trajdl.datasets.arrow.abstract module#

class trajdl.datasets.arrow.abstract.BaseArrowDataset(table: Table, unsafe: bool = False)[source]#

Bases: ABC

abstract check_table(table: Table) None[source]#

Check the validity of the input table.

Parameters:

table (pa.Table) – The table to validate.

Raises:
  • ValueError – If the table type is not correct.

  • NotImplementedError – If the method is not implemented in a subclass.

property dis_delta: ChunkedArray#

Get the DIS_DELTA column from the Arrow table.

Returns:

The DIS_DELTA column as a ChunkedArray.

Return type:

pa.lib.ChunkedArray

property entity_id: ChunkedArray#

Get the ENTITY_ID column from the Arrow table.

Returns:

The ENTITY_ID column as a ChunkedArray.

Return type:

pa.lib.ChunkedArray

classmethod init_from_arrow(table: Table)[source]#

Initialize the dataset from an Arrow table.

Parameters:

table (pa.Table) – The Arrow table to initialize the dataset.

Returns:

An instance of the dataset initialized from the Arrow table.

Return type:

BaseArrowDataset

Warns:

RuntimeWarning – If any field in the schema does not exist in the input table.

classmethod init_from_parquet(path: str) BaseArrowDataset[source]#

Initialize the dataset from a Parquet file.

Parameters:

path (str) – The file path to the Parquet file.

Returns:

An instance of the dataset initialized from the Parquet file.

Return type:

BaseArrowDataset

Notes

Due to differences in handling List[List[Float32]] across different frameworks, it is recommended to read the file using PyArrow and try to convert types where necessary.

classmethod init_from_table(table: Table | DataFrame | DataFrame) BaseArrowDataset[source]#

Initialize the dataset from a Polars, Pandas, or Arrow table.

Parameters:

table (Union[pa.Table, pl.DataFrame, pd.DataFrame]) – The input table to initialize the dataset.

Returns:

An instance of the dataset initialized from the provided table.

Return type:

BaseArrowDataset

Raises:

ValueError – If the input is not one of the accepted table types.

iter_as_seqs()[source]#
save(path: str | Path) None[source]#

Save the dataset to a Parquet file.

Parameters:

path (Union[str, Path]) – The file path to save the dataset.

Notes

If the provided path does not end with ‘.parquet’, it will be appended automatically.

abstract classmethod schema() Schema[source]#

Abstract method to define the Arrow schema for the dataset.

Returns:

The schema definition of the dataset.

Return type:

pa.lib.Schema

Raises:

NotImplementedError – If the method is not implemented in a subclass.

property seq: ChunkedArray#

Get the SEQ column from the Arrow table.

Returns:

The SEQ column as a ChunkedArray.

Return type:

pa.lib.ChunkedArray

property start_ts: ChunkedArray#

Get the START_TS column from the Arrow table.

Returns:

The START_TS column as a ChunkedArray.

Return type:

pa.lib.ChunkedArray

abstract sub_classes_construction(*args, **kwargs) BaseSeq[source]#

这个方法是将table的一行数据转换为单条序列的实例

to_polars() DataFrame[source]#
property ts_delta: ChunkedArray#

Get the TS_DELTA column from the Arrow table.

Returns:

The TS_DELTA column as a ChunkedArray.

Return type:

pa.lib.ChunkedArray

property ts_seq: ChunkedArray#

Get the TS_SEQ column from the Arrow table.

Returns:

The TS_SEQ column as a ChunkedArray.

Return type:

pa.lib.ChunkedArray