trajdl.datasets.arrow.abstract module

trajdl.datasets.arrow.abstract module#

class trajdl.datasets.arrow.abstract.BaseArrowDataset(table: Table, unsafe: bool = False)[source]#

Bases: ABC

abstract check_table(table: Table) → None[source]#

Check the validity of the input table.

Parameters:

table (pa.Table) – The table to validate.

Raises:

ValueError – If the table type is not correct.
NotImplementedError – If the method is not implemented in a subclass.

property dis_delta: ChunkedArray#

Get the DIS_DELTA column from the Arrow table.

Returns:: The DIS_DELTA column as a ChunkedArray.
Return type:: pa.lib.ChunkedArray

property entity_id: ChunkedArray#

Get the ENTITY_ID column from the Arrow table.

Returns:: The ENTITY_ID column as a ChunkedArray.
Return type:: pa.lib.ChunkedArray

classmethod init_from_arrow(table: Table)[source]#

Initialize the dataset from an Arrow table.

Parameters:: table (pa.Table) – The Arrow table to initialize the dataset.
Returns:: An instance of the dataset initialized from the Arrow table.
Return type:: BaseArrowDataset
Warns:: RuntimeWarning – If any field in the schema does not exist in the input table.

classmethod init_from_parquet(path: str) → BaseArrowDataset[source]#

Initialize the dataset from a Parquet file.

Parameters:: path (str) – The file path to the Parquet file.
Returns:: An instance of the dataset initialized from the Parquet file.
Return type:: BaseArrowDataset

Notes

Due to differences in handling List[List[Float32]] across different frameworks, it is recommended to read the file using PyArrow and try to convert types where necessary.

classmethod init_from_table(table: Table | DataFrame | DataFrame) → BaseArrowDataset[source]#

Initialize the dataset from a Polars, Pandas, or Arrow table.

Parameters:: table (Union[pa.Table, pl.DataFrame, pd.DataFrame]) – The input table to initialize the dataset.
Returns:: An instance of the dataset initialized from the provided table.
Return type:: BaseArrowDataset
Raises:: ValueError – If the input is not one of the accepted table types.

iter_as_seqs()[source]#

save(path: str | Path) → None[source]#

Save the dataset to a Parquet file.

Parameters:: path (Union[str, Path]) – The file path to save the dataset.

Notes

If the provided path does not end with ‘.parquet’, it will be appended automatically.

abstract classmethod schema() → Schema[source]#

Abstract method to define the Arrow schema for the dataset.

Returns:: The schema definition of the dataset.
Return type:: pa.lib.Schema
Raises:: NotImplementedError – If the method is not implemented in a subclass.

property seq: ChunkedArray#

Get the SEQ column from the Arrow table.

Returns:: The SEQ column as a ChunkedArray.
Return type:: pa.lib.ChunkedArray

property start_ts: ChunkedArray#

Get the START_TS column from the Arrow table.

Returns:: The START_TS column as a ChunkedArray.
Return type:: pa.lib.ChunkedArray

abstract sub_classes_construction(*args, **kwargs) → BaseSeq[source]#: 这个方法是将table的一行数据转换为单条序列的实例

to_polars() → DataFrame[source]#

property ts_delta: ChunkedArray#

Get the TS_DELTA column from the Arrow table.

Returns:: The TS_DELTA column as a ChunkedArray.
Return type:: pa.lib.ChunkedArray

property ts_seq: ChunkedArray#

Get the TS_SEQ column from the Arrow table.

Returns:: The TS_SEQ column as a ChunkedArray.
Return type:: pa.lib.ChunkedArray

trajdl.datasets.arrow.abstract module

Contents

trajdl.datasets.arrow.abstract module#