trajdl.datasets.arrow.abstract module#
- class trajdl.datasets.arrow.abstract.BaseArrowDataset(table: Table, unsafe: bool = False)[source]#
Bases:
ABC- abstract check_table(table: Table) None[source]#
Check the validity of the input table.
- Parameters:
table (pa.Table) – The table to validate.
- Raises:
ValueError – If the table type is not correct.
NotImplementedError – If the method is not implemented in a subclass.
- property dis_delta: ChunkedArray#
Get the DIS_DELTA column from the Arrow table.
- Returns:
The DIS_DELTA column as a ChunkedArray.
- Return type:
pa.lib.ChunkedArray
- property entity_id: ChunkedArray#
Get the ENTITY_ID column from the Arrow table.
- Returns:
The ENTITY_ID column as a ChunkedArray.
- Return type:
pa.lib.ChunkedArray
- classmethod init_from_arrow(table: Table)[source]#
Initialize the dataset from an Arrow table.
- Parameters:
table (pa.Table) – The Arrow table to initialize the dataset.
- Returns:
An instance of the dataset initialized from the Arrow table.
- Return type:
- Warns:
RuntimeWarning – If any field in the schema does not exist in the input table.
- classmethod init_from_parquet(path: str) BaseArrowDataset[source]#
Initialize the dataset from a Parquet file.
- Parameters:
path (str) – The file path to the Parquet file.
- Returns:
An instance of the dataset initialized from the Parquet file.
- Return type:
Notes
Due to differences in handling List[List[Float32]] across different frameworks, it is recommended to read the file using PyArrow and try to convert types where necessary.
- classmethod init_from_table(table: Table | DataFrame | DataFrame) BaseArrowDataset[source]#
Initialize the dataset from a Polars, Pandas, or Arrow table.
- Parameters:
table (Union[pa.Table, pl.DataFrame, pd.DataFrame]) – The input table to initialize the dataset.
- Returns:
An instance of the dataset initialized from the provided table.
- Return type:
- Raises:
ValueError – If the input is not one of the accepted table types.
- save(path: str | Path) None[source]#
Save the dataset to a Parquet file.
- Parameters:
path (Union[str, Path]) – The file path to save the dataset.
Notes
If the provided path does not end with ‘.parquet’, it will be appended automatically.
- abstract classmethod schema() Schema[source]#
Abstract method to define the Arrow schema for the dataset.
- Returns:
The schema definition of the dataset.
- Return type:
pa.lib.Schema
- Raises:
NotImplementedError – If the method is not implemented in a subclass.
- property seq: ChunkedArray#
Get the SEQ column from the Arrow table.
- Returns:
The SEQ column as a ChunkedArray.
- Return type:
pa.lib.ChunkedArray
- property start_ts: ChunkedArray#
Get the START_TS column from the Arrow table.
- Returns:
The START_TS column as a ChunkedArray.
- Return type:
pa.lib.ChunkedArray
- property ts_delta: ChunkedArray#
Get the TS_DELTA column from the Arrow table.
- Returns:
The TS_DELTA column as a ChunkedArray.
- Return type:
pa.lib.ChunkedArray
- property ts_seq: ChunkedArray#
Get the TS_SEQ column from the Arrow table.
- Returns:
The TS_SEQ column as a ChunkedArray.
- Return type:
pa.lib.ChunkedArray