trajdl.datasets.open_source.hasher module#

class trajdl.datasets.open_source.hasher.Hasher(hasher_type: str)[source]#

Bases: object

digest_arrow(table: Table, max_chunksize: int = 8192) str[source]#

Digest a PyArrow Table and produce its hash.

Parameters:
  • table (pa.Table) – The PyArrow table to be hashed.

  • max_chunksize (int, optional) – The maximum size of each chunk for processing (default is 8192).

Returns:

The hexadecimal representation of the table hash.

Return type:

str

digest_file(path: str) str[source]#

Digest a file and produce its hash.

Parameters:

path (str) – The path to the file to be hashed.

Returns:

The hexadecimal representation of the file hash.

Return type:

str

digest_parquet(path: str, max_chunksize: int = 8192) str[source]#

Digest a Parquet file and produce its hash.

Parameters:
  • path (str) – The path to the Parquet file to be hashed.

  • max_chunksize (int, optional) – The maximum size of each chunk for processing (default is 8192).

Returns:

The hexadecimal representation of the Parquet file hash.

Return type:

str

init_hasher()[source]#

Initialize the hashing function based on the hasher_type.