Comment on [deleted]
wise_pancake@lemmy.ca 1 week ago
If possible convert those files to compressed parquet, and apply sorting and partitioning to them.
I’ve gotten 10-100gb csv files down to 300-5gb sizes just by doing that
That makes searching and scanning so much faster, and you can do this all with open source free software like polars and ibis.
Jason2357@lemmy.ca 1 week ago
Parquet is great, especially if there is some reasonable way of partitioning records - for example, by month or year - if you might need to only search 2024 or something like that. Parquet is great for only needing to I/O the specific variables you are concerned with, and if you can partition the records and only subset a fraction of them, operations can be extremely efficient.