[deleted]

⁨33⁩ ⁨likes⁩

Submitted ⁨⁨5⁩ ⁨months⁩ ago⁩ by ⁨Dust0741@lemmy.world⁩ to ⁨selfhosted@lemmy.world⁩

[deleted]

source

Comments

Sort:hotnew top

yaroto98@lemmy.org ⁨5⁩ ⁨months⁩ ago
Done this with massive log files. Used perl and regex. That’s basically what the language was built for.

But with CSVs? I’d throw them in a db with an index.

source
- SheeEttin@lemmy.zip ⁨5⁩ ⁨months⁩ ago
  Agreed. If the data is suitable enough, there are plenty of tools to slurp a CSV into mariadb or whatever.
  
  source
sith@lemmy.zip ⁨5⁩ ⁨months⁩ ago
Try ripgrep instead of grep. Unless you go for index etc.

source
jonne@infosec.pub ⁨5⁩ ⁨months⁩ ago
Really depends on what data it is and whether you want to search it regularly or just as a one time thing.

You could load them into an rdbms (MySQL/Postgres) and have it handle the indexing, or use python tools to process the files.

If it’s just a one time thing grep is probably fine tho.

source
tal@lemmy.today ⁨5⁩ ⁨months⁩ ago
Are you looking for specific values in this table, or substrings?

If specific values, I’d probably import the CSV file into a database with an column indexed on the value you care about.

source
- atzanteol@sh.itjust.works ⁨5⁩ ⁨months⁩ ago
  Many (most?) databases these days support some sort of full text search.
  
  source
keepee@lemmy.world ⁨5⁩ ⁨months⁩ ago
Check out DuckDb

source
solrize@lemmy.ml ⁨5⁩ ⁨months⁩ ago
No idea about aleph. I’ve used Solr for that (solr.apache.org), thus my username, but maybe that is considered old school by now.

source
- filister@lemmy.world ⁨5⁩ ⁨months⁩ ago
  Elasticsearch should work too
  
  source
titey@jlai.lu ⁨5⁩ ⁨months⁩ ago
If CSV entries are similars, you can try Opensearch or Elasticsearch. It’s great for plain text search (with Lucene)

source
irotsoma@lemmy.blahaj.zone ⁨5⁩ ⁨months⁩ ago
I’ve used java Scanner objects to do this extremely efficiently with minimal memory required even with multiple parallel searches. Indexing is only necessary if you want to search for information many times and don’t know what exactly the search will be. For one time searches, it’s not going to be useful. Grep honestly is going to be faster and more efficient for most one time searches.

The initial indexing or searching of the files will be bottlenecked by the speed of the disk the files are on, no matter what you do. It only helps to index because you can move future searches to faster memory.

So it greatly depends on what and how often you need to search and the tradeoff is memory usage, but only for multiple searches of data you choose to index from the files in the first pass.

source
JamonBear@sh.itjust.works ⁨5⁩ ⁨months⁩ ago
Assuming you meant CSV, xan is wonderful for this.

source
Treczoks@lemmy.world ⁨5⁩ ⁨months⁩ ago
This depends on what you are actually looking for, and how you are looking for it.

Do you really need pattern matching, or do you only look for fixed strings? Then other tools may be faster.

If you need case independent search on an upper- and lowercase data set, make a copy that is all upper or all lower, and search there.

If you only search in certain columns, make a copy that only includes these.

Or import the data into a database.

source
wise_pancake@lemmy.ca ⁨5⁩ ⁨months⁩ ago
If possible convert those files to compressed parquet, and apply sorting and partitioning to them.

I’ve gotten 10-100gb csv files down to 300-5gb sizes just by doing that

That makes searching and scanning so much faster, and you can do this all with open source free software like polars and ibis.

source
- Jason2357@lemmy.ca ⁨5⁩ ⁨months⁩ ago
  Parquet is great, especially if there is some reasonable way of partitioning records - for example, by month or year - if you might need to only search 2024 or something like that. Parquet is great for only needing to I/O the specific variables you are concerned with, and if you can partition the records and only subset a fraction of them, operations can be extremely efficient.
  
  source
fluckx@lemmy.world ⁨5⁩ ⁨months⁩ ago
RDBMS shine on getbyId queries. Queries where the value starts with should also work well. But queries where the word is in the middle of the value or column generally don’t perform well. Since it’s just for personal use that might not matter too much. If you’re querying on exact values it’ll go pretty smooth. If you’re querying on ‘deniro’ while the value contains ‘bob deniro’ and others it’ll be less performant. But it’s possible it works well enough for your case.

Elasticsearch is well known for text searches and being incredibly flexible with queries and filtering. www.elastic.co

Manticore is one that’s been on my check-it-out for I don’t know how long. It looks great imo: manticoresearch.com

Open search: opensearch.org

Disclaimer: I haven’t really used any RDBMS systems extensively for years so it’s possible there are some that added support for full text searches being more performant.

Aleph also seems to be able to cross reference data between documents. I don’t think any of the ones listed above do this. But I also don’t know if this is part of your requirements.

source