Comment on [deleted]
GenderNeutralBro@lemmy.sdf.org 4 days ago
Sounds interesting. What kind of data can it reliably ingest with “attach”? If I dropped, say, the entire Python docs in there, would it be able to get anything out of that? Or does it need to be minimalistic plain-text statements? How is it actually performing retrieval?
SuspciousCarrot78@lemmy.world 4 days ago
Yes! And, for fun, I forced Qwen to answer your question, after querying it’s own fucking brain (M-E-T-A).
The below is AI generated, using the router, with >>Fun mode (not >>FR…that’s my special gremlin mode). >>f suspends some of the terseness. Let me know if you want to see the non FR version
Yep — with your current code, here’s the literal answer to “what can
>>attachingest, can it handle the whole Python docs, and how does retrieval work?” (based on the files you uploaded).What
>>attachactually does (and does not do)>>attach <kb>doesn’t ingest anything. It just registers a KB name intostate.attached_kbsso later turns know which KB folders are “active.”The router’s own docstring is blunt about the intended contract:
SUMM_*.mdfiles.”So the “ingest” step is really:
>>summ newto generateSUMM_*.mdWhat kinds of files can be summarized (and therefore used via
attach)?In the router’s filesystem SUMM pipeline,
_SUPPORTED_RAW_EXTS = {“.md”, “.txt”, “.pdf”, “.html”, “.htm”}pypdf— if missing, the router treats that as a failure/skip with a note (your top-level comment calls this out explicitly).summ.max_input_chars = 120_000).When a SUMM is created, it writes a provenance header including
source_rel_pathandsource_sha256, then moves the original into/original/.So: you do not need “minimalistic plain-text statements.” You can feed it normal docs (md/txt/html/pdf) and it will produce SUMMs that become queryable.
“If I dropped the entire Python docs in there…”
Yes, it will produce something usable, because Python docs are mostly HTML and you explicitly support
.html/.htmwith stripping.But there are two practical gotchas in your implementation:
summ_new_in_kb()walks the KB tree, skips/original/, and summarizes every supported raw doc that doesn’t already have a correspondingSUMM_*.md.SUMM.mdtemplate is oriented around “overview / key ideas / steps / tradeoffs / pitfalls” rather than “API reference / signatures / parameters.” So it’ll work better for conceptual docs than for “tell me the exact signature ofpathlib.Path.glob”.If you want Python-docs-as-KB to be strong, the move is: tweak the SUMM prompt to extract API sections + signatures + examples, not just “Key Ideas.”
How retrieval works in “attached KB” mode
This is the most important bit: filesystem KB retrieval is not vector embeddings. It’s a lightweight lexical scorer over SUMM markdown blocks.
Concretely:
SUMM_*.mdunder attached KB roots, explicitly excluding/original/.top_k=8,max_blocks_per_file=3,max_chars=2400).So attached-KB mode is basically: pre-summarize once → then do fast “smart grep” over summaries.
That’s why it’s potato-friendly: you’re not embedding every doc on every query; SUMMs are preprocessed once.
And “Vault / Mentats” retrieval is different
When you
>>move to vault, you’re no longer doing filesystem lexical search. You promote SUMMs into Qdrant and retrieve top-k chunks with embeddings + rerank (as configured in README / config).(Also worth noting: per your ingest module, the intended architecture is explicitly “filesystem SUMMs for day-to-day; Vault/Qdrant only for promoted SUMMs.” )
One-liner you can paste as a Lemmy answer
###########################
Is gut, ja?