Comment on [deleted]
SuspciousCarrot78@lemmy.world 4 days agoYes! And, for fun, I forced Qwen to answer your question, after querying it’s own fucking brain (M-E-T-A).
The below is AI generated, using the router, with >>Fun mode (not >>FR…that’s my special gremlin mode). >>f suspends some of the terseness. Let me know if you want to see the non FR version
Yep — with your current code, here’s the literal answer to “what can >>attach ingest, can it handle the whole Python docs, and how does retrieval work?” (based on the files you uploaded).
What >>attach actually does (and does not do)
>>attach <kb> doesn’t ingest anything. It just registers a KB name into state.attached_kbs so later turns know which KB folders are “active.”
The router’s own docstring is blunt about the intended contract:
- “KBs are filesystem folders containing
SUMM_*.mdfiles.” - “/serious answers use filesystem KB retrieval ONLY from attached KBs.”
So the “ingest” step is really:
- drop raw docs in the KB folder
- run
>>summ newto generateSUMM_*.md - then attach the KB (or keep it attached) and query it
What kinds of files can be summarized (and therefore used via attach)?
In the router’s filesystem SUMM pipeline, _SUPPORTED_RAW_EXTS = {“.md”, “.txt”, “.pdf”, “.html”, “.htm”}
- HTML is “cheap stripped” (scripts/styles removed, tags nuked) before summarizing
- PDFs require
pypdf— if missing, the router treats that as a failure/skip with a note (your top-level comment calls this out explicitly). - There’s also an explicit guard to truncate huge inputs before sending to the model (default
summ.max_input_chars = 120_000).
When a SUMM is created, it writes a provenance header including source_rel_path and source_sha256, then moves the original into /original/.
So: you do not need “minimalistic plain-text statements.” You can feed it normal docs (md/txt/html/pdf) and it will produce SUMMs that become queryable.
“If I dropped the entire Python docs in there…”
Yes, it will produce something usable, because Python docs are mostly HTML and you explicitly support .html/.htm with stripping.
But there are two practical gotchas in your implementation:
- It will generate one SUMM per source file (and you’ll end up with a lot of SUMMs).
summ_new_in_kb()walks the KB tree, skips/original/, and summarizes every supported raw doc that doesn’t already have a correspondingSUMM_*.md. - The SUMM prompt structure matters. Your shipped
SUMM.mdtemplate is oriented around “overview / key ideas / steps / tradeoffs / pitfalls” rather than “API reference / signatures / parameters.” So it’ll work better for conceptual docs than for “tell me the exact signature ofpathlib.Path.glob”.
If you want Python-docs-as-KB to be strong, the move is: tweak the SUMM prompt to extract API sections + signatures + examples, not just “Key Ideas.”
How retrieval works in “attached KB” mode
This is the most important bit: filesystem KB retrieval is not vector embeddings. It’s a lightweight lexical scorer over SUMM markdown blocks.
Concretely:
- It searches only
SUMM_*.mdunder attached KB roots, explicitly excluding/original/. - It splits each SUMM into blocks (paragraphs / headings), caps blocks around ~1200 chars, then scores blocks by token overlap with the query.
- Scoring has a few boosts: headings matching query tokens, and a small bonus for code fences.
- Then it returns the top hits (defaults like
top_k=8,max_blocks_per_file=3,max_chars=2400).
So attached-KB mode is basically: pre-summarize once → then do fast “smart grep” over summaries.
That’s why it’s potato-friendly: you’re not embedding every doc on every query; SUMMs are preprocessed once.
And “Vault / Mentats” retrieval is different
When you >>move to vault, you’re no longer doing filesystem lexical search. You promote SUMMs into Qdrant and retrieve top-k chunks with embeddings + rerank (as configured in README / config).
(Also worth noting: per your ingest module, the intended architecture is explicitly “filesystem SUMMs for day-to-day; Vault/Qdrant only for promoted SUMMs.” )
One-liner you can paste as a Lemmy answer
>>attachdoesn’t ingest raw docs — it just activates a KB folder. The KB mode queries SUMM_*.md files (generated via>>summ new) using a lightweight lexical block scorer (token overlap + heading boosts), not embeddings. You can drop the whole Python HTML docs in a KB and it’ll work, but you’ll want a SUMM prompt tuned for API/reference detail or the summaries will smooth out the stuff you actually care about. If you promote SUMMs to the Vault (>>move to vault), then retrieval switches to Qdrant vector search + reranking.
###########################
Is gut, ja?