Comment

GenderNeutralBro@lemmy.sdf.org ⁨3⁩ ⁨weeks⁩ ago

Sounds interesting. What kind of data can it reliably ingest with “attach”? If I dropped, say, the entire Python docs in there, would it be able to get anything out of that? Or does it need to be minimalistic plain-text statements? How is it actually performing retrieval?

source

Sort:hotnew top

SuspciousCarrot78@lemmy.world ⁨3⁩ ⁨weeks⁩ ago
Yes! And, for fun, I forced Qwen to answer your question, after querying it’s own fucking brain (M-E-T-A).

The below is AI generated, using the router, with >>Fun mode (not >>FR…that’s my special gremlin mode). >>f suspends some of the terseness. Let me know if you want to see the non FR version

Yep — with your current code, here’s the literal answer to “what can >>attach ingest, can it handle the whole Python docs, and how does retrieval work?” (based on the files you uploaded).

What >>attach actually does (and does not do)

>>attach <kb> doesn’t ingest anything. It just registers a KB name into state.attached_kbs so later turns know which KB folders are “active.”

The router’s own docstring is blunt about the intended contract:

“KBs are filesystem folders containing SUMM_*.md files.”

“/serious answers use filesystem KB retrieval ONLY from attached KBs.”

So the “ingest” step is really:

drop raw docs in the KB folder

run >>summ new to generate SUMM_*.md

then attach the KB (or keep it attached) and query it

What kinds of files can be summarized (and therefore used via attach)?

In the router’s filesystem SUMM pipeline, _SUPPORTED_RAW_EXTS = {“.md”, “.txt”, “.pdf”, “.html”, “.htm”}

HTML is “cheap stripped” (scripts/styles removed, tags nuked) before summarizing

PDFs require pypdf — if missing, the router treats that as a failure/skip with a note (your top-level comment calls this out explicitly).

There’s also an explicit guard to truncate huge inputs before sending to the model (default summ.max_input_chars = 120_000).

When a SUMM is created, it writes a provenance header including source_rel_path and source_sha256, then moves the original into /original/.

So: you do not need “minimalistic plain-text statements.” You can feed it normal docs (md/txt/html/pdf) and it will produce SUMMs that become queryable.

“If I dropped the entire Python docs in there…”

Yes, it will produce something usable, because Python docs are mostly HTML and you explicitly support .html/.htm with stripping.

But there are two practical gotchas in your implementation:

It will generate one SUMM per source file (and you’ll end up with a lot of SUMMs). summ_new_in_kb() walks the KB tree, skips /original/, and summarizes every supported raw doc that doesn’t already have a corresponding SUMM_*.md.

The SUMM prompt structure matters. Your shipped SUMM.md template is oriented around “overview / key ideas / steps / tradeoffs / pitfalls” rather than “API reference / signatures / parameters.” So it’ll work better for conceptual docs than for “tell me the exact signature of pathlib.Path.glob”.

If you want Python-docs-as-KB to be strong, the move is: tweak the SUMM prompt to extract API sections + signatures + examples, not just “Key Ideas.”

How retrieval works in “attached KB” mode

This is the most important bit: filesystem KB retrieval is not vector embeddings. It’s a lightweight lexical scorer over SUMM markdown blocks.

Concretely:

It searches only SUMM_*.md under attached KB roots, explicitly excluding /original/.

It splits each SUMM into blocks (paragraphs / headings), caps blocks around ~1200 chars, then scores blocks by token overlap with the query.

Scoring has a few boosts: headings matching query tokens, and a small bonus for code fences.

Then it returns the top hits (defaults like top_k=8, max_blocks_per_file=3, max_chars=2400).

So attached-KB mode is basically: pre-summarize once → then do fast “smart grep” over summaries.

That’s why it’s potato-friendly: you’re not embedding every doc on every query; SUMMs are preprocessed once.

And “Vault / Mentats” retrieval is different

When you >>move to vault, you’re no longer doing filesystem lexical search. You promote SUMMs into Qdrant and retrieve top-k chunks with embeddings + rerank (as configured in README / config).

(Also worth noting: per your ingest module, the intended architecture is explicitly “filesystem SUMMs for day-to-day; Vault/Qdrant only for promoted SUMMs.” )

One-liner you can paste as a Lemmy answer

>>attach doesn’t ingest raw docs — it just activates a KB folder. The KB mode queries SUMM_*.md files (generated via >>summ new) using a lightweight lexical block scorer (token overlap + heading boosts), not embeddings. You can drop the whole Python HTML docs in a KB and it’ll work, but you’ll want a SUMM prompt tuned for API/reference detail or the summaries will smooth out the stuff you actually care about. If you promote SUMMs to the Vault (>>move to vault), then retrieval switches to Qdrant vector search + reranking.

###########################

Is gut, ja?
source

What `>>attach` actually does (and does not do)

What kinds of files can be summarized (and therefore used via `attach`)?

“If I dropped the entire Python docs in there…”

How retrieval works in “attached KB” mode

And “Vault / Mentats” retrieval is different

One-liner you can paste as a Lemmy answer

What >>attach actually does (and does not do)