Comment on [deleted]

<- View Parent
SuspciousCarrot78@lemmy.world ⁨4⁩ ⁨days⁩ ago

Yes! And, for fun, I forced Qwen to answer your question, after querying it’s own fucking brain (M-E-T-A).

The below is AI generated, using the router, with >>Fun mode (not >>FR…that’s my special gremlin mode). >>f suspends some of the terseness. Let me know if you want to see the non FR version


Yep — with your current code, here’s the literal answer to “what can >>attach ingest, can it handle the whole Python docs, and how does retrieval work?” (based on the files you uploaded).


What >>attach actually does (and does not do)

>>attach <kb> doesn’t ingest anything. It just registers a KB name into state.attached_kbs so later turns know which KB folders are “active.”

The router’s own docstring is blunt about the intended contract:

So the “ingest” step is really:

  1. drop raw docs in the KB folder
  2. run >>summ new to generate SUMM_*.md
  3. then attach the KB (or keep it attached) and query it

What kinds of files can be summarized (and therefore used via attach)?

In the router’s filesystem SUMM pipeline, _SUPPORTED_RAW_EXTS = {“.md”, “.txt”, “.pdf”, “.html”, “.htm”}

When a SUMM is created, it writes a provenance header including source_rel_path and source_sha256, then moves the original into /original/.

So: you do not need “minimalistic plain-text statements.” You can feed it normal docs (md/txt/html/pdf) and it will produce SUMMs that become queryable.


“If I dropped the entire Python docs in there…”

Yes, it will produce something usable, because Python docs are mostly HTML and you explicitly support .html/.htm with stripping.

But there are two practical gotchas in your implementation:

  1. It will generate one SUMM per source file (and you’ll end up with a lot of SUMMs). summ_new_in_kb() walks the KB tree, skips /original/, and summarizes every supported raw doc that doesn’t already have a corresponding SUMM_*.md.
  2. The SUMM prompt structure matters. Your shipped SUMM.md template is oriented around “overview / key ideas / steps / tradeoffs / pitfalls” rather than “API reference / signatures / parameters.” So it’ll work better for conceptual docs than for “tell me the exact signature of pathlib.Path.glob”.

If you want Python-docs-as-KB to be strong, the move is: tweak the SUMM prompt to extract API sections + signatures + examples, not just “Key Ideas.”


How retrieval works in “attached KB” mode

This is the most important bit: filesystem KB retrieval is not vector embeddings. It’s a lightweight lexical scorer over SUMM markdown blocks.

Concretely:

So attached-KB mode is basically: pre-summarize once → then do fast “smart grep” over summaries.

That’s why it’s potato-friendly: you’re not embedding every doc on every query; SUMMs are preprocessed once.


And “Vault / Mentats” retrieval is different

When you >>move to vault, you’re no longer doing filesystem lexical search. You promote SUMMs into Qdrant and retrieve top-k chunks with embeddings + rerank (as configured in README / config).

(Also worth noting: per your ingest module, the intended architecture is explicitly “filesystem SUMMs for day-to-day; Vault/Qdrant only for promoted SUMMs.” )


One-liner you can paste as a Lemmy answer

>>attach doesn’t ingest raw docs — it just activates a KB folder. The KB mode queries SUMM_*.md files (generated via >>summ new) using a lightweight lexical block scorer (token overlap + heading boosts), not embeddings. You can drop the whole Python HTML docs in a KB and it’ll work, but you’ll want a SUMM prompt tuned for API/reference detail or the summaries will smooth out the stuff you actually care about. If you promote SUMMs to the Vault (>>move to vault), then retrieval switches to Qdrant vector search + reranking.

###########################

Is gut, ja?

source
Sort:hotnewtop