alfablend

@alfablend@lemmy.world

This is a remote user, information on this page may be incomplete. View at Source ↗

⁨Comment⁩ on ⁨"What’s Your Preferred Self-Hosted Solution for Deep Monitoring (Beyond Simple Page Changes)?"⁩ ⁨⁨5⁩ ⁨months⁩ ago⁩:
@xyro Ah, I see! I’m not using Ollama at the moment — my setup is based on GPT4All with a locally hosted DeepSeek model, which handles the semantic parsing directly.

As mentioned earlier, the pipeline doesn’t just diff pages — it detects new document URLs from the source feed (via selectors), downloads them, and generates structured summaries. Here’s a snippet from the YAML config to illustrate how that works:
```
(extract:
  events:
    selector: "results[*]"
    fields:
      url: pdf_url
      title: title
      order_number: executive_order_number

download:
  extensions: [".pdf"]

gpt:
  prompt: |
    Analyze this Executive Order document:
    - Purpose: 1–2 sentences
    - Key provisions: 3–5 bullet points
    - Agencies involved: list
    - Revokes/amends: if any
    - Policy impact: neutral analysis
)
```
To keep things efficient, I also support regex-based extraction before passing content to the LLM. That way, I can isolate relevant blocks (e.g. addresses, client names, conclusions) and reduce the noise in the prompt. Example from another config:
```
processing:
  extract_regex:
    - "object of cultural heritage"
    - "address[:\\s]\\s*(.{10,100}?)(?=\\n|$)"
    - "project(?:s)?"
    - "circumstances"
    - "client\\s*:?\\s*(.{10,100}?)(?=\\n|$)"
    - "(?:conclusions?)\\s*(.{50,300}?)(?=\\n|$)"
```
Let me know if you’re experimenting with similar flows — I’d be happy to share templates or compare how DeepSeek performs on your sources!
⁨Comment⁩ on ⁨"What’s Your Preferred Self-Hosted Solution for Deep Monitoring (Beyond Simple Page Changes)?"⁩ ⁨⁨5⁩ ⁨months⁩ ago⁩:
Hello! For changedetection.io there is setup instruction with PIP install: github.com/dgtlmoon/…/Microsoft-Windows What is your use case?
⁨Comment⁩ on ⁨"What’s Your Preferred Self-Hosted Solution for Deep Monitoring (Beyond Simple Page Changes)?"⁩ ⁨⁨5⁩ ⁨months⁩ ago⁩:
@xyro Thanks for sharing your case! I’ve also tested changedetection.io — it’s a great tool for basic site monitoring.

But in my tests, it doesn’t go beyond the surface. If there’s a page with multiple document links, it’ll detect changes in the list (via diff), but it won’t automatically download and analyze the new documents themselves.

Here’s how I’ve approached this:
1. Crawl the page to extract links
2. Detect new document URLs
3. Download each document and extract keywords
4. Generate an AI summary using a local LLM
5. Add the result to a readable feed
P.S. If it helps, I can create a YAML template tailored to your grant-tracking case and run a quick test.
"What’s Your Preferred Self-Hosted Solution for Deep Monitoring (Beyond Simple Page Changes)?"
Submitted ⁨⁨5⁩ ⁨months⁩ ago⁩ to ⁨selfhosted@lemmy.world⁩ | ⁨6⁩ ⁨comments⁩