Cloudflare now serves sites in Markdown to AI agents

⁨154⁩ ⁨likes⁩

Submitted ⁨⁨3⁩ ⁨weeks⁩ ago⁩ by ⁨Beep@lemmus.org⁩ to ⁨technology@lemmy.world⁩

https://www.techzine.eu/news/applications/138772/cloudflare-now-serves-sites-in-markdown-to-ai-agents/

source

Comments

Sort:hotnew top

webghost0101@sopuli.xyz ⁨3⁩ ⁨weeks⁩ ago
The autistic community has been dying for this kind of accessibility accommodation for years.

I cannot express how deeply this angers me.

“Markdown offers a cleaner, more semantically clear representation of the content. This means less noise for ~~language models and other text-analysis systems ~~ people that process information neurodivergently, resulting in more efficient processing and ~~potentially~~ lower ~~compute costs~~ real life physical exhaustion.“

source
- artyom@piefed.social ⁨3⁩ ⁨weeks⁩ ago
  My brother, this is not just autistic people. Everyone wants this. Except the people who make the sites, because all that noise is how they make money.
  
  If you look at a private blog, they’re usually devoid of much noise.
  
  source
  - MagicShel@lemmy.zip ⁨3⁩ ⁨weeks⁩ ago
    MD is a nearly ideal format. I keep my personal notes and time management stuff in Obsidian using markdown. Write my blog in Markdown. AsciiDoc is nice, too, for certain use cases.
    
    source
- kernelle@lemmy.dbzer0.com ⁨3⁩ ⁨weeks⁩ ago
  ❌️ Adding accessibility features to make the internet usable by anyone
  
  ✅️ ~~anyone~~ other computers
  
  source
- deltaspawn0040@lemmy.zip ⁨3⁩ ⁨weeks⁩ ago
  Woohoo, the interests of capital have coincidentally aligned with ours in this one brief moment!
  
  source
  - FauxLiving@lemmy.world ⁨3⁩ ⁨weeks⁩ ago
    The next step is for Cloudflare to introduce a proprietary markdown tags, then release a library to parse their new crap, then update their systems so it serves degraded ‘legacy’ markdown but include a paid API to get access to the ‘old’ markdown, then add features to the library that can only be accessed by API customers, etcetc
    
    I see a commercial entity embrace something and start looking for the extend and extinguish part.
    
    source
- SuspciousCarrot78@lemmy.world ⁨3⁩ ⁨weeks⁩ ago
  I have ASD; I made several tools that explicitly convert web sources to .md and JSON.
  
  The shitty thing is, a lot of sites - even if they have stuff available in simple, beautiful JSON format, refuse to give public access to it. Notoriously, movie session times for local cinemas. That should be a simple look up…but no.
  
  Oh well, at least cool shit like this still exists
  
  github.com/chubin/wttr.in
  
  github.com/scrapy/scrapy
  
  source
- acosmichippo@lemmy.world ⁨3⁩ ⁨weeks⁩ ago
  don’t “reader” views in web browsers essentially accomplish the same thing?
  
  source
  - kernelle@lemmy.dbzer0.com ⁨3⁩ ⁨weeks⁩ ago
    Yes, and the way this reader functionality works is using structured tags in the HTML code. A very small effort that leads to a magnitude in accessibility that some just do not give an F about.
    
    The kicker here is that these tags also impact SEO heavily, so not having them not only makes it harder for many people to use or read some sites, but it also makes them score lower on search engines.
    
    source
TORFdot0@lemmy.world ⁨3⁩ ⁨weeks⁩ ago
Does this mean that if I pretend to be a bot, I can access any cloudflare site ad-free?

source
- bjoern_tantau@swg-empire.de ⁨3⁩ ⁨weeks⁩ ago
  “Prove that you’re a bot by factorising this large number.”
  
  source
  - br3d@lemmy.world ⁨3⁩ ⁨weeks⁩ ago
    “Prove you’re a bot by failing to click all the motorcycles in this image”
    
    source
  - xthexder@l.sw0.com ⁨3⁩ ⁨weeks⁩ ago
    How large a number are we talking? This might be impossible for a computer as well considering this being a hard problem is effectively the basis for most encryption.
    
    source
    -> View More Comments
LedgeDrop@lemmy.zip ⁨3⁩ ⁨weeks⁩ ago
jaw-drop I can go back to lynx now! /s

Potentially, this is actually a fantastic improvement. It (in theory) means you could request markdown and convert it back to html and meanwhile strip out ads, Javascript, tracking/cruft, etc.

I wonder how accurate of a markdown translation this would be. Would/could it handle single-page apps?

source
uninvitedguest@piefed.ca ⁨3⁩ ⁨weeks⁩ ago
A few things come to mind 1. Is this much different from the “reading view” popularized by Instapaper, reading later, etc and now baked into most browsers today? 2. What is a token? 3. How is it that tokens have become the base unit on which we denominate LLM work/effort/task difficulty/cost? 4. Does every LLM model make use of “tokens” or is it just one or a select few? 5. If multiple use the idea of tokens, is a token with one LLM/provider equivalent to a token with a different LLM/provider?

source
- wonderingwanderer@sopuli.xyz ⁨3⁩ ⁨weeks⁩ ago
  A token is basically a linguistic unit, like a word or a phrase.
  
  LLMs don’t parse text word-by-word because it would miss a lot of idiomatic meaning and other context. “Dave shot a hole in one at the golf course” might be parsed as “{Dave} {shot} {a hole in one} {at the golf course}”
  
  They use NLP to “tokenize” text, meaning parsing it into individual tokens, so depending on the tokenizer I suppose there could be slight variations on how a text is tokenized.
  
  Then the LLM runs each token through layers of matrices on attention heads (basically, vectors) in order to assess the probabilistic relationships between each token, and uses that process to generate a response via next-token prediction.
  
  It’s a bit more complex than that, of course. Tensor calculus, billions of weighted parameters, layers divided by hidden sizes, also matmuls, masks, softmax, and dropout. Also the “context window” which is how many tokens it can process at a time. But it’s the gist of it.
  
  But a token is just the basic unit that gets run through those processes.
  
  source
- wosat@lemmy.world ⁨3⁩ ⁨weeks⁩ ago
  Here’s an OpenAI page that allows you to enter text and see how it gets tokenized:
  
  platform.openai.com/tokenizer
  
  source
- CandleTiger@programming.dev ⁨3⁩ ⁨weeks⁩ ago
  A token is the word for the base unit of text that an LLM works with. It’s always been that way. The LLM does not directly work with characters; they are collected together into chunks less than a word and this stream of tokens is what the LLM is processing. This is also why the LLMs have such trouble with spelling questions like “how many Rs in raspberry?” — they do not see the individual letters in the first place so they do not know.
  
  No, the LLMs do not all tokenize the same way. Different tokenizers are (or at least were once) one of the major ways they differed from each other. A simple tokenizer might split words up into one token per syllable but I think they’ve gotten much more complicated than that, now.
  
  My understanding is very basic and out-of-date.
  
  source
planish@sh.itjust.works ⁨3⁩ ⁨weeks⁩ ago
Why don’t browsers know how to render a Markdown content-type yet, all by themselves? It’s ubiquitous now and it’s not like it’s hard to parse, but every site has to translate it into HTML itself for the browser.

source
- JaddedFauceet@lemmy.world ⁨2⁩ ⁨weeks⁩ ago
  There are many flavours and extensions of markdown. For example rendering of a table is not part of the standard. How newlines are handled also differ.
  
  to have all browsers support this, this need to be made into a web standard. And all agreed by different browser vendors (like chrome, safari and firefox).
  
  right now this might be best handled by browser extension first.
  
  source
  - planish@sh.itjust.works ⁨1⁩ ⁨week⁩ ago
    Why can’t I just write this up as a PR to Firefox and stand a snowball’s chance of getting it merged, though? Everything’s somehow simultaneously extremely stodgy and completely beholden to whatever Google decides to ship this week.
    
    source
    -> View More Comments
fruitycoder@sh.itjust.works ⁨3⁩ ⁨weeks⁩ ago
Firefox addon for regular users when?

source