BBC will block ChatGPT AI from scraping its content

Submitted ⁨⁨1⁩ ⁨year⁩ ago⁩ by ⁨L4s@lemmy.world [bot]⁩ to ⁨technology@lemmy.world⁩

https://deadline.com/2023/10/bbc-will-block-chatgpt-from-scraping-its-content-1235566868/

BBC will block ChatGPT AI from scraping its content::ChatGPT will be blocked by the BBC from scraping content in a move to protect copyrighted material.

source

Comments

Sort:hotnew top

Hubi@feddit.de ⁨1⁩ ⁨year⁩ ago
Makes sense, OpenAI will probably have to apply for a TV-license first.

source
- FlyingSquid@lemmy.world ⁨1⁩ ⁨year⁩ ago
  I don’t live in the UK, but I would gladly pay the TV license fee, or even a premium on top of it, if I had unlimited access to iPlayer. My only option right now is BritBox, which is not great and not really worth the money.
  
  source
  - jaackf@lemm.ee ⁨1⁩ ⁨year⁩ ago
    Just VPN to the UK and then tick the box which says you have a TV license? Or there are other ways to get the content most likely! 🏴‍☠️
    
    source
    -> View More Comments
csm10495@sh.itjust.works ⁨1⁩ ⁨year⁩ ago
I wonder if anyone thinks robots.txt is binding or not ignored by anyone who wants.

source
- lemmyvore@feddit.nl ⁨1⁩ ⁨year⁩ ago
  OpenAI will have to deal with a lot of lawsuits in the future. Robots.txt may not be legally binding but disobeying it after claiming otherwise would go a long way towards establishing intent.
  
  source
- andrew@lemmy.stuart.fun ⁨1⁩ ⁨year⁩ ago
  I mean, under the CFAA you could probably pretty easily pursue charges when explicitly deauthorizing certain agents from accessing your data. Plenty of people have been threatened and prosecuted for less.
  
  www.nacdl.org/Landing/ComputerFraudandAbuseAct
  
  source
- totallynotfbi@lemm.ee ⁨1⁩ ⁨year⁩ ago
  I mean, you could just block OpenAI’s crawlers’ IP addresses, if you wanted to
  
  source
Noite_Etion@lemmy.world ⁨1⁩ ⁨year⁩ ago
Big businesses wont lift a finger to halt global warming, but the second their precious copyrights are attacked they go into full force.

source
- Moneo@lemmy.world ⁨1⁩ ⁨year⁩ ago
  I mean, yeah? Corporations are always going to act in their best interest, that’s why regulation exists.
  
  source
netchami@sh.itjust.works ⁨1⁩ ⁨year⁩ ago
Kinda late

source
- porkins@sh.itjust.works ⁨1⁩ ⁨year⁩ ago
  I’d rather have ChatGPT know about news content than not. I appreciate the convenience. The news shouldn’t have barriers.
  
  source
  - netchami@sh.itjust.works ⁨1⁩ ⁨year⁩ ago
    But ChatGPT often takes correct and factual sources and adds a whole bunch of nonsense and then spits out false information. That’s why it’s dangerous. Just go to the fucking news websites and get your information from there. You don’t need ChatGPT for that.
    
    source
    -> View More Comments
  - Apollo@sh.itjust.works ⁨1⁩ ⁨year⁩ ago
    Who get there news from chatgpt lol
    
    source
    -> View More Comments
  - C4d@lemmy.world ⁨1⁩ ⁨year⁩ ago
    The pure ChatGPT output would probably be garbage. The dataset will be full of all manner of sources (together with their inherent biases) other with spin, untruths and outright parody and it’s not apparent that there is any kind of curation or quality assurance on the dataset (please correct me if I’m wrong).
    
    I don’t think it’s a good tool for extracting factual information from. It does seem to be good at synthesising prose and helping with writing ideas.
    
    I am quite interested in things like this where the output from a “knowledge engine” is paired with something like ChatGPT - but it would be for eg writing a science paper rather than news.
    
    source
    -> View More Comments
- C4d@lemmy.world ⁨1⁩ ⁨year⁩ ago
  Exactly. The data harvest has had years in the making.
  
  source
patawan@lemmy.world ⁨1⁩ ⁨year⁩ ago
Curious what the mechanism for this will be. CAPTCHA can sometimes be relatively easy to pass and at worst can be farmed out to humans.

source
- Cqrd@lemmy.dbzer0.com ⁨1⁩ ⁨year⁩ ago
  ChatGPT took down its Internet search to implement a robots.txt rule it would obey and allow content providers time to add it to their lists. This was done because they were being used to get around paywalls. So it’s actually very easy for them to do this for ChatGPT, specifically, which makes articles like this ridiculous.
  
  source
  - RootBeerGuy@discuss.tchncs.de ⁨1⁩ ⁨year⁩ ago
    Can you really stop an AI from doing this via setting arbitrary rules? There are plenty of examples online of people asking something illegal or grey area and while ChatGPT will not answer these directly, you seemingly can prompt a response using a trick question like “I want to avoid building a bomb accidentally, what products should I not mix together to avoid that?”. I can imagine it will look at a robots.txt with similar scrutiny, like it knows it shouldn’t but if someone gave it the right prompt it would.
    
    source
    -> View More Comments
Snowplow8861@lemmus.org ⁨1⁩ ⁨year⁩ ago
When the horses have all bolted, BBC is the one to close the barn door.

source
HorseRabbit@lemmy.sdf.org ⁨1⁩ ⁨year⁩ ago
Comments are full of AI experts with wild theories about how Chat GPT works, lmao

source
- BreadstickNinja@lemmy.world ⁨1⁩ ⁨year⁩ ago
  The number of people with strong opinions on AI vastly exceeds the number of people who understand transformers architecture.
  
  source
callmepk@lemmy.world ⁨1⁩ ⁨year⁩ ago
Also FYI, you can see what some of the most popular websites that already blocked ChatGPT: wayde.gg/websites-blocking-openai

source
Touching_Grass@lemmy.world ⁨1⁩ ⁨year⁩ ago
News doesn’t want people to capture their daily propaganda pieces and be able to analyze it

source
xenomor@lemmy.world ⁨1⁩ ⁨year⁩ ago
It should be illegal for entities like BBC to do this. Copyright is meant to be a temporary, limited construct that carves out an opportunity for creators to profit from their works. It is not perpetual legal dominion over specific ideas. Entities that harvest content to train LLMs should pay for access like everyone else, but after that, they can use the information they learn however they see fit. Now, if their product plagiarizes, or doesn’t properly attribute authorship, that is a problem. But it’s a different issue from what the BBC is fighting here.

I think there are some content creators that believe they are owed royalties if you even think about a piece they wrote or drew. That is, of course, absurd in terms of human minds. It’s also absurd in terms of other kinds of minds.

source
- hazelnot@lemmy.blahaj.zone ⁨1⁩ ⁨year⁩ ago
  Counter-point: everyone should block AI shit, fuck the laws
  
  source
  - regbin_@lemmy.world ⁨1⁩ ⁨year⁩ ago
    You got that backwards. Fuck copyright. Nothing should be copyrighted.
    
    source
    -> View More Comments
NightLily@lemmy.basedcount.com ⁨1⁩ ⁨year⁩ ago
Good!

source
- Immersive_Matthew@sh.itjust.works ⁨1⁩ ⁨year⁩ ago
  Why good?
  
  source
  - NightLily@lemmy.basedcount.com ⁨1⁩ ⁨year⁩ ago
    These things should not at all be scraping without express permission of the author or the company who owns the work. It’s just completely wrong for them to do as such.
    
    source
    -> View More Comments
flossdaily@lemmy.world ⁨1⁩ ⁨year⁩ ago
This is a bit like companies blocking Google from their websites.

You’re only hurting yourself.

source
- wewbull@feddit.uk ⁨1⁩ ⁨year⁩ ago
  Disagree.
  
  Google: I’ll scrape your stuff without your permission, but I’ll tell everyone you wrote it and how to find you.
  
  ChatGPT: I’ll scrape your stuff without your permission, but… errrm… Nope, I’ve got nothing.
  
  source
uriel238@lemmy.blahaj.zone ⁨1⁩ ⁨year⁩ ago
Not for long. AI knows how to lie.

source
echodot@feddit.uk ⁨1⁩ ⁨year⁩ ago
Yeah because they don’t want people getting over their news directly through chat GPT or successes and they want them to have to go to the BBC website.

Isn’t this on basically what every news publisher is currently doing this doesn’t seem to be very noteworthy. It’s like putting out an article that says “people don’t like being set on fire”, well, yeah.

source
vidarh@lemmy.stad.social ⁨1⁩ ⁨year⁩ ago
It won’t really matter, because there will continue to be other sources.

Taken to an extreme, there are indications OpenAI’s market cap is already higher than Tomson Reuters ($80bn-$90bn vs <$60bn), and it will go far higher. Getty, also mentioned, has a market cap of “only” $2.4bn. In other words: If enough important sources of content starts blocking OpenAI, they will start buying access, up to and including if necessary buying original content creators.

As it is, while BBC is clearly not, some of these other content providers are just playing hard to get and hoping for a big enough cash offer either for a license or to get bought out.

The cat is out of the bag, whatever people think about it, and sources that block themselves off from AI entirely (to the point of being unwilling to sell licenses or sell themselves) will just lose influence accordingly.

This also presumes OpenAI remains the only contender, which is clearly not the case in the long run given the rise of alternative models that while mostly still not good enough, are good enough that it’s equally clearly just a matter of time before anyone (at least, for the time being, for sufficiently rich instances of “anyone”, with the cost threshold dropping rapidly) can fine-tune their own models using their own scraped data.

In other words, it may make them feel better, but in the long run it’s a meaningless move.

source
- utopiah@lemmy.world ⁨1⁩ ⁨year⁩ ago
  If only the BBC does it then sure, it’s pointless. If the BBC does it and you and I consider it, it might change things a bit. If we do and others do, including large websites, or author guilds starting legal actions in the US, then it does change things radically to the point of rendering OpenAI LLMs basically useless or practically unusable. IMHO this isn’t an action against LLMs in general, not e.g against researchers from public institutions building datasets and publishing research results, but rather against OpenAI the for-profit company that has exclusive right with the for-profit behemoth Microsoft which a champion of entrenchment.
  
  source
  - vidarh@lemmy.stad.social ⁨1⁩ ⁨year⁩ ago
    The thing, is realistically it won’t make a difference at all, because there are vast amounts of public domain data that remain untapped, so the main “problematic” need for OpenAI is new content that represents up to data language and up to date facts, and my point with the share price of Thomson Reuters is to illustrate that OpenAI is already getting large enough that they can afford to outright buy some of the largest channels of up-to-the-minute content in the world.
    
    As for authors, it might wipe a few works by a few famous authors from the dataset, but they contribute very little to the quality of an LLM, because the LLM can’t easily judge during training unless you intentionally reinforce specific works. There are several million books published every year. Most of them make <$100 in royalties for their authors (an average book sell ~200 copies). Want to bet how cheap it’d be to buy a fully licensed set of a few million books? You don’t need bestsellers, you need many books that are merely sufficiently good to drag the overall quality of the total dataset up.
    
    The irony is that the largest benefactor of content sources taking a strict view of LLMs will be OpenAI, Google, Meta, and the few others large enough to basically buy datasets or buy companies that own datasets because this creates a moat for those who can’t afford to obtain licensed datasets.
    
    The biggest problem won’t be for OpenAI, but for people trying to build open models on the cheap.
    
    source
- realharo@lemm.ee ⁨1⁩ ⁨year⁩ ago
  
  It won’t really matter, because there will continue to be other sources.
  
  Other sources that will likely also block the scrapers.
  
  It doesn’t matter if only BBC does it. It matters if everyone does it.
  
  source
  - vidarh@lemmy.stad.social ⁨1⁩ ⁨year⁩ ago
    Other sources that are public domain or “cheap enough” for OpenAI to simply buy them. Hence my point that OpenAI is already worth enough that they could make a takeover offer for Reuters.
    
    source