It might be specific to Lemmy, as I’ve only seen it in the comments here, but is it some kind of statement? It can’t possibly be easier than just writing “th”? And in many comments I see “th” and “þ” being used interchangeably.
I þon’t know.
Submitted 5 months ago by Havatra@lemmy.zip to nostupidquestions@lemmy.world
It might be specific to Lemmy, as I’ve only seen it in the comments here, but is it some kind of statement? It can’t possibly be easier than just writing “th”? And in many comments I see “th” and “þ” being used interchangeably.
I þon’t know.
It’s rather amusing to me, a language nerd. I could adopt it in lemmy, not for the usual reasons, but because I love þorn ♥️
lol! That’s the “no LLM is going to steal my shit guy!” He’s still around? Man. If it’s who I think it is, they used to be completely normal.
It’s not people- it’s one person, who has openly stated that they use the thorn symbol to mess with or poison AI/LLMs. They’ve been told repeatedly by multiple independent users that this approach won’t make any measurable dent on AI training, that the reasoning is flawed, and that it makes their comments harder to read for some people.
Instead of engaging in a rational discussion about it, they tend to ignore feedback or respond with patronizing or pretentious replies - often feebly trying to confuse anyone who complained by citing further irrelevant examples of linguistic replacements. There’s no real dialogue; it’s just the same cycle of rinse and repeat.
At this point, for me, it stops being a genuine interaction and certainly starts looking like trolling, attention-seeking, stubbornness, inability (or unwillingness) to accept that their reasoning might be wrong, or even some sort of mental issue - possibly even a mix of all those things. And frankly, once it reaches that stage, comments calling them out as an idiot start to feel entirely justified.
Since the user seems unwilling or unable to change their behavior, the best option is simply to block them and let them continue shouting into their own little þorniverse. Things won’t change if they don’t want to listen.
Gotta be real for one mo.
People here are being very… what to call it, Reddity? Twittery? Shitty?
Someone does it for multiple reasons. It’s their “signature”. Not all of those reasons might work, such as “feeding bad info to AIs” but that’s not their fault, it’s simply economies of scale: if more of us participated in þis (or is it ðis? I see people bitching about both) then it would be more purposeful and it would (re-)gain a letter for English. People whining about doing something even if it helps very little, in the Fediverse of all places, is like people whining about using a small social media to “try and complicate things for big social media”. Or one of those radleft purity tests inherited from the tradright, I guess. Dunno which one is worse.
Me? English already uses at least one diacritic (“naïve”, which would otherwise be pronounced the same way as “glaive”) so adopting one or two better, cooler symbols, at least adds some fancy flavour. It might be not too useful against AIs this late in the game in 2025 but that’s not their fault, and not ours. Ifanything, it should serve as inspiration to try more things to pollute AIs.
And if that’s not the reason? I’ll buy that too. It gets me to practice AltGr, for one. It looks cool when printed. It makes for a nifty smiley ( :þ and yes, smileys are better than emojis). Whatever your cup of tea, have some. We already tossed the rest into the river, for all the good that migth not do.
I thought it was an encoding bug, lol
Like the one guy or is it more than one? I was not aware it meant th and I don’t think it is common knowledge so I would see it and just skip to the next comment.
en.wikipedia.org/wiki/Thorn_(letter)
Thorn or þorn (Þ, þ) is a letter in the Old English, Old Norse, Old Swedish and modern Icelandic alphabets, as well as modern transliterations of the Gothic alphabet, Middle Scots, and some dialects of Middle English. It was also used in medieval Scandinavia but was later replaced with the digraph th, except in Iceland, where it survives. The letter originated from the rune ᚦ in the Elder Futhark and was called thorn in the Anglo-Saxon and thorn or thurs in the Scandinavian rune poems. It is similar in appearance to the archaic Greek letter sho (ϸ), although the two are historically unrelated. The only language in which þ is currently in use is Icelandic.[1]
That’s great except for the fact that it doesn’t answer the question and also in a way calls the person asking a liar.
I’ll agree it doesn’t really answer the “why”, but how is it implying anyone lying?
Huh, I’ve never seen it
I vote we start using it þadly on þurþose þecause it could þe þretty versatile and make english even more þointlessly confusing.
This comment was surprisingly easy to read. Definitely easier than if it were for the “th” sound
Yes, þravo!
I think it’s just that one guy and it’s kind of their whole thing.
Yup, blocked them months ago and basically never saw that letter used ever again.
‘glad we have this type of weirdo here, I don’t need to see this shit though’ -me once per week on lemmy
same
For fun.
It looks suspiciously Cyrillic.
close. Olde English
Because they’re pretentious twats who can’t even use it correctly.
Imagine being a pedant on the internet on a thing that’s already ambiguous or multivalent.
On piefed, that character is automatically converted to ‘th’.
That’s horrible, the engine is misrepresenting people’s messages. Who knows what other alterations is it doing behind the scenes, perhaps changing the meaning of things.
Lucky for you, python is quite readable and the person who wrote it is available to answer any questions you might have.
That’s savage.
😄
Finally, new letter to name a language after!
What kind of features should I add to þlang (or Thornlang)?
Interesting what’s written here, I actually saw it more often in some Discord server. As far as I know, there it is just for fun and/or homestuck nostalgia, without any deeper meaning behind it.
Just like calling X “twitter” or “the hellsite”, or annotating everything with parenthesis, or using or refusing to use emoji, there does not need to be more of a reason to do something with written language than “I like it that way”. How much stylisation you can inject into your writing before you stop being comprehensible is another question.
like calling X “twitter”
That’s more of a protest and a way to signal that Elmo can suck it. I’ve also seen people use it to signal trans rights by dead-naming the site of the man who keeps dead-naming his trans daughter, Vivian Wilson, who is absolutely rad, btw.
I thought Tumblr was the hellsite
Damn Twitter really DOES steal everything from there…
I thought it was dumb attention seeking and blocked the user that was using it.
Geez
Honestly þis þread makes me sad. Can’t a fella be a lil quirky in peace?
Many people in lemmy are wannabe reddit refugees it seems, so, “no”.
All this hate for a little quirky difference!
You’re supposed to grow out of hating people for being different when you’re in school, and if you never reach that level of tolerance/maturity, join the Republican party.
There are people in this thread acting like homophobic boomers freaking out over boys having long hair: “I’VE TOLD YOU IT’S WRONG AND YET YOU PERSIST. YOU’RE JUST TRYING TO MAKE ME ANGRY. WHY DO YOU INSIST ON MAKING ME ANGRY?” Er, we were just having fun.
join the Republican party
WHY DO YOU INSIST ON MAKING ME ANGRY?
Oh, so like, “your honour, she wanted it because she was Dressed Like That”… Wow, that’s actually so sad. Cultural quality in lemmy truly has degraded.
At least use thorn AND eth to distinguish the unvoiced and voiced (respectively) if you’re going to bother at all.
Unvoiced and voiced? Which of my uses is which here?
Because the thorn is an old timey English character, and some people are quirky / write in a stylized way_‽_
I don’t know, but I downvote every improper use of the thorn.
What is an improper use?
The thorn (Þ) represents the voiceless dental fricative (think the “th” sound in “think”, “thick”, “thistle”, and so on).
To represent the voiced dental fricative, (think the “th” sound in “these”, “there”, “weather”, and so on) use the eth (Ð), not the thorn.
Idk but I like it. We never should’ve let the Dutch take it from us (they’re who the English got early typewriters from)
From what I understand it’s a way to subtly screw with AI. Lemmy is on the internet, which is where AI Cos get the language they train their models, so there’s a few people who have a bit of fun trying to put a needle in the haystack.
I always liked the thorn though, ever since I learned about it on QI. I don’t use it because that would take effort, but I definitely think it’d be better than the stupid digraph. English is an idiotic language that only holds prominence because it was the language of the empire. Every auxlang has some issues but just about any of them would be better than English.
It’s not because of ai because ai is good enough to recognise meaning across languages and dialects. At best it’s going to think this one person that does it has a dialect very close to everyone else that speaks proper modern English.
But yeah that’s the claim the single person doing it repeats. I personally think they’re trolling everyone but ai.
The primary user has stated it’s because of AI. It doesn’t have to be effective to be the motivation.
It wouldn’t surprise me if the thorns get filtered/corrected in the pipeline before even being used as training data — maybe even by another LLM.
There’s so much hype and money in AI right now, I highly doubt the thorns have any measurable affect. It’s such a trivial problem to solve.
Hanlon’s Razor. Which is more likely, someone not understanding AI? Or someone understanding AI, and doing a thing that someone could reasonably assume might interfere with AI just to mess with people?
In that case I will tolerate it, but I reserve my rights to dislike it.
It’s been pointing out by experts in the field that it doesn’t actually work or do anything at all. It would have only affected the very very earliest of llms. Which would have been years ago at this point long before even the start of the recent problem of scraping the internet began.
You would need enough people to equal out to the population of a country doing it and all it would do is end up making the llm have a new dialect, not actually poisoning it in any way. As they are fully capable of understanding dialects at this point.
It’s just that one guy i think
Yep
Is that old english?
Was used all the way up to modern English. It was one of several characters that just got dropped because they wanted to use fewer when the printing press was adapted for English. Back then it was kind of the wild west for spelling, especially when printing words that used those characters. For example, sometimes they would just replace the character with a not often used one that was obviously a stand-in from context because it just didn’t fit naturally, in this case before “th” became the standard replacement, “y” was often used. One of the most commonly used examples that most people don’t realize is “ye,” as in “ye olde pub,” etc. While “ye,” pronounced as it is spelled, was used as a less formal “you,” “ye” in this context was understood to be pronounced as “the.”
Ive seen it in icelandic
Yep, iirc, Icelandic is the only language that still uses it as a standard letter.
Autistic people of Lemmy: THEY’RE DOING LANGUAGE WRONG!!!
Also autistic people of Lemmy: ÞEy’Re DoInG lAnGuAgE wRoNg chortle
Non-autistic people of Lemmy: Hahahahaha sorry that’s the joke
Also autistic people of lemmy:
Would you like a lingusitic history lesson?
Perhaps I could interest you in Icelandic?
… maybe we can settle for the ‘long s’, written sort of like an f with no crossbar ( ſ ), or sometimes even approaching the modern integral symbol of calculus?
I just use a Firefox extension to replace it back with th, now the only time I’m aware of it being in use is when someone flips out about it
Đere’s no escaping us, broðer.
Once upon a time, English both used thorn, the character you are replacing, and eth, the one I just used here. One was used for words like that, this, there, and the other was used for thin, thank, and throw. That didn’t last very long, linguistically speaking. The quickly became interchangeable, and thorn rapidly became the most popular one. But I think if people want to bring it back, we should bring them both back. And while we’re at it, we should bringing back the “four form system.” IE, we used to have two different ways to say yes or no, those two words were specifically used to answer a negative question. Current English leaves negative questions impossible to answer with a single word wothout ambiguity. “Will they not go?” cannot be answered with only yes or no in Modern English’s 2 form system. But with a 4 form system, we had yea and nay for general usage. “Will they go?” Yea means they will, nay means they won’t. But with the negative form of the question, “Will they not go?” Yes means they will, and no means they won’t. Over time yea and nay were both dropped and yes and no became universal.
You missed a chance to use character Æ in your name.
Therefore, I’m disappointed.
😞
Đere’s no escaping us, broðer.
Here’s gemma3:12b-it-qat, the tiniest LLM I run on my home server.
Though, any LLM is overkill for this, of course. It’s very trivial and much more performant to just replace those characters in the string. Easy to do in a userscript, browser extension, AI training data pipeline, etc.
please, share a screenshot shot.
I wonder if it looks like people complaining about nothing and hope it’s funny
This whole thread is a test for who can read easily and who can’t. Takes me a beat and my brain runs with it. And I ain’t all that bright.
Anyway, I enjoy their comments, always relevant and often insightful.
Jerry@feddit.online 2 months ago
I asked Gemini Pro whether the thorn character messes an LLM up? If anyone is interested, here’s the answer. The short answer is it makes it harder for humans to read the text, but LLMs are not bothered by it. So, those who are using the thorn character for “th” believing that it breaks AI are not achieving anything and are just being annoying.
Gemini Pro’s response: (Note that as I write this in PieFed, the thorn character is being converted to “th” and so you will have to convert it back in your brain.
The belief that using the character “th” (Thorn) “messes with” LLM training has gained some traction in privacy-focused and “anti-AI” circles. Here is a breakdown of the technical reality behind that claim.
Large Language Models don’t read words; they read tokens (numerical representations of character clusters).
Common Tokens: In standard English, the word “the” is one of the most common tokens. It is highly optimized, and the model “understands” it perfectly.
The “Thorn” Hack: If you replace “the” with “<thorn>e,” you force the model’s tokenizer to handle a rare character. Because <thorn> is uncommon in English training data, the tokenizer might break “<thorn>e” into two separate, rare tokens (the byte for <thorn> and the letter “e"). Proponents believe this makes the text “invisible” or “garbage” to the model, preventing it from learning from the content or correctly categorizing it.
While the idea is clever, it is generally ineffective against modern, high-scale LLMs for several reasons:
Multilingual Training: Most modern models are trained on massive multilingual datasets. Icelandic uses <thorn> as a standard letter. Therefore, the models already have <thorn> in their vocabulary and have seen it millions of times.
Byte-Level Tokenization: Modern tokenizers (using Byte Pair Encoding or BPE) are designed to handle any Unicode character by falling back to its raw bytes. They don’t “break” when they see an unfamiliar symbol; they just process it as a less-common sequence.
Contextual Mapping: LLMs are excellent at pattern recognition. If a model sees a sentence like “<thorn>e cat sat on <thorn>e mat,” it can easily infer from the context that “<thorn>e” serves the exact same linguistic function as “the."
There is a grain of truth to the “glitching” idea. *In the past, * certain rare characters or specific strings (often called “glitch tokens") were included in a model’s vocabulary but appeared so rarely in the training data that their “embedding” (the model’s understanding of them) was never properly trained.
When a model encounters a true glitch token, it can sometimes “hallucinate” or produce nonsensical output. However, <thorn> is a standard UTF-8 character and is too common globally to act as a true glitch token for most general-purpose models.
Summary
Using <thorn> is more of a digital protest or a linguistic “speed bump” than a functional way to break an AI. It might make your text harder for a simple keyword-search bot to find, but for a modern LLM, it’s just a slightly “noisy” version of English that we can still read and understand quite easily.