What do you think about Abstract Wikipedia?

Submitted ⁨⁨10⁩ ⁨months⁩ ago⁩ by ⁨ibt3321@lemmy.blahaj.zone⁩ to ⁨technology@lemmy.world⁩

Wikifunctions is a new site that has been added to the list of sites operated by WMF. I definitely see uses for it in automating updates on Wikipedia and bots (and also for programmers to reference), but their goal is to translate Wikipedia articles to more languages by writing them in code that has a lot of linguistic information. I have mixed feelings about this, as I don’t like existing programs that automatically generate articles (see the Cebuano and Dutch Wikipedias), and I worry that the system will be too complicated for average people.

source

Comments

Sort:hotnew top

GenderNeutralBro@lemmy.sdf.org ⁨10⁩ ⁨months⁩ ago
Sounds like a great idea. Plain English (or any human language) is not the best way to store information. I’ve certainly noticed mismatches between the data in different languages, or across related articles, because they don’t share the same data source.

Take a look at the article for NYC in English and French and you’ll see a bunch of data points, like total area, that are different. Not huge differences, but any difference at all is enough to demonstrate the problem. There should be one canonical source of data shared by all representations.

Wikipedia is available in hundreds of languages. Why should hundreds of editors need to update the NYC page every time a new census comes out with new population numbers? Ideally, that would require only one change to update every version of the article.

In programming, the convention is to separate the data from the presentation. In this context, plain-English is the presentation, and weaving actual data into it is sub-optimal. Something like population or area size of a city is not language-dependent, and should not be stored in a language-dependent way.

Ultimately, this is about reducing duplicate effort and maintaining data integrity.

source
- schnurrito@discuss.tchncs.de ⁨10⁩ ⁨months⁩ ago
  This problem was solved in like 2012 or 2013 with the introduction of Wikidata, but not all language editions have decided to use that.
  
  source
  - GenderNeutralBro@lemmy.sdf.org ⁨10⁩ ⁨months⁩ ago
    How common is it in English? I haven’t checked a lot of articles, but I did check the source of the English and French NYC articles I linked and it seems like all the information is hardcoded, not referenced from Wikidata.
    
    source
    -> View More Comments
  - rottingleaf@lemmy.zip ⁨10⁩ ⁨months⁩ ago
    
    but not all language editions have decided to use that.
    
    Some people like their little power they call “meritocracy” to decide what belongs in the article and what doesn’t.
    
    source
- robotica@lemmy.world ⁨10⁩ ⁨months⁩ ago
  Disclaimer, I didn’t do any research on this, but what would be bad with just AI translating text, given a reliable enough AI? No code required, just plain human speech.
  
  source
  - GenderNeutralBro@lemmy.sdf.org ⁨10⁩ ⁨months⁩ ago
    This will help make machine translation more reliable, ensuring that objective data does not get transformed along with the language presenting that data. It will also make it easier to test and validate the machine translators.
    
    Any automated translations would still need to reviewed. I don’t think we will (or should) see totally automated translations in the near future, but I do think the machine translators could be a very useful tool for editors.
    
    Language models are impressive, but they are not efficient data retrieval systems. Denny Vrandecic, the founder of Wikidata, has a couple insightful videos about this topic.
    
    This one talks about knowledge graphs in general, from 2020: www.youtube.com/watch?v=Oips1aW738Q
    
    This one is from last year and is specifically about how you could integrate LLMs with the knowledge graph to greatly increase their accuracy, utility, and efficiency: www.youtube.com/watch?v=WqYBx2gB6vA
    
    I highly recommend that second video. He does a great job laying out what LLMs are efficient for, what more conventional methods are efficient for, and how you can integrate them to get the best of both worlds.
    
    source
    -> View More Comments
AbouBenAdhem@lemmy.world ⁨10⁩ ⁨months⁩ ago
I assume the main benefit will be for users of less-spoken languages, who currently get out-of date articles or none at all.

source
lvxferre@mander.xyz ⁨10⁩ ⁨months⁩ ago

but their goal is to translate Wikipedia articles to more languages by writing them in code that has a lot of linguistic information

That’ll get unruly really fast.

Languages simply don’t agree on how to split the usage of words. Or grammatical case. Or if, when and how to do agreement.

Just for the sake of example: how are they going to keep track of case in a way that doesn’t break Hindi, or Basque, or English, or Guarani? Or grammatical gender for a word like “milk”? (not even the Romance languages agree in it.) At a certain point, it gets simply easier to write the article in all those languages than to code something to make it for you.

I think that the best use scenario is to automate tidbits of highly changing data.

source
- Atemu@lemmy.ml ⁨10⁩ ⁨months⁩ ago
  
  Languages simply don’t agree on how to split the usage of words. Or grammatical case. Or if, when and how to do agreement.
  
  Just for the sake of example: how are they going to keep track of case in a way that doesn’t break Hindi, or Basque, or English, or Guarani? Or grammatical gender for a word like “milk”? (not even the Romance languages agree in it.) At a certain point, it gets simply easier to write the article in all those languages than to code something to make it for you.
  
  I don’t know what the WMF is planning here but what you’re pointing out is precisely what abstraction would solve.
  
  If you had an abstract way to represent a sentence, you would be independent of any one order or case or whatever other grammatical feature. In the end you obviously do need actual sentences with these features. To get these, you’d build a mechanism that would convert the abstract sentence representation into a concrete sentences for specific languages that is correctly constructed according to those specific languages’ rules.
  
  Same with gender. What you’d store would not be that e.g. some german sentence is talking about the feminine milk but rather that it’s talking about the abstract concept of milk. How exactly that abstract concept is represented in words would then be up to individual languages to decide.
  
  I have absolutely no idea whether what I’m talking about here would be practical to implement but it in theory it could work.
  
  source
  - lvxferre@mander.xyz ⁨10⁩ ⁨months⁩ ago
    Abstractions are not magic, and they cannot make info appear out of nowhere. Somewhere inside that abstraction you’ll need to have the pieces of info that Spanish “leche” [milk] is feminine, that Zulu “ubisi” [milk] is class 11, that English predicative uses the ACC form, so goes on.
    
    And you’ll need people to mark a multitude of distinctions in their sentences, when writing them down, that the abstraction layer would demand for other languages. Such as tagging the “I” in “I see a boy” as “+masculine, +older-person, +informal” so Japanese correctly conveys it as “ore” instead of “boku”, "atashi, “watashi” etc.
    
    Even the idea of “abstract concept of milk” doesn’t work as well as it sounds like, because languages will split even the abstract concepts in different ways. For example, does the abstract concept associated with a living pig includes its flesh?
    
    And the language itself cannot decide those things. A language is not an agent; it doesn’t “do” something. You’d need people to actively insert those pieces of info for each language, that’s perhaps doable for the most spoken ones, but those are the ones that would benefit the least from this.
    
    source
    -> View More Comments
- Silentiea@lemm.ee ⁨10⁩ ⁨months⁩ ago
  They’re just going to write all the articles in lojban.
  
  source
  - lvxferre@mander.xyz ⁨10⁩ ⁨months⁩ ago
    Not even that would do the trick - practical usage of Lojban heavily relies on fu’ivla, that carry with themselves the semantic scope assigned to the original words. .u’i I want to see them trying though.
    
    source
- lvxferre@mander.xyz ⁨10⁩ ⁨months⁩ ago
  I’ll reply to myself to highlight a point, and issue a challenge for those who assume that WMF’s apparent goal (to translate Wikipedia articles to more languages by writing them in code that has a lot of linguistic information) is actually viable:
  
  Here’s an excerpt from an actual Wikipedia article: “the solubility of these gases depending on the temperature and salinity of the water.” If you think that the goal is viable, then show me all linguistic information that you expect Wikipedia writers to input in said code, and then I’ll show you why it is not enough and it’ll still output “then who was phone?” tier nonsense for some languages.
  
  It isn’t even a full sentence, how hard would it be? /s
  
  Too much work? Then focus on the PP at the end of the sentence. It’s still probably more information than a full article. (And if you don’t know what I mean by “PP” you probably don’t have the knowledge to talk about this.)
  
  Hic Rhodes, hic salta.
  
  source
abhibeckert@lemmy.world ⁨10⁩ ⁨months⁩ ago
I can definitely see a use for it - for example I often use Wikipedia to access statistics for sports athletes/teams.

Currently the data is human entered which limits the amount of detail. If it was automated it could include a lot more detail - for example how many kilometres did a football player run on the 2nd of march 1994? That data is often available but not on wikipedia.

source
- ibt3321@lemmy.blahaj.zone ⁨10⁩ ⁨months⁩ ago
  The site itself is for contributors who want to create functions and write code for them. Examples of how it might be used in the future for articles:
  
  Z11884 for articles about chemicals.
  
  Z11302 for use in prose.
  
  source
solrize@lemmy.world ⁨10⁩ ⁨months⁩ ago
This sounds like more roboticication of wikipedia. Not good.

source