Page MenuHomePhabricator

topic overlap between Wikipedia language versions
Open, Needs TriagePublic

Description

Idea:
The different language Wikipedias cover very different topics in their articles. With the sitelinks on Wikidata we have data to analyze this further. It'd be useful to have an overview of the overlap of articles between the different language versions of Wikipedia. We want to make the result of this actionable.

This could look something like this:

not covered in enwpnot covered in dewpnot covered in frwp
enwp articles-1050
dewp articles42-12
frwp articles15150-

Each cell could then link to a list of missing topics to make it actionable. Preferably the list would be ordered by the number of other Wikipedias that cover the topic.

Notes:

  • We should make it clear that there are good reasons for some topics not being covered in a Wikipedia and it is not always necessary to create a new article. These reasons can include:
    • the topic is not considered notable for that Wikipedia
    • the topic is covered but as a paragraph in another article for example
  • Later this could be expanded to the other Wikimedia projects.

See also:
T200859: Add "haswbsitelink" to find items missing in a certain wiki
T236992: Order Wikidata search result by number of statements/labels/sitelinks/identifiers

Event Timeline

@Lydia_Pintscher @Manuel @WMDE-leszek

Before we proceed with this, please take a look at our WDCM Sitelinks Dashboard:

  • Wiki View tab and then
    • Wiki Similarity

I would say that the similarity graph presented there is pretty close to what you are looking for.

Maybe we should just think about extending the functionality of this WDCM system component instead of going for a new data product?

Darcyisverycute subscribed.

{F35452779} {F35452776}
Sorry I didn't have time to write here yesterday, I worked on this as part of the hackathon. I gave a presentation (slides and data in xlsx export attached, it doesn't render great so I anonymously published online here as well). The approach I did was to circumvent that there is no fast way to test if a given article about a wikidata item is in mainspace, I instead rely on inclusion in a large encyclopedia ID system (I chose Encyclopedia Britannica, info in the slides). It's fast enough to run a comparison between two langs through the ~170k items in the particular ID system, within the 1 minute query timeout window on https://query.wikidata.org/

So to fill out the rest of the matrix I just need to work out a way to programmatically combine the queries into a table and run on a database dump, or run queries of the form in my presentation sequentially (possibly also with a database dump). The full matrix is ~170 language wikis across 250+ languages, so about 28900 queries to run in total if we wanted the full table. @Lydia_Pintscher do you have any advice on scaling up this approach?

(NB my spreadsheet is the same as in the idea description but transposed)

Thank you! :)

I unfortunately don't have any good tips for scaling.