topic overlap between Wikipedia language versions
Open, Needs TriagePublic
Actions

Assigned To

Authored By

	Lydia_Pintscher
	May 23 2021, 5:49 PM

Description

Idea:
The different language Wikipedias cover very different topics in their articles. With the sitelinks on Wikidata we have data to analyze this further. It'd be useful to have an overview of the overlap of articles between the different language versions of Wikipedia. We want to make the result of this actionable.

This could look something like this:

	not covered in enwp	not covered in dewp	not covered in frwp
enwp articles	-	10	50
dewp articles	42	-	12
frwp articles	15	150	-

Each cell could then link to a list of missing topics to make it actionable. Preferably the list would be ordered by the number of other Wikipedias that cover the topic.

Notes:

We should make it clear that there are good reasons for some topics not being covered in a Wikipedia and it is not always necessary to create a new article. These reasons can include:
- the topic is not considered notable for that Wikipedia
- the topic is covered but as a paragraph in another article for example
Later this could be expanded to the other Wikimedia projects.

Related Objects
Search...

		Status	Subtype	Assigned	Task
		Open		None	T90870 selfcontained projects around Wikidata (tracking)
		Open		Darcyisverycute	T283466 topic overlap between Wikipedia language versions

Event Timeline

Lydia_Pintscher created this task.May 23 2021, 5:49 PM

Restricted Application added a subscriber: Aklapper. · View Herald TranscriptMay 23 2021, 5:49 PM

Frostly subscribed.May 23 2021, 8:23 PM

GoranSMilovanovic claimed this task.May 24 2021, 6:50 AM

GoranSMilovanovic added projects: WMDE-Analytics-Engineering, User-GoranSMilovanovic.

GoranSMilovanovic subscribed.

Bugreporter updated the task description. (Show Details)May 24 2021, 7:03 AM

GoranSMilovanovic moved this task from Technical Wishlist to Incoming on the User-GoranSMilovanovic board.May 25 2021, 7:27 AM

@Lydia_Pintscher @Manuel @WMDE-leszek

Before we proceed with this, please take a look at our WDCM Sitelinks Dashboard:

Wiki View tab and then
- Wiki Similarity

I would say that the similarity graph presented there is pretty close to what you are looking for.

Maybe we should just think about extending the functionality of this WDCM system component instead of going for a new data product?

• amy_rc subscribed.May 25 2021, 7:49 AM

GoranSMilovanovic removed GoranSMilovanovic as the assignee of this task.May 25 2021, 11:05 AM

GoranSMilovanovic removed projects: User-GoranSMilovanovic, WMDE-Analytics-Engineering.May 27 2021, 8:49 AM

Lydia_Pintscher added a project: Wikimania-Hackathon-2022.Aug 11 2022, 10:40 AM

Lydia_Pintscher moved this task from Inbox to Hacking Projects on the Wikimania-Hackathon-2022 board.

{F35452779} {F35452776}
Sorry I didn't have time to write here yesterday, I worked on this as part of the hackathon. I gave a presentation (slides and data in xlsx export attached, it doesn't render great so I anonymously published online here as well). The approach I did was to circumvent that there is no fast way to test if a given article about a wikidata item is in mainspace, I instead rely on inclusion in a large encyclopedia ID system (I chose Encyclopedia Britannica, info in the slides). It's fast enough to run a comparison between two langs through the ~170k items in the particular ID system, within the 1 minute query timeout window on https://query.wikidata.org/

So to fill out the rest of the matrix I just need to work out a way to programmatically combine the queries into a table and run on a database dump, or run queries of the form in my presentation sequentially (possibly also with a database dump). The full matrix is ~170 language wikis across 250+ languages, so about 28900 queries to run in total if we wanted the full table. @Lydia_Pintscher do you have any advice on scaling up this approach?

(NB my spreadsheet is the same as in the idea description but transposed)

Thank you! :)

I unfortunately don't have any good tips for scaling.

Manuel mentioned this in T288611: Number of links to other Wikimedia projects.Sep 7 2022, 11:39 AM

	Restricted File
	Aug 15 2022, 10:54 AM

	Restricted File
	Aug 15 2022, 10:54 AM

topic overlap between Wikipedia language versionsOpen, Needs TriagePublicActions

Description

Related ObjectsSearch...

Event Timeline

topic overlap between Wikipedia language versions
Open, Needs TriagePublic
Actions

Related Objects
Search...