Page MenuHomePhabricator

Add all remaining languages to WikiWho
Open, In Progress, Needs TriagePublic

Description

I believe there are a number of users across the Wikimedia ecosystem who are eager for access to tools that rely on WikiWho data (including 'Who Wrote That?' as well as my new tool at impact-visualizer.wmcloud.org and the features of Programs & Events Dashboard that rely on it).

It's been running relatively smoothly with English plus about a dozen other languages, including some of the larger ones, so I feel relatively confident that the system can handle enabling the rest of the languages (based on the rule of the thumb that en.wiki is about equivalent to the rest of the languages combined when it comes to resource utilization).

Let's add it!

(Two languages that I know in particular have users who want access to tools that need this data are uk and sv.)

Progress

See https://meta.wikimedia.org/wiki/Talk:Community_Wishlist/W503#Current_progress

Event Timeline

Restricted Application added a subscriber: Aklapper. · View Herald Transcript

Hi there, I would also like to use it for tools to measure projects on nl and pap Wikipedia, so wanted to support this task!

Indeed things seem to be running well enough. To my knowledge there haven't been any outages in moons, with the last one resulting in a permanent fix. It'd be nice to figure out T344936: Systemd services rely on cron to restart, but I think it's otherwise mostly safe to say the WikiWho service is stable "enough".

T344938: Celery queue is full of deleted pages that won't ever get processed meanwhile is a problem that appears to continue to grow. We're now at ~310,000+ messages! Even so, it doesn't seem to have any real consequences (?)

I should note that if I remember correctly, we added the last round of new languages when we moved to a new VM, so we had a unique opportunity to run the import script on a machine different than the client-facing server. We may or may not want to do the same here, or take some measure to ensure it isn't hogging too many resources from processing the celery queue.

@MusikAnimal @TheresNoTime any chance either of you could squeeze this in soonish? It's blocker to rolling out support for more languages for my Impact Visualizer (https://impact-visualizer.wmcloud.org/) project, and I'm eager to get some of the folks from those languages using it before I run out of time/money to for active development.

@MusikAnimal is there documentation on how wikiwho imports are done and would it help if someone would preprocess data locally for adding them? (I suppose that processing dumps take some time)

It's worth noting that we could enable additional languages without an initial processing run; in that case, every new request would trigger processing that particular article (and result in slower, but still usable, performance for the first request for each article for Who Wrote That and Dashboard end users). I think that strategy would be fine for enabling all additional languages; the only blocker would be making sure we have appropriate storage for handling the gradual accumulation of processed articles files for the additional languages (and clearing some headway for continued growth of storage requirements for the already-enabled ones).

@Zache here's the documentation: https://github.com/wikimedia/wikiwho_api/blob/main/WIKIMEDIA_VPS_SETUP.md#adding-new-languages-to-wikiwho

Thank you! Also, follow-up question: how much disk space does one language version take? (for example enwiki, dewiki, huwiki) Note: I don't really need this information but would like to get some understanding of the sizes.

Following wikis are running FlaggedRevs extension without Wikiwho. I think that WikiWho data could be used in pending reviewing so it could be nice to have these enabled at some point.

T407660: WikiWho: pickles are too big for pickle_storage needs to be resolved before we do anything. That will fill pick_storage02 more than it already is, so yeah, something like T310386: Investigate realtime processing of non-mainspace pages and unsupported wikis without persisting data to disk would seem worthwhile. I do think we'll have room for a number of other languages, but given the growth we're seeing I'd say cap it at just a few, or start with just one big-ish wiki, or one we expect WikIWho to be used a lot and could benefit from the pre-computation. Russian Wikipedia perhaps?

Thank you! Also, follow-up question: how much disk space does one language version take? (for example enwiki, dewiki, huwiki) Note: I don't really need this information but would like to get some understanding of the sizes.

It's generally proportional to the article count. As of the time of writing:

LanguageSize (GB)Article count
English~3,2007,083,607
French7032,717,846
German7023,064,852
Spanish5352,071,519
Italian4551,942,311
Japanese3601,478,586
Polish2471,673,469
Portuguese2401,158,983
Dutch1892,200,912
Arabic1741,284,342
Turkish105650,610
Hungarian97562,287
Indonesian90748,314
Basque32474,423

I would like to add Chinese Wikipedia (zh) to the list. We really want a true authorship tool, so that we don't need to rely on "Top 10 by added text (approximate)" to determine authorship.

I got the official start on this tonight only to run into two new problems:

  1. Current code relies on the deprecated Database backup dumps format,
  2. XML exports contain an <origin> tag (even the aforementioned deprecated ones). For individual tests, I used i.e. sed -i 's|<origin>[^<]*</origin>||g' filename.xml but this is not in the script

Sounds like we just need to adjust the code to work with the new MediaWiki Content File Exports format.

we just need to adjust the code to work with the new MediaWiki Content File Exports format.

I think is involves:


Either that, or hack our way into using the old format as I did for T422230#11807452. So, remove <origin> tags then compress into the format acceptable by WikiWho (7z).

I took a look at this and submitted PR #25 which adds .bz2 support to the import script while keeping backward compat with .7z.

For the <origin> tag, upgrading mwxml from 0.3.3 to 0.3.8 should be enough. I checked the dependency chain and it looks safe: mwxml 0.3.8 and its deps (mwtypes, mwcli, para) have no Python version constraint. The only thing to watch is jsonschema, since the requirements.txt says Python 3.5.2, we should pin jsonschema<4.18 to avoid pulling a version that needs Python 3.7+.

Tested locally with a real ukwiki dump sample and mwxml==0.3.8 parses it fine, no need for the sed workaround.

FYI https://wikiwho.wmcloud.org is currently errorig out because zh is not a known language code to Django. I'll work on this tomorrow. There are no service disruptions to the APIs (except for the newly added Chinese APIs).

EDIT: This has been resolved

Change #1281309 had a related patch set uploaded (by MusikAnimal; author: MusikAnimal):

[labs/xtools@main] Authorship: enable dsb, fa, hi, ru, sv, uk, vi Wikipedias

https://gerrit.wikimedia.org/r/1281309

Change #1281309 merged by jenkins-bot:

[labs/xtools@main] Authorship: enable dsb, fa, hi, ru, sv, uk, vi Wikipedias

https://gerrit.wikimedia.org/r/1281309

Change #1285530 had a related patch set uploaded (by MusikAnimal; author: MusikAnimal):

[labs/xtools@main] Authorship: enable ce, sr, no, fi, cs, ro and sh language Wikipedias

https://gerrit.wikimedia.org/r/1285530

Change #1285530 merged by jenkins-bot:

[labs/xtools@main] Authorship: enable ce, sr, no, fi, cs, ro and sh language Wikipedias

https://gerrit.wikimedia.org/r/1285530

Hey @MusikAnimal — I worked on T414075 investigating the compression angle, glad to see this moving forward with the two batches merged! I noticed T335786 (zhwiki) is still open and unassigned. I'm also curious whether the remaining FlaggedRevs wikis from @Zache's list (sq, als, be, bn, eo, ka, ia, vec) have been estimated yet.
Happy to run disk space/feasibility estimates for any of those if that would be useful — same kind of investigation as T414075. Just let me know where to focus.

@Xinacod zhwiki is coming :) You are pretty good at Python, perhaps you could help me review https://github.com/wikimedia/WikiWho/pull/2 which I hope will speed things up for CJK (and other languages, too).

I will add the FlaggedRev wikis as part of the next (and likely final) batch, then I will have to wrap up this wish. You can follow progress at https://meta.wikimedia.org/wiki/Talk:Community_Wishlist/W503#Current_progress

Change #1287050 had a related patch set uploaded (by MusikAnimal; author: MusikAnimal):

[labs/xtools@main] Authorship: Add support for eleven more languages

https://gerrit.wikimedia.org/r/1287050

Change #1287050 merged by jenkins-bot:

[labs/xtools@main] Authorship: Add support for eleven more languages

https://gerrit.wikimedia.org/r/1287050

@MusikAnimal Sorry to bother you, but please, let us continue the conversation in PR#2.

Change #1293652 had a related patch set uploaded (by MusikAnimal; author: MusikAnimal):

[labs/xtools@main] Authorship: add support for 28 more lanugages

https://gerrit.wikimedia.org/r/1293652

Change #1293652 merged by jenkins-bot:

[labs/xtools@main] Authorship: add support for 28 more lanugages

https://gerrit.wikimedia.org/r/1293652