Page MenuHomePhabricator

Add all remaining languages to WikiWho
Open, Needs TriagePublic

Description

I believe there are a number of users across the Wikimedia ecosystem who are eager for access to tools that rely on WikiWho data (including 'Who Wrote That?' as well as my new tool at impact-visualizer.wmcloud.org and the features of Programs & Events Dashboard that rely on it).

It's been running relatively smoothly with English plus about a dozen other languages, including some of the larger ones, so I feel relatively confident that the system can handle enabling the rest of the languages (based on the rule of the thumb that en.wiki is about equivalent to the rest of the languages combined when it comes to resource utilization).

Let's add it!

(Two languages that I know in particular have users who want access to tools that need this data are uk and sv.)

Event Timeline

Restricted Application added a subscriber: Aklapper. · View Herald Transcript

Hi there, I would also like to use it for tools to measure projects on nl and pap Wikipedia, so wanted to support this task!

Indeed things seem to be running well enough. To my knowledge there haven't been any outages in moons, with the last one resulting in a permanent fix. It'd be nice to figure out T344936: Systemd services rely on cron to restart, but I think it's otherwise mostly safe to say the WikiWho service is stable "enough".

T344938: Celery queue is full of deleted pages that won't ever get processed meanwhile is a problem that appears to continue to grow. We're now at ~310,000+ messages! Even so, it doesn't seem to have any real consequences (?)

I should note that if I remember correctly, we added the last round of new languages when we moved to a new VM, so we had a unique opportunity to run the import script on a machine different than the client-facing server. We may or may not want to do the same here, or take some measure to ensure it isn't hogging too many resources from processing the celery queue.

@MusikAnimal @TheresNoTime any chance either of you could squeeze this in soonish? It's blocker to rolling out support for more languages for my Impact Visualizer (https://impact-visualizer.wmcloud.org/) project, and I'm eager to get some of the folks from those languages using it before I run out of time/money to for active development.

@MusikAnimal is there documentation on how wikiwho imports are done and would it help if someone would preprocess data locally for adding them? (I suppose that processing dumps take some time)

It's worth noting that we could enable additional languages without an initial processing run; in that case, every new request would trigger processing that particular article (and result in slower, but still usable, performance for the first request for each article for Who Wrote That and Dashboard end users). I think that strategy would be fine for enabling all additional languages; the only blocker would be making sure we have appropriate storage for handling the gradual accumulation of processed articles files for the additional languages (and clearing some headway for continued growth of storage requirements for the already-enabled ones).

@Zache here's the documentation: https://github.com/wikimedia/wikiwho_api/blob/main/WIKIMEDIA_VPS_SETUP.md#adding-new-languages-to-wikiwho

Thank you! Also, follow-up question: how much disk space does one language version take? (for example enwiki, dewiki, huwiki) Note: I don't really need this information but would like to get some understanding of the sizes.

Following wikis are running FlaggedRevs extension without Wikiwho. I think that WikiWho data could be used in pending reviewing so it could be nice to have these enabled at some point.

T407660: WikiWho: pickles are too big for pickle_storage needs to be resolved before we do anything. That will fill pick_storage02 more than it already is, so yeah, something like T310386: Investigate realtime processing of non-mainspace pages and unsupported wikis without persisting data to disk would seem worthwhile. I do think we'll have room for a number of other languages, but given the growth we're seeing I'd say cap it at just a few, or start with just one big-ish wiki, or one we expect WikIWho to be used a lot and could benefit from the pre-computation. Russian Wikipedia perhaps?

Thank you! Also, follow-up question: how much disk space does one language version take? (for example enwiki, dewiki, huwiki) Note: I don't really need this information but would like to get some understanding of the sizes.

It's generally proportional to the article count. As of the time of writing:

LanguageSize (GB)Article count
English~3,2007,083,607
French7032,717,846
German7023,064,852
Spanish5352,071,519
Italian4551,942,311
Japanese3601,478,586
Polish2471,673,469
Portuguese2401,158,983
Dutch1892,200,912
Arabic1741,284,342
Turkish105650,610
Hungarian97562,287
Indonesian90748,314
Basque32474,423

I would like to add Chinese Wikipedia (zh) to the list. We really want a true authorship tool, so that we don't need to rely on "Top 10 by added text (approximate)" to determine authorship.