Page MenuHomePhabricator

Add zhwiki to WikiWho
Open, MediumPublic8 Estimated Story Points

Description

Now that we have more disk space (T332630), we're trying to get as many of the top language Wikipedias added to WikiWho as possible. We'll want to do a quick test to ensure the Chinese logograms are treated correctly by the algorithm, but I expect they will be.

Acceptance criteria

  • Adjust algorithm to better support CJK languages
  • Download XML dump
  • Merge code to add zh language to WikiWho
  • Start processing the dump
  • Merge code to add zhwiki to EventStreams listener and restart services
  • Update Who-Wrote-That and XTools clients

Event Timeline

Restricted Application added subscribers: Stang, Aklapper. · View Herald Transcript

WikiWho uses the Django framework, which has a built-in i18n library. That unfortunately does not support zh without a locale. We'll need to manually add the language code, something similar to https://stackoverflow.com/a/20265032/604142. I tried doing this without success, so in the interest of time, we may need to hold off on adding Chinese until after we move to the new VM.

JWheeler-WMF moved this task from Needs Discussion to Freezer on the Community-Tech board.

Moving to our backlog as low priority due to low demand and risk of performance issues re: WikiWho

Can we still get WikiWho enabled on zhwiki? There is strong demand for it -- even if people aren’t posting comments on this issue tracker. Authorship information is particularly important on zhwiki, since "Did You Know? Nominations" require the nominee to have at least 2/3 authorship. Currently, we’re using “Top 10 by added text” on XTools as a substitute, but this method isn’t very accurate and ignores contributions from editors outside the top 10. Please consider prioritizing this request.

@MusikAnimal Thank you for the zhwiki API! I have been testing it ever since it comes out and it's great.

There is one bug I want to report though: Chinese languages are considered by character rather than by word. Therefore, Chinese characters need to be considered as individual tokens rather for the Authorship to calculate correctly.

This is one example from the page "Miku (歌曲)" on zhwiki:

token 241: '发行。anamanaguchi在和vocaloid制作公司'
token 242: '[['
token 243: '克理普敦未來媒體'
token 244: ']]'
token 245: '洽谈后,决定与初音未來一同巡演,並创作一首原创曲目以及設計此歌的编舞。该曲列入2016年'

You see, there are no space between Latin words, Chinese characters, Arabic numbers, and half-width punctuations in this example. Therefore tokens become very large chunks, and only the half-width punctuations were effectively split.

I suppose Japanese will also face this issue since they don't use spaces either. Korean, however, uses spaces.

My proposed solution is that we just consider all CJK Unified Ideographs + Hiragana + Katakana as individual tokens. This would solve the problem.

First off, apologies I forgot about this task! I should have been making updates here.

Chinese languages are considered by character rather than by word. Therefore, Chinese characters need to be considered as individual tokens rather for the Authorship to calculate correctly.

Bah! I had a feeling this all seemed too easy. Fixing this issue won't be particularly easy (for me) as it requires changes to the underlying algorithm. WikiWho was written externally and inherited by WMF. We'll need to fork the algorithm and adjust it to work for languages like Chinese and the others you mention.

I did manage to find the method that I think needs changing, but I've little idea what implications splitting on each character has on the algorithm. I suspect other changes will be needed too, such as to WhoColor (which is what powers Who-Wrote-That). Additionally, with every character as an individual token, I assume it would vastly increase the storage footprint.

All in all, this is not sounding very promising, I'm afraid :(

Is the algorithm at all useful the way it is now? If not, I am inclined to drop support for Chinese and Japanese, as it is expensive to keep it running. If we do manage to fix the algorithm for such languages, we'll want to re-process the XML dumps from scratch, so we'd have to start over anyway.

Sorry this issue did not occur to us beforehand!

Is the algorithm at all useful the way it is now? If not, I am inclined to drop support for Chinese and Japanese, as it is expensive to keep it running. If we do manage to fix the algorithm for such languages, we'll want to re-process the XML dumps from scratch, so we'd have to start over anyway.

Yeah, I have already started promoting WikiWho on Chinese Wikipedia. While waiting for official support of XTools Authorship and WikiBlame, I made a simple user script to check contributions for DYK nominations. On Chinese Wikipedia, to award credits to an editor who wrote an article, we check if they have more than 2/3 authorship, which is hard to calculate manually. Therefore the WikiWho API is a highly anticipated feature request, and many editors are excited to see it being implemented.

Don't get discouraged :-)

I don't know the underline architecture, but currently when using the API, I had a vague feeling that the algorithm only runs when a request is sent (plus caching for all future requests on that revision). So compared to the vast amount of articles on Chinese and Japanese Wikipedia, I don't think the footprint would increase that quickly.

And by the way, forking the WikiWho seems to be the eventual answer. The original repo has not been updated for 7 years already.

I can help with fixing the algorithm if you need a hand.

And by the way, forking the WikiWho seems to be the eventual answer. The original repo has not been updated for 7 years already.

I can help with fixing the algorithm if you need a hand.

If you have the skills and willpower, any help would be fantastic! I can offer code review, and to assist with wiring things up with the API repo, but coding much of anything by myself is likely going be a time sink given my lack of Python expertise.

I suppose for starters, I should set up forks of the upstream Python packages. I've done so here:

I don't know the underline architecture, but currently when using the API, I had a vague feeling that the algorithm only runs when a request is sent (plus caching for all future requests on that revision). So compared to the vast amount of articles on Chinese and Japanese Wikipedia, I don't think the footprint would increase that quickly.

There is a "user queue" and a systemd process that runs in the background reacting to every edit made, via EventStreams. We only recently imported the dump and that was dated April 1, so whenever you made a request for an article it may have had to re-process a bunch of revisions. Slowly over time however, the background process will ensure the pickle files are up-to-date.

Anyhow, we actually have a significant amount of storage space to work with. It shouldn't be a huge problem unless it end's up being say, 2-3x as much disk space. Even then, I have a feeling we can make it happen, so don't let storage factor be a deterrent, rather just something to keep in mind :)

MusikAnimal changed the point value for this task from 3 to 8.Apr 24 2026, 5:26 AM

Since WhoColor only consumes tokenized result from WikiWho, there is nothing to modify in WhoColor.

Thanks so much for contributing!!! This is exciting :)

Heads up it may take me a while to review everything, but rest assured I'll get to it. I will be at the Hackathon over the coming week so it may leak into that time (which might mean a bit more delay given everything else that'll be going on there).

A note that following PR #1, the import process seems to be incredibly slower than before, I guess because there's more tokenization to be done. At the rate we're going now, it may take weeks for it to complete the processing of zhwiki.

@Supergrey1 Do you believe PR #2 could help with overall speeds?

@Supergrey1 Do you believe PR #2 could help with overall speeds?

Yes, I believe PR #2 would speed up the process. I am not sure if it may save weeks of time though, but it is worth trying.