Page MenuHomePhabricator

define languages for citolytics A/B test
Closed, DeclinedPublic

Description

We should define a list of languages that we want to use for the citolytics A/B test.

  • English
  • German
  • French
  • Spanish

Event Timeline

@EBernhardson Could you please refine the list as needed. It might be interesting to add some languages with fewer articles to compare the performance. Thereafter @mschwarzer can start to generate the data at our university cluster.

It was easiest to grab generic usage of the more_like feature, rather than the very specific usage of more like for related articles, but it should be a reasonable proxy. The following data is for all queries that have occurred in 2017. Likely this tracks somewhat closely to mobile page view %.

Total more like queries: 120,591,501 (avg of 82 per second)

wikiqueriespercent
enwiki215690750.17886
jawiki122617110.10168
eswiki95048550.07882
itwiki94015920.07796
dewiki73653220.06108
plwiki45784610.03797
zhwiki45571440.03779
ptwiki44048920.03653
nlwiki42039040.03486
svwiki32852020.02724
ruwiki31901720.02645
frwiki30103670.02496
arwiki28276000.02345
trwiki24985070.02072
fawiki22211430.01842
kowiki21774630.01806
fiwiki20152660.01671
idwiki17753980.01472
cswiki16400960.0136
hewiki15803730.01311
nowiki12796450.01061
huwiki12056630.01

Where can I upload the data? The data from all requested wikis won't fit on our labs instances.

@mschwarzer I think you can upload it to the citolytics release page on github. If that does not work I can share a onedrive folder with you. But is the result data really that large? Since you do not have access to private data at the moment you do not need to worry about uploading the dataset to a public location.

@Physikerwelt The uncompressed the results (enwiki) are around 50 GB in size (~10 GB compressed). Other languages will less. So they won't fit to github (max. 2 GB per file) and onedrive. For now, I'll start uploading them to a lab instance.

Labs instances have a secondary drive that is not mounted by default (because it's over-commited, we dont have enough disk space for all instances to use this space). Check with labs on how to configure that, i believe it's a puppet role to enable.