We should define a list of languages that we want to use for the citolytics A/B test.
- English
- German
- French
- Spanish
We should define a list of languages that we want to use for the citolytics A/B test.
@EBernhardson Could you please refine the list as needed. It might be interesting to add some languages with fewer articles to compare the performance. Thereafter @mschwarzer can start to generate the data at our university cluster.
It was easiest to grab generic usage of the more_like feature, rather than the very specific usage of more like for related articles, but it should be a reasonable proxy. The following data is for all queries that have occurred in 2017. Likely this tracks somewhat closely to mobile page view %.
Total more like queries: 120,591,501 (avg of 82 per second)
| wiki | queries | percent |
| enwiki | 21569075 | 0.17886 |
| jawiki | 12261711 | 0.10168 |
| eswiki | 9504855 | 0.07882 |
| itwiki | 9401592 | 0.07796 |
| dewiki | 7365322 | 0.06108 |
| plwiki | 4578461 | 0.03797 |
| zhwiki | 4557144 | 0.03779 |
| ptwiki | 4404892 | 0.03653 |
| nlwiki | 4203904 | 0.03486 |
| svwiki | 3285202 | 0.02724 |
| ruwiki | 3190172 | 0.02645 |
| frwiki | 3010367 | 0.02496 |
| arwiki | 2827600 | 0.02345 |
| trwiki | 2498507 | 0.02072 |
| fawiki | 2221143 | 0.01842 |
| kowiki | 2177463 | 0.01806 |
| fiwiki | 2015266 | 0.01671 |
| idwiki | 1775398 | 0.01472 |
| cswiki | 1640096 | 0.0136 |
| hewiki | 1580373 | 0.01311 |
| nowiki | 1279645 | 0.01061 |
| huwiki | 1205663 | 0.01 |
Where can I upload the data? The data from all requested wikis won't fit on our labs instances.
@mschwarzer I think you can upload it to the citolytics release page on github. If that does not work I can share a onedrive folder with you. But is the result data really that large? Since you do not have access to private data at the moment you do not need to worry about uploading the dataset to a public location.
@Physikerwelt The uncompressed the results (enwiki) are around 50 GB in size (~10 GB compressed). Other languages will less. So they won't fit to github (max. 2 GB per file) and onedrive. For now, I'll start uploading them to a lab instance.
Labs instances have a secondary drive that is not mounted by default (because it's over-commited, we dont have enough disk space for all instances to use this space). Check with labs on how to configure that, i believe it's a puppet role to enable.