I discovered that the list of wikis in the published dataset, which is included as wikis.txt has several wikis missing, see the diff generated below:
urbanecm@martins-mbp Desktop % echo ls | lftp 'https://analytics.wikimedia.org/published/datasets/one-off/research-mwaddlink/' > lftp-wikis-raw.txt urbanecm@martins-mbp Desktop % grep -o '[a-z_]*wiki$' lftp-wikis-raw.txt > lftp-wikis.txt urbanecm@martins-mbp Desktop % wget https://analytics.wikimedia.org/published/datasets/one-off/research-mwaddlink/wikis.txt [...] urbanecm@martins-mbp Desktop % sort wikis.txt > wikis-sorted.txt; sort lftp-wikis.txt > lftp-wikis-sorted.txt urbanecm@martins-mbp Desktop % git diff wikis-sorted.txt lftp-wikis-sorted.txt diff --git a/wikis-sorted.txt b/lftp-wikis-sorted.txt index 5a7d54f..31e095b 100644 --- a/wikis-sorted.txt +++ b/lftp-wikis-sorted.txt @@ -145,19 +145,23 @@ lijwiki liwiki lmowiki lnwiki +lowiki ltgwiki ltwiki lvwiki maiwiki map_bmswiki mdfwiki +mgwiki mhrwiki minwiki miwiki mkwiki +mlwiki mnwiki mrjwiki mrwiki +mswiki mtwiki mwlwiki myvwiki @@ -180,6 +184,8 @@ nvwiki nywiki ocwiki olowiki +omwiki +orwiki oswiki pagwiki pamwiki @@ -196,6 +202,7 @@ pntwiki pswiki ptwiki quwiki +rmwiki rmywiki rnwiki roa_rupwiki @@ -213,16 +220,19 @@ scwiki sdwiki sewiki sgwiki +shwiki simplewiki siwiki skwiki slwiki smwiki +sowiki sqwiki srnwiki srwiki sswiki stqwiki +stwiki suwiki svwiki swwiki @@ -231,6 +241,7 @@ tawiki tcywiki tetwiki tewiki +tgwiki thwiki tkwiki tlwiki @@ -245,6 +256,7 @@ twwiki tyvwiki tywiki udmwiki +ugwiki ukwiki uzwiki vecwiki
I am unsure about the need for wikis.txt. Since indexing is enabled on analytics.wikimedia.org, consumers can easily see which directories exist on the server, without relying on wikis.txt, which might be not in a consistent state.
This was originally discovered when wikis.txt included hywwiki, without there being a matching folder with data, which broke the linkrecommendation service (as it was trying to load non-existing data).