Page MenuHomePhabricator

The published dataset's list of wikis misses a couple of wikis with existing data
Closed, ResolvedPublic

Description

I discovered that the list of wikis in the published dataset, which is included as wikis.txt has several wikis missing, see the diff generated below:

urbanecm@martins-mbp Desktop % echo ls | lftp 'https://analytics.wikimedia.org/published/datasets/one-off/research-mwaddlink/' > lftp-wikis-raw.txt
urbanecm@martins-mbp Desktop % grep -o '[a-z_]*wiki$' lftp-wikis-raw.txt > lftp-wikis.txt
urbanecm@martins-mbp Desktop % wget https://analytics.wikimedia.org/published/datasets/one-off/research-mwaddlink/wikis.txt
[...]
urbanecm@martins-mbp Desktop % sort wikis.txt > wikis-sorted.txt; sort lftp-wikis.txt > lftp-wikis-sorted.txt
urbanecm@martins-mbp Desktop % git diff wikis-sorted.txt lftp-wikis-sorted.txt
diff --git a/wikis-sorted.txt b/lftp-wikis-sorted.txt
index 5a7d54f..31e095b 100644
--- a/wikis-sorted.txt
+++ b/lftp-wikis-sorted.txt
@@ -145,19 +145,23 @@ lijwiki
 liwiki
 lmowiki
 lnwiki
+lowiki
 ltgwiki
 ltwiki
 lvwiki
 maiwiki
 map_bmswiki
 mdfwiki
+mgwiki
 mhrwiki
 minwiki
 miwiki
 mkwiki
+mlwiki
 mnwiki
 mrjwiki
 mrwiki
+mswiki
 mtwiki
 mwlwiki
 myvwiki
@@ -180,6 +184,8 @@ nvwiki
 nywiki
 ocwiki
 olowiki
+omwiki
+orwiki
 oswiki
 pagwiki
 pamwiki
@@ -196,6 +202,7 @@ pntwiki
 pswiki
 ptwiki
 quwiki
+rmwiki
 rmywiki
 rnwiki
 roa_rupwiki
@@ -213,16 +220,19 @@ scwiki
 sdwiki
 sewiki
 sgwiki
+shwiki
 simplewiki
 siwiki
 skwiki
 slwiki
 smwiki
+sowiki
 sqwiki
 srnwiki
 srwiki
 sswiki
 stqwiki
+stwiki
 suwiki
 svwiki
 swwiki
@@ -231,6 +241,7 @@ tawiki
 tcywiki
 tetwiki
 tewiki
+tgwiki
 thwiki
 tkwiki
 tlwiki
@@ -245,6 +256,7 @@ twwiki
 tyvwiki
 tywiki
 udmwiki
+ugwiki
 ukwiki
 uzwiki
 vecwiki

I am unsure about the need for wikis.txt. Since indexing is enabled on analytics.wikimedia.org, consumers can easily see which directories exist on the server, without relying on wikis.txt, which might be not in a consistent state.

This was originally discovered when wikis.txt included hywwiki, without there being a matching folder with data, which broke the linkrecommendation service (as it was trying to load non-existing data).

Event Timeline

Urbanecm_WMF renamed this task from The published dataset's list of wikis misses a couple of wikis the data exists to The published dataset's list of wikis misses a couple of wikis with existing data.Jul 3 2023, 8:38 AM
Urbanecm_WMF updated the task description. (Show Details)
Restricted Application added a subscriber: Base. · View Herald TranscriptJul 3 2023, 8:38 AM
Sgs subscribed.

Maybe we have a “bug” in the publishing script, in line publish-datasets.sh#L20, preventing to write in the file for wiki ids which are a substring of another wiki id. eg:

➜  ~ echo gaswiki > wikis.txt
➜  ~ grep -q "aswiki" wikis.txt
➜  ~ echo $?
0

This could explain why all missing wikis are a substring of a prior alphabetically sorted wiki:

sgimeno@stat1008:/srv/published/datasets/one-off/research-mwaddlink$ cat wikis.txt | grep omwiki
gomwiki
omwiki
sgimeno@stat1008:/srv/published/datasets/one-off/research-mwaddlink$ cat wikis.txt | grep orwiki
gorwiki
orwiki
sgimeno@stat1008:/srv/published/datasets/one-off/research-mwaddlink$ cat wikis.txt | grep lowiki
ilowiki
lowiki
olowiki
sgimeno@stat1008:/srv/published/datasets/one-off/research-mwaddlink$ cat wikis.txt | grep mgwiki
bat_smgwiki
mgwiki
sgimeno@stat1008:/srv/published/datasets/one-off/research-mwaddlink$ cat wikis.txt | grep mlwiki
emlwiki
mlwiki
sgimeno@stat1008:/srv/published/datasets/one-off/research-mwaddlink$ cat wikis.txt | grep omwiki
gomwiki
omwiki
sgimeno@stat1008:/srv/published/datasets/one-off/research-mwaddlink$ cat wikis.txt | grep orwiki
gorwiki
orwiki
sgimeno@stat1008:/srv/published/datasets/one-off/research-mwaddlink$ cat wikis.txt | grep rmwiki
nrmwiki
sgimeno@stat1008:/srv/published/datasets/one-off/research-mwaddlink$ cat wikis.txt | grep shwiki
kshwiki
sgimeno@stat1008:/srv/published/datasets/one-off/research-mwaddlink$ cat wikis.txt | grep stwiki
astwiki
sgimeno@stat1008:/srv/published/datasets/one-off/research-mwaddlink$ cat wikis.txt | grep tgwiki
ltgwiki
sgimeno@stat1008:/srv/published/datasets/one-off/research-mwaddlink$ cat wikis.txt | grep ugwiki
bugwiki
sgimeno@stat1008:/srv/published/datasets/one-off/research-mwaddlink$

Trying to run the commands in the description now returned an empty diff so the sources seem to be in-sync for the moment. I also asked @kevinbazira about the removal process and it is done manually. To prevent other occurrences of T344686, and as an alternative to what's proposed in T344711 we could create a remove-dataset.sh script which would also update the index.

Maybe we have a “bug” in the publishing script, in line publish-datasets.sh#L20, preventing to write in the file for wiki ids which are a substring of another wiki id. eg:

➜  ~ echo gaswiki > wikis.txt
➜  ~ grep -q "aswiki" wikis.txt
➜  ~ echo $?
0

Good catch! Let me upload a quick fix for that.

Trying to run the commands in the description now returned an empty diff so the sources seem to be in-sync for the moment. I also asked @kevinbazira about the removal process and it is done manually. To prevent other occurrences of T344686, and as an alternative to what's proposed in T344711 we could create a remove-dataset.sh script which would also update the index.

I receive a diff upon running those commands?

urbanecm@wmf3345 addlink % echo ls | lftp 'https://analytics.wikimedia.org/published/datasets/one-off/research-mwaddlink/' > lftp-wikis-raw.txt
urbanecm@wmf3345 addlink % grep -o '[a-z_]*wiki$' lftp-wikis-raw.txt > lftp-wikis.txt

urbanecm@wmf3345 addlink % wget https://analytics.wikimedia.org/published/datasets/one-off/research-mwaddlink/wikis.txt
[...]
urbanecm@wmf3345 addlink % sort wikis.txt > wikis-sorted.txt; sort lftp-wikis.txt > lftp-wikis-sorted.txt
urbanecm@wmf3345 addlink % git diff wikis-sorted.txt lftp-wikis-sorted.txt
diff --git a/wikis-sorted.txt b/lftp-wikis-sorted.txt
index 880dbcd..5621700 100644
--- a/wikis-sorted.txt
+++ b/lftp-wikis-sorted.txt
@@ -12,7 +12,6 @@ arwiki
 arywiki
 arzwiki
 astwiki
-aswiki
 atjwiki
 avwiki
 aywiki
@@ -200,6 +199,7 @@ pntwiki
 pswiki
 ptwiki
 quwiki
+rmwiki
 rmywiki
 rnwiki
 roa_rupwiki
@@ -217,16 +217,19 @@ scwiki
 sdwiki
 sewiki
 sgwiki
+shwiki
 simplewiki
 siwiki
 skwiki
 slwiki
 smwiki
+sowiki
 sqwiki
 srnwiki
 srwiki
 sswiki
 stqwiki
+stwiki
 suwiki
 svwiki
 swwiki
@@ -235,6 +238,7 @@ tawiki
 tcywiki
 tetwiki
 tewiki
+tgwiki
 thwiki
 tkwiki
 tlwiki
@@ -249,6 +253,7 @@ twwiki
 tyvwiki
 tywiki
 udmwiki
+ugwiki
 ukwiki
 uzwiki
 vecwiki
urbanecm@wmf3345 addlink %

Maybe git diff is not willing to compare files that are not a part of any git repository on your system? You might want to try plain diff instead, although its output is fairly hard to read (but it should show differences).

Change 951548 had a related patch set uploaded (by Urbanecm; author: Urbanecm):

[research/mwaddlink@main] publish-datasets: Require exact match in wikis.txt to skip list update

https://gerrit.wikimedia.org/r/951548

Maybe git diff is not willing to compare files that are not a part of any git repository on your system? You might want to try plain diff instead, although its output is fairly hard to read (but it should show differences).

Ops, you are totally right, I'm getting the same results. Thanks.

Change 951548 merged by jenkins-bot:

[research/mwaddlink@main] publish-datasets: Require exact match in wikis.txt to skip list update

https://gerrit.wikimedia.org/r/951548

Sgs edited projects, added Growth-Team (Sprint 3 (Growth Team)); removed Growth-Team.

The issue remains present although for less wikis, updated output shows the following wikis:

> sowiki
> stwiki
> tgwiki
> ttermwiki
> ugwiki

I'm going ahead and add sowiki, stwiki and tgwiki in the context of T308142 and tgwiki and ugwiki in the context of T308143. I'm not sure ttermwiki is a valid domain, @kevinbazira could you clarify what's the purpose of tterm directory so we can close this issue. Ty!

DMburugu moved this task from Incoming to Doing on the Growth-Team (Sprint 3 (Growth Team)) board.

@Sgs, ttermwiki has been removed from the published datasets, it was created to test the unpublish-datasets script in T344799.

Mentioned in SAL (#wikimedia-analytics) [2023-11-16T13:22:35Z] <sergi0> stat1008: Add sowiki, stwiki, tgwiki and ugwiki to /srv/published/datasets/one-off/research-mwaddlink/wikis.txt (T340944)

Mentioned in SAL (#wikimedia-operations) [2023-11-16T13:34:57Z] <sergi0> stat1008: Add sowiki, stwiki, tgwiki and ugwiki to /srv/published/datasets/one-off/research-mwaddlink/wikis.txt (T340944)

@Sgs, ttermwiki has been removed from the published datasets, it was created to test the unpublish-datasets script in T344799.

Thanks for the clarification. The task is resolved now and published dataset wikis are matching wikis.txt