Page MenuHomePhabricator

[LEX] pp_sortkey is null for wb-claims, wbl-forms and wbl-senses on many Lexemes
Closed, ResolvedPublic

Description

As a Wikidata product manager, I want our analytics to be accurate.
As a Wikidata power user, I want queries using pp_sortkey to return the correct data.

Problem:
Since April 2022, the page props wb-claims, wbl-forms and wbl-senses often have the pp_sortkey set to NULL in the page_props database table. This means that queries using the sort key, rather than the pp_value (to optimize the query by using a covering index), are returning the wrong result.

Example:

mysql:research@dbstore1005.eqiad.wmnet [wikidatawiki]> SELECT COUNT(*) FROM page_props WHERE pp_propname IN ('wbl-forms', 'wbl-senses') AND pp_sortkey IS NULL;
+----------+
| COUNT(*) |
+----------+
|  1150039 |
+----------+
1 row in set (0.687 sec)

mysql:research@dbstore1005.eqiad.wmnet [wikidatawiki]> SELECT COUNT(*) FROM page_props WHERE pp_propname IN ('wbl-forms', 'wbl-senses') AND pp_sortkey IS NOT NULL;
+----------+
| COUNT(*) |
+----------+
|  1234425 |
+----------+
1 row in set (0.729 sec)

mysql:research@dbstore1005.eqiad.wmnet [wikidatawiki]> SELECT COUNT(*) FROM page_props WHERE pp_propname = 'wb-claims' AND pp_sortkey IS NULL;
+----------+
| COUNT(*) |
+----------+
|   946316 |
+----------+
1 row in set (0.812 sec)

mysql:research@dbstore1005.eqiad.wmnet [wikidatawiki]> SELECT COUNT(*) FROM page_props WHERE pp_propname = 'wb-claims' AND pp_sortkey IS NOT NULL;
+-----------+
| COUNT(*)  |
+-----------+
| 106516430 |
+-----------+
1 row in set (1 min 40.897 sec)

This appears to have been caused by Page properties should always be strings (T305158) – PagePropsTable::getPropertySortKeyValue() returns null for strings, even if the strings are numeric. (The corresponding Wikibase change was thankfully never merged.)

Screenshots/mockups:
This is likely responsible for the reported drop of senses and forms in Grafana:

image.png (258×912 px, 27 KB)

image.png (258×912 px, 29 KB)

BDD
WHEN a lexeme is created or edited
AND its page props have been written
THEN the pp_sortkey for pp_propname='wbl-senses' is non-null
AND the pp_sortkey for pp_propname='wbl-forms' is non-null

Acceptance criteria:

  • The pp_sortkey is populated for new edits
  • We repopulate the pp_sortkey for all affected lexemes

Open questions:

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes

Change 970749 had a related patch set uploaded (by Lucas Werkmeister (WMDE); author: Lucas Werkmeister (WMDE)):

[mediawiki/extensions/WikibaseLexeme@master] Revert "Page properties should always be strings"

https://gerrit.wikimedia.org/r/970749

ItamarWMDE renamed this task from pp_sortkey is null for wb-claims, wbl-forms and wbl-senses on many Lexemes to [LEX] pp_sortkey is null for wb-claims, wbl-forms and wbl-senses on many Lexemes.Nov 7 2023, 3:52 PM

Change 970749 merged by jenkins-bot:

[mediawiki/extensions/WikibaseLexeme@master] Revert "Page properties should always be strings"

https://gerrit.wikimedia.org/r/970749

Unfortunately, it looks like this isn’t going to fix itself. Not only is purging the page not enough to rewrite the page props – even edits don’t necessarily update them. Apparently the pp_sortkey only becomes non-null when the value itself changes, e.g. on L123 after I added another form:

-- before the edit
mysql:research@dbstore1005.eqiad.wmnet [wikidatawiki]> SELECT pp_propname, pp_value, pp_sortkey FROM page_props JOIN page ON pp_page = page_id WHERE page_namespace = 146 AND page_title = 'L123';
+-------------+----------+------------+
| pp_propname | pp_value | pp_sortkey |
+-------------+----------+------------+
| wb-claims   | 19       |       NULL |
| wbl-forms   | 1        |       NULL |
| wbl-senses  | 4        |       NULL |
+-------------+----------+------------+
3 rows in set (0.001 sec)

-- after the edit
mysql:research@dbstore1005.eqiad.wmnet [wikidatawiki]> SELECT pp_propname, pp_value, pp_sortkey FROM page_props JOIN page ON pp_page = page_id WHERE page_namespace = 146 AND page_title = 'L123';
+-------------+----------+------------+
| pp_propname | pp_value | pp_sortkey |
+-------------+----------+------------+
| wb-claims   | 19       |       NULL |
| wbl-forms   | 2        |          2 |
| wbl-senses  | 4        |       NULL |
+-------------+----------+------------+
3 rows in set (0.001 sec)

So I think we need some maintenance script that goes through the ~2.1 million page_props rows WHERE pp_propname IN ('wb-claims', 'wbl-forms', 'wbl-senses') AND pp_sortkey IS NULL (in batches) and sets the sortkey based on the value…

Change 975023 had a related patch set uploaded (by Lucas Werkmeister (WMDE); author: Lucas Werkmeister (WMDE)):

[mediawiki/extensions/Wikidata.org@master] Add maintenance script to fix pp_sortkey

https://gerrit.wikimedia.org/r/975023

Uploaded a first version of the maintenance script, though I now wonder whether it shouldn’t be in WikibaseLexeme instead (and perhaps even wired up so it runs in update.php automatically) – after all, the issue didn’t only affect Wikidata.org.

though I now wonder whether it shouldn’t be in WikibaseLexeme instead (and perhaps even wired up so it runs in update.php automatically) – after all, the issue didn’t only affect Wikidata.org.

Yeah, I’m now pretty sure it should be in WikibaseLexeme. I’ll move it there.

Change 975250 had a related patch set uploaded (by Lucas Werkmeister (WMDE); author: Lucas Werkmeister (WMDE)):

[mediawiki/extensions/WikibaseLexeme@master] Add update.php maintenance script to fix pp_sortkey

https://gerrit.wikimedia.org/r/975250

Change 975023 abandoned by Lucas Werkmeister (WMDE):

[mediawiki/extensions/Wikidata.org@master] Add maintenance script to fix pp_sortkey

Reason:

Superseded by I94ad711c04 in WikibaseLexeme.

https://gerrit.wikimedia.org/r/975023

Change 975250 merged by jenkins-bot:

[mediawiki/extensions/WikibaseLexeme@master] Add update.php maintenance script to fix pp_sortkey

https://gerrit.wikimedia.org/r/975250

Change 975593 had a related patch set uploaded (by Lucas Werkmeister (WMDE); author: Lucas Werkmeister (WMDE)):

[mediawiki/extensions/WikibaseLexeme@wmf/1.42.0-wmf.5] Add update.php maintenance script to fix pp_sortkey

https://gerrit.wikimedia.org/r/975593

Looks like the script worked fine on Beta – update.php output (look for wikidatawiki – starts 11:22:09 UTC) and outcome:

wikiadmin@172.16.4.172(wikidatawiki)> SELECT COUNT(*) FROM page_props WHERE pp_propname IN ('wb-claims', 'wbl-forms', 'wbl-senses') AND pp_sortkey IS NULL;
+----------+
| COUNT(*) |
+----------+
|        0 |
+----------+
1 row in set (0.001 sec)

Change 975593 merged by jenkins-bot:

[mediawiki/extensions/WikibaseLexeme@wmf/1.42.0-wmf.5] Add update.php maintenance script to fix pp_sortkey

https://gerrit.wikimedia.org/r/975593

Mentioned in SAL (#wikimedia-operations) [2023-11-20T14:26:45Z] <urbanecm@deploy2002> Started scap: Backport for [[gerrit:975589|EditGrowthConfig: Do not provide default for levelling up threshold when disabled (T351603)]], [[gerrit:975593|Add update.php maintenance script to fix pp_sortkey (T350224)]]

Mentioned in SAL (#wikimedia-operations) [2023-11-20T14:28:06Z] <urbanecm@deploy2002> urbanecm and lucaswerkmeister-wmde and cyndywikime: Backport for [[gerrit:975589|EditGrowthConfig: Do not provide default for levelling up threshold when disabled (T351603)]], [[gerrit:975593|Add update.php maintenance script to fix pp_sortkey (T350224)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug)

Mentioned in SAL (#wikimedia-operations) [2023-11-20T14:37:13Z] <urbanecm@deploy2002> Finished scap: Backport for [[gerrit:975589|EditGrowthConfig: Do not provide default for levelling up threshold when disabled (T351603)]], [[gerrit:975593|Add update.php maintenance script to fix pp_sortkey (T350224)]] (duration: 10m 28s)

Mentioned in SAL (#wikimedia-operations) [2023-11-20T15:48:18Z] <Lucas_WMDE> START lucaswerkmeister-wmde@mwmaint2002:~$ mwscript Wikibase.Lexeme.Maintenance.FixPagePropsSortkey testwikidatawiki --batch-size=1000 # T350224

Mentioned in SAL (#wikimedia-operations) [2023-11-20T15:48:39Z] <Lucas_WMDE> DONE Wikibase.Lexeme.Maintenance.FixPagePropsSortkey (T350224) in 1.079s real time :)

Looks like it worked just fine on Test Wikidata:

lucaswerkmeister-wmde@mwmaint2002:~$ time mwscript Wikibase.Lexeme.Maintenance.FixPagePropsSortkey testwikidatawiki --batch-size=1000 # T350224
Wikibase\Lexeme\Maintenance\FixPagePropsSortkey running...
Update page IDs from 406 to 325101...
Update page IDs from 325102 to 326620...
Update page IDs from 326622 to 284068...
Done.

real    0m1.079s
user    0m0.220s
sys     0m0.057s
mysql:research@dbstore1007.eqiad.wmnet [testwikidatawiki]> SELECT COUNT(*) FROM page_props WHERE pp_propname IN ('wb-claims', 'wbl-forms', 'wbl-senses') AND pp_sortkey IS NULL;
+----------+
| COUNT(*) |
+----------+
|        0 |
+----------+
1 row in set (0.001 sec)

(That’s down from 6104 rows before the script ran.) I’ll probably do real Wikidata tomorrow.

Note to self: for running on Wikidata, probably keep an eye on MySQL Replication Lag in Grafana. (And/or Orchestrator? But I’m not sure how useful that one is to me tbh.)

Mentioned in SAL (#wikimedia-operations) [2023-11-21T14:36:19Z] <Lucas_WMDE> START [in tmux] lucaswerkmeister-wmde@mwmaint2002:~$ mwscript Wikibase.Lexeme.Maintenance.FixPagePropsSortkey wikidatawiki --batch-size=1000 # T350224

Mentioned in SAL (#wikimedia-operations) [2023-11-21T14:45:27Z] <Lucas_WMDE> T350224 maintenance script finished (8m46s real time)

mysql:research@dbstore1005.eqiad.wmnet [wikidatawiki]> SELECT pp_propname, pp_value, pp_sortkey FROM page_props JOIN page ON pp_page = page_id WHERE page_namespace = 146 AND page_title = 'L123';
+-------------+----------+------------+
| pp_propname | pp_value | pp_sortkey |
+-------------+----------+------------+
| wb-claims   | 19       |       NULL |
| wbl-forms   | 2        |          2 |
| wbl-senses  | 4        |       NULL |
+-------------+----------+------------+
3 rows in set (0.001 sec)
mysql:research@dbstore1005.eqiad.wmnet [wikidatawiki]> SELECT pp_propname, pp_value, pp_sortkey FROM page_props JOIN page ON pp_page = page_id WHERE page_namespace = 146 AND page_title = 'L123';
+-------------+----------+------------+
| pp_propname | pp_value | pp_sortkey |
+-------------+----------+------------+
| wb-claims   | 19       |         19 |
| wbl-forms   | 2        |          2 |
| wbl-senses  | 4        |          4 |
+-------------+----------+------------+
3 rows in set (0.002 sec)

And also:

mysql:research@dbstore1005.eqiad.wmnet [wikidatawiki]> SELECT COUNT(*) FROM page_props WHERE pp_propname IN ('wb-claims', 'wbl-forms', 'wbl-senses') AND pp_sortkey IS NULL;
+----------+
| COUNT(*) |
+----------+
|        0 |
+----------+
1 row in set (0.001 sec)

Those numbers look good! Now I'm expecting a sharp increase in Grafana for our panels about counting forms and senses.

Looks like the numbers have gone back up:

image.png (787×1 px, 66 KB)

image.png (787×1 px, 64 KB)

If I do a very poor man’s linear regression analysis (read: I hold a ruler up to my screen), it looks like the number of forms is roughly in line with the growth rate during March 2022, while the growth in senses accelerated a bit.

I also added a note about this to the Grafana panel with the number of forms/senses.

On that Grafana panel, I wonder if we do not want to switch the axis for Forms and Lexemes. We have way more Forms than Lexemes. So Lexemes are by default almost invisible as are all the other Entities that are not Forms. Or maybe switch to a log y-axis?

Change 1004146 had a related patch set uploaded (by C. Scott Ananian; author: C. Scott Ananian):

[mediawiki/core@master] ParserOutput::setPageProperty(): Update documentation

https://gerrit.wikimedia.org/r/1004146

Change 1004148 had a related patch set uploaded (by C. Scott Ananian; author: C. Scott Ananian):

[mediawiki/core@master] Add ParserOutput::setIndexedPageProperty(); deprecate numeric properties

https://gerrit.wikimedia.org/r/1004148

Change 1004190 had a related patch set uploaded (by C. Scott Ananian; author: C. Scott Ananian):

[mediawiki/extensions/Kartographer@master] Ensure page property is numeric so it will be indexed

https://gerrit.wikimedia.org/r/1004190

Change 1004193 had a related patch set uploaded (by C. Scott Ananian; author: C. Scott Ananian):

[mediawiki/extensions/ProofreadPage@master] Ensure pages are indexed by quality level

https://gerrit.wikimedia.org/r/1004193

Change 1004146 merged by jenkins-bot:

[mediawiki/core@master] ParserOutput::setPageProperty(): Update documentation

https://gerrit.wikimedia.org/r/1004146

Change 1004190 merged by jenkins-bot:

[mediawiki/extensions/Kartographer@master] Ensure page property is numeric so it will be indexed

https://gerrit.wikimedia.org/r/1004190

Change #1004148 merged by jenkins-bot:

[mediawiki/core@master] Add ParserOutput::setIndexedPageProperty(); deprecate numeric properties

https://gerrit.wikimedia.org/r/1004148

Change #1018798 had a related patch set uploaded (by C. Scott Ananian; author: C. Scott Ananian):

[mediawiki/services/parsoid@master] [CMC] Add ::setUnindexedPageProperty and ::setIndexedPageProperty

https://gerrit.wikimedia.org/r/1018798

Change #1018970 had a related patch set uploaded (by C. Scott Ananian; author: C. Scott Ananian):

[mediawiki/core@REL1_42] Add ParserOutput::setIndexedPageProperty(); deprecate numeric properties

https://gerrit.wikimedia.org/r/1018970

Change #1018970 abandoned by C. Scott Ananian:

[mediawiki/core@REL1_42] Add ParserOutput::setIndexedPageProperty(); deprecate numeric properties

Reason:

Botched the cherry-pick

https://gerrit.wikimedia.org/r/1018970

Change #1018970 restored by C. Scott Ananian:

[mediawiki/core@REL1_42] Add ParserOutput::setIndexedPageProperty(); deprecate numeric properties

https://gerrit.wikimedia.org/r/1018970

Change #1018970 merged by jenkins-bot:

[mediawiki/core@REL1_42] Add ParserOutput::setIndexedPageProperty(); deprecate numeric properties

https://gerrit.wikimedia.org/r/1018970

Change #1019071 had a related patch set uploaded (by C. Scott Ananian; author: C. Scott Ananian):

[mediawiki/core@master] ParserOutput::setUnindexedPageProperty(): use empty string as default value

https://gerrit.wikimedia.org/r/1019071

Change #1019179 had a related patch set uploaded (by C. Scott Ananian; author: C. Scott Ananian):

[mediawiki/core@REL1_42] ParserOutput::setUnindexedPageProperty(): use empty string as default value

https://gerrit.wikimedia.org/r/1019179

Change #1019071 merged by jenkins-bot:

[mediawiki/core@master] ParserOutput::setUnindexedPageProperty(): use empty string as default value

https://gerrit.wikimedia.org/r/1019071

Change #1019360 had a related patch set uploaded (by C. Scott Ananian; author: C. Scott Ananian):

[mediawiki/services/parsoid@master] [CMC] Add ::setUnsortedPageProperty and ::setSortedPageProperty

https://gerrit.wikimedia.org/r/1019360

Change #1019179 merged by jenkins-bot:

[mediawiki/core@REL1_42] ParserOutput::setUnindexedPageProperty(): use empty string as default value

https://gerrit.wikimedia.org/r/1019179

Change #1018798 abandoned by C. Scott Ananian:

[mediawiki/services/parsoid@master] [CMC] Add ::setUnindexedPageProperty and ::setIndexedPageProperty

Reason:

Abandoned in favor of Ied9948e10e49113dc0082726b16759e0676444f8

https://gerrit.wikimedia.org/r/1018798

Change #1019360 merged by jenkins-bot:

[mediawiki/services/parsoid@master] [CMC] Add ::setUnsortedPageProperty and ::setNumericPageProperty

https://gerrit.wikimedia.org/r/1019360

Change #1020340 had a related patch set uploaded (by Arlolra; author: C. Scott Ananian):

[mediawiki/services/parsoid@REL1_42] [CMC] Add ::setUnsortedPageProperty and ::setNumericPageProperty

https://gerrit.wikimedia.org/r/1020340

Change #1020340 merged by jenkins-bot:

[mediawiki/services/parsoid@REL1_42] [CMC] Add ::setUnsortedPageProperty and ::setNumericPageProperty

https://gerrit.wikimedia.org/r/1020340

Change #1022258 had a related patch set uploaded (by Arlolra; author: Arlolra):

[mediawiki/vendor@master] Bump wikimedia/parsoid to 0.20.0-a2

https://gerrit.wikimedia.org/r/1022258

Change #1022258 merged by jenkins-bot:

[mediawiki/vendor@master] Bump wikimedia/parsoid to 0.20.0-a2

https://gerrit.wikimedia.org/r/1022258

Hi @Lucas_Werkmeister_WMDE, Arian asked me to verify this: I can see on Grafana that the missing data was repopulated.
Great that this is now fixed! \o/

One thing before I resolve this: The number of Forms seem to have have stagnated again from March 2024. Is this a true effect this time?

I suspect that’s a true effect – the number of forms in the query service (14058347) matches the number in Grafana (14058150 as of midnight last night) extremely well.

Okay great, thank you for checking, Lucas!

Officially resolved! \o/