Page MenuHomePhabricator

Wikidata term store contains rows for deleted items
Closed, ResolvedPublic8 Estimated Story Points

Description

As a Wikidata editor, I want to create and edit items without getting conflict reports for pages that don’t exist.

Problem:
The Wikidata term store (aka new term store, normalized term store) contains some rows for items that have been deleted during the wb_terms migration phase.

Acceptance criteria:

  • The term store does not contain any data for deleted items.

Example:
An editor tried to set the Spanish label of an item to «Nicki Nicole» and the description to «cantante argentina», but got an error that “Item Nicki Nicole (Q67179790) already has label "Nicki Nicole" associated with language code es, using the same description text” (project chat permalink); however, the page Q67179790 was actually deleted on 12 September 2019.

Querying the term store on Toolforge, we can see that the rows still exist which allowed Wikidata to detect the “conflict”, and also to render {{Q|Q67179790}} using the label of the deleted item:

MariaDB [wikidatawiki_p]> SELECT wbit_item_id, wby_name, wbxl_language, wbx_text FROM wbt_item_terms JOIN wbt_term_in_lang ON wbit_term_in_lang_id = wbtl_id JOIN wbt_type ON wbtl_type_id = wby_id JOIN wbt_text_in_lang ON wbtl_text_in_lang_id = wbxl_id JOIN wbt_text ON wbxl_text_id = wbx_id WHERE wbit_item_id = 67179790;
+--------------+-------------+---------------+---------------------------------------------------------------+
| wbit_item_id | wby_name    | wbxl_language | wbx_text                                                      |
+--------------+-------------+---------------+---------------------------------------------------------------+
|     67179790 | label       | en            | Nicki Nicole                                                  |
|     67179790 | label       | es            | Nicki Nicole                                                  |
|     67179790 | label       | ast           | Nicki Nicole                                                  |
|     67179790 | description | en            | Argentine singer                                              |
|     67179790 | description | es            | cantante argentina                                            |
|     67179790 | description | en-gb         | Argentine singer                                              |
|     67179790 | description | sq            | këngëtare argjentinase                                        |
|     67179790 | description | fr            | chanteuse argentine                                           |
|     67179790 | description | gl            | cantante arxentina                                            |
|     67179790 | description | bn            | আর্জেন্টিনীয় গায়িকা                                         |
|     67179790 | description | ar            | مغنية أرجنتينية                                               |
|     67179790 | description | he            | זמרת ארגנטינאית                                               |
|     67179790 | description | ro            | cântăreață argentiniană                                       |
|     67179790 | description | ca            | cantant argentina                                             |
|     67179790 | description | en-ca         | Argentine singer                                              |
+--------------+-------------+---------------+---------------------------------------------------------------+
15 rows in set (0.04 sec)

Event Timeline

I think this is because we were writing to the new term store at some point when the item was edited, but not when the item was deleted. Timeline:

10 September 2019, 11:06:

Mentioned in SAL (#wikimedia-operations) [2019-09-10T11:06:02Z] <urbanecm@deploy1001> Synchronized wmf-config/InitialiseSettings.php: SWAT: 6afe963: Set items term store on write both for all of Wikidata (T225055) (duration: 00m 55s)

10 September 2019, 15:24:
XabatuBot edits the item, adding an Asturian label. (Strictly speaking, this isn’t public information, but I see no harm in revealing it here.) Since we are at this time writing to the new term store for all of Wikidata, the terms of the item are duly written to the new term store.

10 September 2019, 15:58:

Mentioned in SAL (#wikimedia-operations) [2019-09-10T15:58:37Z] ladsgroup@deploy1001 Synchronized wmf-config/InitialiseSettings.php: Revert "Set items term store on write both for all of Wikidata" (duration: 01m 02s)

11 September 2019, 11:15:

Mentioned in SAL (#wikimedia-operations) [2019-09-11T11:15:28Z] <ladsgroup@deploy1001> Synchronized wmf-config/InitialiseSettings.php: SWAT: [[gerrit:535815|Set item terms on write both up to Q10mio (T225055)]] (duration: 01m 03s)

12 September 2019, 09:51:
MisterSynergy deletes the item as not notable. At this time, we are only writing to the new term store for items with an ID up to ten million, so the deletion is not sent to the new term store and the old terms remain.

12 September2019, 12:03:

Mentioned in SAL (#wikimedia-operations) [2019-09-12T12:03:25Z] <ladsgroup@deploy1001> Synchronized wmf-config/InitialiseSettings.php: SWAT: [[gerrit:536167|Set item terms on write both up to Q20mio (T225055)]] (duration: 01m 31s)

23 September 2019, 11:23:

Mentioned in SAL (#wikimedia-operations) [2019-09-23T11:23:32Z] <awight@deploy1001> Synchronized wmf-config/VariantSettings.php: SWAT: [[gerrit:538577|Set item terms on write both up to Q40Mio (T225055)]] (duration: 00m 55s)

7 October 2019, 13:27:

Mentioned in SAL (#wikimedia-operations) [2019-10-07T13:27:21Z] <ladsgroup@deploy1001> Synchronized wmf-config/InitialiseSettings.php: [[gerrit:541249|Set all of wikidata to write both for item term store (T225055)]] (duration: 00m 54s)

If the item was deleted now, its terms would be removed from the new term store, since we are now writing to the new term store for all item IDs.

In other words, potentially affected is any item that was edited on 10 September between 11:06 and 15:58 (UTC) and then deleted before 7 October 13:27, if I’m not mistaken. (Items with an ID below 40 million are actually only affected if they were deleted before 23 September 11:23, and similar caveats apply to thresholds of 20 and 10 million, if we want to be exact.)

In other words, potentially affected is any item that was edited on 10 September between 11:06 and 15:58 (UTC) and then deleted before 7 October 13:27, if I’m not mistaken.

MariaDB [wikidatawiki_p]> SELECT DISTINCT wbit_item_id, log_title, log_timestamp FROM wbt_item_terms JOIN logging ON wbit_item_id=SUBSTRING(log_title, 2) WHERE log_type='delete' AND log_action='delete' AND log_namespace=0 AND log_title NOT IN (SELECT page_title FROM page WHERE page_namespace=0) ORDER BY log_timestamp ASC;

+--------------+-----------+----------------+
| wbit_item_id | log_title | log_timestamp  |
+--------------+-----------+----------------+
|      1798871 | Q1798871  | 20190908075346 |
|     67206532 | Q67206532 | 20190910183620 |
|     67206284 | Q67206284 | 20190910183629 |
|     67205707 | Q67205707 | 20190910183639 |
|     67205787 | Q67205787 | 20190910183649 |
|     67205329 | Q67205329 | 20190910183659 |
|     16376239 | Q16376239 | 20190910183709 |
|     12865809 | Q12865809 | 20190910183719 |
|     25667640 | Q25667640 | 20190910183731 |
|     25667272 | Q25667272 | 20190910183740 |
|     25664811 | Q25664811 | 20190910183750 |
|      3546228 | Q3546228  | 20190910183800 |
|     65175512 | Q65175512 | 20190910183810 |
|     65051610 | Q65051610 | 20190910183820 |
|     67205948 | Q67205948 | 20190910183830 |
|      9307721 | Q9307721  | 20190910183840 |

 ... (skip plenty of rows here) ...

|     30686328 | Q30686328 | 20191103185910 |
|     25333226 | Q25333226 | 20191103194317 |
|      8877061 | Q8877061  | 20191103194339 |
|     60975079 | Q60975079 | 20191103194433 |
|      8443133 | Q8443133  | 20191103194539 |
|     70013783 | Q70013783 | 20191107163208 |
|     74670014 | Q74670014 | 20191113061939 |
|     74065265 | Q74065265 | 20191115121708 |
|     74672773 | Q74672773 | 20191125085852 |
|     74688761 | Q74688761 | 20191125090312 |
|     74693264 | Q74693264 | 20191125090707 |
|     74670082 | Q74670082 | 20191125093548 |
|     74669422 | Q74669422 | 20191125093827 |
|     74668425 | Q74668425 | 20191125185307 |
|     74674299 | Q74674299 | 20191125234049 |
|     74691179 | Q74691179 | 20191126080348 |
|     74483053 | Q74483053 | 20191126101239 |
|     74675327 | Q74675327 | 20191127111732 |
|     74453398 | Q74453398 | 20191205123011 |
|     74613482 | Q74613482 | 20191206010632 |
|     74692551 | Q74692551 | 20191223084428 |
|     87589401 | Q87589401 | 20200313152209 |
+--------------+-----------+----------------+

9594 rows in set (17 min 42.76 sec)

Except for one entry, the problem starts indeed on the afternoon of 2019-10-09 (item deletion timestamp). However, there are plenty of entries until ~2019-11-03 in the evening (deletion time), and only 17 entries from a later deletion time.

(For the record, I also created a list of potentially affected items at P12791, but figured maybe it shouldn’t be public.)

So comparing the 2 queries here, in T263730#6490963 and also in P12791 it seems that T263730#6490963 contains more items that we potentially want to cleanup.

I ran the query today and still seemingly get 9594 results, (which means the issue isn't occurring now? and was indeed only during the migration)

Whoever tackles this should make sure they are happy with the query they are using to identify the entities that need to be cleaned up for.

This probably requires a maintenance script to ensure we clean up not just wbt_term_in_lang, but also wbt_text_in_lang and wbt_text. The maintenance script shouldn’t need to do much more than call TermStoreCleaner::cleanTermInLangIds(), after checking that the term really isn’t used anymore. (I don’t think the script should find the unused rows, that should be done separately, by whomever runs the maintenance script.)

noarave set the point value for this task to 8.Dec 15 2020, 2:24 PM
Addshore claimed this task.

SAL logs are connected to T270249

I checked in the DB and the list of issues looked empty :)