Page MenuHomePhabricator

Run maintenance/FlowPopulateRefId.php on all production wikis
Closed, ResolvedPublic

Event Timeline

If possible, and I am online, ping me when this happens to check the lag.

@mattflaschen: Is this still blocked, or can it be done now?

T117785: Deploy artificial primary key reference change to all production wikis simultaneously is done (it's in product review), so this isn't blocked.

It might be nice to either wait for https://gerrit.wikimedia.org/r/#/c/251558/ to roll out, or cherry-pick it. It's not actually required, though; it's just to avoid spamming the logs (see T117785#1788660)

Never mind, that actually has rolled out.

catrope@terbium:~$ mwscript extensions/Flow/maintenance/FlowPopulateRefId.php testwiki
Ensured ref_id for 300 Flow\Model\WikiReference references...
Ensured ref_id for 300 Flow\Model\WikiReference references...
Ensured ref_id for 300 Flow\Model\WikiReference references...
Catchable fatal error: Argument 8 passed to Flow\Model\WikiReference::__construct() must be an instance of Title, null given in /srv/mediawiki/php-1.27.0-wmf.7/extensions/Flow/includes/Model/WikiReference.php on line 37

Change 254091 had a related patch set uploaded (by Catrope):
Make counts in FlowPopulateRevId.php output more useful

https://gerrit.wikimedia.org/r/254091

Change 254091 merged by jenkins-bot:
Make counts in FlowPopulateRevId.php output more useful

https://gerrit.wikimedia.org/r/254091

Looks like the errors come from rows with namespace 104, which no longer exists on testwiki. We'll have to go into the DB and remove those rows manually.

Long term, we could write a cleanup script that removes rows with invalid titles.

Long term, we could write a cleanup script that removes rows with invalid titles.

T119035: Script to remove namespaces. It's not too complicated.

We could fix FlowPopulateRefId to skip these rows (by querying the database more directly), but then there would just be unprocessed rows without the PK in the DB, which is still a problem.

Looks like the errors come from rows with namespace 104, which no longer exists on testwiki. We'll have to go into the DB and remove those rows manually.

Long term, we could write a cleanup script that removes rows with invalid titles.

Nope, I was wrong. I was running the script against testwiki, and the "faulty" row with namespace 104 was for mediawikiwiki. which legitimately has a namespace 104. Today, I ran the script against test2wiki, and it choked on a row with namespace 118 and a title with Chinese characters. Which, sure enough, was a zhwiki row. So it looks like our script is broken and tries to process rows that belong to other wikis.

Change 258504 had a related patch set uploaded (by Catrope):
FlowPopulateRefId.php: Only process the current wiki

https://gerrit.wikimedia.org/r/258504

Change 258504 merged by jenkins-bot:
FlowPopulateRefId.php: Only process the current wiki

https://gerrit.wikimedia.org/r/258504

Change 258507 had a related patch set uploaded (by Catrope):
FlowPopulateRefId.php: Only process the current wiki

https://gerrit.wikimedia.org/r/258507

Change 258507 merged by jenkins-bot:
FlowPopulateRefId.php: Only process the current wiki

https://gerrit.wikimedia.org/r/258507

I reran the script with my fix, and it worked pretty well. I did run into two types of issues.

  • There was a row with a deleted namespace (2501) on mediawikiwiki, which caused a fatal. Fixed by deleting this row.
  • On a few wikis, there were rows with bad namespace/title combos (e.g. ns=0, title=Template:Foo). These rows should never have existed, and I don't quite understand why they were ever inserted. The script didn't normalize these rows, and instead ended up in an infinite loop at the end, because it kept finding the same row with ref_id=NULL. I updated these rows manually (e.g. to ns=10, title=Foo) and reran the script, which took care of it.

In any case, the population is all done now:

mysql:research@x1-analytics-slave [flowdb]>  select count(*) from flow_wiki_ref where ref_id is null;
+----------+
| count(*) |
+----------+
|        0 |
+----------+
1 row in set (0.16 sec)

mysql:research@x1-analytics-slave [flowdb]>  select count(*) from flow_ext_ref where ref_id is null;
+----------+
| count(*) |
+----------+
|        0 |
+----------+
1 row in set (0.06 sec)

Unfortunately, we have some duplicate IDs (UUID collisions strike back :( ), I'll look into those.

mysql:research@x1-analytics-slave [flowdb]> select count(*), count(distinct ref_id) from flow_wiki_ref;
+----------+------------------------+
| count(*) | count(distinct ref_id) |
+----------+------------------------+
|   251338 |                 251243 |
+----------+------------------------+
1 row in set (0.39 sec)

mysql:research@x1-analytics-slave [flowdb]> select count(*), count(distinct ref_id) from flow_ext_ref;
+----------+------------------------+
| count(*) | count(distinct ref_id) |
+----------+------------------------+
|    83466 |                  83466 |
+----------+------------------------+
1 row in set (0.12 sec)

Full list of duplicate UUIDs: P2415 . What's weird is that they all follow this pattern:

05464B37----E-31F9E505
 timestamp | random

(The first 11 digits are completely based on the timestamp, the 12th digit is half timestamp, half random, and the remaining digits are random.)

The fact that the timestamp portions all start with the same eight digits intuitively appears to make sense: all these UUIDs were generated in a short timespan. However, the timestamps of the colliding UUIDs are all very close together. There are two groups, one between 05464B373A6EE431F9E505 and 05464B373ABAE631F9E505 which spans 0.019 seconds, and another one between 05464B375156E331F9E505 and 05464B37517EE531F9E505 which spans 0.100 seconds. The gap between these groups is 1.447 seconds. What's more, even though we'd expect about 100+19 = 119 different millisecond timestamps, the actual number of distinct timestamp values in this set is 28.

Also, it makes much less sense that most of the random digits are shared. Upon closer inspection, the random-influenced digits contain even less variation:

Digit 12 (first four bits timestamp, last four bits random): 2 (=0010), 6 (=0110), A (=1010), E (=1110)
Digit 13: always E
Digit 14: 2, 3, 4, 5, 6, 7, 8
Digits 15-22: always 31F9E505

Note that the random-influenced bits of digit 12 are always 10, so the only actual variation in the random-influenced digits is in digit 14, which takes 7 different values. So in the collision set, there are 7 distinct values of the random-influenced bits, 28 distinct values of the timestamp bits, and 90 distinct values total.

This doesn't mean all UUIDs share this pattern (I haven't checked whether the non-colliding ones do), but the fact that the colliding ones are all incredibly similar could mean something. This might be of interest for T117907: Determine cause of historical UIDs colliding as well.

I tried to fix this by setting all of these back to NULL, then rerunning the maintenance script. Fortunately (and this was quite likely due to the small differences in timestamps), all of these rows belonged to the same wiki (frwiki).

However, after rerunning the script, I now have a very similar looking set of collisions: P2416. The timestamp range is 29ms wide, with every possible timestamp value taken (all 30 of them). Once again, there are exactly 90 collisions, with one of them happening 4 times, one 5 times, and all the others 2 times, digit 12 only takes on 26AE and digit 14 only takes on 2345678. It sure as hell seems like something deterministic is going on here. As far as I can tell, the offending code isn't even in Flow, because we just call UIDGenerator::newTimestampedUID88() in MediaWiki.

@mattflaschen, @matthiasmullie, @jcrespo: Should we just use an auto-increment field instead of a UUID?

Because all of the colliding rows (185 rows with only 90 distinct IDs) were for the same piece of content (the header of https://fr.wikipedia.org/wiki/Discussion_utilisateur:Nclm ) and seemed outdated, I made a whitespace-only edit. This successfully regenerated the link table rows. There were now only 17 colliding rows, with 6 distinct IDs. I NULLed those out and ran the script, and I got... 17 colliding rows with 6 distinct IDs, also with the same pattern of recurrences (2, 2, 2, 4, 5,2).

This is insane. I give up.

I strongly suggested auto-increments, and was confused by this implementation of non-standard UUIDs. I know it is not your fault, but you should reconsider using them, which will have additional benefits.

It turns out the pattern and reproducibility I saw were because multiple identical rows exist in the database. I feel stupid for not checking this earlier.

mysql:research@x1-analytics-slave [flowdb]> select ref_src_namespace, ref_src_title, ref_target_namespace, ref_target_title, count(*) as c from flow_wiki_ref where ref_src_wiki='frwiki' and ref_src_namespace=3  group by ref_src_namespace, ref_src_title, ref_target_namespace, ref_target_title having c>1;
+-------------------+---------------+----------------------+-----------------------------------------------------------------------+---+
| ref_src_namespace | ref_src_title | ref_target_namespace | ref_target_title                                                      | c |
+-------------------+---------------+----------------------+-----------------------------------------------------------------------+---+
|                 3 | Nclm          |                    6 | Icon_falscher_Titel.svg                                               | 2 |
|                 3 | Nclm          |                    6 | Kolding,_Haderslevvej_from_Stejlbjergvej,_snowy,_through_a_window.jpg | 2 |
|                 3 | Nclm          |                   10 | Début_de_carte                                                        | 2 |
|                 3 | Nclm          |                   10 | Emoji                                                                 | 4 |
|                 3 | Nclm          |                   10 | Minuscule                                                             | 5 |
|                 3 | Nclm          |                   10 | Titre_incorrect                                                       | 2 |
+-------------------+---------------+----------------------+-----------------------------------------------------------------------+---+

Somewhat unsurprisingly, all the identical rows also got identical IDs. What is weird is that there are lots of other duplicate rows (which means that rEFLW873cbbca3e614a7f5d3c39d91cbe1769687f6c8a lies BTW, the code apparently doesn't prevent duplicate rows from being inserted) and those do have distinct IDs.

I've deleted the duplicate rows, and now ref_id is in fact unique across the entire table. Even after making a whitespace edit to the offending page, it stayed that way.

mysql:research@x1-analytics-slave [flowdb]> select count(*), count(distinct ref_id) from flow_wiki_ref;
+----------+------------------------+
| count(*) | count(distinct ref_id) |
+----------+------------------------+
|   252062 |                 252062 |
+----------+------------------------+
1 row in set (0.42 sec)

mysql:research@x1-analytics-slave [flowdb]> select count(*), count(distinct ref_id) from flow_ext_ref;
+----------+------------------------+
| count(*) | count(distinct ref_id) |
+----------+------------------------+
|    84657 |                  84657 |
+----------+------------------------+

@matthiasmullie: Can you comment on why the UNIQUE constraint was removed? The commit message claims that the code won't insert duplicate rows, but that's clearly not true:

mysql:research@x1-analytics-slave [flowdb]> select count(*) as c, count(distinct ref_src_namespace, ref_target_namespace, ref_src_object_id, ref_src_object_type, ref_src_workflow_id, ref_target_namespace, ref_target_title) as d from flow_wiki_ref;
+--------+--------+
| c      | d      |
+--------+--------+
| 252062 | 242671 |
+--------+--------+
1 row in set (1.01 sec)

@matthiasmullie: Can you comment on why the UNIQUE constraint was removed?

It was to avoid false collisions due to common URL prefixes.

DB index length limit is shorter than the allowable URL length, which caused URLs that are similar for the first X character to be considered the same (see example at T92284).

Back when I wrote that, it looked like duplicate data wouldn't (or would be very unlikely) to be inserted, but apparently almost 10k made it in (almost 4% of the entries). I should probably take another look why/how that could happen.

This is done on officewiki too:

mysql:research@s2-analytics-slave [officewiki]> select count(*), count(distinct ref_id) from flow_wiki_ref;
+----------+------------------------+
| count(*) | count(distinct ref_id) |
+----------+------------------------+
|     1390 |                   1390 |
+----------+------------------------+
1 row in set (0.14 sec)

mysql:research@s2-analytics-slave [officewiki]> select count(*), count(distinct ref_id) from flow_ext_ref;
+----------+------------------------+
| count(*) | count(distinct ref_id) |
+----------+------------------------+
|      535 |                    535 |
+----------+------------------------+
1 row in set (0.04 sec)