Recheck wikidatawiki.pagelinks table across all hosts in s8
Closed, ResolvedPublic

Description

wikidatawiki.pagelinks table has broken a couple of times already (T209521) due to missing rows on the sanitarium host (db1124). We need to check the table again to make sure it is fully fixed.
Specially check between master (db1071), sanitarium master (db1087) and sanitarium for s8 (db1124).

Marostegui triaged this task as High priority.
Marostegui moved this task from Triage to In progress on the DBA board.
Marostegui updated the task description. (Show Details)Sun, Dec 23, 3:57 PM

I have run the following check for pagelinks:
./compare.py wikidatawiki pagelinks pl_from db1087 db1071 db1092 db1099:3318 db1101:3318 db1104 db1109 db1116:3318
Basically db1087 (sanitarium master) vs all the eqiad slaves and the master (db1071).
That revealed no differences, so db1087 is equal to the master and equal to all the slaves.

I am now going to run db1087 (sanitarium master) vs db1124:3318 (sanitarium) and will report back the differences for pagelinks between those two.

The check between db1087 and db1124:3318 reported multiple differences for pagelinks

Execution ended, a total of 301 chunk(s) are different.

We could try to manually fix those, but I think given that this table isn't massive (136G) and we have already spent quite lots of time manually fixing those (specially when they happen on weekends, or holidays), I think we should fully re-import it from db1087 into db1124:3318 and let it replicate to labs.
It sounds strange that we have so many differences, I am pretty sure we fixed all of them when s8 issue happened, maybe we have unsafe statements?. Fully reimporting the table will allow us to run the check again a few weeks after the reimport and if there are differences again, it is clear that we have unsafe statements.

We could minimize the impact on labs by, for example:

  • Stop replication on labsdb1009 and labsdb1010
  • Depool labsdb1011.
  • Let it fully replicate on labsdb1011 once labsdb1011 has sync'ed with the master
    • Depool labsdb1010 and labsdb1009 and let labsdb1011 serve both
  • Start replication on labsdb1009 and labsdb1010 so they get the new table via replication.
  • Once labsdb1009 and labsdb1010 are in sync: repool them back to their normal service

@jcrespo @Banyek any objection to this idea to fully re-import wikidatawiki.pagelinks?

I have proactively manually fixed a few chunks and ran the compare again between db1124:3318 and db1087:

Execution ended, a total of 296 chunk(s) are different.

I have gone ahead and manually fixed more chunks in advance:

Execution ended, a total of 270 chunk(s) are different.

I have fixed more rows to reduce the chance of another breakage ;-)

Execution ended, a total of 228 chunk(s) are different.

More chunks fixed and this is the result of the compare.py:

Execution ended, a total of 138 chunk(s) are different.

More fixes done:

Execution ended, a total of 97 chunk(s) are different.

I have done another batch of fixes:

Execution ended, a total of 67 chunk(s) are different.

Fixed a few more and we are close to be fully done:

Execution ended, a total of 31 chunk(s) are different.

More fixes done:

Execution ended, a total of 17 chunk(s) are different.

We are almost done!:

Execution ended, a total of 6 chunk(s) are different.

Mentioned in SAL (#wikimedia-operations) [2019-01-02T06:15:14Z] <marostegui> Fix last chunks on db1124:338 - T212574

Marostegui closed this task as Resolved.Wed, Jan 2, 8:30 AM

This is all done:

Execution ended, no differences found.

I have also run a full check for:
db1087 vs db1071 (master) db1092 db1099:3318 db1101:3318 db1104 db1109 db1116:3318

./compare.py wikidatawiki pagelinks pl_from db1087 db1071 db1092 db1099:3318 db1101:3318 db1104 db1109  db1116:3318
...
Execution ended, no differences found.