Recheck wikidatawiki.pagelinks table across all hosts in s8
Closed, ResolvedPublic
Actions

Assigned To

Authored By

	Marostegui
	Dec 23 2018, 12:10 PM

Description

wikidatawiki.pagelinks table has broken a couple of times already (T209521) due to missing rows on the sanitarium host (db1124). We need to check the table again to make sure it is fully fixed.
Specially check between master (db1071), sanitarium master (db1087) and sanitarium for s8 (db1124).

Related Objects
Search...

		Status	Subtype	Assigned	Task
		Resolved		• Banyek	T209521 replication broken on db1124:3318 on wikidata.pagelinks
		Resolved		Marostegui	T212574 Recheck wikidatawiki.pagelinks table across all hosts in s8

Event Timeline

Marostegui triaged this task as High priority.Dec 23 2018, 12:10 PM

Marostegui created this task.

Marostegui moved this task from Triage to In progress on the DBA board.

Marostegui claimed this task.Dec 23 2018, 12:31 PM

Marostegui updated the task description. (Show Details)Dec 23 2018, 3:57 PM

I have run the following check for pagelinks:
./compare.py wikidatawiki pagelinks pl_from db1087 db1071 db1092 db1099:3318 db1101:3318 db1104 db1109 db1116:3318
Basically db1087 (sanitarium master) vs all the eqiad slaves and the master (db1071).
That revealed no differences, so db1087 is equal to the master and equal to all the slaves.

I am now going to run db1087 (sanitarium master) vs db1124:3318 (sanitarium) and will report back the differences for pagelinks between those two.

The check between db1087 and db1124:3318 reported multiple differences for pagelinks

Execution ended, a total of 301 chunk(s) are different.

We could try to manually fix those, but I think given that this table isn't massive (136G) and we have already spent quite lots of time manually fixing those (specially when they happen on weekends, or holidays), I think we should fully re-import it from db1087 into db1124:3318 and let it replicate to labs.
It sounds strange that we have so many differences, I am pretty sure we fixed all of them when s8 issue happened, maybe we have unsafe statements?. Fully reimporting the table will allow us to run the check again a few weeks after the reimport and if there are differences again, it is clear that we have unsafe statements.

We could minimize the impact on labs by, for example:

Stop replication on labsdb1009 and labsdb1010
Depool labsdb1011.
Let it fully replicate on labsdb1011 once labsdb1011 has sync'ed with the master
- Depool labsdb1010 and labsdb1009 and let labsdb1011 serve both
Start replication on labsdb1009 and labsdb1010 so they get the new table via replication.
Once labsdb1009 and labsdb1010 are in sync: repool them back to their normal service

@jcrespo @Banyek any objection to this idea to fully re-import wikidatawiki.pagelinks?

I have proactively manually fixed a few chunks and ran the compare again between db1124:3318 and db1087:

Execution ended, a total of 296 chunk(s) are different.

I have gone ahead and manually fixed more chunks in advance:

Execution ended, a total of 270 chunk(s) are different.

I have fixed more rows to reduce the chance of another breakage ;-)

Execution ended, a total of 228 chunk(s) are different.

More chunks fixed and this is the result of the compare.py:

Execution ended, a total of 138 chunk(s) are different.

More fixes done:

Execution ended, a total of 97 chunk(s) are different.

I have done another batch of fixes:

Execution ended, a total of 67 chunk(s) are different.

Fixed a few more and we are close to be fully done:

Execution ended, a total of 31 chunk(s) are different.

More fixes done:

Execution ended, a total of 17 chunk(s) are different.

We are almost done!:

Execution ended, a total of 6 chunk(s) are different.

Mentioned in SAL (#wikimedia-operations) [2019-01-02T06:15:14Z] <marostegui> Fix last chunks on db1124:338 - T212574

This is all done:

Execution ended, no differences found.

I have also run a full check for:
db1087 vs db1071 (master) db1092 db1099:3318 db1101:3318 db1104 db1109 db1116:3318

./compare.py wikidatawiki pagelinks pl_from db1087 db1071 db1092 db1099:3318 db1101:3318 db1104 db1109  db1116:3318
...
Execution ended, no differences found.

awesome!

Marostegui mentioned this in T213108: db1082 power loss resulted on mysql crash.Jan 7 2019, 10:54 PM

Recheck wikidatawiki.pagelinks table across all hosts in s8Closed, ResolvedPublicActions

Description

Related ObjectsSearch...

Event Timeline

Recheck wikidatawiki.pagelinks table across all hosts in s8
Closed, ResolvedPublic
Actions

Related Objects
Search...