Page MenuHomePhabricator

S5 replication issue, affecting watchlist and probably recentchanges
Closed, ResolvedPublic

Description

Starting on 11:42 UTC 4 s5 hosts have complained about replication lag in our monitoring systems

Event Timeline

akosiaris triaged this task as Unbreak Now! priority.Sep 25 2020, 12:10 PM

This is being internally tracked as there is some PII, but feel free to use this task for updates from the SRE team

List of affected wikis

apiportalwiki
avkwiki
cebwiki
dewiki
enwikivoyage
jawikivoyage
lldwiki
mgwiktionary
mhwiktionary
muswiki
shwiki
srwiki
thankyouwiki

Mentioned in SAL (#wikimedia-operations) [2020-09-25T12:13:33Z] <kormat@cumin1001> dbctl commit (dc=all): 'Add db2113 to various groups T263842', diff saved to https://phabricator.wikimedia.org/P12797 and previous config saved to /var/cache/conftool/dbconfig/20200925-121332-kormat.json

DBA are testing a recovery action prior to applying it broadly.

A fix was applied and users of affected wikis should be seeing recovery now.

jcrespo lowered the priority of this task from Unbreak Now! to High.Sep 25 2020, 12:29 PM
jcrespo subscribed.

This shoud be fixed now for end-users. removing unbreak now. Please report any strange things you may find (should be none for wiki users).

Eqiad is still broken and we need a data check to understood why this happened.

This needs research, it is weird this happened, specially after T260042 was done prior to switchover.

Mentioned in SAL (#wikimedia-operations) [2020-09-25T15:23:48Z] <jynus> fixing enwikivoyage ipblocks inconsistency cluster-wide T263842

jcrespo lowered the priority of this task from High to Medium.Sep 25 2020, 4:01 PM
jcrespo moved this task from Triage to In progress on the DBA board.

After discussing proposed fix of table inconsistency with enwikivoyage admins, an old block, that was only applied on certain dbs was deleted and the new one applied to everywhere else (DELETE of both ids and INSERT back the last one).

Table is now confirmed in a consistent state with db-compare:

root@cumin2001:~$ db-compare enwikivoyage ipblocks ipb_id db2123 db2089:3315 db2099:3315 db2111 db2113 db2128 db2137:3315 db2139:3315 --step 100
Starting comparison between id 1 and 16327
2020-09-25T15:43:55.667044: row id 9901/16327, ETA: 00m06s, 0 chunk(s) found different
Execution ended, no differences found.

This now becomes no longer an ongoing, active issue, and only left is researching the original trigger of inconsistency (or directly focus on prevent new ones with T207253).

After discussing proposed fix of table inconsistency with enwikivoyage admins

Was this public anywhere for the sake of transparency? Could a log / page be linked to?

Was this public anywhere for the sake of transparency? Could a log / page be linked to?

Yes, it was on their Village pump. https://en.wikivoyage.org/wiki/Wikivoyage:Travellers%27_pub#IP_block

Effectively no block was applied or removed by me, only metadata was made consistent by "merging" 2 other partially applied blocks. Logs where kept intact. No real data or logs was touched. Also it was an IP block doing vandalism, not an account block.

List of affected wikis

apiportalwiki
avkwiki
cebwiki
dewiki
enwikivoyage
jawikivoyage
lldwiki
mgwiktionary
mhwiktionary
muswiki
shwiki
srwiki
thankyouwiki

This is extremely strange, and I think it can be an app issue. Why do I think that? Because the following wikis were created just a few weeks ago:
thankyouwiki T260551 17th Aug
apiportalwiki T246946#6422639 27th Aug
avkwiki T258077#6344407 29th Jul
lldwiki T259436#6388996 17th Aug
jawikivoyage T260482#6417318 27th Aug

I don't think it is possible to have such drifts on such new wikis if they're not cause by MW (or whatever), - we haven't had crashes on s5 recently that could explain some hosts being inconsistent, especially with such new wikis (less than a month old).

List of affected wikis

apiportalwiki
avkwiki
cebwiki
dewiki
enwikivoyage
jawikivoyage
lldwiki
mgwiktionary
mhwiktionary
muswiki
shwiki
srwiki
thankyouwiki

This is extremely strange, and I think it can be an app issue. Why do I think that? Because the following wikis were created just a few weeks ago:
thankyouwiki T260551 17th Aug
apiportalwiki T246946#6422639 27th Aug
avkwiki T258077#6344407 29th Jul
lldwiki T259436#6388996 17th Aug
jawikivoyage T260482#6417318 27th Aug

I don't think it is possible to have such drifts on such new wikis if they're not cause by MW (or whatever), - we haven't had crashes on s5 recently that could explain some hosts being inconsistent, especially with such new wikis (less than a month old).

Nevermind this comment. Alex was listing all the wikis on the affected section and not all the wikis that had duplicate entries. Sorry for the confusion
I should stop reading tickets while on holidays :-)

Mentioned in SAL (#wikimedia-operations) [2020-09-28T08:21:15Z] <kormat@cumin1001> dbctl commit (dc=all): 'Remove db2113 from contributions/logpager/recentchanges*/watchlist T263842', diff saved to https://phabricator.wikimedia.org/P12812 and previous config saved to /var/cache/conftool/dbconfig/20200928-082114-kormat.json

I have checked previous' week backups (22nd Sept) to see if there was anything existing for any of the involved data, at least on the PK (ipb_id) or the UNIQUE index (ipb_address_unique (ipb_address(255),ipb_user,ipb_auto))
There is nothing there, but of course, both eqiad and codfw hosts might not have been affected by this and they had the same data.

Going to keep digging into binlogs to see if there's anything relevant in there

The sequence of events within the transaction that failed is interesting and it definitely didn't help that we are using INSERT IGNORE here. This is a summary from the binlog transaction for the broken entry and with the affected ipb_address_unique UNIQUE:

'REDACTED',ipb_user = 0,ipb_auto = 0

This contains the first insert that went thru (it maybe failed, but as it has the IGNORE...) with the timestamp: 20200925113933 :

BEGIN
INSERT /* MediaWiki\Block\DatabaseBlockStore::insertBlock  */ IGNORE INTO `ipblocks`

Then an UPDATE on that same row that broke on some slaves, with a duplicate entry, it has this timestamp: 20200925110538.

UPDATE /* MediaWiki\Block\DatabaseBlockStore::updateBlock  */  `ipblocks` SET ipb_address = 'REDACTED',ipb_user = 0,ipb_timestamp = '20200925110538',ipb_auto = 0

That is the row that was already existing on some hosts:

root@db2089.codfw.wmnet[enwikivoyage]> SELECT * FROM ipblocks where ipb_address = 'xxxx'\G
*************************** 1. row ***************************
          	ipb_id: xxxx
     	ipb_address: xxx
        	ipb_user: 0
    	ipb_by_actor: xxxx
   	ipb_reason_id: xxxx
   	ipb_timestamp: 20200925110538

The values for ipb_address_unique on the insert were:
62.XXX.2X,0,0

And the values for that same UPDATE for that same key were exactly the same, with just the modification of the timestamp from 20200925113933 to 20200925110538.
What doesn't make sense to me is that there was an existing row with 20200925110538 timestamp (even if the timestamp isn't part of the UNIQUE).

I have looked for that same IP on the existing binlogs on an affected slave (binlogs from today till 20th Sept) and there's no trace of that IP being added there.
Did the same on the eqiad master, which has binlogs from 29th Aug till today, no trace of that IP being added to ipblocks, before the sequence of events. The timestamps of the first entry on eqiad and codfw master are the same.
I doubt this row was existing in eqiad (or the other codfw affected hosts) before this event, especially with that ipb_timestamp.

These are the logs of the blocks (2 inserts and 1 update?) the timestamps would be close to (but not necessarily exactly the same as) the log ones: https://en.wikivoyage.org/w/index.php?title=Special:Log&page=User%3A62.60.63.20&type=block

This and the ids of the rows would say that the issue is very recent, and app-layer related (but cannot discard due to data relations it started on a much earlier difference).

Yes, they are those, this is the order of events on the binlog for the ipblock table on that IP that is involved on the duplicate entry:

1st insert: 20200925110538
2nd insert: 20200925113933
1st UPDATE: 20200925110538
3rd insert (not involved on the duplicate key error): 20200925114649

The duplicate entry was involving: xxxx.20-0-0-0 for the ipb_address UNIQUE key, and that combination isn't present on any binlogs from 29th Aug till today

I just noticed the IR on wikitech says:

duplicate key for ip banning

The ipblocks table actually stores both account and ip blocks. A ban does not also equal a block in the terms most wikis use it.

It would probably be more accurate and clearer to just say:

about a duplicate key in enwikivoyage's ipblocks table

Instead of:

about enwikivoyage duplicate key for ip banning

Not making the change myself to avoid stepping on or confusing anyone in SRE.

I just noticed the IR on wikitech says:

duplicate key for ip banning

The ipblocks table actually stores both account and ip blocks. A ban does not also equal a block in the terms most wikis use it.

It would probably be more accurate and clearer to just say:

about a duplicate key in enwikivoyage's ipblocks table

Instead of:

about enwikivoyage duplicate key for ip banning

Not making the change myself to avoid stepping on or confusing anyone in SRE.

LGTM, https://wikitech.wikimedia.org/w/index.php?title=Incident_documentation/20200925-s5-replication-lag&diff=1883786&oldid=1883161

Marostegui closed this task as Resolved.EditedOct 6 2020, 7:34 AM

I am going to close this as resolved as the incident is over and the table was clean up and showed no differences anymore (T263842#6494474)
The cause is yet unknown and I don't think we'll be able to know what happened. This is the closest we've gotten (T263842#6498681). As I wrote on the incident report, there are essentially two options (summary from the IR):

1) The drift was there and we inserted a row that touched it

2) The drift was generated somehow from MW.

The first option is hard to see, as that table was recently checked before the switchover and came up clean (https://phabricator.wikimedia.org/T260042) There is also the fact that the row that failed had a timestamp from around the time of the crash. The sequence of events within the transaction that failed is interesting and it definitely didn't help that we are using INSERT IGNORE here

Option 2 for this crash would imply that MW somehow introduced that inconsistency with an unsafe statement like INSERT IGNORE, however, it is difficult to know why only a few hosts failed.

More on the IR, which is now `in-review.

Let's follow up there. If someone feels this needs to stay open, please re-open.
Thanks everyone for responding and specifically thanks Jaime and Stevie for mitigating and getting this resolved.

As a last comment, I thought at first it was 1, but after some analysis, I believe there are more chances that it was 2, given the rows involved were very frequent, but as you say, it is not easy to prove it. Something due to some race condition due to ipv6 and constraints of the table???