S5 replication issue, affecting watchlist and probably recentchanges
Closed, ResolvedPublic
Actions

Assigned To

Authored By

	akosiaris
	Sep 25 2020, 12:10 PM

Description

Starting on 11:42 UTC 4 s5 hosts have complained about replication lag in our monitoring systems

Related Objects

Mentioned In: T264701: Investigate Re-evaluate the use of INSERT IGNORE on ipblocks [8H]
Mentioned Here: P12812 dbctl commit (dc=all): 'Remove db2113 from contributions/logpager/recentchanges*/watchlist T263842'
T246946: Prepare and check storage layer for api.wikimedia.org
T258077: Prepare and check storage layer for avkwiki
T259436: Prepare and check storage layer for lldwiki
T260482: Prepare and check storage layer for jawikivoyage
T260551: Prepare and check storage layer for thankyouwiki
T207253: Automatically compare a few tables per section between hosts and DC
T260042: Compare a few tables per section before the switchover
P12797 dbctl commit (dc=all): 'Add db2113 to various groups T263842'

Event Timeline

akosiaris created this task.Sep 25 2020, 12:10 PM

Restricted Application added a subscriber: Aklapper. · View Herald TranscriptSep 25 2020, 12:10 PM

This is being internally tracked as there is some PII, but feel free to use this task for updates from the SRE team

LSobanski subscribed.Sep 25 2020, 12:11 PM

List of affected wikis

apiportalwiki
avkwiki
cebwiki
dewiki
enwikivoyage
jawikivoyage
lldwiki
mgwiktionary
mhwiktionary
muswiki
shwiki
srwiki
thankyouwiki

Mentioned in SAL (#wikimedia-operations) [2020-09-25T12:13:33Z] <kormat@cumin1001> dbctl commit (dc=all): 'Add db2113 to various groups T263842', diff saved to https://phabricator.wikimedia.org/P12797 and previous config saved to /var/cache/conftool/dbconfig/20200925-121332-kormat.json

DBA are testing a recovery action prior to applying it broadly.

taavi added a project: Wikimedia-Incident.Sep 25 2020, 12:18 PM

Count_Count subscribed.Sep 25 2020, 12:19 PM

A fix was applied and users of affected wikis should be seeing recovery now.

This shoud be fixed now for end-users. removing unbreak now. Please report any strange things you may find (should be none for wiki users).

Eqiad is still broken and we need a data check to understood why this happened.

This needs research, it is weird this happened, specially after T260042 was done prior to switchover.

jcrespo added a subscriber: Marostegui.Sep 25 2020, 1:01 PM

RhinosF1 subscribed.Sep 25 2020, 2:35 PM

Mentioned in SAL (#wikimedia-operations) [2020-09-25T15:23:48Z] <jynus> fixing enwikivoyage ipblocks inconsistency cluster-wide T263842

After discussing proposed fix of table inconsistency with enwikivoyage admins, an old block, that was only applied on certain dbs was deleted and the new one applied to everywhere else (DELETE of both ids and INSERT back the last one).

Table is now confirmed in a consistent state with db-compare:

root@cumin2001:~$ db-compare enwikivoyage ipblocks ipb_id db2123 db2089:3315 db2099:3315 db2111 db2113 db2128 db2137:3315 db2139:3315 --step 100
Starting comparison between id 1 and 16327
2020-09-25T15:43:55.667044: row id 9901/16327, ETA: 00m06s, 0 chunk(s) found different
Execution ended, no differences found.

This now becomes no longer an ongoing, active issue, and only left is researching the original trigger of inconsistency (or directly focus on prevent new ones with T207253).

After discussing proposed fix of table inconsistency with enwikivoyage admins

Was this public anywhere for the sake of transparency? Could a log / page be linked to?

Was this public anywhere for the sake of transparency? Could a log / page be linked to?

Yes, it was on their Village pump. https://en.wikivoyage.org/wiki/Wikivoyage:Travellers%27_pub#IP_block

Thanks for the quick reply @jcrespo

Effectively no block was applied or removed by me, only metadata was made consistent by "merging" 2 other partially applied blocks. Logs where kept intact. No real data or logs was touched. Also it was an IP block doing vandalism, not an account block.

In T263842#6493968, @akosiaris wrote:

List of affected wikis

apiportalwiki
avkwiki
cebwiki
dewiki
enwikivoyage
jawikivoyage
lldwiki
mgwiktionary
mhwiktionary
muswiki
shwiki
srwiki
thankyouwiki

This is extremely strange, and I think it can be an app issue. Why do I think that? Because the following wikis were created just a few weeks ago:
thankyouwiki T260551 17th Aug
apiportalwiki T246946#6422639 27th Aug
avkwiki T258077#6344407 29th Jul
lldwiki T259436#6388996 17th Aug
jawikivoyage T260482#6417318 27th Aug

I don't think it is possible to have such drifts on such new wikis if they're not cause by MW (or whatever), - we haven't had crashes on s5 recently that could explain some hosts being inconsistent, especially with such new wikis (less than a month old).

In T263842#6494987, @Marostegui wrote:
In T263842#6493968, @akosiaris wrote:
List of affected wikis
apiportalwiki
avkwiki
cebwiki
dewiki
enwikivoyage
jawikivoyage
lldwiki
mgwiktionary
mhwiktionary
muswiki
shwiki
srwiki
thankyouwiki
This is extremely strange, and I think it can be an app issue. Why do I think that? Because the following wikis were created just a few weeks ago:
thankyouwiki T260551 17th Aug
apiportalwiki T246946#6422639 27th Aug
avkwiki T258077#6344407 29th Jul
lldwiki T259436#6388996 17th Aug
jawikivoyage T260482#6417318 27th Aug

I don't think it is possible to have such drifts on such new wikis if they're not cause by MW (or whatever), - we haven't had crashes on s5 recently that could explain some hosts being inconsistent, especially with such new wikis (less than a month old).

Nevermind this comment. Alex was listing all the wikis on the affected section and not all the wikis that had duplicate entries. Sorry for the confusion
I should stop reading tickets while on holidays :-)

Bawolff awarded a token.Sep 26 2020, 2:42 AM

DannyS712 subscribed.Sep 28 2020, 2:53 AM

Mentioned in SAL (#wikimedia-operations) [2020-09-28T08:21:15Z] <kormat@cumin1001> dbctl commit (dc=all): 'Remove db2113 from contributions/logpager/recentchanges*/watchlist T263842', diff saved to https://phabricator.wikimedia.org/P12812 and previous config saved to /var/cache/conftool/dbconfig/20200928-082114-kormat.json

I have checked previous' week backups (22nd Sept) to see if there was anything existing for any of the involved data, at least on the PK (ipb_id) or the UNIQUE index (ipb_address_unique (ipb_address(255),ipb_user,ipb_auto))
There is nothing there, but of course, both eqiad and codfw hosts might not have been affected by this and they had the same data.

Going to keep digging into binlogs to see if there's anything relevant in there

A preliminary incident report is at https://wikitech.wikimedia.org/wiki/Incident_documentation/20200925-s5-replication-lag

The sequence of events within the transaction that failed is interesting and it definitely didn't help that we are using INSERT IGNORE here. This is a summary from the binlog transaction for the broken entry and with the affected ipb_address_unique UNIQUE:

'REDACTED',ipb_user = 0,ipb_auto = 0

This contains the first insert that went thru (it maybe failed, but as it has the IGNORE...) with the timestamp: 20200925113933 :

BEGIN
INSERT /* MediaWiki\Block\DatabaseBlockStore::insertBlock  */ IGNORE INTO `ipblocks`

Then an UPDATE on that same row that broke on some slaves, with a duplicate entry, it has this timestamp: 20200925110538.

UPDATE /* MediaWiki\Block\DatabaseBlockStore::updateBlock  */  `ipblocks` SET ipb_address = 'REDACTED',ipb_user = 0,ipb_timestamp = '20200925110538',ipb_auto = 0

That is the row that was already existing on some hosts:

root@db2089.codfw.wmnet[enwikivoyage]> SELECT * FROM ipblocks where ipb_address = 'xxxx'\G
*************************** 1. row ***************************
          	ipb_id: xxxx
     	ipb_address: xxx
        	ipb_user: 0
    	ipb_by_actor: xxxx
   	ipb_reason_id: xxxx
   	ipb_timestamp: 20200925110538

The values for ipb_address_unique on the insert were:
62.XXX.2X,0,0

And the values for that same UPDATE for that same key were exactly the same, with just the modification of the timestamp from 20200925113933 to 20200925110538.
What doesn't make sense to me is that there was an existing row with 20200925110538 timestamp (even if the timestamp isn't part of the UNIQUE).

I have looked for that same IP on the existing binlogs on an affected slave (binlogs from today till 20th Sept) and there's no trace of that IP being added there.
Did the same on the eqiad master, which has binlogs from 29th Aug till today, no trace of that IP being added to ipblocks, before the sequence of events. The timestamps of the first entry on eqiad and codfw master are the same.
I doubt this row was existing in eqiad (or the other codfw affected hosts) before this event, especially with that ipb_timestamp.

These are the logs of the blocks (2 inserts and 1 update?) the timestamps would be close to (but not necessarily exactly the same as) the log ones: https://en.wikivoyage.org/w/index.php?title=Special:Log&page=User%3A62.60.63.20&type=block

This and the ids of the rows would say that the issue is very recent, and app-layer related (but cannot discard due to data relations it started on a much earlier difference).

Yes, they are those, this is the order of events on the binlog for the ipblock table on that IP that is involved on the duplicate entry:

1st insert: 20200925110538
2nd insert: 20200925113933
1st UPDATE: 20200925110538
3rd insert (not involved on the duplicate key error): 20200925114649

The duplicate entry was involving: xxxx.20-0-0-0 for the ipb_address UNIQUE key, and that combination isn't present on any binlogs from 29th Aug till today

I just noticed the IR on wikitech says:

duplicate key for ip banning

The ipblocks table actually stores both account and ip blocks. A ban does not also equal a block in the terms most wikis use it.

It would probably be more accurate and clearer to just say:

about a duplicate key in enwikivoyage's ipblocks table

Instead of:

about enwikivoyage duplicate key for ip banning

Not making the change myself to avoid stepping on or confusing anyone in SRE.

In T263842#6506360, @RhinosF1 wrote:

I just noticed the IR on wikitech says:

duplicate key for ip banning

The ipblocks table actually stores both account and ip blocks. A ban does not also equal a block in the terms most wikis use it.

It would probably be more accurate and clearer to just say:

about a duplicate key in enwikivoyage's ipblocks table

Instead of:

about enwikivoyage duplicate key for ip banning

Not making the change myself to avoid stepping on or confusing anyone in SRE.

LGTM, https://wikitech.wikimedia.org/w/index.php?title=Incident_documentation/20200925-s5-replication-lag&diff=1883786&oldid=1883161

RhinosF1 awarded a token.Oct 1 2020, 7:20 AM

Marostegui mentioned this in T264701: Investigate Re-evaluate the use of INSERT IGNORE on ipblocks [8H].Oct 6 2020, 7:28 AM

I am going to close this as resolved as the incident is over and the table was clean up and showed no differences anymore (T263842#6494474)
The cause is yet unknown and I don't think we'll be able to know what happened. This is the closest we've gotten (T263842#6498681). As I wrote on the incident report, there are essentially two options (summary from the IR):

1) The drift was there and we inserted a row that touched it

2) The drift was generated somehow from MW.

The first option is hard to see, as that table was recently checked before the switchover and came up clean (https://phabricator.wikimedia.org/T260042) There is also the fact that the row that failed had a timestamp from around the time of the crash. The sequence of events within the transaction that failed is interesting and it definitely didn't help that we are using INSERT IGNORE here

Option 2 for this crash would imply that MW somehow introduced that inconsistency with an unsafe statement like INSERT IGNORE, however, it is difficult to know why only a few hosts failed.

More on the IR, which is now `in-review.

Let's follow up there. If someone feels this needs to stay open, please re-open.
Thanks everyone for responding and specifically thanks Jaime and Stevie for mitigating and getting this resolved.

IR is at: https://wikitech.wikimedia.org/wiki/Incident_documentation/20200925-s5-replication-lag

As a last comment, I thought at first it was 1, but after some analysis, I believe there are more chances that it was 2, given the rows involved were very frequent, but as you say, it is not easy to prove it. Something due to some race condition due to ipv6 and constraints of the table???

S5 replication issue, affecting watchlist and probably recentchangesClosed, ResolvedPublicActions

Description

Related Objects

Event Timeline

S5 replication issue, affecting watchlist and probably recentchanges
Closed, ResolvedPublic
Actions