Page MenuHomePhabricator

centralauth database on dbstore1002 is out of date, replication stuck?
Closed, ResolvedPublic

Description

mysql:sul@dbstore1002 [centralauth]> select max(gu_id) from globaluser;
+------------+
| max(gu_id) |
+------------+
|   36083321 |
+------------+
1 row in set (0.02 sec)

versus current master:

mysql:wikiadmin@db1033 [centralauth]> select max(gu_id) from globaluser;
+------------+
| max(gu_id) |
+------------+
|   38547313 |
+------------+
1 row in set (0.00 sec)

I had been using that server to generate stats for SUL finalization since it can do joins with centralauth + local wikis and noticed it giving totally weird responses and started investigating further.

Event Timeline

Legoktm raised the priority of this task from to Needs Triage.
Legoktm updated the task description. (Show Details)
Legoktm added subscribers: Legoktm, Springle.

I see dbstore1002 has the wrong replication rules for s7, so wikis are up to date but centralauth is not (likely 66+ days out of date based on uptime, ie, when config was last reloaded).

Related to https://gerrit.wikimedia.org/r/#/c/198292/ which has been fixed only on dbstore1001 so far. Will start a resync on dbstore1002 s7.

Resync is done and replication rules corrected.

Legoktm claimed this task.

thanks, looks good to me!

Legoktm set Security to None.

Happening again?

mysql:sul@dbstore1002 [centralauth]> select max(gu_id) from globaluser;
+------------+
| max(gu_id) |
+------------+
|   41826616 |
+------------+
1 row in set (0.00 sec)
mysql:wikiadmin@db1033 [centralauth]> select max(gu_id) from globaluser;
+------------+
| max(gu_id) |
+------------+
|   41984206 |
+------------+
1 row in set (0.00 sec)

Something weird is happening.

Replication for s7 is running, the replication rules (added earlier in this bug) still exist, the statements are appearing in dbstore1002 relay log, the centralauth tables seem intact ... but no new data for a week. Other shards seem ok.

A week ago dbstore1002 did have an outage due to analytics queries filling up /tmp mount, which required a mysqld restart. Perhaps that's connected somehow. No more useful information right now.

It's all of S7 affected:

arwiki
cawiki
eswiki
fawiki
frwiktionary
hewiki
huwiki
kowiki
metawiki
rowiki
ukwiki
viwiki

...and centralauth.

The other wikis all show up to date data.

Oops, sorry, this was fixed a while back.