Page MenuHomePhabricator

db1068 (s4/commonswiki slave) is missing data about at least 6 users
Closed, ResolvedPublic

Description

On db1040, the s4 master the user "Hobbers10" exists:

mysql:wikiadmin@db1040 [commonswiki]> select user_id from user where user_name="Hobbers10";
+---------+
| user_id |
+---------+
| 4076491 |
+---------+
1 row in set (0.00 sec)

On db1068:

mysql:wikiadmin@db1068 [commonswiki]> select user_id from user where user_name="Hobbers10";
Empty set (0.00 sec)

If you refresh https://commons.wikimedia.org/wiki/Special:Contributions/Hobbers10 enough times you'll occasionally see "User account "Hobbers10" is not registered."

Other likely affected users:

commonswiki:  Local user not found for localname entry Captain Jack Riley@commonswiki
commonswiki:  Local user not found for localname entry Czsheng@commonswiki
commonswiki:  Local user not found for localname entry Dobri.kovachev@commonswiki
commonswiki:  Local user not found for localname entry Dương Thanh Tùng@commonswiki
commonswiki:  Local user not found for localname entry Rani Jyoti@commonswiki

Event Timeline

Legoktm created this task.Mar 8 2015, 4:27 AM
Legoktm raised the priority of this task from to Unbreak Now!.
Legoktm updated the task description. (Show Details)
Legoktm added subscribers: Legoktm, Springle, Krenair, Keegan.
Restricted Application added a subscriber: Aklapper. · View Herald TranscriptMar 8 2015, 4:27 AM

I checked the other slaves, only this particular one (db1068.eqiad.wmnet) is missing that entry.

Legoktm updated the task description. (Show Details)Mar 8 2015, 4:46 AM
Legoktm set Security to None.
Krenair added a comment.EditedMar 8 2015, 5:00 AM

The 5 other affected users are also on master but not db1068 - and they were all registered on commons within minutes of each other:

+-------------------+
| user_registration |
+-------------------+
| 20140905030419    |
| 20140905030543    |
| 20140905030448    |
| 20140905030546    |
| 20140905030505    |
| 20140905030457    |
+-------------------+
6 rows in set (0.00 sec)
mysql:wikiadmin@db1068.eqiad.wmnet [commonswiki]> select user_id from user where user_id >= 4076487 limit 2;
+---------+
| user_id |
+---------+
| 4076487 |
| 4076494 |
+---------+
2 rows in set (0.00 sec)
Krenair renamed this task from db1068 (s4/commonswiki slave) is missing data about some users to db1068 (s4/commonswiki slave) is missing data about at least 6 users.Mar 8 2015, 5:08 AM
Springle claimed this task.Mar 8 2015, 10:28 PM

Binary logs don't go back that far. Starting a sync check to see how large the problem is.

Binary logs don't go back that far. Starting a sync check to see how large the problem is.

Thanks. How (un)likely is it that other wiki's slaves might have similar issues? Or is this an isolated incident?

Thanks. How (un)likely is it that other wiki's slaves might have similar issues? Or is this an isolated incident?

Unknown at this stage. The sync-check process is running on db1068 and two other s4 slaves. I included all tables, not just user, so it's a slow process. Once that's done I'll compare results and hopefully know more.

Harej added a subscriber: Harej.Mar 10 2015, 5:57 AM

Are the other two db1062 and es1002?

Springle added a comment.EditedMar 11 2015, 5:31 AM

Are the other two db1062 and es1002?

No, db1070 and db1064, both S4 slaves.

db1062 is in S7 and es1002 is in ES1 (external storage). Why do you mention those boxes?

The sync check process has finished. Good news is that db1068 is only slave showing discrepancies, and the affected user records are:

4076488
4076489
4076490
4076491
4076482
4076493

Bad news is that really doesn't help trace the cause. One possibility is that db1068 is running 10.0.13 which was the MariaDB version showing replication problems in labs last year, however the bug in that case related to multi-source replication which labs uses and production does not.

So, still digging. db1068 will stay depooled until it gets rebuilt and upgraded. Will also sync-check the other remaining 10.0.13 slaves.

Are the other two db1062 and es1002?

No, db1070 and db1064, both S4 slaves.

db1062 is in S7 and es1002 is in ES1 (external storage). Why do you mention those boxes?

https://wikitech.wikimedia.org/wiki/Server_admin_log/Archive_25#September_5 - look at those timestamps against the user_registration of the missing rows:

The 5 other affected users are also on master but not db1068 - and they were all registered on commons within minutes of each other:

+-------------------+
| user_registration |
+-------------------+
| 20140905030419    |
| 20140905030543    |
| 20140905030448    |
| 20140905030546    |
| 20140905030505    |
| 20140905030457    |
+-------------------+
6 rows in set (0.00 sec)
Steinsplitter moved this task from Incoming to Backlog on the Commons board.
Keegan removed a subscriber: Keegan.

Any news / progress here? Asking as this has "Unbreak now" priority...

Given that this DB server is out of rotation, I doubt it's actually still unbreak now. We also know that there are no other users missing.

Krenair lowered the priority of this task from Unbreak Now! to Medium.Apr 1 2015, 7:12 PM
Springle closed this task as Resolved.May 3 2015, 3:00 AM

db1068 is recloned and repooled.

The exact cause is still unknown. Krenair's SAL link suggests the problem was something to do with db1068 upgrade to 10.0.13 based on timestamp (which is circumstantial, but a fair and logical observation). All we can do at this stage is watch carefully and do more sanity checks during future upgrades in the form of pt-table-checksum.

The remaining 10.0.13 slaves have not shown discrepancies, but are in the process of being upgraded regardless.