Page MenuHomePhabricator

Lost database changes on s2 for 3 hours on labs replicas
Closed, DeclinedPublic

Description

Original report:

The replication for zhwiki_p seems inconsistent at the moment: https://zh.wikipedia.org/?curid=5259556 exists, select count(*) from revision where rev_page = 5259556; returns 1, but select * from page where page_id = 5259556; returns Empty set.

This affects all tables of s2 (on labs, not on production) between: 2016-03-08 18:08:24 and 2016-03-08 21:04:38.

Event Timeline

jcrespo renamed this task from Replication broken for zhwiki_p to Data missing in zhwiki on labs replicas.Mar 10 2016, 2:00 PM
jcrespo claimed this task.
jcrespo triaged this task as High priority.
jcrespo moved this task from Triage to In progress on the DBA board.

It seems that the page creation and the first revision of that page failed to be inserted (revision id rOPUP5259556010d1 has 2 revisions, not one, and obviously 1 page in production). At least one record on each table is missing for both labsdb servers.

Strangely enough, the number of records are the same than in production, so maybe it was inserted with a different id?

db1069
MariaDB  db1069 zhwiki > SELECT count(*) FROM revision WHERE rev_id BETWEEN 39320000 AND 39330000;
+----------+
| count(*) |
+----------+
|     9865 |
+----------+
1 row in set (0.02 sec)

s2-master
mysql -h s2-master zhwiki -e "SELECT count(*) FROM revision WHERE rev_id BETWEEN 39320000 AND 39330000;"
+----------+
| count(*) |
+----------+
|     9865 |
+----------+

I made a mistake and searched for the record that was there. There is indeed, a few records missing on that time band:

$ mysql -h dbstore1002 zhwiki -e "SELECT count(*) FROM revision WHERE rev_id BETWEEN 39310000 AND 39320000"
+----------+
| count(*) |
+----------+
|     9908 |
+----------+
$ mysql -h db1069 -P3312 zhwiki -e "SELECT count(*) FROM revision WHERE rev_id BETWEEN 39310000 AND 39320000"
+----------+
| count(*) |
+----------+
|     9565 |
+----------+

(not found on lower or higher ids)

Normally means that those rows were skipped or the server crashed to an incorrect state. I will reimport those records and then recheck again.

Same for page:

root@neodymium:~$ mysql -h db1069 -P3312 zhwiki -e "SELECT count(*) FROM page WHERE page_id BETWEEN 5250000 AND 5260000"
+----------+
| count(*) |
+----------+
|     9377 |
+----------+
root@neodymium:~$ mysql -h db1018 zhwiki -e "SELECT count(*) FROM page WHERE page_id BETWEEN 5250000 AND 5260000"
+----------+
| count(*) |
+----------+
|     9397 |
+----------+

Not found issues before of after.

Edit: There were only 20 pages lost, not 8000. :-)

I have reimported all the missing records from production to labs. The original query is fixed now for me:

mysql -h labsdb1001 zhwiki_p -e "select * from page where page_id = 5259556"
+---------+----------------+-----------------+-------------------+--------------+------------------+-------------+----------------+----------------+--------------------+-------------+----------+--------------------+
| page_id | page_namespace | page_title      | page_restrictions | page_counter | page_is_redirect | page_is_new | page_random    | page_touched   | page_links_updated | page_latest | page_len | page_content_model |
+---------+----------------+-----------------+-------------------+--------------+------------------+-------------+----------------+----------------+--------------------+-------------+----------+--------------------+
| 5259556 |              0 | 歡樂好聲音      |                   |            0 |                0 |           0 | 0.305063367119 | 20160310131738 | 20160310131739     |    39321798 |     3670 | wikitext           |
+---------+----------------+-----------------+-------------------+--------------+------------------+-------------+----------------+----------------+--------------------+-------------+----------+--------------------+

but I will let @jimmyxu confirm it is ok before closing it.

Looks resolved to me, thanks!

Thank you very much for reporting, if you find more issues, please continue reporting them. These reports will help making sure both labs and production are healthy.

Hmm, seems the user table is also missing rows:

MariaDB [zhwiki_p]> select * from user where user_name = '长几';
Empty set (0.01 sec)
jcrespo renamed this task from Data missing in zhwiki on labs replicas to Lost database changes on s2 for 3 hours on labs replicas.Mar 13 2016, 4:59 PM
jcrespo updated the task description. (Show Details)

The plan here is going to be:

  • Reimport, one by one, the appropiate ranges from page, revision, user and recentchanges

This will fix the most important issues related to data consistency.

Over the following weeks, reimport the whole shard from production at the same time that the new labsdb host is setup.

jcrespo closed this task as Resolved.EditedMar 17 2016, 10:56 AM

I think the root cause was corruption of the relay log. This started due to a forced stop of replication due to a permission mismatch on triggers owners. This was not detected in time due to the special filtering configuration of the host sitting between production and labs, plus it only failing partially. This led to several missing events coming from production, and affecting all servers in labs for that particular shard.

I've reimported the missing ranges on all s2 wikis for the following tables:

page
recentchanges
revision
user
logging

This fix import would be imperfect, because tables such as user do not have a primary key aligned with a time (only user_touched), and tables are not append-only. However, this will allow to have limited consistent metadata for content and edits, and the rest (derived tables, properties) would be solved by natural update (*link tables) or on the long-term reimport.

I am going to close this ticket as the original complains would be solved. 99% of the data would be correct. But the final 1% requires the full reimport. The reason I am delaying the reimport is because it creates certain amount of unavailability, lag and other annoyances, so it should be planned carefully, but it should be doable withing a few weeks.

Of course, if there is a specific fix that I can apply that directly impacts you, feel free to create a new ticket (as usual) so that I can attend those faster.

Could the similar issue at T115517 be solved in a similar fashion?

Thank you for pointing it, I was not aware of that issue. I will do the same, solving the most important consistency errors, then solve things completelly during the same reimport.

It looks like the logging table is still missing some records. Could you please take a look at this?

I will. I may have reinserted some records, but not deleted them, because they can be recreated later. I will see if it is easily solvable (some wikis are small enough to be fully reimported).

Is there any chance the missing rows in trwiki database can be imported? My tools have been experiencing inconsistency issues for a few weeks.

Is there any chance the missing rows in trwiki database can be imported? My tools have been experiencing inconsistency issues for a few weeks.

Those were imported already, I have manually deleted that record, but please do not expect full consistency until all wikis reimported- that will take weeks to recover.

Superyetkin raised the priority of this task from High to Unbreak Now!.Apr 15 2016, 4:49 PM

Which record are you referring to? The logging table is still missing some records. See this.

jcrespo lowered the priority of this task from Unbreak Now! to High.Apr 15 2016, 5:09 PM

This is not an unbreak now change because it is not causing impossible to overcome issues on essential infrastructure. As I said, all tables are being imported now, and making it higher priority will not make the reimport any faster.

If you cannot wait, as an alternative you can setup your own wiki from the dumps: https://dumps.wikimedia.org/trwiki/

labsdb1001 and labsdb1003 will be decommissioned at some point in the near future- this will be fixed, but not by fixing things there, but by reloading fresh from production on the whole database, including zhwiki. It is very difficult to fix the current setup without disrrupting the current users.

Also, the initial report was fixed, it was reopened because other issues tarted to get mixed there.