Page MenuHomePhabricator

Eventlogging tables not replicating from master to slave
Closed, ResolvedPublic

Description

Since 201522 there is an issue with Eventlogging SQL data. Data is not replicating from master to slave for some tables.

For example for table: MobileWikiAppSearch_10641988

In master: (m4)

mysql>
mysql> use log;
Database changed
mysql> select max(timestamp) from log.MobileWikiAppSearch_10641988;
+----------------+
| max(timestamp) |
+----------------+
| 20151026145049 |
+----------------+
1 row in set (0.00 sec)

In slave:

mysql:research@analytics-store.eqiad.wmnet [(none)]> use log;
Reading table information for completion of table and column names
You can turn off this feature to get a quicker startup with -A

Database changed
mysql:research@analytics-store.eqiad.wmnet [log]> select max(timestamp) from log.MobileWikiAppSearch_10641988;
+----------------+
| max(timestamp) |
+----------------+
| 20151022145533 |
+----------------+
1 row in set (0.00 sec)

Event Timeline

Nuria assigned this task to jcrespo.
Nuria raised the priority of this task from to Unbreak Now!.
Nuria updated the task description. (Show Details)
Nuria subscribed.

Since 201522 there is an issue with Eventlogging SQL data. Data is not replicating from master to slave for some tables.

For example for table: MobileWikiAppSearch_10641988

In master: (m4)

mysql>
mysql> use log;
Database changed
mysql> select max(timestamp) from log.MobileWikiAppSearch_10641988;
+----------------+

max(timestamp)

+----------------+

20151026145049

+----------------+
1 row in set (0.00 sec)

In slave:

mysql:research@analytics-store.eqiad.wmnet [(none)]> use log;
Reading table information for completion of table and column names
You can turn off this feature to get a quicker startup with -A

Database changed
mysql:research@analytics-store.eqiad.wmnet [log]> select max(timestamp) from log.MobileWikiAppSearch_10641988;
+----------------+

max(timestamp)

+----------------+

20151022145533

+----------------+
1 row in set (0.00 sec)

There is something beyond replication lag that has to explain this issue, as lag, at the time of writing of this ticket is of minutes, not days.

Nuria added subscribers: Milimetric, Ironholds.
Nuria added a subscriber: DBA.

Question: the max of max is:

20151026170824

Does this mean that all data prior to this is there, or are events loaded out of order, independently of each schema?

Why is read_only OFF here?

jcrespo lowered the priority of this task from Unbreak Now! to High.Oct 26 2015, 5:32 PM

All of our dashboards are broken and incapable of being fixed. This is definitely an "unbreak now" from the point of view of our team.

Setting it to high because:

I do not think this is a new issue: on 29 sept, I made an audit of the eventlogging database and analytics-store (dbstore1002) was not part of the logging schema. https://phabricator.wikimedia.org/T108857#1712490

This has been like that likely for months, as the last reference of m4 there I can find is from 2015-08-28.

My recommendation right now is to use the codfw box for analytics, as this will probably require a full database reconstruction from m4-master.

@Ironholds when is the last time you knew your data was "fresh"?

This has been like that likely for months, as the last reference of m4 there I can find is from 2015-08-28.

No, this is been broken since the 22nd. Not months, this data is looked at by developers daily, basically.
If it had not been for the weekend this would have been spotted on on the 23rd.

I think we are missing out on changes on the 22nd/21st that broke this. Could you did a little deeper?
Wasn't it about the 22nd when you were doing computation of intervals for schema purging?

I agree with @Ironholds this is an "unbreak now", more so as time passes and all dashboards are even more outdated. It is not only the analytics team who looks at this data but basically every single product team plus research.

If it is an unbreak-now, can someone please join me on DBA ? I have questions that need answers.

What Nuria said, basically. The idea of "use the codfw box for analytics" ignores the problem that not all services have /access/ to the codfw box.

The data was last fresh on the 22nd, as Nuria says. We wouldn't have events from then if it broke in September.

I added a rule to allow analytics to access mysql at 10.192.32.19/32 which is the long term secondary slave for replication of this data as I understand it. The iptables and replication should be there but this piece had not been dealt with.

Chase added network access from stats1003.eqiad.wmnet to dbstore2002.codfw.wmnet in order to be able to be used as a temporary replacement for dbstore1002.codfw.wmnet.

And what pass/user do we use to access it? Because the 'research' user is getting precisely nowhere.

@Ironholds - I have not created yet the access. If it is active, it will be with the same credentials as you are currently using. However, @Milimetric and other people preferred to have an actual fix.

For everybody- we discovered the issue, scripts on terbium where responsible for backfilling in the background the slave. This process either failed or was interrupted (likely) by the schema changes, as there is no real replication relation between the master and the slave. Being a rather unstandardized method of replication and me not knowing all the details, I failed to recognize it.

Depending on the model, I will fix it in a few minutes, or it will take a bit of time for backfilling it all. But I expect to have it done in 24-48 hours.

I found the error on one of springle's screen sessions:

db1047 GatherClicks_12114785 >= 20151022150117 (rows!)ERROR 1136 (21S01) at line 19: Column count doesn't match value count at row 1

It is fixed now, and it will be applied as soon as it physically can. Leaving this ticket open until that happens.

An approximate list of tables that were affected are mentioned in the incident, in case that helps with troubleshooting / correlating the alter schema statements: https://wikitech.wikimedia.org/wiki/Incident_documentation/20151022-EventLogging

db1047 is also involved, doing the changes there, too.

jcrespo moved this task from Triage to In progress on the DBA board.
jcrespo removed a subscriber: DBA.

Waiting for confirmation that everithing is ok on your side to close this ticket.

mysql:research@analytics-store.eqiad.wmnet [log]> SELECT LEFT(timestamp,8), COUNT(*) FROM MobileWikiAppSearch_10641988 WHERE timestamp > '20150930000000' GROUP BY LEFT(timestamp,8);

20150930232595
20151001225983
20151002232353
20151003267675
20151004298958
20151005237340
20151006234255
20151007229578
20151008223746
20151009232425
20151010271461
20151011284058
20151012242607
20151013230570
20151014136978
20151015220422
20151016216151
20151017253137
20151018277341
20151019218754
20151020199498
20151021193217
20151022142991
20151023122223
20151024112698
20151025136807
20151026125432
2015102771041

Wheee, it works! Thanks Jaime!