Eventlogging tables not replicating from master to slave
Closed, ResolvedPublic
Actions

Assigned To

Authored By

	• Nuria
	Oct 26 2015, 3:39 PM

Description

Since 201522 there is an issue with Eventlogging SQL data. Data is not replicating from master to slave for some tables.

For example for table: MobileWikiAppSearch_10641988

In master: (m4)

mysql>
mysql> use log;
Database changed
mysql> select max(timestamp) from log.MobileWikiAppSearch_10641988;
+----------------+
| max(timestamp) |
+----------------+
| 20151026145049 |
+----------------+
1 row in set (0.00 sec)

In slave:

mysql:research@analytics-store.eqiad.wmnet [(none)]> use log;
Reading table information for completion of table and column names
You can turn off this feature to get a quicker startup with -A

Database changed
mysql:research@analytics-store.eqiad.wmnet [log]> select max(timestamp) from log.MobileWikiAppSearch_10641988;
+----------------+
| max(timestamp) |
+----------------+
| 20151022145533 |
+----------------+
1 row in set (0.00 sec)

Related Objects

Mentioned In: T108856: Set up bucketization of editCount fields {tick}

Event Timeline

• Nuria created this task.Oct 26 2015, 3:39 PM

• Nuria assigned this task to jcrespo.

• Nuria raised the priority of this task from to Unbreak Now!.

• Nuria updated the task description. (Show Details)

• Nuria added projects: Analytics-Backlog, MediaWiki-extensions-EventLogging.

• Nuria subscribed.

Restricted Application added a subscriber: Aklapper. · View Herald TranscriptOct 26 2015, 3:39 PM

Since 201522 there is an issue with Eventlogging SQL data. Data is not replicating from master to slave for some tables.

For example for table: MobileWikiAppSearch_10641988

In master: (m4)

mysql>
mysql> use log;
Database changed
mysql> select max(timestamp) from log.MobileWikiAppSearch_10641988;
+----------------+

max(timestamp)

+----------------+

20151026145049

+----------------+
1 row in set (0.00 sec)

In slave:

mysql:research@analytics-store.eqiad.wmnet [(none)]> use log;
Reading table information for completion of table and column names
You can turn off this feature to get a quicker startup with -A

Database changed
mysql:research@analytics-store.eqiad.wmnet [log]> select max(timestamp) from log.MobileWikiAppSearch_10641988;
+----------------+

max(timestamp)

+----------------+

20151022145533

+----------------+
1 row in set (0.00 sec)

There is something beyond replication lag that has to explain this issue, as lag, at the time of writing of this ticket is of minutes, not days.

• Nuria set Security to None.Oct 26 2015, 3:42 PM

• Nuria added subscribers: Milimetric, Ironholds.

• Nuria added a subscriber: DBA.

jcrespo updated the task description. (Show Details)Oct 26 2015, 3:44 PM

jcrespo merged a task: T116600: Data missing from dbstore1002 but present in dbstore2002.

jcrespo subscribed.

• ezachte subscribed.Oct 26 2015, 5:00 PM

Question: the max of max is:

20151026170824

Does this mean that all data prior to this is there, or are events loaded out of order, independently of each schema?

Why is read_only OFF here?

jcrespo lowered the priority of this task from Unbreak Now! to High.Oct 26 2015, 5:32 PM

All of our dashboards are broken and incapable of being fixed. This is definitely an "unbreak now" from the point of view of our team.

Setting it to high because:

I do not think this is a new issue: on 29 sept, I made an audit of the eventlogging database and analytics-store (dbstore1002) was not part of the logging schema. https://phabricator.wikimedia.org/T108857#1712490

This has been like that likely for months, as the last reference of m4 there I can find is from 2015-08-28.

My recommendation right now is to use the codfw box for analytics, as this will probably require a full database reconstruction from m4-master.

@Ironholds when is the last time you knew your data was "fresh"?

This has been like that likely for months, as the last reference of m4 there I can find is from 2015-08-28.

No, this is been broken since the 22nd. Not months, this data is looked at by developers daily, basically.
If it had not been for the weekend this would have been spotted on on the 23rd.

I think we are missing out on changes on the 22nd/21st that broke this. Could you did a little deeper?
Wasn't it about the 22nd when you were doing computation of intervals for schema purging?

I agree with @Ironholds this is an "unbreak now", more so as time passes and all dashboards are even more outdated. It is not only the analytics team who looks at this data but basically every single product team plus research.

If it is an unbreak-now, can someone please join me on DBA ? I have questions that need answers.

What Nuria said, basically. The idea of "use the codfw box for analytics" ignores the problem that not all services have /access/ to the codfw box.

The data was last fresh on the 22nd, as Nuria says. We wouldn't have events from then if it broke in September.

• kevinator moved this task from Incoming to Radar on the Analytics-Backlog board.Oct 26 2015, 6:10 PM

I added a rule to allow analytics to access mysql at 10.192.32.19/32 which is the long term secondary slave for replication of this data as I understand it. The iptables and replication should be there but this piece had not been dealt with.

Chase added network access from stats1003.eqiad.wmnet to dbstore2002.codfw.wmnet in order to be able to be used as a temporary replacement for dbstore1002.codfw.wmnet.

And what pass/user do we use to access it? Because the 'research' user is getting precisely nowhere.

@Ironholds - I have not created yet the access. If it is active, it will be with the same credentials as you are currently using. However, @Milimetric and other people preferred to have an actual fix.

For everybody- we discovered the issue, scripts on terbium where responsible for backfilling in the background the slave. This process either failed or was interrupted (likely) by the schema changes, as there is no real replication relation between the master and the slave. Being a rather unstandardized method of replication and me not knowing all the details, I failed to recognize it.

Depending on the model, I will fix it in a few minutes, or it will take a bit of time for backfilling it all. But I expect to have it done in 24-48 hours.

jcrespo mentioned this in T108856: Set up bucketization of editCount fields {tick}.Oct 26 2015, 7:34 PM

I found the error on one of springle's screen sessions:

db1047 GatherClicks_12114785 >= 20151022150117 (rows!)ERROR 1136 (21S01) at line 19: Column count doesn't match value count at row 1

It is fixed now, and it will be applied as soon as it physically can. Leaving this ticket open until that happens.

An approximate list of tables that were affected are mentioned in the incident, in case that helps with troubleshooting / correlating the alter schema statements: https://wikitech.wikimedia.org/wiki/Incident_documentation/20151022-EventLogging

db1047 is also involved, doing the changes there, too.

jcrespo added a project: DBA.Oct 26 2015, 9:06 PM

jcrespo moved this task from Triage to In progress on the DBA board.

jcrespo removed a subscriber: DBA.

Waiting for confirmation that everithing is ok on your side to close this ticket.

mysql:research@analytics-store.eqiad.wmnet [log]> SELECT LEFT(timestamp,8), COUNT(*) FROM MobileWikiAppSearch_10641988 WHERE timestamp > '20150930000000' GROUP BY LEFT(timestamp,8);

20150930	232595
20151001	225983
20151002	232353
20151003	267675
20151004	298958
20151005	237340
20151006	234255
20151007	229578
20151008	223746
20151009	232425
20151010	271461
20151011	284058
20151012	242607
20151013	230570
20151014	136978
20151015	220422
20151016	216151
20151017	253137
20151018	277341
20151019	218754
20151020	199498
20151021	193217
20151022	142991
20151023	122223
20151024	112698
20151025	136807
20151026	125432
20151027	71041

Wheee, it works! Thanks Jaime!

jcrespo closed this task as Resolved.Oct 27 2015, 3:23 PM

Eventlogging tables not replicating from master to slave Closed, ResolvedPublicActions

Description

Related Objects

Event Timeline

Eventlogging tables not replicating from master to slave
Closed, ResolvedPublic
Actions