Page MenuHomePhabricator

Eventlogging mysql consumers inserted rows on the analytics slave (db1108) for two hours
Closed, ResolvedPublic5 Estimated Story Points

Description

Preliminary info:

  • As part of regular database maintenance, db1107 (eventlogging master database) needed to be rebooted to allow kernel and mariadb updates.
  • We run a proxy in front of the eventlogging database, called m4-master, that is responsible to send traffic to db1107. If the host goes down, it fails over to db1108 (analytics-slave).

Maintenance steps performed:

  1. Stopped replication on db1108 (eventlogging_sync) and disabled puppet.
  2. Stopped Eventlogging Mysql Consumers on eventlog1001, so mysql traffic to db1107 stopped.
  3. Performed maintenance on db1107.
  4. re-enabled replication on db1108 and also mysql consumers on eventog1001.

The main issue is that before 4) we didn't re-set db1107 as m4-master, so db1108 remained the target. Once mysql traffic restarted, then new events were inserted on db1108.

So now we are in this state:

  • eventlogging mysql traffic stopped.
  • after 1), some new rows have been inserted to db1107 that have not been replicated to db1108
  • after 4), 2h of eventlogging data have been inserted to db1108.

Ideally we could drop this two hours of new data, move the consumer groups' offsets for an earlier stage and then restart everything to replay the data, but not sure how feasible it is.

Event Timeline

elukey triaged this task as High priority.Mar 6 2018, 12:46 PM
elukey created this task.

We run a proxy in front of the eventlogging database, called m4-master

If we can't write to the failover, then we probably shouldn't use the failover proxy name for the EventLogging mysql writer, eh? Let's change that to db1107.

We run a proxy in front of the eventlogging database, called m4-master

If we can't write to the failover, then we probably shouldn't use the failover proxy name for the EventLogging mysql writer, eh? Let's change that to db1107.

I don't really get that comment :-)
Can you explain a bit what you mean?

We run a proxy in front of the eventlogging database, called m4-master

If we can't write to the failover, then we probably shouldn't use the failover proxy name for the EventLogging mysql writer, eh? Let's change that to db1107.

I don't really get that comment :-)
Can you explain a bit what you mean?

Basically that as follow up step the log db on db1108 should be able to accept writes, or that we hardcode db1107 instead of m4-master on Eventlogging's config to avoid this in the future. But we'll have a lot of time to follow up on this, let's focus on recovering the data first :)

We run a proxy in front of the eventlogging database, called m4-master

If we can't write to the failover, then we probably shouldn't use the failover proxy name for the EventLogging mysql writer, eh? Let's change that to db1107.

I don't really get that comment :-)
Can you explain a bit what you mean?

Basically that as follow up step the log db on db1108 should be able to accept writes, or that we hardcode db1107 instead of m4-master on Eventlogging's config to avoid this in the future. But we'll have a lot of time to follow up on this, let's focus on recovering the data first :)

gotcha - thanks.
Let's follow up after the whole thing is recovered.

Mentioned in SAL (#wikimedia-analytics) [2018-03-06T19:06:21Z] <elukey> cleaned up id=0 rows on db1108 (log database) for T188991

It seems that we were (kind of) lucky. For some reason that we don't know (it predates most of us), the tables on the slave db do not have the id field with AUTO-INCREMENT, meanwhile by default on the master this is the case. When the Eventlogging Mysql consumers were pushing new rows to db1108 (the slave) they ended up not specifying any id, and the one that was picked up was 0 (not present in any table on the master since the ids all starts from 1). For each table, once the first row was inserted (with id 0) then the other ones were failing because they were trying to override the same id 0 one, returning (only sometimes) a "Duplicate row etc.." on the eventlogging logs.

So we ended up deleting on the slave all the rows with id 0 (one for each table more or less), and we moved the Eventlogging Mysql Kafka consumers' committed offset in the past via a script that @Ottomata wrote to replay all the data. No duplicates were inserted because of the uuid field of the eventlogging tables.

Data should now be recovered.

elukey set the point value for this task to 5.Mar 6 2018, 7:41 PM

Before closing this task:

  1. review the m4-master failover policy.
  2. document this procedure on wikitech

A bit of historic context about the why db1108 is not read-only:

# History context: there used to be a distinction between
# EL master and slaves, namely that only the master was not
# in read only mode. The Analytics team removed this constraint
# before deploying the eventlogging_cleaner script (T156933),
# that needed to DELETE/UPDATE rows on the job database without
# running as root for obvious reasons.

So what we could do, in theory, is to change the m4-master endpoint with something that is not dbproxied (namely that ends up to db1107 only, so if it goes down no automatic failover happens).

In case of db1107 failure, the mysql consumers would keep failing to insert data and eventually the analytics alarms would trigger. Resetting the consumer groups to a previous/earlier state would ensure that no data is lost.

Change 425231 had a related patch set uploaded (by Elukey; owner: Elukey):
[operations/dns@master] Set m4-master to db1107 rather than dbproxy

https://gerrit.wikimedia.org/r/425231

Change 425231 merged by Elukey:
[operations/dns@master] Set m4-master to db1107 rather than dbproxy

https://gerrit.wikimedia.org/r/425231

Mentioned in SAL (#wikimedia-analytics) [2018-04-10T09:00:44Z] <elukey> restart eventlogging mysql consumers on eventlog1002 to pick up new DNS changes for m4-master - T188991