Page MenuHomePhabricator

Sunset MySQL data store for eventlogging
Closed, ResolvedPublic0 Estimated Story Points

Description

Move eventlogging backend to hadoop

The eventlogging mysql consumer and storage have many issues and perhaps the most poignant is scaling when it comes to tech but also the lack of DBA resources to dedicate to general upkeeping to the eventlogging boxes. Also, there are significant scalability issues, it is been about 6 months that all new data from eventlogging is only persisted in hadoop and schemas need to be whitelisted to be persisted in eventlogging.

This is the parent task to keep track of the work to deprecate entirely the mysql eventlogging consumer in prod.

Related Objects

StatusSubtypeAssignedTask
ResolvedOttomata
ResolvedOttomata
ResolvedOttomata
ResolvedJAllemandou
Resolvedmforns
Duplicatemforns
ResolvedOttomata
ResolvedOttomata
ResolvedOttomata
Resolvedmforns
ResolvedOttomata
ResolvedMilimetric
Resolvedmforns
ResolvedOttomata
Resolvedfdans
DeclinedOttomata
Duplicatefdans
ResolvedOttomata
Resolvedelukey
Resolvedelukey
Resolvedelukey
Resolvedelukey
ResolvedOttomata
ResolvedMilimetric
Resolvedelukey
ResolvedMarostegui

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes
Ottomata added a comment.EditedMar 16 2017, 7:15 PM

Quick off the top of my head list of alternative techs to explore:

  • ClickHouse
  • CitusDB
  • Apache Drill
  • Impala
  • Presto
  • Kylin
  • Druid + Plywood (?)

lack of DBA resources to dedicate to general upkeeping to the eventlogging boxes

Actually, we could continue supporting MariaDB (we support 160 of those, 2 more are not a huge deal)-I think our uptime for the analytics mariadbs is not bad- what we as DBAS do not have time for is to maintain custom replication solutions, in-database code (triggers, events), the current special architecture, schema changes of all documents, and the special eventlogging application needs (that are very different from the rest of the other mariadbs).

Of course, a different story is that a separate database engine could provide potentially better performance and lower latency, such a a column-based solution specifically thought for analytics, like some of the ones you proposed. I can see that as a huge win on that. Also nosql/json-based solutions with high compression rates can also be a big wins that can avoid future problems on new schema changes.

Nuria renamed this task from Find an alternative query interface for eventlogging Mariadb storage to Find an alternative query interface for eventlogging on analytics cluster that can replace MariaDB.Mar 28 2017, 6:55 PM
Nuria lowered the priority of this task from High to Medium.May 29 2017, 3:37 PM
Nuria moved this task from Wikistats to Dashiki on the Analytics board.
Nuria moved this task from Dashiki to Backlog (Later) on the Analytics board.Jul 3 2017, 4:48 PM
fdans edited projects, added Analytics-Kanban; removed Analytics.Oct 26 2017, 4:31 PM
fdans set the point value for this task to 0.
fdans moved this task from Next Up to Parent Tasks on the Analytics-Kanban board.
Nuria renamed this task from Find an alternative query interface for eventlogging on analytics cluster that can replace MariaDB to Sunset MySQL data store for eventlogging. Find an alternative query interface for eventlogging on analytics cluster that can replace MariaDB.Feb 8 2018, 11:41 PM

Have you consider clickhouse? It seems like an interesting open source solution with a familiar SQL interface and much better performance for analytics-like workload. We were considering it for the analytics server on labs- but there will be challenges on loading the data, as it doesn't have a proper replication or update strategy.

We did consider it previously for a different project, and at the time went with Druid. Clickhouse is still a contender for this, as is Presto, and maybe even Elasticsearch. We need to do a lot more research.

@jcrespo: We really liked Clickhouse's performance and interface, but looking at it more closely we came up with the following negatives:

  • At the time, it was hard to install, they were using bleeding-edge GCC and @elukey found a few other bugs and filed issues that they put on their future roadmap
  • It's harder to load data than Druid, you have to massage it, so each dataset we load into it would require attention
  • The performance in general was better than Druid, but for the stuff we needed, Druid was just as fast

I'm not sure what if anything changed since we looked at it (almost 2 years ago), definitely let us know your thoughts on it. Ideally we can choose the same technology and share packaging/puppet to some extent.

Nuria edited projects, added Analytics; removed Analytics-Kanban.Mar 8 2018, 6:38 PM
Nuria moved this task from Backlog (Later) to Incoming on the Analytics board.
Nuria edited projects, added Analytics; removed Analytics-Kanban.
Nuria moved this task from Backlog (Later) to Incoming on the Analytics board.
Nuria edited projects, added Analytics-Kanban; removed Analytics.
nshahquinn-wmf renamed this task from Sunset MySQL data store for eventlogging. Find an alternative query interface for eventlogging on analytics cluster that can replace MariaDB to Sunset MySQL data store for eventlogging.Mar 15 2018, 11:18 AM
Nuria raised the priority of this task from Medium to High.May 15 2019, 10:06 PM
Nuria updated the task description. (Show Details)May 15 2019, 10:09 PM
Ottomata claimed this task.Jun 27 2019, 5:47 PM

Ok, one step closer to removing mysql-consumer as all queries have been moved out of limn repositories into the reportupdater one, the next step is https://phabricator.wikimedia.org/T223414

Nuria moved this task from Next Up to Parent Tasks on the Analytics-Kanban board.Oct 9 2019, 11:11 PM
Nuria added a comment.Oct 24 2019, 6:29 PM

ping @Ottomata that I think we are ready to announce that mysql consumer on eventlogging will be shutdown, we will be working this week to remove references to mysql on the docs. T236403: Remove references in doc to mysql storage for EL data

The ChangesListHighlights schema is still active (1,221 events in the past week), but is still blacklisted from Hive refinement (T212367).

I think the Growth team has inherited responsibility for this schema...@Catrope/@marcella/@MMiller_WMF/@Nettrom are you okay with the loss of the ChangesListHighlights data? It's only going to the MySQL storage, which is going to be shut down next week. I'm not certain what you'd need to do to get it flowing into Hive, but it's probably renaming some fields.

nettrom_WMF edited subscribers, added: nettrom_WMF; removed: Nettrom.Oct 28 2019, 6:14 PM
Nuria added a comment.Oct 28 2019, 7:25 PM

In this case is more than renaming, schema seems too deeply nested to be persisted in hive (json object inside json object). Given that there does not seem to be much interested on this data maybe we can deactivate the even stream?

Hm actually, looking at the current ChangesListHighlights schema, I don't see any reason why it couldn't be refined. It has a clearly defined schema (sub objects are fine, as long as all properties are defined).

Ah, the filters array items was specified as an array instead of an object. I edited the schema to fix and I can refine! Un-blacklisting it...Posting on that task.

The question now is...do we need to backfill? Hmmm I can for whatever we already have in hadoop! Doing so.

Nuria added a comment.Oct 28 2019, 8:18 PM

@Ottomata nice! let me update the other task

I've backfilled the last 90 days of data for ChangesListHighlights from MySQL. This schema doesn't have any whitelist defined, so that should be all we got! Now that ChangesListHighlights is in Hive, assuming that it does not block this task.

I've backfilled the last 90 days of data for ChangesListHighlights from MySQL. This schema doesn't have any whitelist defined, so that should be all we got! Now that ChangesListHighlights is in Hive, assuming that it does not block this task.

Yup, I think you're good to go now! Thanks for jumping on it so quickly.

Change 547239 had a related patch set uploaded (by Ottomata; owner: Ottomata):
[operations/puppet@production] Absent eventlogging 'replication' between db1107 and db1108

https://gerrit.wikimedia.org/r/547239

Change 547239 merged by Ottomata:
[operations/puppet@production] Absent eventlogging 'replication' between db1107 and db1108

https://gerrit.wikimedia.org/r/547239

Change 547273 had a related patch set uploaded (by Ottomata; owner: Ottomata):
[operations/puppet@production] Remove now unused eventlogging::replica puppetization

https://gerrit.wikimedia.org/r/547273

Change 547273 merged by Ottomata:
[operations/puppet@production] Remove now unused eventlogging::replica puppetization

https://gerrit.wikimedia.org/r/547273

Change 547378 had a related patch set uploaded (by Elukey; owner: Elukey):
[operations/puppet@production] profile::graphite::alerts: remove old Eventlogging alarms

https://gerrit.wikimedia.org/r/547378

Change 547378 merged by Elukey:
[operations/puppet@production] profile::graphite::alerts: remove old Eventlogging alarms

https://gerrit.wikimedia.org/r/547378

Marostegui changed the status of subtask T238113: Repurpose db1107 as a generic database from Stalled to Open.Nov 21 2019, 6:12 AM

Mentioned in SAL (#wikimedia-operations) [2019-11-25T13:27:23Z] <elukey> set global read_only=1 on db1108's log database - T159170

Mentioned in SAL (#wikimedia-operations) [2019-11-28T08:19:57Z] <marostegui> Remove m4 from tendril and zarcillo - T159170

Milimetric closed this task as Resolved.Jan 2 2020, 6:05 PM
Milimetric moved this task from Parent Tasks to Done on the Analytics-Kanban board.

Well done!