Page MenuHomePhabricator

Sunset MySQL data store for eventlogging
Open, HighPublic0 Story Points

Description

Move eventlogging backend to hadoop

The eventlogging mysql consumer and storage have many issues and perhaps the most poignant is scaling when it comes to tech but also the lack of DBA resources to dedicate to general upkeeping to the eventlogging boxes. Also, there are significant scalability issues, it is been about 6 months that all new data from eventlogging is only persisted in hadoop and schemas need to be whitelisted to be persisted in eventlogging.

This is the parent task to keep track of the work to deprecate entirely the mysql eventlogging consumer in prod.

Details

Related Gerrit Patches:

Related Objects

StatusAssignedTask
OpenOttomata
ResolvedOttomata
ResolvedOttomata
ResolvedJAllemandou
Resolvedmforns
Duplicatemforns
ResolvedOttomata
ResolvedOttomata
ResolvedOttomata
Resolvedmforns
ResolvedOttomata
ResolvedMilimetric
Resolvedmforns
OpenOttomata
Resolvedfdans
DeclinedOttomata
Duplicatefdans
OpenOttomata
Resolvedelukey
Resolvedelukey
Resolvedelukey
Openelukey
ResolvedOttomata
ResolvedMilimetric

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes
Restricted Application added a subscriber: Aklapper. · View Herald TranscriptFeb 27 2017, 7:24 PM
Ottomata added a subscriber: Ottomata.EditedFeb 27 2017, 7:28 PM

I wonder if we should checkout Citus DB for this type of stuff. Might be worth exploring...but also it might not.

As discussed previously by email, it will be great to be able to query EL data in full in Hive. But removing the existing MySQL setup completely will impose huge switching costs on many people that use this data currently. Hundreds or thousands of existing queries and recipes will need to be rewritten to account for all the differences between HiveQL and MySQL. Adding in all those pesky pseudo LIMITs enforced by Hive's "strict" mode will only be the easy part. Just as a small example, date/time handling is different (e.g. SELECT YEARWEEK(NOW()) works in MariaDB but not in Hive). Sometimes HiveQL offers advantages - e.g. calculating percentiles directly with PERCENTILE() - but sometimes it also lacks inbuilt functions compared to MySQL.

@Tbayer I think you, @JKatzWMF, @Neil_P._Quinn_WMF, and @mpopov raise good points that we definitely want to address in this work. I'll try to catalog them here and I ask you to please fill in details. Nuria's definitely right that we're having major issues with Event Logging's current infrastructure, but we also definitely want to make sure we don't cause everyone else issues.

  • Tilman brought up the many queries and recipes running against Event Logging. I know of reportupdater ones, and one or two ad-hoc dashboards, as well as Mikhail's currently-self-managed-but-considering-reportupdater work. What else is there? For this I would love a good list of as many of these queries as you all know of. I want to estimate how many there are and what kind they are so I can kind of figure out what we could potentially replace with Pivot and what we have to migrate. As for the MySQL specific syntax, we can install compatibility UDFs for some of the things Tilman mentions.
  • Tilman and Jon brought up Hive being much clumsier than MariaDB for getting quick answers. We can look at Impala, Hawk, Kudu, and other tech that would help with this. It would be really helpful to get a sense of how often these queries happen, one or two examples, and how quick is quick. Like, if you get back results in 1 second is it ok? 5 seconds?
  • Jon and Neil brought up using a GUI to access data. You can do that with a JDBC connection to Hive, so that should be possible now. If we implement any of the caching/speed-up engines mentioned above, this will also be a feature we look for. For an example of how to do this now, check out Hue and queries you can save there: https://hue.wikimedia.org/notebook/editor?editor=15
  • Neil brought up joining to checkuser or edit tag data. The vision for the Data Lake long term is to bring all that in as well, so if we have use cases for it, we'll import it.

Now, our job is to support you. So the questions above are just so we can guide our thinking. At the end of the day what matters is that we have good infrastructure for you all to do your work. We believe that if we continue with EventLogging on MariaDB, that infrastructure is in serious danger and we're trying to think of a better way forward, long term. And we don't want to do that unilaterally at all. Nothing is planned yet, this discussion is the plan. That said, prompt responses are appreciated because the infrastructure gets worse every minute.

Nuria renamed this task from Move eventlogging backend to hadoop to Find an alternative query interface for eventlogging Mariadb storage.Mar 2 2017, 5:12 PM
Nuria moved this task from Incoming to Wikistats Production on the Analytics board.
Ottomata added a comment.EditedMar 16 2017, 7:15 PM

Quick off the top of my head list of alternative techs to explore:

  • ClickHouse
  • CitusDB
  • Apache Drill
  • Impala
  • Presto
  • Kylin
  • Druid + Plywood (?)

lack of DBA resources to dedicate to general upkeeping to the eventlogging boxes

Actually, we could continue supporting MariaDB (we support 160 of those, 2 more are not a huge deal)-I think our uptime for the analytics mariadbs is not bad- what we as DBAS do not have time for is to maintain custom replication solutions, in-database code (triggers, events), the current special architecture, schema changes of all documents, and the special eventlogging application needs (that are very different from the rest of the other mariadbs).

Of course, a different story is that a separate database engine could provide potentially better performance and lower latency, such a a column-based solution specifically thought for analytics, like some of the ones you proposed. I can see that as a huge win on that. Also nosql/json-based solutions with high compression rates can also be a big wins that can avoid future problems on new schema changes.

Nuria renamed this task from Find an alternative query interface for eventlogging Mariadb storage to Find an alternative query interface for eventlogging on analytics cluster that can replace MariaDB.Mar 28 2017, 6:55 PM
Nuria lowered the priority of this task from High to Normal.May 29 2017, 3:37 PM
Nuria moved this task from Wikistats Production to Dashiki on the Analytics board.
Nuria moved this task from Dashiki to Backlog (Later) on the Analytics board.Jul 3 2017, 4:48 PM
fdans edited projects, added Analytics-Kanban; removed Analytics.Oct 26 2017, 4:31 PM
fdans set the point value for this task to 0.
fdans moved this task from Next Up to Parent Tasks on the Analytics-Kanban board.
Nuria renamed this task from Find an alternative query interface for eventlogging on analytics cluster that can replace MariaDB to Sunset MySQL data store for eventlogging. Find an alternative query interface for eventlogging on analytics cluster that can replace MariaDB.Feb 8 2018, 11:41 PM

Have you consider clickhouse? It seems like an interesting open source solution with a familiar SQL interface and much better performance for analytics-like workload. We were considering it for the analytics server on labs- but there will be challenges on loading the data, as it doesn't have a proper replication or update strategy.

We did consider it previously for a different project, and at the time went with Druid. Clickhouse is still a contender for this, as is Presto, and maybe even Elasticsearch. We need to do a lot more research.

@jcrespo: We really liked Clickhouse's performance and interface, but looking at it more closely we came up with the following negatives:

  • At the time, it was hard to install, they were using bleeding-edge GCC and @elukey found a few other bugs and filed issues that they put on their future roadmap
  • It's harder to load data than Druid, you have to massage it, so each dataset we load into it would require attention
  • The performance in general was better than Druid, but for the stuff we needed, Druid was just as fast

I'm not sure what if anything changed since we looked at it (almost 2 years ago), definitely let us know your thoughts on it. Ideally we can choose the same technology and share packaging/puppet to some extent.

Nuria edited projects, added Analytics; removed Analytics-Kanban.Mar 8 2018, 6:38 PM
Nuria moved this task from Backlog (Later) to Incoming on the Analytics board.
Nuria edited projects, added Analytics; removed Analytics-Kanban.
Nuria moved this task from Backlog (Later) to Incoming on the Analytics board.
Nuria edited projects, added Analytics-Kanban; removed Analytics.
Neil_P._Quinn_WMF renamed this task from Sunset MySQL data store for eventlogging. Find an alternative query interface for eventlogging on analytics cluster that can replace MariaDB to Sunset MySQL data store for eventlogging.Mar 15 2018, 11:18 AM
Nuria raised the priority of this task from Normal to High.May 15 2019, 10:06 PM
Nuria updated the task description. (Show Details)May 15 2019, 10:09 PM
Ottomata claimed this task.Jun 27 2019, 5:47 PM

Ok, one step closer to removing mysql-consumer as all queries have been moved out of limn repositories into the reportupdater one, the next step is https://phabricator.wikimedia.org/T223414

Nuria moved this task from Next Up to Parent Tasks on the Analytics-Kanban board.Oct 9 2019, 11:11 PM
Nuria added a comment.Thu, Oct 24, 6:29 PM

ping @Ottomata that I think we are ready to announce that mysql consumer on eventlogging will be shutdown, we will be working this week to remove references to mysql on the docs. T236403: Remove references in doc to mysql storage for EL data

The ChangesListHighlights schema is still active (1,221 events in the past week), but is still blacklisted from Hive refinement (T212367).

I think the Growth team has inherited responsibility for this schema...@Catrope/@marcella/@MMiller_WMF/@Nettrom are you okay with the loss of the ChangesListHighlights data? It's only going to the MySQL storage, which is going to be shut down next week. I'm not certain what you'd need to do to get it flowing into Hive, but it's probably renaming some fields.

nettrom_WMF edited subscribers, added: nettrom_WMF; removed: Nettrom.Mon, Oct 28, 6:14 PM
Nuria added a comment.Mon, Oct 28, 7:25 PM

In this case is more than renaming, schema seems too deeply nested to be persisted in hive (json object inside json object). Given that there does not seem to be much interested on this data maybe we can deactivate the even stream?

Hm actually, looking at the current ChangesListHighlights schema, I don't see any reason why it couldn't be refined. It has a clearly defined schema (sub objects are fine, as long as all properties are defined).

Ah, the filters array items was specified as an array instead of an object. I edited the schema to fix and I can refine! Un-blacklisting it...Posting on that task.

The question now is...do we need to backfill? Hmmm I can for whatever we already have in hadoop! Doing so.

Nuria added a comment.Mon, Oct 28, 8:18 PM

@Ottomata nice! let me update the other task

I've backfilled the last 90 days of data for ChangesListHighlights from MySQL. This schema doesn't have any whitelist defined, so that should be all we got! Now that ChangesListHighlights is in Hive, assuming that it does not block this task.

I've backfilled the last 90 days of data for ChangesListHighlights from MySQL. This schema doesn't have any whitelist defined, so that should be all we got! Now that ChangesListHighlights is in Hive, assuming that it does not block this task.

Yup, I think you're good to go now! Thanks for jumping on it so quickly.

Change 547239 had a related patch set uploaded (by Ottomata; owner: Ottomata):
[operations/puppet@production] Absent eventlogging 'replication' between db1107 and db1108

https://gerrit.wikimedia.org/r/547239

Change 547239 merged by Ottomata:
[operations/puppet@production] Absent eventlogging 'replication' between db1107 and db1108

https://gerrit.wikimedia.org/r/547239

Change 547273 had a related patch set uploaded (by Ottomata; owner: Ottomata):
[operations/puppet@production] Remove now unused eventlogging::replica puppetization

https://gerrit.wikimedia.org/r/547273

Change 547273 merged by Ottomata:
[operations/puppet@production] Remove now unused eventlogging::replica puppetization

https://gerrit.wikimedia.org/r/547273

Change 547378 had a related patch set uploaded (by Elukey; owner: Elukey):
[operations/puppet@production] profile::graphite::alerts: remove old Eventlogging alarms

https://gerrit.wikimedia.org/r/547378

Change 547378 merged by Elukey:
[operations/puppet@production] profile::graphite::alerts: remove old Eventlogging alarms

https://gerrit.wikimedia.org/r/547378