Page MenuHomePhabricator

Move reportupdater reports that pull data from eventlogging mysql to pull data from hadoop
Closed, ResolvedPublic8 Estimated Story Points

Description

Move reportupdater reports that pull data from eventlogging mysql to pull data from hadoop via hive queries or even be transformed into hive-to-druid jobs that surface data on superset. We need to look at every report to see what is the best solution. We also need to have in mind that druid might hit a limit when it comes to the number of datasources.

How about joins between analytics replica data and eventlogging data? (those can only happen monthly but seems we do not have any of those right now)

Event Timeline

Tables are:

MediaWikiPingback_15781718
FlowReplies_10561344_15423246
FlowReplies_10561344

If those tables they have data on MySQl that is not on hadoop on event-sanitized DB we will need to backfilled.

MediaWikiPingback is definitely used , let's verify that FlowReplies is also used.

mforns triaged this task as High priority.May 16 2019, 5:05 PM
mforns moved this task from Incoming to Smart Tools for Better Data on the Analytics board.

I think we also need to take into account reports that reda data from mediawiki_* tables, that is, data coming now from eventbus events. Some important dashboards are on top of this data, like the ones done for page creations for ACTRIAL in enwiki.

Oh yeah, we missed that one because it’s in the reportupdater-queries repository already. I’ve added it to my comment above with the others.

Note to team that @kaldari has confirmed that we do not need the page-creation dashboards now that that page creation data is on wikistats and updated monthly so there is no need to worry about those reports.

Change 540159 had a related patch set uploaded (by Mforns; owner: Mforns):
[analytics/reportupdater-queries@master] Migrate reports from MySQL EventLogging to Hive

https://gerrit.wikimedia.org/r/540159

Change 540641 had a related patch set uploaded (by Mforns; owner: Mforns):
[operations/puppet@production] reportupdater::jobs::mysql.pp: Absent flow reports

https://gerrit.wikimedia.org/r/540641

Change 540641 merged by Ottomata:
[operations/puppet@production] reportupdater::jobs::mysql.pp: Absent flow reports

https://gerrit.wikimedia.org/r/540641

Change 540658 had a related patch set uploaded (by Mforns; owner: Mforns):
[operations/puppet@production] reportupdater::jobs::mysql.pp: Absent jobs affected by migration

https://gerrit.wikimedia.org/r/540658

Change 540661 had a related patch set uploaded (by Mforns; owner: Mforns):
[operations/puppet@production] ::reportupdater::jobs: Migrate MySQL jobs to Hive

https://gerrit.wikimedia.org/r/540661

Hi @CCicalese_WMF and @santhosh

We Analytics are migrating some reportupdater reports from MySQL to Hive. The reason is the EventLogging database in MySQL has been deprecated for a while for performance and scalability reasons, and we are decommissioning it completely as part of our goals.

We already migrated your queries for pingback and published_cx2_translations to EventLogging in Hive. See https://gerrit.wikimedia.org/r/#/c/analytics/reportupdater-queries/+/540159/. And we'll be deploying this soon, probably next Monday.

One of the advantages of this migration is that you won't need to update your report queries every time you modify your EL schema. The refining process of EL in Hive handles schema versioning transparently. The only difference on your side is the migrated reports will no more be processed in stat1006.eqiad.wmnet, rather in stat1007.eqiad.wmnet. This is a limitation of our current stats machines config. So, be aware of that when looking for your report files. Nevertheless, the reports will continue to be rsync'ed to https://analytics.wikimedia.org/datasets/periodic/reports/metrics/. And thus, if you have a dashboard configured on top of that data, it should continue to work without problems.

Please, confirm if you read this. Otherwise, I'll try to reach you via mail or IRC.
Cheers!

Change 540658 merged by Ottomata:
[operations/puppet@production] reportupdater::jobs::mysql.pp: Absent jobs affected by migration

https://gerrit.wikimedia.org/r/540658

Change 540159 merged by Ottomata:
[analytics/reportupdater-queries@master] Migrate reports from MySQL EventLogging to Hive

https://gerrit.wikimedia.org/r/540159

Change 540661 merged by Ottomata:
[operations/puppet@production] ::reportupdater::jobs: Migrate MySQL jobs to Hive

https://gerrit.wikimedia.org/r/540661

Change 541324 had a related patch set uploaded (by Mforns; owner: Mforns):
[operations/puppet@production] :reportupdater:manifests:job.pp: fix typo in config-file param

https://gerrit.wikimedia.org/r/541324

Change 541359 had a related patch set uploaded (by Ottomata; owner: Ottomata):
[operations/puppet@production] Ensure eventlogging-consumer mysql is absent on eventlog1002

https://gerrit.wikimedia.org/r/541359

Change 541324 merged by Ottomata:
[operations/puppet@production] reportupdater: fix typo in config-file param

https://gerrit.wikimedia.org/r/541324

Change 541359 merged by Ottomata:
[operations/puppet@production] Ensure eventlogging-consumer mysql is absent on eventlog1002

https://gerrit.wikimedia.org/r/541359

Change 542169 had a related patch set uploaded (by Ottomata; owner: Ottomata):
[operations/puppet@production] Re-enable eventlogging-mysql-consumer until we are more ready

https://gerrit.wikimedia.org/r/542169

Change 542169 merged by Ottomata:
[operations/puppet@production] Re-enable eventlogging-mysql-consumer until we are more ready

https://gerrit.wikimedia.org/r/542169

@CCicalese_WMF Hi :]

After checking the results calculated by reportupdater for the pingback queries for this last week,
and fixing a couple bugs that we encountered, we found that there are differences in the reported values.
They are due to the fact that pingback data in Hive starts at 2017-11 whereas pingback data in MySQL starts at 2017-04.
As the pingback queries are using all data since the beginning of the data set to accumulate weekly results,
the values are different: Hive values are about 20% smaller than MySQL values, because Hive is missing around 7 months of data in 2017.
Apart from this issue, I believe the other properties of the data would be kept, like tendency and proportion.

My question is: would that invalidate your reports? Is that a showstopper?
If so, we could consider other solutions, maybe:

  • Importing MySQL data into Hive. At this point this would suppose some amount of manual work for us.
  • Modifying the queries to obtain the same insights without querying all the existing data for every data point, which will become more costly computationally with time.
  • ?

Please, let me know whether you need those initial 7 months of data. If so, let's discuss the possible solutions.
Otherwise, I'll proceed to apply the solution we found to all queries (-20%) and deploy that, so we can sunset MySQL EventLogging.

Thanks a lot!

The trends over time are one of the important aspects of the pingback data. It would be good to import the missing data from 2017-04 to 2017-11 into hive so we have coverage from the beginning of the period in which we began to collect that data. However, if there is anything that you can think of that would make the queries more efficient without compromising the fidelity of the data, that would be good. For example, since the historical data will not change, can we calculate and save it so that we don't have to recompute it every time?

@CCicalese_WMF

The trends over time are one of the important aspects of the pingback data. It would be good to import the missing data from 2017-04 to 2017-11 into hive so we have coverage from the beginning of the period in which we began to collect that data.

I understand.

However, if there is anything that you can think of that would make the queries more efficient without compromising the fidelity of the data, that would be good. For example, since the historical data will not change, can we calculate and save it so that we don't have to recompute it every time?

The only thing I can think of is executing the queries relative to the week in question (without accumulating all time). Each week you'd get all the metrics relative to the installs that happened that week. Instead of seeing ever-increasing metrics, you'd see more of a flat time-series. In it, you would still be able to appreciate the proportions of each architecture/os/php/database/etc..., only limited to installs occurred during the current week.
I don't know if this is what you expect from this collection of data sets, or rather you want a picture of all installs that happened since 2017-04.

@CCicalese_WMF

If you'd be fine with the solution in my prior comment, I could apply it to the mysql database for the 2017 data, then copy the reports over to stat1007 and complete them querying Hive.
This way you'd have complete data since 2017-04 and we'd solve the query performance caveat. Plus it would be easier for us Analytics to migrate it that way, rather than importing mysql data to Hive.
What do you think?

@CCicalese_WMF
The only thing I can think of is executing the queries relative to the week in question (without accumulating all time). Each week you'd get all the metrics relative to the installs that happened that week. Instead of seeing ever-increasing metrics, you'd see more of a flat time-series. In it, you would still be able to appreciate the proportions of each architecture/os/php/database/etc..., only limited to installs occurred during the current week.
I don't know if this is what you expect from this collection of data sets, or rather you want a picture of all installs that happened since 2017-04.

We aimed to have a view into the total number of active installs for each metric, so only seeing the weekly totals without the accumulation would not give us this. However, there is one additional wrinkle, which is that we added a heartbeat ping about a year and a half ago, but we never found time to update the reports to make use of it. The goal of the heartbeat is to remove from the cumulative total any data for servers that had not sent a heartbeat ping in over a month. In this way, we should be able to see the cumulative install total for a given MediaWiki version rise when it is released and then fall when a new version is released. I think you could probably do this with a 30 day window on pings in the method you described above, but only for the data for MediaWiki versions since https://gerrit.wikimedia.org/r/c/mediawiki/core/+/419506 was merged. That would solve two problems: the queries would not increase in computational complexity over time AND we would be able to see the effect of the heartbeat showing decreasing as well as increasing usage of a particular version.

Is there any way we could have 2 sets of graphs: pre-heartbeat and post-heartbeat, the pre-heartbeat being for purely historical reasons?

@CCicalese_WMF

Totally understand your point.

Is there any way we could have 2 sets of graphs: pre-heartbeat and post-heartbeat, the pre-heartbeat being for purely historical reasons?

Indeed having the two versions of the data would be even more insightful.
And I think it can be done, it's a matter of adding more queries and config to the repo.
Maybe even, some (or all) queries can be reused, and only config additions need to be made.

I would prefer though solving the migration from MySQL to Hive first (so we can sunset EL MySQL), basically take care of this task,
and then tackle that as a separate task, as an improvement to the current data set group, no?
We Analytics discussed this task today in our stand-up meeting, and they pointed me to a way to import the missing MySQL data into Hive that is simpler than I thought.
So I'd lean towards doing that so that the current queries will be able to reproduce the exact same results as the MySQL ones.
Regarding computational cost: Since I replaced the JOIN statement in MySQL queries with another strategy more aligned to Hive (row_number statement),
the queries finish in an acceptable amount of time, I think we can live with that.
Would that be OK?

@CCicalese_WMF

Cool!
I will then copy the data over to Hive and rerun the queries to correct inconsistencies in the reports and dashboards.
I will also create a task regarding the other thoughts we discussed.
Thanks a lot for the inputs!

@CCicalese_WMF Do file another task for the pingback work that should include the heatbeat filtering. From your description I am assuming that it used the same schema: https://meta.wikimedia.org/wiki/Schema:MediaWikiPingback and that would enable us to use a window of one month to calculate the data, so rather than an ever-increasing series you would have a graph representing "state of the world at a given time". That is: "as of 2019-09 there were 100 installs using sql lite"

If I understood the premise correctly data would be calculated monthly, so you would get 1 point per month per split (db, OS..) rather than 1 point per week.

Sorry, for the delay, @Nuria. Yes, that sounds correct. I will create a separate task now.

Change 545607 had a related patch set uploaded (by Mforns; owner: Mforns):
[analytics/reportupdater-queries@master] Correct and optimize pingback reports

https://gerrit.wikimedia.org/r/545607

Change 545607 merged by Milimetric:
[analytics/reportupdater-queries@master] Correct and optimize pingback reports

https://gerrit.wikimedia.org/r/545607

Change 545707 had a related patch set uploaded (by Mforns; owner: Mforns):
[analytics/reportupdater-queries@master] Fix bug in pingback/php_drilldown

https://gerrit.wikimedia.org/r/545707

Change 545707 merged by Mforns:
[analytics/reportupdater-queries@master] Fix bug in pingback/php_drilldown

https://gerrit.wikimedia.org/r/545707

Hi @CCicalese_WMF

All data from MySQL:log.MediaWikiPingback_* is now present in Hive:event_sanitized.mediawikipingback.
Also, all pingback reportupdater-queries have been completely migrated to Hive and tested.

I vetted all results and they look good to me! But please, check that everything is OK.
The last 3 data points (2019-09-29, 2019-10-06 and 2019-10-13) have been calculated by the new queries hitting Hive.

Cheers!

Change 545917 had a related patch set uploaded (by Cicalese; owner: Cicalese):
[analytics/reportupdater-queries@master] Update to include 1.33 and 1.34

https://gerrit.wikimedia.org/r/545917

Thank you, @mforns! I noticed looking at the graphs that the data for 1.33 and 1.34 (still in alpha) was missing, so I submitted a patch. Other than that, looks great!

Nuria set the point value for this task to 8.
Nuria unsubscribed.

Change 545917 merged by Mforns:
[analytics/reportupdater-queries@master] Update to include 1.33 and 1.34

https://gerrit.wikimedia.org/r/545917