Page MenuHomePhabricator

reportupdater Pingback reports are broken and need to be refactored
Closed, ResolvedPublic

Description

Since mid-December 2019 reportupdater Pingback reports are not updated.
https://pingback.wmflabs.org/#unique-wiki-count
The reason is the corresponding Hive queries fail with GC errors.
We already discussed in T223414 that the nature of those queries was not efficient and
that they'd probably fail and would need to be refactored/rethought in the future.

Event Timeline

That is disappointing. What's the next step in getting them working again?

@CCicalese_WMF The queries need to be entirely rewritten so they do not scan the whole table all records every time. We would like to suggest that this would be a good item to work on for someone on your team that wants to get a bit familiar with the data infrastructure.

Otherwise we can work on it probably next quarter.

One thing we could do, as suggested by Dan, is to purge event_sanitized.mediawikipingback by deleting all events that are not the state of the art of a given wiki (remove all but last pingback per wiki). And we could still keep un-purged data in a backup if needed.
This would reduce the size of the table, by my calculations, by a factor of 20 approx. All queries would continue to work and the results should be still the same.
Of course, this would not be a permanent solution, in 1 or 2 years we'd have to repeat this operation to purge the table from "unused" events again or we'd see the same issues.
So maybe this could be a 'quick' solution that would populate the dashboard temporarily and would give us all a couple months to rethink the pingback pipeline?

@mforns Actually, I don't think that will work. Since the reports are cumulative, we need the old data to correctly accumulate usage values for past time periods.

@CCicalese_WMF
Hmm, all queries have the same first step, which is to isolate the last ping from each wiki. Only the last ping is considered for the calculations of new data points.
It is true that, if we delete the data as proposed, we'll not be able to use the table for retroactive queries (say accumulate values until 2018-06-12).
But that's why I suggested to copy the whole original data to a backup location before purging event_sanitized.mediawikipingback.
We even can create a table in top of the backup data to allow for retroactive queries there.
Regarding new data points, I think calculations will work even with only last pings for each wiki, no?

We even can create a table in top of the backup data to allow for retroactive queries there.

Since this is not needed to get the same reports we have been getting to date let''s just please not consider this option, any deviations from our regular workflows require more maintenance.

@CCicalese_WMF Maybe worth revisiting the queries? if you only use the last pingback for a wiki prior pingbacks are not needed.

It is true that, if we delete the data as proposed, we'll not be able to use the table for retroactive queries (say accumulate values until 2018-06-12).

Right. Can we assume that the report will never need to be re-run and that the report output file will never be corrupted or destroyed?

Would we be able to save the report up to the point in December that we had the failure for historical purposes and stop running the report updater for that report? The reason I ask is that the heartbeat ping was introduced in 1.31 in June 2018. With the heartbeat ping, we will hear from each wiki that is still active at least once in every 30 day period. This will limit the period over which we need to accumulate the count. We actually have no way of knowing whether any of the wikis that have emitted pingbacks prior to 1.31 are still active, but since 1.31 is the last LTS, it may be OK to ignore information from versions prior to 1.31. So, we would create a new report with a more efficient query starting with the release of 1.31 in June 2018. Would that alleviate the current GC issue?

Would we be able to save the report up to the point in December that we had the failure

Yes, the "report" is just a csv file here: https://analytics.wikimedia.org/published/datasets/periodic/reports/metrics/pingback/

The reason I ask is that the heartbeat ping was introduced in 1.31 in June 2018

We can also stop these reports entirely and make a new set of queries. Seems like you want <mediawiki version, count(wikis-running this version)> and this is data that can be reported fresh daily just looking back to the past 30 days, right?

Yes, my suggestion is that we stop the existing report but keep the data statically so people can still view the graphs for historical purposes. Then, we would create a new report with a new set of queries that only need to look back 30 days when the report is run.

Oh! Didn't know about the heartbeat pingback.
That is great, looking one month back would be completely fine of course.
Is the heartbeat pingback issued at a fixed data (say first of month) for all active wikis?
Or does the date depend on the install timestamp, or maybe other factors?

There's more detail about the heartbeat ping at T236178.

Milimetric triaged this task as Medium priority.Mar 2 2020, 4:52 PM
Milimetric moved this task from Incoming to Smart Tools for Better Data on the Analytics board.
CCicalese_WMF renamed this task from Should reportupdater Pingback reports be refactored? to reportupdater Pingback reports are broken and need to be refactored.Aug 10 2020, 4:43 PM

@CCicalese_WMF
We're tackling this task now.
IIUC, the pingback heartbeat is sent at first install time, then at a minimum interval of 30 days.
Yes, I agree this will be a good enough proxy for active installs.
Plus, it will not count test wikis and obsolete wikis.

I will then modify the reportupdater queries to just consider the last month of data.
And will try to backfill as far as I can considering the heartbeat starts at June 2018.
Also, will keep existing data for all reports. Hopefully I'll be able to amend them seamlessly.
We can add an annotation that marks the transition of the metrics from one calculation style to the other.

Will ping you for CRs and for some data vetting once we have the dashboards updated.

Change 621552 had a related patch set uploaded (by Mforns; owner: Mforns):
[analytics/reportupdater-queries@master] Adapt pingback queries to use the pingback heartbeat

https://gerrit.wikimedia.org/r/621552

@CCicalese_WMF

I modified the queries to only consider the last 30 days of pingbacks
(the group-by and ordering is still necessary within that period to count wikis correctly and put them in the correct breakdown).

The queries work now smoothly, but I observed that values are pretty different from what we were seeing in the reports before.
They are consistently about 7 times smaller across all reports.

Is that expected?
Should the wikis that issued a pingback in the past and are not active any more match that proportion?

Thanks :]

Interesting. Seven times smaller does seem large, but we haven't had a good sense before how many people install MediaWiki for development, testing, or evaluation and subsequently undeploy it. I would look for consistency in the graphs over time and across different versions as well as the various statistics to see if the results seem reasonable. Are there new graphs available of the new query results?

Change 621552 merged by Mforns:
[analytics/reportupdater-queries@master] Adapt pingback queries to use the pingback heartbeat

https://gerrit.wikimedia.org/r/621552

I merged the patch with the changes in how we calculate the metric.
This will soon populate the charts in https://pingback.wmflabs.org/ from the last date they were updated (2019-09-29) until today.
This way we can vet these new results visually and compare them to the old ones in the same dashboard.
Don't worry. If they make no sense, we can revert the charts in the dashboard and make further changes.
I will ping you once the charts are ready.
Cheers!

I'm not sure if the reports are finished updating, but what is there looks promising. It would be good, though, to use the heartbeat for the calculations back to when it was introduced into the codebase in MW 1.31. It was initially introduced in https://gerrit.wikimedia.org/r/c/mediawiki/core/+/419506 with a bug fix in https://gerrit.wikimedia.org/r/c/mediawiki/core/+/449641 which was cherry-picked to 1.31 at the end of August 2018. So, all numbers for MW 1.31+ starting September, 2018 should use the heartbeat. That would give us the benefit of the heartbeat to see the effect on the graphs over a longer period of time.

Now they have finished updating, yes they look better now!
I will now re-run the reports from 2018-09-02 (not 2018-09-01 because reports are weekly and start on Sunday the 2nd).
And will ping here once the reports are back-filled.

Change 623002 had a related patch set uploaded (by Mforns; owner: Mforns):
[analytics/reportupdater-queries@master] Pull back start time of pingback reports

https://gerrit.wikimedia.org/r/623002

Change 623002 merged by Mforns:
[analytics/reportupdater-queries@master] Pull back start time of pingback reports

https://gerrit.wikimedia.org/r/623002

@CCicalese_WMF The back-filling from 2018-09-02 is running.
It will take about 1 more day.
Then, if all went well, all metrics will be up-to-date in the dashboard.
Cheers!

The graphs look great! Thank you so much!!