Page MenuHomePhabricator

EventLogging data missing from event_sanitized schemas
Closed, ResolvedPublic

Description

In querying some schemas in the data lake, I noticed that there seems to be no data in any event_sanitized schemas on four days. This is important for reports that the Growth team runs, and probably affects reporting run by other teams, too. There is no data on these dates:

  • 2020-04-30
  • 2020-05-01
  • 2020-05-02
  • 2020-05-03

Data seems to be curtailed on the bordering dates of 2020-04-29 and 2020-05-04. Could this data please be fixed, perhaps by re-inserting from the event schemas?

I checked the following tables. In all cases, the data is missing for the event_sanitized version but not for the event version:

  • event_sanitized.serversideaccountcreation
  • event.serversideaccountcreation
  • event_sanitized.homepagemodule
  • event.homepagemodule
  • event_sanitized.helppanel
  • event.helppanel

Below is one of the queries I ran, so you can see how I am looking at this:

SELECT substring(dt,1,10) the_date, count(*)
FROM event_sanitized.serversideaccountcreation
WHERE year = 2020 and month >= 4
AND wiki IN ('cswiki', 'kowiki', 'viwiki', 'arwiki', 'ukwiki', 'huwiki', 'hywiki', 'srwiki', 'euwiki','frwiki')
AND event.isSelfMade = true
AND event.isApi = false
GROUP BY SUBSTRING(dt,1,10)
ORDER BY the_date DESC
LIMIT 1000;

Event Timeline

This seems like it could be related to the issues we had with kerberos-run-command when wrapping the refine-sanitize, kerberos was not surfacing the issue with snakeyaml on EL sanitize. Both issues have been fixed on:

https://github.com/wikimedia/puppet/commit/c96f5fdc12150b9d8d1a2797f0f434bdb50b4d29#diff-1cb73a94124d4f0587d5633c495670a and on refinery-124.0.0. See: https://github.com/wikimedia/analytics-refinery-source/blob/master/changelog.md

Assigning to @Milimetric that has the ops week, rerunning sanitize should fix the issue (sanitize runs a second time 45 days after ingestion date so this second run would have taken care of correcting the data)
Let's verify other schemas, we probably need to rerun refine for all of them.

Thanks @MMiller_WMF for the ping

Oh! I forgot to back-fill sanitization when I fixed the problem with snakeyaml breaking upgrade.

[facepalm] My bad...

I will do the back-filling @Nuria & @Milimetric, don't worry.

EDIT: The second run of sanitization after 45 days has not reached those dates yet. It will fix them automatically, but it still would take about 3 weeks. So, I will back-fill today.

mforns added a project: Analytics-Kanban.
mforns moved this task from Next Up to In Progress on the Analytics-Kanban board.

@MMiller_WMF
I just left the sanitization back-fill running for 2020-04-29 -> 2020-05-05 (issue period).
It will take a couple hours.
Tomorrow I will vet the data.

OK, it seems everything went well.
I checked several tables (including the ones in the task description) and all have complete data for this period.
Moving this task to DONE.
Nevertheless, @MMiller_WMF please check the data on your side :]
Cheers!

@mforns -- the data looks good to me now. Thank you.