Page MenuHomePhabricator

[Bug] `init` and `mtinfo` event counts drop drastically since June 17 2019
Closed, ResolvedPublic

Description

We are using the graph "Number of events by action type, for all languages" of the Toledo notebook to track the number of events by action for external guidance extension. We noticed that the event counts for action init (Access the translated page) and mtinfo (View information about automatic translation) drop drastically since June 17 2019, while the event counts for other actions remain relatively stable.


Count the daily number of init event:

select CONCAT(year, '-', LPAD(month, 2, '0'), '-', LPAD(day, 2, '0')) AS date,
count(1) as n_events
from event.externalguidance
where year=2019 and month=6
and not useragent.is_bot
and event.action = 'init'
group by year, month, day
order by date
limit 1000000
=============================
date	n_events
2019-06-01	195966
2019-06-02	209519
2019-06-03	227952
2019-06-04	215546
2019-06-05	208091
2019-06-06	99218
2019-06-07	11
2019-06-08	165272
2019-06-09	210345
2019-06-10	233290
2019-06-11	239054
2019-06-12	236156
2019-06-13	235328
2019-06-14	218917
2019-06-15	200240
2019-06-16	219316
2019-06-17	51638
2019-06-18	16
2019-06-19	16
2019-06-20	8
2019-06-21	22
2019-06-22	20
2019-06-23	26
2019-06-24	18
2019-06-25	47
2019-06-26	23
2019-06-27	28
2019-06-28	27
2019-06-29	13
2019-06-30	26

We don't see any errors get logged in the eventerror table. And grafana of the ExternalGuidance schema didn't show big drop of event counts in June, so this issue doesn't seem to be the result of a front-end bug.

A-team, do you know where else we can look into to find out the source of the issue?

Details

Related Gerrit Patches:

Event Timeline

chelsyx moved this task from Triage to Tracking on the Product-Analytics board.
Milimetric triaged this task as Unbreak Now! priority.Jul 8 2019, 3:36 PM
Milimetric moved this task from Incoming to Ops Week on the Analytics board.
Restricted Application added a subscriber: Liuxinyu970226. · View Herald TranscriptJul 8 2019, 3:36 PM
Milimetric lowered the priority of this task from Unbreak Now! to High.
Milimetric added a project: Analytics-Kanban.
Nuria added a subscriber: Nuria.Jul 8 2019, 9:25 PM

Definitely something going on here for 2019-07-01 there are ("recorded") 25 "init" events but if I get one of the raw EL files for that day:

hdfs dfs -text /wmf/data/raw/eventlogging/eventlogging_ExternalGuidance/hourly/2019/07/01/00/eventlogging_ExternalGuidance.1004.0.5536.24929991.1561939200000

and I grep for "init" there are 5518 records. So raw data is there but not being refined... mmmmm

Nuria added a subscriber: Ottomata.EditedJul 8 2019, 11:13 PM

Rerun refine for 07/01 to see if anything changes (doesn't see m like it might as latest refine hours are missing a lot)

  1. Removed success flags: > sudo -u analytics hdfs dfs -rm hdfs://analytics-hadoop/wmf/data/event/ExternalGuidance/year=2019/month=7/day=1/*/_SUCCESS
  1. Rerun refine from an-coord1001

sudo -u analytics sudo -u analytics /usr/local/bin/refine_eventlogging_analytics --ignore_failure_flag=true --since=2019-07-01T00:00:00 --until=2019-07-02T00:00:00 table_whitelist_regex="ExternalGuidance" --verbose refine_eventlogging_analytics

But I cannot get to re-refine those hours (success flag is not set)

Pinging @Ottomata to look at refine command while i look at original data

Nuria added a comment.Jul 8 2019, 11:47 PM

Nevermind, was able to re-refine (needed to remove REFINED flags as well as SUCCESS ones) but still there are no changes. I think we just need to debug this on spark cmd line,

Nuria added a comment.Jul 8 2019, 11:59 PM

I think this events are being filtered because their host is translate.googleusercontent.com which does NOT map to ANY wikimedia project true domain, should have thought about this before! It looks like "fake" data coming from a third party clone running our code (similar to "fake" eventloggimg data we filter)

@chelsyx Besides translate.googleusercontent.com is there any other third party domain sending us data?

@chelsyx Besides translate.googleusercontent.com is there any other third party domain sending us data?

The table chelsyx.toledo_pageviews (query that generate this table can be found in T215093) contains pageview data when users using Google Translate to view wiki pages, and its referer_host field should contain all the third party domains that send data to the ExternalGuidance schema. I just queried chelsyx.toledo_pageviews using:

select referer_host, sum(count) as pageviews
from chelsyx.toledo_pageviews
group by referer_host
order by pageviews desc
limit 1000000

The result show that the vast majority of the pings are from translate.googleusercontent.com, followed by translate.google.com, translate.google.de, www.google.com, and a lot of other domains like translate.google.<country code>. Although contributing very few traffic, there're also many other domains in the result (e.g. m.facebook.com), and I'm not sure whether they should be counted. @Pginer-WMF @dr0ptp4kt @santhosh do you have any insights?

Nuria added a comment.EditedJul 9 2019, 3:15 PM

@chelsyx: we will be needing to whitelist external domains cause most EL traffic that comes from 3rd parties is a "fake", that is, we do not really want to count it as lawful actions performed by wikipedia users. I am going with 'translate.google*' for now unless I hear otherwise.

@chelsyx I forget some of the details, but I think it's okay to allow the multiple domains be the referrer for the pageviews.

As for the events, I think the whitelisting @Nuria mentions is fine for ExternalGuidance related events and events on which you're basing relations or ratios - without confining it to such events I believe a little extra noise will get added to the tables for other logged events, so I think that's something you'd probably want to discuss with data people.

Nuria claimed this task.Jul 9 2019, 4:06 PM
Nuria moved this task from Next Up to In Progress on the Analytics-Kanban board.
Nuria added a subscriber: Milimetric.
Nuria added a comment.Jul 9 2019, 4:46 PM

I am going to:

  1. change puppet so we do not apply the 3rd party filter
  2. re-refine all data since June 16th to now
  3. whitelist ranslate.google*

The next time refinery gets deployed I will enable the filter back in puppet

Change 521541 had a related patch set uploaded (by Nuria; owner: Nuria):
[operations/puppet@production] Disabling temporarily the 3rd party filter for EL events

https://gerrit.wikimedia.org/r/521541

Change 521541 merged by Ottomata:
[operations/puppet@production] Disabling temporarily the 3rd party filter for EL events

https://gerrit.wikimedia.org/r/521541

Change 521577 had a related patch set uploaded (by Nuria; owner: Nuria):
[analytics/refinery/source@master] Refactoring eventlogging-specific hostname check

https://gerrit.wikimedia.org/r/521577

Nuria added a comment.Jul 9 2019, 8:30 PM

Ok, data for july is there, onto data for june now:

2019-07-01 266256
2019-07-02 265133
2019-07-03 265787
2019-07-04 261505
2019-07-05 245202
2019-07-06 218491
2019-07-07 239823
2019-07-08 265896
2019-07-09 230804

Nuria added a comment.Jul 10 2019, 5:28 AM

Also re-refined June, the sanitized data will get adjusted when the 2nd sweep of sanitization runs. I will keep this ticket open as we need to deploy the codefix but all data needed should now be available.

@chelsyx I forget some of the details, but I think it's okay to allow the multiple domains be the referrer for the pageviews.
As for the events, I think the whitelisting @Nuria mentions is fine for ExternalGuidance related events and events on which you're basing relations or ratios - without confining it to such events I believe a little extra noise will get added to the tables for other logged events, so I think that's something you'd probably want to discuss with data people.

Thanks @dr0ptp4kt ! The noise is very little so I don't think it would have a big impact on our metrics.

chelsyx added a comment.EditedJul 11 2019, 4:53 AM

@Nuria Thank you for the fix!

The metrics since June 17 2019 come back up, but the dip on June 6&7 is still there -- similarly, there's no such a dip in grafana. Is it a different issue?

Nuria added a comment.Jul 11 2019, 9:02 PM

My mistake, i had refined from 17th onward. All data should be there by now.

Change 521577 merged by jenkins-bot:
[analytics/refinery/source@master] Refactoring eventlogging-specific hostname check

https://gerrit.wikimedia.org/r/521577

Change 524256 had a related patch set uploaded (by Nuria; owner: Nuria):
[operations/puppet@production] Bumping up jar version of refine and adding transform function

https://gerrit.wikimedia.org/r/524256

Change 524256 had a related patch set uploaded (by Ottomata; owner: Nuria):
[operations/puppet@production] Bumping up jar version of refine and adding transform function

https://gerrit.wikimedia.org/r/524256

Change 524256 merged by Ottomata:
[operations/puppet@production] Bumping up jar version of refine and adding transform function

https://gerrit.wikimedia.org/r/524256

Change 524374 had a related patch set uploaded (by Nuria; owner: Nuria):
[operations/puppet@production] Correcting package name in transform function

https://gerrit.wikimedia.org/r/524374

Change 524374 merged by BBlack:
[operations/puppet@production] Correcting package name in transform function

https://gerrit.wikimedia.org/r/524374

Nuria added a comment.Jul 20 2019, 4:57 AM

Closing this data issue but noting here that our filtering of 3rd party data is not working as 3rd party domains are getting through . See:
https://phabricator.wikimedia.org/T228557

Nuria moved this task from Ready to Deploy to Done on the Analytics-Kanban board.Jul 22 2019, 3:07 PM
Nuria closed this task as Resolved.Jul 22 2019, 8:20 PM