Page MenuHomePhabricator

Various EventLogging schemas losing events since around September 8/9
Closed, ResolvedPublic

Description

The total event rate for various EventLogging schemas shows a marked drop occurred around September 8/9. See the following screenshots from Grafana for the last 60 days:

MobileWebSearch events 2016-07-29..2016-09-27 from Grafana.png (576×847 px, 113 KB)

RelatedArticles events 2016-07-29..2016-09-27 from Grafana.png (411×832 px, 89 KB)

MobileWebLanguageSwitcher  events 2016-07-29..2016-09-27 from Grafana.png (391×844 px, 80 KB)

  • Edit: Less clear, but at least the initial drop seems related (the maximum hourly values for Sept 9, 10 and 11 are lower than for every day since Jul 30)

Edit events 2016-07-29..2016-09-27 from Grafana.png (394×842 px, 88 KB)

A fifth example is Schema:Popups, where the issue arose first. Here shown for a particular event category on one wiki (the overall event rate for this schema has been affected by software changes and new experiment launches in the above timespan, so Grafana is less useful here):

Schema:Popups pageLoaded events per day (huwiki, anons).png (481×736 px, 46 KB)

This is being discussed in T146620.

Interestingly, Firefox does not seem to be affected there, which leaves Chrome as the main suspect in case of that particular schema (because it excludes IE and other browsers that are not SendBeacon compatible):

Schema:Popups pageLoaded events per day (huwiki, anons, Firefox).png (549×871 px, 51 KB)

However, various other schemas don't show such a pattern:

Event Timeline

Interestingly, Firefox does not seem to be affected there, which leaves Chrome as the main suspect in case of that particular schema (because it excludes IE and other browsers that are not SendBeacon compatible):
...

Here is the equivalent of that Firefox chart (F4526930) for Chrome, broken down by version. It shows a) how pronounced that Sept 8/9 drop is for Chrome, and b) that it is not tied to any particular version of Chrome.

pageLoaded events in the Popups schema (Chrome only, by version).png (465×977 px, 40 KB)

Data source (relying on a crudely handcrafted browser/version detection):

SELECT LEFT(timestamp, 8) AS date, 
SUBSTRING(userAgent,INSTR(userAgent,'Chrome/')+7,2) AS chrome_version,
COUNT(*) AS pageloaded_events
FROM log.Popups_15777589
WHERE wiki ='huwiki'
AND event_isAnon = 1
AND event_action = 'pageLoaded'
AND INSTR(userAgent,'Chrome')
GROUP BY date, chrome_version
ORDER BY date, chrome_version;

(versions with <= 200 events in this result are omitted in the above chart)

More evidence that Chrome (possibly also other browsers) was affected across several schemas while Firefox was not: In the Edit schema, the ratio of events with Firefox user agents rose simultaneously with the drop in overall events depicted above. (CC @Neil_P._Quinn_WMF )

Ratio of Firefox user agents in the Edit schema, Sept 4-13.png (438×708 px, 20 KB)

(determined using this somewhat crude criterion for detecting Firefox user agents; this bug is a great example how a more general solution based on the ua-parser library would be useful.)

Data obtainable via:

SELECT LEFT(timestamp, 10) AS datehour, SUM(1) AS all_events,
SUM(IF(INSTR(userAgent,'Firefox') AND NOT INSTR(userAgent,'Seamonkey'),1,0))/SUM(1) AS Firefox_ratio
FROM log.Edit_13457736
WHERE timestamp LIKE '20160904%'
OR timestamp LIKE '20160905%'
...
OR timestamp LIKE '20160912%'
OR timestamp LIKE '20160913%'
GROUP BY datehour
ORDER BY datehour;

Forgot to CC @mpopov and @chelsyx earlier regarding the
MobileWebSearch schema. Indeed this drop seems to show clearly at http://searchdata.wmflabs.org/metrics/#mobile_events now , see this screenshot that Nirzar just posted at T143829#2726974 :

pasted_file (347×999 px, 99 KB)

MobileWebSearch schema seems to have recovered:

Screen Shot 2016-11-16 at 15.58.31.png (460×1 px, 197 KB)

We no longer have ops logs nor data logs for this item. To reiterate over our e-mail conversation: Please tag tickets we need to see with analytics. And, if we do not respond promptly enough for ops issues, do ping us on irc.

This ticket looks related to the pretty big issue that affected mediawiki the same number of days, starting on the 8th, namely a huge perf regression that affected chrome users: https://phabricator.wikimedia.org/T146099

It started on September 8th, was resolved arround September 16th. Note that events that do not come from the javascript client were not affected which makes sense if issue is related to resource loader problems. Also, features with different js weights will be affected differently by such an issue so we would expect drops of different size in different features.

Given correlation of these two events I think this ticket can be closed.

Nuria reopened this task as Open.
Nuria moved this task from Next Up to Done on the Analytics-Kanban board.
Nuria claimed this task.
Nuria added a subscriber: Milimetric.