Page MenuHomePhabricator

Investigate duplicate EventLogging rows
Closed, ResolvedPublic3 Estimate Story Points

Tokens
"The World Burns" token, awarded by chelsyx."The World Burns" token, awarded by HJiang-WMF."The World Burns" token, awarded by mpopov."The World Burns" token, awarded by nshahquinn-wmf.
Assigned To
Authored By
Tbayer, Aug 11 2016

Description

Various EventLogging tables contain duplicate events, i.e. rows with identical values in all (relevant) fields including the timestamp, only differing in the internal id/uuid fields.

Over the past several months, I and some people from the Reading web team have encountered this in several different tables. Often these duplicates are rare enough to not affect data analysis much (e.g. in T139319#2460171 where I already mentioned this last month). But sometimes they do. In any case this issue has been spotted often enough now to suspect a server-side problem that makes one less confident about EL data in general.

I haven't done a thorough audit of all schemas, but below are three examples to illustrate the issue.

(These queries look a bit complicated, but all they do is to first identify events that have identical values in a set of fields where one assumes that at least one would differ between a pair of real evens, e.g. : timestamp, userAgent, event_linkInteractionToken, event_action, event_totalInteractionTime, in case of the Popups schema. They then calculate the percentage of redundant events per day.)

In the current MediaViewer schema, about 4-6% of events are dupes every month:

MediaViewer_10867062
SELECT dupesperday.yearmonthday, totalevents, ROUND(100*dupes/totalevents,2) AS dupes_percentage FROM
        (SELECT yearmonthday, SUM(copies-1) AS dupes FROM
                (SELECT LEFT(timestamp, 8) AS yearmonthday, timestamp, userAgent, clientIp, webHost, event_action, COUNT(*) AS copies FROM log.MediaViewer_10867062 WHERE timestamp LIKE '201607%' GROUP BY yearmonthday, timestamp, userAgent,  clientIp, webHost, event_action HAVING copies>1) AS dupeslist
        GROUP BY yearmonthday) AS dupesperday
        JOIN
        (SELECT LEFT(timestamp, 8) AS yearmonthday, COUNT(*) AS totalevents FROM log.MediaViewer_10867062 WHERE timestamp LIKE '201607%' GROUP BY yearmonthday) AS totaleventsperday
        ON dupesperday.yearmonthday = totaleventsperday.yearmonthday
ORDER BY yearmonthday;
yearmonthdaytotaleventsdupes_percentage
201607019035244.73
201607028078164.65
201607039483745.06
2016070410431354.53
2016070510473125.33
201607069976835.00
201607079655754.83
201607088894354.60
201607098029745.60
201607108700015.25
2016071110320484.42
2016071210521814.15
2016071310159964.31
2016071410067384.62
201607159193144.46
201607168064094.86
201607178961904.77
201607189987904.41
201607199986694.36
201607209764724.63
201607219646824.41
201607228866694.75
201607237860495.56
201607248812994.71
2016072510238234.42
2016072610247384.74
201607279962854.38
2016072810002364.77
201607299177444.76
201607307844574.93
201607318962624.55

In the schema that counts the Wikimedia blog's pageviews, the problem seems to have become worse suddenly around August 4, 2015 (disregarding initial months for which I would need to double-check the selection of fields tested), leading to overcounts of more than 10%. There were no changes in the blog's theme or setup around that time.

WikimediaBlogVisit_5308166
SELECT dupespermonth.yearmonth, totalevents, ROUND(100*dupes/totalevents,2) AS dupes_percentage FROM 
  (SELECT yearmonth, SUM(copies-1) AS dupes FROM 
    (SELECT LEFT(timestamp, 6) AS yearmonth, timestamp, userAgent, event_requestUrl, event_referrerUrl, COUNT(*) AS copies 
    FROM log.WikimediaBlogVisit_5308166 
    GROUP BY yearmonth, timestamp, userAgent, event_requestUrl, event_referrerUrl HAVING copies>1) AS dupeslist 
  GROUP BY yearmonth) AS dupespermonth
JOIN
  (SELECT LEFT(timestamp, 6) AS yearmonth, COUNT(*) AS totalevents FROM log.WikimediaBlogVisit_5308166 GROUP BY yearmonth) 
  AS totaleventspermonth
ON dupespermonth.yearmonth = totaleventspermonth.yearmonth
ORDER BY yearmonth; 

+-----------+-------------+------------------+
| yearmonth | totalevents | dupes_percentage |
+-----------+-------------+------------------+
| 201303    |       75230 |             7.89 |
| 201304    |      106624 |             6.64 |
| 201305    |       85366 |             0.84 |
| 201306    |       66621 |             0.46 |
| 201307    |       76274 |             0.28 |
| 201308    |       72028 |             0.79 |
| 201309    |       52815 |             0.35 |
| 201310    |       71210 |             0.57 |
| 201311    |       73210 |             0.44 |
| 201312    |       58625 |             0.39 |
| 201401    |       67322 |             0.54 |
| 201402    |       57397 |             0.21 |
| 201403    |       50589 |             0.19 |
| 201404    |       63112 |             0.18 |
| 201405    |     4539339 |             0.12 |
| 201406    |     2602158 |             0.15 |
| 201407    |      799203 |             0.16 |
| 201408    |      107381 |             0.11 |
| 201409    |       81340 |             0.23 |
| 201410    |       67400 |             0.27 |
| 201411    |      101956 |             0.16 |
| 201412    |      111952 |             0.37 |
| 201501    |      110064 |             0.20 |
| 201502    |       76112 |             0.19 |
| 201503    |      149070 |             0.24 |
| 201504    |       82362 |             0.68 |
| 201505    |       71315 |             0.13 |
| 201506    |       77796 |             0.12 |
| 201507    |       69256 |             0.20 |
| 201508    |      108015 |            12.14 |
| 201509    |      132388 |            10.76 |
| 201510    |       87961 |            13.84 |
| 201511    |       71734 |            14.96 |
| 201512    |      106813 |            14.35 |
| 201601    |      134390 |            16.03 |
| 201602    |      150619 |            12.26 |
| 201603    |      144017 |            13.50 |
| 201604    |      200614 |             8.97 |
| 201605    |       90666 |            14.40 |
| 201606    |       88952 |            14.81 |
| 201607    |      164642 |             2.40 |
| 201608    |       24599 |             0.35 |
+-----------+-------------+------------------+
42 rows in set (5 min 32.86 sec)

SELECT dupesperday.yearmonthday, ROUND(100*dupes/totalevents,2) AS dupes_percentage FROM 
	(SELECT yearmonthday, SUM(copies-1) AS dupes FROM 
		(SELECT LEFT(timestamp, 8) AS yearmonthday, timestamp, userAgent, event_requestUrl, event_referrerUrl, COUNT(*) AS copies FROM log.WikimediaBlogVisit_5308166 WHERE timestamp LIKE '201508%' GROUP BY yearmonthday, timestamp, userAgent, event_requestUrl, event_referrerUrl HAVING copies>1) AS dupeslist 
	GROUP BY yearmonthday) AS dupesperday
	JOIN
	(SELECT LEFT(timestamp, 8) AS yearmonthday, COUNT(*) AS totalevents FROM log.WikimediaBlogVisit_5308166 WHERE timestamp LIKE '201508%' GROUP BY yearmonthday) AS totaleventsperday
	ON dupesperday.yearmonthday = totaleventsperday.yearmonthday
ORDER BY yearmonthday; 

+--------------+------------------+
| yearmonthday | dupes_percentage |
+--------------+------------------+
| 20150801     |             0.16 |
| 20150802     |             0.37 |
| 20150803     |             1.16 |
| 20150804     |            10.81 |
| 20150805     |            15.11 |
| 20150806     |             9.87 |
| 20150807     |            12.13 |
| 20150808     |            13.68 |
| 20150809     |            12.38 |
| 20150810     |            11.24 |
| 20150811     |            10.30 |
| 20150812     |             9.68 |
| 20150813     |             8.19 |
| 20150814     |             6.57 |
| 20150815     |            15.54 |
| 20150816     |            13.07 |
| 20150817     |            13.09 |
| 20150818     |            22.61 |
| 20150819     |            18.54 |
| 20150820     |            13.94 |
| 20150821     |            15.68 |
| 20150822     |            12.59 |
| 20150823     |            14.80 |
| 20150824     |            14.63 |
| 20150825     |            15.44 |
| 20150826     |            16.63 |
| 20150827     |            12.96 |
| 20150828     |            14.42 |
| 20150829     |            11.37 |
| 20150830     |             8.53 |
| 20150831     |            13.57 |
+--------------+------------------+
31 rows in set (2.71 sec)

In the Popups schema, these duplicates occur more rarely but still exist (see also T139319#2460171 regarding a previous version of the schema):

Popups_15777589
SELECT dupesperday.popupEnabled, dupesperday.yearmonthday, ROUND(100*dupes/totalevents,2) AS dupes_percentage FROM 
	(SELECT popupEnabled, yearmonthday, SUM(copies-1) AS dupes FROM 
		(SELECT event_popupEnabled AS popupEnabled, LEFT(timestamp, 8) AS yearmonthday, 
		timestamp, userAgent, event_linkInteractionToken, event_action, 
		COUNT(*) AS copies 
		FROM log.Popups_15777589 
		WHERE wiki ='huwiki' AND event_isAnon = 1 
		AND event_linkInteractionToken IS NOT NULL
		GROUP BY popupEnabled, yearmonthday, 
		timestamp, userAgent, event_linkInteractionToken, event_action
		HAVING copies>1) AS dupeslist 
	GROUP BY popupEnabled, yearmonthday) AS dupesperday
	JOIN
	(SELECT event_popupEnabled AS popupEnabled, LEFT(timestamp, 8) AS yearmonthday, 
	COUNT(*) AS totalevents 
	FROM log.Popups_15777589
	WHERE wiki ='huwiki' AND event_isAnon = 1 
	AND event_linkInteractionToken IS NOT NULL
	GROUP BY popupEnabled, yearmonthday) AS totaleventsperday
	ON dupesperday.popupEnabled = totaleventsperday.popupEnabled
        AND dupesperday.yearmonthday = totaleventsperday.yearmonthday
ORDER BY popupEnabled, yearmonthday;
+--------------+--------------+------------------+
| popupEnabled | yearmonthday | dupes_percentage |
+--------------+--------------+------------------+
|            0 | 20160728     |             0.06 |
|            0 | 20160729     |             0.14 |
|            0 | 20160730     |             0.02 |
|            0 | 20160731     |             0.15 |
|            0 | 20160801     |             0.02 |
|            0 | 20160802     |             0.02 |
|            0 | 20160803     |             0.04 |
|            0 | 20160804     |             0.02 |
|            0 | 20160805     |             0.01 |
|            0 | 20160806     |             0.02 |
|            0 | 20160807     |             0.04 |
|            0 | 20160808     |             0.06 |
|            0 | 20160809     |             0.02 |
|            0 | 20160810     |             0.07 |
|            1 | 20160728     |             0.20 |
|            1 | 20160729     |             0.01 |
|            1 | 20160730     |             0.07 |
|            1 | 20160731     |             0.25 |
|            1 | 20160801     |             0.04 |
|            1 | 20160802     |             0.05 |
|            1 | 20160803     |             0.50 |
|            1 | 20160804     |             0.08 |
|            1 | 20160805     |             0.07 |
|            1 | 20160806     |             0.06 |
|            1 | 20160807     |             0.10 |
|            1 | 20160808     |             0.06 |
|            1 | 20160809     |             0.08 |
|            1 | 20160810     |             0.16 |
+--------------+--------------+------------------+
28 rows in set (44.94 sec)

Event Timeline

Restricted Application added a subscriber: Aklapper. · View Herald TranscriptAug 11 2016, 12:33 AM

@Nuria curious to hear your take on this, as it is has a fairly significant in some areas

mpopov added a subscriber: mpopov.Sep 16 2016, 11:02 PM
Nuria claimed this task.Dec 19 2016, 4:41 PM
Nuria removed a project: Analytics.
Nuria added a subscriber: Milimetric.
Nuria moved this task from Next Up to In Progress on the Analytics-Kanban board.Dec 20 2016, 8:22 PM
Nuria added a comment.EditedDec 20 2016, 8:54 PM

Running this again number of so-called duplicates is small.

Results of blog schema:

201508 108015 12.14
201509 132388 10.76
201510 87961 13.84
201511 71734 14.96
201512 106813 14.35
201601 134390 16.03
201602 150619 12.26
201603 144017 13.50
201604 200614 8.97
201605 90666 14.40
201606 88952 14.81
201607 164642 2.40
201608 66908 0.27
201609 41178 0.38

And pop ups:
0 201607 0.10
0 201608 0.04
0 201609 0.26
1 201607 0.12
1 201608 0.08
1 201609 0.10

Nuria added a comment.Dec 20 2016, 9:47 PM

I think there are several issues here, the main one is is likely a browser issue, almost all duplicated events on pop up schema are coming from FF (which unlike chrome doesn't cancel outgoing requests even if your finger gets stuck on F5) . It is our experience that when this happens the client is at fault, that is, the instrumentation is sending two events in close succession when it should only send one. In this scenario chrome will cancel the 2nd request, FF would not (at least older FF versions did not). FF 47, 48 and 49 (to a lesser extent) seem to have this issue.

In the case of the blog userAgent is not captured so we cannot check whether FF is involved.

In the case of mediaviewer the user agents under which this happens are more varied, the constant rate of duplicates points to a bug with the instrumentation happening constantly for all browsers.

Nuria added a comment.Dec 20 2016, 9:50 PM

I really do not see any action items here for analytics. The issues afflicting schemas are of different nature and can be tracked back to the client for these two examples.

Nuria moved this task from In Progress to Done on the Analytics-Kanban board.Dec 20 2016, 9:50 PM
Nuria moved this task from Done to In Progress on the Analytics-Kanban board.Dec 22 2016, 10:05 PM
Nuria added a comment.Dec 23 2016, 1:13 AM

NavigationTiming duplicates, about 2%... mmm

yearmonth totalevents dupes_percentage
201610 12500127 3.13
201611 11927725 2.90
201612 7834400 2.80

Nuria moved this task from In Progress to Paused on the Analytics-Kanban board.Jan 6 2017, 4:26 PM
Nuria moved this task from Paused to Next Up on the Analytics-Kanban board.Jan 30 2017, 4:51 PM
Nuria moved this task from Ready to Deploy to Next Up on the Analytics-Kanban board.
Nuria moved this task from Next Up to In Progress on the Analytics-Kanban board.Mar 14 2017, 11:58 PM
Nuria added a comment.Mar 15 2017, 1:03 AM

mmm.. I think these selects are going to work in the case of a major number of duplicates (like the ones we were seeing on popUp schema) but I certainly overcounted duplicates using a similar method for navigationTiming. Will update ticket with details from comparing data from mysql consumer and hadoop. I am just going to calculate signatures for event data + timestamp to the minute + other capsule data and see how many signatures are repeated.

Nuria moved this task from In Progress to Paused on the Analytics-Kanban board.Mar 20 2017, 3:39 PM
Nuria moved this task from Paused to In Progress on the Analytics-Kanban board.Mar 23 2017, 4:46 PM
Nuria added a comment.EditedMar 24 2017, 8:48 PM

Continuing with work on this.

  • I dumped all NavigationTiming events from March 1st to March 15th.
  • Computer signatures like : md5(concat(event_domInteractive,timestamp,userAgent,webhost,wiki) )
  • Counted number of signatures that appear more than once versus signatures that appear only once
  • Looked at events that are duplicated to see if there was some common pattern
Nuria added a comment.Mar 27 2017, 4:22 PM

Of 4328945 events for couple weeks of NavigationTiming data about 400 are duplicated on NavigationTIming so about ~0.01% and basically the totallity of those is happening on Windows NT (there are also a couple in iPhone iPad and couple on IE10)

Nuria moved this task from In Progress to Done on the Analytics-Kanban board.Mar 27 2017, 4:22 PM
Nuria set the point value for this task to 3.Mar 28 2017, 3:20 PM
Nuria closed this task as Resolved.Mar 28 2017, 3:25 PM