Various EventLogging tables contain duplicate events, i.e. rows with identical values in all (relevant) fields including the timestamp, only differing in the internal id/uuid fields.
Over the past several months, I and some people from the Reading web team have encountered this in several different tables. Often these duplicates are rare enough to not affect data analysis much (e.g. in T139319#2460171 where I already mentioned this last month). But sometimes they do. In any case this issue has been spotted often enough now to suspect a server-side problem that makes one less confident about EL data in general.
I haven't done a thorough audit of all schemas, but below are three examples to illustrate the issue.
(These queries look a bit complicated, but all they do is to first identify events that have identical values in a set of fields where one assumes that at least one would differ between a pair of real evens, e.g. : timestamp, userAgent, event_linkInteractionToken, event_action, event_totalInteractionTime, in case of the Popups schema. They then calculate the percentage of redundant events per day.)
In the current MediaViewer schema, about 4-6% of events are dupes every month:
SELECT dupesperday.yearmonthday, totalevents, ROUND(100*dupes/totalevents,2) AS dupes_percentage FROM (SELECT yearmonthday, SUM(copies-1) AS dupes FROM (SELECT LEFT(timestamp, 8) AS yearmonthday, timestamp, userAgent, clientIp, webHost, event_action, COUNT(*) AS copies FROM log.MediaViewer_10867062 WHERE timestamp LIKE '201607%' GROUP BY yearmonthday, timestamp, userAgent, clientIp, webHost, event_action HAVING copies>1) AS dupeslist GROUP BY yearmonthday) AS dupesperday JOIN (SELECT LEFT(timestamp, 8) AS yearmonthday, COUNT(*) AS totalevents FROM log.MediaViewer_10867062 WHERE timestamp LIKE '201607%' GROUP BY yearmonthday) AS totaleventsperday ON dupesperday.yearmonthday = totaleventsperday.yearmonthday ORDER BY yearmonthday;
| yearmonthday | totalevents | dupes_percentage |
| 20160701 | 903524 | 4.73 |
| 20160702 | 807816 | 4.65 |
| 20160703 | 948374 | 5.06 |
| 20160704 | 1043135 | 4.53 |
| 20160705 | 1047312 | 5.33 |
| 20160706 | 997683 | 5.00 |
| 20160707 | 965575 | 4.83 |
| 20160708 | 889435 | 4.60 |
| 20160709 | 802974 | 5.60 |
| 20160710 | 870001 | 5.25 |
| 20160711 | 1032048 | 4.42 |
| 20160712 | 1052181 | 4.15 |
| 20160713 | 1015996 | 4.31 |
| 20160714 | 1006738 | 4.62 |
| 20160715 | 919314 | 4.46 |
| 20160716 | 806409 | 4.86 |
| 20160717 | 896190 | 4.77 |
| 20160718 | 998790 | 4.41 |
| 20160719 | 998669 | 4.36 |
| 20160720 | 976472 | 4.63 |
| 20160721 | 964682 | 4.41 |
| 20160722 | 886669 | 4.75 |
| 20160723 | 786049 | 5.56 |
| 20160724 | 881299 | 4.71 |
| 20160725 | 1023823 | 4.42 |
| 20160726 | 1024738 | 4.74 |
| 20160727 | 996285 | 4.38 |
| 20160728 | 1000236 | 4.77 |
| 20160729 | 917744 | 4.76 |
| 20160730 | 784457 | 4.93 |
| 20160731 | 896262 | 4.55 |
In the schema that counts the Wikimedia blog's pageviews, the problem seems to have become worse suddenly around August 4, 2015 (disregarding initial months for which I would need to double-check the selection of fields tested), leading to overcounts of more than 10%. There were no changes in the blog's theme or setup around that time.
SELECT dupespermonth.yearmonth, totalevents, ROUND(100*dupes/totalevents,2) AS dupes_percentage FROM (SELECT yearmonth, SUM(copies-1) AS dupes FROM (SELECT LEFT(timestamp, 6) AS yearmonth, timestamp, userAgent, event_requestUrl, event_referrerUrl, COUNT(*) AS copies FROM log.WikimediaBlogVisit_5308166 GROUP BY yearmonth, timestamp, userAgent, event_requestUrl, event_referrerUrl HAVING copies>1) AS dupeslist GROUP BY yearmonth) AS dupespermonth JOIN (SELECT LEFT(timestamp, 6) AS yearmonth, COUNT(*) AS totalevents FROM log.WikimediaBlogVisit_5308166 GROUP BY yearmonth) AS totaleventspermonth ON dupespermonth.yearmonth = totaleventspermonth.yearmonth ORDER BY yearmonth; +-----------+-------------+------------------+ | yearmonth | totalevents | dupes_percentage | +-----------+-------------+------------------+ | 201303 | 75230 | 7.89 | | 201304 | 106624 | 6.64 | | 201305 | 85366 | 0.84 | | 201306 | 66621 | 0.46 | | 201307 | 76274 | 0.28 | | 201308 | 72028 | 0.79 | | 201309 | 52815 | 0.35 | | 201310 | 71210 | 0.57 | | 201311 | 73210 | 0.44 | | 201312 | 58625 | 0.39 | | 201401 | 67322 | 0.54 | | 201402 | 57397 | 0.21 | | 201403 | 50589 | 0.19 | | 201404 | 63112 | 0.18 | | 201405 | 4539339 | 0.12 | | 201406 | 2602158 | 0.15 | | 201407 | 799203 | 0.16 | | 201408 | 107381 | 0.11 | | 201409 | 81340 | 0.23 | | 201410 | 67400 | 0.27 | | 201411 | 101956 | 0.16 | | 201412 | 111952 | 0.37 | | 201501 | 110064 | 0.20 | | 201502 | 76112 | 0.19 | | 201503 | 149070 | 0.24 | | 201504 | 82362 | 0.68 | | 201505 | 71315 | 0.13 | | 201506 | 77796 | 0.12 | | 201507 | 69256 | 0.20 | | 201508 | 108015 | 12.14 | | 201509 | 132388 | 10.76 | | 201510 | 87961 | 13.84 | | 201511 | 71734 | 14.96 | | 201512 | 106813 | 14.35 | | 201601 | 134390 | 16.03 | | 201602 | 150619 | 12.26 | | 201603 | 144017 | 13.50 | | 201604 | 200614 | 8.97 | | 201605 | 90666 | 14.40 | | 201606 | 88952 | 14.81 | | 201607 | 164642 | 2.40 | | 201608 | 24599 | 0.35 | +-----------+-------------+------------------+ 42 rows in set (5 min 32.86 sec) SELECT dupesperday.yearmonthday, ROUND(100*dupes/totalevents,2) AS dupes_percentage FROM (SELECT yearmonthday, SUM(copies-1) AS dupes FROM (SELECT LEFT(timestamp, 8) AS yearmonthday, timestamp, userAgent, event_requestUrl, event_referrerUrl, COUNT(*) AS copies FROM log.WikimediaBlogVisit_5308166 WHERE timestamp LIKE '201508%' GROUP BY yearmonthday, timestamp, userAgent, event_requestUrl, event_referrerUrl HAVING copies>1) AS dupeslist GROUP BY yearmonthday) AS dupesperday JOIN (SELECT LEFT(timestamp, 8) AS yearmonthday, COUNT(*) AS totalevents FROM log.WikimediaBlogVisit_5308166 WHERE timestamp LIKE '201508%' GROUP BY yearmonthday) AS totaleventsperday ON dupesperday.yearmonthday = totaleventsperday.yearmonthday ORDER BY yearmonthday; +--------------+------------------+ | yearmonthday | dupes_percentage | +--------------+------------------+ | 20150801 | 0.16 | | 20150802 | 0.37 | | 20150803 | 1.16 | | 20150804 | 10.81 | | 20150805 | 15.11 | | 20150806 | 9.87 | | 20150807 | 12.13 | | 20150808 | 13.68 | | 20150809 | 12.38 | | 20150810 | 11.24 | | 20150811 | 10.30 | | 20150812 | 9.68 | | 20150813 | 8.19 | | 20150814 | 6.57 | | 20150815 | 15.54 | | 20150816 | 13.07 | | 20150817 | 13.09 | | 20150818 | 22.61 | | 20150819 | 18.54 | | 20150820 | 13.94 | | 20150821 | 15.68 | | 20150822 | 12.59 | | 20150823 | 14.80 | | 20150824 | 14.63 | | 20150825 | 15.44 | | 20150826 | 16.63 | | 20150827 | 12.96 | | 20150828 | 14.42 | | 20150829 | 11.37 | | 20150830 | 8.53 | | 20150831 | 13.57 | +--------------+------------------+ 31 rows in set (2.71 sec)
In the Popups schema, these duplicates occur more rarely but still exist (see also T139319#2460171 regarding a previous version of the schema):
SELECT dupesperday.popupEnabled, dupesperday.yearmonthday, ROUND(100*dupes/totalevents,2) AS dupes_percentage FROM
(SELECT popupEnabled, yearmonthday, SUM(copies-1) AS dupes FROM
(SELECT event_popupEnabled AS popupEnabled, LEFT(timestamp, 8) AS yearmonthday,
timestamp, userAgent, event_linkInteractionToken, event_action,
COUNT(*) AS copies
FROM log.Popups_15777589
WHERE wiki ='huwiki' AND event_isAnon = 1
AND event_linkInteractionToken IS NOT NULL
GROUP BY popupEnabled, yearmonthday,
timestamp, userAgent, event_linkInteractionToken, event_action
HAVING copies>1) AS dupeslist
GROUP BY popupEnabled, yearmonthday) AS dupesperday
JOIN
(SELECT event_popupEnabled AS popupEnabled, LEFT(timestamp, 8) AS yearmonthday,
COUNT(*) AS totalevents
FROM log.Popups_15777589
WHERE wiki ='huwiki' AND event_isAnon = 1
AND event_linkInteractionToken IS NOT NULL
GROUP BY popupEnabled, yearmonthday) AS totaleventsperday
ON dupesperday.popupEnabled = totaleventsperday.popupEnabled
AND dupesperday.yearmonthday = totaleventsperday.yearmonthday
ORDER BY popupEnabled, yearmonthday;
+--------------+--------------+------------------+
| popupEnabled | yearmonthday | dupes_percentage |
+--------------+--------------+------------------+
| 0 | 20160728 | 0.06 |
| 0 | 20160729 | 0.14 |
| 0 | 20160730 | 0.02 |
| 0 | 20160731 | 0.15 |
| 0 | 20160801 | 0.02 |
| 0 | 20160802 | 0.02 |
| 0 | 20160803 | 0.04 |
| 0 | 20160804 | 0.02 |
| 0 | 20160805 | 0.01 |
| 0 | 20160806 | 0.02 |
| 0 | 20160807 | 0.04 |
| 0 | 20160808 | 0.06 |
| 0 | 20160809 | 0.02 |
| 0 | 20160810 | 0.07 |
| 1 | 20160728 | 0.20 |
| 1 | 20160729 | 0.01 |
| 1 | 20160730 | 0.07 |
| 1 | 20160731 | 0.25 |
| 1 | 20160801 | 0.04 |
| 1 | 20160802 | 0.05 |
| 1 | 20160803 | 0.50 |
| 1 | 20160804 | 0.08 |
| 1 | 20160805 | 0.07 |
| 1 | 20160806 | 0.06 |
| 1 | 20160807 | 0.10 |
| 1 | 20160808 | 0.06 |
| 1 | 20160809 | 0.08 |
| 1 | 20160810 | 0.16 |
+--------------+--------------+------------------+
28 rows in set (44.94 sec)