Page MenuHomePhabricator

Event counts from Mysql and Hive don't match. Refine is persisting data from crawlers.
Closed, ResolvedPublic

Description

From Aug 15 to Sep 6, we are seeing more events in table MobileWikiAppDailyStats from Hive than from MariaDB:

MariaDB:

select LEFT(timestamp, 8) as date, count(*)
from MobileWikiAppDailyStats_17984412
where LEFT(timestamp, 6) > '201807'
and LEFT(timestamp, 6) < '201810'
group by LEFT(timestamp, 8)

-- 
"date","count(*)"
20180801,56023
20180802,55432
20180803,57239
20180804,62699
20180805,65801
20180806,56967
20180807,57174
20180808,56901
20180809,57041
20180810,58182
20180811,63543
20180812,67376
20180813,57891
20180814,56880
20180815,34880
20180816,32578
20180817,31106
20180818,38840
20180819,37071
20180820,34108
20180821,31844
20180822,36228
20180823,33234
20180824,39703
20180825,42850
20180826,46037
20180827,36714
20180828,36890
20180829,34536
20180830,34683
20180831,37542
20180901,30896
20180902,30064
20180903,28771
20180904,25545
20180905,32876
20180906,54262
20180907,55869
20180908,63575
20180909,65959
20180910,55920
20180911,53646
20180912,44495
20180913,39990
20180914,38106
20180915,42800
20180916,44858
20180917,36713
20180918,36219
20180919,35353
20180920,34916
20180921,35440
20180922,40438
20180923,42501
20180924,35431
20180925,33829
20180926,33436
20180927,32933
20180928,34492
20180929,40432
20180930,41875

HIve:

SELECT Date(CONCAT(year,"-",LPAD(month,2,"0"),"-",LPAD(day,2,"0"))) as date, COUNT(*) 
FROM event.mobilewikiappdailystats
WHERE year=2018 and month > 7 and month < 10
AND revision = 17984412
GROUP BY CONCAT(year,"-",LPAD(month,2,"0"),"-",LPAD(day,2,"0"))
ORDER BY date
LIMIT 1000

--
date	_c1	
2018-08-01	56175	
2018-08-02	55432	
2018-08-03	57239	
2018-08-04	62701	
2018-08-05	65801	
2018-08-06	56967	
2018-08-07	57174	
2018-08-08	56840	
2018-08-09	57043	
2018-08-10	58182	
2018-08-11	63543	
2018-08-12	67376	
2018-08-13	57893	
2018-08-14	57454	
2018-08-15	58694	
2018-08-16	58104	
2018-08-17	59040	
2018-08-18	65094	
2018-08-19	67263	
2018-08-20	58035	
2018-08-21	57410	
2018-08-22	58890	
2018-08-23	54932	
2018-08-24	55986	
2018-08-25	61667	
2018-08-26	65487	
2018-08-27	55596	
2018-08-28	53908	
2018-08-29	54395	
2018-08-30	54134	
2018-08-31	53413	
2018-09-01	62479	
2018-09-02	64395	
2018-09-03	59572	
2018-09-04	54574	
2018-09-05	53557	
2018-09-06	54261	
2018-09-07	55873	
2018-09-08	63577	
2018-09-09	65959	
2018-09-10	55920	
2018-09-11	53646	
2018-09-12	44495	
2018-09-13	39990	
2018-09-14	38106	
2018-09-15	42800	
2018-09-16	44858	
2018-09-17	36713	
2018-09-18	36219	
2018-09-19	35353	
2018-09-20	34916	
2018-09-21	35440	
2018-09-22	40440	
2018-09-23	42501	
2018-09-24	35431	
2018-09-25	33829	
2018-09-26	33503	
2018-09-27	32933	
2018-09-28	34492	
2018-09-29	40436	
2018-09-30	41877

Event Timeline

By looking at some of this data I can see that web crawler events are getting into hive but not into mysql (that would be something for us to fix).
You can see that by comparing data on 2018-09-30T20:30.
The events with app install id:
5979C16C-30DC-4552-8C67-6922E4959796
are from a crawler.

Now, I want yo clarify that both systems are not identical and one of the reasons why we want to encourage users moving to hive for analysis is because the mysql system does not scale, this, in data, might appear as some lost events as restarts of the mysql process (which have happened twice in the period you queried for) could drop a few events.
Also, a burst of traffic from a schema that will send too many requests to mysql can make the consumer get clogged up, die and drop a few events.

And last but not least, persistence to two different SQL engines and two pipelines for validation can run into issues that are particular to one MySQL engine or schema validation pipeline.

Nuria renamed this task from Event counts from Mysql and Hive don't match to Event counts from Mysql and Hive don't match. Hive is persisting data from crawlers. .Nov 21 2018, 10:26 PM

@Nuria : Do we know why those events don't get inserted into MySQL?

@JAllemandou we have code that prevents them from being so for self-reported bots: https://phabricator.wikimedia.org/T67508

Milimetric renamed this task from Event counts from Mysql and Hive don't match. Hive is persisting data from crawlers. to Event counts from Mysql and Hive don't match. Refine is persisting data from crawlers. .Nov 29 2018, 6:08 PM
Milimetric triaged this task as High priority.
Milimetric lowered the priority of this task from High to Medium.
Milimetric moved this task from Incoming to Operational Excellence on the Analytics board.
Milimetric moved this task from Operational Excellence to Data Quality on the Analytics board.
Nuria raised the priority of this task from Medium to High.Dec 14 2018, 5:22 AM
Nuria subscribed.

By looking at some of this data I can see that web crawler events are getting into hive but not into mysql (that would be something for us to fix).

Does this (i.e. that T67508 doesn't yet work for the Hive data) apply to all EL schemas? In that case it would seem a valuable addition to the documentation at https://wikitech.wikimedia.org/wiki/Analytics/Systems/EventLogging#Incompatibilities_with_the_MariaDB_setup .

@Nuria & @Tbayer, I looked into this a bit. The is_bot flag is added to the parsed user agent, so all EL events have it. But it's only used to filter out events from being inserted into mysql:

https://github.com/wikimedia/puppet/blob/dcac90f33c775dfdca2fa592adaa33d3e7c83394/modules/eventlogging/files/plugins.py#L146

I can see two possibilities:

  1. we did this on purpose and we forgot
  2. we made the mysql filter before EventLogging refine -> Hadoop was operational, so we forgot to apply the filter there

Does everyone think we should filter out is_bot events or keep them there and leave the filtering for query time? I don't see what value we could get out of data from bots, so I'd vote to filter them out.

we made the mysql filter before EventLogging refine -> Hadoop was operational, so we forgot to apply the filter there

Yes, this is what happened. To be fair this are only "identified" bots (there is a lot of bot data in EL not identified as such) but yes, they should be filtered out.

Hm, I thought I was onto something with the is_bot flag, because the mysql consumer filters out events when useragent.is_bot is true. But I can't back this up from the existing data, mostly because is_bot seems to always be false now (in all data we have for this schema, only 6 records have is_bot true). I reran the queries with AND not useragent.is_bot anyway, and results are almost always identical. When they're not, my guess is the mysql consumer is dropping events due to restarts (like during the April 1st outage (also see March 7th and 19th)).

We talked about this and we don't think filtering all events by is_bot would be a good idea. We mostly did that in mysql for scalability reasons, because we saw spiky activity from bots that would crash the consumer. But we don't have that problem in Hadoop. So the question is, @chelsyx: can you filter out records with is_bot true when you need? If you almost always would need to do that, maybe we can think of another aproach.

Rerunning the original queries on new data:

Hive:
SELECT Date(CONCAT(year,"-",LPAD(month,2,"0"),"-",LPAD(day,2,"0"))) as date, COUNT(*) 
FROM event.mobilewikiappdailystats
WHERE year=2019 and month > 1 and month < 5
AND revision = 17984412
GROUP BY CONCAT(year,"-",LPAD(month,2,"0"),"-",LPAD(day,2,"0"))
ORDER BY date
LIMIT 1000
;

MariaDB:
select LEFT(timestamp, 8) as date, count(*)
from MobileWikiAppDailyStats_17984412
where LEFT(timestamp, 6) > '201902'
and LEFT(timestamp, 6) < '201905'
group by LEFT(timestamp, 8)
;

Date            Hive        MariaDB
==========      =====       =======
2019-02-01      52136       52136
2019-02-02      60110       60110
2019-02-03      62252       62252
2019-02-04      52997       52997
2019-02-05      51234       51234
2019-02-06      50962       50962
2019-02-07      51165       51165
2019-02-08      52206       52206
2019-02-09      59006       59006
2019-02-10      62538       62538
2019-02-11      53448       53448
2019-02-12      50974       50974
2019-02-13      50734       50734
2019-02-14      49820       49820
2019-02-15      50891       50891
2019-02-16      58675       58675
2019-02-17      60717       60717
2019-02-18      53260       53260
2019-02-19      52155       52155
2019-02-20      50847       50847
2019-02-21      50710       50710
2019-02-22      52407       52407
2019-02-23      59151       59151
2019-02-24      61078       61078
2019-02-25      54601       54601
2019-02-26      50947       50947
2019-02-27      50488       50488
2019-02-28      50016       50016
2019-03-01      51093       51093
2019-03-02      58070       58070
2019-03-03      60225       60225
2019-03-04      52738       52738
2019-03-05      50948       50948
2019-03-06      50625       50625
2019-03-07      50358       50244
2019-03-08      52143       52143
2019-03-09      58497       58497
2019-03-10      60904       60904
2019-03-11      51534       51534
2019-03-12      50881       50881
2019-03-13      50129       50129
2019-03-14      50778       50778
2019-03-15      51118       51118
2019-03-16      57532       57532
2019-03-17      59959       59959
2019-03-18      50468       50468
2019-03-19      50230       29706
2019-03-20      50401       50401
2019-03-21      52442       52442
2019-03-22      51079       51079
2019-03-23      57270       57270
2019-03-24      59547       59547
2019-03-25      51267       51267
2019-03-26      50061       50061
2019-03-27      50062       50062
2019-03-28      49490       49490
2019-03-29      49907       49907
2019-03-30      55896       55896
2019-03-31      58502       58502
2019-04-01      49570       49110
2019-04-02      49193       49190
2019-04-03      49628       49628
2019-04-04      49486       49486
2019-04-05      50795       50795
2019-04-06      55728       55728
2019-04-07      58317       58317
2019-04-08      50281       50281

@chelsyx since you can filter the data with the is_bot tag, we're going to close this ticket unless you have any objections.

@fdans No objection from me. Thank you for looking into this issue!