Identical EventLogging queries give different results on db1047 and dbstore1002
Closed, ResolvedPublic
Actions

Assigned To

Authored By

	• Tbayer
	Mar 30 2016, 10:26 AM

Description

For several different EventLogging tables, the same query yields quite different results depending on whether it is run on db1047 (= "s1-analytics-slave") or on dbstore1002 ("analytics-store" = "s2-analytics-slave" etc.).

Example 1 (about MobileWebSectionUsage_15038458, queries done within a few hours on Saturday/Sunday March 26/27):
see P2832

Example 2: (both queries ran just now, and earlier on March 16 with the same results)

mysql:research@analytics-store.eqiad.wmnet [(none)]> SELECT COUNT(*) AS events FROM log.NavigationTiming_14899847 WHERE LEFT(timestamp, 8) = '20151203';
+--------+
| events |
+--------+
| 307440 |
+--------+
1 row in set (5.84 sec)

mysql:research@s1-analytics-slave.eqiad.wmnet [(none)]> SELECT COUNT(*) AS events FROM log.NavigationTiming_14899847 WHERE LEFT(timestamp, 8) = '20151203';
+--------+
| events |
+--------+
| 358908 |
+--------+
1 row in set (2 min 15.64 sec)

Example 3: Many tables seem to differ considerably in size between store and s1, e.g. these (queried just now):

mysql:research@s1-analytics-slave.eqiad.wmnet [(none)]>  SELECT table_name, (DATA_LENGTH + INDEX_LENGTH)/1024/1024/1024 as `TOTAL SIZE (GB)` FROM information_schema.tables WHERE table_name LIKE 'MobileWikiAppToC%';
+--------------------------------------+------------------+
| table_name                           | TOTAL SIZE (GB)  |
+--------------------------------------+------------------+
| MobileWikiAppToCInteraction_10375484 | 150.709785191342 |
| MobileWikiAppToCInteraction_11014396 |   0.988343123347 |
| MobileWikiAppToCInteraction_14585319 |   0.452323913574 |
| MobileWikiAppToCInteraction_8461467  |  81.039179009385 |
+--------------------------------------+------------------+
4 rows in set (0.04 sec)

mysql:research@analytics-store.eqiad.wmnet [(none)]> SELECT table_name, (DATA_LENGTH + INDEX_LENGTH)/1024/1024/1024 as `TOTAL SIZE (GB)` FROM information_schema.tables WHERE table_name LIKE 'MobileWikiAppToC%';
+--------------------------------------+------------------+
| table_name                           | TOTAL SIZE (GB)  |
+--------------------------------------+------------------+
| MobileWikiAppToCInteraction_10375484 | 133.469544879161 |
| MobileWikiAppToCInteraction_11014396 |   0.876318826340 |
| MobileWikiAppToCInteraction_14585319 |   0.465894431807 |
| MobileWikiAppToCInteraction_8461467  |  81.233877775259 |
+--------------------------------------+------------------+
4 rows in set (0.37 sec)

I already discussed such issues briefly with @madhuvishy and @Nuria on #wikimedia-analytics two weeks ago; back then I was informed that (despite the "slave" in the name) s1 might be more reliable than on analytics-store.

(BTW, is the information here still current? https://wikitech.wikimedia.org/wiki/MariaDB#Analytics_Slaves )

Related Objects

Mentioned In: T156844: Decommission old dbstore hosts (db1046, db1047)
T128536: Compare usage of collapsed and uncollapsed section view on mobile web
T136057: Weird German sessions before/since 2.1.141
T128931: investigate instrumentation of section opens/scrolls
Mentioned Here: T125135: Add autoincrement id to EventLogging MySQL tables. {oryx}
T136057: Weird German sessions before/since 2.1.141
P2832 Eventlogging discrepancies between s1 and s2

Event Timeline

• Tbayer created this task.Mar 30 2016, 10:26 AM

Restricted Application added a subscriber: Aklapper. · View Herald TranscriptMar 30 2016, 10:26 AM

• Tbayer mentioned this in T128931: investigate instrumentation of section opens/scrolls.Mar 30 2016, 10:27 AM

FYI that @jcrespo send an e-mail on this regard yesterday to @analytics and db1047 is not fitted for usage as of now.

In T131236#2161126, @Nuria wrote:

FYI that @jcrespo send an e-mail on this regard yesterday to @analytics

This is a different issue. I had actually already seen that email (as you could have inferred from the fact that it was a reply to a question I had posted in the same thread shortly before).

and db1047 is not fitted for usage as of now.

Actually it is, according to Jaime. I had asked him about this on IRC (see partial log below) before filing this bug on his request.

[08:46:36] <HaeB>	 jynus: followup question to https://lists.wikimedia.org/pipermail/analytics/2016-March/005072.html : is it OK to restart queries on s1-analytics-slave now? (it seems to be up and running)
[...]
[08:56:51] <jynus>	 HaeB, yes, queries can go there
[08:57:11] <HaeB>	 jynus: how long might the refilling take?
[08:57:48] <jynus>	 HaeB, the tables are constantly being refilled from the master, the only thing that changes is how late they are from the master
[08:58:24] <HaeB>	 so it's just about the update lag, basically?
[08:58:25] <HaeB>	 i see
[...]
[08:59:38] <jynus>	 most seem to be 5 days behind
[08:59:48] <jynus>	 enwiki and dewiki only minutes
[09:00:17] <HaeB>	 (i'm querying older eventlogging data, so that's not a concern in my case)
[...]
[09:05:08] <jynus>	 right now dbstore1002 is probably more accurante than db1047
[...]
[09:07:37] <jynus>	 except for MobileAppUploadAttempts_5257716
[09:08:37] <HaeB>	 jynus: also for older data? (from around the beginning of january, say)
[09:09:16] <jynus>	 yes, in fact I can say that (except for that table) all rows from January to now are there
[09:09:19] <HaeB>	 jynus: and why? (the way madhu and nuria explained it to me made it sound that it was the other way around... that s1-analytics might be more reliable than store)
[09:09:34] <jynus>	 well, eventually, both will be
[...]
[09:10:23] <jynus>	 but db1047 just had issues and I have not yet completed a full resync
[...]
[09:11:10] <HaeB>	 jynus: ok, but that would not affect (say) eventlogging data from january 7, say?
[09:11:50] <HaeB>	 (the missing resync, i mean)
[09:11:58] <jynus>	 I am actually checking since 2016-01-01
[09:13:15] <jynus>	 I think the best way to check it is for you to run a quick count(*) from the timestamp that interests you an compare results
[09:13:33] <jynus>	 if you see a difference, tell me

@jcrespo Any progress here? (I just reran Example 2 and it still gives the same results with the same discrepancy.)

I will give it a look tomorrow.

jcrespo triaged this task as Medium priority.May 25 2016, 5:47 PM

jcrespo moved this task from Triage to Pending comment on the DBA board.

• Tbayer mentioned this in T136057: Weird German sessions before/since 2.1.141.May 25 2016, 10:04 PM

See also T136057#2328652 (@Dbrant encountering a "wrong checksum" error on s1-analytics-slave while the same query works on s2)

^ I had the exact same problem yesterday. Thanks for documenting this, @Tbayer!

• Tbayer mentioned this in T128536: Compare usage of collapsed and uncollapsed section view on mobile web.Jun 2 2016, 7:11 PM

From what I can see, results are >3 months. By that time, if results arrive, results may have been already purged due to retention policies, resulting on partial results. Why results so old keep being inserted is something I do not know- but as far as I can see, it happens.

BTW, do never do:

MariaDB EVENTLOGGING m4 localhost log > SELECT COUNT(*) AS events FROM log.NavigationTiming_14899847 WHERE LEFT(timestamp, 8) = '20151203';
+--------+
| events |
+--------+
| 358908 |
+--------+
1 row in set (4.69 sec)

I literally have that as an example on may talks on what not to do: http://www.slideshare.net/jynus/query-optimization-from-0-to-10-and-up-to-57/44 . Please give a look at those slides. Compare the speed with:

MariaDB EVENTLOGGING m4 localhost log > SELECT COUNT(*) AS events FROM log.NavigationTiming_14899847 WHERE timestamp like '20151203%';
+--------+
| events |
+--------+
| 358908 |
+--------+
1 row in set (0.46 sec)

This is a query done on the master. It seems analytics-store is missing those particular rows. Blame T125135 for some rows skipped (someone decided to get rid of auto-increments, and now new rows are inserted in parallel, so for a time not all rows were inserted).

I am resyncing everything from December now, that will fix this particular issue.

jcrespo moved this task from In progress to Pending comment on the DBA board.Jun 3 2016, 2:15 PM

Thank you, the code on our end is reayd to add an auto increment column. Please see: https://phabricator.wikimedia.org/T125135

MariaDB  db1047 log > SELECT COUNT(*) AS events FROM log.NavigationTiming_14899847 WHERE timestamp like '20151203%';
+--------+
| events |
+--------+
| 358908 |
+--------+
1 row in set (1.40 sec)

MariaDB  db1046 log > SELECT COUNT(*) AS events FROM log.NavigationTiming_14899847 WHERE timestamp like '20151203%';
+--------+
| events |
+--------+
| 358908 |
+--------+
1 row in set (0.52 sec)

MariaDB  dbstore1002 log > SELECT COUNT(*) AS events FROM log.NavigationTiming_14899847 WHERE timestamp like '20151203%';
+--------+
| events |
+--------+
| 358908 |
+--------+
1 row in set (0.61 sec)

This has been fixed for all tables up to December.

• Tbayer mentioned this in T156844: Decommission old dbstore hosts (db1046, db1047).Mar 17 2017, 10:58 PM

Identical EventLogging queries give different results on db1047 and dbstore1002Closed, ResolvedPublicActions

Description

Related Objects

Event Timeline

Identical EventLogging queries give different results on db1047 and dbstore1002
Closed, ResolvedPublic
Actions