Page MenuHomePhabricator

enwiki_p logging vs logging_userindex returning dramatically different results
Open, Needs TriagePublic

Description

Consider this query on logging:

MariaDB [enwiki_p]> SELECT log_user_text AS `reviewer`,
  COUNT(DISTINCT(log_page)) AS `reviews`
  FROM logging
  WHERE log_timestamp BETWEEN 20170519000000 AND 20170619000000
  AND log_namespace = 0 AND
    (
      (
        log_type = 'patrol'
        AND log_action = 'patrol'
      ) OR (
        log_type = 'pagetriage-curation'
        AND log_action = 'reviewed'
      )
    )
  GROUP BY reviewer
  ORDER BY reviews DESC
  LIMIT 100;

+-------------------------+---------+
| reviewer                | reviews |
+-------------------------+---------+
| MusikAnimal             |      54 |
| Boleyn                  |      53 |
| BigHaz                  |      47 |
| JTtheOG                 |      41 |
| BigHaz                  |      41 |
| Onel5969                |      40 |
| DragonflySixtyseven     |      39 |
| BigHaz                  |      37 |
| Boleyn                  |      36 |
| BigHaz                  |      28 |
| Fitindia                |      28 |
| MusikAnimal             |      27 |
| Boleyn                  |      27 |
| Dschslava               |      27 |
| MusikAnimal             |      27 |
| DragonflySixtyseven     |      27 |
| Boleyn                  |      27 |
| Winged Blades of Godric |      26 |
| DragonflySixtyseven     |      25 |
| BigHaz                  |      24 |
| Boleyn                  |      23 |
| DragonflySixtyseven     |      21 |
| JTtheOG                 |      21 |
| Onel5969                |      20 |
| BigHaz                  |      20 |
| Domdeparis              |      19 |
| JTtheOG                 |      19 |
| BigHaz                  |      19 |
| JTtheOG                 |      19 |
| Onel5969                |      18 |
| Boleyn                  |      18 |
| BigHaz                  |      17 |
| Onel5969                |      17 |
| BigHaz                  |      17 |
| Onel5969                |      16 |
| Boleyn                  |      16 |
| SL93                    |      15 |
| Boleyn                  |      15 |
| Markdask                |      15 |
| Mduvekot                |      15 |
| JTtheOG                 |      15 |
| Sulaimandaud            |      14 |
| JTtheOG                 |      14 |
| BigHaz                  |      14 |
| Onel5969                |      14 |
| JTtheOG                 |      14 |
| Boleyn                  |      14 |
| Boleyn                  |      13 |
| Fitindia                |      13 |
| JTtheOG                 |      13 |
| Bensci54                |      13 |
| BigHaz                  |      13 |
| Fitindia                |      13 |
| Stephreef               |      13 |
| Catmando999             |      13 |
| PRehse                  |      13 |
| JTtheOG                 |      13 |
| PRehse                  |      13 |
| Boleyn                  |      13 |
| JTtheOG                 |      13 |
| PRehse                  |      13 |
| Boleyn                  |      13 |
| PRehse                  |      13 |
| Bensci54                |      13 |
| PRehse                  |      13 |
| Boleyn                  |      13 |
| MusikAnimal             |      13 |
| Sadads                  |      13 |
| Dschslava               |      13 |
| BigHaz                  |      13 |
| Dschslava               |      13 |
| TheCatalyst31           |      13 |
| PRehse                  |      13 |
| JTtheOG                 |      13 |
| Sadads                  |      13 |
| TheCatalyst31           |      13 |
| PRehse                  |      13 |
| PRehse                  |      13 |
| PRehse                  |      13 |
| PRehse                  |      13 |
| PRehse                  |      13 |
| BigHaz                  |      13 |
| BigHaz                  |      13 |
| Legacypac               |      13 |
| Fitindia                |      13 |
| Markdask                |      13 |
| PRehse                  |      13 |
| Dschslava               |      13 |
| JTtheOG                 |      12 |
| Fitindia                |      12 |
| Stephreef               |      12 |
| Blythwood               |      12 |
| Mduvekot                |      12 |
| BigHaz                  |      12 |
| PRehse                  |      12 |
| Nicnote                 |      12 |
| JTtheOG                 |      12 |
| PRehse                  |      12 |
| Boleyn                  |      12 |
| MrX                     |      12 |
+-------------------------+---------+
100 rows in set (4 min 11.29 sec)

Note this took only 4 minutes. However the results are painfully wrong. In fact if you remove the LIMIT you'll get some 10,000+ rows when there should be only ~350 rows.

The same query but with logging_userindex:

MariaDB [enwiki_p]> SELECT log_user_text AS `reviewer`,
  COUNT(DISTINCT(log_page)) AS `reviews`
  FROM logging_userindex
  WHERE log_timestamp BETWEEN 20170519000000 AND 20170619000000
  AND log_namespace = 0 AND
    (
      (
        log_type = 'patrol'
        AND log_action = 'patrol'
      ) OR (
        log_type = 'pagetriage-curation'
        AND log_action = 'reviewed'
      )
    )
  GROUP BY reviewer
  ORDER BY reviews DESC
  LIMIT 100;

+-------------------------+---------+
| reviewer                | reviews |
+-------------------------+---------+
| BigHaz                  |    1197 |
| JTtheOG                 |    1068 |
| Onel5969                |    1062 |
| Boleyn                  |    1045 |
| PRehse                  |     924 |
| Sulaimandaud            |     523 |
| Narutolovehinata5       |     461 |
| Usernamekiran           |     441 |
| Kudpung                 |     373 |
| Domdeparis              |     370 |
| Robert McClenon         |     362 |
| Timothyjosephwood       |     352 |
| MrX                     |     334 |
| Mduvekot                |     305 |
| Chrissymad              |     299 |
| Sadads                  |     295 |
| Winged Blades of Godric |     293 |
| Atlantic306             |     277 |
| Pichpich                |     267 |
| Fitindia                |     261 |
| Reddogsix               |     251 |
| DragonflySixtyseven     |     250 |
| TonyBallioni            |     238 |
| Meatsgains              |     235 |
| MainlyTwelve            |     218 |
| Reb1981                 |     201 |
| RileyBugz               |     198 |
| Stephreef               |     185 |
| MusikAnimal             |     184 |
| KGirlTrucker81          |     181 |
| 331dot                  |     175 |
| Diannaa                 |     172 |
| Legacypac               |     154 |
| Ymblanter               |     149 |
| Nick Moyes              |     147 |
| C.Fred                  |     132 |
| Graeme Bartlett         |     129 |
| Jake Brockman           |     123 |
| Melcous                 |     121 |
| Scope creep             |     120 |
| Insertcleverphrasehere  |     119 |
| Gbawden                 |     116 |
| TheCatalyst31           |     115 |
| DGG                     |     115 |
| Gtstricky               |     113 |
| JamesG5                 |     108 |
| RA0808                  |     107 |
| Markdask                |     103 |
| Slashme                 |      99 |
| Simplexity22            |      96 |
| Malunrenta              |      96 |
| WWGB                    |      90 |
| Gilo1969                |      87 |
| Diako1971               |      82 |
| Dschslava               |      81 |
| Hydronium Hydroxide     |      81 |
| Innisfree987            |      80 |
| SL93                    |      79 |
| Jupitus Smart           |      79 |
| Nicnote                 |      77 |
| Fbdave                  |      77 |
| I dream of horses       |      76 |
| Oluwa2Chainz            |      72 |
| Cahk                    |      70 |
| Winner 42               |      70 |
| Velella                 |      66 |
| Xevus11                 |      64 |
| TheMagikCow             |      61 |
| JHCaufield              |      58 |
| Ajpolino                |      57 |
| Blackguard SF           |      57 |
| Cabayi                  |      56 |
| Yintan                  |      54 |
| WikiDan61               |      54 |
| JMHamo                  |      54 |
| Yashovardhan Dhanania   |      53 |
| L3X1                    |      51 |
| Killer Moff             |      48 |
| Ipigott                 |      48 |
| LuK3                    |      47 |
| Kuyabribri              |      45 |
| Mattlore                |      43 |
| Devopam                 |      42 |
| Sjö                     |      42 |
| Numbermaniac            |      42 |
| Fixer88                 |      39 |
| TheSandDoctor           |      39 |
| Vanjagenije             |      39 |
| Bensci54                |      39 |
| TheMagnificentist       |      38 |
| Oiyarbepsy              |      37 |
| Fabrictramp             |      36 |
| Kaldari                 |      35 |
| Mean as custard         |      35 |
| Catmando999             |      35 |
| DESiegel                |      35 |
| Coderzombie             |      35 |
| PamD                    |      34 |
| Arthistorian1977        |      33 |
| Stikkyy                 |      33 |
+-------------------------+---------+
100 rows in set (21 min 47.53 sec)

These results are correct and identical to production, but it took nearly 22 minutes to get them.

I tried this same query on logging_logindex but it took too long (45+ minutes), so I can't confirm is the same issue is there.

Why are the results from logging so different than logging_userindex? Is MySQL completely bugging out? I understand the indexed tables on Tool Labs may be missing data where the user or other column is NULL due to being suppressed, but that doesn't explain the discrepancy in data that isn't being suppressed. And 10,000+ rows for this query on logging versus ~350 on logging_userindex is pretty dramatic.

Event Timeline

Restricted Application added a project: Cloud-Services. · View Herald TranscriptJun 20 2017, 1:03 AM
Restricted Application added a subscriber: Aklapper. · View Herald Transcript
MusikAnimal updated the task description. (Show Details)Jun 20 2017, 1:03 AM
MusikAnimal updated the task description. (Show Details)Jun 20 2017, 1:06 AM
MZMcBride added a subscriber: MZMcBride.

@jcrespo can answer much better than I can, but in my experience, these types of data integrity issues on Tool Labs are very common. And every time they're brought up, someone threatens "ROW-based replication" that will allegedly solve all of these problems.

MusikAnimal added a comment.EditedJun 20 2017, 1:20 AM

I want to also point out that this query took 9 seconds to finish on production.

zhuyifei1999 added a subscriber: zhuyifei1999.EditedJun 20 2017, 2:30 AM

FWIW, using https://tools.wmflabs.org/tools-info/optimizer.py, the EXPLAIN-s for both queries (logging and logging_userindex) is basically the same as (differ a bit as the table continues to grow):

idselect_typetabletypepossible_keyskeykey_lenrefrowsExtra
1SIMPLEloggingrangetype_time, timestype_time34None34611530Using where; Rowid-ordered scan; Using temporary; Using filesort

Regarding logging_logindex:

idselect_typetabletypepossible_keyskeykey_lenrefrowsExtra
1SIMPLEloggingreftype_time, page_time, timespage_time4const25948436Using where; Using temporary; Using filesort

Key page_time is log_namespace, log_title, log_timestamp, which are not as useful as type_time. (Title is not used in the query)

bd808 renamed this task from Tool Labs logging vs indexed version returning dramatically different results to enwiki_p logging vs logging_userindex returning dramatically different results.Jun 20 2017, 3:23 AM
bd808 removed a project: Toolforge.
zhuyifei1999 added a comment.EditedJun 20 2017, 3:30 AM

Maybe DBAs have better ideas, but this is an optimised-to-4-min query:

1MariaDB [enwiki_p]> SELECT logtemp.log_user_text AS `reviewer`,
2 -> COUNT(DISTINCT(logtemp.log_page)) AS `reviews`
3 -> FROM (
4 -> SELECT log_user_text, log_page
5 -> FROM logging_userindex
6 -> WHERE log_timestamp BETWEEN 20170519000000 AND 20170619000000
7 -> AND log_type = 'patrol'
8 -> AND log_action = 'patrol'
9 -> AND log_namespace = 0
10 -> UNION
11 -> SELECT log_user_text, log_page
12 -> FROM logging_userindex
13 -> WHERE log_timestamp BETWEEN 20170519000000 AND 20170619000000
14 -> AND log_type = 'pagetriage-curation'
15 -> AND log_action = 'reviewed'
16 -> AND log_namespace = 0
17 -> ) logtemp
18 -> GROUP BY reviewer
19 -> ORDER BY reviews DESC
20 -> LIMIT 100;
21+-------------------------+---------+
22| reviewer | reviews |
23+-------------------------+---------+
24| BigHaz | 1197 |
25| JTtheOG | 1068 |
26| Onel5969 | 1062 |
27| Boleyn | 1045 |
28| PRehse | 924 |
29| Sulaimandaud | 523 |
30| Narutolovehinata5 | 461 |
31| Usernamekiran | 441 |
32| Kudpung | 373 |
33| Domdeparis | 370 |
34| Robert McClenon | 362 |
35| Timothyjosephwood | 352 |
36| MrX | 334 |
37| Mduvekot | 305 |
38| Chrissymad | 299 |
39| Sadads | 295 |
40| Winged Blades of Godric | 293 |
41| Atlantic306 | 277 |
42| Pichpich | 267 |
43| Fitindia | 261 |
44| Reddogsix | 251 |
45| DragonflySixtyseven | 250 |
46| TonyBallioni | 238 |
47| Meatsgains | 235 |
48| MainlyTwelve | 218 |
49| Reb1981 | 201 |
50| RileyBugz | 198 |
51| Stephreef | 185 |
52| MusikAnimal | 184 |
53| KGirlTrucker81 | 181 |
54| 331dot | 175 |
55| Diannaa | 172 |
56| Legacypac | 154 |
57| Ymblanter | 149 |
58| Nick Moyes | 147 |
59| C.Fred | 132 |
60| Graeme Bartlett | 129 |
61| Jake Brockman | 123 |
62| Melcous | 121 |
63| Scope creep | 120 |
64| Insertcleverphrasehere | 119 |
65| Gbawden | 116 |
66| TheCatalyst31 | 115 |
67| DGG | 115 |
68| Gtstricky | 113 |
69| JamesG5 | 108 |
70| RA0808 | 107 |
71| Markdask | 103 |
72| Slashme | 99 |
73| Simplexity22 | 96 |
74| Malunrenta | 96 |
75| WWGB | 90 |
76| Gilo1969 | 87 |
77| Diako1971 | 82 |
78| Dschslava | 81 |
79| Hydronium Hydroxide | 81 |
80| Innisfree987 | 80 |
81| SL93 | 79 |
82| Jupitus Smart | 79 |
83| Nicnote | 77 |
84| Fbdave | 77 |
85| I dream of horses | 76 |
86| Oluwa2Chainz | 72 |
87| Cahk | 70 |
88| Winner 42 | 70 |
89| Velella | 66 |
90| Xevus11 | 64 |
91| TheMagikCow | 61 |
92| JHCaufield | 58 |
93| Ajpolino | 57 |
94| Blackguard SF | 57 |
95| Cabayi | 56 |
96| Yintan | 54 |
97| WikiDan61 | 54 |
98| JMHamo | 54 |
99| Yashovardhan Dhanania | 53 |
100| L3X1 | 51 |
101| Killer Moff | 48 |
102| Ipigott | 48 |
103| LuK3 | 47 |
104| Kuyabribri | 45 |
105| Mattlore | 43 |
106| Devopam | 42 |
107| Sjö | 42 |
108| Numbermaniac | 42 |
109| Fixer88 | 39 |
110| TheSandDoctor | 39 |
111| Vanjagenije | 39 |
112| Bensci54 | 39 |
113| TheMagnificentist | 38 |
114| Oiyarbepsy | 37 |
115| Fabrictramp | 36 |
116| Kaldari | 35 |
117| Mean as custard | 35 |
118| Catmando999 | 35 |
119| DESiegel | 35 |
120| Coderzombie | 35 |
121| PamD | 34 |
122| Stikkyy | 33 |
123| Arthistorian1977 | 33 |
124+-------------------------+---------+
125100 rows in set (3 min 48.30 sec)

The relevant EXPLAIN is:

idselect_typetabletypepossible_keyskeykey_lenrefrowsExtra
1PRIMARYALLNoneNoneNoneNone35659591Using temporary; Using filesort
2DERIVEDloggingreftype_time, timestype_time34const34142133Using where
3UNIONloggingreftype_time, timestype_time34const1517458Using where
NoneUNION RESULTALLNoneNoneNoneNoneNone

I executed this query a second time and got 4 min 42.41 sec. Also executed on logging where I got 6 min 27.80 sec (with correct results), and logging_logindex where I grew impatient and the query has not yet finish at the time of writing this I got 30 min 3.67 sec (with correct results).

MusikAnimal added a comment.EditedJun 20 2017, 4:02 AM

Maybe DBAs have better ideas, but this is an optimised-to-4-min query:
P5596

This is amazing. Thank you!

However just to make sure we're on the same page, when I run this without a LIMIT I get ~12,500 results going off of logging:

MariaDB [enwiki_p]> SELECT log_user_text AS `reviewer`,
  COUNT(DISTINCT(log_page)) AS `reviews`
  FROM logging
  WHERE log_timestamp BETWEEN 20170519000000 AND 20170619000000
  AND log_namespace = 0 AND
  (
    (
      log_type = 'patrol'
      AND log_action = 'patrol'
    ) OR (
      log_type = 'pagetriage-curation'
      AND log_action = 'reviewed'
    )
  )
  GROUP BY reviewer
  ORDER BY reviews DESC;

+---------------------------+---------+
| reviewer                  | reviews |
+---------------------------+---------+
| MusikAnimal               |      54 |
| Boleyn                    |      53 |
| BigHaz                    |      47 |

...

| Sulaimandaud              |       1 |
| Kuyabribri                |       1 |
| Sulaimandaud              |       1 |
+---------------------------+---------+
12471 rows in set (3 min 25.47 sec)

and the logging_userindex version gave me 345, which is correct. So my gut tells me this is not a replication issue, or the timestamps of these log entries were somehow mutated. There were not thousands of reviewers in this time period.

Strangely enough, if I do a COUNT on logging it's correct:

MariaDB [enwiki_p]> SELECT COUNT(DISTINCT(log_user_text)) AS count FROM logging WHERE log_timestamp BETWEEN 20170519000000 AND 20170619000000 AND log_namespace = 0 AND ((log_type = 'patrol' AND log_action = 'patrol') OR (log_type = 'pagetriage-curation' AND log_action = 'reviewed'));
+-------+
| count |
+-------+
|   345 |
+-------+
1 row in set (3 min 5.16 sec)

I think something really funky is going on. Maybe MySQL really did bug out, perhaps due to this being such an inefficient query (yet mysteriously finishes in only a few minutes)? Or the data in logging is corrupt in such a way that MySQL got confused?

At the same time I'm OK with this data being wrong, because it puts me as the top reviewer! =P

I think something really funky is going on.

The grouping mechanism don't seem to work correctly from what I see. There are multiple reviewer rows with the same name. But I don't really understand why it's like that.

At the same time I'm OK with this data being wrong, because it puts me as the top reviewer! =P

lol

In which hosts did you do the tests?

In which hosts did you do the tests?

Sorry I didn't record this information. I ran sql enwiki on tools-bastion-02, if that helps!

I tried the original two queries on labsdb-web.eqiad.wmnet without the LIMIT.
The first query;

12471 rows in set (10 min 52.70 sec)

The second query:

345 rows in set (42.76 sec)

And in production (without the LIMIT) the original query took (there is obviously not logging_userindex):

345 rows in set (7 min 28.24 sec)

The explains in production and for the view are different (expected) so probably the views are playing a role here.
If I check the same query on the same labsdb host but on enwiki and not on the views:

345 rows in set (1 min 9.47 sec)

Which is exactly the same result we get in production

So I would say it is definitely the views which is giving some weird results here.

Krenair added a subscriber: Krenair.
bd808 moved this task from Backlog to Wiki replicas on the Data-Services board.Aug 31 2017, 3:07 PM
jcrespo moved this task from Triage to Done on the DBA board.Sep 12 2017, 4:22 PM