Page MenuHomePhabricator

Central Notice: possible CN issue early on December 1st UTC
Open, LowPublic

Description

Banner hours totals take a steep dive on december 1st after 6am. fr-online said they didn't do anything around this time.

banner issues graph.png (457×1 px, 92 KB)

Event Timeline

select sum(count) as impressions, hour(timestamp) as hour from bannerimpressions where timestamp between '20161201000000' and '20161201235959' group by hour(timestamp);
+-------------+------+
| impressions | hour |
+-------------+------+
|     2546560 |    0 |
|     2517810 |    1 |
|     2479290 |    2 |
|     2382240 |    3 |
|     2115490 |    4 |
|     1804680 |    5 |
|     1535710 |    6 |
|     1420290 |    7 |
|      273530 |    8 |
|     1508940 |    9 |
|     1618530 |   10 |
|     1686280 |   11 |
|     1829700 |   12 |
|     2328590 |   13 |
|     2649190 |   14 |
|     2518080 |   15 |
|     2588770 |   16 |
|     2613630 |   17 |
|     1313720 |   18 |
+-------------+------+

The same query for Nov 30 shows that impressions were almost perfectly flat during that hour.

Higher resolution on the drop:

select sum(count) as impressions, hour(timestamp) as hour, minute(timestamp) as minute from bannerimpressions where times
tamp between '20161201080000' and '20161201095959' group by hour(timestamp), minute(timestamp);

+-------------+------+--------+
| impressions | hour | minute |
+-------------+------+--------+
|       23120 |    8 |      0 |
|       22240 |    8 |      1 |
|       22430 |    8 |      2 |
|       23060 |    8 |      3 |
|       22040 |    8 |      4 |
|       23270 |    8 |      5 |
|       21680 |    8 |      6 |
|       20180 |    8 |      7 |
|       18490 |    8 |      8 |
|       15230 |    8 |      9 |
|        2200 |    8 |     10 |
|        1630 |    8 |     11 |
|        1400 |    8 |     12 |
|        1200 |    8 |     13 |
|        1370 |    8 |     14 |
|        1160 |    8 |     15 |
|        1110 |    8 |     16 |
|        1200 |    8 |     17 |
|        1170 |    8 |     18 |
|        1190 |    8 |     19 |
|        1100 |    8 |     20 |
|        1260 |    8 |     21 |
|         930 |    8 |     22 |
|        1210 |    8 |     23 |
|        1230 |    8 |     24 |
|        1190 |    8 |     25 |
|        1340 |    8 |     26 |
|        1090 |    8 |     27 |
|        1200 |    8 |     28 |
|         990 |    8 |     29 |
|        1090 |    8 |     30 |
|        1030 |    8 |     31 |
|        1260 |    8 |     32 |
|        1320 |    8 |     33 |
|        1460 |    8 |     34 |
|        1180 |    8 |     35 |
|        1040 |    8 |     36 |
|        1060 |    8 |     37 |
|        1120 |    8 |     38 |
|        1150 |    8 |     39 |
|        1110 |    8 |     40 |
|        1390 |    8 |     41 |
|        1220 |    8 |     42 |
|        1290 |    8 |     43 |
|        1250 |    8 |     44 |
|        1330 |    8 |     45 |
|        1360 |    8 |     46 |
|        1160 |    8 |     47 |
|        1260 |    8 |     48 |
|        1250 |    8 |     49 |
|        1210 |    8 |     50 |
|        1140 |    8 |     51 |
|        1440 |    8 |     52 |
|        1280 |    8 |     53 |
|        1230 |    8 |     54 |
|        1320 |    8 |     55 |
|        1100 |    8 |     56 |
|        1080 |    8 |     57 |
|        1310 |    8 |     58 |
|        1180 |    8 |     59 |
|        1400 |    9 |      0 |
|        1600 |    9 |      1 |
|        5380 |    9 |      2 |
|        6750 |    9 |      3 |
|       11370 |    9 |      4 |
|       25610 |    9 |      5 |
|       25600 |    9 |      6 |
|       25220 |    9 |      7 |
|       26080 |    9 |      8 |
|       26020 |    9 |      9 |
|       26910 |    9 |     10 |
|       26480 |    9 |     11 |
|       26670 |    9 |     12 |
|       26750 |    9 |     13 |
|       26340 |    9 |     14 |
|       26420 |    9 |     15 |
|       26720 |    9 |     16 |
|       27480 |    9 |     17 |
|       26490 |    9 |     18 |
|       26660 |    9 |     19 |
|       27160 |    9 |     20 |
|       26910 |    9 |     21 |
|       26960 |    9 |     22 |
|       27200 |    9 |     23 |
|       27010 |    9 |     24 |
|       27070 |    9 |     25 |
|       25610 |    9 |     26 |
|       27010 |    9 |     27 |
|       28170 |    9 |     28 |
|       27650 |    9 |     29 |
|       26550 |    9 |     30 |
|       26320 |    9 |     31 |
|       27290 |    9 |     32 |
|       27510 |    9 |     33 |
|       27210 |    9 |     34 |
|       26740 |    9 |     35 |
|       26760 |    9 |     36 |
|       27470 |    9 |     37 |
|       27730 |    9 |     38 |
|       28270 |    9 |     39 |
|       27920 |    9 |     40 |
|       27770 |    9 |     41 |
|       27710 |    9 |     42 |
|       28270 |    9 |     43 |
|       27800 |    9 |     44 |
|       27070 |    9 |     45 |
|       27660 |    9 |     46 |
|       27490 |    9 |     47 |
|       26930 |    9 |     48 |
|       26200 |    9 |     49 |
|       26760 |    9 |     50 |
|       26630 |    9 |     51 |
|       27110 |    9 |     52 |
|       26940 |    9 |     53 |
|       26320 |    9 |     54 |
|       27810 |    9 |     55 |
|       27660 |    9 |     56 |
|       27200 |    9 |     57 |
|       26870 |    9 |     58 |
|       26270 |    9 |     59 |
+-------------+------+--------+

line counts from beaconImpressions logs in the banner impression pipeline:

2016-12-01T08:00 5494
2016-12-01T08:01 5296
2016-12-01T08:02 5330
2016-12-01T08:03 5365
2016-12-01T08:04 5235
2016-12-01T08:05 5342
2016-12-01T08:06 5184
2016-12-01T08:07 4874
2016-12-01T08:08 4107
2016-12-01T08:09 3391
2016-12-01T08:10 1202
2016-12-01T08:11 1047
2016-12-01T08:12 1051
2016-12-01T08:13 968
2016-12-01T08:14 1040
2016-12-01T08:15 945
2016-12-01T08:16 940
2016-12-01T08:17 988
2016-12-01T08:18 941
2016-12-01T08:19 988
2016-12-01T08:20 954
2016-12-01T08:21 944
2016-12-01T08:22 922
2016-12-01T08:23 967
2016-12-01T08:24 952
2016-12-01T08:25 994
2016-12-01T08:26 967
2016-12-01T08:27 940
2016-12-01T08:28 1000
2016-12-01T08:29 977
2016-12-01T08:30 937
2016-12-01T08:31 919
2016-12-01T08:32 969
2016-12-01T08:33 973
2016-12-01T08:34 945
2016-12-01T08:35 927
2016-12-01T08:36 936
2016-12-01T08:37 892
2016-12-01T08:38 945
2016-12-01T08:39 950
2016-12-01T08:40 919
2016-12-01T08:41 974
2016-12-01T08:42 966
2016-12-01T08:43 993
2016-12-01T08:44 1017
2016-12-01T08:45 975
2016-12-01T08:46 969
2016-12-01T08:47 931
2016-12-01T08:48 971
2016-12-01T08:49 980
2016-12-01T08:50 930
2016-12-01T08:51 897
2016-12-01T08:52 951
2016-12-01T08:53 939
2016-12-01T08:54 923
2016-12-01T08:55 929
2016-12-01T08:56 959
2016-12-01T08:57 963
2016-12-01T08:58 952
2016-12-01T08:59 936
2016-12-01T09:00 977
2016-12-01T09:01 1046
2016-12-01T09:02 1477
2016-12-01T09:03 2010
2016-12-01T09:04 2943
2016-12-01T09:05 5361
2016-12-01T09:06 5534
2016-12-01T09:07 5513
2016-12-01T09:08 5710

DStrine raised the priority of this task from High to Unbreak Now!.Dec 1 2016, 8:59 PM

Just to note:

  • CentralNotice logs don't show any changes to banner or campaign settings at the edges of this window.
  • Nothing stands out in the server admin logs, either... Though maybe we're missing something. Copying here:

09:14 jynus: mysql restart and upgrade for es1018 T151995
09:06 jynus@tin: Synchronized wmf-config/db-eqiad.php: Depool es1018 (duration: 00m 48s)
08:57 jynus@tin: Synchronized wmf-config/db-eqiad.php: Repool es1014 (duration: 00m 45s)
08:51 ariel@tin: Finished deploy [dumps/dumps@2b35e77]: less logging, fix regression for db_user/password retrieval (duration: 00m 03s)
08:50 ariel@tin: Starting deploy [dumps/dumps@2b35e77]: less logging, fix regression for db_user/password retrieval
08:41 jynus@tin: Synchronized wmf-config/db-eqiad.php: Repool es1011 (duration: 00m 48s)
08:10 jynus@tin: Synchronized wmf-config/db-eqiad.php: Repool es1016 (duration: 00m 45s)
07:57 elukey@tin: Finished deploy [analytics/pivot/deploy@0513a6e]: (no message) (duration: 00m 02s)

Filtering logs of requests to Special:BannerLoader show a notorious drop-off in this window for FR-targeted countries. This means that browsers in these countries practically stopped requesting banners during this window.

This eliminates the possibility that a change in banner content caused this. However, it could still be an issue with ResourceLoader or GeoIP.

From Hive:

SELECT
  count(*), SUBSTR(ts,1,16) as time
FROM
  wmf.webrequest
WHERE
  year=2016
  AND month=12
  AND day=1
  AND hour= 8
  AND geocoded_data['country_code'] in ('AU', 'CA', 'IE', 'NZ', 'US', 'GB')
  AND uri_query LIKE "%BannerLoader%"
  AND agent_type = 'user'
GROUP BY
  SUBSTR(ts,1,16);

_c0	time
15737	2016-12-01 08:00
15882	2016-12-01 08:01
15865	2016-12-01 08:02
15870	2016-12-01 08:03
15577	2016-12-01 08:04
15615	2016-12-01 08:05
15399	2016-12-01 08:06
15116	2016-12-01 08:07
14907	2016-12-01 08:08
12946	2016-12-01 08:09
1076	2016-12-01 08:10
610	2016-12-01 08:11
433	2016-12-01 08:12
348	2016-12-01 08:13
332	2016-12-01 08:14
245	2016-12-01 08:15
242	2016-12-01 08:16
218	2016-12-01 08:17
206	2016-12-01 08:18
223	2016-12-01 08:19
178	2016-12-01 08:20
197	2016-12-01 08:21
163	2016-12-01 08:22
160	2016-12-01 08:23
122	2016-12-01 08:24
144	2016-12-01 08:25
156	2016-12-01 08:26
116	2016-12-01 08:27
131	2016-12-01 08:28
127	2016-12-01 08:29
126	2016-12-01 08:30
109	2016-12-01 08:31
134	2016-12-01 08:32
123	2016-12-01 08:33
134	2016-12-01 08:34
123	2016-12-01 08:35
97	2016-12-01 08:36
100	2016-12-01 08:37
104	2016-12-01 08:38
94	2016-12-01 08:39
69	2016-12-01 08:40
98	2016-12-01 08:41
94	2016-12-01 08:42
67	2016-12-01 08:43
89	2016-12-01 08:44
92	2016-12-01 08:45
83	2016-12-01 08:46
69	2016-12-01 08:47
107	2016-12-01 08:48
90	2016-12-01 08:49
83	2016-12-01 08:50
85	2016-12-01 08:51
71	2016-12-01 08:52
80	2016-12-01 08:53
83	2016-12-01 08:54
71	2016-12-01 08:55
59	2016-12-01 08:56
74	2016-12-01 08:57
59	2016-12-01 08:58
67	2016-12-01 08:59

What about the non-geotargeted, did they drop off as well? It seems like the dropoff in the country list above is roughly similar to the total dropoff percentage-wise.

What about the non-geotargeted, did they drop off as well?

Great point, we should check that.

Also (noted by @cwdent on IRC) GeoIP cookies last a year. So if all of a sudden, servers stopped sending cookies, we wouldn't see this dropoff. However, I think we would if all of a sudden they started setting incorrect (say, w/ unknown country) cookies...

There wasn't much non-geotargeted stuff runing - just CzechWikiCon and WikiFranca_contribution_month2016_call. CzechWikiCon seems unaffected, and WikiFranca seems affected (but has really low numbers). However, there IS a geotargeted one on eswiki that doesn't seem affected at all

Bannerhourcount

Not geotargeted:

CzechWikiCon7504
CzechWikiCon8580
CzechWikiCon9701
WikiFranca_contribution_month2016_call736
WikiFranca_contribution_month2016_call819
WikiFranca_contribution_month2016_call970

Geotargeted:

WLAfrica_201672344
WLAfrica_201681000
WLAfrica_201693031
WMES_Wiki_Loves_Folk_2016720127
WMES_Wiki_Loves_Folk_2016839854
WMES_Wiki_Loves_Folk_2016949539

All the datacenters show the problem, but to different degrees:
codfw: 92% drop
eqiad: 97% drop
esams: 86% drop
ulsfo: 65% drop

codfw08:05:00 AM2253
codfw08:06:00 AM2177
codfw08:07:00 AM2208
codfw08:08:00 AM1827
codfw08:09:00 AM1582
codfw08:10:00 AM337
codfw08:11:00 AM211
codfw08:12:00 AM215
codfw08:13:00 AM183
codfw08:14:00 AM190
eqiad08:05:00 AM8371
eqiad08:06:00 AM8250
eqiad08:07:00 AM8103
eqiad08:08:00 AM6646
eqiad08:09:00 AM5662
eqiad08:10:00 AM743
eqiad08:11:00 AM435
eqiad08:12:00 AM339
eqiad08:13:00 AM251
eqiad08:14:00 AM295
esams08:05:00 AM25010
esams08:06:00 AM24685
esams08:07:00 AM20347
esams08:08:00 AM18164
esams08:09:00 AM12409
esams08:10:00 AM4080
esams08:11:00 AM3538
esams08:12:00 AM3429
esams08:13:00 AM3303
esams08:14:00 AM3422
ulsfo08:05:00 AM16979
ulsfo08:06:00 AM16802
ulsfo08:07:00 AM16674
ulsfo08:08:00 AM14939
ulsfo08:09:00 AM13576
ulsfo08:10:00 AM6480
ulsfo08:11:00 AM6073
ulsfo08:12:00 AM6083
ulsfo08:13:00 AM6044
ulsfo08:14:00 AM5956

@Ejegg
Nice discoveries! I don't think WikiFranca_contribution_month2016_call was affected. If you zoom out, the 08:00 numbers fit into the rest of the pattern:

select
    banner,
    sum(count) as impressions,
    day(timestamp),
    hour(timestamp) as hour
from bannerimpressions
where
    timestamp > '20161127000000'
    and banner in ('WikiFranca_contribution_month2016_call')
group by
    banner,
    day(timestamp),
    hour(timestamp);

+----------------------------------------+-------------+----------------+------+
| banner                                 | impressions | day(timestamp) | hour |
+----------------------------------------+-------------+----------------+------+
| WikiFranca_contribution_month2016_call |          10 |              1 |    0 |
| WikiFranca_contribution_month2016_call |          50 |              1 |    3 |
| WikiFranca_contribution_month2016_call |          30 |              1 |    4 |
| WikiFranca_contribution_month2016_call |          40 |              1 |    5 |
| WikiFranca_contribution_month2016_call |          40 |              1 |    6 |
| WikiFranca_contribution_month2016_call |          60 |              1 |    7 |
| WikiFranca_contribution_month2016_call |          30 |              1 |    8 |
| WikiFranca_contribution_month2016_call |          50 |              1 |    9 |
| WikiFranca_contribution_month2016_call |         180 |              1 |   10 |
| WikiFranca_contribution_month2016_call |          90 |              1 |   11 |
| WikiFranca_contribution_month2016_call |          50 |              1 |   12 |
| WikiFranca_contribution_month2016_call |         100 |              1 |   13 |
| WikiFranca_contribution_month2016_call |          20 |              1 |   14 |
| WikiFranca_contribution_month2016_call |          30 |              1 |   15 |
| WikiFranca_contribution_month2016_call |          20 |              1 |   16 |
| WikiFranca_contribution_month2016_call |          10 |              1 |   17 |
| WikiFranca_contribution_month2016_call |          10 |              1 |   18 |
| WikiFranca_contribution_month2016_call |          10 |              1 |   20 |
| WikiFranca_contribution_month2016_call |          10 |              1 |   21 |
| WikiFranca_contribution_month2016_call |          10 |              1 |   22 |
| WikiFranca_contribution_month2016_call |         120 |             30 |   17 |
| WikiFranca_contribution_month2016_call |         100 |             30 |   18 |
| WikiFranca_contribution_month2016_call |          70 |             30 |   19 |
| WikiFranca_contribution_month2016_call |         120 |             30 |   20 |
| WikiFranca_contribution_month2016_call |          50 |             30 |   21 |
| WikiFranca_contribution_month2016_call |          20 |             30 |   22 |
| WikiFranca_contribution_month2016_call |          60 |             30 |   23 |
+----------------------------------------+-------------+----------------+------+

Ahh, ok. So my burning question is now "How was the geotargeted WMES_Wiki_Loves_Folk_2016 unaffected?"

Looks like it's language-specific: en, de, fr, it, and simple are affected - none of the others are.

langhourcount
ar7378
ar8438
ar9546
ca72554
ca84143
ca94358
cs7507
cs8586
cs9711
da78599
da89836
da911875
de7744156
de8133495
de91039503
en71940837
en8311084
en91724790
es758956
es899938
es9120584
fr710833
fr81692
fr915529
he7244
he8256
he9280
hu78079
hu810329
hu911243
it763744
it87884
it9107189
ja7317844
ja8308715
ja9292575
lv71262
lv81471
lv91423
nl76500
nl811400
nl911592
no77537
no812408
no912648
pt7162
pt8147
pt9179
ro76122
ro88216
ro99516
simple72009
simple8308
simple92621
sk72650
sk82928
sk93385
sv721513
sv830859
sv934423
zh710343
zh89574
zh98765

Made two partly-processed tables in Hive to be able to run queries more quickly.

In the andyrussg database: t152122beaconimpression20161201, t152122bannerloader20161201.

Here's what I did to create them:

use andyrussg;

CREATE TABLE
  t152122bannerloader20161201
AS SELECT
  *,
  SUBSTR(ts,1,16) as time,
  parse_url(concat('http://bla.org/woo/', uri_query), 'QUERY', 'campaign')  as campaign,
  parse_url(concat('http://bla.org/woo/', uri_query), 'QUERY', 'banner')  as banner,
  parse_url(concat('http://bla.org/woo/', uri_query), 'QUERY', 'uselang')  as uselang,
  parse_url(concat('http://bla.org/woo/', uri_query), 'QUERY', 'debug')  as debug
FROM
  wmf.webrequest
WHERE
  year=2016
  AND month=12
  AND day=1
  AND agent_type = 'user'
  AND uri_query LIKE "%BannerLoader%";
  
CREATE TABLE
  t152122beaconimpression20161201
AS SELECT
  *,
  SUBSTR(ts,1,16) as time,
  parse_url(concat('http://bla.org/woo/', uri_query), 'QUERY', 'campaign')  as campaign,
  parse_url(concat('http://bla.org/woo/', uri_query), 'QUERY', 'banner')  as banner,
  parse_url(concat('http://bla.org/woo/', uri_query), 'QUERY', 'uselang')  as uselang,
  parse_url(concat('http://bla.org/woo/', uri_query), 'QUERY', 'debug')  as debug,
  parse_url(concat('http://bla.org/woo/', uri_query), 'QUERY', 'db')  as db,
  parse_url(concat('http://bla.org/woo/', uri_query), 'QUERY', 'device')  as device,
  parse_url(concat('http://bla.org/woo/', uri_query), 'QUERY', 'statusCode')  as statusCode,
  parse_url(concat('http://bla.org/woo/', uri_query), 'QUERY', 'country')  as country,
  parse_url(concat('http://bla.org/woo/', uri_query), 'QUERY', 'anonymous')  as anonymous
FROM
  wmf.webrequest
WHERE
  year=2016
  AND month=12
  AND day=1
  AND agent_type = 'user'
  AND uri_path LIKE '%beacon%impression%';

I'm seeing signs that this happened again last night at the same hour.
Can someone remind me why we didn't think the activity in the server admin logs was relevant?

I'm seeing signs that this happened again last night at the same hour.
Can someone remind me why we didn't think the activity in the server admin logs was relevant?

  1. It's hard to see how DB server de/re-pooling would affect RL/geoip (this is a weak argument)
  2. The opposite of what you'd expect happened. A re-pool event correlated with the start of the dip, and a de-pool even with the end. There were also de- and re-pools in the middle of the dip, and after it.
  3. The time correlation isn't that good. The dip started gradually over the course of a few (3ish) minutes, and recovered gradually as well. The events that appeared to be correlated were in fact logged after the recovery had already begun.

I guess we'll find out tomorrow: there probably (hopefully!) won't be anyone messing with DB servers at 9am CET on Saturday morning, or much SAL activity of any kind really. If the dip happens again in the absence of SAL activity, that's a pretty strong indicator that it's probably unrelated.

Dip for 8/2:

07:57:00 AM21295
07:58:00 AM21267
07:59:00 AM21290
08:00:00 AM21118
08:01:00 AM14335
08:02:00 AM11844
08:03:00 AM11420
08:04:00 AM10994
08:05:00 AM10015
08:06:00 AM7631
08:07:00 AM4729
08:08:00 AM4413
08:09:00 AM3289
08:10:00 AM2972
08:11:00 AM3008
08:12:00 AM2997
08:13:00 AM2885
08:14:00 AM2961
08:15:00 AM3000
08:16:00 AM3010
08:17:00 AM2985
08:18:00 AM2931
08:19:00 AM2861
08:20:00 AM2949
08:21:00 AM2982
08:22:00 AM2988
08:23:00 AM2930
08:24:00 AM3080
08:25:00 AM3027
08:26:00 AM2917
08:27:00 AM3015
08:28:00 AM3001
08:29:00 AM2977
08:30:00 AM2887
08:31:00 AM2977
08:32:00 AM2994
08:33:00 AM2984
08:34:00 AM3025
08:35:00 AM3010
08:36:00 AM2964
08:37:00 AM2930
08:38:00 AM3003
08:39:00 AM2937
08:40:00 AM3038
08:41:00 AM2849
08:42:00 AM2934
08:43:00 AM2879
08:44:00 AM2940
08:45:00 AM2879
08:46:00 AM2993
08:47:00 AM2920
08:48:00 AM3034
08:49:00 AM3047
08:50:00 AM3133
08:51:00 AM3052
08:52:00 AM3100
08:53:00 AM3069
08:54:00 AM2978
08:55:00 AM3117
08:56:00 AM3024
08:57:00 AM3041
08:58:00 AM3061
08:59:00 AM3086
09:00:00 AM3269
09:01:00 AM14981
09:02:00 AM23320
09:03:00 AM23255
09:04:00 AM25204
09:05:00 AM26367

Current theory:

  1. Something somewhere poisons (=writes bad data to) the choiceData cache at 08:00 UTC
  2. This causes the output of the choiceData RL module to change (this could explain the drop in 304 satisfaction rate and If-None-Match percentage around 8:10 in this graph)
  3. This change rolls out gradually over 5 minutes (this could explain the gradual drop in impressions from 8:00 to 8:05)
  4. choiceData is cached in objectcache for an hour, so the cache expires at 09:00 UTC
  5. The cache refreshes itself with correct data
  6. This causes the output of the choiceData RL module to change again (this could explain the second drop in the 304/INM graphs around 9:07)
  7. This change rolls out gradually over 5 minutes (this could explain the gradual ramp up in impressions from 9:00 to 9:05)

Next steps to diagnose:

  1. @Ejegg wrote a PhantomJS script (P4559) that he intends to run every 5 minutes overnight. It will fetch an article page and log mw.centralNotice.data, mw.centralNotice.choiceData, the version hash of ext.centralNotice.choiceData, and the cookie header
  2. We also plan to run a separate script that dumps load.php?modules=ext.centralNotice.choiceData every 5 mins.
  3. @Krinkle has revived perflogbot in #wikimedia-perf-bots, which checks the startup module every 5 minutes and logs changes to module hashes. Unfortunately this bot went down on Nov 30, so we don't have any data during the Dec 1 and Dec 2 incidents. This bot runs against nlwiki, not enwiki, so it's not certain if we'll see something in its logs, but we might.
  4. Put all these in place and wait and see if the incident repeats itself on Dec 3 at 08:00 UTC. If it does, look at the captured data and try to figure out what happened.

A watched pot never boils. No banner dip observed December 3 08:00 UTC.

A watched pot never boils. No banner dip observed December 3 08:00 UTC.

I had a script (still running it actually) that captures the contents of the choiceData module every 5 mins. I extracted the version hashes, and they're all the same. So there were no choiceData changes at all in the last 24h or so.

I'm seeing signs that this happened again last night at the same hour.
Can someone remind me why we didn't think the activity in the server admin logs was relevant?

  1. It's hard to see how DB server de/re-pooling would affect RL/geoip (this is a weak argument)
  2. The opposite of what you'd expect happened. A re-pool event correlated with the start of the dip, and a de-pool even with the end. There were also de- and re-pools in the middle of the dip, and after it.
  3. The time correlation isn't that good. The dip started gradually over the course of a few (3ish) minutes, and recovered gradually as well. The events that appeared to be correlated were in fact logged after the recovery had already begun.

I guess we'll find out tomorrow: there probably (hopefully!) won't be anyone messing with DB servers at 9am CET on Saturday morning, or much SAL activity of any kind really. If the dip happens again in the absence of SAL activity, that's a pretty strong indicator that it's probably unrelated.

It seems pretty unlikely the SAL activity (DB pools/depools) is related indeed.

Unfortunately traffic patterns on Saturday are pretty different from workdays for all kinds of reasons, and may not tell us much either.

@AndyRussG here's the bash script I use to run the phantomjs as well as figuring out all the RL module versions and the raw choiceData: P4569
It parses out the module versions with this: https://packages.debian.org/search?keywords=jq

@jcrespo It's more than just an EventLogging issue - we're not seeing requests for the banners in the webrequest table, and we're seeing a corresponding drop in donations, so it's pretty clear people are not seeing the banners.

Oh hey, 08:00 UTC is 00:00 PST. Midnight cron job, or witching hour caching issue?

Change 325604 had a related patch set uploaded (by XenoRyet):
Adyen form for Canada

https://gerrit.wikimedia.org/r/325604

Ignore above. Wrong bug number in that patch.

Update: We've run a bunch more queries and talked this over on IRC a lot.

tl;dr: This is a server-side issue with ChoiceData, the info about possible campaigns that we send to browsers. I think it may well an edge-case bug in CentralNotice rather than an infrastructure outage or random data corruption.

Queries:
Mega-query for grouping on several factors: P4583 by @Ejegg
Main conclusion: The issue didn't affect any browser or OS disproportionately, except for a specific mobile Chrome version that may have caching issues, or may have odd proxied caching by a large mobile provider. Also, not related to geolocation. This fits with the theory of a server-side issue with ChoiceData.

Extraction of impression data for outages: P4584

First outage by language and datacenter: P4585
Conclusion: Apparently no significant relationship between outage and datacenter.

Both outages by languages and database name (wgDBname, a proxy for project): P4586
Conclusions:

  • The combination of these two factors show the greatest correlation to the outages that we've seen so far.
  • The lang/project combinations with visible outages are: de/dewiki, fr/frwiki, en/enwiki, it/itwiki and en/simplewiki.
  • The same combinations went down for both outages (2016-12-01 and 2016-12-01).
  • There are cases of no outage with some of these languages and projects in other combinations. For example, en/commons and es/enwiki were fine.
  • There may have been outages from many other combinations. We don't have enough impressions on most combinations to get meaningful data.

Discussion:

  • The nature of our GeoIP infrastructure (static lookup file on each Varnish host) makes a coordinated geolocation outage seem unlikely.
  • The duration of both outages (1 hr) corresponds to TTL of ChoiceData in memcache.
  • Language and project are the two factors that vary ChoiceData.
  • That both outages happened with the same lang/project combinations makes a WMF infrastructure outage or data corruption seem unlikely.
  • There is one data point that does not support an edge-case CentralNotice bug: timing. Since we don't see any changes to campaign or banner configurations in the logs close to the times of the outage, it's hard to imagine what could have triggered it.

Thanks so much, everyone, for helping out and chiming in!!!! :D

P.S. Pls add/correct me if I've missed any important points we discussed elsewhere, or if I've gotten stuff wrong (likely)... thx!!!

WAM 2016_en targeted simple English and other languages and projects with outages, but not high-traffic languages for which we've seen no outages. It ended 8 hrs before the first outage. What about some timezone misconfiguration somewhere, or a bug in how we're querying campaign start/end times, some corruption in the db for just this campaign, some other bug in the ChoiceData provider, or a combination of the above? Also, the campaign has an attention-getting space in its name...

Just to note, the above query (P4586) and conclusions are a little bit wrong, because db (enwiki, frwiki, etc.) doesn't really map to project in CN terms. I think the conclusions are still basically correct. I'll fix and re-run the query...

"Unbreak now" priority for more than a month (since T152122#2839731); no updates in this task since Dec 08.
@AndyRussG: Any updates to share? Is "Unbreak now" still a realistic priority? Please reset/adjust if not.

DStrine lowered the priority of this task from Unbreak Now! to Medium.Jan 2 2017, 2:22 PM

We haven't seen this in a while. I have set it to normal.

DStrine lowered the priority of this task from Medium to Low.