Page MenuHomePhabricator

Investigate source of null / missing If-Match headers in VE html2wt requests
Closed, ResolvedPublic8 Estimated Story Points

Description

The following dashboard shows occasional VE requests with null If-Match headers: https://logstash.wikimedia.org/#dashboard/temp/AVSmkco7Dxp7yus2S9IN

The fact that this triggers the 'etag missing' warning indicates that the header is seen as falsy in RB, so this must be an actual null, and not a string. I suspect that an empty string is converted to null in the request template, while evaluating this line: https://github.com/wikimedia/restbase/blob/ed9bab4823c077dba6a416eaccf9d65b0f8fce4c/v1/transform.yaml#L88.

In any case, this leaves the question why VE is sending some transform requests without If-Match, but with a revision. These requests are clearly not for new pages, so there must be a bug somewhere in VE. @Krenair, do you have any ideas what could cause this? Could this issue be triggered when VisualEditor re-requests content through the VE API, rather than directly from RB?

Event Timeline

Possibly related are transform requests that reach Parsoid without dp data in them; we've had around 200 of them in the last 24h: https://logstash.wikimedia.org/#dashboard/temp/AVTDBDtJDxp7yus25R9g . They come without a revision from the wikimedia/multi-http-client v1.0 user agent, so it's likely a VRS request.

Change 289585 had a related patch set uploaded (by Alex Monk):
Debug log strange-looking ETags being sent to RB

https://gerrit.wikimedia.org/r/289585

Change 289586 had a related patch set uploaded (by Alex Monk):
Debug log strange-looking ETags being sent to RB

https://gerrit.wikimedia.org/r/289586

Change 289587 had a related patch set uploaded (by Alex Monk):
Debug log strange-looking ETags being sent to RB

https://gerrit.wikimedia.org/r/289587

Change 289585 merged by jenkins-bot:
Debug log strange-looking ETags being sent to RB

https://gerrit.wikimedia.org/r/289585

Change 289587 merged by jenkins-bot:
Debug log strange-looking ETags being sent to RB

https://gerrit.wikimedia.org/r/289587

Change 289586 merged by jenkins-bot:
Debug log strange-looking ETags being sent to RB

https://gerrit.wikimedia.org/r/289586

Possibly related are transform requests that reach Parsoid without dp data in them

This sounds a lot like Flow, which afaik is the only client directly talking to Parsoid these days.

There have been no matches for the null etag check in RESTBase in the last seven days, so it looks pretty certain that RB does not return literal (or string) null etags in HTML responses.

Possibly related are transform requests that reach Parsoid without dp data in them

This sounds a lot like Flow, which afaik is the only client directly talking to Parsoid these days.

These are requests coming to RB, so it's not likely that's Flow

@Krenair, did you find anything else while looking into the VE extension?

Change 291123 had a related patch set uploaded (by Alex Monk):
Add debug schema for a VE issue

https://gerrit.wikimedia.org/r/291123

Change 291124 had a related patch set uploaded (by Alex Monk):
Track ETag header responses in special event logging schema

https://gerrit.wikimedia.org/r/291124

Change 291123 merged by jenkins-bot:
Add debug schema for a VE issue

https://gerrit.wikimedia.org/r/291123

Change 291124 merged by jenkins-bot:
Track ETag header responses in special event logging schema

https://gerrit.wikimedia.org/r/291124

Change 291143 had a related patch set uploaded (by Alex Monk):
Add debug schema for a VE issue

https://gerrit.wikimedia.org/r/291143

Change 291144 had a related patch set uploaded (by Alex Monk):
Track ETag header responses in special event logging schema

https://gerrit.wikimedia.org/r/291144

Change 291144 merged by jenkins-bot:
Track ETag header responses in special event logging schema

https://gerrit.wikimedia.org/r/291144

Change 291143 merged by jenkins-bot:
Add debug schema for a VE issue

https://gerrit.wikimedia.org/r/291143

Since that was deployed there have been about 200 instances of jqxhr.getResponseHeader( 'etag' ) returning null from 80 distinct user agents. That's either RESTBase not setting the header properly or browsers not getting the header correctly (e.g. maybe by not respecting RB's Access-Control-Expose-Headers)
Raw data can be found in the VET135171_15651962 table in EventLogging if you have access.

I can reproduce the null etag using the "UC Browser" Android app and pointing it at https://people.wikimedia.org/~krenair/T135171.html (note this page loads jQuery from third parties)
Edit: And getAllResponseHeaders also doesn't show 'etag' like it does when run in Chrome

All the others are Safari, with a few FF46 exceptions and an old IE.

Change 291349 had a related patch set uploaded (by Alex Monk):
Make VE RB URLs domain-relative

https://gerrit.wikimedia.org/r/291349

Change 291350 had a related patch set uploaded (by GWicke):
Use domain-relative URL for client-side RESTBase requests

https://gerrit.wikimedia.org/r/291350

Change 291350 abandoned by GWicke:
Use domain-relative URL for client-side RESTBase requests

Reason:
https://gerrit.wikimedia.org/r/291350 has a better variant.

https://gerrit.wikimedia.org/r/291350

We discussed this on IRC. It looks like the issue is a combination of:

  • VisualEditor on mobile domains (ex: en.m.wikipedia.org) making CORS requests to the main project domain (en.wikipedia.org).
  • Some browsers (especially webkit pre ca. 2012) not implementing Access-Control-Expose-Headers for CORS requests correctly.

@AlexMonk-WMF's patch changes the config to use domain-relative requests instead, which should avoid CORS requests altogether. Hopefully, this will avoid hitting this CORS bug, and also speed things up slightly by reusing the existing TLS connection.

Change 291349 merged by jenkins-bot:
Make VE RB URLs domain-relative

https://gerrit.wikimedia.org/r/291349

I looked through the data we captured via EventLogging and

1mysql:research@analytics-store.eqiad.wmnet [log]> select day(timestamp), hour(timestamp), count(*) from VET135171_15651962 where event_msg = 'getResponseHeader: null' group by hour(timestamp), day(timestamp) order by day(timestamp), hour(timestamp);
2+----------------+-----------------+----------+
3| day(timestamp) | hour(timestamp) | count(*) |
4+----------------+-----------------+----------+
5| 27 | 0 | 2 |
6| 27 | 1 | 3 |
7| 27 | 2 | 9 |
8| 27 | 3 | 6 |
9| 27 | 4 | 14 |
10| 27 | 5 | 3 |
11| 27 | 6 | 3 |
12| 27 | 7 | 6 |
13| 27 | 8 | 2 |
14| 27 | 9 | 6 |
15| 27 | 10 | 42 |
16| 27 | 11 | 14 |
17| 27 | 12 | 12 |
18| 27 | 13 | 9 |
19| 27 | 14 | 6 |
20| 27 | 15 | 10 |
21| 27 | 16 | 21 |
22| 27 | 17 | 23 |
23| 27 | 18 | 21 |
24| 27 | 19 | 9 |
25| 27 | 20 | 18 |
26| 27 | 21 | 12 |
27| 27 | 22 | 7 |
28| 27 | 23 | 13 |
29| 28 | 0 | 5 |
30| 28 | 1 | 8 |
31| 28 | 2 | 14 |
32| 28 | 3 | 14 |
33| 28 | 4 | 18 |
34| 28 | 5 | 14 |
35| 28 | 6 | 13 |
36| 28 | 7 | 13 |
37| 28 | 8 | 16 |
38| 28 | 9 | 12 |
39| 28 | 10 | 15 |
40| 28 | 11 | 29 |
41| 28 | 12 | 24 |
42| 28 | 13 | 21 |
43| 28 | 14 | 15 |
44| 28 | 15 | 24 |
45| 28 | 16 | 15 |
46| 28 | 17 | 18 |
47| 28 | 18 | 30 |
48| 28 | 19 | 20 |
49| 28 | 20 | 22 |
50| 28 | 21 | 35 |
51| 28 | 22 | 33 |
52| 28 | 23 | 24 |
53| 29 | 0 | 9 |
54| 29 | 1 | 6 |
55| 29 | 2 | 12 |
56| 29 | 3 | 30 |
57| 29 | 4 | 14 |
58| 29 | 5 | 16 |
59| 29 | 6 | 36 |
60| 29 | 7 | 19 |
61| 29 | 8 | 19 |
62| 29 | 9 | 23 |
63| 29 | 10 | 20 |
64| 29 | 11 | 27 |
65| 29 | 12 | 32 |
66| 29 | 13 | 33 |
67| 29 | 14 | 22 |
68| 29 | 15 | 22 |
69| 29 | 16 | 33 |
70| 29 | 17 | 19 |
71| 29 | 18 | 48 |
72| 29 | 19 | 32 |
73| 29 | 20 | 22 |
74| 29 | 21 | 27 |
75| 29 | 22 | 17 |
76| 29 | 23 | 13 |
77| 30 | 0 | 13 |
78| 30 | 1 | 14 |
79| 30 | 2 | 23 |
80| 30 | 3 | 28 |
81| 30 | 4 | 24 |
82| 30 | 5 | 47 |
83| 30 | 6 | 23 |
84| 30 | 7 | 17 |
85| 30 | 8 | 23 |
86| 30 | 9 | 25 |
87| 30 | 10 | 12 |
88| 30 | 11 | 14 |
89| 30 | 12 | 24 |
90| 30 | 13 | 34 |
91| 30 | 14 | 40 |
92| 30 | 15 | 24 |
93| 30 | 16 | 10 |
94| 30 | 17 | 1 |
95| 30 | 18 | 2 |
96| 30 | 19 | 1 |
97| 30 | 20 | 9 |
98| 30 | 21 | 3 |
99| 30 | 22 | 4 |
100| 30 | 23 | 1 |
101| 31 | 0 | 4 |
102| 31 | 1 | 2 |
103| 31 | 2 | 5 |
104| 31 | 3 | 3 |
105| 31 | 4 | 6 |
106| 31 | 5 | 1 |
107| 31 | 6 | 4 |
108| 31 | 8 | 3 |
109| 31 | 9 | 2 |
110| 31 | 11 | 3 |
111| 31 | 12 | 6 |
112| 31 | 13 | 4 |
113| 31 | 14 | 5 |
114| 31 | 15 | 9 |
115| 31 | 16 | 3 |
116+----------------+-----------------+----------+
117111 rows in set (0.43 sec)
shows it pretty clearly dropping significantly around the time of the deployment (roughly 2016-05-30 15:37).
Unfortunately it's not completely gone... Obviously there's caching involved, but also the problem does not appear to be strictly limited to the mobile domains.

@AlexMonk-WMF, lets give it a bit more time & re-check the data in two weeks or so, once caches should no longer be a factor.

mysql> select date(timestamp), count(*) from VET135171_15651962 where event_msg = 'getResponseHeader: null' group by date(timestamp) order by date(timestamp);
+-----------------+----------+
| date(timestamp) | count(*) |
+-----------------+----------+
| 2016-05-27      |      271 |
| 2016-05-28      |      452 |
| 2016-05-29      |      551 |
| 2016-05-30      |      416 |
| 2016-05-31      |       74 |
| 2016-06-01      |       75 |
| 2016-06-02      |       54 |
| 2016-06-03      |       64 |
| 2016-06-04      |       47 |
| 2016-06-05      |       61 |
| 2016-06-06      |       71 |
| 2016-06-07      |       91 |
| 2016-06-08      |       66 |
| 2016-06-09      |       69 |
| 2016-06-10      |       42 |
+-----------------+----------+
15 rows in set (0.54 sec)

I think there's a separate issue going on here still.

mysql> select hour(timestamp), sum(event_msg = 'getResponseHeader: null') as c, sum(event_msg != 'getResponseHeader: null') from VET135171_15651962 where month(timestamp) = 6 group by hour(timestamp) order by hour(timestamp), c;
+-----------------+------+---------------------------------------------+
| hour(timestamp) | c    | sum(event_msg != 'getResponseHeader: null') |
+-----------------+------+---------------------------------------------+
|               0 |    7 |                                       23382 |
|               1 |   10 |                                       22664 |
|               2 |   14 |                                       21911 |
|               3 |    9 |                                       19441 |
|               4 |   13 |                                       21281 |
|               5 |   23 |                                       26367 |
|               6 |   34 |                                       36739 |
|               7 |   33 |                                       44697 |
|               8 |   47 |                                       49070 |
|               9 |   42 |                                       50008 |
|              10 |   42 |                                       48007 |
|              11 |   49 |                                       49302 |
|              12 |   43 |                                       55461 |
|              13 |   38 |                                       59098 |
|              14 |   31 |                                       57470 |
|              15 |   33 |                                       49634 |
|              16 |   29 |                                       45694 |
|              17 |   21 |                                       43743 |
|              18 |   22 |                                       41981 |
|              19 |   38 |                                       38950 |
|              20 |   24 |                                       35324 |
|              21 |   22 |                                       29256 |
|              22 |   10 |                                       23984 |
|              23 |    7 |                                       20776 |
+-----------------+------+---------------------------------------------+
24 rows in set (1.46 sec)

The remaining issue appears to be about the same across browsers, across wikis, etc. and I think this should show that it's correlated with overall traffic. I can't help but wonder if RB is doing something like erroring without changing the status code away from 200.

It's still happening and I don't know why. See e.g. select day(timestamp), hour(timestamp), sum(event_msg = 'getResponseHeader: null') as c, sum(event_msg != 'getResponseHeader: null') from VET135171_15651962 where month(timestamp) = 7 group by day(timestamp), hour(timestamp) order by day(timestamp), hour(timestamp), c;
Am wondering if some more logging on the RESTBase side might be a good idea.

Change 306035 had a related patch set uploaded (by Jforrester):
Follow-up 51dd790 & c40070b: Drop T135171 debug tracking

https://gerrit.wikimedia.org/r/306035

Change 308009 had a related patch set uploaded (by Alex Monk):
Get rid of T135171 logging schema

https://gerrit.wikimedia.org/r/308009

Change 306035 merged by jenkins-bot:
Follow-up 51dd790 & c40070b: Drop T135171 debug tracking

https://gerrit.wikimedia.org/r/306035

Change 308009 merged by jenkins-bot:
Get rid of T135171 logging schema

https://gerrit.wikimedia.org/r/308009