Page MenuHomePhabricator

[Bug] Many ReadingDepth validation errors logged
Open, NormalPublic

Description

There have been many ReadingDepth schema validation errors reported lately. E.g., https://logstash.wikimedia.org/app/kibana#/doc/logstash-*/logstash-2019.02.13/eventlogging?id=AWjnIl6_zpjgITg6W3pd&_g=h@44136fa. The event appears valid for the schema:

{
  "event": {
    "pageTitle": "夏潮聯合會",
    "namespaceId": 0,
    "skin": "vector",
    "isAnon": true,
    "pageToken": "24f1f4ad991185c25e05",
    "sessionToken": "87cdca2326ac87a04a11",
    "action": "pageUnloaded",
    "domInteractiveTime": 1807,
    "firstPaintTime": 26290,
    "default_sample": true,
    "totalLength": 368277,
    "visibleLength": 12969
  },
  "revision": 18201205,
  "schema": "ReadingDepth",
  "webHost": "www.google.com.hk",
  "wiki": "zhwiki"
}

The raw data is:

?%7B%22event%22%3A%7B%22pageTitle%22%3A%22%E5%A4%8F%E6%BD%AE%E8%81%AF%E5%90%88%E6%9C%83%22%2C%22namespaceId%22%3A0%2C%22skin%22%3A%22vector%22%2C%22isAnon%22%3Atrue%2C%22pageToken%22%3A%2224f1f4ad991185c25e05%22%2C%22sessionToken%22%3A%2287cdca2326ac87a04a11%22%2C%22action%22%3A%22pageUnloaded%22%2C%22domInteractiveTime%22%3A1807%2C%22firstPaintTime%22%3A26290%2C%22default_sample%22%3Atrue%2C%22totalLength%22%3A368277%2C%22visibleLength%22%3A12969%7D%2C%22revision%22%3A18201205%2C%22schema%22%3A%22ReadingDepth%22%2C%22webHost%22%3A%22www.google.com.hk%22%2C%22wiki%22%3A%22zhwiki%22%7D%3B    cp4028.ulsfo.wmnet    97829958    2019-02-13T13:53:34    2604:180:3:241::39d5    "Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/71.0.3578.98 Safari/537.36"

The error is something like "Extra data: line 1 column 379 - line 1 column 380 (char 378 - 379)".

I think the extra user agent cruft at the end is expected. For example, here's the URL for an event logged from my browser:

https://en.wikipedia.org/beacon/event?%7B%22event%22%3A%7B%22pageTitle%22%3A%22Barack%20Obama%22%2C%22namespaceId%22%3A0%2C%22skin%22%3A%22minerva%22%2C%22isAnon%22%3Atrue%2C%22pageToken%22%3A%2260c18a32dfabbd3d6533%22%2C%22sessionToken%22%3A%228f50d028a7ae84693ed3%22%2C%22action%22%3A%22pageLoaded%22%2C%22domInteractiveTime%22%3A1196%2C%22firstPaintTime%22%3A1113%2C%22default_sample%22%3Atrue%7D%2C%22revision%22%3A18201205%2C%22schema%22%3A%22ReadingDepth%22%2C%22webHost%22%3A%22en.m.wikipedia.org%22%2C%22wiki%22%3A%22enwiki%22%7D;

Which becomes:

https://en.wikipedia.org/beacon/event?{"event":{"pageTitle":"Barack Obama","namespaceId":0,"skin":"minerva","isAnon":true,"pageToken":"60c18a32dfabbd3d6533","sessionToken":"8f50d028a7ae84693ed3","action":"pageLoaded","domInteractiveTime":1196,"firstPaintTime":1113,"default_sample":true},"revision":18201205,"schema":"ReadingDepth","webHost":"en.m.wikipedia.org","wiki":"enwiki"};

And finally:

{
  "event": {
    "pageTitle": "Barack Obama",
    "namespaceId": 0,
    "skin": "minerva",
    "isAnon": true,
    "pageToken": "60c18a32dfabbd3d6533",
    "sessionToken": "8f50d028a7ae84693ed3",
    "action": "pageLoaded",
    "domInteractiveTime": 1196,
    "firstPaintTime": 1113,
    "default_sample": true
  },
  "revision": 18201205,
  "schema": "ReadingDepth",
  "webHost": "en.m.wikipedia.org",
  "wiki": "enwiki"
}

Event Timeline

Restricted Application added subscribers: Cosine02, Aklapper. · View Herald TranscriptFeb 13 2019, 7:37 PM

In case it's useful, keep in mind that it's possible to query the webrequest table for more detail on the event in question:

SELECT *
FROM wmf.webrequest
WHERE year = 2019 AND month = 2 AND day = 13 AND hour = 13
AND uri_path = '/beacon/event'
AND uri_query LIKE '%24f1f4ad991185c25e05%'
ORDER BY dt
LIMIT 100;
Jdlrobson added subscribers: Ottomata, Jdlrobson.

The event appears valid for the schema:

"webHost": "www.google.com.hk",

The webhost is invalid. It's likely that a proxy on www.google.com.hk is malforming the data which seems related to T212330. cc @Ottomata and co

If we filter invalid events from webHost's we don't own are there still a lot of ReadingDepth errors? That would be good to know.

phuedx added a subscriber: phuedx.Feb 14 2019, 10:46 AM

The "Extra data:" error is raised by json.loads as it encounters the ; character at the end of the JSON string:

[0]
import json

json.loads('{"event":{"pageTitle":"Barack Obama","namespaceId":0,"skin":"minerva","isAnon":true,"pageToken":"60c18a32dfabbd3d6533","sessionToken":"8f50d028a7ae84693ed3","action":"pageLoaded","domInteractiveTime":1196,"firstPaintTime":1113,"default_sample":true},"revision":18201205,"schema":"ReadingDepth","webHost":"en.m.wikipedia.org","wiki":"enwiki"};')

Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/usr/local/Cellar/python/3.7.2/Frameworks/Python.framework/Versions/3.7/lib/python3.7/json/__init__.py", line 348, in loads
    return _default_decoder.decode(s)
  File "/usr/local/Cellar/python/3.7.2/Frameworks/Python.framework/Versions/3.7/lib/python3.7/json/decoder.py", line 340, in decode
    raise JSONDecodeError("Extra data", s, end)
json.decoder.JSONDecodeError: Extra data: line 1 column 342 (char 341)

AFAICT the function that should be decoding and parsing the QSON (Querystring-encoded JSON) string is expecting that character to not be encoded but it is (it's the %3B at the end of the raw event).

phuedx added a comment.EditedFeb 14 2019, 2:17 PM

Following @Tbayer's advice, I did a little digging in wmf.webrequest:

There were 253 events logged that didn't end with the expected terminating character sequence ("%7D;"), of which 246 ended with the %3B character sequence.

[0]
+----------------------+------------+-----------+
| n_encoded_semicolon  | n_invalid  |     n     |
+----------------------+------------+-----------+
| 246                  | 253        | 60840190  |
+----------------------+------------+-----------+

select
    sum(if(substring(uri_query, -3) == '%3B', 1, 0)) as n_encoded_semicolon,
    sum(if(substring(uri_query, -4) != '%7D\\;', 1, 0)) as n_invalid,
    sum(1) as n
from
    wmf.webrequest
where
    year = 2019
    and month = 2
    and day = 13
    and uri_path = '/beacon/event'
    and uri_query like '%ReadingDepth%'
;

Looking at the webHost field of the event capsule for those 246 events, Google does appear to be doing some sort of proxying as well as one other site (vip.kuaimen.bid):

[1]
+---------------------+------+--+
|        host         |  n   |
+---------------------+------+--+
| www.google.com.hk   | 134  |
| tr.m.wikipedia.org  | 41   |
| en.wikipedia.org    | 36   |
| de.wikipedia.org    | 14   |
| vip.kuaimen.bid     | 7    |
| es.wikipedia.org    | 5    |
| en.m.wikipedia.org  | 5    |
| tr.wikipedia.org    | 4    |
+---------------------+------+--+

with dataset as (
    select
        # The get_json_object UDF is very forgiving of the trailing ';' character but not the leading
        # '?' character.
        get_json_object(
            substring(reflect("java.net.URLDecoder", "decode", uri_query, "UTF-8"), 2),
            '$.webHost'
        ) as host
    from
        wmf.webrequest
    where
        year = 2019
        and month = 2
        and day = 13
        and uri_path = '/beacon/event'
        and uri_query like '%ReadingDepth%'
        and substring(uri_query, -3) == '%3B'
)

select
    host,
    sum(1) as n
from
    dataset
group by
    host
order by
    n desc
limit 10000
;
fdans moved this task from Incoming to Ops Week on the Analytics board.Feb 18 2019, 4:46 PM

Unless I'm misunderstanding this, I'm assuming this is a problem with the ingestion not the delivery?

Hm, unless I'm misunderstanding, this is a problem with the proxy sending bad request data?

Hm, unless I'm misunderstanding, this is a problem with the proxy sending bad request data?

Then is this a duplicate of T212330 ?

The "Extra data:" error is raised by json.loads as it encounters the ; character at the end of the JSON string:

Yeah, EventLogging only strips semicolons, not encoded semicolons: https://github.com/wikimedia/eventlogging/blob/master/eventlogging/parse.py#L114

We are leaning towards filtering out events coming from non-wikimedia domains. If there's reason to try and include them, we could try to strip %3B as well, that seems like it would work no?

phuedx added a comment.Apr 1 2019, 6:29 PM

We are leaning towards filtering out events coming from non-wikimedia domains.

I think that's a reasonable solution.

Milimetric moved this task from Ops Week to Incoming on the Analytics board.Apr 1 2019, 8:37 PM
fdans moved this task from Incoming to Data Quality on the Analytics board.Apr 4 2019, 5:17 PM
fdans triaged this task as Normal priority.