Page MenuHomePhabricator

[Bug] Many ReadingDepth validation errors logged
Open, NormalPublic

Description

There have been many ReadingDepth schema validation errors reported lately. E.g., https://logstash.wikimedia.org/app/kibana#/doc/logstash-*/logstash-2019.02.13/eventlogging?id=AWjnIl6_zpjgITg6W3pd&_g=h@44136fa. The event appears valid for the schema:

{
  "event": {
    "pageTitle": "夏潮聯合會",
    "namespaceId": 0,
    "skin": "vector",
    "isAnon": true,
    "pageToken": "24f1f4ad991185c25e05",
    "sessionToken": "87cdca2326ac87a04a11",
    "action": "pageUnloaded",
    "domInteractiveTime": 1807,
    "firstPaintTime": 26290,
    "default_sample": true,
    "totalLength": 368277,
    "visibleLength": 12969
  },
  "revision": 18201205,
  "schema": "ReadingDepth",
  "webHost": "www.google.com.hk",
  "wiki": "zhwiki"
}

The raw data is:

?%7B%22event%22%3A%7B%22pageTitle%22%3A%22%E5%A4%8F%E6%BD%AE%E8%81%AF%E5%90%88%E6%9C%83%22%2C%22namespaceId%22%3A0%2C%22skin%22%3A%22vector%22%2C%22isAnon%22%3Atrue%2C%22pageToken%22%3A%2224f1f4ad991185c25e05%22%2C%22sessionToken%22%3A%2287cdca2326ac87a04a11%22%2C%22action%22%3A%22pageUnloaded%22%2C%22domInteractiveTime%22%3A1807%2C%22firstPaintTime%22%3A26290%2C%22default_sample%22%3Atrue%2C%22totalLength%22%3A368277%2C%22visibleLength%22%3A12969%7D%2C%22revision%22%3A18201205%2C%22schema%22%3A%22ReadingDepth%22%2C%22webHost%22%3A%22www.google.com.hk%22%2C%22wiki%22%3A%22zhwiki%22%7D%3B    cp4028.ulsfo.wmnet    97829958    2019-02-13T13:53:34    2604:180:3:241::39d5    "Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/71.0.3578.98 Safari/537.36"

The error is something like "Extra data: line 1 column 379 - line 1 column 380 (char 378 - 379)".

I think the extra user agent cruft at the end is expected. For example, here's the URL for an event logged from my browser:

https://en.wikipedia.org/beacon/event?%7B%22event%22%3A%7B%22pageTitle%22%3A%22Barack%20Obama%22%2C%22namespaceId%22%3A0%2C%22skin%22%3A%22minerva%22%2C%22isAnon%22%3Atrue%2C%22pageToken%22%3A%2260c18a32dfabbd3d6533%22%2C%22sessionToken%22%3A%228f50d028a7ae84693ed3%22%2C%22action%22%3A%22pageLoaded%22%2C%22domInteractiveTime%22%3A1196%2C%22firstPaintTime%22%3A1113%2C%22default_sample%22%3Atrue%7D%2C%22revision%22%3A18201205%2C%22schema%22%3A%22ReadingDepth%22%2C%22webHost%22%3A%22en.m.wikipedia.org%22%2C%22wiki%22%3A%22enwiki%22%7D;

Which becomes:

https://en.wikipedia.org/beacon/event?{"event":{"pageTitle":"Barack Obama","namespaceId":0,"skin":"minerva","isAnon":true,"pageToken":"60c18a32dfabbd3d6533","sessionToken":"8f50d028a7ae84693ed3","action":"pageLoaded","domInteractiveTime":1196,"firstPaintTime":1113,"default_sample":true},"revision":18201205,"schema":"ReadingDepth","webHost":"en.m.wikipedia.org","wiki":"enwiki"};

And finally:

{
  "event": {
    "pageTitle": "Barack Obama",
    "namespaceId": 0,
    "skin": "minerva",
    "isAnon": true,
    "pageToken": "60c18a32dfabbd3d6533",
    "sessionToken": "8f50d028a7ae84693ed3",
    "action": "pageLoaded",
    "domInteractiveTime": 1196,
    "firstPaintTime": 1113,
    "default_sample": true
  },
  "revision": 18201205,
  "schema": "ReadingDepth",
  "webHost": "en.m.wikipedia.org",
  "wiki": "enwiki"
}

Event Timeline

Restricted Application added subscribers: Cosine02, Aklapper. · View Herald TranscriptFeb 13 2019, 7:37 PM

In case it's useful, keep in mind that it's possible to query the webrequest table for more detail on the event in question:

SELECT *
FROM wmf.webrequest
WHERE year = 2019 AND month = 2 AND day = 13 AND hour = 13
AND uri_path = '/beacon/event'
AND uri_query LIKE '%24f1f4ad991185c25e05%'
ORDER BY dt
LIMIT 100;
Jdlrobson added subscribers: Ottomata, Jdlrobson.

The event appears valid for the schema:

"webHost": "www.google.com.hk",

The webhost is invalid. It's likely that a proxy on www.google.com.hk is malforming the data which seems related to T212330. cc @Ottomata and co

If we filter invalid events from webHost's we don't own are there still a lot of ReadingDepth errors? That would be good to know.

phuedx added a subscriber: phuedx.Feb 14 2019, 10:46 AM

The "Extra data:" error is raised by json.loads as it encounters the ; character at the end of the JSON string:

[0]
import json

json.loads('{"event":{"pageTitle":"Barack Obama","namespaceId":0,"skin":"minerva","isAnon":true,"pageToken":"60c18a32dfabbd3d6533","sessionToken":"8f50d028a7ae84693ed3","action":"pageLoaded","domInteractiveTime":1196,"firstPaintTime":1113,"default_sample":true},"revision":18201205,"schema":"ReadingDepth","webHost":"en.m.wikipedia.org","wiki":"enwiki"};')

Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/usr/local/Cellar/python/3.7.2/Frameworks/Python.framework/Versions/3.7/lib/python3.7/json/__init__.py", line 348, in loads
    return _default_decoder.decode(s)
  File "/usr/local/Cellar/python/3.7.2/Frameworks/Python.framework/Versions/3.7/lib/python3.7/json/decoder.py", line 340, in decode
    raise JSONDecodeError("Extra data", s, end)
json.decoder.JSONDecodeError: Extra data: line 1 column 342 (char 341)

AFAICT the function that should be decoding and parsing the QSON (Querystring-encoded JSON) string is expecting that character to not be encoded but it is (it's the %3B at the end of the raw event).

phuedx added a comment.EditedFeb 14 2019, 2:17 PM

Following @Tbayer's advice, I did a little digging in wmf.webrequest:

There were 253 events logged that didn't end with the expected terminating character sequence ("%7D;"), of which 246 ended with the %3B character sequence.

[0]
+----------------------+------------+-----------+
| n_encoded_semicolon  | n_invalid  |     n     |
+----------------------+------------+-----------+
| 246                  | 253        | 60840190  |
+----------------------+------------+-----------+

select
    sum(if(substring(uri_query, -3) == '%3B', 1, 0)) as n_encoded_semicolon,
    sum(if(substring(uri_query, -4) != '%7D\\;', 1, 0)) as n_invalid,
    sum(1) as n
from
    wmf.webrequest
where
    year = 2019
    and month = 2
    and day = 13
    and uri_path = '/beacon/event'
    and uri_query like '%ReadingDepth%'
;

Looking at the webHost field of the event capsule for those 246 events, Google does appear to be doing some sort of proxying as well as one other site (vip.kuaimen.bid):

[1]
+---------------------+------+--+
|        host         |  n   |
+---------------------+------+--+
| www.google.com.hk   | 134  |
| tr.m.wikipedia.org  | 41   |
| en.wikipedia.org    | 36   |
| de.wikipedia.org    | 14   |
| vip.kuaimen.bid     | 7    |
| es.wikipedia.org    | 5    |
| en.m.wikipedia.org  | 5    |
| tr.wikipedia.org    | 4    |
+---------------------+------+--+

with dataset as (
    select
        # The get_json_object UDF is very forgiving of the trailing ';' character but not the leading
        # '?' character.
        get_json_object(
            substring(reflect("java.net.URLDecoder", "decode", uri_query, "UTF-8"), 2),
            '$.webHost'
        ) as host
    from
        wmf.webrequest
    where
        year = 2019
        and month = 2
        and day = 13
        and uri_path = '/beacon/event'
        and uri_query like '%ReadingDepth%'
        and substring(uri_query, -3) == '%3B'
)

select
    host,
    sum(1) as n
from
    dataset
group by
    host
order by
    n desc
limit 10000
;
fdans moved this task from Incoming to Ops Week on the Analytics board.Feb 18 2019, 4:46 PM

Unless I'm misunderstanding this, I'm assuming this is a problem with the ingestion not the delivery?

Hm, unless I'm misunderstanding, this is a problem with the proxy sending bad request data?

Hm, unless I'm misunderstanding, this is a problem with the proxy sending bad request data?

Then is this a duplicate of T212330 ?

The "Extra data:" error is raised by json.loads as it encounters the ; character at the end of the JSON string:

Yeah, EventLogging only strips semicolons, not encoded semicolons: https://github.com/wikimedia/eventlogging/blob/master/eventlogging/parse.py#L114

We are leaning towards filtering out events coming from non-wikimedia domains. If there's reason to try and include them, we could try to strip %3B as well, that seems like it would work no?

phuedx added a comment.Apr 1 2019, 6:29 PM

We are leaning towards filtering out events coming from non-wikimedia domains.

I think that's a reasonable solution.

Milimetric moved this task from Ops Week to Incoming on the Analytics board.Apr 1 2019, 8:37 PM
fdans triaged this task as Normal priority.Apr 4 2019, 5:17 PM
fdans moved this task from Incoming to Data Quality on the Analytics board.
phuedx added a comment.Jul 4 2019, 3:14 PM

In the last 24 hour period, 11 more ReadingDepth events with odd URLs have caused processing errors. See https://logstash.wikimedia.org/goto/dbe9713837e22a6d5a0d64f09fae1ec2. For these events, the QSON is terminated as we expect but it's prefixed with either iorg_domain_internal=... or operator_user=....

These failures are expected, though, the spec does not account for such prefixes, it just expects an encoded JSON blob. Is this something the client has no control over somehow? Actually, yeah, how are these URLs even generated?

phuedx added a comment.Thu, Aug 1, 3:16 PM

These failures are expected, though, the spec does not account for such prefixes, it just expects an encoded JSON blob. Is this something the client has no control over somehow? Actually, yeah, how are these URLs even generated?

I think you're looking for mw.eventLog.makeBeaconUrl, which is defined here: https://gerrit.wikimedia.org/r/plugins/gitiles/mediawiki/extensions/EventLogging/+/master/modules/ext.eventLogging/core.js#84

I meant the operator_user= and iorg_domain_internal= prefixes, which are not added by us in any way, so these failures are expected, right?

phuedx added a comment.EditedThu, Aug 1, 4:23 PM

Correct. Those prefixes must be added by whatever application is proxying the site.