Page MenuHomePhabricator

[Bug] Many JSON decode ReadingDepth schema errors from wikiyy
Open, Needs TriagePublic


There are many "No JSON object could be decoded" ReadingDepth schema errors appearing in our dashboard. For example:

URL (encoded)
	?la=ru&lo=%2Fbeacon%2Fevent&	cp3042.esams.wmnet	439103450	2018-12-19T18:40:20	"-"
URL (decoded)
	?la=ru&lo=/beacon/event&{"event":{"pageTitle":"Лимфатическая_система","namespaceId":0,"skin":"minerva","isAnon":true,"pageToken":"e13d6df9fa8530b83849","sessionToken":"a212bac24ccfc05955f2","action":"pageUnloaded","domInteractiveTime":3135,"firstPaintTime":3024,"default_sample":true,"totalLength":12916,"visibleLength":12805},"revision":18201205,"schema":"ReadingDepth","webHost":"ru_m_wikiyy_com","wiki":"ruwiki"};=	cp3042.esams.wmnet	439103450	2018-12-19T18:40:20	"-"
  "event": {
    "pageTitle": "Лимфатическая_система",
    "namespaceId": 0,
    "skin": "minerva",
    "isAnon": true,
    "pageToken": "e13d6df9fa8530b83849",
    "sessionToken": "a212bac24ccfc05955f2",
    "action": "pageUnloaded",
    "domInteractiveTime": 3135,
    "firstPaintTime": 3024,
    "default_sample": true,
    "totalLength": 12916,
    "visibleLength": 12805
  "revision": 18201205,
  "schema": "ReadingDepth",
  "webHost": "ru_m_wikiyy_com",
  "wiki": "ruwiki"

The webHost varies but often contains "wikiyy" and seems unusual.

Event Timeline

Restricted Application added a subscriber: Aklapper. · View Herald TranscriptDec 19 2018, 6:55 PM

I think the problem is this part of the url:


The eventlogging parser expects that the query string start with the url encoded event. Here it doesn't. It looks like whatever wikiyy is has mirrored our sites (with the javascript), but is mangling URLs.

Jdlrobson added a subscriber: Jdlrobson.

Can't we just whitelist this URI in EventLogging?

Whitelist what URI?

Hm, do you mean blacklist? We don't want to collect this data at all, right?

phuedx added a comment.EditedJan 2 2019, 4:48 PM

We don't want to collect this data at all, right?

In the case of the ReadingDepth instrumentation, no.

I think that the EventLogging processor is following the robustness principle well here and that it's the version of the EventLogging client (?!) that isn't as it's making requests to non-conformant URLs. This brings me on to itself…

It appears to proxy content from Wikipedia and inject tracking scripts (I see references to Yandex Metrika and Google Ads in the page source). It even uses the Wikipedia branding on the Special:Login page, e.g. clicking on "Log in" in the hamburger menu takes me to the following URL: Something's not right, right?!

It is probably a good idea in general to have a whitelist of domains that we control from which we accept events. I'll add this to the Modern Event Platform Stream Intake Service use cases.

Actually on quick second thought...I think that's not possible? How would apps send events?

Hm, do you mean blacklist? We don't want to collect this data at all, right?

yep sorry for confusion! Some kind of list seems in order!

See also T197971: Virtual pageview refine should not refine data that does not come from wikimedia domains for a similar issue (as well as the somewhat explanations at T188804).

Agree that it would be great to have a general solution to limit the logged EL data to WMF webhosts only. (Even if the non-WMF events could sometimes yield interesting information about third-party usage like in this case.)

CC @Groceryheist for awareness, although it looks like this particular issue didn't affect the validity of the data used in the Reading time project per se.

Nuria added a subscriber: Nuria.Jan 3 2019, 1:36 PM

This is a frequent occurrence as wikimedia's code base is used by many other sites as-is. In this case this is an unlawful usage of content and we will report it as such. Site will probably be taken down in a few days.

The endpoint needs to be able to ingest from "any" domain so - as @Ottomata mentioned the 4 clients we have for Eventlogging (android, ios, javascript and php can send events and have those be accepted). Remember that there are no sessions and thus no authenticated clients. Now, that does not mean that we cannot use a request dispatcher that filters (or throttles) Eventlogging traffic that abides to a certain criteria, that request dispatcher would probably use a blacklist rather than a whitelist.

I betcha we could also use some kinda of special secret key whitelisting. Even if not secure and easily spoofable, it would at least keep stuff like this out of event logs.

Milimetric moved this task from Incoming to Radar on the Analytics board.Jan 3 2019, 6:35 PM