Pageview dumps incorrectly formatted, looks like a result of possibly malicious activity
Open, NormalPublic

Description

When parsing the pageview dumps, specifically https://dumps.wikimedia.org/other/pageviews/2015/2015-08/pageviews-20150821-210000.gz I'm seeing sections violating the expected 4 column format.

As an example

en 
&<script>document.vulnerable=true;</script>
 1 0
en 
&{document.vulnerable=true;};
 1 0
en 
<!--_--_--><script>document.vulnerable=true;</script><!--_--_-->
 1 0
en 
<![CDATA[<!--]]<script>document.vulnerable=true;//--></script>
 1 0
en 
<<script>document.vulnerable=true;</script>
 1 0
en 
<?_echo('<SCR)';echo('IPT>document.vulnerable=true</SCRIPT>');_?>
 1 0
en 
<HEAD><META_HTTP-EQUIV="CONTENT-TYPE"_CONTENT="text/html;_charset=UTF-7">_</HEAD>+ADw-SCRIPT+AD4-document.vulnerable=true;+ADw-/SCRIPT+AD4-
 1 0
en 
<HTML><BODY><?xml:namespace_prefix="t"_ns="urn:schemas-microsoft-com:time"><?import_namespace="t"_implementation="#default#time2"><t:set_attributeName="innerHTML"_to="XSS<SCRIPT_DEFER>document.vulnerable=true</SCRIPT>"></BODY></HTML>
 1 0
en 
<OBJECT_classid=clsid:ae24fdae-03c6-11d1-8b76-0080c744f389><param_name=url_value=javascript:document.vulnerable=true></OBJECT>
 1 0
en 
<XML_ID="xss"><I><B><IMG_SRC="javas<!--_-->cript:document.vulnerable=true"></B></I></XML><SPAN_DATASRC="#xss"_DATAFLD="B"_DATAFORMATAS="HTML"></SPAN>
 1 0
en 
<XML_ID=I><X><C><![CDATA[<IMG_SRC="javas]]<![CDATA[cript:document.vulnerable=true;">]]</C></X></xml><SPAN_DATASRC=#I_DATAFLD=C_DATAFORMATAS=HTML></SPAN>
 1 0
en 
<a_href="about:<script>document.vulnerable=true;</script>">
 1 0
en 
<a_href="javascript#document.vulnerable=true;">
 1 0
en 
<bgsound_src="javascript:document.vulnerable=true;">
 1 0
en 
<body_onload="document.vulnerable=true;">
 1 0
en 
<div_datafld="b"_dataformatas="html"_datasrc="#X"></div>
 1 0
en 
<div_onmouseover="document.vulnerable=true;">
 1 0
en 
<div_style="background-image:_url(javascript:document.vulnerable=true;);">
 1 0
en 
<div_style="behaviour:_url([link_to_code]);">
 1 0
en 
<div_style="binding:_url([link_to_code]);">
 1 0
en 
<div_style="width:_expression(document.vulnerable=true;);">
 1 0
en 
<img_dynsrc="javascript:document.vulnerable=true;">
 1 0
en 
<img_src="blah"onmouseover="document.vulnerable=true;">
 1 0
en 
<img_src="blah>"_onmouseover="document.vulnerable=true;">
 1 0
en 
<img_src="javascript:document.vulnerable=true;">
 1 0
en 
<img_src="livescript:document.vulnerable=true;">
 1 0
en 
<img_src="mocha:document.vulnerable=true;">
 1 0
en 
<img_src=&{document.vulnerable=true;};>
 1 0
en 
<input_type="image"_dynsrc="javascript:document.vulnerable=true;">
 1 0
en 
<link_rel="stylesheet"_href="javascript:document.vulnerable=true;">
 1 0
en 
<object_classid="clsid:..."_codebase="javascript:document.vulnerable=true;">
 1 0
en 
<style><!--</style><script>document.vulnerable=true;//--></script>
 1 0
en 
<style_type="text/javascript">document.vulnerable=true;</style>
 1 0
en 
<xml_id="X"><a><b><script>document.vulnerable=true;</script>;</b></a></xml>
 1 0
en 
<xml_src="javascript:document.vulnerable=true;">
 1 0
en 
[\xC0][\xBC]script>document.vulnerable=true;[\xC0][\xBC]/script>
 1 0
Nuria added a subscriber: Nuria.Sep 12 2016, 3:54 PM

This looks like an XSS on article title, we dump title as is and thus you see it this way. cc-ing security in case some cleanup needs to happen.

If this was a request that resulted in 404 I do not think it will be dumped as such, so this request must be returning something other than 404.

Nuria moved this task from Incoming to Radar on the Analytics board.
Nuria added a subscriber: MoritzMuehlenhoff.

"<" and ">" are invalid character to use in titles, and should result in a 400 status code. Does the pageviews include things that give a 400 result?

Nuria added a comment.EditedSep 13 2016, 7:21 PM

"<" and ">" are invalid character to use in titles, and should result in a 400 status code. Does the pageviews include things that give a 400 result?

No, otherwise logs will be full of garbage-like requests.

But it could be that < were "&lt"? as that one is allowed once it is urlencoded.

Nuria added a comment.EditedSep 13 2016, 7:49 PM

Also, I can see garbage like requests in pageview_hourly table, on that hour:

Select:
select page_title, user_agent_map from pageview_hourly where (page_title like '%<%') and (year=2015 and month=08 and day=21 and hour=20);

Dataset: https://wikitech.wikimedia.org/wiki/Analytics/Data/Pageview_hourly

<<script>document.vulnerable=true;</script>
{"browser_major":"28","os_family":"Ubuntu","os_major":"-","device_family":"Other","browser_family":"Firefox","os_minor":"-","wmf_app_version":"-"}

<XML_ID=I><X><C><![CDATA[<IMG_SRC="javas]]<![CDATA[cript:document.vulnerable=true;">]]</C></X></xml><SPAN_DATASRC=#I_DATAFLD=C_DATAFORMATAS=HTML></SPAN>
{"browser_major":"28","os_family":"Ubuntu","os_major":"-","device_family":"Other","browser_family":"Firefox","os_minor":"-","wmf_app_version":"-"}

<xml_src="javascript:document.vulnerable=true;">
{"browser_major":"28","os_family":"Ubuntu","os_major":"-","device_family":"Other","browser_family":"Firefox","os_minor":"-","wmf_app_version":"-"}

<input_type="image"_dynsrc="javascript:document.vulnerable=true;">
{"browser_major":"28","os_family":"Ubuntu","os_major":"-","device_family":"Other","browser_family":"Firefox","os_minor":"-","wmf_app_version":"-"}

<a_href="about:<script>document.vulnerable=true;</script>">
{"browser_major":"28","os_family":"Ubuntu","os_major":"-","device_family":"Other","browser_family":"Firefox","os_minor":"-","wmf_app_version":"-"}

<img_src=&{document.vulnerable=true;};>
{"browser_major":"28","os_family":"Ubuntu","os_major":"-","device_family":"Other","browser_family":"Firefox","os_minor":"-","wmf_app_version":"-"}

<OBJECT_classid=clsid:ae24fdae-03c6-11d1-8b76-0080c744f389><param_name=url_value=javascript:document.vulnerable=true></OBJECT>
{"browser_major":"28","os_family":"Ubuntu","os_major":"-","device_family":"Other","browser_family":"Firefox","os_minor":"-","wmf_app_version":"-"}

<img_src="mocha:document.vulnerable=true;">
{"browser_major":"28","os_family":"Ubuntu","os_major":"-","device_family":"Other","browser_family":"Firefox","os_minor":"-","wmf_app_version":"-"}

<img_src="blah"onmouseover="document.vulnerable=true;">
{"browser_major":"28","os_family":"Ubuntu","os_major":"-","device_family":"Other","browser_family":"Firefox","os_minor":"-","wmf_app_version":"-"}

<style><!--</style><script>document.vulnerable=true;//--></script>
{"browser_major":"28","os_family":"Ubuntu","os_major":"-","device_family":"Other","browser_family":"Firefox","os_minor":"-","wmf_app_version":"-"}

<link_rel="stylesheet"_href="javascript:document.vulnerable=true;">
{"browser_major":"28","os_family":"Ubuntu","os_major":"-","device_family":"Other","browser_family":"Firefox","os_minor":"-","wmf_app_version":"-"}
�<l)� {"browser_major":"2","os_family":"Other","os_major":"-","device_family":"Spider","browser_family":"bingbot","os_minor":"-","wmf_app_version":"-"}

<img_src="javascript:document.vulnerable=true;">
{"browser_major":"28","os_family":"Ubuntu","os_major":"-","device_family":"Other","browser_family":"Firefox","os_minor":"-","wmf_app_version":"-"}

<body_onload="document.vulnerable=true;">
{"browser_major":"28","os_family":"Ubuntu","os_major":"-","device_family":"Other","browser_family":"Firefox","os_minor":"-","wmf_app_version":"-"}

<![CDATA[<!--]]<script>document.vulnerable=true;//--></script>
{"browser_major":"28","os_family":"Ubuntu","os_major":"-","device_family":"Other","browser_family":"Firefox","os_minor":"-","wmf_app_version":"-"}

<?_echo('<SCR)';echo('IPT>document.vulnerable=true</SCRIPT>');_?>
{"browser_major":"28","os_family":"Ubuntu","os_major":"-","device_family":"Other","browser_family":"Firefox","os_minor":"-","wmf_app_version":"-"}

<div_style="binding:_url([link_to_code]);">
{"browser_major":"28","os_family":"Ubuntu","os_major":"-","device_family":"Other","browser_family":"Firefox","os_minor":"-","wmf_app_version":"-"}

<object_classid="clsid:..."_codebase="javascript:document.vulnerable=true;">
{"browser_major":"28","os_family":"Ubuntu","os_major":"-","device_family":"Other","browser_family":"Firefox","os_minor":"-","wmf_app_version":"-"}

<!--_--_--><script>document.vulnerable=true;</script><!--_--_-->
{"browser_major":"28","os_family":"Ubuntu","os_major":"-","device_family":"Other","browser_family":"Firefox","os_minor":"-","wmf_app_version":"-"}

<XML_ID="xss"><I><B><IMG_SRC="javas<!--_-->cript:document.vulnerable=true"></B></I></XML><SPAN_DATASRC="#xss"_DATAFLD="B"_DATAFORMATAS="HTML"></SPAN>
{"browser_major":"28","os_family":"Ubuntu","os_major":"-","device_family":"Other","browser_family":"Firefox","os_minor":"-","wmf_app_version":"-"}

<style_type="text/javascript">document.vulnerable=true;</style>
{"browser_major":"28","os_family":"Ubuntu","os_major":"-","device_family":"Other","browser_family":"Firefox","os_minor":"-","wmf_app_version":"-"}

<HEAD><META_HTTP-EQUIV="CONTENT-TYPE"_CONTENT="text/html;_charset=UTF-7">_</HEAD>+ADw-SCRIPT+AD4-document.vulnerable=true;+ADw-/SCRIPT+AD4-
{"browser_major":"28","os_family":"Ubuntu","os_major":"-","device_family":"Other","browser_family":"Firefox","os_minor":"-","wmf_app_version":"-"}

<div_style="width:_expression(document.vulnerable=true;);">
{"browser_major":"28","os_family":"Ubuntu","os_major":"-","device_family":"Other","browser_family":"Firefox","os_minor":"-","wmf_app_version":"-"}

<a_href="javascript#document.vulnerable=true;">
{"browser_major":"28","os_family":"Ubuntu","os_major":"-","device_family":"Other","browser_family":"Firefox","os_minor":"-","wmf_app_version":"-"}

<img_src="livescript:document.vulnerable=true;">
{"browser_major":"28","os_family":"Ubuntu","os_major":"-","device_family":"Other","browser_family":"Firefox","os_minor":"-","wmf_app_version":"-"}
1078498053271< {"browser_major":"28","os_family":"Ubuntu","os_major":"-","device_family":"Other","browser_family":"Firefox","os_minor":"-","wmf_app_version":"-"}

&<script>document.vulnerable=true;</script>
{"browser_major":"28","os_family":"Ubuntu","os_major":"-","device_family":"Other","browser_family":"Firefox","os_minor":"-","wmf_app_version":"-"}

<img_dynsrc="javascript:document.vulnerable=true;">
{"browser_major":"28","os_family":"Ubuntu","os_major":"-","device_family":"Other","browser_family":"Firefox","os_minor":"-","wmf_app_version":"-"}

<HTML><BODY><?xml:namespace_prefix="t"_ns="urn:schemas-microsoft-com:time"><?import_namespace="t"_implementation="#default#time2"><t:set_attributeName="innerHTML"_to="XSS<SCRIPT_DEFER>document.vulnerable=true</SCRIPT>"></BODY></HTML>
{"browser_major":"28","os_family":"Ubuntu","os_major":"-","device_family":"Other","browser_family":"Firefox","os_minor":"-","wmf_app_version":"-"}

<div_style="behaviour:_url([link_to_code]);">
{"browser_major":"28","os_family":"Ubuntu","os_major":"-","device_family":"Other","browser_family":"Firefox","os_minor":"-","wmf_app_version":"-"}

<bgsound_src="javascript:document.vulnerable=true;">
{"browser_major":"28","os_family":"Ubuntu","os_major":"-","device_family":"Other","browser_family":"Firefox","os_minor":"-","wmf_app_version":"-"}

<xml_id="X"><a><b><script>document.vulnerable=true;</script>;</b></a></xml>
{"browser_major":"28","os_family":"Ubuntu","os_major":"-","device_family":"Other","browser_family":"Firefox","os_minor":"-","wmf_app_version":"-"}

<div_datafld="b"_dataformatas="html"_datasrc="#X"></div>
{"browser_major":"28","os_family":"Ubuntu","os_major":"-","device_family":"Other","browser_family":"Firefox","os_minor":"-","wmf_app_version":"-"}

<div_onmouseover="document.vulnerable=true;">
{"browser_major":"28","os_family":"Ubuntu","os_major":"-","device_family":"Other","browser_family":"Firefox","os_minor":"-","wmf_app_version":"-"}

<div_style="background-image:_url(javascript:document.vulnerable=true;);">
{"browser_major":"28","os_family":"Ubuntu","os_major":"-","device_family":"Other","browser_family":"Firefox","os_minor":"-","wmf_app_version":"-"}

<img_src="blah>"_onmouseover="document.vulnerable=true;">
{"browser_major":"28","os_family":"Ubuntu","os_major":"-","device_family":"Other","browser_family":"Firefox","os_minor":"-","wmf_app_version":"-"}
Time taken: 0.179 seconds, Fetched: 36 row(s)

So those are definitely getting through w/o an error code

But it could be that < were "&lt"? as that one is allowed once it is urlencoded.

&lt is allowed, but &lt; (with a semi-colon at the end) is not allowed. I would assume the lack of semi-colon would make it not get converted to a < in the logs (but not sure). Url encoding should not affect it, as titles get url decoded by MediaWiki.

Perhaps there's some sort of endpoint which doesn't do the 400 response. Do you know what the original request url is?

New theory.

From what I understand, the analytics cluster determines page name by trying to parse the path of the url ( e.g. http://en.wikipedia.org/wiki/Foo -> Foo ). However, sometimes the path of the url is not the effective article name.

For example, consider:

This returns the meta Main_Page with a 200 result. My guess is that the analytics code views this as being a page named "<script>", where MW considers this the page that corresponds to a page_id of 79155 and ignores the url path title.

So assuming that's correct, this leaves the question of should we do anything about it? From a security perspective this doesn't matter (Although we could give a 400 in this case on the principle that the user is up to no good).

From an analytics perspective, this means that a very small number of page views are misattributed, which presumably is bad.

Nuria added a comment.Oct 31 2016, 2:53 PM

@Bawoff: right, this should be a 400 in mw end. We have driven similar changes recently so we return 404s from mw for "bad" requests that were returning 200 before. Such as requests for history for non existing pages.

Arguably from a mw prespective this is a good and valid request - curid parameter takes precedence over title parameter.

btw, I'm untagging from security since there is no security impact on MediaWiki.

So, where can we go from here? Should these be filtered out or left in?

ArielGlenn moved this task from Backlog to Active on the Dumps-Generation board.Dec 1 2016, 2:41 PM
Nuria added a comment.Dec 1 2016, 4:36 PM

@Bawolff @ArielGlenn Valid requests and pageviews are different things. While these could be valid requests in any case should they return 200, which is what must have happened for them to be counted as pageviews and appear on the pageview files. A 4xx is in order. Since these are old files this issue might have been fixed on mediawiki already. We just need to verify that requests like these no longer return 200.

Nuria added a comment.Dec 1 2016, 8:57 PM

I added security again cause these requests should not return 200: https://meta.wikimedia.org/wiki/%3Cscript%3E?curid=79155

I am not sure if these could cause an XSS but given that what is really happening is an internal redirect to https://meta.wikimedia.org the request should be returning something other than "yes! valid request, 200 here-you-go"

dpatrick triaged this task as Normal priority.Dec 7 2016, 5:30 PM

btw, sometimes people use the curid parameter to get to a page with an invalid title (e.g. if a bug causes a page to have a non-unicode normalized title)

Tgr added a subscriber: Tgr.Dec 7 2016, 7:58 PM

These are pageview requests which result in the successful display of an existing, valid page. HTTP 200 is the only reasonable status code in such a case. There is no HTTP response code for "this content has a different canonical URL"; that's what rel=canonical is for. We return HTTP 200 for redirect pages for the same reason.

Anomie added a subscriber: Anomie.Dec 7 2016, 8:17 PM

There is no HTTP response code for "this content has a different canonical URL"; that's what rel=canonical is for.

301?

We return HTTP 200 for redirect pages for the same reason.

See T53736 and T20883. From discussion there, it seems like the main reason is that with a 3xx we'd have to redirect to Target?from=something (or set a very-short-lived cookie that we Vary on, or something else like that) and somehow turn that from=something into the "Redirected from" message (and without opening things up to crap like "Redirected from yo momma").

Nuria added a comment.Dec 7 2016, 8:35 PM

These are pageview requests which result in the successful display of an existing, valid page. HTTP 200 is the only reasonable status code in such a case.

Not really, that the user received a valid case is decoupled from http codes that should be sent. Note that a request that returns 404 might return a friendly page to the user, an example: (history request for non existing page) https://en.wikipedia.org/w/index.php?title=hgdhgfhg&action=history

The requests on this ticket are clearly invalid and in fact are an internal redirect to another page (internal rewrite, not sure how we call those). They should return 301 if nothing else. But really, should be 404 (seems to me). Either one.

Tgr added a comment.Dec 7 2016, 8:58 PM

Per RFC 2616,

  • (301) The requested resource has been assigned a new permanent URI and any future references to this resource SHOULD use one of the returned URIs. Clients with link editing capabilities ought to automatically re-link references to the Request-URI to one or more of the new references returned by the server, where possible. (...) The new permanent URI SHOULD be given by the Location field in the response. Unless the request method was HEAD, the entity of the response SHOULD contain a short hypertext note with a hyperlink to the new URI(s).
  • (404) The server has not found anything matching the Request-URI.

so using either of those while actually returning the content seems wrong. We could of course do a proper 301 redirect, or properly reject the request as invalid / not found, but as bawolff noted, it does have some uses. Also, curids are sometimes used as a kind of permalink that's not affected by renames (some wikis have it in the sidebar IIRC). Issuing 301 redirects wouldn't break that, but would reduce usability as the URL would be harder to copy/paste-

I note that all this has nothing to do with the original bug report, which is about a title encoding issue in the pageview reports. Whatever URL gets used, it should not be allowed to break the formatting of the pageview stats file - that's an internal issue that has nothing to do with response codes.

Tgr added a comment.Dec 7 2016, 9:10 PM

Filed T152628: Reject MediaWiki requests which use an URL with an invalid title but use some alternative mechanism to specify a valid page. As long as the special handling is limited to requests where the apparent page name is invalid, I guess there is no drawback to serving a HTTP 400.

Title sanitization in the code generating the pageview dumps should be fixed nevertheless.

Nuria added a comment.Dec 7 2016, 9:50 PM

(404) The server has not found anything matching the Request-URI.
so using either of those while actually returning the content seems wrong

mmm ...mediawiki has its own stand when it comes to urls pointing to non existing content that defies the standard as we return 200 for not existing urls all the time (seems to me): https://en.wikipedia.org/wiki/UglyDog -> 204

But anyways, I defer on this regard to mediawiki developers. My main point is that issues have to be fixed at the root to be fixed reliably. In this case a url for a non existing page that provides no content to the user (other than a warning) should not return a 200.

Tgr added a comment.Dec 7 2016, 10:18 PM

mmm ...mediawiki has its own stand when it comes to urls pointing to non existing content that defies the standard as we return 200 for not existing urls all the time (seems to me): https://en.wikipedia.org/wiki/UglyDog -> 204

I see that served with a 404.

In this case a url for a non existing page that provides no content to the user (other than a warning) should not return a 200.

I think you misunderstand the situation. https://meta.wikimedia.org/wiki/<script>?curid=79155 will serve the exact same content as https://meta.wikimedia.org/wiki/Main_Page. (It's not even that we show the main page because it's some sort of error - 79155 just happens to be the page ID of the main page. Other numbers will get you other pages.) It's just an alternate URL format (and one where the title of the page is completely unguessable from the URL).

There is no HTTP response code for "this content has a different canonical URL"; that's what rel=canonical is for.

301?

But cf. T106793: Pages with single quote in title are inaccessible by some clients (redirect loop)

Rather than letting this die, is there any sort of concensus on next steps here?

Tgr added a comment.Jan 30 2017, 6:11 PM

The application that creates pageview dumps should escape or filter newlines (and tabs and other unexpected characters that mess up document structure). If we think T152628: Reject MediaWiki requests which use an URL with an invalid title but use some alternative mechanism to specify a valid page is still worth doing after that, it should be an easy change.

Nuria added a comment.Jan 30 2017, 6:37 PM

The application that creates pageview dumps should escape or filter newlines (and tabs and other unexpected characters that mess up document structure)

Agreed.

But it is not only that what needs doing right? faulty requests that are returning 200 should return a different http code

Tgr added a comment.Jan 30 2017, 6:51 PM

But it is not only that what needs doing right? faulty requests that are returning 200 should return a different http code

That's T152628, or did you have something else in mind?

Restricted Application added a subscriber: TerraCodes. · View Herald TranscriptMay 29 2017, 3:54 PM