Page MenuHomePhabricator

Pageview dumps incorrectly formatted, need to escape special characters
Closed, ResolvedPublic

Description

When parsing the pageview dumps, specifically https://dumps.wikimedia.org/other/pageviews/2015/2015-08/pageviews-20150821-210000.gz I'm seeing sections violating the expected 4 column format.

As an example

en 
&<script>document.vulnerable=true;</script>
 1 0
en 
&{document.vulnerable=true;};
 1 0
en 
<!--_--_--><script>document.vulnerable=true;</script><!--_--_-->
 1 0
en 
<![CDATA[<!--]]<script>document.vulnerable=true;//--></script>
 1 0
en 
<<script>document.vulnerable=true;</script>
 1 0
en 
<?_echo('<SCR)';echo('IPT>document.vulnerable=true</SCRIPT>');_?>
 1 0
en 
<HEAD><META_HTTP-EQUIV="CONTENT-TYPE"_CONTENT="text/html;_charset=UTF-7">_</HEAD>+ADw-SCRIPT+AD4-document.vulnerable=true;+ADw-/SCRIPT+AD4-
 1 0
en 
<HTML><BODY><?xml:namespace_prefix="t"_ns="urn:schemas-microsoft-com:time"><?import_namespace="t"_implementation="#default#time2"><t:set_attributeName="innerHTML"_to="XSS<SCRIPT_DEFER>document.vulnerable=true</SCRIPT>"></BODY></HTML>
 1 0
en 
<OBJECT_classid=clsid:ae24fdae-03c6-11d1-8b76-0080c744f389><param_name=url_value=javascript:document.vulnerable=true></OBJECT>
 1 0
en 
<XML_ID="xss"><I><B><IMG_SRC="javas<!--_-->cript:document.vulnerable=true"></B></I></XML><SPAN_DATASRC="#xss"_DATAFLD="B"_DATAFORMATAS="HTML"></SPAN>
 1 0
en 
<XML_ID=I><X><C><![CDATA[<IMG_SRC="javas]]<![CDATA[cript:document.vulnerable=true;">]]</C></X></xml><SPAN_DATASRC=#I_DATAFLD=C_DATAFORMATAS=HTML></SPAN>
 1 0
en 
<a_href="about:<script>document.vulnerable=true;</script>">
 1 0
en 
<a_href="javascript#document.vulnerable=true;">
 1 0
en 
<bgsound_src="javascript:document.vulnerable=true;">
 1 0
en 
<body_onload="document.vulnerable=true;">
 1 0
en 
<div_datafld="b"_dataformatas="html"_datasrc="#X"></div>
 1 0
en 
<div_onmouseover="document.vulnerable=true;">
 1 0
en 
<div_style="background-image:_url(javascript:document.vulnerable=true;);">
 1 0
en 
<div_style="behaviour:_url([link_to_code]);">
 1 0
en 
<div_style="binding:_url([link_to_code]);">
 1 0
en 
<div_style="width:_expression(document.vulnerable=true;);">
 1 0
en 
<img_dynsrc="javascript:document.vulnerable=true;">
 1 0
en 
<img_src="blah"onmouseover="document.vulnerable=true;">
 1 0
en 
<img_src="blah>"_onmouseover="document.vulnerable=true;">
 1 0
en 
<img_src="javascript:document.vulnerable=true;">
 1 0
en 
<img_src="livescript:document.vulnerable=true;">
 1 0
en 
<img_src="mocha:document.vulnerable=true;">
 1 0
en 
<img_src=&{document.vulnerable=true;};>
 1 0
en 
<input_type="image"_dynsrc="javascript:document.vulnerable=true;">
 1 0
en 
<link_rel="stylesheet"_href="javascript:document.vulnerable=true;">
 1 0
en 
<object_classid="clsid:..."_codebase="javascript:document.vulnerable=true;">
 1 0
en 
<style><!--</style><script>document.vulnerable=true;//--></script>
 1 0
en 
<style_type="text/javascript">document.vulnerable=true;</style>
 1 0
en 
<xml_id="X"><a><b><script>document.vulnerable=true;</script>;</b></a></xml>
 1 0
en 
<xml_src="javascript:document.vulnerable=true;">
 1 0
en 
[\xC0][\xBC]script>document.vulnerable=true;[\xC0][\xBC]/script>
 1 0

Event Timeline

Nuria added a subscriber: Nuria.Sep 12 2016, 3:54 PM

This looks like an XSS on article title, we dump title as is and thus you see it this way. cc-ing security in case some cleanup needs to happen.

If this was a request that resulted in 404 I do not think it will be dumped as such, so this request must be returning something other than 404.

Nuria moved this task from Incoming to Radar on the Analytics board.
Nuria added a subscriber: MoritzMuehlenhoff.

"<" and ">" are invalid character to use in titles, and should result in a 400 status code. Does the pageviews include things that give a 400 result?

Nuria added a comment.EditedSep 13 2016, 7:21 PM

"<" and ">" are invalid character to use in titles, and should result in a 400 status code. Does the pageviews include things that give a 400 result?

No, otherwise logs will be full of garbage-like requests.

But it could be that < were "&lt"? as that one is allowed once it is urlencoded.

Nuria added a comment.EditedSep 13 2016, 7:49 PM

Also, I can see garbage like requests in pageview_hourly table, on that hour:

Select:
select page_title, user_agent_map from pageview_hourly where (page_title like '%<%') and (year=2015 and month=08 and day=21 and hour=20);

Dataset: https://wikitech.wikimedia.org/wiki/Analytics/Data/Pageview_hourly

<<script>document.vulnerable=true;</script>
{"browser_major":"28","os_family":"Ubuntu","os_major":"-","device_family":"Other","browser_family":"Firefox","os_minor":"-","wmf_app_version":"-"}

<XML_ID=I><X><C><![CDATA[<IMG_SRC="javas]]<![CDATA[cript:document.vulnerable=true;">]]</C></X></xml><SPAN_DATASRC=#I_DATAFLD=C_DATAFORMATAS=HTML></SPAN>
{"browser_major":"28","os_family":"Ubuntu","os_major":"-","device_family":"Other","browser_family":"Firefox","os_minor":"-","wmf_app_version":"-"}

<xml_src="javascript:document.vulnerable=true;">
{"browser_major":"28","os_family":"Ubuntu","os_major":"-","device_family":"Other","browser_family":"Firefox","os_minor":"-","wmf_app_version":"-"}

<input_type="image"_dynsrc="javascript:document.vulnerable=true;">
{"browser_major":"28","os_family":"Ubuntu","os_major":"-","device_family":"Other","browser_family":"Firefox","os_minor":"-","wmf_app_version":"-"}

<a_href="about:<script>document.vulnerable=true;</script>">
{"browser_major":"28","os_family":"Ubuntu","os_major":"-","device_family":"Other","browser_family":"Firefox","os_minor":"-","wmf_app_version":"-"}

<img_src=&{document.vulnerable=true;};>
{"browser_major":"28","os_family":"Ubuntu","os_major":"-","device_family":"Other","browser_family":"Firefox","os_minor":"-","wmf_app_version":"-"}

<OBJECT_classid=clsid:ae24fdae-03c6-11d1-8b76-0080c744f389><param_name=url_value=javascript:document.vulnerable=true></OBJECT>
{"browser_major":"28","os_family":"Ubuntu","os_major":"-","device_family":"Other","browser_family":"Firefox","os_minor":"-","wmf_app_version":"-"}

<img_src="mocha:document.vulnerable=true;">
{"browser_major":"28","os_family":"Ubuntu","os_major":"-","device_family":"Other","browser_family":"Firefox","os_minor":"-","wmf_app_version":"-"}

<img_src="blah"onmouseover="document.vulnerable=true;">
{"browser_major":"28","os_family":"Ubuntu","os_major":"-","device_family":"Other","browser_family":"Firefox","os_minor":"-","wmf_app_version":"-"}

<style><!--</style><script>document.vulnerable=true;//--></script>
{"browser_major":"28","os_family":"Ubuntu","os_major":"-","device_family":"Other","browser_family":"Firefox","os_minor":"-","wmf_app_version":"-"}

<link_rel="stylesheet"_href="javascript:document.vulnerable=true;">
{"browser_major":"28","os_family":"Ubuntu","os_major":"-","device_family":"Other","browser_family":"Firefox","os_minor":"-","wmf_app_version":"-"}
�<l)� {"browser_major":"2","os_family":"Other","os_major":"-","device_family":"Spider","browser_family":"bingbot","os_minor":"-","wmf_app_version":"-"}

<img_src="javascript:document.vulnerable=true;">
{"browser_major":"28","os_family":"Ubuntu","os_major":"-","device_family":"Other","browser_family":"Firefox","os_minor":"-","wmf_app_version":"-"}

<body_onload="document.vulnerable=true;">
{"browser_major":"28","os_family":"Ubuntu","os_major":"-","device_family":"Other","browser_family":"Firefox","os_minor":"-","wmf_app_version":"-"}

<![CDATA[<!--]]<script>document.vulnerable=true;//--></script>
{"browser_major":"28","os_family":"Ubuntu","os_major":"-","device_family":"Other","browser_family":"Firefox","os_minor":"-","wmf_app_version":"-"}

<?_echo('<SCR)';echo('IPT>document.vulnerable=true</SCRIPT>');_?>
{"browser_major":"28","os_family":"Ubuntu","os_major":"-","device_family":"Other","browser_family":"Firefox","os_minor":"-","wmf_app_version":"-"}

<div_style="binding:_url([link_to_code]);">
{"browser_major":"28","os_family":"Ubuntu","os_major":"-","device_family":"Other","browser_family":"Firefox","os_minor":"-","wmf_app_version":"-"}

<object_classid="clsid:..."_codebase="javascript:document.vulnerable=true;">
{"browser_major":"28","os_family":"Ubuntu","os_major":"-","device_family":"Other","browser_family":"Firefox","os_minor":"-","wmf_app_version":"-"}

<!--_--_--><script>document.vulnerable=true;</script><!--_--_-->
{"browser_major":"28","os_family":"Ubuntu","os_major":"-","device_family":"Other","browser_family":"Firefox","os_minor":"-","wmf_app_version":"-"}

<XML_ID="xss"><I><B><IMG_SRC="javas<!--_-->cript:document.vulnerable=true"></B></I></XML><SPAN_DATASRC="#xss"_DATAFLD="B"_DATAFORMATAS="HTML"></SPAN>
{"browser_major":"28","os_family":"Ubuntu","os_major":"-","device_family":"Other","browser_family":"Firefox","os_minor":"-","wmf_app_version":"-"}

<style_type="text/javascript">document.vulnerable=true;</style>
{"browser_major":"28","os_family":"Ubuntu","os_major":"-","device_family":"Other","browser_family":"Firefox","os_minor":"-","wmf_app_version":"-"}

<HEAD><META_HTTP-EQUIV="CONTENT-TYPE"_CONTENT="text/html;_charset=UTF-7">_</HEAD>+ADw-SCRIPT+AD4-document.vulnerable=true;+ADw-/SCRIPT+AD4-
{"browser_major":"28","os_family":"Ubuntu","os_major":"-","device_family":"Other","browser_family":"Firefox","os_minor":"-","wmf_app_version":"-"}

<div_style="width:_expression(document.vulnerable=true;);">
{"browser_major":"28","os_family":"Ubuntu","os_major":"-","device_family":"Other","browser_family":"Firefox","os_minor":"-","wmf_app_version":"-"}

<a_href="javascript#document.vulnerable=true;">
{"browser_major":"28","os_family":"Ubuntu","os_major":"-","device_family":"Other","browser_family":"Firefox","os_minor":"-","wmf_app_version":"-"}

<img_src="livescript:document.vulnerable=true;">
{"browser_major":"28","os_family":"Ubuntu","os_major":"-","device_family":"Other","browser_family":"Firefox","os_minor":"-","wmf_app_version":"-"}
1078498053271< {"browser_major":"28","os_family":"Ubuntu","os_major":"-","device_family":"Other","browser_family":"Firefox","os_minor":"-","wmf_app_version":"-"}

&<script>document.vulnerable=true;</script>
{"browser_major":"28","os_family":"Ubuntu","os_major":"-","device_family":"Other","browser_family":"Firefox","os_minor":"-","wmf_app_version":"-"}

<img_dynsrc="javascript:document.vulnerable=true;">
{"browser_major":"28","os_family":"Ubuntu","os_major":"-","device_family":"Other","browser_family":"Firefox","os_minor":"-","wmf_app_version":"-"}

<HTML><BODY><?xml:namespace_prefix="t"_ns="urn:schemas-microsoft-com:time"><?import_namespace="t"_implementation="#default#time2"><t:set_attributeName="innerHTML"_to="XSS<SCRIPT_DEFER>document.vulnerable=true</SCRIPT>"></BODY></HTML>
{"browser_major":"28","os_family":"Ubuntu","os_major":"-","device_family":"Other","browser_family":"Firefox","os_minor":"-","wmf_app_version":"-"}

<div_style="behaviour:_url([link_to_code]);">
{"browser_major":"28","os_family":"Ubuntu","os_major":"-","device_family":"Other","browser_family":"Firefox","os_minor":"-","wmf_app_version":"-"}

<bgsound_src="javascript:document.vulnerable=true;">
{"browser_major":"28","os_family":"Ubuntu","os_major":"-","device_family":"Other","browser_family":"Firefox","os_minor":"-","wmf_app_version":"-"}

<xml_id="X"><a><b><script>document.vulnerable=true;</script>;</b></a></xml>
{"browser_major":"28","os_family":"Ubuntu","os_major":"-","device_family":"Other","browser_family":"Firefox","os_minor":"-","wmf_app_version":"-"}

<div_datafld="b"_dataformatas="html"_datasrc="#X"></div>
{"browser_major":"28","os_family":"Ubuntu","os_major":"-","device_family":"Other","browser_family":"Firefox","os_minor":"-","wmf_app_version":"-"}

<div_onmouseover="document.vulnerable=true;">
{"browser_major":"28","os_family":"Ubuntu","os_major":"-","device_family":"Other","browser_family":"Firefox","os_minor":"-","wmf_app_version":"-"}

<div_style="background-image:_url(javascript:document.vulnerable=true;);">
{"browser_major":"28","os_family":"Ubuntu","os_major":"-","device_family":"Other","browser_family":"Firefox","os_minor":"-","wmf_app_version":"-"}

<img_src="blah>"_onmouseover="document.vulnerable=true;">
{"browser_major":"28","os_family":"Ubuntu","os_major":"-","device_family":"Other","browser_family":"Firefox","os_minor":"-","wmf_app_version":"-"}
Time taken: 0.179 seconds, Fetched: 36 row(s)

So those are definitely getting through w/o an error code

But it could be that < were "&lt"? as that one is allowed once it is urlencoded.

&lt is allowed, but &lt; (with a semi-colon at the end) is not allowed. I would assume the lack of semi-colon would make it not get converted to a < in the logs (but not sure). Url encoding should not affect it, as titles get url decoded by MediaWiki.

Perhaps there's some sort of endpoint which doesn't do the 400 response. Do you know what the original request url is?

New theory.

From what I understand, the analytics cluster determines page name by trying to parse the path of the url ( e.g. http://en.wikipedia.org/wiki/Foo -> Foo ). However, sometimes the path of the url is not the effective article name.

For example, consider:

This returns the meta Main_Page with a 200 result. My guess is that the analytics code views this as being a page named "<script>", where MW considers this the page that corresponds to a page_id of 79155 and ignores the url path title.

So assuming that's correct, this leaves the question of should we do anything about it? From a security perspective this doesn't matter (Although we could give a 400 in this case on the principle that the user is up to no good).

From an analytics perspective, this means that a very small number of page views are misattributed, which presumably is bad.

Nuria added a comment.Oct 31 2016, 2:53 PM

@Bawoff: right, this should be a 400 in mw end. We have driven similar changes recently so we return 404s from mw for "bad" requests that were returning 200 before. Such as requests for history for non existing pages.

Arguably from a mw prespective this is a good and valid request - curid parameter takes precedence over title parameter.

btw, I'm untagging from security since there is no security impact on MediaWiki.

So, where can we go from here? Should these be filtered out or left in?

ArielGlenn moved this task from Backlog to Active on the Dumps-Generation board.Dec 1 2016, 2:41 PM
Nuria added a comment.Dec 1 2016, 4:36 PM

@Bawolff @ArielGlenn Valid requests and pageviews are different things. While these could be valid requests in any case should they return 200, which is what must have happened for them to be counted as pageviews and appear on the pageview files. A 4xx is in order. Since these are old files this issue might have been fixed on mediawiki already. We just need to verify that requests like these no longer return 200.

Nuria added a comment.Dec 1 2016, 8:57 PM

I added security again cause these requests should not return 200: https://meta.wikimedia.org/wiki/%3Cscript%3E?curid=79155

I am not sure if these could cause an XSS but given that what is really happening is an internal redirect to https://meta.wikimedia.org the request should be returning something other than "yes! valid request, 200 here-you-go"

dpatrick triaged this task as Normal priority.Dec 7 2016, 5:30 PM

btw, sometimes people use the curid parameter to get to a page with an invalid title (e.g. if a bug causes a page to have a non-unicode normalized title)

Tgr added a subscriber: Tgr.Dec 7 2016, 7:58 PM

These are pageview requests which result in the successful display of an existing, valid page. HTTP 200 is the only reasonable status code in such a case. There is no HTTP response code for "this content has a different canonical URL"; that's what rel=canonical is for. We return HTTP 200 for redirect pages for the same reason.

Anomie added a subscriber: Anomie.Dec 7 2016, 8:17 PM

There is no HTTP response code for "this content has a different canonical URL"; that's what rel=canonical is for.

301?

We return HTTP 200 for redirect pages for the same reason.

See T53736 and T20883. From discussion there, it seems like the main reason is that with a 3xx we'd have to redirect to Target?from=something (or set a very-short-lived cookie that we Vary on, or something else like that) and somehow turn that from=something into the "Redirected from" message (and without opening things up to crap like "Redirected from yo momma").

Nuria added a comment.Dec 7 2016, 8:35 PM

These are pageview requests which result in the successful display of an existing, valid page. HTTP 200 is the only reasonable status code in such a case.

Not really, that the user received a valid case is decoupled from http codes that should be sent. Note that a request that returns 404 might return a friendly page to the user, an example: (history request for non existing page) https://en.wikipedia.org/w/index.php?title=hgdhgfhg&action=history

The requests on this ticket are clearly invalid and in fact are an internal redirect to another page (internal rewrite, not sure how we call those). They should return 301 if nothing else. But really, should be 404 (seems to me). Either one.

Tgr added a comment.Dec 7 2016, 8:58 PM

Per RFC 2616,

  • (301) The requested resource has been assigned a new permanent URI and any future references to this resource SHOULD use one of the returned URIs. Clients with link editing capabilities ought to automatically re-link references to the Request-URI to one or more of the new references returned by the server, where possible. (...) The new permanent URI SHOULD be given by the Location field in the response. Unless the request method was HEAD, the entity of the response SHOULD contain a short hypertext note with a hyperlink to the new URI(s).
  • (404) The server has not found anything matching the Request-URI.

so using either of those while actually returning the content seems wrong. We could of course do a proper 301 redirect, or properly reject the request as invalid / not found, but as bawolff noted, it does have some uses. Also, curids are sometimes used as a kind of permalink that's not affected by renames (some wikis have it in the sidebar IIRC). Issuing 301 redirects wouldn't break that, but would reduce usability as the URL would be harder to copy/paste-

I note that all this has nothing to do with the original bug report, which is about a title encoding issue in the pageview reports. Whatever URL gets used, it should not be allowed to break the formatting of the pageview stats file - that's an internal issue that has nothing to do with response codes.

Tgr added a comment.Dec 7 2016, 9:10 PM

Filed T152628: Reject MediaWiki requests which use an URL with an invalid title but use some alternative mechanism to specify a valid page. As long as the special handling is limited to requests where the apparent page name is invalid, I guess there is no drawback to serving a HTTP 400.

Title sanitization in the code generating the pageview dumps should be fixed nevertheless.

Nuria added a comment.Dec 7 2016, 9:50 PM

(404) The server has not found anything matching the Request-URI.
so using either of those while actually returning the content seems wrong

mmm ...mediawiki has its own stand when it comes to urls pointing to non existing content that defies the standard as we return 200 for not existing urls all the time (seems to me): https://en.wikipedia.org/wiki/UglyDog -> 204

But anyways, I defer on this regard to mediawiki developers. My main point is that issues have to be fixed at the root to be fixed reliably. In this case a url for a non existing page that provides no content to the user (other than a warning) should not return a 200.

Tgr added a comment.Dec 7 2016, 10:18 PM

mmm ...mediawiki has its own stand when it comes to urls pointing to non existing content that defies the standard as we return 200 for not existing urls all the time (seems to me): https://en.wikipedia.org/wiki/UglyDog -> 204

I see that served with a 404.

In this case a url for a non existing page that provides no content to the user (other than a warning) should not return a 200.

I think you misunderstand the situation. https://meta.wikimedia.org/wiki/<script>?curid=79155 will serve the exact same content as https://meta.wikimedia.org/wiki/Main_Page. (It's not even that we show the main page because it's some sort of error - 79155 just happens to be the page ID of the main page. Other numbers will get you other pages.) It's just an alternate URL format (and one where the title of the page is completely unguessable from the URL).

There is no HTTP response code for "this content has a different canonical URL"; that's what rel=canonical is for.

301?

But cf. T106793: Pages with single quote in title are inaccessible by some clients (redirect loop)

Rather than letting this die, is there any sort of concensus on next steps here?

Tgr added a comment.Jan 30 2017, 6:11 PM

The application that creates pageview dumps should escape or filter newlines (and tabs and other unexpected characters that mess up document structure). If we think T152628: Reject MediaWiki requests which use an URL with an invalid title but use some alternative mechanism to specify a valid page is still worth doing after that, it should be an easy change.

Nuria added a comment.Jan 30 2017, 6:37 PM

The application that creates pageview dumps should escape or filter newlines (and tabs and other unexpected characters that mess up document structure)

Agreed.

But it is not only that what needs doing right? faulty requests that are returning 200 should return a different http code

Tgr added a comment.Jan 30 2017, 6:51 PM

But it is not only that what needs doing right? faulty requests that are returning 200 should return a different http code

That's T152628, or did you have something else in mind?

Restricted Application added a subscriber: TerraCodes. · View Herald TranscriptMay 29 2017, 3:54 PM
awight claimed this task.Mar 18 2019, 4:18 PM
awight added a subscriber: awight.

I'll take a look at this and its subtasks.

This is pedantic of me, but I want to break the subtask link to T152628 and make it a "mentioned in", for the reason @Tgr gives in T144100#2982645 . I'm renaming this task to make it more clear that the scope is to fix escaping in the dumps files.

awight renamed this task from Pageview dumps incorrectly formatted, looks like a result of possibly malicious activity to Pageview dumps incorrectly formatted, need to escape special characters.Mar 19 2019, 12:45 AM

In https://phabricator.wikimedia.org/diffusion/ANRE/browse/master/oozie/pageview/hourly/transform_pageview_to_legacy_format.hql$25 , we're doing a raw string concatenation in order to produce the output format. The code comment has some suggestions,

INSERT OVERWRITE DIRECTORY "${destination_directory}"
    -- Since "ROW FORMAT DELIMITED DELIMITED FIELDS TERMINATED BY ' '" only
    -- works for exports to local directories (see HIVE-5672), we have to
    -- prepare the lines by hand through concatenation :-(
    -- Set 0 as volume column since we don't use it.
    SELECT
        CONCAT_WS(" ", qualifier, page_title, cast(view_count AS string), "0") line

Maybe it's desirable to change our workflow so we can use ROW FORMAT and then copy the file out of HDFS? Otherwise, continuing down the raw string manipulation path we could transform the page_title to surround with quotes and escape any characters not allowed in CSVs.

thanks for working on this @awight :)

Maybe it's desirable to change our workflow so we can use ROW FORMAT and then copy the file out of HDFS?

I don't think we'd want that:

  • Using ROW FORMAT implies writing files outside of HDFS, and HDFS is our (almost :) infinite storageg
  • Also the workflow for getting those files out would need to be changed which can be avoided.

Otherwise, continuing down the raw string manipulation path we could transform the page_title to surround with quotes and escape any characters not allowed in CSVs.

I think the suggestion of escaping all white-space characters (end-of-lines, spaces, tabs etc) actually makes sense.
I also think we should update the page-title extraction function to better handle the special <script> case shown here.

Reading HIVE-5672, the root bug has been fixed and we're in theory welcome to use ROW FORMAT without (or with) LOCAL. I think that's worth a try because it gives us more flexibility to escape or delimit in common variants. I would lean towards double-quoting the title field as well as escaping whitespace other than <space>, but want to make a case for not escaping <space>, that it makes our format more human-readable.

Using ROW FORMAT implies writing files outside of HDFS, and HDFS is our (almost :) infinite storage

I see, so omitting the LOCAL keyword here means that we do write to HDFS, great.

One side note is that a comment here mentions that the query plan may change, so we should be sure to monitor dump job health after deploying a fix.

Reading HIVE-5672, the root bug has been fixed

True, but the Hive version we have is old enough so that we don't have that fix ... <shame/>

Reading HIVE-5672, the root bug has been fixed

True, but the Hive version we have is old enough so that we don't have that fix ... <shame/>

baahaha

/me hangs coal-stained overalls for the night

Nuria added a comment.Mar 19 2019, 3:13 PM

I think the suggestion of escaping all white-space characters (end-of-lines, spaces, tabs etc) actually makes sense.
I also think we should update the page-title extraction function to better handle the special <script> case shown here.

+1, these changes are to happen on pageviewdefinition java classes rather than HIVE

I think the suggestion of escaping all white-space characters (end-of-lines, spaces, tabs etc) actually makes sense.
I also think we should update the page-title extraction function to better handle the special <script> case shown here.

+1, these changes are to happen on pageviewdefinition java classes rather than HIVE.

We should confirm that white-space chars are not accepted in regular page titles before adding this to the main function. If they are accepted, keeping them in parquet format is probably a good idea, and the filtering should happen in hive. If not accepted, ok to remove them at page-title extraction :)

Nuria added a comment.Mar 19 2019, 5:15 PM

Ah, yes, @JAllemandou is totally right as always.

Change 498700 had a related patch set uploaded (by Awight; owner: Adam Wight):
[analytics/refinery@master] Escape whitespaces in page_title

https://gerrit.wikimedia.org/r/498700

Change 498702 had a related patch set uploaded (by Awight; owner: Adam Wight):
[analytics/refinery/source@master] [WIP] Reject invalid titles

https://gerrit.wikimedia.org/r/498702

Change 498700 abandoned by Awight:
Escape whitespaces in page_title

Reason:
I realized this is redundant work, we're already replacing s/ /_/g in refinery-source, and we can trust it since we own that code. Now I'm wondering how the bad titles broke the CSV format in the first place?

https://gerrit.wikimedia.org/r/498700

Change 498702 merged by jenkins-bot:
[analytics/refinery/source@master] Reject invalid Page titles

https://gerrit.wikimedia.org/r/498702

Nuria moved this task from Next Up to Ready to Deploy on the Analytics-Kanban board.

Change 498702 merged by Fdans:
[analytics/refinery/source@master] Reject invalid Page titles

https://gerrit.wikimedia.org/r/498702

Change 505707 had a related patch set uploaded (by Fdans; owner: Fdans):
[analytics/refinery@master] Bump record version in webrequest load job

https://gerrit.wikimedia.org/r/505707

Change 505707 merged by Joal:
[analytics/refinery@master] Bump record version in webrequest load job and pageview hourly

https://gerrit.wikimedia.org/r/505707

Anomie removed a subscriber: Anomie.Apr 23 2019, 3:39 PM
Nuria moved this task from Ready to Deploy to Done on the Analytics-Kanban board.Apr 23 2019, 4:03 PM
Nuria closed this task as Resolved.Apr 23 2019, 6:10 PM

Super thanks!