Pageview dumps incorrectly formatted, need to escape special characters
Closed, ResolvedPublic
Actions

Assigned To

Authored By

	Kjschiroo
	Aug 27 2016, 5:42 PM

Description

When parsing the pageview dumps, specifically https://dumps.wikimedia.org/other/pageviews/2015/2015-08/pageviews-20150821-210000.gz I'm seeing sections violating the expected 4 column format.

As an example

en 
&<script>document.vulnerable=true;</script>
 1 0
en 
&{document.vulnerable=true;};
 1 0
en 
<!--_--_--><script>document.vulnerable=true;</script><!--_--_-->
 1 0
en 
<![CDATA[<!--]]<script>document.vulnerable=true;//--></script>
 1 0
en 
<<script>document.vulnerable=true;</script>
 1 0
en 
<?_echo('<SCR)';echo('IPT>document.vulnerable=true</SCRIPT>');_?>
 1 0
en 
<HEAD><META_HTTP-EQUIV="CONTENT-TYPE"_CONTENT="text/html;_charset=UTF-7">_</HEAD>+ADw-SCRIPT+AD4-document.vulnerable=true;+ADw-/SCRIPT+AD4-
 1 0
en 
<HTML><BODY><?xml:namespace_prefix="t"_ns="urn:schemas-microsoft-com:time"><?import_namespace="t"_implementation="#default#time2"><t:set_attributeName="innerHTML"_to="XSS<SCRIPT_DEFER>document.vulnerable=true</SCRIPT>"></BODY></HTML>
 1 0
en 
<OBJECT_classid=clsid:ae24fdae-03c6-11d1-8b76-0080c744f389><param_name=url_value=javascript:document.vulnerable=true></OBJECT>
 1 0
en 
<XML_ID="xss"><I><B><IMG_SRC="javas<!--_-->cript:document.vulnerable=true"></B></I></XML><SPAN_DATASRC="#xss"_DATAFLD="B"_DATAFORMATAS="HTML"></SPAN>
 1 0
en 
<XML_ID=I><X><C><![CDATA[<IMG_SRC="javas]]<![CDATA[cript:document.vulnerable=true;">]]</C></X></xml><SPAN_DATASRC=#I_DATAFLD=C_DATAFORMATAS=HTML></SPAN>
 1 0
en 
<a_href="about:<script>document.vulnerable=true;</script>">
 1 0
en 
<a_href="javascript#document.vulnerable=true;">
 1 0
en 
<bgsound_src="javascript:document.vulnerable=true;">
 1 0
en 
<body_onload="document.vulnerable=true;">
 1 0
en 
<div_datafld="b"_dataformatas="html"_datasrc="#X"></div>
 1 0
en 
<div_onmouseover="document.vulnerable=true;">
 1 0
en 
<div_style="background-image:_url(javascript:document.vulnerable=true;);">
 1 0
en 
<div_style="behaviour:_url([link_to_code]);">
 1 0
en 
<div_style="binding:_url([link_to_code]);">
 1 0
en 
<div_style="width:_expression(document.vulnerable=true;);">
 1 0
en 
<img_dynsrc="javascript:document.vulnerable=true;">
 1 0
en 
<img_src="blah"onmouseover="document.vulnerable=true;">
 1 0
en 
<img_src="blah>"_onmouseover="document.vulnerable=true;">
 1 0
en 
<img_src="javascript:document.vulnerable=true;">
 1 0
en 
<img_src="livescript:document.vulnerable=true;">
 1 0
en 
<img_src="mocha:document.vulnerable=true;">
 1 0
en 
<img_src=&{document.vulnerable=true;};>
 1 0
en 
<input_type="image"_dynsrc="javascript:document.vulnerable=true;">
 1 0
en 
<link_rel="stylesheet"_href="javascript:document.vulnerable=true;">
 1 0
en 
<object_classid="clsid:..."_codebase="javascript:document.vulnerable=true;">
 1 0
en 
<style><!--</style><script>document.vulnerable=true;//--></script>
 1 0
en 
<style_type="text/javascript">document.vulnerable=true;</style>
 1 0
en 
<xml_id="X"><a><b><script>document.vulnerable=true;</script>;</b></a></xml>
 1 0
en 
<xml_src="javascript:document.vulnerable=true;">
 1 0
en 
[\xC0][\xBC]script>document.vulnerable=true;[\xC0][\xBC]/script>
 1 0

Details

Subject	Repo	Branch	Lines +/-
Reject invalid Page titles	analytics/refinery/source	master	+81 -4
Bump record version in webrequest load job and pageview hourly	analytics/refinery	master	+2 -2
Escape whitespaces in page_title	analytics/refinery	master	+7 -1

Customize query in gerrit

Related Objects
Search...

Status	Assigned	Task
Resolved	awight	T144100 Pageview dumps incorrectly formatted, need to escape special characters
Resolved	None	T156656 Review parent task for any potential pageview definition improvements
Open	None	T152628 Reject MediaWiki requests which use an URL with an invalid title but use some alternative mechanism to specify a valid page

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes

ArielGlenn added a project: Analytics.Sep 12 2016, 11:41 AM

This looks like an XSS on article title, we dump title as is and thus you see it this way. cc-ing security in case some cleanup needs to happen.

If this was a request that resulted in 404 I do not think it will be dumped as such, so this request must be returning something other than 404.

• Nuria added a project: acl*security.Sep 12 2016, 3:55 PM

• Nuria moved this task from Incoming to Radar on the Analytics board.

• Nuria added a subscriber: • MoritzMuehlenhoff.

• MoritzMuehlenhoff added a subscriber: ArielGlenn.Sep 12 2016, 4:14 PM

"<" and ">" are invalid character to use in titles, and should result in a 400 status code. Does the pageviews include things that give a 400 result?

"<" and ">" are invalid character to use in titles, and should result in a 400 status code. Does the pageviews include things that give a 400 result?

No, otherwise logs will be full of garbage-like requests.

But it could be that < were "&lt"? as that one is allowed once it is urlencoded.

Also, I can see garbage like requests in pageview_hourly table, on that hour:

Select:
select page_title, user_agent_map from pageview_hourly where (page_title like '%<%') and (year=2015 and month=08 and day=21 and hour=20);

Dataset: https://wikitech.wikimedia.org/wiki/Analytics/Data/Pageview_hourly

<<script>document.vulnerable=true;</script>
{"browser_major":"28","os_family":"Ubuntu","os_major":"-","device_family":"Other","browser_family":"Firefox","os_minor":"-","wmf_app_version":"-"}

<XML_ID=I><X><C><![CDATA[<IMG_SRC="javas]]<![CDATA[cript:document.vulnerable=true;">]]</C></X></xml><SPAN_DATASRC=#I_DATAFLD=C_DATAFORMATAS=HTML></SPAN>
{"browser_major":"28","os_family":"Ubuntu","os_major":"-","device_family":"Other","browser_family":"Firefox","os_minor":"-","wmf_app_version":"-"}

<xml_src="javascript:document.vulnerable=true;">
{"browser_major":"28","os_family":"Ubuntu","os_major":"-","device_family":"Other","browser_family":"Firefox","os_minor":"-","wmf_app_version":"-"}

<input_type="image"_dynsrc="javascript:document.vulnerable=true;">
{"browser_major":"28","os_family":"Ubuntu","os_major":"-","device_family":"Other","browser_family":"Firefox","os_minor":"-","wmf_app_version":"-"}

<a_href="about:<script>document.vulnerable=true;</script>">
{"browser_major":"28","os_family":"Ubuntu","os_major":"-","device_family":"Other","browser_family":"Firefox","os_minor":"-","wmf_app_version":"-"}

<img_src=&{document.vulnerable=true;};>
{"browser_major":"28","os_family":"Ubuntu","os_major":"-","device_family":"Other","browser_family":"Firefox","os_minor":"-","wmf_app_version":"-"}

<OBJECT_classid=clsid:ae24fdae-03c6-11d1-8b76-0080c744f389><param_name=url_value=javascript:document.vulnerable=true></OBJECT>
{"browser_major":"28","os_family":"Ubuntu","os_major":"-","device_family":"Other","browser_family":"Firefox","os_minor":"-","wmf_app_version":"-"}

<img_src="mocha:document.vulnerable=true;">
{"browser_major":"28","os_family":"Ubuntu","os_major":"-","device_family":"Other","browser_family":"Firefox","os_minor":"-","wmf_app_version":"-"}

<img_src="blah"onmouseover="document.vulnerable=true;">
{"browser_major":"28","os_family":"Ubuntu","os_major":"-","device_family":"Other","browser_family":"Firefox","os_minor":"-","wmf_app_version":"-"}

<link_rel="stylesheet"_href="javascript:document.vulnerable=true;">
{"browser_major":"28","os_family":"Ubuntu","os_major":"-","device_family":"Other","browser_family":"Firefox","os_minor":"-","wmf_app_version":"-"}
�<l)� {"browser_major":"2","os_family":"Other","os_major":"-","device_family":"Spider","browser_family":"bingbot","os_minor":"-","wmf_app_version":"-"}

<img_src="javascript:document.vulnerable=true;">
{"browser_major":"28","os_family":"Ubuntu","os_major":"-","device_family":"Other","browser_family":"Firefox","os_minor":"-","wmf_app_version":"-"}

<body_onload="document.vulnerable=true;">
{"browser_major":"28","os_family":"Ubuntu","os_major":"-","device_family":"Other","browser_family":"Firefox","os_minor":"-","wmf_app_version":"-"}

<![CDATA[</script>
{"browser_major":"28","os_family":"Ubuntu","os_major":"-","device_family":"Other","browser_family":"Firefox","os_minor":"-","wmf_app_version":"-"}

<?_echo('<SCR)';echo('IPT>document.vulnerable=true</SCRIPT>');_?>
{"browser_major":"28","os_family":"Ubuntu","os_major":"-","device_family":"Other","browser_family":"Firefox","os_minor":"-","wmf_app_version":"-"}

<div_style="binding:_url([link_to_code]);">
{"browser_major":"28","os_family":"Ubuntu","os_major":"-","device_family":"Other","browser_family":"Firefox","os_minor":"-","wmf_app_version":"-"}

<object_classid="clsid:..."_codebase="javascript:document.vulnerable=true;">
{"browser_major":"28","os_family":"Ubuntu","os_major":"-","device_family":"Other","browser_family":"Firefox","os_minor":"-","wmf_app_version":"-"}

<script>document.vulnerable=true;</script>
{"browser_major":"28","os_family":"Ubuntu","os_major":"-","device_family":"Other","browser_family":"Firefox","os_minor":"-","wmf_app_version":"-"}

<XML_ID="xss"><I><B><IMG_SRC="javascript:document.vulnerable=true"></B></I></XML><SPAN_DATASRC="#xss"_DATAFLD="B"_DATAFORMATAS="HTML"></SPAN>
{"browser_major":"28","os_family":"Ubuntu","os_major":"-","device_family":"Other","browser_family":"Firefox","os_minor":"-","wmf_app_version":"-"}

<style_type="text/javascript">document.vulnerable=true;</style>
{"browser_major":"28","os_family":"Ubuntu","os_major":"-","device_family":"Other","browser_family":"Firefox","os_minor":"-","wmf_app_version":"-"}

<HEAD><META_HTTP-EQUIV="CONTENT-TYPE"_CONTENT="text/html;_charset=UTF-7">_</HEAD>+ADw-SCRIPT+AD4-document.vulnerable=true;+ADw-/SCRIPT+AD4-
{"browser_major":"28","os_family":"Ubuntu","os_major":"-","device_family":"Other","browser_family":"Firefox","os_minor":"-","wmf_app_version":"-"}

<div_style="width:_expression(document.vulnerable=true;);">
{"browser_major":"28","os_family":"Ubuntu","os_major":"-","device_family":"Other","browser_family":"Firefox","os_minor":"-","wmf_app_version":"-"}

<a_href="javascript#document.vulnerable=true;">
{"browser_major":"28","os_family":"Ubuntu","os_major":"-","device_family":"Other","browser_family":"Firefox","os_minor":"-","wmf_app_version":"-"}

<img_src="livescript:document.vulnerable=true;">
{"browser_major":"28","os_family":"Ubuntu","os_major":"-","device_family":"Other","browser_family":"Firefox","os_minor":"-","wmf_app_version":"-"}
1078498053271< {"browser_major":"28","os_family":"Ubuntu","os_major":"-","device_family":"Other","browser_family":"Firefox","os_minor":"-","wmf_app_version":"-"}

&<script>document.vulnerable=true;</script>
{"browser_major":"28","os_family":"Ubuntu","os_major":"-","device_family":"Other","browser_family":"Firefox","os_minor":"-","wmf_app_version":"-"}

<img_dynsrc="javascript:document.vulnerable=true;">
{"browser_major":"28","os_family":"Ubuntu","os_major":"-","device_family":"Other","browser_family":"Firefox","os_minor":"-","wmf_app_version":"-"}

<HTML><BODY><?xml:namespace_prefix="t"_ns="urn:schemas-microsoft-com:time"><?import_namespace="t"_implementation="#default#time2"><t:set_attributeName="innerHTML"_to="XSS<SCRIPT_DEFER>document.vulnerable=true</SCRIPT>"></BODY></HTML>
{"browser_major":"28","os_family":"Ubuntu","os_major":"-","device_family":"Other","browser_family":"Firefox","os_minor":"-","wmf_app_version":"-"}

<div_style="behaviour:_url([link_to_code]);">
{"browser_major":"28","os_family":"Ubuntu","os_major":"-","device_family":"Other","browser_family":"Firefox","os_minor":"-","wmf_app_version":"-"}

<bgsound_src="javascript:document.vulnerable=true;">
{"browser_major":"28","os_family":"Ubuntu","os_major":"-","device_family":"Other","browser_family":"Firefox","os_minor":"-","wmf_app_version":"-"}

<xml_id="X"><a><b><script>document.vulnerable=true;</script>;</b></a></xml>
{"browser_major":"28","os_family":"Ubuntu","os_major":"-","device_family":"Other","browser_family":"Firefox","os_minor":"-","wmf_app_version":"-"}

<div_datafld="b"_dataformatas="html"_datasrc="#X"></div>
{"browser_major":"28","os_family":"Ubuntu","os_major":"-","device_family":"Other","browser_family":"Firefox","os_minor":"-","wmf_app_version":"-"}

<div_onmouseover="document.vulnerable=true;">
{"browser_major":"28","os_family":"Ubuntu","os_major":"-","device_family":"Other","browser_family":"Firefox","os_minor":"-","wmf_app_version":"-"}

<div_style="background-image:_url(javascript:document.vulnerable=true;);">
{"browser_major":"28","os_family":"Ubuntu","os_major":"-","device_family":"Other","browser_family":"Firefox","os_minor":"-","wmf_app_version":"-"}

<img_src="blah>"_onmouseover="document.vulnerable=true;">
{"browser_major":"28","os_family":"Ubuntu","os_major":"-","device_family":"Other","browser_family":"Firefox","os_minor":"-","wmf_app_version":"-"}
Time taken: 0.179 seconds, Fetched: 36 row(s)

So those are definitely getting through w/o an error code

But it could be that < were "&lt"? as that one is allowed once it is urlencoded.

&lt is allowed, but < (with a semi-colon at the end) is not allowed. I would assume the lack of semi-colon would make it not get converted to a < in the logs (but not sure). Url encoding should not affect it, as titles get url decoded by MediaWiki.

Perhaps there's some sort of endpoint which doesn't do the 400 response. Do you know what the original request url is?

New theory.

From what I understand, the analytics cluster determines page name by trying to parse the path of the url ( e.g. http://en.wikipedia.org/wiki/Foo -> Foo ). However, sometimes the path of the url is not the effective article name.

For example, consider:

https://meta.wikimedia.org/wiki/<script>?curid=79155

This returns the meta Main_Page with a 200 result. My guess is that the analytics code views this as being a page named "<script>", where MW considers this the page that corresponds to a page_id of 79155 and ignores the url path title.

So assuming that's correct, this leaves the question of should we do anything about it? From a security perspective this doesn't matter (Although we could give a 400 in this case on the principle that the user is up to no good).

From an analytics perspective, this means that a very small number of page views are misattributed, which presumably is bad.

@Bawoff: right, this should be a 400 in mw end. We have driven similar changes recently so we return 404s from mw for "bad" requests that were returning 200 before. Such as requests for history for non existing pages.

Arguably from a mw prespective this is a good and valid request - curid parameter takes precedence over title parameter.

btw, I'm untagging from security since there is no security impact on MediaWiki.

So, where can we go from here? Should these be filtered out or left in?

ArielGlenn moved this task from Backlog to Active on the Dumps-Generation board.Dec 1 2016, 2:41 PM

@Bawolff @ArielGlenn Valid requests and pageviews are different things. While these could be valid requests in any case should they return 200, which is what must have happened for them to be counted as pageviews and appear on the pageview files. A 4xx is in order. Since these are old files this issue might have been fixed on mediawiki already. We just need to verify that requests like these no longer return 200.

• Nuria added a project: acl*security.Dec 1 2016, 8:53 PM

I added security again cause these requests should not return 200: https://meta.wikimedia.org/wiki/%3Cscript%3E?curid=79155

I am not sure if these could cause an XSS but given that what is really happening is an internal redirect to https://meta.wikimedia.org the request should be returning something other than "yes! valid request, 200 here-you-go"

• dpatrick triaged this task as Medium priority.Dec 7 2016, 5:30 PM

btw, sometimes people use the curid parameter to get to a page with an invalid title (e.g. if a bug causes a page to have a non-unicode normalized title)

These are pageview requests which result in the successful display of an existing, valid page. HTTP 200 is the only reasonable status code in such a case. There is no HTTP response code for "this content has a different canonical URL"; that's what rel=canonical is for. We return HTTP 200 for redirect pages for the same reason.

In T144100#2855187, @Tgr wrote:

There is no HTTP response code for "this content has a different canonical URL"; that's what rel=canonical is for.

301?

We return HTTP 200 for redirect pages for the same reason.

See T53736 and T20883. From discussion there, it seems like the main reason is that with a 3xx we'd have to redirect to Target?from=something (or set a very-short-lived cookie that we Vary on, or something else like that) and somehow turn that from=something into the "Redirected from" message (and without opening things up to crap like "Redirected from yo momma").

These are pageview requests which result in the successful display of an existing, valid page. HTTP 200 is the only reasonable status code in such a case.

Not really, that the user received a valid case is decoupled from http codes that should be sent. Note that a request that returns 404 might return a friendly page to the user, an example: (history request for non existing page) https://en.wikipedia.org/w/index.php?title=hgdhgfhg&action=history

The requests on this ticket are clearly invalid and in fact are an internal redirect to another page (internal rewrite, not sure how we call those). They should return 301 if nothing else. But really, should be 404 (seems to me). Either one.

Per RFC 2616,

(301) The requested resource has been assigned a new permanent URI and any future references to this resource SHOULD use one of the returned URIs. Clients with link editing capabilities ought to automatically re-link references to the Request-URI to one or more of the new references returned by the server, where possible. (...) The new permanent URI SHOULD be given by the Location field in the response. Unless the request method was HEAD, the entity of the response SHOULD contain a short hypertext note with a hyperlink to the new URI(s).
(404) The server has not found anything matching the Request-URI.

so using either of those while actually returning the content seems wrong. We could of course do a proper 301 redirect, or properly reject the request as invalid / not found, but as bawolff noted, it does have some uses. Also, curids are sometimes used as a kind of permalink that's not affected by renames (some wikis have it in the sidebar IIRC). Issuing 301 redirects wouldn't break that, but would reduce usability as the URL would be harder to copy/paste-

I note that all this has nothing to do with the original bug report, which is about a title encoding issue in the pageview reports. Whatever URL gets used, it should not be allowed to break the formatting of the pageview stats file - that's an internal issue that has nothing to do with response codes.

Tgr mentioned this in T152628: Reject MediaWiki requests which use an URL with an invalid title but use some alternative mechanism to specify a valid page.Dec 7 2016, 9:08 PM

Filed T152628: Reject MediaWiki requests which use an URL with an invalid title but use some alternative mechanism to specify a valid page. As long as the special handling is limited to requests where the apparent page name is invalid, I guess there is no drawback to serving a HTTP 400.

Title sanitization in the code generating the pageview dumps should be fixed nevertheless.

(404) The server has not found anything matching the Request-URI.
so using either of those while actually returning the content seems wrong

mmm ...mediawiki has its own stand when it comes to urls pointing to non existing content that defies the standard as we return 200 for not existing urls all the time (seems to me): https://en.wikipedia.org/wiki/UglyDog -> 204

But anyways, I defer on this regard to mediawiki developers. My main point is that issues have to be fixed at the root to be fixed reliably. In this case a url for a non existing page that provides no content to the user (other than a warning) should not return a 200.

In T144100#2855621, @Nuria wrote:

mmm ...mediawiki has its own stand when it comes to urls pointing to non existing content that defies the standard as we return 200 for not existing urls all the time (seems to me): https://en.wikipedia.org/wiki/UglyDog -> 204

I see that served with a 404.

In this case a url for a non existing page that provides no content to the user (other than a warning) should not return a 200.

I think you misunderstand the situation. https://meta.wikimedia.org/wiki/<script>?curid=79155 will serve the exact same content as https://meta.wikimedia.org/wiki/Main_Page. (It's not even that we show the main page because it's some sort of error - 79155 just happens to be the page ID of the main page. Other numbers will get you other pages.) It's just an alternate URL format (and one where the title of the page is completely unguessable from the URL).

In T144100#2855217, @Anomie wrote:

In T144100#2855187, @Tgr wrote:

There is no HTTP response code for "this content has a different canonical URL"; that's what rel=canonical is for.

301?

But cf. T106793: Pages with single quote in title are inaccessible by some clients (redirect loop)

Rather than letting this die, is there any sort of concensus on next steps here?

The application that creates pageview dumps should escape or filter newlines (and tabs and other unexpected characters that mess up document structure). If we think T152628: Reject MediaWiki requests which use an URL with an invalid title but use some alternative mechanism to specify a valid page is still worth doing after that, it should be an easy change.

The application that creates pageview dumps should escape or filter newlines (and tabs and other unexpected characters that mess up document structure)

Agreed.

But it is not only that what needs doing right? faulty requests that are returning 200 should return a different http code

• Nuria created subtask T156656: Review parent task for any potential pageview definition improvements.Jan 30 2017, 6:37 PM

In T144100#2982767, @Nuria wrote:

But it is not only that what needs doing right? faulty requests that are returning 200 should return a different http code

That's T152628, or did you have something else in mind?

Indeed. Added subtask.

ArielGlenn edited projects, added Datasets-General-or-Unknown; removed Dumps-Generation.Mar 21 2017, 9:02 AM

Bawolff added a subscriber: MusikAnimal.Mar 31 2017, 11:16 AM

• Nuria added a project: good first task.May 29 2017, 3:54 PM

Restricted Application added a subscriber: TerraCodes. · View Herald TranscriptMay 29 2017, 3:54 PM

I'll take a look at this and its subtasks.

This is pedantic of me, but I want to break the subtask link to T152628 and make it a "mentioned in", for the reason @Tgr gives in T144100#2982645 . I'm renaming this task to make it more clear that the scope is to fix escaping in the dumps files.

awight renamed this task from Pageview dumps incorrectly formatted, looks like a result of possibly malicious activity to Pageview dumps incorrectly formatted, need to escape special characters.Mar 19 2019, 12:45 AM

In https://phabricator.wikimedia.org/diffusion/ANRE/browse/master/oozie/pageview/hourly/transform_pageview_to_legacy_format.hql$25 , we're doing a raw string concatenation in order to produce the output format. The code comment has some suggestions,

INSERT OVERWRITE DIRECTORY "${destination_directory}"
    -- Since "ROW FORMAT DELIMITED DELIMITED FIELDS TERMINATED BY ' '" only
    -- works for exports to local directories (see HIVE-5672), we have to
    -- prepare the lines by hand through concatenation :-(
    -- Set 0 as volume column since we don't use it.
    SELECT
        CONCAT_WS(" ", qualifier, page_title, cast(view_count AS string), "0") line

Maybe it's desirable to change our workflow so we can use ROW FORMAT and then copy the file out of HDFS? Otherwise, continuing down the raw string manipulation path we could transform the page_title to surround with quotes and escape any characters not allowed in CSVs.

thanks for working on this @awight :)

In T144100#5034702, @awight wrote:

Maybe it's desirable to change our workflow so we can use ROW FORMAT and then copy the file out of HDFS?

I don't think we'd want that:

Using ROW FORMAT implies writing files outside of HDFS, and HDFS is our (almost :) infinite storageg
Also the workflow for getting those files out would need to be changed which can be avoided.

Otherwise, continuing down the raw string manipulation path we could transform the page_title to surround with quotes and escape any characters not allowed in CSVs.

I think the suggestion of escaping all white-space characters (end-of-lines, spaces, tabs etc) actually makes sense.
I also think we should update the page-title extraction function to better handle the special <script> case shown here.

Reading HIVE-5672, the root bug has been fixed and we're in theory welcome to use ROW FORMAT without (or with) LOCAL. I think that's worth a try because it gives us more flexibility to escape or delimit in common variants. I would lean towards double-quoting the title field as well as escaping whitespace other than <space>, but want to make a case for not escaping <space>, that it makes our format more human-readable.

Using ROW FORMAT implies writing files outside of HDFS, and HDFS is our (almost :) infinite storage

I see, so omitting the LOCAL keyword here means that we do write to HDFS, great.

One side note is that a comment here mentions that the query plan may change, so we should be sure to monitor dump job health after deploying a fix.

Reading HIVE-5672, the root bug has been fixed

True, but the Hive version we have is old enough so that we don't have that fix ... <shame/>

In T144100#5035139, @JAllemandou wrote:

Reading HIVE-5672, the root bug has been fixed

True, but the Hive version we have is old enough so that we don't have that fix ... <shame/>

baahaha

/me hangs coal-stained overalls for the night

I think the suggestion of escaping all white-space characters (end-of-lines, spaces, tabs etc) actually makes sense.
I also think we should update the page-title extraction function to better handle the special <script> case shown here.

+1, these changes are to happen on pageviewdefinition java classes rather than HIVE

In T144100#5036380, @Nuria wrote:

I think the suggestion of escaping all white-space characters (end-of-lines, spaces, tabs etc) actually makes sense.
I also think we should update the page-title extraction function to better handle the special <script> case shown here.

+1, these changes are to happen on pageviewdefinition java classes rather than HIVE.

We should confirm that white-space chars are not accepted in regular page titles before adding this to the main function. If they are accepted, keeping them in parquet format is probably a good idea, and the filtering should happen in hive. If not accepted, ok to remove them at page-title extraction :)

Ah, yes, @JAllemandou is totally right as always.

Change 498700 had a related patch set uploaded (by Awight; owner: Adam Wight):
[analytics/refinery@master] Escape whitespaces in page_title

https://gerrit.wikimedia.org/r/498700

gerritbot added a project: Patch-For-Review.Mar 24 2019, 7:48 AM

Change 498702 had a related patch set uploaded (by Awight; owner: Adam Wight):
[analytics/refinery/source@master] [WIP] Reject invalid titles

https://gerrit.wikimedia.org/r/498702

awight moved this task from Backlog to Active on the Datasets-General-or-Unknown board.Mar 24 2019, 10:09 PM

Updated the page_title field documentation to reflect what happens in code,
https://wikitech.wikimedia.org/w/index.php?title=Analytics/Archive/Data/Pagecounts-raw&diff=1821019&oldid=1818913

Milimetric mentioned this in T156656: Review parent task for any potential pageview definition improvements.Mar 25 2019, 8:39 PM

Change 498700 abandoned by Awight:
Escape whitespaces in page_title

Reason:
I realized this is redundant work, we're already replacing s/ /_/g in refinery-source, and we can trust it since we own that code. Now I'm wondering how the bad titles broke the CSV format in the first place?

https://gerrit.wikimedia.org/r/498700

awight moved this task from Backlog / Other to Patch pending review on the acl*security board.Mar 29 2019, 12:36 AM

Change 498702 merged by jenkins-bot:
[analytics/refinery/source@master] Reject invalid Page titles

https://gerrit.wikimedia.org/r/498702

awight moved this task from Patch pending review to Patch pending deployment on the acl*security board.Mar 29 2019, 8:40 AM

• Nuria added a project: Analytics-Kanban.Mar 29 2019, 2:56 PM

• Nuria moved this task from Next Up to Ready to Deploy on the Analytics-Kanban board.

• Nuria merged a task: T217071: pageviews dumps contain invalid lines.Apr 19 2019, 1:34 PM

• Nuria added a subscriber: Aquameta.

Change 498702 merged by Fdans:
[analytics/refinery/source@master] Reject invalid Page titles

https://gerrit.wikimedia.org/r/498702

Change 505707 had a related patch set uploaded (by Fdans; owner: Fdans):
[analytics/refinery@master] Bump record version in webrequest load job

https://gerrit.wikimedia.org/r/505707

Change 505707 merged by Joal:
[analytics/refinery@master] Bump record version in webrequest load job and pageview hourly

https://gerrit.wikimedia.org/r/505707

Anomie unsubscribed.Apr 23 2019, 3:39 PM

• Nuria moved this task from Ready to Deploy to Done on the Analytics-Kanban board.Apr 23 2019, 4:03 PM

Super thanks!

sbassett moved this task from Patch pending deployment to Done on the acl*security board.Sep 26 2019, 2:38 PM

• chasemp added a project: Security.Feb 10 2020, 10:59 PM

• chasemp removed a project: acl*security.Feb 20 2020, 8:17 PM

• Nuria mentioned this in T245468: Pageviews missing for titles with emojis since April 23, 2019.Apr 16 2020, 6:24 PM

Aklapper edited projects, added Analytics-Radar; removed Analytics.Jun 10 2020, 6:44 AM

• Nuria closed subtask T156656: Review parent task for any potential pageview definition improvements as Resolved.Jun 23 2020, 3:48 PM

Pageview dumps incorrectly formatted, need to escape special charactersClosed, ResolvedPublicActions

Description

Details

Related ObjectsSearch...

Event Timeline

Pageview dumps incorrectly formatted, need to escape special characters
Closed, ResolvedPublic
Actions

Related Objects
Search...