Page MenuHomePhabricator

Pageviews API reporting inaccurate data for pages titles containing special characters
Closed, ResolvedPublic13 Estimated Story Points

Description

Since around 24 February, it appears the pageviews API has been reporting incomplete data for page titles with special characters, such as Cyrillic.

Compare:
https://tools.wmflabs.org/pageviews/#start=2016-02-07&end=2016-02-26&project=ru.wikipedia.org&platform=all-access&agent=user&pages=BMW|Intel|Microsoft
https://tools.wmflabs.org/pageviews/#start=2016-02-07&end=2016-02-26&project=ru.wikipedia.org&platform=all-access&agent=user&pages=Путин,_Владимир_Владимирович|Обама,_Барак|Ленин,_Владимир_Ильич

This is a visualization of the data from the pageviews API, unchanged. See an example of hitting the API directly here.

We are seeing the same issue on other wikis.

Event Timeline

Mentioned in SAL [2016-02-29T10:08:27Z] <joal> Deploying refinery to see if previous deploy was causing https://phabricator.wikimedia.org/T128295

First investigations:
The problem comes from page_title computation on the cluster.
It seems the cluster upgrade we did last week broke stuff we didn't noticed.

ADD JAR /usr/lib/hive-hcatalog/share/hcatalog/hive-hcatalog-core.jar;
ADD JAR     hdfs://analytics-hadoop/wmf/refinery/2016-02-23T18.55.34Z--7dadb6b/artifacts/org/wikimedia/analytics/refinery/refinery-hive-0.0.26.jar;
CREATE TEMPORARY FUNCTION get_pageview_info AS 'org.wikimedia.analytics.refinery.hive.GetPageviewInfoUDF';
SELECT
    uri_path,
    uri_query,
    get_pageview_info(uri_host, uri_path, uri_query)['page_title'] as pvinf_title,
    pageview_info['page_title'] as pt,
    COUNT(1) as vc
FROM wmf.webrequest
WHERE year = 2016
    AND month = 2
    AND day = 27
    AND hour = 17
    AND is_pageview
    AND pageview_info['project'] = 'sv.wikipedia'
    AND uri_path = '/wiki/Lasse_%C3%85berg'
GROUP BY
    uri_path,
    uri_query,
    pageview_info['page_title'],
    get_pageview_info(uri_host, uri_path, uri_query)['page_title']
ORDER BY vc DESC
LIMIT 100;


+-------------------------+------------+---------------+---------------+------+--+
|        uri_path         | uri_query  |  pvinf_title  |      pt       |  vc  |
+-------------------------+------------+---------------+---------------+------+--+
| /wiki/Lasse_%C3%85berg  |            | Lasse_��berg  | Lasse_��berg  | 414  |
| /wiki/Lasse_%C3%85berg  |            | Lasse_��berg  | Lasse_Åberg   | 101  |
| /wiki/Lasse_%C3%85berg  |            | Lasse_Åberg   | Lasse_��berg  | 78   |
| /wiki/Lasse_%C3%85berg  |            | Lasse_Åberg   | Lasse_Åberg   | 13   |
+-------------------------+------------+---------------+---------------+------+--+

Same path and query, different results for page_title extraction, those different results being different from the refine run and my hand-made run.
My guess is that there are different version of something on different nodes.

Tried to deploy refinery in hdfs to see if the problem could be coming from a bad deployed version.
There are many errors from hdfs:

java.io.IOException: Bad connect ack with firstBadLink as 10.64.36.131:50010
	at org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.createBlockOutputStream(DFSOutputStream.java:1584)
	at org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.nextBlockOutputStream(DFSOutputStream.java:1483)
	at org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.run(DFSOutputStream.java:668)

Also, when looking at HDFS web interface (analytics1001:50070 using ssh tunnel), there is a banner saying:

Upgrade in progress. Not yet finalized.

Keeping investigating ...

FYI, I just ran simliar query on just wmf_raw.webrequest table.

ADD JAR /usr/lib/hive-hcatalog/share/hcatalog/hive-hcatalog-core.jar;
ADD JAR     hdfs://analytics-hadoop/wmf/refinery/2016-02-23T18.55.34Z--7dadb6b/artifacts/org/wikimedia/analytics/refinery/refinery-hive-0.0.26.jar;
CREATE TEMPORARY FUNCTION get_pageview_info AS 'org.wikimedia.analytics.refinery.hive.GetPageviewInfoUDF';
SELECT
    uri_path,
    get_pageview_info(uri_host, uri_path, uri_query)['page_title'] as pvinf_title,
    COUNT(1) as vc
FROM wmf_raw.webrequest
WHERE webrequest_source='text'
    AND year = 2016
    AND month = 2
    AND day = 27
    AND hour = 17
    AND uri_host = 'sv.wikipedia.org'
    AND uri_path = '/wiki/Lasse_%C3%85berg'
GROUP BY
    uri_path,
    get_pageview_info(uri_host, uri_path, uri_query)['page_title']
ORDER BY vc DESC
LIMIT 100;


uri_path                pvinf_title   vc
/wiki/Lasse_%C3%85berg  Lasse_��berg  190
/wiki/Lasse_%C3%85berg  Lasse_Åberg   45

I ran this a second time, and got slightly different results, indicating to me (as @JAllemandou already noted) that the problem is likely due to some differences between Hadoop worker nodes. Still looking...

Milimetric raised the priority of this task from High to Unbreak Now!.Feb 29 2016, 5:10 PM
Milimetric removed projects: Analytics, Pageviews-API.

Phew, somehow many NodeManager (and, unrelatedly Datanode) JVM processes had got stuck with file.encoding = ANSI_X3.4-1968 the last time they were restarted. I don't think was a JVM problem, since a locale shell exec from the process returned

LANGUAGE=en_US:
LC_CTYPE="C"
LC_NUMERIC="C"
LC_TIME="C"
LC_COLLATE="C"
LC_MONETARY="C"
LC_MESSAGES="C"
LC_PAPER="C"
LC_NAME="C"
LC_ADDRESS="C"
LC_TELEPHONE="C"
LC_MEASUREMENT="C"
LC_IDENTIFICATION="C"
LC_ALL=C

vs

LANGUAGE=en_US:
LC_CTYPE="en_US.UTF-8"
LC_NUMERIC="en_US.UTF-8"
LC_TIME="en_US.UTF-8"
LC_COLLATE="en_US.UTF-8"
LC_MONETARY="en_US.UTF-8"
LC_MESSAGES="en_US.UTF-8"
LC_PAPER="en_US.UTF-8"
LC_NAME="en_US.UTF-8"
LC_ADDRESS="en_US.UTF-8"
LC_TELEPHONE="en_US.UTF-8"
LC_MEASUREMENT="en_US.UTF-8"
LC_IDENTIFICATION="en_US.UTF-8"
LC_ALL=

on nodes where file.encoding was UTF-8.

So, somehow, the JVM processes started with a borked environment with a bad locale.

I'm not sure how this happened. Restarting the JVM processes has fixed this.

@JAllemandou, in order to make sure we aren't bitten by this, do you think we should attempt to set file.encoding for Hadoop JVMs by default? I'm not sure.

I assume this is related, but I also noticed that the /pageviews/top endpoint sometimes returns the wrong characters when there should be accents like é.

For instance see the topviews on the French Wikipedia:
http://tools.wmflabs.org/topviews/#start=2016-02-22&end=2016-02-28&project=fr.wikipedia.org&platform=all-access

Notice the second result Sp?cial:Search, while the third Spécial:Recherche appears correctly.

Example hitting the API directly: https://wikimedia.org/api/rest_v1/metrics/pageviews/top/fr.wikipedia/all-access/2016/02/22

@MusikAnimal : The issue @Ottomata is describing leads to badly formatted data loaded in the API.

Now that the problem is found and solved, we are starting to backfill correct data in place of corrupted one.
This process will be long since we have about 7 days to recompute, but thing will be sorted at some point :)

@JAllemandou Thanks for the explanation. So are you saying that new pageview data coming in should be accurate? Looks like things are smoothing out in the first example but not quite back to where it should be.

In the meantime I just want to keep people informed. Could you confirm what day the breaking change occurred? Maybe it was the 23rd or even the 22nd?

Many thanks for the prompt attention!

@MusikAnimal : You're welcome, thanks for having filedthe bug !
The breakage occurred Feb 23rd, polluting data between from the 23rd included to the 29th included.
Recomputation / backfilling is currently ongoing, but it'll take long time (at least end of week and weekend).

@Ottomata :

@JAllemandou, in order to make sure we aren't bitten by this, do you think we should attempt to set file.encoding for Hadoop JVMs by default? I'm not sure.

While it would be great to ensure that for every JVM we launch we control that setting, I'm not sure if the cost is worth ...
I have the feeling it would take a long to ensure every JVM (hdfs, yarn, mapreduce, hive, spark, oozie etc) we run has a puppet setting for correct encoding.
Happy to discuss it further :)

Milimetric set the point value for this task to 13.Mar 3 2016, 5:24 PM

Removing Pageview-API tag in order for this task not to be automatically tagged as Analytics.

ehm cant we fix the herald rule instead ? tags shouldnt have to be removed simply because some automated system does something undesirable.

From what I can see, the data from Feb. 23-29 looks fixed now. Thank you!

Hi,
Backfilling has finished yesterday night.
Data should be correct now :)
Sorry for the inconvenience and thanks again for having spot that !