Page MenuHomePhabricator

Why do dumps and pageview api have slightly different counts?
Closed, ResolvedPublic3 Estimated Story Points

Description

This email: https://lists.wikimedia.org/pipermail/analytics/2018-September/006437.html points out that dumps and pageview API differ slightly. Intuitively it should be ok because the difference is small, but it would be nice to know where it comes from exactly and document it.

Event Timeline

I have investigated on this topic and have some answers.

The main observed difference is due to a shift in hours between dumps and pageview-api. The pageview-dumps data was originally built to mimic a legacy tool named webstatcollector (see https://www.mediawiki.org/wiki/Analytics/Pageviews/Webstatscollector), which naming of files for hours was different from the current convention we use. For instance for data between 2018-09-27T13:00:00 and 2018-09-27T14:00:00, pageview-dumps uses 2018-09-27T14:00:00 while pageview-api uses 2018-09-27T13:00:00.

Something else happening and polluting the results when looking at individual pages is that special end-of-lines charaters happen in page-titles and are not escaped (a new task has been open for this: T205620). These problem are minor however.

Confirmation scripts:

# Pageviews by user on en.wikipedia  between 2015-10-01T00:00:00 and 2015-10-01T0!:00:00
curl -x https://wikimedia.org/api/rest_v1/metrics/pageviews/aggregate/en.wikipedia/all-access/user/hourly/2015100100/2015100100
# {"items":[{"project":"en.wikipedia","access":"all-access","agent":"user","granularity":"hourly","timestamp":"2015100100","views":9720933}]}

# Projectviews dumps sum of (en, en.m and en.zero) for files projectviews-20151001-000000 and projectviews-20151001-010000
grep -r '^en\(\.m\|\.zero\)* ' projectviews-20151001-000000 | awk '{print $3}' | awk '{s+=$1}END{print s}'
# 9466703 -- Different from  API
grep -r '^en\(\.m\|\.zero\)* ' projectviews-20151001-010000 | awk '{print $3}' | awk '{s+=$1}END{print s}'
# 9720933 -- Same as API

# Pageview dumps sum of (en, en.m and en.zero) for file pageviews-20151001-010000.gz
zcat pageviews-20151001-010000.gz | grep -e '^en\(\.m\|\.zero\)* .* 0$' | awk '{print $3}' | awk '{s+=$1}END{print s}'
# 9720932 -- 1 missing, because of an incorrect line

# Showing the incorrect line:
zcat pageviews-20151001-010000.gz | grep -e '^en\(\.m\|\.zero\)* ' | grep -v -B 2 -A 2 -e ' 0$'
# en 🛏 1 0
# en 🦄 1 0
# en.m 
# en.m ! 4 0
# en.m !!! 3 0
JAllemandou edited projects, added Analytics-Kanban; removed Analytics.
JAllemandou set the point value for this task to 3.
JAllemandou moved this task from Next Up to Done on the Analytics-Kanban board.