This email: https://lists.wikimedia.org/pipermail/analytics/2018-September/006437.html points out that dumps and pageview API differ slightly. Intuitively it should be ok because the difference is small, but it would be nice to know where it comes from exactly and document it.
Description
Related Objects
Event Timeline
I have investigated on this topic and have some answers.
The main observed difference is due to a shift in hours between dumps and pageview-api. The pageview-dumps data was originally built to mimic a legacy tool named webstatcollector (see https://www.mediawiki.org/wiki/Analytics/Pageviews/Webstatscollector), which naming of files for hours was different from the current convention we use. For instance for data between 2018-09-27T13:00:00 and 2018-09-27T14:00:00, pageview-dumps uses 2018-09-27T14:00:00 while pageview-api uses 2018-09-27T13:00:00.
Something else happening and polluting the results when looking at individual pages is that special end-of-lines charaters happen in page-titles and are not escaped (a new task has been open for this: T205620). These problem are minor however.
Confirmation scripts:
# Pageviews by user on en.wikipedia between 2015-10-01T00:00:00 and 2015-10-01T0!:00:00 curl -x https://wikimedia.org/api/rest_v1/metrics/pageviews/aggregate/en.wikipedia/all-access/user/hourly/2015100100/2015100100 # {"items":[{"project":"en.wikipedia","access":"all-access","agent":"user","granularity":"hourly","timestamp":"2015100100","views":9720933}]} # Projectviews dumps sum of (en, en.m and en.zero) for files projectviews-20151001-000000 and projectviews-20151001-010000 grep -r '^en\(\.m\|\.zero\)* ' projectviews-20151001-000000 | awk '{print $3}' | awk '{s+=$1}END{print s}' # 9466703 -- Different from API grep -r '^en\(\.m\|\.zero\)* ' projectviews-20151001-010000 | awk '{print $3}' | awk '{s+=$1}END{print s}' # 9720933 -- Same as API # Pageview dumps sum of (en, en.m and en.zero) for file pageviews-20151001-010000.gz zcat pageviews-20151001-010000.gz | grep -e '^en\(\.m\|\.zero\)* .* 0$' | awk '{print $3}' | awk '{s+=$1}END{print s}' # 9720932 -- 1 missing, because of an incorrect line # Showing the incorrect line: zcat pageviews-20151001-010000.gz | grep -e '^en\(\.m\|\.zero\)* ' | grep -v -B 2 -A 2 -e ' 0$' # en 🛏 1 0 # en 🦄 1 0 # en.m # en.m ! 4 0 # en.m !!! 3 0