Page MenuHomePhabricator

Data missing in page creation datasets
Closed, ResolvedPublic

Description

Looking at the daily number of page creations suggests that there's data missing in the underlying datasets. I've manually checked some of them and it looks like the count is low for Jan 3, 2018, and missing for Jan 4–6. Data appears to exist in the log database on analytics-slave:

SELECT DATE(rev_timestamp) AS log_date, count(*) AS num_pages
       FROM mediawiki_page_create_2
       WHERE rev_timestamp >= '2018-01-01 00:00:00'
       AND `database`='enwiki'
       GROUP BY log_date;
+------------+-----------+
| log_date   | num_pages |
+------------+-----------+
| 2018-01-01 |      7346 |
| 2018-01-02 |      8404 |
| 2018-01-03 |      7768 |
| 2018-01-04 |      7756 |
| 2018-01-05 |      7184 |
| 2018-01-06 |      7358 |
| 2018-01-07 |      7672 |
| 2018-01-08 |      8311 |
| 2018-01-09 |      8949 |
| 2018-01-10 |      8154 |
| 2018-01-11 |      7843 |
| 2018-01-12 |      7709 |
| 2018-01-13 |      7299 |
| 2018-01-14 |      7382 |
| 2018-01-15 |      8076 |
| 2018-01-16 |      6091 |
+------------+-----------+
16 rows in set (0.52 sec)

Maybe the solution is to delete all the TSVs and recreate, since there's not that much data?

Event Timeline

Nettrom created this task.Jan 16 2018, 6:45 PM
Restricted Application added a subscriber: Aklapper. · View Herald TranscriptJan 16 2018, 6:45 PM

@Milimetric: Any idea why there might be data missing here?

So, I'm not sure, but my bet is that there was some outage that caused the data to land there after the reports were run. When that happens, you can re-run the reports:

https://wikitech.wikimedia.org/wiki/Analytics/Systems/Reportupdater#Re-runs

I'll do it this time, but here's the command I used so you can do it next time:

sudo -u stats python /srv/reportupdater/reportupdater/rerun_reports.py /srv/reportupdater/jobs/reportupdater-queries/page-creation 2018-01-03 2018-01-08

This is documented in more depth at the link above. The data should show up in the TSVs on ReportUpdater's next run (either every hour or every day depending on how it's configured for those jobs).

and if you're nervous about what that did, you can check the .reruns folder:

milimetric@stat1006:/srv/reportupdater$ cat /srv/reportupdater/jobs/reportupdater-queries/page-creation/.reruns/1516152325112 
2018-01-03
2018-01-08
pagecreations_main
pagecreations_draft
pagecreations_main_autopatrolled
pagecreations_main_noredirects
pagecreations_main_autoconfirmed
pagecreations
pagecreations_main_non-autoconfirmed
pagecreations_main_bots
milimetric@stat1006:/srv/reportupdater$
Nettrom closed this task as Resolved.Jan 17 2018, 7:07 PM
Nettrom claimed this task.

I checked the dashboard for enwiki and spot-checked a dataset, and the data appears to be in working order. Thanks for helping take care of this @Milimetric, and great to learn there's a way to easily fix this next time!