Page MenuHomePhabricator

https://dumps.wikimedia.org/other/pageviews/ lacks hourly pageviews since 20190722-17:00
Closed, ResolvedPublic3 Story Points

Description

wikimedia/portals.git is the repository to generate https://www.wikipedia.org/ . The build script uses pageviews data from https://dumps.wikimedia.org/other/pageviews/2019/2019-07/ but the build indicates there are missing data:

00:01:46.265 HTTPError: 404 requesting https://dumps.wikimedia.org/other/pageviews/2019/2019-07/projectviews-20190722-170000
00:01:46.275 Unhandled rejection HTTPError: 404 requesting https://dumps.wikimedia.org/other/pageviews/2019/2019-07/projectviews-20190722-170000
00:01:46.296 HTTPError: 404 requesting https://dumps.wikimedia.org/other/pageviews/2019/2019-07/projectviews-20190722-180000
00:01:46.299 Unhandled rejection HTTPError: 404 requesting https://dumps.wikimedia.org/other/pageviews/2019/2019-07/projectviews-20190722-180000
00:01:46.311 HTTPError: 404 requesting https://dumps.wikimedia.org/other/pageviews/2019/2019-07/projectviews-20190722-190000
00:01:46.312 Unhandled rejection HTTPError: 404 requesting https://dumps.wikimedia.org/other/pageviews/2019/2019-07/projectviews-20190722-190000
00:01:46.327 HTTPError: 404 requesting https://dumps.wikimedia.org/other/pageviews/2019/2019-07/projectviews-20190722-200000
00:01:46.328 Unhandled rejection HTTPError: 404 requesting https://dumps.wikimedia.org/other/pageviews/2019/2019-07/projectviews-20190722-200000
00:01:46.360 HTTPError: 404 requesting https://dumps.wikimedia.org/other/pageviews/2019/2019-07/projectviews-20190722-210000
00:01:46.361 Unhandled rejection HTTPError: 404 requesting https://dumps.wikimedia.org/other/pageviews/2019/2019-07/projectviews-20190722-210000
00:01:46.371 HTTPError: 404 requesting https://dumps.wikimedia.org/other/pageviews/2019/2019-07/projectviews-20190722-220000
00:01:46.373 Unhandled rejection HTTPError: 404 requesting https://dumps.wikimedia.org/other/pageviews/2019/2019-07/projectviews-20190722-220000
00:01:46.387 HTTPError: 404 requesting https://dumps.wikimedia.org/other/pageviews/2019/2019-07/projectviews-20190722-230000
00:01:46.391 Unhandled rejection HTTPError: 404 requesting https://dumps.wikimedia.org/other/pageviews/2019/2019-07/projectviews-20190722-230000

On https://dumps.wikimedia.org/other/pageviews/2019/2019-07/ , the last hourly dump is from 22-Jul-2019 17:03 UTC.

Potentially related SAL entries:

16:36 	<jeh> 	redirecting dumps.wikimedia.org dns to labstore1006 T224228
17:02 	<nuria@deploy1001> 	Started deploy [analytics/refinery@d889893]: deploying refinery jar bump forwebrequest/load jobs
17:17 	<nuria@deploy1001> 	Finished deploy [analytics/refinery@d889893]: deploying refinery jar bump forwebrequest/load jobs (duration: 14m 51s)

Event Timeline

hashar created this task.Jul 23 2019, 8:48 AM
Restricted Application added a subscriber: Aklapper. · View Herald TranscriptJul 23 2019, 8:48 AM

The DNS change is f4c51db24a4953eb4b24b3480b555ad5cf61b219 Mon Jul 22 16:31:01 2019 +0000

templates/wikimedia.org
@@ -80,7 +80,7 @@ ns2.corp    1H  IN A    198.73.209.16
 ; interactions with selective CN censorhsip at the DNS level)
 dyna            600 IN DYNA geoip!text-addrs
 
-dumps           5M  IN CNAME labstore1007
+dumps           5M  IN CNAME labstore1006
 
 lists           5M  IN A    208.80.154.21
 lists           5M  IN AAAA 2620:0:861:1:208:80:154:21

Maybe related are some hiera values at:

hieradata/common.yaml
# Dumps distribution server currently serving traffic over NFS to cloud vps instances
dumps_dist_active_vps: labstore1007.wikimedia.org
# Dumps distribution server currently serving web and rsync mirror traffic
# Also serves stat* hosts over nfs
dumps_dist_active_web: labstore1006.wikimedia.org

The last file is from 22-Jul-2019 17:03

I lamely looked at labstore1006 and labstore1007 via curl and they both are stall at the same last file. I have used:

curl -H 'dumps.wikimedia.org' -k https://labstore1007.wikimedia.org/other/pageviews/2019/2019-07/
curl -H 'dumps.wikimedia.org' -k https://labstore1006.wikimedia.org/other/pageviews/2019/2019-07/

There are also the SAL entries:

17:17 	<nuria@deploy1001> 	Finished deploy [analytics/refinery@d889893]: deploying refinery jar bump forwebrequest/load jobs (duration: 14m 51s)
17:02 	<nuria@deploy1001> 	Started deploy [analytics/refinery@d889893]: deploying refinery jar bump forwebrequest/load jobs
hashar renamed this task from https://dumps.wikimedia.org/other/pageviews/ lacks hourly pageviews since 20190722-16:00 to https://dumps.wikimedia.org/other/pageviews/ lacks hourly pageviews since 20190722-17:00.Jul 23 2019, 9:02 AM
hashar updated the task description. (Show Details)
elukey added a subscriber: elukey.Jul 23 2019, 9:10 AM

@hashar thanks a lot for the ping. Yesterday we restarted the job that produces the files, but it has been failing since then https://hue.wikimedia.org/oozie/list_oozie_coordinator/0024727-190417151359684-oozie-oozi-C/

We don't have a notification for errors generated by it, so we haven't realized. Thanks a lot for the ping.

Mentioned in SAL (#wikimedia-analytics) [2019-07-23T09:23:53Z] <elukey> restart projectview-hourly-coordinator with correct config - T228731

elukey added a subscriber: Nuria.Jul 23 2019, 9:31 AM

Ah no ok there is an explanation for this trouble. @Nuria deployed yesterday refinery as mentioned, but she didn't restart any job yet. This is usually fine since when we start the oozie jobs we specifically set a refinery path on HDFS, that usually is something like:

refinery_directory	hdfs://analytics-hadoop/wmf/refinery/2019-07-22T17.21.45+00.00--scap_sync_2019-07-22_0001-dirty

Meanwhile for projectviews was:

refinery_directory	hdfs://analytics-hadoop/wmf/refinery/current

This means that as soon as the deployment was done, the HDFS directory was updated as well, and the oozie job didn't have the correct setting requested (we added a parameter) hence failing silently (because of lack of alarms for this job).

elukey claimed this task.Jul 23 2019, 9:36 AM
elukey triaged this task as Normal priority.
elukey added a project: Analytics-Kanban.
elukey moved this task from Next Up to In Progress on the Analytics-Kanban board.

I can see data at the bottom of https://dumps.wikimedia.org/other/pageviews/2019/2019-07/ (might need a forced refresh in the browser). The labstore nodes pull from stat1007's rsync server so we should be ok with any DNS change.

Maybe related are some hiera values at:

hieradata/common.yaml
# Dumps distribution server currently serving traffic over NFS to cloud vps instances
dumps_dist_active_vps: labstore1007.wikimedia.org
# Dumps distribution server currently serving web and rsync mirror traffic
# Also serves stat* hosts over nfs
dumps_dist_active_web: labstore1006.wikimedia.org

I'll be pushing this update out today, rebooting labstore1006 and then switching DNS back to labstore1007. Please let me know if there are any concerns.

elukey moved this task from In Progress to Done on the Analytics-Kanban board.Jul 23 2019, 3:19 PM
elukey set the point value for this task to 3.

@JHedden it seems the DNS change has been harmless and the issue came from some of the magic in a oozie job.

I will let @elukey mark this task resolved, unless there is a need to add some monitoring to ensure the job runs properly and/or the files get published properly.

elukey closed this task as Resolved.Jul 24 2019, 9:48 AM