Page MenuHomePhabricator

Replace cron jobs from EZachte's home directory on stat1005 with rsync fetches
Closed, ResolvedPublic

Description

We're migrating from dataset1001 to labstore1006|7 to serve dumps to our NFS/web users. We'd like to convert the crons that push data to dataset1001 to rsync pulls from the newer servers.

Let's figure out the frequency and what directory we can pull from, and we can set up rsync jobs to pull into the public/other/pageviews-ez directory tree.

For a short while the datasets would be going both to dataset1001 using the existing scripts, and to the new labstore box using the new server-side rsyncs, until the dataset1001 service is finally turned off. We'd potentially like to enable the server-side rsync service on March 28, 2018, and we can expect to turn off the crons pushing to dataset1001 by April 6th.

Related Objects

StatusAssignedTask
Resolvedbd808
ResolvedArielGlenn
Resolved madhuvishy
Resolved madhuvishy
Resolved madhuvishy
Resolved madhuvishy
ResolvedArielGlenn
ResolvedArielGlenn
Resolvedezachte
Resolved madhuvishy
Resolved madhuvishy
Resolved madhuvishy
Resolved madhuvishy
Resolved madhuvishy
Resolved madhuvishy
Resolved madhuvishy
Resolved madhuvishy
Resolved madhuvishy
Resolved madhuvishy
Resolved madhuvishy

Event Timeline

madhuvishy triaged this task as Normal priority.Mar 9 2018, 7:15 AM
madhuvishy created this task.
Restricted Application removed a project: Patch-For-Review. · View Herald TranscriptMar 9 2018, 7:15 AM
bd808 moved this task from Backlog to Dumps on the Data-Services board.Mar 10 2018, 11:40 PM

Update based on my discussion with @ezachte over email:

Current status: The following bash files in stat1005 at /home/ezachte rsync to dataset1001 (we grant write permissions for specific directories on the dataset1001 end).

./wikistats_data/dumps/bash/zip_out.sh:rsync -avv $out/zip_all/out_w*.zip $dataset2
./wikistats_data/dumps/bash/zip_csv.sh:rsync -avv $zip_all/csv_*.zip $dataset2
./wikistats_data/dumps/bash/rsync.sh:rsync -av /a/dammit.lt/projectcounts/projectcounts-20??.tar dataset2::pagecounts-ez/projectcounts
./wikistats/image_sets/bash/publish_zips.sh:      rsync -ipv4 -avv *.zip  $dataset1001
./wikistats/dumps/bash/count_editors_yoy.sh:rsync -arv -ipv4 --include=*.bz2 $csv/csv_wb/csv_wb_active_editors.zip $dataset1001 
./wikistats/dumps/bash/count_editors_yoy.sh:rsync -arv -ipv4 --include=*.bz2 $csv/csv_wk/csv_wk_active_editors.zip $dataset1001 
./wikistats/dumps/bash/count_editors_yoy.sh:rsync -arv -ipv4 --include=*.bz2 $csv/csv_wn/csv_wn_active_editors.zip $dataset1001 
./wikistats/dumps/bash/count_editors_yoy.sh:rsync -arv -ipv4 --include=*.bz2 $csv/csv_wo/csv_wo_active_editors.zip $dataset1001 
./wikistats/dumps/bash/count_editors_yoy.sh:rsync -arv -ipv4 --include=*.bz2 $csv/csv_wp/csv_wp_active_editors.zip $dataset1001 
./wikistats/dumps/bash/count_editors_yoy.sh:rsync -arv -ipv4 --include=*.bz2 $csv/csv_wq/csv_wq_active_editors.zip $dataset1001 
./wikistats/dumps/bash/count_editors_yoy.sh:rsync -arv -ipv4 --include=*.bz2 $csv/csv_ws/csv_ws_active_editors.zip $dataset1001 
./wikistats/dumps/bash/count_editors_yoy.sh:rsync -arv -ipv4 --include=*.bz2 $csv/csv_wv/csv_wv_active_editors.zip $dataset1001 
./wikistats/dumps/bash/count_editors_yoy.sh:rsync -arv -ipv4 --include=*.bz2 $csv/csv_wx/csv_wx_active_editors.zip $dataset1001 
./wikistats/dumps/bash/rsync.sh:rsync -av /a/dammit.lt/projectcounts/projectcounts-20??.tar dataset1001::pagecounts-ez/projectcounts
./wikistats/dammit.lt/bash/dammit_published_merged.sh:echo "rsync -arv --include=*.bz2 $output/* $dataset1001"
./wikistats/dammit.lt/bash/dammit_published_merged.sh:      rsync -arv --include=*.bz2 $output/* $dataset1001
./wikistats/dammit.lt/bash/dammit_projectviews_monthly.sh:rsync -av -ipv4 projectviews_csv.zip  dataset1001.wikimedia.org::pagecounts-ez/projectviews
./wikistats/backup/zip_out.sh:rsync -ipv4 -avv $backup/out_w*.zip $dataset1001
./wikistats/backup/zip_csv.sh:echo "rsync -ipv4 -avv $zip_all/csv_*.zip $dataset1001"
./wikistats/backup/zip_csv.sh:rsync -ipv4 -avv $zip_all/csv_*.zip $dataset1001

What we'd like this to change to for the new servers labstore1006 & 7:

  • Local jobs run by Erik Zachte in stat1005 that write to specific places in /srv (in stat1005).
  • Rsync jobs that run on labstore1006 & 7 that fetch data from /srv on stat1005.
  • Proposal for new directory structure in stat1005:
- /srv/analytics/wikistats_1                                -> rsync to https://dumps.wikimedia.org/other/wikistats_1/ (_1 added to folder name, as we now also have wikistats 2 project) 

!!! this is a new url which replaces https://dumps.wikimedia.org/other/pagecounts-ez/wikistats/ 
    I don't expect it will break any automated downloads, as I expect all downloads are manual, but a redirection from old url could be useful

!!! no need to copy data from dataset1001 for this folder (I repackaged data into more zip files, some with new names, new rsync will find all files needed on /srv/analtytics/wikistats_1)

- /srv/analytics/pagecounts-ez/merged/            -> rsync to https://dumps.wikimedia.org/other/pagecounts-ez/merged
- /srv/analytics/pagecounts-ez/projectcounts/   -> rsync to https://dumps.wikimedia.org/other/pagecounts-ez/projectcounts
- /srv/analytics/pagecounts-ez/projectviews/    -> rsync to https://dumps.wikimedia.org/other/pagecounts-ez/projectviews

- /srv/media/contest_winners/WLM (Wiki Loves Monuments, at present only WLM contains data) -> rsync to https://dumps.wikimedia.org/other/media/contest_winners/WLM/
- /srv/media/contest_winners/WLA (Wiki Loves Africa)                                                                     -> rsync to https://dumps.wikimedia.org/other/media/contest_winners/WLA/
- /srv/media/contest_winners/WLE (wiki Loves Earth)                                                                      ->  rsync to https://dumps.wikimedia.org/other/media/contest_winners/WLE/

@Ottomata or @elukey - Do y'all have any opinions on hosting this data in /srv, or the directory structure? Also we'd need to allow for Erik to be able to write to these directories, and give read only rsync permissions for labstore1006 & 7 to fetch this data.

Hm, is ezachte the only one that pushes to dataset1001?

Proposal for new directory structure in stat1005

Hm, if all that we are doing is making his user's crons write to new locations in /srv, that's fine with me. I'd just make this /srv/wikistats_1 (no /analytics), and chown it stats:wikidev 0775. Then Erik should be able to change his jobs to write to directories there.

@Ottomata Thanks so much, /srv/wikistats_1 seems fine. There's also media and pagecounts-ez, cool to have those at the top level in /srv too?

Oh I missed that, hm.

For https://analytics.wikimedia.org/datasets/, we use a script that will automatically rsync directory hierarchies from multiple source hosts into a destination on thorium.eqiad.wmnet. Perhaps we should use the same solution here, so that future rsyncs to the new labstore hosts for dumps.wm.org purposes can all use the same fetch job. E.g.

hostA:/srv/dumps/media
hostA:/srv/dumps/wikistats_1
hostB:/srv/dumps/whatever/else

  • Rsync cron on each labstore* to rsync from each configured source host (hostA, hostB, etc.) -> labstore*:/srv/data/dumps-rsynced/$::hostname
  • hardsync cron job that syncs from /srv/data/dumps-rsynced -> /srv/data/dumps

(replace the labstore paths with whatever is correct or makes the most sense :) )

This way, for any host that has data that needs to show up on dumps.wikimedia.org, we can just include the same class to set up the source host directory structure, and it will be mirrored to the proper location to be served at dumps.wikimedia.org.

This is a little confusing, lemme know in IRC what you think and if you have more Qs

Hm, is ezachte the only one that pushes to dataset1001?

Yep, he's it. Well, the only user. There are a couple other rsyncs that are pushes to dataset1001 from other hosts but those are/have been converted.

Change 423533 had a related patch set uploaded (by Madhuvishy; owner: Madhuvishy):
[operations/puppet@production] statistics: Create /srv/dumps directory to host dumps datasets

https://gerrit.wikimedia.org/r/423533

Change 423533 merged by Madhuvishy:
[operations/puppet@production] statistics: Create /srv/dumps directory to host dumps datasets

https://gerrit.wikimedia.org/r/423533

Change 423539 had a related patch set uploaded (by Madhuvishy; owner: Madhuvishy):
[operations/puppet@production] dumps: Add rsync fetch jobs for datasets in stat1005

https://gerrit.wikimedia.org/r/423539

Change 423540 had a related patch set uploaded (by Madhuvishy; owner: Madhuvishy):
[operations/puppet@production] statistics: Create dumps rsync module to allow read from labstore1006|7

https://gerrit.wikimedia.org/r/423540

Change 423540 abandoned by Madhuvishy:
statistics: Create dumps rsync module to allow read from labstore1006|7

Reason:
/srv is already an rsync module on stat*

https://gerrit.wikimedia.org/r/423540

@ezachte Hello, after chatting with Andrew a bit, here's the direction we have in mind (pretty similar to what we talked about with some naming adjustments)

  • /srv/dumps has been created on stat1005. You should be able to write to it. And it'll serve as the root directory for all the datasets you generate.
  • We'll have the following rsync fetch jobs running from labstore1006|7
Source - stat1005Dest: labstore1006/7
/srv/dumps/wikistats_1.0/srv/dumps/xmldatadumps/public/other/wikistats_1.0
/srv/dumps/pagecounts-ez/merged/srv/dumps/xmldatadumps/public/other/pagecounts-ez/merged
/srv/dumps/pagecounts-ez/projectcounts/srv/dumps/xmldatadumps/public/other/pagecounts-ez/projectcounts
/srv/dumps/pagecounts-ez/projectviews/srv/dumps/xmldatadumps/public/other/pagecounts-ez/projectviews
/srv/dumps/media/contest_winners/WLM/srv/dumps/xmldatadumps/public/other/media/contest_winners/WLM
/srv/dumps/media/contest_winners/WLA/srv/dumps/xmldatadumps/public/other/media/contest_winners/WLA
/srv/dumps/media/contest_winners/WLE/srv/dumps/xmldatadumps/public/other/media/contest_winners/WLE

I've put up a preliminary puppet patch for this at https://gerrit.wikimedia.org/r/#/c/423539/. Let me know when you've written the data into the new directory structure, and I can get the rsync crons running on our end. If you have a preferred frequency for each job to run, let me know that as well. Thanks!

Hmm. Imprecise naming can cause a lot of confusion, as it did already in recent years. (I did my part to add to this confusion, by misnaming things as well)

It might not seem that important how we name things internally, but time and again there was confusion about what Wikistats consisted off, and still sometimes there is.
Wikistats started as 'all files that ezachte builds from xml dumps'. Later I added traffic reports, and those were called Wikistats as well, fine.
But then somehow the meaning narrowed down for some to 'any traffic reports ezachte built', and confusion ensued. And never quite went away ;-)

Keeping /dumps/ in all paths for labstore1006/7 (paths which translate directly to url, I assume) makes sense, as changing those would break automated jobs by third parties. (for traffic files)

For stat1005:

/srv/dumps/wikistats_1 is spot on (there is not 1.0, only 1. Wikistats_2 team is going to change 2.0 to 2 as well in new reports.)

The WL* contest winners are just collections of images prepped for offline viewing.
Nothing to do with dumps, at least not with xml dumps.
(dump can be also be a container term, without much meaning).
/srv/media seems more straightforward to me.

Neither have pagecounts-ez folders anything to do with dumps, rather with traffic (logs).
Also there are no plans to replace these traffic files, so grouping these under /srv/dumps/ may suggest a connection that isn't there.
If /srv/analytics/ doesn't feel right, how about /srv/traffic/ ?

The reason I suggested /srv/dumps isn't because I like the name or because any of the data has to do with dumps, but because I wanted a general purpose spot from which files on stat1005 (and possibly other stat boxes) will automatically be published to dumps.wikimedia.org. The expectation should be that if you put things in this directory, they will be published, rather than having us have to configure new rsync jobs and new directory names every time someone wants to do this.

I agree that the naming is unfortunate but that's where we are. As for the directory structure underneath of /srv/dumps, I don't mind so much. @madhuvishy, instead of making rsync jobs for every single of Erik's outputs, why not just mirror the directory structure in /srv/dumps on stat1005, and rsync the whole thing? Hm, also, it seems all data ends up at dumps.wikimedia.org/other, so maybe we can make that a restriction? /srv/dumps/other or /srv/dumps-other, perhaps?

OR! Another idea. We have /srv/published-datasets for this very thing: Serving analytics type datasets from analytics.wikimedia.org/datasets. What if we stopped copying to labstore hosts altogether, and just modified the https://dumps.wikimedia.org/other/ page to point links at new locations over on analytics.wikimedia.org, with some nice redirects too. Then Erik could put files in /srv/published-datasets (we'd have to think of the proper hierarchy here), and they'd automatically show up in analytics.wikimedia.org/datasets. No new crons or rsyncs needed.

A downside of that idea is that analytics.wikimedia.org only lives on thorium. The /datasets exist on their source hosts, but if someone deletes them there, they get deleted from throrium too. There's no failover or redundancy for data there.

Thoughts?

My thoughts on the above:

Not only are there no backups and no failover if we use a copy on thorium, but there's no mirrors by third parties, and no nfs availability of the data for people on wmf cloud instances.

If 'dumps' is a problematic name, let's use the more generic "datasets", without the 'public-' part of the name. I don't much care what the subdirectories are called; I don't think we should lock ourselves into keeping the same names as on the web server; things move around sometimes because even in urls, naming is hard. It's not a particular burden to pass a list of sources and destinations to the fetcher manifest to pick up as many items from as many hosts and subdirectories as we need, over multiple rsyncs for convenience.

Hm, maybe we could use published-datasets for this, and ALSO sync everything in published-datasets to dumps.wm.org/other using hardsync? Everything in https://analytics.wikimedia.org/datasets/ would then be available too.

Not sure if that is a good idea or not.

I'm ok with having all the content on dumps.wm.org, as long as growth is planned for well in advance. The exact mechanics are another thing; we want to pull from the dumps.wm.o end. Let's see what @madhuvishy says about this too.

Thanks @ezachte and @Ottomata for the helpful explanations! I think I'd like to get the immediate task at hand done first before talking about using different rsync mechanisms and also having other analytics datasets available on the dumps distribution servers.

As for /srv/dumps in stat1005, I think it is just a directory that serves as a container for all things that are being shipped off to the dumps distribution servers. It makes sense to me to have a bit of generic name there, but don't care if the name is dumps or datasets or anything else. Erik, I'll defer to you for what you'd like the container directory to be named, and I'm cool with the wikistats_1 naming too. Thank you!

And for having many rsync jobs and not one thing that mirrors everything in the directory, I think have individual jobs is useful so it's intentional what things get rsynced over and how often. I'd like to not accidentally pick up some dataset that is dropped in the parent directory accidentally and have our disk usage go up, and I'd like to also be able to kill any of the individual rsync jobs as and when some of these get phased out or switched to different mechanisms.

published-datasets comes close, although images are no data, it would work for me
published-files ? [-files] doesn't add much
how about solely published? I can't login right now, but I vaguely rermember that already exists

I think using something that looks like 'published-datasets' but is not that will be confusing for folks. It is unfortunate that dumps.wm.org is named as it is, but that is where these files are being hosted from. Unless there is a plan to change the name of 'dumps.wikimedia.org' in the future AND if we are not trying to build a generic solution anyway, then we should use something that is clear. I like /srv/dumps.

If/when we decide to make a generic non-dumps based solution, then we should figure out a better name. I'm fine with solving this as a one off as Madhu prefers now, but beware that this might make future work and confusion for future Madhuvishys (and/or others). :)

I think have individual jobs is useful so it's intentional what things get rsynced over and how often.

I guess this is fine, since historically there aren't that many things that get put on dumps. We did this for analytics.wm.org/datasets because we did not want to require ops intervention when analysts and researchers wanted to make a dataset public.

Am I bike-shedding? Perhaps.

We're not gonna change all urls. That would break too many 3rd party jobs.
So whatever we choose will have to live in an imperfect world.

Merits of /srv/dumps is indeed that it bears the connection to 'wikimedia dumps server' in the name. I can see that.
Merits of /srv/published is it does not work against anyone who wants to navigate the folder tree on intuition in search for traffic data and reports.

Enough said. Both have their merits. I rest my case.

Let's just go with /srv/dumps since we already have that set up then.

Bike shedding is important! We'll have to look at that bike shed color for a long time.

Since Madhu is not going to make a generic rsync job for /srv/dumps/other -> dumps.wikimedia.org/other, I'd be fine with calling this something else. (Madhu, I'm going to send folks to you when/if they want stuff to show up on dumps :) )

So, what else? /srv/public-other with a README specifying what the directory is for?

Thanks Andrew, sounds good to me.

Y'all I'd like to gently point out the primary goal here - we want the rsyncs to happen on the labstores and not from stat1005. To that end, I'm just looking for a directory(ies) to pull from on stat1005. I think we've all agreed on /srv/dumps as the container at least once. The directory already exists. /srv/public-other seems even more generic to me. I'm happy to add a README to /srv/dumps that says this is the container directory for things are shipped to the dumps distribution servers.

What @madhuvishy said. /srv/dumps is "good enough" that we did all say it's ok (and I dislike names like public-other rather a lot). Thanks everybody for weighing in. Let's get this show on the road.

/srv/dumps fine with me ok!

There's a report on the xmldatadumps mailing list that pageviews-ez files are missing for April. Indeed they are not available from the web server. Any ideas?

TheDJ added a subscriber: TheDJ.Apr 10 2018, 2:41 PM

Does this explain why the dataset has been missing since april 1st ?
https://dumps.wikimedia.org/other/pagecounts-ez/merged/2018/

AFAIK this has not yet happened, so no. The same jobs should be 'just running' on the new web servers.

The files are on dataset1001. Which means that Erik hasn't updated his scripts to point to the new server (dumps.wikimedia.org).

Or, more correctly, he hasn't updated his jobs to write the files to /srv/dumps on stat1005?

I'm not sure there's been a patch merged to pull from there yet. https://gerrit.wikimedia.org/r/#/c/423539/ Still waiting to be merged.

The rsync config that allows old style @ezachte to sync to labstore1006 & 7 already exist. We haven't talked about switching on the old setup for the new servers since I thought the jobs are being changed so we can sync from stat1005. I'm ready to turn on the rsync jobs whenever, but I don't see the data in /srv/dumps yet either.

Exactly. Erik needs to update his scripts to point to the new host.

Mentioned in SAL (#wikimedia-analytics) [2018-04-12T20:34:52Z] <ottomata> replacing references to dataset1001.wikimedia.org:: with /srv/dumps in stat1005:~ezachte/wikistats/dammit.lt/bash: for f in $(sudo grep -l dataset1001.wikimedia.org *); do sudo sed -i 's@dataset1001.wikimedia.org::@/srv/dumps/@g' $f; done T189283

Change 423539 merged by Ottomata:
[operations/puppet@production] dumps: Add rsync fetch jobs for datasets in stat1005

https://gerrit.wikimedia.org/r/423539

Change 425897 had a related patch set uploaded (by Ottomata; owner: Ottomata):
[operations/puppet@production] Fix typo in stat_dumps jobs

https://gerrit.wikimedia.org/r/425897

Change 425897 merged by Ottomata:
[operations/puppet@production] Fix typo in stat_dumps jobs

https://gerrit.wikimedia.org/r/425897

Change 425898 had a related patch set uploaded (by Ottomata; owner: Ottomata):
[operations/puppet@production] Use valid minute for stat_dumps fetch job

https://gerrit.wikimedia.org/r/425898

Change 425898 merged by Ottomata:
[operations/puppet@production] Use valid minute for stat_dumps fetch job

https://gerrit.wikimedia.org/r/425898

This seems to be broken. Not sure what you were trying to do (it's past my bedtime) but seeing a lot of messages like:

Cron <dumpsgen@labstore1007> bash -c '/usr/bin/rsync -rt --delete --chmod=go-w stat1005.eqiad.wmnet::/srv/dumps/wikistats_1.0/ /srv/dumps/xmldatadumps/public/other/wikistats_1.0/'

ERROR: The remote path must start with a module name not a /
rsync error: error starting client-server protocol (code 5) at main.c(1653) [Receiver=3.1.1]

Change 425915 had a related patch set uploaded (by Ottomata; owner: Ottomata):
[operations/puppet@production] Fix rsync module refernce for stats_dumps

https://gerrit.wikimedia.org/r/425915

Change 425915 merged by Ottomata:
[operations/puppet@production] Fix rsync module refernce for stats_dumps

https://gerrit.wikimedia.org/r/425915

Now seeing a lot of:

Cron <dumpsgen@labstore1006> bash -c '/usr/bin/rsync -rt --delete --chmod=go-w stat1005.eqiad.wmnet::srv/dumps/media/contest_winners/WLE/ /srv/dumps/xmldatadumps/public/other/media/contest_winners/WLE/'
rsync: change_dir "/dumps/media/contest_winners/WLE" (in srv) failed: No such file or directory (2)
rsync error: some files/attrs were not transferred (see previous errors) (code 23) at main.c(1655) [Receiver=3.1.1]
rsync: read error: Connection reset by peer (104)

Cron <dumpsgen@labstore1006> bash -c '/usr/bin/rsync -rt --delete --chmod=go-w stat1005.eqiad.wmnet::srv/dumps/wikistats_1.0/ /srv/dumps/xmldatadumps/public/other/wikistats_1.0/'
rsync: change_dir "/dumps/wikistats_1.0" (in srv) failed: No such file or directory (2)
rsync error: some files/attrs were not transferred (see previous errors) (code 23) at main.c(1655) [Receiver=3.1.1]
rsync: read error: Connection reset by peer (104)

and so on, for all these directories.

Change 425962 had a related patch set uploaded (by ArielGlenn; owner: ArielGlenn):
[operations/puppet@production] no stat1005 /sv/dumps rsyncs to dumps servers until there's data

https://gerrit.wikimedia.org/r/425962

Change 425962 merged by ArielGlenn:
[operations/puppet@production] no stat1005 /sv/dumps rsyncs to dumps servers until there's data

https://gerrit.wikimedia.org/r/425962

My bad! I totally forgot to follow up on this (and was behind on mail as well).

I just moved all files in stat1005:/home/ezachte/wikistats_data/dammit/pagecounts/merged to new location stat1005:/srv/dumps/pagecounts-ez/merged
Hopefully new rsync job will pick these up soon.

Tomorrow I will amend the bash files that create new files to also do so in the new location.
And also look into other (unrelated) folders that were to be migrated as well.

I apologize for the inconvenience caused!

New daily pagecounts files exist at stat1005 in /srv/dumps/pagecounts-ez/merged/2018/2018-04

I don't see them yet in https://dumps.wikimedia.org/other/pagecounts-ez/merged/2018/2018-04

Maybe the rsync job hasn't run yet?

BTW Does it run once an hour or once a day? Once a day would be enough if all goes as intended. If that's not the case once an hour would allow quicker fixes.

Change 426928 had a related patch set uploaded (by Ottomata; owner: Ottomata):
[operations/puppet@production] Re-enable dumps/other fetcher rsync job, simplify jobs

https://gerrit.wikimedia.org/r/426928

Change 426928 merged by Ottomata:
[operations/puppet@production] Re-enable dumps/other fetcher rsync job, simplify jobs

https://gerrit.wikimedia.org/r/426928

Change 426929 had a related patch set uploaded (by Ottomata; owner: Ottomata):
[operations/puppet@production] Remove trailing / from rsync locations

https://gerrit.wikimedia.org/r/426929

Change 426929 merged by Ottomata:
[operations/puppet@production] Remove trailing / from rsync locations

https://gerrit.wikimedia.org/r/426929

Change 426931 had a related patch set uploaded (by Ottomata; owner: Ottomata):
[operations/puppet@production] Add $delete paramater to dumps::web::fetches::job

https://gerrit.wikimedia.org/r/426931

Change 426931 merged by Ottomata:
[operations/puppet@production] Add $delete paramater to dumps::web::fetches::job

https://gerrit.wikimedia.org/r/426931

The cron had been disabled, because the source locations didn't exist and were breaking things. I just reenabled them.

Files should be copied once an hour.

@Ottomata the daily merged files are copied now, thanks.

But what happened to the monthly files for previous years? They aren't there.
FIles like pagecounts-2018-01-views-ge-5.bz2 but for earlier years.

Yeah, rats. Madhu's original rsync crons used the --delete flag. I disabled that flag in https://gerrit.wikimedia.org/r/426931, but by that time it had already run once. I'm now rsyncing over pagecounts-ez over (without --delete) from (old) dataset1001 to restore anything that was there before.

Change 427111 had a related patch set uploaded (by Ottomata; owner: Ottomata):
[operations/puppet@production] Allow rsyncing to dumps pagecounts-ez and media from dumps peer hosts

https://gerrit.wikimedia.org/r/427111

Change 427111 merged by Ottomata:
[operations/puppet@production] Allow rsyncing to dumps pagecounts-ez and media from dumps peer hosts

https://gerrit.wikimedia.org/r/427111

@Ottomata Thanks for fixing up the rsync jobs! Can we close this task now?

Think so! We can reopen if there are more problems.

ezachte closed this task as Resolved.Apr 17 2018, 6:31 PM
ezachte claimed this task.

All data look good to me now, and are being updated.

Fingers crossed about the next server migration though, we've seen (unrecoverable) rsync typo on previous migration (stat1002 -> stat1005).
Is this Murphy or the 2nd Law of Thermodynamics (given enough time, chaos is unavoidable).

I'm not complaining, I make enough mistakes myself.

But please note https://dumps.wikimedia.org/other/pagecounts-ez/merged/ is our version of ZeitGeist comparable to Twitter archive at Library of Congress.
A gold mine for future data archaeologists.

I'll look into hdfs backup script once more to ensure that we're good there.