Done. A couple patches adding sample code for a job and setting the default config to skip that job also went out with this deploy.
Thu, May 13
May I suggest we stuff the revId right into the exception message? it's passed as a parameter to the logger but doesn't end up being logged afaict, at least I don't find it in the logstash entry. See e.g. https://logstash.wikimedia.org/app/discover#/doc/logstash-*/logstash-deploy-2021.05.13?id=yqfCZXkBA6MeBtBq-eAT
Wed, May 12
I need to look at these when the temp files are still around, which would be on Monday evening for both the json and ttl files. I'll try to remember to do that next week.
Tue, May 11
Mon, May 10
Pinging @Addshore directly, any chance you are still generating this data, or alternatively, that you still have the tools around and could easily do so?
I've added the email address requested via pm in irc, following the instructions at https://wikitech.wikimedia.org/wiki/SRE_Clinic_Duty#Mail_aliases
Whatever happened to media backups? Was an implementation decided on or even completed?
Welp. Didn't do these because of holidays. No harm done, but I guess they will go before the May 20th run.
Fri, May 7
Thu, May 6
So where are you at in the process? You can follow this task until we have something that's more ready for others to try, or you can jump in with us to build the infrastructure, let me know and we can plan accordingly.
Hey just following up, any luck with the elwikiversity issue of too many requests in parallel?
You know there is another airflow-common-usage task around, here it is: T237361
@Cmjohnson These are all yours to decomm whenever you like. Thanks a lot!
The new hosts are busily running dumps and the old ones have been marked as spare. Closing!
<13>May 4 00:08:29 dumpsgen: extensions/CirrusSearch/maintenance/DumpIndex.php failed for /mnt/dumpsdata/otherdumps/cirrussearch/20210503/commonswiki-20210503-cirrussearch-file.json.gz
Wed, May 5
What are the next steps on this? Should I be tweaking a manifest someplace?
Tue, May 4
Thanks for this. We should also not start a new rsync on our side if one is already running on the same host. And someone needs to doublecheck that there are timeouts so that we won't have processes hanging and never completing, although I don't think that's the case for the recent incidents.
Mon, May 3
I've seen some in the past month, indeed.
@ERROR: max connections (6) reached -- try again later rsync error: error starting client-server protocol (code 5) at main.c(1675) [Receiver=3.1.3]
That was labstore1007 on Apr 14. We had one on Apr 8 as well.
Thu, Apr 29
After merging the above patch, I needed to remove the cron jobs from the dumspgen crontab manually on snapshot1006,7 since switching the role to testbed does not and can't really do that. I also tested angwikibooks and skwikibooks full dump runs with the test config file that writes output to a test directory. The first wiki had previous runs so we tested prefetch with that; the second one did not so we tested db fetches of text content with that. Everything looks ready to go.
Thanks for showing up, and please come again AND tell your friends! See you at a deploy window again soon! Adding @thcipriani as a FYI since he's sort of corralling the trainings. And closing this task as done!
Let's see about getting you that +2 on the MediaWiki config repo. Aaaand done by @Reedy already. Please check but you should have the bits; you needed to be added to the wmf-deployment group.
p.s. I think we'll have another trainer in the meeting. But most of the time we don't bother to ack the invite :-D
Yes please, just show up! That would be great!
Wed, Apr 28
I had to manually edit /srv/deployment/dumps/dumps-cache/.config on all three hosts to change the name of the upstream host from dpeloy1001 to deploy1002, still wrong in the repo on deploy1002. See T197470
Just ran into this today on an install of new snapshot1011,12,13: got the dreaded
Apr 28 11:47:52 snapshot1011 puppet-agent: Execution of '/usr/bin/scap deploy-local --repo dumps/dumps -D log_json:False' returned 70: #007 Apr 28 11:47:52 snapshot1011 puppet-agent: (/Stage[main]/Profile::Dumps::Generation::Worker::Common/Scap::Target[dumps/dumps]/Package[dumps/dumps]/ensure) change from 'absent' to 'present' failed: Execution of '/usr/bin/scap deploy-local --repo dumps/dumps -D log_json:False' returned 70: #007
for the dumps repo. I have manually edited the dumps-cache/config files on those hosts and left the DEPLOY_HEAD file in the dumps repo on deploy1002 untouched, so that any proposed solution can be tested there. I have two more hosts yet to roll out, so we can defintitely check what works.
Thanks for that clarification, Liam. The 'regular uploading to IA' example is the sort of thing I had in mind.
I am proactively adding @hoo as he can provide some insight and perhaps tag others as well.
Tue, Apr 27
What prevents someone from uploading the dailies from a WMCS instance to archive.org? Do we want to deter that, encourage it, have no opinion?
Linking here as a related issue: T281048 (storage for security related data also under discussion there)
Ah ok! I didn't mean to be hasty, just saw the reimaging script runs and got excited :-)
After discussion with jbond, reopening for further discussion on better alerting in case of failures.
Note that the only way we found out about these usage errors from the systemd timer wrapper script was that a vigilant user of one of the output datasets happened to notice they weren't being produced. The error messages themselves just siltently went into syslog with no one being the wiser. We might want to think about better reporting that nonetheless doesn't mean piles of cronspam.
Folks cc-ed on this should decide if their jobs ought to start later today in lieu of not running at all, and either do it or poke me to do it, if so.
This was caused by https://gerrit.wikimedia.org/r/c/operations/puppet/+/679292 which added arg parsing to the systemd timer wrapper script.
Hey, this looks almost done, am I reading that right? :-) :-)
In the short term, fewer dumps could be kept, although that only gets us so far.
Sun, Apr 25
If people move stuff off of /srv/security we could get .5T back which would be helpful. Some of those files are from a few years ago.
Fri, Apr 23
For comparison, that 1.1gb json-encoded html file bz2 compressed down to 348mb and the 7z compressed one is 286mb, quite a savings when you consider the larger files.
As folks might guess from all the merges, the first email via MAILTO to ops-dumps arrived today, verifying that part of the migration, so get your patches in, Amir! :-)
<13>Apr 22 04:16:02 dumpsgen: extensions/CirrusSearch/maintenance/DumpIndex.php failed for /mnt/dumpsdata/otherdumps/cirrussearch/20210419/mniwiktionary-20210419-cirrussearch-content.json.gz <13>Apr 22 04:16:02 dumpsgen: extensions/CirrusSearch/maintenance/DumpIndex.php failed for /mnt/dumpsdata/otherdumps/cirrussearch/20210419/mniwiktionary-20210419-cirrussearch-general.json.gz
Live on the web server. Have a great weekend when it arrives!
Thu, Apr 22
Wed, Apr 21
I was able to successfully download the dump for elwikisource, so that tells me that the basic functionality of the script is good.
If you have no further questions, I'll close this task. You might also consider subscribing to the xmldatadumps-l mailing list https://lists.wikimedia.org/mailman/listinfo/xmldatadumps-l as a relatively low traffic list where announcements about dumps are sent, and people familiar with uses of the dumps discuss with each other.
Tue, Apr 20
As long as people are forewarned not to run integration tests but only true unit tests (which by definition don't write to the db), and if they aren't sure, they better look at the specific test to see what it does.
Thanks for this info. I might try with a different wiki and see what happens there. Let me know when the namespace change happens and I'll update my script accordingly.
So... would anyone mind if I went ahead and JFDI? I probably would have to concoct some custom role but if people don't mind the extra manifest it shouldn't be too hard.
I'm not adverse to that if we can determine there are no real users of the data, and there's an acceptable substitute. These dumps aren't needed for makiong a replica, or for doing analysis of the content.
That config fix is live on snapshot1008 and will take effect during the next run (next week). I'll leave the task open until we've verified that it runs ok. After that, I'll check ttl and json file sizes and perhaps adjust the comments in that file to reflect what to look at in order to guesstimate the size in the future.
Cc-ing @Cparle who has worked on these in the past. (If there's a better person, please let me know.)
I have added some information about filenames here: https://meta.wikimedia.org/wiki/Data_dumps/What%27s_available_for_download#XML_files if that is helpful. Please let me know if there is other information missing!
Hi, dumps maintainer here.
Mon, Apr 19
Code stash place until it works and I figure out where in the wmf repo system it ought to live: https://github.com/apergos/okapi-downloader
curl -L -u USERNAME_HERE --output got-this.json https://api.wikimediaenterprise.org/v1/exports/json/elwikiversity
when supplied with the right password, gets only a file of size
-rw-rw-r--. 1 ariel ariel 565816 Απρ 19 14:06 got-this.json
with 99 entries in it, truncated.
Fri, Apr 16
<13>Apr 15 03:00:52 dumpsgen: extensions/CirrusSearch/maintenance/DumpIndex.php failed for /mnt/dumpsdata/otherdumps/cirrussearch/20210412/mniwiktionary-20210412-cirrussearch-content.json.gz <13>Apr 15 03:00:52 dumpsgen: extensions/CirrusSearch/maintenance/DumpIndex.php failed for /mnt/dumpsdata/otherdumps/cirrussearch/20210412/mniwiktionary-20210412-cirrussearch-general.json.gz
I see the size of each dump is available in the json project list output, which is great. Can we also get md5 or sha1 hashes of these files via the same endpoint? This would be extremely nice for downloaders, and also for us to verify that downloads are complete and without corruption.
Additional update via email from the OKAPI folks:
I've received credentials from Ryan, which will get placed into the private puppet repo soon enough, and the access point is https://api.wikimediaenterprise.org/v1/docs/index.html#/ which produces json output. We can now start getting to work on scripting this. We'll set a descriptive user agent in the test and eventual production script including the standard email contact address for dumps, once there's something to test.