Page MenuHomePhabricator

Enhancements to vagrant dumps role
Open, LowPublic

Description

In T185116, we got a basic vagrant role running to the point that it would perform the simplest dump. Now, we want to add more functionality to exercise important features:

  • This role should provision a second wiki to allow multi-wiki dumps.
  • Populate empty tables: change_tags, image, iwlinks, langlinks, page_props, page_restrictions, protected_titles, redirect, sites

Here are features we'll explicitly never implement in this role:

  • send mail on error
  • run a monitor
  • do any rsyncs whatsoever
  • run a web server beyond the bare minimum
  • run anything out of cron

Event Timeline

We're still waiting for @bd808 to give a thumbs up on the first patch and merge, right? Not that it has to be merged before work starts on these new feature, of course.

The above patchset was merged earlier this week so this task can move full speed ahead whenever!

We should support testing the 'misc' cron job dumps (from the command line); @Smalyshev did a little work on this for wikibase dumps, see https://gerrit.wikimedia.org/r/#/c/mediawiki/vagrant/+/456673/ and I think there might be more patchsets coming (hopefully :-) )

Change 459538 had a related patch set uploaded (by ArielGlenn; owner: ArielGlenn):
[mediawiki/vagrant@master] add supporting shell script for crons in dump role and restructure dir locations

https://gerrit.wikimedia.org/r/459538

The above patch is almost ready for review. I have run the regular xml/sql dumps with it applied, and they run clean. I'd like to try one of the 'misc' dumps that uses the misc crons functions and make sure it runs correctly.

Huh, it needed some changes for that. One more round of testing needed for the latest patch.

Finally. The current patchset works with xml/sql dumps, and I've tested it for category rdf dumps by copying in dump_functions.sh, dumpcategoriesrdf-shared.sh, and dumpcategoriesrdf.sh to /usr/local/etc and/or /usr/local/bin, and also making the log dir /var/log/categoriesrdf owned by vagrant. The log dirs for the various misc scripts can be added in a later patchset; this is just verification that the current code works.

Check out also https://www.mediawiki.org/wiki/User:Smalyshev_(WMF)/Dump_Test which describes how I tested dumps and which tweaks may be useful.

Check out also https://www.mediawiki.org/wiki/User:Smalyshev_(WMF)/Dump_Test which describes how I tested dumps and which tweaks may be useful.

I'd like to run the wikidata rdf test too to make sure it works; how did you get content in there, just apply the wikidata role?

Applying wikidata role does not create content, AFAIK. I just went to Special:NewItem and created a bunch of them manually. There is probably a way to load it from dumps etc. (WMDE folks and particularly @Addshore may know some better ways) but I just made them manually.

It looks like I might be able to get the job done with importDump.php and a small selection of content from an existing wd dump. I'll experiment some.

well in applying the wikidata role I got:

==> default: Notice: /Stage[main]/Role::Wikidata/Mediawiki::Extension[Wikibase]/Git::Clone[mediawiki/extensions/Wikibase]/Exec[git_clone_mediawiki/extensions/Wikibase]/returns: executed successfully
==> default: Error: Command exceeded timeout
==> default: Error: /Stage[main]/Role::Wikidata/Mediawiki::Extension[Wikibase]/Php::Composer::Install[/vagrant/mediawiki/extensions/Wikibase]/Exec[composer-install--vagrant-mediawiki-extensions-Wikibase]/returns: change from notrun to 0 failed: Command exceeded timeout
==> default: Notice: /Stage[main]/Role::Wikidata/Mediawiki::Extension[Wikibase]/Mediawiki::Settings[Wikibase]/File[/vagrant/settings.d/puppet-managed/10-Wikibase.php]: Dependency Exec[composer-install--vagrant-mediawiki-extensions-Wikibase] has failures: true
==> default: Warning: /Stage[main]/Role::Wikidata/Mediawiki::Extension[Wikibase]/Mediawiki::Settings[Wikibase]/File[/vagrant/settings.d/puppet-managed/10-Wikibase.php]: Skipping because of failed dependencies

so that's a bit of an issue. Any ideas, @Smalyshev ? This is after applying the dumps role but I can't imagine that makes a difference.

Well those were some fun rabbit holes. I'll document what I did, but the long and short of it is, after sorting out github issues and then composer issues, it turns out that there are some annoyances with CommonSettings for the wikidata role. I need to think about the minimal workaround needed for the wikidata role and the dumps role to play well together, as well as the dumps role all by itself.

The latest version of the patchset should allow one to apply the wikidata role, then apply the dumps role, then dump any of the wikis configured, including wikidata, and (once the wikidata json, rdf, and dumps_functions and dcat scripts/config are copied in) dump those too.

I'm going to do a final 'test from scratch' and write up my entire procedure; then there's a couple things I could do to the xml/sql dumps scripts to make a couple kludgey symlinks go away; and then we might be ready for a merge.

https://apergos.wordpress.com/2018/09/19/xml-sql-dumps-and-mediawiki-vagrant-two-great-tastes-that-taste-great-together/ This is much longer than anyone here cares about, just skip to the last few sections and especially 'next steps'.

I really dislike having several symlinks to get everything to see the MWScript.php file but I'd rather get the patchset merged in now than wait for all of those separate changes to separate repos to get in during appropriate windows when dumps aren't running.

What do folks think about putting all the 'misc dump scripts' (puppet/modules/snapshot/files/cron) in their own repo, rather than having them in puppet? They'd have to be deployed by scap I suppose, but they could then be easily cloned into any testing platform instead of each tester having to copy in the scripts to e.g. vagrant or wherever else.

What do folks think about putting all the 'misc dump scripts' (puppet/modules/snapshot/files/cron) in their own repo

I'd be fine with whatever is more convenient, I don't have any specific dependency on where these scripts are. As long as there's a specific process for getting it deployed, it's fine for me.

@awight What was your thinking behind the choice of /vagrant/srv/dumps/output as the location for dumps output files, as opposed to someplace not on the nfs mount?

To add some context, some Vagrant setups (i.e. on development machines) have /vagrant set up in the way that does not allow other users easily writing there and changing permissions. This causes a lot of hassle that could be avoided if the directory does not reside in /vagrant space.

@awight What was your thinking behind the choice of /vagrant/srv/dumps/output as the location for dumps output files, as opposed to someplace not on the nfs mount?

NFS mount if we're lucky :) or some other kludgey transport… I think moving the output dir is a great idea. We already have apache serving the generated files IIRC, so debugging the generated files using a web browser is still possible outside of /vagrant.

I've moved the dumps output to /var/www/dumps and the webroot to /var/www/dumps/public; what do folks think about that? (Note I have not tested the module with these new changes.)

If we like that, then I would:

  • get this merged
  • merge in the MWScript.php-relted patches for dumps python scripts and for the misc cron scripts, and remove those symlinks from this module
  • remove the devwiki stuff and make this role include the wikidata role
  • discuss adding an initial import of say 25 pages to wikidata (5 minute wait for that to run?)

discuss adding an initial import of say 25 pages to wikidata

Great idea! But I am not sure puppet role is a good fit for this - since puppet is a stateless "how it should be" tool, putting one-time actions there in my experience is somewhat of a hassle. And I probably don't want to re-run import each time I reprovision anything. Maybe having a script that does it deployed instead? Also would be nice to import something in all three Wikidata namespaces: Items, Properties and Lexemes.

Yes, sorry I wasn't clear. I meant to include some base data and a script, rather than force an import. since some users may not want that data. Info on how to do the import would go in the README. As a side note.importing can be done as a one-time thing in puppet; the dumps role does this already.

Change 459538 merged by jenkins-bot:
[mediawiki/vagrant@master] add supporting shell scripts for dump role and restructure dir locations

https://gerrit.wikimedia.org/r/459538

Change 463223 had a related patch set uploaded (by ArielGlenn; owner: ArielGlenn):
[mediawiki/vagrant@master] dumps role: point to the README from the motd banner

https://gerrit.wikimedia.org/r/463223

While not strictly a blocker, some related work is at T205825.

Change 463809 had a related patch set uploaded (by ArielGlenn; owner: ArielGlenn):
[mediawiki/vagrant@master] remove left-over /vagrant/srv/docroot/dumps/www directory from manifest

https://gerrit.wikimedia.org/r/463809

Change 463809 merged by jenkins-bot:
[mediawiki/vagrant@master] remove left-over /vagrant/srv/docroot/dumps/www directory from manifest

https://gerrit.wikimedia.org/r/463809

Aklapper triaged this task as Low priority.Dec 6 2022, 10:26 AM