I've updated the draft RFC to remove the 'final' schema, leaving the 'transitional' schema as the new schema proposal; I've munged the 'header changes' section leaving my question about possible changes to slot role names in there for comment. Still thinking about @daniel's proposal (a series of text tags with attributes).
Mon, Aug 13
Thanks for the input so far!
The job is rerunning and output is being produced, closing this ticket.
I will hazard a guess that the script initially retrieves the correct db for the dumps group but does not try to refresh it periodically (or on every failure). We could consider doing so after some number of consecutive failures; @hoo what do you think?
I didn't get a chance to check things then but all open connections to dbs now from snapshot1008 are to db1087, as they should be. This will take some digging.
2bd7259a2c88a2bcc30a9217770a64268e161305 is the commit which changed the behavior of maintenance scripts, merged on Aug 7th.
Fri, Aug 10
Ah ha! I hope whoever added the file will update it with the information you need. (I wonder who that kind person was?)
The draft at https://www.mediawiki.org/wiki/Requests_for_comment/Schema_update_for_multiple_content_objects_per_revision_(MCR)_in_XML_dumps is ready for a first round of comments by people on this ticket (or people just following along). Anything from 'that schema is wrong' to 'why did you use that color for diffs' to 'this wording is confusing', whatever you think is useful.
it's there for me. i just checked by clicking on the above link. (I also tried going directly from the analytics index page to see if the link is different and broken there but it worked for me from there too.
Thu, Aug 9
- Sorry, I should've been watching closer and caught this sooner, but "dumps.wikimedia.org" is one of the handful of domains on the audited shortlist which don't use our standard cache cluster termination. This means the cache clusters don't service its traffic, so they can't really rewrite into it easily. We can perhaps add a backend just for this purpose, though (not move dumps to standard termination, just also have the dumps backend available for rewriting these particular requests into). I'll have to double-check there's no snags with that idea...
Wed, Aug 8
If you are going to store things on the dump web servers:
@daniel I am about to steal a bunch of your preliminary work and comments in order to craft a workable proposal (see the link in the task description now); for this reason you are listed as a co-author of the RFC. If you would rather not, please say so and I'll just give a ton of credit in the document itself.
Tue, Aug 7
A bug indeed. Here you go: T18036
Mon, Aug 6
It's the labstore boxes you want, either 1006 or 1007 depending, and maybe you just want to make the file available and ask someone to drop it into the right location? And that would likely be someone on the WMCS team.
Does 2.16 have the 'a new changset has been uploaded by so-and-so' feature for polygerrit, like the current ui does, or is that a later release?
Adding @awight as an interested party (who works on eg the mw vagrant dumps role).
Because MCR content on Commons, and specifically the metadata storage piece, is set to go live on October 1st, and we likely will barely have an RFC out by that time if we are lucky, we will not be giving adoptees much time to convert their existing utilities to use the new schema. So I am initially leaning towards something like this:
Making clear here the correspondence between revisions, slots, content, text, and comparing that to the previous setup with just revisions and text.
@hoo I am adding you guessing that you will want to weigh in on the new schema. If this is outside your interest, go ahead and take yourself off.
Some initial comments/questions on slot-roles, content_models tables:
I'm adding here the tables and fields that need to be part of the dumps, both for export and for import, so everyone is on the same page.
This shouldn't run on the snapshot (dumps-generating) hosts; if it were to run anywhere it would run on the web server. Looping in @Bstorm who is the point person for the labstore boxes (which handle web service) now.
Sat, Aug 4
When i chatted with @Imarlier via Hangout, we talked about running these as one-offs as needed to fix specific issues. I don't know if that is still the plan, perhaps he can weigh in.
Thu, Aug 2
Adding @brion as someone who knows these schemas well (thanks in advance!)
Notes on timing: it i expected that commons MCR writes (metadata for media) will be happening by Oct 1, so it would be really really nice to have an rfc approved and code written by then. That's a pretty short time frame given that it's summer vacation right now, but otherwise data about some media won't show up n the dujmps.
Existing proposals (which were the occasion for my comments at the link above): https://www.mediawiki.org/wiki/Multi-Content_Revisions/Dumps
I've run the script manually with the above change applied; results are available in the expected location.
Wed, Aug 1
This is now deployed; I'll check tomorrow that the dailies ran ok, and we'll know about the fulls over the weekend.
Mon, Jul 30
Google gets updates from us more than once a day; I don't know how their update pipeline works, but they certainly have or could have access to the data more or less live. We should talk to them and find out where the problem is.
This is deployed.
The mail arrived and looks fine, closing this.
Wed, Jul 25
We've never had requests for specific tables like these; the ep tables aren't dumped as part of the regular dumps either.
Mon, Jul 23
Sun, Jul 22
This should happen after the current run is completed.
This is deployed but I'll wait to close it until we see the first email arrive in a few days.
Fri, Jul 20
I have sent mail about adding hewiki to the 'big wikis' list for processing; this is set to happen for the August 1st run.
What's the status on this? Anything needed to get it moving?
What's the status on this? Anything needed to get it moving?
I'm going to go ahead and close this; if it's observed again and it's not the result of too many connections, it can be re-opened.
A dump run has completed since this was deployed, and it works fine. Closing.
Emailed report showed up with all the info i need. Closing.
Thu, Jul 19
Well, 'next time' turned out to be 5 minutes later, too twitchy to leave it for tomorrow. Run should happen tomorrow morning so I'll check the results then.
I'll get a report on the longer running jobs for enwiki, wikidatawiki and the 'big wikis' for now. This should run tomorrow before the 20th dump jobs kick off.
I should add one more job that gives me the slowest 40, say, page-meta-history bz2 content dumps on all wikis, so i can track those. Next time.
Given that the hosting setup for this service is different now, this might as well be closed. If folks notice problems in the future they can create a new task.
The dblist fix has been deployed, off to test the actual bash script now. Until now it's all been manual runs across the dblist with direct calls to the maintenance script.