Did we get signoff from @Legoktm? I'm guessing yes but I didn't see it here or on the changesets.
Mediawiki flow history dumps were interrupted and are re-running, so while they will finish up right before the new run, the wikidata weeklies have already started. Saturday the 31st maybe for the last two reboots.
Does anyone know where the schema for these xml files lives? I've grepped around in mw core and in the abstract extension repos and found nothing.
Well, I"m not going to make the 20th deadline. I need to do a pile of timing tests with this code to see if it's faster than the horrid head/tail pipeline I have been using until now, for combining files. There's a bunch of variants needed: do the compression in the same C program? Pipe it, does it make much difference? Buffer size? Etc. When I have those results we'll have a decision about the way forward.
Wed, Mar 14
Tue, Mar 13
Actually, is this any different than having 'deleted="deleted"'as the attribute when a revision, contributor or comment is no longer available? AFAIK that's not a standard attribute or anything, it's just in our schema. Which reminds me, the change about needs to go into an updated schema too if we agree on it.
If you plan to implement daily dumps, I would strongly encourage you to do them as diffs, assuming that these will be much more efficient than full dumps every day.
Well, on wikidatawiki in beta, the new code generates a whole lot of <abstract not-applicable="" /> as we expect; on other wikis it produces the usual output. So that looks good.
Now trying to find out about standard xml libraries.
Yes, these are there because there is no equivalent any more; the abstratcs dumps are now gz compressed, the page content files were split along a different page range this month because that's how the timing fell out, etc.
@Andrew, I think the issue with thumbs is that you need to have the thumb directory in wherever you are storing the images, and it needs to be writable by the web server. Try that and see how it goes.
If php error logging is enabled (I couldn't find where it was), you should be able to see some complaints that would point you to that, I think. There's probably similar things for other subdirectories in /srv/mediawiki/images, like temp, timeline, and maybe the favicon path too.
Welp. I have tried a few times, and while I am not at all confident about the es setup on beta, I am confident that this is a legit attempt to retrieve something from es when it oughtn't.
#0 /srv/mediawiki/php-master/extensions/Flow/includes/Model/AbstractRevision.php(413): Flow\Data\Storage\RevisionStorage::loadExternalContentIntoRevision(Flow\Model\Header, string) #1 /srv/mediawiki/php-master/extensions/Flow/includes/Model/AbstractRevision.php(249): Flow\Model\AbstractRevision->getLoadedContent() #2 /srv/mediawiki/php-master/extensions/Flow/includes/Dump/Exporter.php(380): Flow\Model\AbstractRevision::toStorageRow(Flow\Model\Header) #3 /srv/mediawiki/php-master/extensions/Flow/includes/Dump/Exporter.php(368): Flow\Dump\Exporter->formatRevision(Flow\Model\Header) #4 /srv/mediawiki/php-master/extensions/Flow/includes/Dump/Exporter.php(247): Flow\Dump\Exporter->formatRevisions(Flow\Model\Header) #5 /srv/mediawiki/php-master/extensions/Flow/includes/Dump/Exporter.php(200): Flow\Dump\Exporter->formatHeader(Flow\Model\Header) #6 /srv/mediawiki/php-master/extensions/Flow/includes/Dump/Exporter.php(182): Flow\Dump\Exporter->formatWorkflow(Flow\Model\Workflow, Flow\Search\Iterators\HeaderIterator, Flow\Search\Iterators\TopicIterator) #7 /srv/mediawiki/php-master/extensions/Flow/maintenance/dumpBackup.php(89): Flow\Dump\Exporter->dump(BatchRowIterator) #8 /srv/mediawiki/php-master/extensions/Flow/maintenance/dumpBackup.php(62): FlowDumpBackup->dump(integer, integer) #9 /srv/mediawiki/php-master/maintenance/doMaintenance.php(94): FlowDumpBackup->execute() #10 /srv/mediawiki/php-master/extensions/Flow/maintenance/dumpBackup.php(133): require_once(string) #11 /srv/mediawiki/multiversion/MWScript.php(100): require_once(string)
So it looks like $revision->toStorageRow is doing more than it should. In there, I see 'rev_content' => $obj->getLoadedContent(), which shouldn't really load content every time, or we will never just be able to get the metadata.
Mon, Mar 12
That's because all the thumbs stuff isn't in that directory. You might want to sym link the 0-9a-f directories into the dated ones each time or something, assuming that the run s good (I gotta mark that it is, it's in the todos at the top of the file).
Thank you! I'm going to close this for now, and the ticket to watch is T181936.
I tried testing on a locally patched Flow + mw-core on snapshot01 beta, and I got an attempt to load from ext store but it's inconclusive, especially because it generated an exception at the crucial point instead of actually retrieving the content. I need to look at it further, might be a problem with the testing setup. I will get back to this tomorrow and let you know one way or the other.
Can we defer this ticket until we can move crons such as this to their own host? We should then be able to increase the number of shards and get the job to finish up a day sooner (famous last words), meaning no one's weekends would be ruined and you'd get to import on Thursday.
I'd like to reach a conclusion on this one way or another, either accept or decline. My inclination towards this now is decline but I'm still open to discussion if you have some suggestions re my previous comment.
Have a look for example at dumptextPass.php in core/maintenance; it's got a rotateDb function that it calls to clean up and try to get a new connection when things go bad.
I've updated it accoring to your second suggestion (untested though). I prefer to have empty abstract tags in there rather than skip them completely. The file ought to compress down to something pretty tiny at least!
Did not mean to close this yet!
I did not get this deployed until after the start of the March run, though it was ready beforehand. We will have to wait for the March 20th run to verify that the fix work. My apologies!
are there any further issues? If not I'm going to close this ticket in a few days.
The sample image dump script is in /root on wikitech-static, along with a config file that works ok.
Sun, Mar 11
My Ruby knowledge is next to nothing, so we'd have to make sure one of the co-mentors has that background. Except for that, as long as the bot performs well and isn't a resource hog, it doesn't matter what language it's written in.
Sat, Mar 10
I ask because there are things like infoboxes, which get turned into tables; making those into plain text seems not so useful to me. And you might want to preserve some formatting, e.g. <p> markers turning into blank lines, <br /> into carriage returns, and so on. How exactly that would be done is a task for the plaintext consumer to implement, as they will know their needs best.
Fri, Mar 9
Is HTML what's asked for on this ticket, a full html expansion of the wikitext?
test image dump script seems to be ok on labsweb1001:
ariel@labweb1001:~$ sudo -u www-data python ./get_images.py --verbose --configfile dump_images.conf.labweb1001 --wiki labswiki
ran to completion proucing a bunch of files in /mnt/dumpsdata/xmldatadumps/public/images/labswiki/20180309
Thu, Mar 8
@awight You should be able to do more testing on the mw vagrant role now. The patch is available in master.
Wed, Mar 7
Welp. Here's some draft announcements, edit the hell out of 'em. P6813
Tue, Mar 6
As far as dumping the images back out, I'm dusting off some old scripts that used to be used for media tarballs back in the day, and repurposing the code. There might be something more elegant or ready-made that's been written in the meantime. If no one intervenes soon, I'll get a draft of a script together for testing.
I guess this is /srv/dumps/public on labstore1003? It's fine now. Odds are you tried to access it in the middle of an rsync; rsyncs copy file trees first and set permissions afterwards. If you mean some other host, please note the details here. Thanks!
Mon, Mar 5
In the meantime I should see if we can get more speed out of a dedicate C utility instead of a pipeline of bzcat | head | tail for each of these files in the recombine pipeline. I'll need this for the multistream dumps regardless.
Do you specifc the config file anywhere? And I take it the wiki name is actually 'wiki' in the database?
The current changeset would check the mount point at every puppet run, i.e. every 30 minutes, it seems. Is that often enough?
I wonder what other jobs use that mountpoint, it might be nice to find out.
It appears to take over two days for the recombine of the pages-articles files, and over two days for the recombine of the meta-current files. I'm going to skip this step when the combined sizes of the files to recombine is 20 GB or more, configurable.
I'm tempted to just turn off abstracts for Wikidata altogether, since every item in there is a Qxxx with junk for the abstract. But your approach is better, in case similar content creeps into other projects. WHat do you think about https://gerrit.wikimedia.org/r/#/c/416409/, as opposed to somehow checking for TextContent and WikitextContent (which requires having the content to hand)?
Time to close this ticket. At this point we have: labstore boxes coming on line soon, dumpsdata hosts deployed months ago, snapshot testbed refresh is at least on the radar with a ticket, and snapshot cron job host is in the request queue. Mid-term plans may look very different, depending on what dumps look like by then.
Hey @RobH, what are next steps on this?
Sat, Mar 3
I was active at en wikt during some of those deletions, so I have an opinion.
@awight According to T185116 utfnormal is gone, so don't bother with that.
You might be able to copy the feed and report templates in from the dumps clone instead of having separate files; I'm assuming no one cares too much about css niceties here.
If you are dumping a local wiki then it had better be running, and if it's running, then the maintenance script to get the db params and other info should 'just work" if configuration is set up properly. Can you give an example of the crash?
Fri, Mar 2
Additional content notes. In the case that a page consisting of a MW edirect points to a non-existent page, it will be omitted. That's easiest and wouldn't really be a loss of content.
https://gerrit.wikimedia.org/r/#/c/394977/ is running fine on snapshot01 (latest patch version i.e. 29). Besides any other changes reviewers want to propose there, we also need the icu changes and the hhvm buil with the memcached upstream patch.
- "misc" dumps are everything except xml/sql dumps, run off of the 'misc dumps cron' snapshot host. they should already be available on labstore6,7.
- "TBD": cron jobs from Ezachte's home directory on stat1005, need to check specifically what gets synced and when
- "puppet": these html files are managed by puppet.
- "dumps::web::fetches::stats": pulls via rsync from stat1005.eqiad.wmnet::hdfs-archive
- "dumps::web::fetches::kiwix": pulls via rsync from download.kiwix.org
- "dumps::web::fetches::wikitech_dumps": pulls via wget from https://wikitech.wikimedia.org/dumps/
- "profile::phabricator::main": rsync from phab1001
- "role::logging::mediawiki::udp2log": rsync from mwlog1001.eqiad.wmnet
- "-" indicates there's no job, as these folders are not updated.
Thu, Mar 1
Wed, Feb 28
From chat on irc: @Catrope tells me it will likely be about two weeks before he's able to have a look at this again.
Some more design decisions made as we go along (see updated patchset):
snapshot1007 and dumpsdata1001 can't be rebooted right now due to the weekly wikidata nt dumps. And tomorrow the full xml/sql run starts, so we'll be looking at March 18th or 19th for a reboot.
Tue, Feb 27
Rsync is working, files have shown up. Closing this ticket.
OK, I take it back. The only ones for which the files were not updated, are wiki projects which have been deleted in the meantime. What's happening instead is that the rsyncs aren't copying over these links from the dumps generation hosts to the web server. I'll check on that.
This is still a problem after the Feb 20th run. There must be an additional issue, having a look.
All of the above wikis completed their run, closing.
The missing hashes now appear in dumpstatus.json as they should, verified for the Feb 20 run. Closing.
This is now deployed and will be in effect for the next run starting March 1st. Leaving this ticket open until we see that the stubs and page content files contain the expected pages.
Fri, Feb 23
This seems fine to me provided it can be handled on the labstore boxes without requiring any changes on the dumpsdata servers. Also we might think about what a 'good' directory structure would look like, and how we could phase out such symlinks over time.
Very little code at all was written on the last incarnation of this project. If a student is interested this round, they should start out on toolforge, meaning that much of last year's configuration work would be irrelevant. The json file containing a partial mapping between emojis and images could be re-used and expanded, though I would encourage the student to explore other avenues for producing such a mapping, so that it is maintained after the project's end. And the generation of an image plus copyright info, or alternatively work on Twittercards on commons, would start from the beginning, nothing's been done on that front.
Wed, Feb 21
In the meantime I am rethinking the way these dumps ought to go. This is an alternative approach, still nascent. https://gerrit.wikimedia.org/r/#/c/413212/ I'll copy over some notes onto this ticket later about this. Meh I'll copy them now:
Feb 15 2018
I believe this is a side-effect of T185454 and fixed with the commit mentioned there. Local testing with the latest deployed code does not show this issue. We won't know for sure until the next run on the 20th so I'll keep this ticket open.
We have to get this into the budget plan by tomorrow, so I'm going to request: