We dump a list of media filenames (namespace 6) for each wiki every day. These files reside here: https://dumps.wikimedia.org/other/mediatitles/
Right, the sha1 column is indexed then? I hadn't bothered to check that. We dump the image table in any case so that would be available a couple of times a month, not exactly matching the machine vision dumps but close enough.
Tue, Dec 10
Sigh.. no. adding to my todo list.
Thanks for these updates. The sizes look quite reasonable, even allowing for unexpected growth.
wmf.8 is now everywhere, and this branch has the patch in it, so I can close this task. Thanks for the fix!
Mon, Dec 9
What do we think about the pile of these in the log:
Okay, I have had a chat with one of the core platform folks about the future of RESTBase. TL:DR, it's going away next quarter (Jan-Mar 2020)! It will be replaced by some caching service or other, TBD. I've subscribed to the appropriate ticket (T239743 if no better one comes along), and we'll see what the plans are and whether easy bulk internal access can be negotiated in. Cassandra itself does not lend it to such things It's also not even clear if prerendering on edit will happen in all cases; for example, bots may not need a text preview and may not request the rendered text after edit, so skipping prerendering in these cases might save load on the servers.
I'm going to merge this into another task for dumps of HTML produced from expanded wikitext.
I just want to raise this so it's on folks' radar: it would be nice if whatever caching mechanism is introduced, could easily have the HTML for current page revisions dumped in bulk, on a per wiki basis preferably. If that turns out not to be feasible because of the design that's understandable, but if it urns out not to be a big deal, it would be handy for providing HTML dumps of content, particularly for the large wikis.
Repeating here some things from a chort chat in irc:
Sun, Dec 8
New adds-changes dumps are being produced after this patch https://gerrit.wikimedia.org/r/#/c/operations/puppet/+/555732/ was deployed so I can close this now.
Sat, Dec 7
Rebooted and all is well. Sometime on Sunday I'll enable locking again on the adds-changes dumps and check Monday that they still run properly.
Fri, Dec 6
Weird, I see nothing in the changelog that looks likely: https://salsa.debian.org/debian/bzip2/blob/master/debian/changelog
The dumps referenced above are missing some content and have been a bit fidgety to maintain. For more on that, see T236507.
Thu, Dec 5
No, there is a ticket for dumping parsed wikitext as it is stored in RESTBase but that's not a full page view with skin etc.
@Nemo_bis thanks for testing! Can you ldd the pbzip2 on both boxes and tell me if there's a difference between the one that succeeds and the one that fails?
Wed, Dec 4
I've emailed to the xmldatadumps-l list asking for testers. See https://lists.wikimedia.org/pipermail/xmldatadumps-l/2019-December/001510.html
We don't try to render anything. Not for wikidata entity dumps nor the xml dumps. So I don't have any good ideas about that.
Do either of you have any of the queries run? And which host(s) were they from? If this is new and different from the issue patched above, I should look into it.
After a brief discussion on irc, there are a couple of suggestions for updating the content of Special:UnusedFiles (which could then be used via the api, we hope):
To get the list of images not used, we could:
- collect all image names from the imagelinks table (column 'il_to')
- normalize those image names
- for each image in the image table (column 'img_name'), normalize the name, see if it's in the above list, otherwise output to a potential list to be purged
We need the following:
Tue, Dec 3
Line length needs to be tweaked to conform with our puppet settings for flake8 I guess.
Mon, Dec 2
Running now and doing the right thing. Closing.
Let me add @Bstorm to make sure she knows I've volunteered us to host a copy and to make sure that there's 7T spare around, since that's more than I expected.
The above is now live. Need to do another reboot test when the misc crons are done, so that will be Saturday again. If it pans out, I'll add the locking back in to the adds-changes dumps then too.
Data verified to be going out to labstore1006. Closing this miserable excuse for a ticket. And kicking myself for not remembering basic bash array syntax, yet again.
https://bugs.launchpad.net/ubuntu/+source/nfs-utils/+bug/1428486 shows that setting NEED_STATD=yes is guaranteed not to work. Explicit enable needed, patch coming.
We need some timing tests on these: is there a happy medium between 'best settings for compression' and 'best settings for speed'? What are we looking at in terms of execution time and space, if we add this step? We'd continue to provide bz2s I guess, since those are handy for processing into Hadoop, being well-suited to parallel processing.
Let's see what happens once this is in production.
This is great news! We would be happy to link to it and host a copy once it's ready to be announced. What is the cumulative size of the files for download?
Sat, Nov 30
Reboot done and rcp.statd did not start, so I have again restarted it manually. I'll leave things as they are for the weekend and see what's needed on Monday, since it's not urgent. Probably I will have to explicitly enable and start the service in puppet.
Fri, Nov 29
This is so done. So very very done. Thanks everyone!
Thu, Nov 28
I started rpc.statd manually on the other two dumpsdata servers as well.
For the record and for my future self, this issue manifested as failure to get locks over nfs from a client:
fcntl.lockf(fhandle, fcntl.LOCK_EX | fcntl.LOCK_NB) OSError: [Errno 37] No locks available
Both server and client are using nfs v3.
Wed, Nov 27
Thanks for the forwards!
Potentially interesting for Airflow/Argo comparison: https://medium.com/flyr-labs-blog/why-were-switching-off-airflow-sort-of-780c4f58a660
message:belong turns up a number of these for various code paths just within the last 15 minutes. https://logstash.wikimedia.org/goto/40f64ef65a75d7609c391dd00ef5d0bb as an example.
I don't really want to revive this ticket but I do want to know if it's seriously on the roadmap or indefinitely deferred/rejected.
Do we know what queries these clients were running? A first pass through the relevant MediaWiki code doesn't show any good suspects.
Do we have a meeting scheduled to talk about capacity needs?
This is now complete. Nov 20th wikidata abstract files are nice little empty files as expected.
https://lists.wikimedia.org/pipermail/wikitech-l/2019-November/092821.html Email sent to wikitech-l and xmldatadumps-l. @leila would you be willing to forward to the research mailing lists? @hoo are you on the wikidata mailing list and can you forward it there? Thanks in advance :)
Closing, any followup issues can get their own tasks.
Aaaaand dumpsdata1001 is reimaged. All the data is still there, available to snapshot hosts.
I have tested on snapshot1008, which mounts only the buster nfs share, that the dump_lock.py script with multiple instances works as it should; this is the locking mechanism for xml/sql dumps. This means that although the adds-changes dumps locking must still be investigated later, I can go ahead and re-image dumpsdata1001 now that the current xml/sql run has completed.
This looks good for stubs and page content dumps on deployment-prep; stubs now does not have the tag and page content dumps still do, which is what we want. Once this is deployed to all the wikis we can close the task.
Ok, that's new and very undesirable behavior. In the past it was always the case that for xml/sql dumps, connections might remain open to vslow hosts that were then depooled or recategorized, but never to non-vslow hosts... unless, I suppose, no vslow hosts were available. I'll need to see if something has broken or changed in the maintenance scripts or in setting up db connections.
I actually checked, and the last time db1087 (the current vslow) was depooled was like 4 months ago: https://gerrit.wikimedia.org/r/#/c/operations/mediawiki-config/+/520844/
Now that we don't use gerrit anymore for depooling hosts, it is harder to track if a host got depooled via dbctl, but I also checked SAL and did some phabricator searches and couldn't find any recent db1087's depooling.
It was never a vslow host.
Mon, Nov 25
Sure thing! I'm just not sure of the way forward right now.
Snapshot1006 was running regular wikidata dumps. We don't flush LB config after every query for obvious reasons, though page content fetchers should fail and restart with a new config if the connected db server becomes unavailable. Unavailable in this case means it doesn't serve queries, connections are terminated. I don't think there's a facility in MediaWiki to fail a connection if the host has been depooled in etcd/LB config but continues to respond to queries.
I'm going to send an email announcement to wikitech and xmldatadumps-l. Someone on the research and wikidata lists should forward the announcement there. Adding the relevant projects (sorry if they aren't right, please feel free to move this around where it belongs).
Sun, Nov 24
Adds-changes dumps did not run properly; when I checked this afternoon the Nov 23 job was hung indefinitely trying to get a lockfile on the first wiki to be processed (abwiki). I watched snapshot1008 attempt to connect to dumpsdata1002 for (some) nfs request and then try dumpsdata1003 when that failed (!) I rebooted snapshot1008 which no longer does this. Some port was still advertised wrongly on dumsdata1002 it seems, a reboot took care of that.
Sat, Nov 23
See T238972 for the production dumps switchover ticket.
And some of them are already on labstore1006, so rsyncs are working as expected.
snapshot1008 now uses dumpsdata1002 as its nfs server. I had to manually systemctl stop nfs-mountd.service and start it again for dumpsdata1002 to pick up the values (and especially the port setting) in /etc/default/nfs-kernel-server so that's poor. Other than that, no problems with puppet's unmounting and remounting of the share.
Fri, Nov 22
Given that the wikidata entity dumps are still finishing up the truthy gz files, and after that there will be bz2 recompression and the Lexemes, I'll be making the switchover tomorrow morning or mid-day EET.
https://gerrit.wikimedia.org/r/#/c/mediawiki/core/+/473222/ which I have not looked at for even a second (sorry!)
Ah, right - that would allow people to try out 0.11 in Special:Export before we make it the default. It doesn't prevent us from generating dumps in 0.11.
The big question is - do we need to provide both for a while, so people have time to adjust to 0.11? It's technically a breaking change.
That's right, this is an answer to the question "Is that structured data being dumped elsewhere on its own" (like the wikidata entity dumps).
The patchset for tonight/tomorrow, moving misc cron storage to dumpsdata1002, is ready to go.
Thu, Nov 21
I would say we should add size and not just the number of rows. There's a big refactor of revision table being deployed that will free up lots of space and that's what matters.