Page MenuHomePhabricator

ArielGlenn (ariel)
User

Today

  • Clear sailing ahead.

Tomorrow

  • Clear sailing ahead.

Saturday

  • Clear sailing ahead.

User Details

User Since
Oct 8 2014, 7:09 PM (270 w, 16 h)
Availability
Available
IRC Nick
apergos
LDAP User
ArielGlenn
MediaWiki User
ArielGlenn [ Global Accounts ]

Recent Activity

Today

ArielGlenn added a comment to T240520: Produce dumps of commons thumbnail URLs.

We dump a list of media filenames (namespace 6) for each wiki every day. These files reside here: https://dumps.wikimedia.org/other/mediatitles/

Thu, Dec 12, 8:34 AM · Dumps-Generation, Internet-Archive, Datasets-Archiving

Yesterday

ArielGlenn added a comment to T236431: Figure out data dumps for the MachineVision extension.

Right, the sha1 column is indexed then? I hadn't bothered to check that. We dump the image table in any case so that would be available a couple of times a month, not exactly matching the machine vision dumps but close enough.

Wed, Dec 11, 3:54 PM · Product-Infrastructure-Team-Backlog (Kanban), Dumps-Generation, Machine vision
ArielGlenn added a comment to T226093: Capacity planning for Commons Structured Data.

generated via https://github.com/apergos/misc-wmf-crap/blob/master/sdc-growth/get_slot_growth.py a quickie one-off script.

Wed, Dec 11, 3:48 PM · Structured-Data-Backlog (Current Work), Dumps-Generation, Operations, Wikidata, SDC General

Tue, Dec 10

ArielGlenn added a comment to T231866: Circular dependency when creating service! ContentLanguage.

Sigh.. no. adding to my todo list.

Tue, Dec 10, 5:24 PM · MW-1.34-notes, MW-1.35-notes (1.35.0-wmf.10; 2019-12-10), Performance-Team (Radar), Language-Team, MW-1.34-release, Core Platform Team Workboards (Clinic Duty Team), MediaWiki-ResourceLoader, MediaWiki-extensions-Gadgets, MediaWiki-ServiceContainer
ArielGlenn added a comment to T236431: Figure out data dumps for the MachineVision extension.

Thanks for these updates. The sizes look quite reasonable, even allowing for unexpected growth.

Tue, Dec 10, 5:17 PM · Product-Infrastructure-Team-Backlog (Kanban), Dumps-Generation, Machine vision
ArielGlenn closed T228763: stubs are produced with xml:space="preserve" in the text tag; this is new behavior for the July 20th run of the xml/sql dumps as Resolved.

wmf.8 is now everywhere, and this branch has the patch in it, so I can close this task. Thanks for the fix!

Tue, Dec 10, 9:21 AM · MW-1.35-notes (1.35.0-wmf.8; 2019-11-26), CPT Initiatives (MCR), Core Platform Team Workboards (Clinic Duty Team), Dumps-Generation

Mon, Dec 9

ArielGlenn added a comment to T239894: Dispatching broken on beta - Fatal error: Class 'Memcached' not found in ObjectCache.php on line 186.

\o/ awesome!

Mon, Dec 9, 9:41 PM · User-Ladsgroup, Wikidata-Campsite (Wikidata-Campsite-Iteration-∞), Wikidata, MediaWiki-Cache, Beta-Cluster-reproducible
ArielGlenn added a comment to T239894: Dispatching broken on beta - Fatal error: Class 'Memcached' not found in ObjectCache.php on line 186.

What do we think about the pile of these in the log:

Mon, Dec 9, 9:32 PM · User-Ladsgroup, Wikidata-Campsite (Wikidata-Campsite-Iteration-∞), Wikidata, MediaWiki-Cache, Beta-Cluster-reproducible
ArielGlenn added a comment to T182351: Make HTML dumps available.

Okay, I have had a chat with one of the core platform folks about the future of RESTBase. TL:DR, it's going away next quarter (Jan-Mar 2020)! It will be replaced by some caching service or other, TBD. I've subscribed to the appropriate ticket (T239743 if no better one comes along), and we'll see what the plans are and whether easy bulk internal access can be negotiated in. Cassandra itself does not lend it to such things It's also not even clear if prerendering on edit will happen in all cases; for example, bots may not need a text preview and may not request the rendered text after edit, so skipping prerendering in these cases might save load on the servers.

Mon, Dec 9, 12:13 PM · Research-Backlog, Datasets-Archiving, Analytics
ArielGlenn merged T156581: Produce dumps of parsed page content into T133547: set up automated HTML (restbase) dumps on francium.
Mon, Dec 9, 12:04 PM · Core Platform Team Legacy (Watching / External), Services (watching), Patch-For-Review, Datasets-Archiving, Dumps-Generation
ArielGlenn merged task T156581: Produce dumps of parsed page content into T133547: set up automated HTML (restbase) dumps on francium.
Mon, Dec 9, 12:04 PM · Dumps-Generation
ArielGlenn added a comment to T156581: Produce dumps of parsed page content.

I'm going to merge this into another task for dumps of HTML produced from expanded wikitext.

Mon, Dec 9, 12:03 PM · Dumps-Generation
ArielGlenn added a comment to T239743: Provide cached access to Parsoid PHP within core.

I just want to raise this so it's on folks' radar: it would be nice if whatever caching mechanism is introduced, could easily have the HTML for current page revisions dumped in bulk, on a per wiki basis preferably. If that turns out not to be feasible because of the design that's understandable, but if it urns out not to be a big deal, it would be handy for providing HTML dumps of content, particularly for the large wikis.

Mon, Dec 9, 11:05 AM · User-ArielGlenn, Parsoid-PHP, Core Platform Team Workboards (Green), MediaWiki-REST-API, CPT Initiatives (Core REST API in PHP)
ArielGlenn added a comment to T222349: Do not rate limit dumps from internal network.

Repeating here some things from a chort chat in irc:

Mon, Dec 9, 8:07 AM · cloud-services-team (Kanban), Data-Services, Patch-For-Review, Wikidata, Wikidata-Query-Service, Discovery-Search, Operations
ArielGlenn moved T239866: Investigate use of bz2 decompression tools on multistream files from Backlog to Active on the Dumps-Generation board.
Mon, Dec 9, 8:01 AM · Dumps-Generation
ArielGlenn moved T239905: dumpRdf for mediainfo entities loads data from db more often than it needs to from Backlog to Other teams on the Dumps-Generation board.
Mon, Dec 9, 8:01 AM · Structured-Data-Backlog (Current Work), Dumps-Generation, WikibaseMediaInfo, Wikidata-Query-Service, SDC General, Commons, Wikidata

Sun, Dec 8

ArielGlenn closed T239401: make sure that after reboot, rpc.statd starts on dumpsdata1002 as Resolved.

New adds-changes dumps are being produced after this patch https://gerrit.wikimedia.org/r/#/c/operations/puppet/+/555732/ was deployed so I can close this now.

Sun, Dec 8, 10:44 PM · Dumps-Generation
ArielGlenn renamed T239401: make sure that after reboot, rpc.statd starts on dumpsdata1002 from makre sure that after reboot, rpc.statd starts on dumpsdata1002 to make sure that after reboot, rpc.statd starts on dumpsdata1002.
Sun, Dec 8, 10:42 PM · Dumps-Generation

Sat, Dec 7

ArielGlenn added a comment to T239401: make sure that after reboot, rpc.statd starts on dumpsdata1002.

Rebooted and all is well. Sometime on Sunday I'll enable locking again on the adds-changes dumps and check Monday that they still run properly.

Sat, Dec 7, 9:56 AM · Dumps-Generation

Fri, Dec 6

ArielGlenn added a comment to T239866: Investigate use of bz2 decompression tools on multistream files.

Weird, I see nothing in the changelog that looks likely: https://salsa.debian.org/debian/bzip2/blob/master/debian/changelog

Fri, Dec 6, 2:36 PM · Dumps-Generation
ArielGlenn added a comment to T108199: Evaluate Phabricator for its ability to export and import data.

The dumps referenced above are missing some content and have been a bit fidgety to maintain. For more on that, see T236507.

Fri, Dec 6, 5:01 AM · Phabricator

Thu, Dec 5

ArielGlenn added a project to T239743: Provide cached access to Parsoid PHP within core: User-ArielGlenn.
Thu, Dec 5, 7:23 PM · User-ArielGlenn, Parsoid-PHP, Core Platform Team Workboards (Green), MediaWiki-REST-API, CPT Initiatives (Core REST API in PHP)
ArielGlenn added a comment to T182351: Make HTML dumps available.

...

No, there is a ticket for dumping parsed wikitext as it is stored in RESTBase but that's not a full page view with skin etc.

Thu, Dec 5, 3:53 PM · Research-Backlog, Datasets-Archiving, Analytics
ArielGlenn triaged T239897: wmf-auto-reimage errors: failure to downtime (w/ no rename), pytho gc whine as Medium priority.
Thu, Dec 5, 11:02 AM · Operations
ArielGlenn added a comment to T239866: Investigate use of bz2 decompression tools on multistream files.

@Nemo_bis thanks for testing! Can you ldd the pbzip2 on both boxes and tell me if there's a difference between the one that succeeds and the one that fails?

Thu, Dec 5, 7:53 AM · Dumps-Generation

Wed, Dec 4

ArielGlenn added a comment to T239866: Investigate use of bz2 decompression tools on multistream files.

I've emailed to the xmldatadumps-l list asking for testers. See https://lists.wikimedia.org/pipermail/xmldatadumps-l/2019-December/001510.html

Wed, Dec 4, 10:03 PM · Dumps-Generation
ArielGlenn triaged T239866: Investigate use of bz2 decompression tools on multistream files as High priority.
Wed, Dec 4, 9:52 PM · Dumps-Generation
ArielGlenn added a comment to T143870: Some mw snapshot hosts are accessing main db servers.

We don't try to render anything. Not for wikidata entity dumps nor the xml dumps. So I don't have any good ideas about that.

Wed, Dec 4, 8:28 PM · MW-1.35-notes (1.35.0-wmf.10; 2019-12-10), Dumps-Generation, DBA
ArielGlenn added a comment to T143870: Some mw snapshot hosts are accessing main db servers.

Do either of you have any of the queries run? And which host(s) were they from? If this is new and different from the issue patched above, I should look into it.

Wed, Dec 4, 7:10 PM · MW-1.35-notes (1.35.0-wmf.10; 2019-12-10), Dumps-Generation, DBA
ArielGlenn added a comment to T239807: Clean up old images on wikitech-static.

After a brief discussion on irc, there are a couple of suggestions for updating the content of Special:UnusedFiles (which could then be used via the api, we hope):

Wed, Dec 4, 4:27 PM · wikitech.wikimedia.org
ArielGlenn added a comment to T239807: Clean up old images on wikitech-static.

To get the list of images not used, we could:

  • collect all image names from the imagelinks table (column 'il_to')
  • normalize those image names
  • for each image in the image table (column 'img_name'), normalize the name, see if it's in the above list, otherwise output to a potential list to be purged
Wed, Dec 4, 1:17 PM · wikitech.wikimedia.org
ArielGlenn added a comment to T239807: Clean up old images on wikitech-static.

We need the following:

Wed, Dec 4, 12:46 PM · wikitech.wikimedia.org
ArielGlenn triaged T239807: Clean up old images on wikitech-static as Medium priority.
Wed, Dec 4, 12:43 PM · wikitech.wikimedia.org

Tue, Dec 3

ArielGlenn added a comment to T182351: Make HTML dumps available.

@tizianopiccardi thanks for the update and great to see that you're there. :) Please make sure you advertise for it on wiki-research-l when the right time arrives.
@ArielGlenn storing a copy on our end sounds good. How can we go about keeping the data refreshed on our end? This is a one-time effort by tizianopiccardi and colleagues. Can we take the code from them and have a schedule for releasing HTML dumps the same way we do XML dumps?

Tue, Dec 3, 8:00 PM · Research-Backlog, Datasets-Archiving, Analytics
ArielGlenn added a comment to T239334: Python3 style guide.

Line length needs to be tweaked to conform with our puppet settings for flake8 I guess.

Tue, Dec 3, 1:08 PM · Patch-For-Review, User-ArielGlenn, User-jbond, Operations, Puppet
ArielGlenn added a comment to T231866: Circular dependency when creating service! ContentLanguage.

Do we know if anything else is outstanding here?
It seems to have fixed the errors I was getting on my dev wiki on a cold load.... @ArielGlenn does it seem fixed for you too? (though it's not in a wmf branch yet - not sure if it's worth backporting)

Tue, Dec 3, 8:43 AM · MW-1.34-notes, MW-1.35-notes (1.35.0-wmf.10; 2019-12-10), Performance-Team (Radar), Language-Team, MW-1.34-release, Core Platform Team Workboards (Clinic Duty Team), MediaWiki-ResourceLoader, MediaWiki-extensions-Gadgets, MediaWiki-ServiceContainer

Mon, Dec 2

ArielGlenn closed T238646: make cirrussearch dumps write into a temp location and move file into real path when complete as Resolved.

Running now and doing the right thing. Closing.

Mon, Dec 2, 9:00 PM · Dumps-Generation
ArielGlenn updated subscribers of T182351: Make HTML dumps available.

Let me add @Bstorm to make sure she knows I've volunteered us to host a copy and to make sure that there's 7T spare around, since that's more than I expected.

Mon, Dec 2, 3:49 PM · Research-Backlog, Datasets-Archiving, Analytics
ArielGlenn added a comment to T239401: make sure that after reboot, rpc.statd starts on dumpsdata1002.

The above is now live. Need to do another reboot test when the misc crons are done, so that will be Saturday again. If it pans out, I'll add the locking back in to the adds-changes dumps then too.

Mon, Dec 2, 2:17 PM · Dumps-Generation
ArielGlenn closed T239590: rsyncs not happening from dumpsdata1001 to labstore boxes as Resolved.

Data verified to be going out to labstore1006. Closing this miserable excuse for a ticket. And kicking myself for not remembering basic bash array syntax, yet again.

Mon, Dec 2, 1:38 PM · Dumps-Generation
ArielGlenn added a comment to T239401: make sure that after reboot, rpc.statd starts on dumpsdata1002.

https://bugs.launchpad.net/ubuntu/+source/nfs-utils/+bug/1428486 shows that setting NEED_STATD=yes is guaranteed not to work. Explicit enable needed, patch coming.

Mon, Dec 2, 1:37 PM · Dumps-Generation
ArielGlenn triaged T239590: rsyncs not happening from dumpsdata1001 to labstore boxes as Medium priority.
Mon, Dec 2, 10:40 AM · Dumps-Generation
ArielGlenn renamed T239401: make sure that after reboot, rpc.statd starts on dumpsdata1002 from reboot dumpsdata1002 to makre sure that after reboot, rpc.statd starts on dumpsdata1002.
Mon, Dec 2, 10:36 AM · Dumps-Generation
ArielGlenn moved T239401: make sure that after reboot, rpc.statd starts on dumpsdata1002 from Up Next to Active on the Dumps-Generation board.
Mon, Dec 2, 10:35 AM · Dumps-Generation
ArielGlenn added a comment to T222985: Provide wikidata JSON dumps compressed with zstd .

We need some timing tests on these: is there a happy medium between 'best settings for compression' and 'best settings for speed'? What are we looking at in terms of execution time and space, if we add this step? We'd continue to provide bz2s I guess, since those are handy for processing into Hadoop, being well-suited to parallel processing.

Mon, Dec 2, 10:35 AM · Dumps-Generation, Wikidata
ArielGlenn moved T143870: Some mw snapshot hosts are accessing main db servers from Active to Blocked/Stalled/Waiting for event on the Dumps-Generation board.
Mon, Dec 2, 10:32 AM · MW-1.35-notes (1.35.0-wmf.10; 2019-12-10), Dumps-Generation, DBA
ArielGlenn added a comment to T143870: Some mw snapshot hosts are accessing main db servers.

Let's see what happens once this is in production.

Mon, Dec 2, 10:32 AM · MW-1.35-notes (1.35.0-wmf.10; 2019-12-10), Dumps-Generation, DBA
ArielGlenn added a comment to T182351: Make HTML dumps available.

This is great news! We would be happy to link to it and host a copy once it's ready to be announced. What is the cumulative size of the files for download?

Mon, Dec 2, 9:46 AM · Research-Backlog, Datasets-Archiving, Analytics

Sat, Nov 30

ArielGlenn added a comment to T239401: make sure that after reboot, rpc.statd starts on dumpsdata1002.

Reboot done and rcp.statd did not start, so I have again restarted it manually. I'll leave things as they are for the weekend and see what's needed on Monday, since it's not urgent. Probably I will have to explicitly enable and start the service in puppet.

Sat, Nov 30, 7:06 AM · Dumps-Generation

Fri, Nov 29

ArielGlenn moved T143870: Some mw snapshot hosts are accessing main db servers from Backlog to Active on the Dumps-Generation board.
Fri, Nov 29, 11:11 AM · MW-1.35-notes (1.35.0-wmf.10; 2019-12-10), Dumps-Generation, DBA
ArielGlenn closed T219768: Get a third dumpsdata server, a subtask of T224563: Migrate dumpsdata hosts to Stretch/Buster, as Resolved.
Fri, Nov 29, 11:08 AM · Dumps-Generation, Operations
ArielGlenn closed T219768: Get a third dumpsdata server as Resolved.

This is so done. So very very done. Thanks everyone!

Fri, Nov 29, 11:08 AM · hardware-requests, Operations, Dumps-Generation
ArielGlenn moved T239401: make sure that after reboot, rpc.statd starts on dumpsdata1002 from Backlog to Up Next on the Dumps-Generation board.
Fri, Nov 29, 6:37 AM · Dumps-Generation

Thu, Nov 28

ArielGlenn added a comment to T239401: make sure that after reboot, rpc.statd starts on dumpsdata1002.

I started rpc.statd manually on the other two dumpsdata servers as well.

Thu, Nov 28, 9:06 PM · Dumps-Generation
ArielGlenn added a comment to T239401: make sure that after reboot, rpc.statd starts on dumpsdata1002.

For the record and for my future self, this issue manifested as failure to get locks over nfs from a client:

fcntl.lockf(fhandle, fcntl.LOCK_EX | fcntl.LOCK_NB)
OSError: [Errno 37] No locks available

Both server and client are using nfs v3.

Thu, Nov 28, 1:45 PM · Dumps-Generation
ArielGlenn triaged T239401: make sure that after reboot, rpc.statd starts on dumpsdata1002 as Medium priority.
Thu, Nov 28, 11:44 AM · Dumps-Generation

Wed, Nov 27

ArielGlenn added a comment to T238972: switch xml/sql (and adds-changes) dumps to use 0.11 schema with content from multiple slots.

Thanks for the forwards!

Wed, Nov 27, 7:47 PM · User-notice, Wikidata, Research, Dumps-Generation
ArielGlenn added a comment to T237361: Discuss common needs in a job manager/scheduler.

Potentially interesting for Airflow/Argo comparison: https://medium.com/flyr-labs-blog/why-were-switching-off-airflow-sort-of-780c4f58a660

Wed, Nov 27, 6:18 PM · User-ArielGlenn, Dumps-Rewrite
ArielGlenn added a comment to T221763: Page rename (Special:MovePage) can throw InvalidArgumentException: Title does not belong to page ID X but actually belong to Y..

message:belong turns up a number of these for various code paths just within the last 15 minutes. https://logstash.wikimedia.org/goto/40f64ef65a75d7609c391dd00ef5d0bb as an example.

Wed, Nov 27, 5:46 PM · User-ArielGlenn, CPT Initiatives (MCR), MW-1.34-notes (1.34.0-wmf.16; 2019-07-30), MediaWiki-Revision-backend, Core Platform Team Workboards (Clinic Duty Team), Readers-Web-Backlog (Tracking), PageImages, Multi-Content-Revisions (Reactive), Wikimedia-production-error, Regression
ArielGlenn added a comment to T100705: Consider using Cassandra/restbase in place of external store.

I don't really want to revive this ticket but I do want to know if it's seriously on the roadmap or indefinitely deferred/rejected.

Wed, Nov 27, 4:20 PM · Operations, Wikimedia-General-or-Unknown, Availability
ArielGlenn added a comment to T143870: Some mw snapshot hosts are accessing main db servers.

Do we know what queries these clients were running? A first pass through the relevant MediaWiki code doesn't show any good suspects.

Wed, Nov 27, 1:59 PM · MW-1.35-notes (1.35.0-wmf.10; 2019-12-10), Dumps-Generation, DBA
ArielGlenn moved T238646: make cirrussearch dumps write into a temp location and move file into real path when complete from Backlog to Up Next on the Dumps-Generation board.
Wed, Nov 27, 1:55 PM · Dumps-Generation
ArielGlenn moved T228763: stubs are produced with xml:space="preserve" in the text tag; this is new behavior for the July 20th run of the xml/sql dumps from Other teams to Blocked/Stalled/Waiting for event on the Dumps-Generation board.
Wed, Nov 27, 1:55 PM · MW-1.35-notes (1.35.0-wmf.8; 2019-11-26), CPT Initiatives (MCR), Core Platform Team Workboards (Clinic Duty Team), Dumps-Generation
ArielGlenn added a comment to T226093: Capacity planning for Commons Structured Data.

Do we have a meeting scheduled to talk about capacity needs?

Wed, Nov 27, 1:54 PM · Structured-Data-Backlog (Current Work), Dumps-Generation, Operations, Wikidata, SDC General
ArielGlenn closed T236006: consider generating an empty abstract file for wikidata as Resolved.

This is now complete. Nov 20th wikidata abstract files are nice little empty files as expected.

Wed, Nov 27, 1:53 PM · Wikidata, Dumps-Generation
ArielGlenn moved T221917: Create RDF dump of structured data on Commons from Active to Blocked/Stalled/Waiting for event on the Dumps-Generation board.
Wed, Nov 27, 1:52 PM · Dumps-Generation, MW-1.34-notes (1.34.0-wmf.10; 2019-06-18), Patch-For-Review, WikibaseMediaInfo, Wikidata-Query-Service, SDC General, Commons, Wikidata
ArielGlenn moved T230856: RDF dump performance for SDC from Active to Blocked/Stalled/Waiting for event on the Dumps-Generation board.
Wed, Nov 27, 1:52 PM · Structured-Data-Backlog (Current Work), Dumps-Generation, WikibaseMediaInfo, Wikidata-Query-Service, SDC General, Commons, Wikidata
ArielGlenn updated subscribers of T238972: switch xml/sql (and adds-changes) dumps to use 0.11 schema with content from multiple slots.

https://lists.wikimedia.org/pipermail/wikitech-l/2019-November/092821.html Email sent to wikitech-l and xmldatadumps-l. @leila would you be willing to forward to the research mailing lists? @hoo are you on the wikidata mailing list and can you forward it there? Thanks in advance :)

Wed, Nov 27, 1:41 PM · User-notice, Wikidata, Research, Dumps-Generation
ArielGlenn added a project to T221763: Page rename (Special:MovePage) can throw InvalidArgumentException: Title does not belong to page ID X but actually belong to Y.: User-ArielGlenn.
Wed, Nov 27, 1:24 PM · User-ArielGlenn, CPT Initiatives (MCR), MW-1.34-notes (1.34.0-wmf.16; 2019-07-30), MediaWiki-Revision-backend, Core Platform Team Workboards (Clinic Duty Team), Readers-Web-Backlog (Tracking), PageImages, Multi-Content-Revisions (Reactive), Wikimedia-production-error, Regression
ArielGlenn added a project to T239334: Python3 style guide: User-ArielGlenn.
Wed, Nov 27, 1:22 PM · Patch-For-Review, User-ArielGlenn, User-jbond, Operations, Puppet
ArielGlenn updated the task description for T224549: Track remaining jessie systems in production.
Wed, Nov 27, 12:54 PM · Operations
ArielGlenn updated the task description for T224549: Track remaining jessie systems in production.
Wed, Nov 27, 12:54 PM · Operations
ArielGlenn closed T224563: Migrate dumpsdata hosts to Stretch/Buster, a subtask of T224549: Track remaining jessie systems in production, as Resolved.
Wed, Nov 27, 12:44 PM · Operations
ArielGlenn closed T224563: Migrate dumpsdata hosts to Stretch/Buster as Resolved.

Closing, any followup issues can get their own tasks.

Wed, Nov 27, 12:44 PM · Dumps-Generation, Operations
ArielGlenn updated the task description for T224563: Migrate dumpsdata hosts to Stretch/Buster.
Wed, Nov 27, 12:20 PM · Dumps-Generation, Operations
ArielGlenn added a comment to T224563: Migrate dumpsdata hosts to Stretch/Buster.

Aaaaand dumpsdata1001 is reimaged. All the data is still there, available to snapshot hosts.

Wed, Nov 27, 12:20 PM · Dumps-Generation, Operations
ArielGlenn added a comment to T224563: Migrate dumpsdata hosts to Stretch/Buster.

I have tested on snapshot1008, which mounts only the buster nfs share, that the dump_lock.py script with multiple instances works as it should; this is the locking mechanism for xml/sql dumps. This means that although the adds-changes dumps locking must still be investigated later, I can go ahead and re-image dumpsdata1001 now that the current xml/sql run has completed.

Wed, Nov 27, 11:38 AM · Dumps-Generation, Operations
ArielGlenn added a comment to T228763: stubs are produced with xml:space="preserve" in the text tag; this is new behavior for the July 20th run of the xml/sql dumps.

This looks good for stubs and page content dumps on deployment-prep; stubs now does not have the tag and page content dumps still do, which is what we want. Once this is deployed to all the wikis we can close the task.

Wed, Nov 27, 10:39 AM · MW-1.35-notes (1.35.0-wmf.8; 2019-11-26), CPT Initiatives (MCR), Core Platform Team Workboards (Clinic Duty Team), Dumps-Generation
ArielGlenn added a comment to T143870: Some mw snapshot hosts are accessing main db servers.

...

Ok, that's new and very undesirable behavior. In the past it was always the case that for xml/sql dumps, connections might remain open to vslow hosts that were then depooled or recategorized, but never to non-vslow hosts... unless, I suppose, no vslow hosts were available. I'll need to see if something has broken or changed in the maintenance scripts or in setting up db connections.

I actually checked, and the last time db1087 (the current vslow) was depooled was like 4 months ago: https://gerrit.wikimedia.org/r/#/c/operations/mediawiki-config/+/520844/
Now that we don't use gerrit anymore for depooling hosts, it is harder to track if a host got depooled via dbctl, but I also checked SAL and did some phabricator searches and couldn't find any recent db1087's depooling.

Wed, Nov 27, 10:25 AM · MW-1.35-notes (1.35.0-wmf.10; 2019-12-10), Dumps-Generation, DBA
ArielGlenn added a comment to T143870: Some mw snapshot hosts are accessing main db servers.

...

It was never a vslow host.

Wed, Nov 27, 9:32 AM · MW-1.35-notes (1.35.0-wmf.10; 2019-12-10), Dumps-Generation, DBA
ArielGlenn added a comment to T230856: RDF dump performance for SDC.

I'm not sure if T222497 covers this stuff and, if not, what is actionable here by the structured data team. @ArielGlenn any thoughts?

Wed, Nov 27, 9:29 AM · Structured-Data-Backlog (Current Work), Dumps-Generation, WikibaseMediaInfo, Wikidata-Query-Service, SDC General, Commons, Wikidata

Mon, Nov 25

ArielGlenn added a comment to T143870: Some mw snapshot hosts are accessing main db servers.

Sure thing! I'm just not sure of the way forward right now.

Mon, Nov 25, 11:57 AM · MW-1.35-notes (1.35.0-wmf.10; 2019-12-10), Dumps-Generation, DBA
ArielGlenn added a comment to T143870: Some mw snapshot hosts are accessing main db servers.

Snapshot1006 was running regular wikidata dumps. We don't flush LB config after every query for obvious reasons, though page content fetchers should fail and restart with a new config if the connected db server becomes unavailable. Unavailable in this case means it doesn't serve queries, connections are terminated. I don't think there's a facility in MediaWiki to fail a connection if the host has been depooled in etcd/LB config but continues to respond to queries.

Mon, Nov 25, 11:43 AM · MW-1.35-notes (1.35.0-wmf.10; 2019-12-10), Dumps-Generation, DBA
ArielGlenn added projects to T238972: switch xml/sql (and adds-changes) dumps to use 0.11 schema with content from multiple slots: Research, Wikidata.

I'm going to send an email announcement to wikitech and xmldatadumps-l. Someone on the research and wikidata lists should forward the announcement there. Adding the relevant projects (sorry if they aren't right, please feel free to move this around where it belongs).

Mon, Nov 25, 8:04 AM · User-notice, Wikidata, Research, Dumps-Generation

Sun, Nov 24

ArielGlenn added a comment to T224563: Migrate dumpsdata hosts to Stretch/Buster.

Adds-changes dumps did not run properly; when I checked this afternoon the Nov 23 job was hung indefinitely trying to get a lockfile on the first wiki to be processed (abwiki). I watched snapshot1008 attempt to connect to dumpsdata1002 for (some) nfs request and then try dumpsdata1003 when that failed (!) I rebooted snapshot1008 which no longer does this. Some port was still advertised wrongly on dumsdata1002 it seems, a reboot took care of that.

Sun, Nov 24, 3:37 PM · Dumps-Generation, Operations

Sat, Nov 23

ArielGlenn added a project to T238959: Make TextPassDumperTest work with 0.11 dump schema: Dumps-Generation.
Sat, Nov 23, 9:15 AM · Patch-For-Review, Core Platform Team Workboards (Clinic Duty Team), CPT Initiatives (MCR Schema Migration), Dumps-Generation, Structured-Data-Backlog, Multi-Content-Revisions (New Features), Structured Data Engineering, Wikidata
ArielGlenn added a comment to T238921: MCR: Include all slots in XML dumps per default.

See T238972 for the production dumps switchover ticket.

Sat, Nov 23, 8:31 AM · CPT Initiatives (MCR Schema Migration), Core Platform Team Workboards (Clinic Duty Team), Patch-For-Review, Dumps-Generation
ArielGlenn triaged T238972: switch xml/sql (and adds-changes) dumps to use 0.11 schema with content from multiple slots as Medium priority.
Sat, Nov 23, 8:29 AM · User-notice, Wikidata, Research, Dumps-Generation
ArielGlenn added a comment to T224563: Migrate dumpsdata hosts to Stretch/Buster.

And some of them are already on labstore1006, so rsyncs are working as expected.

Sat, Nov 23, 8:19 AM · Dumps-Generation, Operations
ArielGlenn added a comment to T224563: Migrate dumpsdata hosts to Stretch/Buster.

snapshot1008 now uses dumpsdata1002 as its nfs server. I had to manually systemctl stop nfs-mountd.service and start it again for dumpsdata1002 to pick up the values (and especially the port setting) in /etc/default/nfs-kernel-server so that's poor. Other than that, no problems with puppet's unmounting and remounting of the share.

Sat, Nov 23, 8:16 AM · Dumps-Generation, Operations

Fri, Nov 22

ArielGlenn added a project to T220525: MCR: Import all slots from XML dumps: User-ArielGlenn.
Fri, Nov 22, 6:24 PM · User-ArielGlenn, Analytics, CPT Initiatives (MCR), Multi-Content-Revisions (New Features)
ArielGlenn added a comment to T224563: Migrate dumpsdata hosts to Stretch/Buster.

Given that the wikidata entity dumps are still finishing up the truthy gz files, and after that there will be bz2 recompression and the Lexemes, I'll be making the switchover tomorrow morning or mid-day EET.

Fri, Nov 22, 4:37 PM · Dumps-Generation, Operations
ArielGlenn renamed T238921: MCR: Include all slots in XML dumps per default from MCR: Include all sots in XML dumps per default to MCR: Include all slots in XML dumps per default.
Fri, Nov 22, 1:39 PM · CPT Initiatives (MCR Schema Migration), Core Platform Team Workboards (Clinic Duty Team), Patch-For-Review, Dumps-Generation
ArielGlenn added a comment to T199121: RFC: Spec for representing multiple content objects per revision (MCR) in XML dumps.

...

https://gerrit.wikimedia.org/r/#/c/mediawiki/core/+/473222/ which I have not looked at for even a second (sorry!)

Ah, right - that would allow people to try out 0.11 in Special:Export before we make it the default. It doesn't prevent us from generating dumps in 0.11.
The big question is - do we need to provide both for a while, so people have time to adjust to 0.11? It's technically a breaking change.

Fri, Nov 22, 1:38 PM · Structured-Data-Backlog, CPT Initiatives (MCR), Multi-Content-Revisions, Multimedia, Core Platform Team Workboards (Done with CPT), TechCom-RFC (TechCom-Approved), Structured Data Engineering, Dumps-Generation, User-ArielGlenn, User-Daniel, Wikidata
ArielGlenn added a comment to T199121: RFC: Spec for representing multiple content objects per revision (MCR) in XML dumps.

I see this ticket is resolved but the dumps on commons have version version="0.10" since from this ticket i gather that the dumps that contain those slots are version=11 , are those being produced?

Not yet. There are unmerged patches yet, and we'll need to announce well in advance before rolling out the change.

I am not aware of any unmerged patches blocking the use of the 0.11 schema. Can you point me to them?

Fri, Nov 22, 11:54 AM · Structured-Data-Backlog, CPT Initiatives (MCR), Multi-Content-Revisions, Multimedia, Core Platform Team Workboards (Done with CPT), TechCom-RFC (TechCom-Approved), Structured Data Engineering, Dumps-Generation, User-ArielGlenn, User-Daniel, Wikidata
ArielGlenn added a comment to T220525: MCR: Import all slots from XML dumps.

Not yet; there's a task for that but it's blocked on a performance issue. See https://phabricator.wikimedia.org/T222497 the blocker, and https://phabricator.wikimedia.org/T221917 the dumps task.

To clarify - the blocker is for the RDF dumps. Including the MediaInfo slot in the XML dump is not blocked on anything, we could just do it. Or am I missing something?

That's right, this is an answer to the question "Is that structured data being dumped elsewhere on its own" (like the wikidata entity dumps).

Fri, Nov 22, 11:43 AM · User-ArielGlenn, Analytics, CPT Initiatives (MCR), Multi-Content-Revisions (New Features)
ArielGlenn added a comment to T224563: Migrate dumpsdata hosts to Stretch/Buster.

The patchset for tonight/tomorrow, moving misc cron storage to dumpsdata1002, is ready to go.

Fri, Nov 22, 9:01 AM · Dumps-Generation, Operations
ArielGlenn added a comment to T220525: MCR: Import all slots from XML dumps.

@daniel: so I understand since i know little about all this. At this time the slots that contain the structure data items on say, a page in commons, are NOT included in the dumps with the page itself. Correct?
Is that structure data being dumped elsewhere on its own?

Fri, Nov 22, 7:09 AM · User-ArielGlenn, Analytics, CPT Initiatives (MCR), Multi-Content-Revisions (New Features)
ArielGlenn added a comment to T199121: RFC: Spec for representing multiple content objects per revision (MCR) in XML dumps.

I see this ticket is resolved but the dumps on commons have version version="0.10" since from this ticket i gather that the dumps that contain those slots are version=11 , are those being produced?

Fri, Nov 22, 7:07 AM · Structured-Data-Backlog, CPT Initiatives (MCR), Multi-Content-Revisions, Multimedia, Core Platform Team Workboards (Done with CPT), TechCom-RFC (TechCom-Approved), Structured Data Engineering, Dumps-Generation, User-ArielGlenn, User-Daniel, Wikidata

Thu, Nov 21

ArielGlenn added a comment to T68025: [Story] Monitor size of some Wikidata database tables.

...

I would say we should add size and not just the number of rows. There's a big refactor of revision table being deployed that will free up lots of space and that's what matters.

Thu, Nov 21, 6:56 AM · Wikidata-Campsite, WMDE-Analytics-Engineering, DBA, Story, Wikidata, Wikidata.org