Page MenuHomePhabricator

ArielGlenn (ariel)
User

Today

  • Clear sailing ahead.

Tomorrow

  • Clear sailing ahead.

Monday

  • Clear sailing ahead.

User Details

User Since
Oct 8 2014, 7:09 PM (344 w, 2 d)
Availability
Available
IRC Nick
apergos
LDAP User
ArielGlenn
MediaWiki User
ArielGlenn [ Global Accounts ]

Recent Activity

Yesterday

ArielGlenn closed T280654: Pending deploys to dumps repo before May 1 2021 run as Resolved.

Done. A couple patches adding sample code for a job and setting the default config to skip that job also went out with this deploy.

Fri, May 14, 4:19 AM · Dumps-Generation

Thu, May 13

ArielGlenn added a comment to T282723: MediaWiki\Revision\RevisionAccessException: Main slot of revision not found in database. See T212428..

May I suggest we stuff the revId right into the exception message? it's passed as a parameter to the logger but doesn't end up being logged afaict, at least I don't find it in the logstash entry. See e.g. https://logstash.wikimedia.org/app/discover#/doc/logstash-*/logstash-deploy-2021.05.13?id=yqfCZXkBA6MeBtBq-eAT

Thu, May 13, 1:03 PM · MediaWiki-Page-derived-data, MediaWiki-Revision-backend, Platform Engineering, Wikimedia-production-error

Wed, May 12

ArielGlenn added a comment to T280624: commons mediainfo json dumps failing.

I need to look at these when the temp files are still around, which would be on Monday evening for both the json and ttl files. I'll try to remember to do that next week.

Wed, May 12, 1:27 PM · Structured-Data-Backlog (Current Work), WikibaseMediaInfo, Dumps-Generation

Tue, May 11

ArielGlenn added a comment to T280624: commons mediainfo json dumps failing.

@ArielGlenn can we close this out? Thanks!

Tue, May 11, 5:00 PM · Structured-Data-Backlog (Current Work), WikibaseMediaInfo, Dumps-Generation

Mon, May 10

ArielGlenn added a comment to T280678: Crunch and delete many old dumps logs.

Pinging @Addshore directly, any chance you are still generating this data, or alternatively, that you still have the tools around and could easily do so?

Mon, May 10, 5:01 PM · Analytics-Kanban, Analytics
ArielGlenn closed T282445: Add hoo/ Marius Hoch to the ops-dumps mail alias as Resolved.

I've added the email address requested via pm in irc, following the instructions at https://wikitech.wikimedia.org/wiki/SRE_Clinic_Duty#Mail_aliases

Mon, May 10, 12:06 PM · Dumps-Generation
ArielGlenn added a comment to T53001: Image tarball dumps on your.org are not being generated.

Whatever happened to media backups? Was an implementation decided on or even completed?

Mon, May 10, 7:34 AM · Dumps-Generation, SRE, Datasets-Archiving, Datasets-General-or-Unknown
ArielGlenn moved T280624: commons mediainfo json dumps failing from Blocked/Stalled/Waiting for event to Active on the Dumps-Generation board.
Mon, May 10, 7:33 AM · Structured-Data-Backlog (Current Work), WikibaseMediaInfo, Dumps-Generation
ArielGlenn moved T222985: Provide wikidata JSON dumps compressed with zstd from Active to Blocked/Stalled/Waiting for event on the Dumps-Generation board.
Mon, May 10, 7:33 AM · wdwb-tech, Dumps-Generation, Wikidata
ArielGlenn added a comment to T280654: Pending deploys to dumps repo before May 1 2021 run.

Welp. Didn't do these because of holidays. No harm done, but I guess they will go before the May 20th run.

Mon, May 10, 7:32 AM · Dumps-Generation

Fri, May 7

ArielGlenn added a comment to T280311: Temp files left around in wikistats_1/ ?.

So it looks like the https://dumps.wikimedia.org/other/wikistats_1.0/ folder is empty, so that can be deleted.

The https://dumps.wikimedia.org/other/wikistats_1/ folder contains all kinds of crazy and very outdated reports and results of all kinds. If we wanted to reclaim that space, we could look through access logs for the past month to see if anyone's downloading it. I'd imagine we wouldn't find anything. So, my opinion: archive in HDFS and delete. If anyone feels uncomfortable with that, keep it around until we need the space.

Fri, May 7, 4:14 PM · Analytics-Radar, Dumps-Generation

Thu, May 6

ArielGlenn added a comment to T273585: Host OKAPI HTML dumps on public-facing labstore servers.

Hey, looks like we just need to consume it at a slower rate, we'll be doing deployment this week that should solve the issue, we'll keep you posted, sorry for late reply. As for md5 or sha hash I'll put it into another ticket and add you to it so you can track the progress.

Thu, May 6, 5:33 PM · Datasets-Archiving, Okapi [Wikimedia Enterprise], Dumps-Generation, cloud-services-team (Kanban)
ArielGlenn added a comment to T282033: AirFlow collaboration between PE and DE.

<snip>

So where are you at in the process? You can follow this task until we have something that's more ready for others to try, or you can jump in with us to build the infrastructure, let me know and we can plan accordingly.

Thu, May 6, 4:35 PM · Platform Team Workboards (Image Suggestion API), Analytics
ArielGlenn added a comment to T273585: Host OKAPI HTML dumps on public-facing labstore servers.

Hey just following up, any luck with the elwikiversity issue of too many requests in parallel?

Thu, May 6, 2:55 PM · Datasets-Archiving, Okapi [Wikimedia Enterprise], Dumps-Generation, cloud-services-team (Kanban)
ArielGlenn added a comment to T282033: AirFlow collaboration between PE and DE.

You know there is another airflow-common-usage task around, here it is: T237361

Thu, May 6, 9:14 AM · Platform Team Workboards (Image Suggestion API), Analytics
ArielGlenn moved T273089: mediawiki scripts fail on new buster image in deployment-prep from Active to Blocked/Stalled/Waiting for event on the Dumps-Generation board.
Thu, May 6, 8:57 AM · Dumps-Generation, Beta-Cluster-Infrastructure
ArielGlenn moved T273585: Host OKAPI HTML dumps on public-facing labstore servers from Active to Blocked/Stalled/Waiting for event on the Dumps-Generation board.
Thu, May 6, 8:57 AM · Datasets-Archiving, Okapi [Wikimedia Enterprise], Dumps-Generation, cloud-services-team (Kanban)
ArielGlenn moved T282078: decommission snapshot100[5,6,7].eqiad.wmnet from Active to Blocked/Stalled/Waiting for event on the Dumps-Generation board.
Thu, May 6, 8:57 AM · Dumps-Generation, decommission-hardware
ArielGlenn added a comment to T282078: decommission snapshot100[5,6,7].eqiad.wmnet.

@Cmjohnson These are all yours to decomm whenever you like. Thanks a lot!

Thu, May 6, 8:56 AM · Dumps-Generation, decommission-hardware
ArielGlenn reassigned T282078: decommission snapshot100[5,6,7].eqiad.wmnet from ArielGlenn to Cmjohnson.
Thu, May 6, 8:56 AM · Dumps-Generation, decommission-hardware
ArielGlenn committed rLPRIfff50c31cda6: remove fake mcrouter secrets for snapshot1005,6,7 (authored by ArielGlenn).
remove fake mcrouter secrets for snapshot1005,6,7
Thu, May 6, 8:46 AM
ArielGlenn closed T281330: deploy three new snapshots as replacements for snapshot1005,6,7 and set 1005,6,7 as spare as Resolved.

The new hosts are busily running dumps and the old ones have been marked as spare. Closing!

Thu, May 6, 7:44 AM · Patch-For-Review, Dumps-Generation
ArielGlenn moved T282078: decommission snapshot100[5,6,7].eqiad.wmnet from Backlog to Active on the Dumps-Generation board.
Thu, May 6, 7:43 AM · Dumps-Generation, decommission-hardware
ArielGlenn added a comment to T265056: Make Cirrus Search dump script more resilient to failures (elasticsearch restarts).

Today's report:

<13>May  4 00:08:29 dumpsgen: extensions/CirrusSearch/maintenance/DumpIndex.php failed for /mnt/dumpsdata/otherdumps/cirrussearch/20210503/commonswiki-20210503-cirrussearch-file.json.gz
Thu, May 6, 5:49 AM · Discovery-Search, CirrusSearch, Dumps-Generation
ArielGlenn updated the task description for T282078: decommission snapshot100[5,6,7].eqiad.wmnet.
Thu, May 6, 5:41 AM · Dumps-Generation, decommission-hardware
ArielGlenn committed R1885:56e32ba295bd: remove snapshot1005,6,7 from dump scap targets (authored by ArielGlenn).
remove snapshot1005,6,7 from dump scap targets
Thu, May 6, 5:36 AM
ArielGlenn claimed T282078: decommission snapshot100[5,6,7].eqiad.wmnet.
Thu, May 6, 5:32 AM · Dumps-Generation, decommission-hardware
ArielGlenn created T282078: decommission snapshot100[5,6,7].eqiad.wmnet.
Thu, May 6, 5:32 AM · Dumps-Generation, decommission-hardware

Wed, May 5

ArielGlenn added a comment to T281267: various weekly and daily dumps run from systemd timers are broken.

What are the next steps on this? Should I be tweaking a manifest someplace?

Wed, May 5, 6:14 AM · wdwb-tech, Wikidata, SRE, observability, Dumps-Generation

Tue, May 4

ArielGlenn added a comment to T260223: Kiwix rsyncs not completing and stacking up on labstore1006,7.

Thanks for this. We should also not start a new rsync on our side if one is already running on the same host. And someone needs to doublecheck that there are timeouts so that we won't have processes hanging and never completing, although I don't think that's the case for the recent incidents.

Tue, May 4, 3:50 AM · affects-Kiwix-and-openZIM, Dumps-Generation, cloud-services-team (Kanban)

Mon, May 3

ArielGlenn added a comment to T260223: Kiwix rsyncs not completing and stacking up on labstore1006,7.

I've seen some in the past month, indeed.

@ERROR: max connections (6) reached -- try again later
rsync error: error starting client-server protocol (code 5) at main.c(1675) [Receiver=3.1.3]

That was labstore1007 on Apr 14. We had one on Apr 8 as well.

Mon, May 3, 2:14 PM · affects-Kiwix-and-openZIM, Dumps-Generation, cloud-services-team (Kanban)

Thu, Apr 29

ArielGlenn awarded Deployment Training Graduate to recipient: MSantos.
Thu, Apr 29, 3:07 PM
MSantos awarded T281458: Deployment training request for mbsantos a Yellow Medal token.
Thu, Apr 29, 1:41 PM · Release-Engineering-Team (Deployment Training Requests)
ArielGlenn added a comment to T281330: deploy three new snapshots as replacements for snapshot1005,6,7 and set 1005,6,7 as spare.

After merging the above patch, I needed to remove the cron jobs from the dumspgen crontab manually on snapshot1006,7 since switching the role to testbed does not and can't really do that. I also tested angwikibooks and skwikibooks full dump runs with the test config file that writes output to a test directory. The first wiki had previous runs so we tested prefetch with that; the second one did not so we tested db fetches of text content with that. Everything looks ready to go.

Thu, Apr 29, 12:35 PM · Patch-For-Review, Dumps-Generation
ArielGlenn closed T281458: Deployment training request for mbsantos as Resolved.

Thanks for showing up, and please come again AND tell your friends! See you at a deploy window again soon! Adding @thcipriani as a FYI since he's sort of corralling the trainings. And closing this task as done!

Thu, Apr 29, 11:56 AM · Release-Engineering-Team (Deployment Training Requests)
ArielGlenn updated subscribers of T281458: Deployment training request for mbsantos.

Let's see about getting you that +2 on the MediaWiki config repo. Aaaand done by @Reedy already. Please check but you should have the bits; you needed to be added to the wmf-deployment group.

Thu, Apr 29, 11:52 AM · Release-Engineering-Team (Deployment Training Requests)
ArielGlenn added a comment to T281458: Deployment training request for mbsantos.

p.s. I think we'll have another trainer in the meeting. But most of the time we don't bother to ack the invite :-D

Thu, Apr 29, 10:31 AM · Release-Engineering-Team (Deployment Training Requests)
ArielGlenn added a comment to T281458: Deployment training request for mbsantos.

Yes please, just show up! That would be great!

Thu, Apr 29, 10:17 AM · Release-Engineering-Team (Deployment Training Requests)

Wed, Apr 28

ArielGlenn added a comment to T280631: Make Wikimedia Enterprise Daily Exports and Diffs available to WMCS community.

So, in the Enterprise-WMCS ToU, along with a restriction on direct commercial use (and a restriction on on-selling), we should also include a restriction on the "systematic" provision of the feed to others. It would be important to not restrict "fair" sharing of the content (especially since the fortnightly dumps are available anyway). What we'd need to restrict is the systematic sharing - the bootleg recreation of the API.

Dumps are commonly mirrored on third party servers. Should we ask people not to mirror the fortnightly materials to-be-shipped to the dumps.wikimedia.org systems, or is that bit fine?

Wed, Apr 28, 4:32 PM · Dumps-Generation, Okapi [Wikimedia Enterprise]
ArielGlenn committed R1885:c694227ac618: Add snapshot1011,12,13 to scap targets for the dumps repo (authored by ArielGlenn).
Add snapshot1011,12,13 to scap targets for the dumps repo
Wed, Apr 28, 3:59 PM
ArielGlenn added a comment to T281330: deploy three new snapshots as replacements for snapshot1005,6,7 and set 1005,6,7 as spare.

I had to manually edit /srv/deployment/dumps/dumps-cache/.config on all three hosts to change the name of the upstream host from dpeloy1001 to deploy1002, still wrong in the repo on deploy1002. See T197470

Wed, Apr 28, 1:00 PM · Patch-For-Review, Dumps-Generation
ArielGlenn added a comment to T197470: find a way to systematically update the deployment server name across all repos.

Just ran into this today on an install of new snapshot1011,12,13: got the dreaded

Apr 28 11:47:52 snapshot1011 puppet-agent[8409]: Execution of '/usr/bin/scap deploy-local --repo dumps/dumps -D log_json:False' returned 70: #007
Apr 28 11:47:52 snapshot1011 puppet-agent[8409]: (/Stage[main]/Profile::Dumps::Generation::Worker::Common/Scap::Target[dumps/dumps]/Package[dumps/dumps]/ensure) change from 'absent' to 'present' failed: Execution of '/usr/bin/scap deploy-local --repo dumps/dumps -D log_json:False' returned 70: #007

for the dumps repo. I have manually edited the dumps-cache/config files on those hosts and left the DEPLOY_HEAD file in the dumps repo on deploy1002 untouched, so that any proposed solution can be tested there. I have two more hosts yet to roll out, so we can defintitely check what works.

Wed, Apr 28, 12:55 PM · Release-Engineering-Team (Seen), Scap, SRE
ArielGlenn committed rLPRI4d2d3a056248: Add fake mcrouter secrets for snapshot1011,12,13 (authored by ArielGlenn).
Add fake mcrouter secrets for snapshot1011,12,13
Wed, Apr 28, 11:45 AM
ArielGlenn added a comment to T280631: Make Wikimedia Enterprise Daily Exports and Diffs available to WMCS community.

Thanks for that clarification, Liam. The 'regular uploading to IA' example is the sort of thing I had in mind.

Wed, Apr 28, 10:43 AM · Dumps-Generation, Okapi [Wikimedia Enterprise]
ArielGlenn updated subscribers of T209390: Output some meta data about the wikidata JSON dump.

I am proactively adding @hoo as he can provide some insight and perhaps tag others as well.

Wed, Apr 28, 9:26 AM · wdwb-tech, Dumps-Generation, Wikidata
ArielGlenn moved T281330: deploy three new snapshots as replacements for snapshot1005,6,7 and set 1005,6,7 as spare from Backlog to Active on the Dumps-Generation board.
Wed, Apr 28, 6:30 AM · Patch-For-Review, Dumps-Generation
ArielGlenn triaged T281330: deploy three new snapshots as replacements for snapshot1005,6,7 and set 1005,6,7 as spare as Medium priority.
Wed, Apr 28, 6:10 AM · Patch-For-Review, Dumps-Generation
ArielGlenn moved T272509: (Need By: 2021-03-31) rack/setup/install snapshot101[1-5] from Blocked/Stalled/Waiting for event to Done on the Dumps-Generation board.
Wed, Apr 28, 6:07 AM · SRE, Dumps-Generation, ops-eqiad, DC-Ops
ArielGlenn moved T280554: Data dumps need better documentation from Active to Done on the Dumps-Generation board.
Wed, Apr 28, 6:07 AM · Documentation, Dumps-Generation
ArielGlenn moved T281267: various weekly and daily dumps run from systemd timers are broken from Backlog to Active on the Dumps-Generation board.
Wed, Apr 28, 6:07 AM · wdwb-tech, Wikidata, SRE, observability, Dumps-Generation
ArielGlenn moved T280631: Make Wikimedia Enterprise Daily Exports and Diffs available to WMCS community from Backlog to Other teams on the Dumps-Generation board.
Wed, Apr 28, 6:07 AM · Dumps-Generation, Okapi [Wikimedia Enterprise]

Tue, Apr 27

ArielGlenn added a project to T280631: Make Wikimedia Enterprise Daily Exports and Diffs available to WMCS community: Dumps-Generation.

What prevents someone from uploading the dailies from a WMCS instance to archive.org? Do we want to deter that, encourage it, have no opinion?

Tue, Apr 27, 6:52 PM · Dumps-Generation, Okapi [Wikimedia Enterprise]
ArielGlenn closed T280554: Data dumps need better documentation as Resolved.
Tue, Apr 27, 3:31 PM · Documentation, Dumps-Generation
ArielGlenn added a comment to T281203: dumps distribution servers space issues.

Linking here as a related issue: T281048 (storage for security related data also under discussion there)

Tue, Apr 27, 3:15 PM · Security-Team, Data-Services, cloud-services-team (Kanban)
ArielGlenn moved T278416: Mention QRank in “Analytics Datasets” from Other teams to Done on the Dumps-Generation board.
Tue, Apr 27, 2:57 PM · Analytics-Kanban, Dumps-Generation, Analytics
ArielGlenn moved T279661: DRY up .html files in puppet used for snapshot and dumps modules from Backlog to Done on the Dumps-Generation board.
Tue, Apr 27, 2:57 PM · Dumps-Generation
ArielGlenn added a comment to T272509: (Need By: 2021-03-31) rack/setup/install snapshot101[1-5].

Ah ok! I didn't mean to be hasty, just saw the reimaging script runs and got excited :-)

Tue, Apr 27, 2:51 PM · SRE, Dumps-Generation, ops-eqiad, DC-Ops
ArielGlenn reopened T281267: various weekly and daily dumps run from systemd timers are broken as "Open".

After discussion with jbond, reopening for further discussion on better alerting in case of failures.

Tue, Apr 27, 2:34 PM · wdwb-tech, Wikidata, SRE, observability, Dumps-Generation
ArielGlenn added a comment to T281267: various weekly and daily dumps run from systemd timers are broken.

Note that the only way we found out about these usage errors from the systemd timer wrapper script was that a vigilant user of one of the output datasets happened to notice they weren't being produced. The error messages themselves just siltently went into syslog with no one being the wiser. We might want to think about better reporting that nonetheless doesn't mean piles of cronspam.

Tue, Apr 27, 2:27 PM · wdwb-tech, Wikidata, SRE, observability, Dumps-Generation
ArielGlenn added a comment to T281267: various weekly and daily dumps run from systemd timers are broken.

Folks cc-ed on this should decide if their jobs ought to start later today in lieu of not running at all, and either do it or poke me to do it, if so.

Tue, Apr 27, 2:22 PM · wdwb-tech, Wikidata, SRE, observability, Dumps-Generation
ArielGlenn updated subscribers of T281267: various weekly and daily dumps run from systemd timers are broken.

This means the wikidata entity dumps and the commons mediainfo dumps did not run this week, cc @hoo and @dcausse for a heads up.

Tue, Apr 27, 2:18 PM · wdwb-tech, Wikidata, SRE, observability, Dumps-Generation
ArielGlenn added a comment to T281267: various weekly and daily dumps run from systemd timers are broken.

This was caused by https://gerrit.wikimedia.org/r/c/operations/puppet/+/679292 which added arg parsing to the systemd timer wrapper script.

Tue, Apr 27, 2:15 PM · wdwb-tech, Wikidata, SRE, observability, Dumps-Generation
ArielGlenn triaged T281267: various weekly and daily dumps run from systemd timers are broken as High priority.
Tue, Apr 27, 2:14 PM · wdwb-tech, Wikidata, SRE, observability, Dumps-Generation
ArielGlenn added a comment to T272509: (Need By: 2021-03-31) rack/setup/install snapshot101[1-5].

Hey, this looks almost done, am I reading that right? :-) :-)

Tue, Apr 27, 5:48 AM · SRE, Dumps-Generation, ops-eqiad, DC-Ops
ArielGlenn added a comment to T281203: dumps distribution servers space issues.

In the short term, fewer dumps could be kept, although that only gets us so far.

Tue, Apr 27, 5:48 AM · Security-Team, Data-Services, cloud-services-team (Kanban)

Sun, Apr 25

ArielGlenn added a comment to T281048: mwlog1001 is running out of free space on /srv/mw-log.

If people move stuff off of /srv/security we could get .5T back which would be helpful. Some of those files are from a few years ago.

Sun, Apr 25, 6:16 AM · Performance-Team, MediaWiki-Revision-backend, MW-1.37-notes (1.37.0-wmf.3; 2021-04-27), observability, SRE

Fri, Apr 23

ArielGlenn added a comment to T273585: Host OKAPI HTML dumps on public-facing labstore servers.

For comparison, that 1.1gb json-encoded html file bz2 compressed down to 348mb and the 7z compressed one is 286mb, quite a savings when you consider the larger files.

Fri, Apr 23, 5:51 PM · Datasets-Archiving, Okapi [Wikimedia Enterprise], Dumps-Generation, cloud-services-team (Kanban)
ArielGlenn added a comment to T273585: Host OKAPI HTML dumps on public-facing labstore servers.

We don't have doc describing the dump. But I can write up the README file just so we have it documented, this is going to be a good thing for us as well as more people starting to use this having a place to point them to for info.

Basic idea that every title is written as separate file inside the tar.gz file. So If you unpack it you'll get a directory full of separate JSON files, one for each title (name of the title corresponds the file name with .json extension).

Fri, Apr 23, 4:08 PM · Datasets-Archiving, Okapi [Wikimedia Enterprise], Dumps-Generation, cloud-services-team (Kanban)
ArielGlenn added a comment to T273673: replace all puppet crons with systemd timers.

As folks might guess from all the merges, the first email via MAILTO to ops-dumps arrived today, verifying that part of the migration, so get your patches in, Amir! :-)

Fri, Apr 23, 10:54 AM · Patch-For-Review, User-jbond, puppet-compiler, SRE, Puppet
ArielGlenn added a comment to T265056: Make Cirrus Search dump script more resilient to failures (elasticsearch restarts).

Today's report:

<13>Apr 22 04:16:02 dumpsgen: extensions/CirrusSearch/maintenance/DumpIndex.php failed for /mnt/dumpsdata/otherdumps/cirrussearch/20210419/mniwiktionary-20210419-cirrussearch-content.json.gz
<13>Apr 22 04:16:02 dumpsgen: extensions/CirrusSearch/maintenance/DumpIndex.php failed for /mnt/dumpsdata/otherdumps/cirrussearch/20210419/mniwiktionary-20210419-cirrussearch-general.json.gz
Fri, Apr 23, 10:33 AM · Discovery-Search, CirrusSearch, Dumps-Generation
ArielGlenn closed T278416: Mention QRank in “Analytics Datasets” as Resolved.

Live on the web server. Have a great weekend when it arrives!

Fri, Apr 23, 5:03 AM · Analytics-Kanban, Dumps-Generation, Analytics

Thu, Apr 22

ArielGlenn added a comment to T277629: Create new group for root access to snapshot*, dumpsdata* and labstore1006,7 with holger in it.

Any news on this?

Thu, Apr 22, 4:26 AM · SRE, SRE-Access-Requests, Dumps-Generation

Wed, Apr 21

ArielGlenn added a comment to T273585: Host OKAPI HTML dumps on public-facing labstore servers.

I was able to successfully download the dump for elwikisource, so that tells me that the basic functionality of the script is good.

Wed, Apr 21, 4:33 AM · Datasets-Archiving, Okapi [Wikimedia Enterprise], Dumps-Generation, cloud-services-team (Kanban)
ArielGlenn added a comment to T280554: Data dumps need better documentation.

If you have no further questions, I'll close this task. You might also consider subscribing to the xmldatadumps-l mailing list https://lists.wikimedia.org/mailman/listinfo/xmldatadumps-l as a relatively low traffic list where announcements about dumps are sent, and people familiar with uses of the dumps discuss with each other.

Wed, Apr 21, 4:26 AM · Documentation, Dumps-Generation

Tue, Apr 20

ArielGlenn added a comment to T278666: create instance in deployment-prep just for testing MediaWiki code not yet merged into master.

adding phpunit to an instance can't be done with existing packages

Running PhpUnit on a host that has config for actual beta databases/services sounds incredibly dangerous to me, are you sure this won't cause any issues?

As long as people are forewarned not to run integration tests but only true unit tests (which by definition don't write to the db), and if they aren't sure, they better look at the specific test to see what it does.

Tue, Apr 20, 9:44 PM · Release-Engineering-Team (Radar), Testing-Roadblocks, User-ArielGlenn, Beta-Cluster-Infrastructure
ArielGlenn added a comment to T273585: Host OKAPI HTML dumps on public-facing labstore servers.

Thanks for this info. I might try with a different wiki and see what happens there. Let me know when the namespace change happens and I'll update my script accordingly.

Tue, Apr 20, 9:40 PM · Datasets-Archiving, Okapi [Wikimedia Enterprise], Dumps-Generation, cloud-services-team (Kanban)
ArielGlenn added a comment to T278666: create instance in deployment-prep just for testing MediaWiki code not yet merged into master.

So... would anyone mind if I went ahead and JFDI? I probably would have to concoct some custom role but if people don't mind the extra manifest it shouldn't be too hard.

Tue, Apr 20, 4:31 PM · Release-Engineering-Team (Radar), Testing-Roadblocks, User-ArielGlenn, Beta-Cluster-Infrastructure
ArielGlenn added a comment to T111775: Infoboxes are mistaken for abstracts in page abstract dumps..

I'm not adverse to that if we can determine there are no real users of the data, and there's an acceptable substitute. These dumps aren't needed for makiong a replica, or for doing analysis of the content.

Tue, Apr 20, 4:02 PM · ActiveAbstract, TextExtracts, Research-Backlog, Dumps-Generation
ArielGlenn moved T280624: commons mediainfo json dumps failing from Other teams to Blocked/Stalled/Waiting for event on the Dumps-Generation board.
Tue, Apr 20, 2:15 PM · Structured-Data-Backlog (Current Work), WikibaseMediaInfo, Dumps-Generation
ArielGlenn added a comment to T280624: commons mediainfo json dumps failing.

That config fix is live on snapshot1008 and will take effect during the next run (next week). I'll leave the task open until we've verified that it runs ok. After that, I'll check ttl and json file sizes and perhaps adjust the comments in that file to reflect what to look at in order to guesstimate the size in the future.

Tue, Apr 20, 2:15 PM · Structured-Data-Backlog (Current Work), WikibaseMediaInfo, Dumps-Generation
ArielGlenn added a comment to T280554: Data dumps need better documentation.

Thanks for the quick response. That looks like it covers most of the questions I had.

I assume the md5sums and sha1sums files are just checksumming the corresponding data files for transmission verification?

Tue, Apr 20, 2:10 PM · Documentation, Dumps-Generation
ArielGlenn moved T280654: Pending deploys to dumps repo before May 1 2021 run from Backlog to Up Next on the Dumps-Generation board.
Tue, Apr 20, 11:36 AM · Dumps-Generation
ArielGlenn triaged T280654: Pending deploys to dumps repo before May 1 2021 run as Medium priority.
Tue, Apr 20, 11:36 AM · Dumps-Generation
ArielGlenn moved T280554: Data dumps need better documentation from Backlog to Active on the Dumps-Generation board.
Tue, Apr 20, 11:34 AM · Documentation, Dumps-Generation
ArielGlenn moved T280624: commons mediainfo json dumps failing from Backlog to Other teams on the Dumps-Generation board.
Tue, Apr 20, 11:34 AM · Structured-Data-Backlog (Current Work), WikibaseMediaInfo, Dumps-Generation
ArielGlenn updated subscribers of T280624: commons mediainfo json dumps failing.

Cc-ing @Cparle who has worked on these in the past. (If there's a better person, please let me know.)

Tue, Apr 20, 7:54 AM · Structured-Data-Backlog (Current Work), WikibaseMediaInfo, Dumps-Generation
ArielGlenn triaged T280624: commons mediainfo json dumps failing as High priority.
Tue, Apr 20, 7:52 AM · Structured-Data-Backlog (Current Work), WikibaseMediaInfo, Dumps-Generation
ArielGlenn added a comment to T280554: Data dumps need better documentation.

I have added some information about filenames here: https://meta.wikimedia.org/wiki/Data_dumps/What%27s_available_for_download#XML_files if that is helpful. Please let me know if there is other information missing!

Tue, Apr 20, 6:31 AM · Documentation, Dumps-Generation
ArielGlenn added a comment to T280554: Data dumps need better documentation.

Hi, dumps maintainer here.

Tue, Apr 20, 5:00 AM · Documentation, Dumps-Generation

Mon, Apr 19

ArielGlenn added a comment to T280311: Temp files left around in wikistats_1/ ?.

@ArielGlenn thank you for noticing, please delete!

Mon, Apr 19, 3:56 PM · Analytics-Radar, Dumps-Generation
ArielGlenn added a comment to T273585: Host OKAPI HTML dumps on public-facing labstore servers.

Code stash place until it works and I figure out where in the wmf repo system it ought to live: https://github.com/apergos/okapi-downloader

Mon, Apr 19, 11:40 AM · Datasets-Archiving, Okapi [Wikimedia Enterprise], Dumps-Generation, cloud-services-team (Kanban)
ArielGlenn added a comment to T273585: Host OKAPI HTML dumps on public-facing labstore servers.
curl -L -u USERNAME_HERE --output got-this.json  https://api.wikimediaenterprise.org/v1/exports/json/elwikiversity

when supplied with the right password, gets only a file of size

-rw-rw-r--. 1 ariel ariel 565816 Απρ  19 14:06 got-this.json

with 99 entries in it, truncated.

Mon, Apr 19, 11:12 AM · Datasets-Archiving, Okapi [Wikimedia Enterprise], Dumps-Generation, cloud-services-team (Kanban)

Fri, Apr 16

ArielGlenn added a comment to T273585: Host OKAPI HTML dumps on public-facing labstore servers.

Around sizes , those sizes are still for HTML dumps, we are in progress of switching those to reflect JSON sizes, so you can only use those as approximation only, for now at least. We are not computing hashes at this point, but that's definitely something we can work out in the future. I'll keep you posted on updates. Thanks for all the feedback.

Fri, Apr 16, 1:36 PM · Datasets-Archiving, Okapi [Wikimedia Enterprise], Dumps-Generation, cloud-services-team (Kanban)
ArielGlenn added a comment to T265056: Make Cirrus Search dump script more resilient to failures (elasticsearch restarts).

Today's report:

<13>Apr 15 03:00:52 dumpsgen: extensions/CirrusSearch/maintenance/DumpIndex.php failed for /mnt/dumpsdata/otherdumps/cirrussearch/20210412/mniwiktionary-20210412-cirrussearch-content.json.gz
<13>Apr 15 03:00:52 dumpsgen: extensions/CirrusSearch/maintenance/DumpIndex.php failed for /mnt/dumpsdata/otherdumps/cirrussearch/20210412/mniwiktionary-20210412-cirrussearch-general.json.gz
Fri, Apr 16, 1:32 PM · Discovery-Search, CirrusSearch, Dumps-Generation
ArielGlenn added a comment to T273585: Host OKAPI HTML dumps on public-facing labstore servers.

I see the size of each dump is available in the json project list output, which is great. Can we also get md5 or sha1 hashes of these files via the same endpoint? This would be extremely nice for downloaders, and also for us to verify that downloads are complete and without corruption.

Fri, Apr 16, 9:55 AM · Datasets-Archiving, Okapi [Wikimedia Enterprise], Dumps-Generation, cloud-services-team (Kanban)
ArielGlenn added a comment to T273585: Host OKAPI HTML dumps on public-facing labstore servers.

Additional update via email from the OKAPI folks:

Fri, Apr 16, 7:28 AM · Datasets-Archiving, Okapi [Wikimedia Enterprise], Dumps-Generation, cloud-services-team (Kanban)
ArielGlenn added a comment to T279832: Some pages moves fail with "InvalidArgumentException: The Title object yields no ID. Perhaps the page doesn't exist?".

It is hard to quantify the impact (only those with logstash access can quantify it really). But I can tell you that in the last few weeks, at least 6 or 7 users have reported it on fawiki for various pages. I can also tell you that in that same period, hundreds of pages have been successfully moved on fawiki. So your original assessment of "it only impacts some pages on some wikis" is correct, but because no solid workaround exists, I still prefer this to be UBN and attract more attention from the developer community and the WMF engineers.

Fri, Apr 16, 7:17 AM · MW-1.37-notes (1.37.0-wmf.5; 2021-05-11), MediaWiki-extensions-Scribunto, Platform Team Workboards (Clinic Duty Team), MediaWiki-Page-rename, Wikimedia-production-error
ArielGlenn added a comment to T273585: Host OKAPI HTML dumps on public-facing labstore servers.

I've received credentials from Ryan, which will get placed into the private puppet repo soon enough, and the access point is https://api.wikimediaenterprise.org/v1/docs/index.html#/ which produces json output. We can now start getting to work on scripting this. We'll set a descriptive user agent in the test and eventual production script including the standard email contact address for dumps, once there's something to test.

Fri, Apr 16, 5:21 AM · Datasets-Archiving, Okapi [Wikimedia Enterprise], Dumps-Generation, cloud-services-team (Kanban)
ArielGlenn moved T273585: Host OKAPI HTML dumps on public-facing labstore servers from Blocked/Stalled/Waiting for event to Active on the Dumps-Generation board.
Fri, Apr 16, 5:12 AM · Datasets-Archiving, Okapi [Wikimedia Enterprise], Dumps-Generation, cloud-services-team (Kanban)
ArielGlenn moved T280311: Temp files left around in wikistats_1/ ? from Backlog to Other teams on the Dumps-Generation board.
Fri, Apr 16, 5:12 AM · Analytics-Radar, Dumps-Generation