Page MenuHomePhabricator

ArielGlenn (ariel)
User

Today

  • Clear sailing ahead.

Tomorrow

  • Clear sailing ahead.

Wednesday

  • Clear sailing ahead.

User Details

User Since
Oct 8 2014, 7:09 PM (319 w, 5 d)
Availability
Available
IRC Nick
apergos
LDAP User
ArielGlenn
MediaWiki User
ArielGlenn [ Global Accounts ]

Recent Activity

Yesterday

ArielGlenn closed T268333: New dump run paused; Wikidata xml/sql dump page meta history dumps incomplete as Resolved.

No-op is done, and the files are on the web server, the cloud nfs server, and the dumpsdata fallback server. Closing this task.

Sun, Nov 22, 9:49 PM · Dumps-Generation
ArielGlenn added a comment to T268333: New dump run paused; Wikidata xml/sql dump page meta history dumps incomplete.

The hashes are done. I have edited dumpruninfo.txt for the run and set the status for the meta history jobs to 'done', adding appropriate time stamps according to when the last bz2 and 7z files were completed.

Sun, Nov 22, 8:51 PM · Dumps-Generation
ArielGlenn added a comment to T268333: New dump run paused; Wikidata xml/sql dump page meta history dumps incomplete.

It finished up and the md5 and sha1 hashes are being generated now.

Sun, Nov 22, 8:13 PM · Dumps-Generation
ArielGlenn moved T268333: New dump run paused; Wikidata xml/sql dump page meta history dumps incomplete from Backlog to Active on the Dumps-Generation board.
Sun, Nov 22, 8:12 PM · Dumps-Generation
ArielGlenn triaged T268417: Need tool to append blocks onto the end of an xml dump bz2 file as High priority.
Sun, Nov 22, 6:58 PM · Patch-For-Review, Dumps-Generation
ArielGlenn triaged T268416: Need tool to split up large xml bz2 file into smaller ones as High priority.
Sun, Nov 22, 6:03 PM · Patch-For-Review, Dumps-Generation
ArielGlenn added a comment to T268333: New dump run paused; Wikidata xml/sql dump page meta history dumps incomplete.

The 7z file is still being written. If all goes well it should finish up today.

Sun, Nov 22, 10:25 AM · Dumps-Generation

Sat, Nov 21

ArielGlenn added a comment to T263319: Be more efficient about generating page ranges for breaking up page content jobs into smaller ones.

The above patch has been deployed everywhere and manually copied into the directory for the run already underway. It won't save us tons of time on the run but still, every savings is useful.

Sat, Nov 21, 8:43 AM · Patch-For-Review, Dumps-Generation

Fri, Nov 20

ArielGlenn added a comment to T268333: New dump run paused; Wikidata xml/sql dump page meta history dumps incomplete.

The missing bz2 file is now available, although not linked for download yet. The last md5 and sha1 sums are running now, and because a couple of the files are quite big (193GB, 65GB), it's going to take a bit for those to finish up before I can run a no-op and have them show up with links.

Fri, Nov 20, 7:31 PM · Dumps-Generation
ArielGlenn triaged T268333: New dump run paused; Wikidata xml/sql dump page meta history dumps incomplete as High priority.
Fri, Nov 20, 12:29 PM · Dumps-Generation
ArielGlenn added a comment to T265056: Cirrus Search dumps failed for some wikis.

Today's report:

<13>Nov 18 16:27:19 dumpsgen: extensions/CirrusSearch/maintenance/DumpIndex.php failed for /mnt/dumpsdata/otherdumps/cirrussearch/20201116/frwiki-20201116-cirrussearch-content.json.gz
Fri, Nov 20, 8:44 AM · Discovery-Search, CirrusSearch, Dumps-Generation

Thu, Nov 19

ArielGlenn added a comment to T267037: "Untagging because we were tagged by Herald".

If it's any help, Traffic workboard has a column called "Bad Herald", you can do a similar thing.

Thu, Nov 19, 6:34 AM · Platform Engineering

Mon, Nov 16

ArielGlenn added a comment to T264991: Upgrade the MediaWiki servers to ICU 63.

Note that I ran a little dumps test on a non-latin1 wiki in deployment-prep (ruwiki to be precise) and the results look just fine. But I didn't do any sort of comprehensive testing like MW integration tests or whatever they do over there.

Mon, Nov 16, 4:13 PM · Patch-For-Review, Beta-Cluster-Infrastructure, DBA, Operations, serviceops

Sat, Nov 14

ArielGlenn renamed T267854: Consider dumping tables in batches of rows via mysqldump (using where) from Consider dumping tables in batches of ros via mysql (using where) to Consider dumping tables in batches of rows via mysqldump (using where).
Sat, Nov 14, 12:11 AM · Dumps-Generation
ArielGlenn triaged T267854: Consider dumping tables in batches of rows via mysqldump (using where) as Medium priority.
Sat, Nov 14, 12:09 AM · Dumps-Generation

Thu, Nov 12

ArielGlenn added a comment to T267796: Beta cluster not working (Error: 502, Next Hop Connection Failed).

The varnishes in beta are broken right now, see T267561

Thu, Nov 12, 2:16 PM · Beta-Cluster-Infrastructure
ArielGlenn added a comment to T267561: Beta needs to be upgraded to Varnish 6.

I was able to get further along by doing things manually as root on one of the instances, deployment-cache-text06.

Thu, Nov 12, 1:09 PM · User-Ryasmeen, Operations, Beta-Cluster-Infrastructure, Traffic
ArielGlenn committed R1891:6a2c601da921: use long ints for rev lengths in revsperpage (authored by ArielGlenn).
use long ints for rev lengths in revsperpage
Thu, Nov 12, 12:11 PM

Tue, Nov 10

ArielGlenn added a comment to T267561: Beta needs to be upgraded to Varnish 6.

I guess that T267439 might be related.

Tue, Nov 10, 7:31 PM · User-Ryasmeen, Operations, Beta-Cluster-Infrastructure, Traffic
ArielGlenn added a comment to T219279: Some pages will become completely unreachable after PHP7 update due to Unicode changes.

From looking at the code, it seems like the user list ought to have three fields, the first one being the name of the wiki. That appears to be missing. Someone can correct me on that later if they know better.

Tue, Nov 10, 7:19 PM · MW-1.35-notes (1.35.0-wmf.28; 2020-04-14), User-notice, Platform Team Workboards (Clinic Duty Team), MW-1.34-notes (1.34.0-wmf.16; 2019-07-30), serviceops, Operations, PHP 7.2 support, MediaWiki-General
ArielGlenn added a comment to T264991: Upgrade the MediaWiki servers to ICU 63.

Update in deployment-prep is now complete, assuming I did not miss any hosts.

Tue, Nov 10, 6:38 PM · Patch-For-Review, Beta-Cluster-Infrastructure, DBA, Operations, serviceops
ArielGlenn added a comment to T264991: Upgrade the MediaWiki servers to ICU 63.

Puppet sync back to working. Back on track to continue with the update in deployment-prep.

Tue, Nov 10, 6:03 PM · Patch-For-Review, Beta-Cluster-Infrastructure, DBA, Operations, serviceops
ArielGlenn added a comment to T264991: Upgrade the MediaWiki servers to ICU 63.

This is apparently from T267439 After some discussion with jbond and dancy in irc, I am going to revert that and hope I'm not making the varnish situation worse. Also adding @thcipriani so he is informed.

Tue, Nov 10, 5:54 PM · Patch-For-Review, Beta-Cluster-Infrastructure, DBA, Operations, serviceops
ArielGlenn added a comment to T264991: Upgrade the MediaWiki servers to ICU 63.

Can't proceed at the moment, puppet sync to deployment-prep has been broken since Nov 6. Log excerpts from the earliest error:

2020-11-06T20:00:06Z INFO     git.cmd: git rev-parse --abbrev-ref HEAD -> 0; stdout: 'master'
2020-11-06T20:00:06Z INFO     git.cmd: git rev-parse master -> 0; stdout: 'c52d47210553bde2e89735a73637b918911d0226'
2020-11-06T20:00:06Z INFO     git.cmd: git merge-base 63bd567cfb4a5dfbeeb28e1908fb3f8b976d6a4e HEAD -> 0; stdout: '63bd567cfb4a5dfbeeb28e1908fb3f8b976d6a4e'
2020-11-06T20:00:06Z INFO     sync-upstream: Up-to-date: /var/lib/git/labs/private
2020-11-06T20:10:01Z INFO     git.cmd: git diff --abbrev=40 --full-index -M --raw --no-color
2020-11-06T20:10:01Z ERROR    sync-upstream: Local diffs detected.  Commit your changes!
2020-11-06T20:20:01Z INFO     git.cmd: git diff --abbrev=40 --full-index -M --raw --no-color
2020-11-06T20:20:01Z ERROR    sync-upstream: Local diffs detected.  Commit your changes!

git info:

root@deployment-puppetmaster04:/var/lib/git/operations/puppet/modules/profile/manifests/mediawiki# git status
HEAD detached from 0bd812c4d7
Changes not staged for commit:
  (use "git add <file>..." to update what will be committed)
  (use "git checkout -- <file>..." to discard changes in working directory)
Tue, Nov 10, 5:12 PM · Patch-For-Review, Beta-Cluster-Infrastructure, DBA, Operations, serviceops
ArielGlenn added a comment to T264991: Upgrade the MediaWiki servers to ICU 63.

Note that I'm running

cumin  'O{project:deployment-prep  name:^deployment-mediawiki-[0-9]+$ } or O{project:deployment-prep  name:^deployment-parsoid[0-9]+$ } or O{project:deployment-prep  name:^deployment-snapshot[0-9]+$ } or O{project:deployment-prep  name:^deployment-jobrunner[0-9]+$ } or O{project:deployment-prep  name:^deployment-mwmaint[0-9]+$ }'

which targets the following instances:

deployment-jobrunner03.deployment-prep.eqiad1.wikimedia.cloud,deployment-mediawiki-[07,09].deployment-prep.eqiad1.wikimedia.cloud,
deployment-mwmaint01.deployment-prep.eqiad1.wikimedia.cloud,deployment-parsoid11.deployment-prep.eqiad1.wikimedia.cloud,
deployment-snapshot02.deployment-prep.eqiad1.wikimedia.cloud
Tue, Nov 10, 4:19 PM · Patch-For-Review, Beta-Cluster-Infrastructure, DBA, Operations, serviceops
ArielGlenn added a comment to T264991: Upgrade the MediaWiki servers to ICU 63.

Upgrade plan on deployment-prep:

  • add profile::mediawiki::php::icu63: true to hiera for deployment-prep project prefix; this will only have impact on hosts including the appropriate mediawiki manifest
  • apt-get update on deploy*, mediawiki*, jobrunner*, parsoid*, snapshot*
  • export DEBIAN_FRONTEND=noninteractive; apt-get -y libicu63 libxml2 on all of the above
  • export DEBIAN_FRONTEND=noninteractive; apt-get install php7.2-bcmath php7.2-bz2 php7.2-cli php7.2-common php7.2-curl php7.2-dba php7.2-fpm php7.2-gd php7.2-gmp php7.2-intl php7.2-json php7.2-mbstring php7.2-mysql php7.2-opcache php7.2-readline php7.2-xml php-apcu php-cli php-common php-excimer php-geoip php-igbinary php-luasandbox php-memcached php-mongodb php-msgpack php-redis php-tideways-xhprof php-wmerrors -y on all of the above
Tue, Nov 10, 3:41 PM · Patch-For-Review, Beta-Cluster-Infrastructure, DBA, Operations, serviceops
ArielGlenn added a comment to T252396: Split page-meta-history wikidata dump job across multiple hosts.

running pages-meta-history for wikidata parts 18 - 23 in screen on snapshot1009, and 23-27 on snapshot1010, one last time, hoping to have te above deployed by Dec 1.

Tue, Nov 10, 3:11 PM · Patch-For-Review, Dumps-Generation
ArielGlenn added a project to T264991: Upgrade the MediaWiki servers to ICU 63: Beta-Cluster-Infrastructure.

I will updating to icu63 in deployment-prep, with Moritz looking on. This will likely happen later today, and I'll post updates about the progress.

Tue, Nov 10, 10:53 AM · Patch-For-Review, Beta-Cluster-Infrastructure, DBA, Operations, serviceops

Fri, Nov 6

ArielGlenn added a comment to T263319: Be more efficient about generating page ranges for breaking up page content jobs into smaller ones.

Welp. In the year 2020 it turns out I still need a long int in order to store byte counts that could get up to 3.5GB. So revsperpage gave values that were much too small for cumulative rev lengths and so on, and the code to split up page content jobs into smallish page rangs, mostly didn't.

Fri, Nov 6, 7:01 PM · Patch-For-Review, Dumps-Generation
ArielGlenn added a comment to T265056: Cirrus Search dumps failed for some wikis.

Today's report:

<13>Nov  3 17:45:52 dumpsgen: extensions/CirrusSearch/maintenance/DumpIndex.php failed for /mnt/dumpsdata/otherdumps/cirrussearch/20201102/commonswiki-20201102-cirrussearch-file.json.gz
Fri, Nov 6, 9:03 AM · Discovery-Search, CirrusSearch, Dumps-Generation

Wed, Nov 4

ArielGlenn added a project to T246415: Investigate a different db load groups for wikidata / wikibase: User-ArielGlenn.
Wed, Nov 4, 5:48 PM · User-ArielGlenn, User-Michael, Wikidata-Campsite (Wikidata-Campsite-Iteration-∞), Sustainability (Incident Followup), Patch-For-Review, Wikidata-Trailblazing-Exploration, User-Addshore, wdwb-tech-focus, Wikidata
ArielGlenn added a comment to T263319: Be more efficient about generating page ranges for breaking up page content jobs into smaller ones.

Aaaaand I'm wrong. There was a reference to it in the defaults, I swear I checked that but oh well. So the revinfo files are all created as they should be. I'll know if they have the desired effect when I see how the wikidata pagecontent meta-history jobs start off, probably a few days yet.

Wed, Nov 4, 3:52 PM · Patch-For-Review, Dumps-Generation
ArielGlenn added a comment to T263319: Be more efficient about generating page ranges for breaking up page content jobs into smaller ones.

Dang it, I was so excited to check if the revinfo stuff was working today but I forgot to add an entry to the path for it in the production configs, after all my careful local testing >_< I'll check in on the jobs again tomorrow.

Wed, Nov 4, 11:25 AM · Patch-For-Review, Dumps-Generation

Tue, Nov 3

ArielGlenn added a comment to T266333: Xml/sql dumps are still querying etcd excessively, fix this..

This looks a lot better, still a spike but half the size. There/s probably more that could be done yet.

Tue, Nov 3, 6:56 AM · Dumps-Generation
ArielGlenn added a comment to T267077: Document remaining database load groups .

Thanks @ArielGlenn! I put a really simple/generic blurb in the attached patchset, please feel free to comment/amend however you think would be best.

Tue, Nov 3, 6:43 AM · Patch-For-Review, Platform Engineering, Performance-Team

Mon, Nov 2

ArielGlenn added a comment to T267077: Document remaining database load groups .

I can say something about the "dump" group if someone points me at a location and tells me an appropriate format.

Mon, Nov 2, 10:37 PM · Patch-For-Review, Platform Engineering, Performance-Team
ArielGlenn added a comment to T267037: "Untagging because we were tagged by Herald".

Yeah fine, we are talking about the rule right now. Please give us a few minutes though :-P

Mon, Nov 2, 3:44 PM · Platform Engineering

Sun, Nov 1

ArielGlenn added a comment to T266333: Xml/sql dumps are still querying etcd excessively, fix this..

In fact I had to deploy a couple quick fixes to manage directory creation at the start of a dump run, see https://gerrit.wikimedia.org/r/c/operations/dumps/+/637855 and https://gerrit.wikimedia.org/r/c/operations/dumps/+/637856

Sun, Nov 1, 10:20 AM · Dumps-Generation

Fri, Oct 30

ArielGlenn added a comment to T264298: wb_terms is getting removed.

All of those tables are there: see https://gerrit.wikimedia.org/r/c/operations/puppet/+/527505 and current https://github.com/wikimedia/puppet/blob/production/modules/snapshot/files/dumps/table_jobs.yaml#L142

Fri, Oct 30, 2:29 PM · User-Addshore, Wikidata, Dumps-Generation
ArielGlenn added a comment to T252396: Split page-meta-history wikidata dump job across multiple hosts.

I believe all of the functionality including error handling is there. I need a few more integration-ish tests yet before this can go live, and then there will be a puppet patch needed to add a stage to the regular snapshots (not running wikidata or enwiki) to be secondary batch workers on wikidata for the full run.

Fri, Oct 30, 2:21 PM · Patch-For-Review, Dumps-Generation
ArielGlenn added a comment to T264298: wb_terms is getting removed.

We also realized that the tablejobs.yaml file didn’t mention the new tables (the replacement for wb_terms: wbt_{item,property}_terms, wbt_{term,text}_in_lang, wbt_text, wbt_type). If wb_terms was worth dumping, then presumably the new tables should be dumped too. Is it enough to add them to the YAML file or do you need some extra setup for new tables?

Fri, Oct 30, 2:19 PM · User-Addshore, Wikidata, Dumps-Generation
ArielGlenn added a comment to T265056: Cirrus Search dumps failed for some wikis.

Last week's report that i added in a comment and forgot to save:

<13>Oct 14 01:47:12 dumpsgen: extensions/CirrusSearch/maintenance/DumpIndex.php failed for /mnt/dumpsdata/otherdumps/cirrussearch/20201012/enwiki-20201012-cirrussearch-general.json.gz
<13>Oct 15 00:41:35 dumpsgen: extensions/CirrusSearch/maintenance/DumpIndex.php failed for /mnt/dumpsdata/otherdumps/cirrussearch/20201012/srwiki-20201012-cirrussearch-content.json.gz
<13>Oct 15 00:49:59 dumpsgen: extensions/CirrusSearch/maintenance/DumpIndex.php failed for /mnt/dumpsdata/otherdumps/cirrussearch/20201012/srwikisource-20201012-cirrussearch-content.json.gz
<13>Oct 15 00:55:24 dumpsgen: extensions/CirrusSearch/maintenance/DumpIndex.php failed for /mnt/dumpsdata/otherdumps/cirrussearch/20201012/svwiki-20201012-cirrussearch-content.json.gz
Fri, Oct 30, 7:04 AM · Discovery-Search, CirrusSearch, Dumps-Generation

Thu, Oct 29

ArielGlenn added a comment to T266333: Xml/sql dumps are still querying etcd excessively, fix this..

Deployed and live everywhere, I'll leave this task open for the first few days of the next run to make sure that tables jobs run ok and that we have a smaller spike in etcd calls.

Thu, Oct 29, 12:34 PM · Dumps-Generation
ArielGlenn added a comment to T263319: Be more efficient about generating page ranges for breaking up page content jobs into smaller ones.

Everything done except the puppet run. This will be live everywhere in ~ 30 minutes, well in time for the new run on the 1st of the month. Leaving this task open until that run completes for some large wikis.

Thu, Oct 29, 12:28 PM · Patch-For-Review, Dumps-Generation
ArielGlenn added a comment to T199890: missed pages from kafka outage on July 11 2018.

While this ticket has high prio I'm not that confident it will be resolved based on the history and timeframe so far. Should we keep it open @ArielGlenn ?

Thu, Oct 29, 7:35 AM · User-ArielGlenn, Operations

Tue, Oct 27

ArielGlenn triaged T266519: build command for page content dumps properly when there is no prefetch as Medium priority.
Tue, Oct 27, 12:17 AM · Dumps-Generation

Mon, Oct 26

ArielGlenn added a comment to T258108: Turn off auto-watchlist preferences for bots.

It might be nice to prune some of the existing bot entry rows too. There's precedent for this sort of action, see T184485

Mon, Oct 26, 4:54 PM · Platform Team Workboards (Clinic Duty Team), Growth-Team, MediaWiki-Watchlist
ArielGlenn added a comment to T203075: Warning: MediaWiki\Storage\SqlBlobStore::fetchBlob: Bad data in text row.

See T265989 where I have collected a bunch of bad revisions with timestamps across all the wikis. This may let us make some headway.

Mon, Oct 26, 4:54 PM · MediaWiki-Revision-backend, Platform Team Workboards (Clinic Duty Team), Wikimedia-production-error
ArielGlenn added a project to T258236: Special:Export WikiExporter::dumpPages query needs optimization: User-ArielGlenn.
Mon, Oct 26, 8:33 AM · User-ArielGlenn, Platform Team Workboards (Clinic Duty Team), MediaWiki-Special-pages, MediaWiki-Export-or-Import
ArielGlenn added a comment to T258236: Special:Export WikiExporter::dumpPages query needs optimization.

OIC, this is not using rev_page_id as an index when it really ought to be.

Mon, Oct 26, 8:26 AM · User-ArielGlenn, Platform Team Workboards (Clinic Duty Team), MediaWiki-Special-pages, MediaWiki-Export-or-Import

Sun, Oct 25

ArielGlenn added a comment to T266333: Xml/sql dumps are still querying etcd excessively, fix this..

The above patch has been tested locally, in deployment-prep, and has "unit" (not really unit) tests which pass. It can be deployed once the current dump run finishes, probably Monday or Tuesday.

Sun, Oct 25, 7:03 AM · Dumps-Generation
ArielGlenn renamed T265978: add docstrings to all methods in python dump scripts, in a consistent format for use by pydoc etc from make docs in dumps python scripts pydoc-compliant to add docstrings to all methods in python dump scripts, in a consistent format for use by pydoc etc.
Sun, Oct 25, 5:47 AM · Dumps-Generation
ArielGlenn moved T266333: Xml/sql dumps are still querying etcd excessively, fix this. from Backlog to Active on the Dumps-Generation board.
Sun, Oct 25, 5:34 AM · Dumps-Generation

Oct 24 2020

ArielGlenn committed R1891:9f690514840e: bump version to 0.0.10 (authored by ArielGlenn).
bump version to 0.0.10
Oct 24 2020, 4:53 PM
ArielGlenn committed R1891:3e7ea4d3124a: new util to display info about revisions for one or more pages from XML input (authored by ArielGlenn).
new util to display info about revisions for one or more pages from XML input
Oct 24 2020, 4:53 PM
ArielGlenn added a comment to T263319: Be more efficient about generating page ranges for breaking up page content jobs into smaller ones.

To be done when the current run completes, probably on Monday:

Oct 24 2020, 11:49 AM · Patch-For-Review, Dumps-Generation
ArielGlenn committed R1885:ec8dd7313245: use the new style name for snapshot02 scap target (authored by ArielGlenn).
use the new style name for snapshot02 scap target
Oct 24 2020, 7:34 AM
ArielGlenn committed R1885:392123ff7e56: add snapshot02 in wmcs to beta targets, remove snapshot01 (authored by ArielGlenn).
add snapshot02 in wmcs to beta targets, remove snapshot01
Oct 24 2020, 7:34 AM

Oct 23 2020

ArielGlenn added a comment to T263319: Be more efficient about generating page ranges for breaking up page content jobs into smaller ones.

I looked at the config settings for maxrevbytes and revsPerJob and they look ok to keep everywhere for a first run and we'll see what happens.

Oct 23 2020, 1:59 PM · Patch-For-Review, Dumps-Generation
ArielGlenn added a comment to T263323: for xml/sql dumps, make the means of finding the next wiki to dump cheaper.

More can be done, see T266333 which should reduce the numbers even further.

Oct 23 2020, 12:08 PM · Dumps-Generation
ArielGlenn added a comment to T266333: Xml/sql dumps are still querying etcd excessively, fix this..


Thanks to akosiasris for pointing this out, deliberately not pinging him on this task though :-)

Oct 23 2020, 12:06 PM · Dumps-Generation
ArielGlenn claimed T266333: Xml/sql dumps are still querying etcd excessively, fix this..
Oct 23 2020, 12:05 PM · Dumps-Generation
ArielGlenn created T266333: Xml/sql dumps are still querying etcd excessively, fix this..
Oct 23 2020, 12:04 PM · Dumps-Generation

Oct 22 2020

ArielGlenn added a comment to T263319: Be more efficient about generating page ranges for breaking up page content jobs into smaller ones.

Deployment plan for the above patch, all to be done when the current run is complete, perhaps the 27th or 28th:

  • test in deployment-prep:
    • make a copy of the conf file on snapshot02 in deployment-prep
    • add or double-check settings for maxrevbytes, revsPerJob, revinfostash
    • do some test runs to be sure everything still looks good
  • merge the mwbzutils patch
  • merge the deb packaging patch for mwbzutils
  • push the new package up to our repo
  • manually install the new package on all snapshot hosts
  • merge the python patch, deploy dumps via scap
  • add patchset to puppet with new or double-checked settings for bigwikis, en, wd:
      • maxrevbytes
      • revsPerJob
      • revinfostash
    • add the tested changes to conf file for WMCS as well
    • merge and deploy puppet patches, run puppet everywhere
Oct 22 2020, 6:40 PM · Patch-For-Review, Dumps-Generation

Oct 21 2020

ArielGlenn moved T265989: Corrupt entries in text table for nlwiktionary causing a lot of MW PHP Warning from Backlog to Other teams on the Dumps-Generation board.
Oct 21 2020, 5:58 AM · Platform Team Workboards (Clinic Duty Team), Dumps-Generation

Oct 20 2020

ArielGlenn added a comment to T265989: Corrupt entries in text table for nlwiktionary causing a lot of MW PHP Warning .

A note about the range of affected revisions on the wikis that have a large number of them:

  • dewiki, from 2002-09-13 to 2005-05-14 and then also 2006-04-09 and 2009-03-09
  • enwiki, from 2001-10-01 to 2005-08-25 and then also 2006-04-09 and 2009-03-09
  • eswiki, from 2004-02-19 through 2004-11-25 and then 2009-03-09
  • nlwiktionary: 2003-12-23, and then 2004-04-03 to 2004-07-25
Oct 20 2020, 7:59 PM · Platform Team Workboards (Clinic Duty Team), Dumps-Generation
ArielGlenn added a comment to T265989: Corrupt entries in text table for nlwiktionary causing a lot of MW PHP Warning .

The following wikis have revisions with an empty sha1:
anwiki azwiki cawiki commonswiki dewiki dewikiversity dewiktionary elwiki enwiki eswiki eswiktionary etwiki frwiki hrwiki nlwiki nlwiktionary nowiki ocwiki plwiki ptwiki slwiki viwiki zhwiki

Oct 20 2020, 4:21 PM · Platform Team Workboards (Clinic Duty Team), Dumps-Generation
ArielGlenn added a comment to T263587: CAPEX for ParserCache for Parsoid.

<snip>

Oct 20 2020, 1:30 PM · DBA, serviceops, Platform Team Workboards (Green), MediaWiki-Parser, Parsoid
ArielGlenn added a comment to T265989: Corrupt entries in text table for nlwiktionary causing a lot of MW PHP Warning .

I'm running a crap bash script on the fallback dumps nfs server (dumpsdata1003), crunching metadata xml files to see the pattern of empty sha1s for revisions across all the wikis. I'll drop a report in here when it's complete. Running in screen session from ariel, as dumpsgen user.

Oct 20 2020, 12:37 PM · Platform Team Workboards (Clinic Duty Team), Dumps-Generation
ArielGlenn added a comment to T265989: Corrupt entries in text table for nlwiktionary causing a lot of MW PHP Warning .

Note that the initial report via IRC of over a thousand errors seems to be only these sorts of errors:

dumpsgen@snapshot1005:/mnt/dumpsdata/xmldatadumps/public/nlwiktionary/20201020$ zcat nlwiktionary-20201020-stub-meta-history.xml.gz  | grep '<sha1 />' | wc -l
1817
Oct 20 2020, 11:19 AM · Platform Team Workboards (Clinic Duty Team), Dumps-Generation
ArielGlenn added a comment to T265989: Corrupt entries in text table for nlwiktionary causing a lot of MW PHP Warning .

Note that these errors can be regenerated at any time by going to a relatively idle snapshot host and, as the dumpsgen user, running

php /srv/mediawiki/multiversion/MWScript.php dumpBackup.php --wiki=nlwiktionary --full --stub --report=1000 --output=file:/mnt/dumpsdata/temp/dumpsgen/nlwiktionary-20201020-stub-meta-history.xml --start=37917 --end 37918
Oct 20 2020, 10:56 AM · Platform Team Workboards (Clinic Duty Team), Dumps-Generation
ArielGlenn added a comment to T265989: Corrupt entries in text table for nlwiktionary causing a lot of MW PHP Warning .

Might be a case for findBadBlobs.php and I really think we'll see this on a bunch of wikis.

Oct 20 2020, 10:44 AM · Platform Team Workboards (Clinic Duty Team), Dumps-Generation
ArielGlenn renamed T265989: Corrupt entries in text table for nlwiktionary causing a lot of MW PHP Warning from Corrupt entries in text table for nlwiktionary causing to Corrupt entries in text table for nlwiktionary causing a lot of MW PHP Warning .
Oct 20 2020, 10:34 AM · Platform Team Workboards (Clinic Duty Team), Dumps-Generation
ArielGlenn added a comment to T265989: Corrupt entries in text table for nlwiktionary causing a lot of MW PHP Warning .

So sometime in 2004-06 things were broken.

Oct 20 2020, 10:33 AM · Platform Team Workboards (Clinic Duty Team), Dumps-Generation
ArielGlenn added a comment to T265989: Corrupt entries in text table for nlwiktionary causing a lot of MW PHP Warning .

Page id: 37917 on nlwiktionary

Oct 20 2020, 10:32 AM · Platform Team Workboards (Clinic Duty Team), Dumps-Generation
ArielGlenn created T265989: Corrupt entries in text table for nlwiktionary causing a lot of MW PHP Warning .
Oct 20 2020, 10:27 AM · Platform Team Workboards (Clinic Duty Team), Dumps-Generation
ArielGlenn created T265978: add docstrings to all methods in python dump scripts, in a consistent format for use by pydoc etc.
Oct 20 2020, 8:38 AM · Dumps-Generation

Oct 19 2020

ArielGlenn moved T51133: Create partial SQL dump of watchlist table from Up Next to Active on the Dumps-Generation board.
Oct 19 2020, 6:39 AM · Platform Team Workboards (Clinic Duty Team), Patch-For-Review, Privacy Engineering, Dumps-Generation

Oct 18 2020

ArielGlenn edited P13017 (An Untitled Masterwork).
Oct 18 2020, 8:21 AM
ArielGlenn created P13017 (An Untitled Masterwork).
Oct 18 2020, 8:20 AM

Oct 15 2020

ArielGlenn added a comment to T265056: Cirrus Search dumps failed for some wikis.

Today's report:

<13>Oct 14 01:47:12 dumpsgen: extensions/CirrusSearch/maintenance/DumpIndex.php failed for /mnt/dumpsdata/otherdumps/cirrussearch/20201012/enwiki-20201012-cirrussearch-general.json.gz
<13>Oct 15 00:41:35 dumpsgen: extensions/CirrusSearch/maintenance/DumpIndex.php failed for /mnt/dumpsdata/otherdumps/cirrussearch/20201012/srwiki-20201012-cirrussearch-content.json.gz
<13>Oct 15 00:49:59 dumpsgen: extensions/CirrusSearch/maintenance/DumpIndex.php failed for /mnt/dumpsdata/otherdumps/cirrussearch/20201012/srwikisource-20201012-cirrussearch-content.json.gz
<13>Oct 15 00:55:24 dumpsgen: extensions/CirrusSearch/maintenance/DumpIndex.php failed for /mnt/dumpsdata/otherdumps/cirrussearch/20201012/svwiki-20201012-cirrussearch-content.json.gz
Oct 15 2020, 5:36 PM · Discovery-Search, CirrusSearch, Dumps-Generation
ArielGlenn created P13002 How Ariel finds and tracks dumps-related gerrit patchsets etc (2020).
Oct 15 2020, 6:12 AM
ArielGlenn lowered the priority of T205825: Restructure 'misc dump' cron scripts and infra so they can be easily tested in mw-vagrant from Medium to Low.

No, not ready. The problem remains that these scripts are in production puppet, and cloning all of puppet info mw-vagrant seems bad, but so does copying the scripts wholesale and committing the copies to the mw-vagrant repo.

Oct 15 2020, 5:47 AM · Dumps-Generation

Oct 14 2020

ArielGlenn added a comment to T203075: Warning: MediaWiki\Storage\SqlBlobStore::fetchBlob: Bad data in text row.

I'd like eventually to run it across all wikis and get a sense of the bad timeframes. Just my 2 cents.

Oct 14 2020, 6:50 PM · MediaWiki-Revision-backend, Platform Team Workboards (Clinic Duty Team), Wikimedia-production-error
ArielGlenn added a comment to T263587: CAPEX for ParserCache for Parsoid.

In case we wanted to cannibalise some servers from the restbase cluster as we move their content to parsercache backends, assuming such a thing were feasible on the software side:

Oct 14 2020, 8:18 AM · DBA, serviceops, Platform Team Workboards (Green), MediaWiki-Parser, Parsoid

Oct 13 2020

ArielGlenn moved T264850: Categorylinks dump might have some problem with the encoding from Backlog to Done on the Dumps-Generation board.
Oct 13 2020, 5:55 AM · Dumps-Generation
ArielGlenn moved T265105: Update ops-dumps email alias for dumps cronspam, errors and other notifications from Backlog to Done on the Dumps-Generation board.
Oct 13 2020, 5:55 AM · Dumps-Generation
ArielGlenn moved T264298: wb_terms is getting removed from Backlog to Blocked/Stalled/Waiting for event on the Dumps-Generation board.
Oct 13 2020, 5:54 AM · User-Addshore, Wikidata, Dumps-Generation
ArielGlenn moved T265056: Cirrus Search dumps failed for some wikis from Backlog to Other teams on the Dumps-Generation board.
Oct 13 2020, 5:54 AM · Discovery-Search, CirrusSearch, Dumps-Generation

Oct 12 2020

ArielGlenn added a comment to T263319: Be more efficient about generating page ranges for breaking up page content jobs into smaller ones.

So... there's the new approach, which once cleaned up, takes about 1 hour and 20 minutes to generate all wikidata page ranges, with 0 db calls. That's the way we like it. The actual patchset has not been tested whatsoever, but rather the revsperpage util and a standalone script making use of it.

Oct 12 2020, 6:05 PM · Patch-For-Review, Dumps-Generation
ArielGlenn renamed T263319: Be more efficient about generating page ranges for breaking up page content jobs into smaller ones from When making guesses about page ranges for page content dumps, use page range info from a previous run, if available to Be more efficient about generating page ranges for breaking up page content jobs into smaller ones.
Oct 12 2020, 6:03 PM · Patch-For-Review, Dumps-Generation
ArielGlenn closed T264850: Categorylinks dump might have some problem with the encoding as Invalid.

They cannot be fixed in the dump; they are truncated on the wiki itself. That's what the sql query shows. Someone will have to go onto rowiki and find out what is going on with the entry of those sortkeys and why they are bad.

Oct 12 2020, 11:13 AM · Dumps-Generation
ArielGlenn added a comment to T264850: Categorylinks dump might have some problem with the encoding.

You have a couple of options. You can replace/ skip / ignore the bad charcters, see https://docs.python.org/3/howto/unicode.html and look for the paragraph starting "The errors argument specifies the response when the input string can’t be converted according to the encoding’s rules." Alternatively you can split the line into entries yourself and just pull out the ones you want. Whichever is easiest for you!

Oct 12 2020, 8:21 AM · Dumps-Generation
ArielGlenn added a comment to T264850: Categorylinks dump might have some problem with the encoding.

It looks to me like there are truncated values for the cl_sortkey for rowiki, which prevent the utf8 conversion on line 55 of the pastebin from working. This leads to your lines all remaining essentially byte-encoded with the results you see when displaying the content. I would look into the sortkeys as stored on rowiki and see what's going on. By contrast when I look at e.g. elwiki's category links, there is plenty of non-ascii text there but no bad entries in the table.

Oct 12 2020, 6:22 AM · Dumps-Generation
ArielGlenn removed projects from T264850: Categorylinks dump might have some problem with the encoding: Wikidata, Wikidata-Query-Service, Analytics.
Oct 12 2020, 5:52 AM · Dumps-Generation

Oct 10 2020

ArielGlenn added a comment to T264838: Poke around production to find the maxmind flatfile! [4H].

I see some stuff on stat1007 in /srv/geoip/archive/ by date; are those what you need?

Oct 10 2020, 5:25 PM · Anti-Harassment (The Letter Song), IP Info

Oct 9 2020

ArielGlenn renamed T263319: Be more efficient about generating page ranges for breaking up page content jobs into smaller ones from When maing guesses about page ranges for page content dumps, use page rangi info from a previous run, if available to When making guesses about page ranges for page content dumps, use page range info from a previous run, if available.
Oct 9 2020, 11:45 AM · Patch-For-Review, Dumps-Generation
ArielGlenn added a comment to T263319: Be more efficient about generating page ranges for breaking up page content jobs into smaller ones.

mwbzutils is packaged and a commit is ready for that: https://gerrit.wikimedia.org/r/c/operations/debs/mwbzutils/+/633167

Oct 9 2020, 11:44 AM · Patch-For-Review, Dumps-Generation
ArielGlenn added a project to T259067: Set up generation of JSON dumps for Wikimedia Commons: Dumps-Generation.
Oct 9 2020, 5:01 AM · Dumps-Generation, Patch-For-Review, Structured-Data-Backlog (Current Work), Datasets-General-or-Unknown, Analytics-Radar, Product-Analytics
ArielGlenn closed T265105: Update ops-dumps email alias for dumps cronspam, errors and other notifications as Resolved.

This has been done, following the procedure described here: https://wikitech.wikimedia.org/wiki/SRE_Clinic_Duty#Mail_aliases Changes are now live.

Oct 9 2020, 4:11 AM · Dumps-Generation