aaron (Aaron Schulz)
User

Today

  • Clear sailing ahead.

Tomorrow

  • Clear sailing ahead.

Friday

  • Clear sailing ahead.

User Details

User Since
Oct 20 2014, 5:25 PM (164 w, 2 d)
Availability
Available
IRC Nick
AaronSchulz
LDAP User
Aaron Schulz
MediaWiki User
Aaron Schulz

Recent Activity

Yesterday

aaron added a comment to T173450: Setup grafana alert for job error rate.

I started a quick dashboard at https://grafana.wikimedia.org/dashboard/db/job-queue-alerts?orgId=1&from=now-12h&to=now with some alerts.

Tue, Dec 12, 11:29 PM · Performance-Team
aaron added a comment to T175672: Make apache/maintenance hosts TLS connections to mariadb work.

@aaron the proxy is installed but unconfigured, - we still have to fix some issues with the start and process, but do you want me to point it to the real master? Do you want me to point it to a soon to be setup master test host?

Tue, Dec 12, 6:29 PM · Patch-For-Review, Performance-Team (Radar), Availability (Multiple-active-datacenters), DBA, Operations

Mon, Dec 11

aaron added a comment to T173450: Setup grafana alert for job error rate.

I suppose we can use jobrunner.runner-status.error.rate, sumSeries(jobrunner.pop.*.failed.*.rate), and sumSeries(jobrunner.pop.*.ok.*.rate) to make alerts in a Grafana dashboard.

Mon, Dec 11, 9:53 PM · Performance-Team
aaron closed T182390: 2017-12-07 Huge SaveTiming spike as Resolved.

Yeah, same thing.

Mon, Dec 11, 7:38 PM · Performance-Team

Fri, Dec 8

aaron added a comment to T182322: ChronologyProtector breaks if two requests write different sets of databases.

I'm not sure why the time check logic is so complicated, I guess it got prematurely generalized from the single-DB case.

Fri, Dec 8, 8:33 PM · MW-1.31-release-notes (WMF-deploy-2017-12-12 (1.31.0-wmf.12)), Patch-For-Review, MediaWiki-Database, Wikidata, Performance-Team, User-Addshore, User-notice

Thu, Dec 7

aaron updated the task description for T151466: Performance Q2 2017/18 goal: Install and use mcrouter in deployment-prep.
Thu, Dec 7, 6:53 AM · Release-Engineering-Team (Watching / External), Availability (Multiple-active-datacenters), Beta-Cluster-Infrastructure, Performance-Team

Wed, Dec 6

Envlh awarded T181385: Wikidata entity dumpers stuck with 100% CPU on snapshot1007 a Heartbreak token.
Wed, Dec 6, 8:50 AM · MW-1.31-release-notes (WMF-deploy-2017-11-28 (1.31.0-wmf.10)), Performance-Team, Wikidata, Datasets-General-or-Unknown

Tue, Dec 5

aaron moved T181385: Wikidata entity dumpers stuck with 100% CPU on snapshot1007 from Doing to Blocked on the Performance-Team board.
Tue, Dec 5, 11:36 PM · MW-1.31-release-notes (WMF-deploy-2017-11-28 (1.31.0-wmf.10)), Performance-Team, Wikidata, Datasets-General-or-Unknown
aaron closed T178531: Add statsd metric to WANObjectCache as Resolved.

Probably some MW fixes actually reaching production.

Tue, Dec 5, 11:07 PM · MW-1.31-release-notes (WMF-deploy-2017-11-28 (1.31.0-wmf.10)), Patch-For-Review, MediaWiki-Cache, monitoring, Performance-Team
aaron added a comment to T180035: MediaWiki core Selenium tests fail when targeting Vagrant.

Does this still occur?

Tue, Dec 5, 4:19 AM · Performance-Team (Radar), MediaWiki-Cache, MediaWiki-Vagrant, Release-Engineering-Team (Kanban), User-zeljkofilipin
aaron closed T180793: Frequent "Wikimedia\\Rdbms\\DatabaseMysqlBase::lock failed to acquire lock" errors on WMF mediawiki logs as Resolved.
Tue, Dec 5, 4:17 AM · MW-1.31-release-notes (WMF-deploy-2017-12-05 (1.31.0-wmf.11)), Patch-For-Review, DBA, Wikimedia-log-errors, Performance-Team
aaron added a comment to T180793: Frequent "Wikimedia\\Rdbms\\DatabaseMysqlBase::lock failed to acquire lock" errors on WMF mediawiki logs.

By reducing the lock max wait times and pushing the brunt of lag waits out if the critical section, then less real time should be wasted.

Tue, Dec 5, 4:17 AM · MW-1.31-release-notes (WMF-deploy-2017-12-05 (1.31.0-wmf.11)), Patch-For-Review, DBA, Wikimedia-log-errors, Performance-Team

Mon, Dec 4

aaron moved T181385: Wikidata entity dumpers stuck with 100% CPU on snapshot1007 from Inbox to Doing on the Performance-Team board.
Mon, Dec 4, 9:00 PM · MW-1.31-release-notes (WMF-deploy-2017-11-28 (1.31.0-wmf.10)), Performance-Team, Wikidata, Datasets-General-or-Unknown

Sat, Dec 2

aaron added a comment to T181385: Wikidata entity dumpers stuck with 100% CPU on snapshot1007.

Change 394779 had a related patch set uploaded (by Aaron Schulz; owner: Aaron Schulz):
[mediawiki/core@master] Try to opportunistically flush statsd data in maintenance scripts

https://gerrit.wikimedia.org/r/394779

Sat, Dec 2, 9:20 PM · MW-1.31-release-notes (WMF-deploy-2017-11-28 (1.31.0-wmf.10)), Performance-Team, Wikidata, Datasets-General-or-Unknown
aaron added a comment to T178531: Add statsd metric to WANObjectCache.

There is some caller that is not making keys correctly, which causes this. I can't find anymore looking though all of core and extensions and mediawiki-config.

Thanks for looking into it! In the meantime I've blackholed said metrics to avoid graphite disks filling up

Sat, Dec 2, 8:27 PM · MW-1.31-release-notes (WMF-deploy-2017-11-28 (1.31.0-wmf.10)), Patch-For-Review, MediaWiki-Cache, monitoring, Performance-Team
aaron added a comment to T178531: Add statsd metric to WANObjectCache.

Change 394493 merged by jenkins-bot:
[mediawiki/core@wmf/1.31.0-wmf.10] Add temporary logging for bad WAN cache statsd keys

https://gerrit.wikimedia.org/r/394493

Sat, Dec 2, 8:26 PM · MW-1.31-release-notes (WMF-deploy-2017-11-28 (1.31.0-wmf.10)), Patch-For-Review, MediaWiki-Cache, monitoring, Performance-Team
aaron added a comment to T181385: Wikidata entity dumpers stuck with 100% CPU on snapshot1007.

How long do these run? The sample rate in config is set to be extremely low. So perhaps:

  • The buffering class buffers things that won't even be saved
  • The buffering could be disable in CLI mode
Sat, Dec 2, 8:18 PM · MW-1.31-release-notes (WMF-deploy-2017-11-28 (1.31.0-wmf.10)), Performance-Team, Wikidata, Datasets-General-or-Unknown

Wed, Nov 29

aaron added a comment to T180035: MediaWiki core Selenium tests fail when targeting Vagrant.

I noticed a worse bug of cpPosTime cookies not being used (not related to WAN cache). The patch for that is above.

Change 393983 had a related patch set uploaded (by Aaron Schulz; owner: Aaron Schulz):
[mediawiki/core@master] Make ChronologyProtector actually use cpPosTime cookies

https://gerrit.wikimedia.org/r/393983

Wed, Nov 29, 5:51 AM · Performance-Team (Radar), MediaWiki-Cache, MediaWiki-Vagrant, Release-Engineering-Team (Kanban), User-zeljkofilipin
aaron added a comment to T180035: MediaWiki core Selenium tests fail when targeting Vagrant.

I noticed a worse bug of cpPosTime cookies not being used (not related to WAN cache). The patch for that is above.

Wed, Nov 29, 4:52 AM · Performance-Team (Radar), MediaWiki-Cache, MediaWiki-Vagrant, Release-Engineering-Team (Kanban), User-zeljkofilipin
aaron added a comment to T180035: MediaWiki core Selenium tests fail when targeting Vagrant.

The simple thing is to not set INTERIM keys in the same request that purged them. The duration of that rule would be HOLDOF_TTL so that the array holding the purged keys doesn't get too big for long-running maintenance scripts. This can be done with a HashBagOStuff nested in the WAN cache object easily enough.

Wed, Nov 29, 12:52 AM · Performance-Team (Radar), MediaWiki-Cache, MediaWiki-Vagrant, Release-Engineering-Team (Kanban), User-zeljkofilipin
aaron added a comment to T180035: MediaWiki core Selenium tests fail when targeting Vagrant.

This looks like an integration issue with ChronologyProtector vs WANObjectCache.

Wed, Nov 29, 12:43 AM · Performance-Team (Radar), MediaWiki-Cache, MediaWiki-Vagrant, Release-Engineering-Team (Kanban), User-zeljkofilipin

Tue, Nov 28

aaron added a comment to T178531: Add statsd metric to WANObjectCache.

I guess we will need MW side logging now. Probably can just add it to wmf branch.

Tue, Nov 28, 5:14 PM · MW-1.31-release-notes (WMF-deploy-2017-11-28 (1.31.0-wmf.10)), Patch-For-Review, MediaWiki-Cache, monitoring, Performance-Team
aaron created T181528: Document the various parameters of each Job class.
Tue, Nov 28, 5:02 PM · Documentation, MediaWiki-JobQueue

Mon, Nov 27

aaron added a comment to T178531: Add statsd metric to WANObjectCache.

There is some caller that is not making keys correctly, which causes this. I can't find anymore looking though all of core and extensions and mediawiki-config.

Mon, Nov 27, 7:59 PM · MW-1.31-release-notes (WMF-deploy-2017-11-28 (1.31.0-wmf.10)), Patch-For-Review, MediaWiki-Cache, monitoring, Performance-Team

Thu, Nov 23

aaron renamed T181216: Get rid of pointless EnqueueJob usage from Get rid of enqueue job to Get rid of pointless EnqueueJob usage.
Thu, Nov 23, 10:03 AM · Patch-For-Review, Services (done), MW-1.31-release-notes (WMF-deploy-2017-12-05 (1.31.0-wmf.11)), MediaWiki-JobQueue
aaron added a comment to T181216: Get rid of pointless EnqueueJob usage.

They were mentioned in https://www.mediawiki.org/wiki/Requests_for_comment/Master-slave_datacenter_strategy_for_MediaWiki#Job_queuing though it was never set up (partly from people being busy with other things). In general jobs are enqueued on POST requests or from other jobs, all in the master datacenter. In some cases, jobs are enqueued on GET or possibly POST (if the api-promise-nonwrite thing is set up in vlc) in rare cases. This should work in a way where the cross-DC propagation is async, rather than having JobQueue::push() blocking on cross-DC traffic.

Thu, Nov 23, 10:01 AM · Patch-For-Review, Services (done), MW-1.31-release-notes (WMF-deploy-2017-12-05 (1.31.0-wmf.11)), MediaWiki-JobQueue

Tue, Nov 21

jcrespo awarded T180793: Frequent "Wikimedia\\Rdbms\\DatabaseMysqlBase::lock failed to acquire lock" errors on WMF mediawiki logs a Doubloon token.
Tue, Nov 21, 11:57 AM · MW-1.31-release-notes (WMF-deploy-2017-12-05 (1.31.0-wmf.11)), Patch-For-Review, DBA, Wikimedia-log-errors, Performance-Team

Mon, Nov 20

aaron moved T180793: Frequent "Wikimedia\\Rdbms\\DatabaseMysqlBase::lock failed to acquire lock" errors on WMF mediawiki logs from Next-up to Doing on the Performance-Team board.
Mon, Nov 20, 9:23 PM · MW-1.31-release-notes (WMF-deploy-2017-12-05 (1.31.0-wmf.11)), Patch-For-Review, DBA, Wikimedia-log-errors, Performance-Team
aaron moved T151466: Performance Q2 2017/18 goal: Install and use mcrouter in deployment-prep from Next-up to Doing on the Performance-Team board.
Mon, Nov 20, 9:23 PM · Release-Engineering-Team (Watching / External), Availability (Multiple-active-datacenters), Beta-Cluster-Infrastructure, Performance-Team

Fri, Nov 17

aaron added a comment to T171881: CL support for Wikipedia Zero piracy problems.

On the MW side of (2) above, it appears the swiftFileBackend code in MW uses PHP's urlencode to transform the filenames into upload URL paths. urlencode documentation claims that it percent-encodes everything but alphanumerics and -_. (so the set it does not encode is almost the official Unreserved Set, but it's missing the tilde). It also encodes spaces as + rather than %20 because it's meant for query strings rather than paths. PHP's rawurlencode would probably have been more appropriate here as it conforms to the RFC and excludes from encoding exactly the Unreserved Set and doesn't do the +-for-spaces thing. However, in practice, we can deal with the ~ issue and spaces have already been made into underscores, so the plusses shouldn't ever actually appear.

Regardless, this explanation seems consistent with observations of the upload.wm.o paths I've seen. We can normalize on similar rules there (but leave spaces as %20 just to be technically-correct, which again won't matter in practice). If at some later date we want to use a prettier normalization we can do that, too, but for now it would be simplest to leave the MediaWiki side alone and just conform everything else to its expectations.

Fri, Nov 17, 11:26 PM · Patch-For-Review, Community-Liaisons (Oct-Dec 2017), Zero

Thu, Nov 16

aaron added a comment to T178849: Click on fullImageLink <a> for PDF on File: page no longer rendering in browser.

So, the post_as_copy = true case works if SwiftFileBackend to no longer blacklist Content-Type from non-PUTs. It would always re-assert the old value if nothing was passed in by the describe() caller.

Thu, Nov 16, 8:41 PM · MediaWiki-File-management, Commons, MW-1.31-release-notes (WMF-deploy-2017-11-28 (1.31.0-wmf.10)), Patch-For-Review, Regression, media-storage, Multimedia-Team-Working-Board, Multimedia
aaron added a comment to T178849: Click on fullImageLink <a> for PDF on File: page no longer rendering in browser.

We should be mindful of the Swift post_as_copy option when set to false. At the moment that does *not* allowing changing Content-Type via POST.

Thu, Nov 16, 8:24 PM · MediaWiki-File-management, Commons, MW-1.31-release-notes (WMF-deploy-2017-11-28 (1.31.0-wmf.10)), Patch-For-Review, Regression, media-storage, Multimedia-Team-Working-Board, Multimedia

Nov 8 2017

aaron closed T178531: Add statsd metric to WANObjectCache as Resolved.
Nov 8 2017, 8:35 PM · MW-1.31-release-notes (WMF-deploy-2017-11-28 (1.31.0-wmf.10)), Patch-For-Review, MediaWiki-Cache, monitoring, Performance-Team
aaron closed T179999: CentralAuthUser::loadFromCache doesn't call the makeKey() methods as needed as Resolved.
Nov 8 2017, 8:35 PM · MW-1.31-release-notes (WMF-deploy-2017-10-31 (1.31.0-wmf.6)), MediaWiki-extensions-CentralAuth, Patch-For-Review, Performance-Team
aaron closed T179999: CentralAuthUser::loadFromCache doesn't call the makeKey() methods as needed, a subtask of T178634: 1.31.0-wmf.7 deployment blockers, as Resolved.
Nov 8 2017, 8:35 PM · Release-Engineering-Team (Kanban), Train Deployments, Release
aaron created T179999: CentralAuthUser::loadFromCache doesn't call the makeKey() methods as needed.
Nov 8 2017, 2:50 AM · MW-1.31-release-notes (WMF-deploy-2017-10-31 (1.31.0-wmf.6)), MediaWiki-extensions-CentralAuth, Patch-For-Review, Performance-Team

Oct 31 2017

aaron updated the task description for T151466: Performance Q2 2017/18 goal: Install and use mcrouter in deployment-prep.
Oct 31 2017, 10:27 PM · Release-Engineering-Team (Watching / External), Availability (Multiple-active-datacenters), Beta-Cluster-Infrastructure, Performance-Team
aaron added a comment to T151466: Performance Q2 2017/18 goal: Install and use mcrouter in deployment-prep.

So, running mcrouter via screen -r with the config in /etc/mcrouter/mcrouter.json on tin seems to work fine. The pool replication works and the timings are comparable to twemproxy -- often better than twemproxy.

Oct 31 2017, 10:21 PM · Release-Engineering-Team (Watching / External), Availability (Multiple-active-datacenters), Beta-Cluster-Infrastructure, Performance-Team

Oct 30 2017

aaron moved T171071: Perform testing for TLS effect on connection rate from Doing to Blocked on the Performance-Team board.
Oct 30 2017, 9:03 PM · Availability (Multiple-active-datacenters), DBA, Operations, Performance-Team

Oct 26 2017

aaron closed T175418: Create new instances memc05 and memc06 running memcached as Resolved.
Oct 26 2017, 10:26 PM · Release-Engineering-Team (Watching / External), Availability (Multiple-active-datacenters), Beta-Cluster-Infrastructure
aaron closed T175418: Create new instances memc05 and memc06 running memcached, a subtask of T151466: Performance Q2 2017/18 goal: Install and use mcrouter in deployment-prep, as Resolved.
Oct 26 2017, 10:26 PM · Release-Engineering-Team (Watching / External), Availability (Multiple-active-datacenters), Beta-Cluster-Infrastructure, Performance-Team

Oct 24 2017

aaron added a comment to T135261: {{REVISION*}} magic words should not display usernames and timestamps for null edits.

That "cannot merge" message is mostly useless and overly-technical in a Gerrit specific way (e.g. you can't "submit" without "+2", which is obvious anyway). Just look for "merge conflict" on the changeset page or where the patch shows up in listings, since that actually matters and is common.

Oct 24 2017, 1:54 AM · MW-1.31-release-notes (WMF-deploy-2017-11-07 (1.31.0-wmf.7)), MW-1.28-release (WMF-deploy-2016-06-14_(1.28.0-wmf.6)), Patch-For-Review, MediaWiki-Parser
aaron added a comment to T178857: Is the ConfirmAccount extension maintained?.

There have always been a lot feature requests or bug reports due to misconfiguration/version-mismatch and so on. I don't really have the time anymore (for some time in fact) to sift through and find the serious bugs. When I become aware of one I try to fix it, but if it's not major then I probably won't look at it.

Oct 24 2017, 1:52 AM · MediaWiki-extensions-ConfirmAccount

Oct 23 2017

aaron added a comment to T177073: Split the backend savetiming metric into submetrics.

Actually, I just moved them to https://grafana-admin.wikimedia.org/dashboard/db/backend-save-timing-breakdown?refresh=5m&orgId=1 .

Oct 23 2017, 8:50 PM · MW-1.31-release-notes (WMF-deploy-2017-10-10 (1.31.0-wmf.3)), Patch-For-Review, Performance-Team
aaron moved T171071: Perform testing for TLS effect on connection rate from Blocked to Doing on the Performance-Team board.
Oct 23 2017, 8:29 PM · Availability (Multiple-active-datacenters), DBA, Operations, Performance-Team
aaron moved T169249: /usr/local/bin/xenon-generate-svgs and flamegraph.pl cronspam from Doing to Next-up on the Performance-Team board.
Oct 23 2017, 8:29 PM · Patch-For-Review, Performance-Team, Operations
aaron closed T177073: Split the backend savetiming metric into submetrics as Resolved.

They are on the main dashboard. If more or added, it would be good to split them out since the main save timing board is getting long.

Oct 23 2017, 8:09 PM · MW-1.31-release-notes (WMF-deploy-2017-10-10 (1.31.0-wmf.3)), Patch-For-Review, Performance-Team
aaron moved T178531: Add statsd metric to WANObjectCache from Inbox to Doing on the Performance-Team board.
Oct 23 2017, 8:08 PM · MW-1.31-release-notes (WMF-deploy-2017-11-28 (1.31.0-wmf.10)), Patch-For-Review, MediaWiki-Cache, monitoring, Performance-Team
aaron triaged T178531: Add statsd metric to WANObjectCache as Normal priority.
Oct 23 2017, 8:08 PM · MW-1.31-release-notes (WMF-deploy-2017-11-28 (1.31.0-wmf.10)), Patch-For-Review, MediaWiki-Cache, monitoring, Performance-Team

Oct 20 2017

aaron added a comment to T173696: Cache format constraint check results.

Probably hotTTR is way to high. It's really "expected time till refresh given 1 hit/sec". With 50/min, you'd get maybe 2 updates (new values) per regex. I'll put up a patch for that.

Oct 20 2017, 4:16 PM · MW-1.31-release-notes (WMF-deploy-2017-10-24 (1.31.0-wmf.5)), MW-1.30-release-notes (WMF-deploy-2017-07-25_(1.30.0-wmf.11)), Patch-For-Review, Wikidata-Former-Sprint-Board, Wikibase-Quality-Constraints, Wikibase-Quality, Wikidata

Oct 19 2017

aaron added a comment to T173696: Cache format constraint check results.

I did a bunch of requests against https://www.wikidata.org/w/api.php?action=wbcheckconstraints&format=json&id=Q42&constraintid=P1476%24F24FF782-E994-4946-BEEC-104CC592534F, which checks a format constraint for “title”. It’s always the same regex and only a handful of different values (17). But while I could see a sharp rise in requests in Grafana corresponding to the times when I sent those requests (permalink), most of them are still cache misses. I’m not sure how to interpret that – it seems values aren’t entering the cache map very often?

Oct 19 2017, 5:45 PM · MW-1.31-release-notes (WMF-deploy-2017-10-24 (1.31.0-wmf.5)), MW-1.30-release-notes (WMF-deploy-2017-07-25_(1.30.0-wmf.11)), Patch-For-Review, Wikidata-Former-Sprint-Board, Wikibase-Quality-Constraints, Wikibase-Quality, Wikidata

Oct 18 2017

Krinkle awarded T178531: Add statsd metric to WANObjectCache a Orange Medal token.
Oct 18 2017, 8:37 PM · MW-1.31-release-notes (WMF-deploy-2017-11-28 (1.31.0-wmf.10)), Patch-For-Review, MediaWiki-Cache, monitoring, Performance-Team
aaron created T178531: Add statsd metric to WANObjectCache.
Oct 18 2017, 8:31 PM · MW-1.31-release-notes (WMF-deploy-2017-11-28 (1.31.0-wmf.10)), Patch-For-Review, MediaWiki-Cache, monitoring, Performance-Team
aaron placed T160298: "Special:ActiveUsers" throws database query error with sql_mode=only_full_group_by up for grabs.
Oct 18 2017, 6:21 PM · MW-1.27-release-notes, MW-1.29-release-notes, MW-1.28-release-notes, MW-1.29-release, MW-1.27-release, Technical-Debt, MediaWiki-Special-pages

Oct 17 2017

aaron added a comment to T173696: Cache format constraint check results.

Reopening. This task is supposed to be for caching results in general, which isn’t done yet at all, though we had a lot of discussion on caching regex checks specifically here, which in hindsight should’ve been in a separate task. Also, IMO the regex caching isn’t done yet, since the Grafana stats are pretty unsatisfactory.

(Perhaps we should repurpose this task to be just about regex checking, open a new one for general caching, and reshuffle the parent tasks so that this one is a child of the new task?)

Oct 17 2017, 3:53 PM · MW-1.31-release-notes (WMF-deploy-2017-10-24 (1.31.0-wmf.5)), MW-1.30-release-notes (WMF-deploy-2017-07-25_(1.30.0-wmf.11)), Patch-For-Review, Wikidata-Former-Sprint-Board, Wikibase-Quality-Constraints, Wikibase-Quality, Wikidata

Oct 12 2017

aaron placed T75174: Make PHPUnit tests pass with PHP 5.5/PostgreSQL on Travis CI up for grabs.
Oct 12 2017, 9:38 PM · User-Addshore, MW-1.31-release-notes (WMF-deploy-2017-12-05 (1.31.0-wmf.11)), Patch-For-Review, MW-1.30-release-notes, PostgreSQL, Goal, MediaWiki-Core-Tests

Oct 6 2017

aaron moved T177073: Split the backend savetiming metric into submetrics from Next-up to Doing on the Performance-Team board.
Oct 6 2017, 6:02 PM · MW-1.31-release-notes (WMF-deploy-2017-10-10 (1.31.0-wmf.3)), Patch-For-Review, Performance-Team

Oct 5 2017

aaron added a comment to T175672: Make apache/maintenance hosts TLS connections to mariadb work.

We discussed proxies in the last performance meeting and we're OK with that (it would cut down on handshake latency anyway).

Oct 5 2017, 10:06 PM · Patch-For-Review, Performance-Team (Radar), Availability (Multiple-active-datacenters), DBA, Operations
aaron added a comment to T155110: JobRunner transaction fname for Job::run() can mismatch __METHOD__ in a subclass.

JobRunner always starts an LBFactory transaction.

Oct 5 2017, 10:01 PM · MediaWiki-JobQueue
aaron closed T42451: "Transaction already in progress" error in sqlite as Resolved.

This was actually fixed for new installs before that patch by moving the object cache table to a separate DB.

Oct 5 2017, 9:41 PM · Performance-Team, SQLite, MediaWiki-Database
aaron closed T42451: "Transaction already in progress" error in sqlite, a subtask of T72710: StorageException in EditEntityActionTest::testActionForPage (edit-already-exists) and related failures, as Resolved.
Oct 5 2017, 9:41 PM · § Wikidata-Sprint-2015-02-25, Patch-For-Review, Wikidata, MediaWiki-extensions-WikibaseRepository
aaron placed T134811: Consider REST with SSL (HyperSwitch/Cassandra) for session storage up for grabs.
Oct 5 2017, 1:12 AM · Services (blocked), Availability (Multiple-active-datacenters), Operations, Performance-Team

Oct 4 2017

aaron added a comment to T175672: Make apache/maintenance hosts TLS connections to mariadb work.

So what I extract from the errors is you're trying to connect to db2048 by IP and not by hostname, and the certificates we expose for mysql do not include verification information for the ip address in its SAN. In fact, I don't think we ever did add that info to our certs.

So if we had the hostname instead of the IP in db-codfw.php, it should work. I think performance was a reason for using IPs instead of hostnames there, so we might need to reissue the certificates if we want to keep using IPs. I think the implications for DBAs would be a huge maintenance work.

Oct 4 2017, 8:47 PM · Patch-For-Review, Performance-Team (Radar), Availability (Multiple-active-datacenters), DBA, Operations
aaron added a comment to T175672: Make apache/maintenance hosts TLS connections to mariadb work.

Also, there is https://bugs.php.net/bug.php?id=74445 :)

Oct 4 2017, 8:39 PM · Patch-For-Review, Performance-Team (Radar), Availability (Multiple-active-datacenters), DBA, Operations
aaron updated the task description for T175672: Make apache/maintenance hosts TLS connections to mariadb work.
Oct 4 2017, 8:33 PM · Patch-For-Review, Performance-Team (Radar), Availability (Multiple-active-datacenters), DBA, Operations
aaron renamed T175672: Make apache/maintenance hosts TLS connections to mariadb work from Make client certs available for apache/maintenance hosts for TLS connections to mariadb to Make apache/maintenance hosts TLS connections to mariadb work.
Oct 4 2017, 7:07 PM · Patch-For-Review, Performance-Team (Radar), Availability (Multiple-active-datacenters), DBA, Operations
aaron added a comment to T155110: JobRunner transaction fname for Job::run() can mismatch __METHOD__ in a subclass.

You can always do what extensions/CentralAuth/includes/LocalRenameJob/LocalRenameJob.php does AFAIK.

Oct 4 2017, 6:03 PM · MediaWiki-JobQueue

Oct 3 2017

aaron added a comment to T175672: Make apache/maintenance hosts TLS connections to mariadb work.

Looking at http://php.net/manual/en/mysqli.ssl-set.php, I would think you'd only need to set capath=/etc/ssl/certs, while setting all other parameters to NULL (except maybe cipher, as I have no idea what is the actual default cipherlist for mysqli on HHVM).

I tried that first but it yields "SSL connection error: SSL_CTX_set_default_verify_paths failed (10.192.32.108)".

Oct 3 2017, 11:01 PM · Patch-For-Review, Performance-Team (Radar), Availability (Multiple-active-datacenters), DBA, Operations
aaron added a comment to T177017: Re-enable per-filter profiling on wikis where it was disabled.

I'd look for the new method calls that are being reached and whether they show up and how large their profile is if they do. Note that you can use cntl-F on the svg images to highlight matches in purple.

Oct 3 2017, 9:15 PM · Anti-Harassment, AbuseFilter

Oct 2 2017

aaron created T177258: Update.php fails with postgres due to ip_changes population.
Oct 2 2017, 9:58 PM · MW-1.31-release-notes (WMF-deploy-2017-12-05 (1.31.0-wmf.11)), MW-1.30-release-notes, MW-1.30-release, Community-Tech-Sprint, PostgreSQL, MediaWiki-Maintenance-scripts, MediaWiki-Installer
aaron added a comment to T177017: Re-enable per-filter profiling on wikis where it was disabled.

I think it's fine to roll out there as long as you are watching https://grafana.wikimedia.org/dashboard/db/save-timing?refresh=5m&orgId=1 and check the -index.svg flamegraph at https://performance.wikimedia.org/xenon/svgs/daily/ for day of deployment the next day (current day values are always useless/incomplete).

Oct 2 2017, 9:45 PM · Anti-Harassment, AbuseFilter
aaron closed T160298: "Special:ActiveUsers" throws database query error with sql_mode=only_full_group_by as Resolved.
Oct 2 2017, 6:09 PM · MW-1.27-release-notes, MW-1.29-release-notes, MW-1.28-release-notes, MW-1.29-release, MW-1.27-release, Technical-Debt, MediaWiki-Special-pages
aaron placed T173450: Setup grafana alert for job error rate up for grabs.
Oct 2 2017, 6:08 PM · Performance-Team

Sep 26 2017

aaron added a comment to T175672: Make apache/maintenance hosts TLS connections to mariadb work.

Looking at http://php.net/manual/en/mysqli.ssl-set.php, I would think you'd only need to set capath=/etc/ssl/certs, while setting all other parameters to NULL (except maybe cipher, as I have no idea what is the actual default cipherlist for mysqli on HHVM).

Sep 26 2017, 3:29 PM · Patch-For-Review, Performance-Team (Radar), Availability (Multiple-active-datacenters), DBA, Operations

Sep 20 2017

aaron moved T166199: Add metrics for master queries on HTTP GET/HEAD from Next-up to Doing on the Performance-Team board.
Sep 20 2017, 7:21 PM · MW-1.31-release-notes (WMF-deploy-2017-10-03 (1.31.0-wmf.2)), Performance-Team, Availability (Multiple-active-datacenters)
aaron added a comment to T173696: Cache format constraint check results.

Interesting idea! It feels a bit weird to implement logic like this on top of the cache (I thought that’s the cache’s job?), but you’re the expert :) it sounds like it makes a lot of sense, at least, since the set of regexes is mostly static and the set of values is highly dynamic, with some very commonly used values.

I think I’ll remove the “don’t bother” microtime check, though, since it seems that even for an extremely simple query like SELECT (1 AS ?x) {}, the query service rarely responds in less than 0.04 seconds, and never in less than 0.02 seconds (tested from a Cloud VPS system within the Eqiad cluster).

Sep 20 2017, 10:30 AM · MW-1.31-release-notes (WMF-deploy-2017-10-24 (1.31.0-wmf.5)), MW-1.30-release-notes (WMF-deploy-2017-07-25_(1.30.0-wmf.11)), Patch-For-Review, Wikidata-Former-Sprint-Board, Wikibase-Quality-Constraints, Wikibase-Quality, Wikidata
aaron added a comment to T173696: Cache format constraint check results.

If want to avoid flooding cache with rarely used long-tail combinations, maybe something like this could be done:

Sep 20 2017, 3:24 AM · MW-1.31-release-notes (WMF-deploy-2017-10-24 (1.31.0-wmf.5)), MW-1.30-release-notes (WMF-deploy-2017-07-25_(1.30.0-wmf.11)), Patch-For-Review, Wikidata-Former-Sprint-Board, Wikibase-Quality-Constraints, Wikibase-Quality, Wikidata

Sep 19 2017

aaron added a comment to T176101: Cannot delete File:MKC,S.jpg on zhwiki due to DBQueryError.

Problem seems to be:

if ( $this->stage <= MIGRATION_WRITE_BOTH ) {
	$fields[$this->key] = $this->lang->truncate( $comment->text, 255 );
}

...LocalFile already used addQuotes(), and this can remove the ending quote character.

That's not the real problem. The problem is that the different behavior of IDatabase->insertSelect()'s $varMap versus ->insert()'s $a wasn't noticed, so the code was incorrectly quoting the value passed to CommentStore->insert() (from the original pre-CommentStore code) rather than quoting the returned literal fields for passing into IDatabase->insertSelect().

Sep 19 2017, 3:07 AM · Patch-For-Review, Vuln-Inject, Wikimedia-log-errors, Chinese-Sites, MediaWiki-Page-deletion

Sep 18 2017

aaron removed a project from T176101: Cannot delete File:MKC,S.jpg on zhwiki due to DBQueryError: Security.

Problem seems to be:

if ( $this->stage <= MIGRATION_WRITE_BOTH ) {
	$fields[$this->key] = $this->lang->truncate( $comment->text, 255 );
}
Sep 18 2017, 9:24 AM · Patch-For-Review, Vuln-Inject, Wikimedia-log-errors, Chinese-Sites, MediaWiki-Page-deletion
aaron triaged T176101: Cannot delete File:MKC,S.jpg on zhwiki due to DBQueryError as Unbreak Now! priority.
Sep 18 2017, 9:23 AM · Patch-For-Review, Vuln-Inject, Wikimedia-log-errors, Chinese-Sites, MediaWiki-Page-deletion
aaron added a project to T176101: Cannot delete File:MKC,S.jpg on zhwiki due to DBQueryError: Security.
Sep 18 2017, 9:10 AM · Patch-For-Review, Vuln-Inject, Wikimedia-log-errors, Chinese-Sites, MediaWiki-Page-deletion

Sep 16 2017

aaron added a comment to T175834: TranslatablePageMoveJob commit while in atomic sections.

Probably the onMoveTranslationUnits handler should be a closure sent to DeferredUpdates. Anything that needs to COMMIT/BEGIN within it's scope needs full transaction control, which is not usually guaranteed when some hook triggers.

Sep 16 2017, 9:43 PM · MediaWiki-extensions-Translate, Wikimedia-log-errors

Sep 15 2017

aaron added a comment to T173477: wmf.14 Blocker - Post Mortem - Cannot flush pre-lock snapshot because writes are pending.
  • PROBLEM: in LinksUpdate, runForTitle() starting off with acquirePageLock(), then calling doUpdate() for the secondary update list, and returning without committing. This meant that any caller using this method inside a loop had to call commitMasterChanges() itself somehow, otherwise, the acquirePageLock() call would fail. The multi-title case of RefreshLinksJob had a for-loop that did not do this. Note that acquirePageLock() uses getScopedLockAndFlush() which is intended for "critical sections" (https://en.wikipedia.org/wiki/Critical_section) involving read/writing to the database. Since it makes to sense to acquire a lock and then read a stale snapshot (from REPEATABLE-READ) from *before* lock acquisition, Database demands that any transaction be cleared. It will do so automatically if there are no writes, but otherwise it fails since committing prematurely may break atomicity.
  • INTRODUCTION: This was broken since 63a3911a67507731695bad3188f486219a563b7d but nothing used multi-title refreshlinks jobs. 0df49eeaf49dcd84cee5afc678de43ebd6c984c5 introduced a use case for this and made the bug manifest itself.
  • AVOIDANCE: since this would seem to happen for any multi-tutle job run, I'm not sure how this got past testing unless there were (a) no links updated and the test jobs were triggered by null or non-link changed edits or (b) the edited test entities only had one backlink. Future backlink change propagation testing should cover these cases.
Sep 15 2017, 9:39 AM · RelEng-Archive-FY201718-Q1
aaron added a comment to T174993: Vandalism in "In the news" articles persisting in the app' ?.

I'd leave it open. The above change avoids the jobqueue and thus fast tail jobs piling due to slow wikidata jobs in the head of the queue. It should help, I'd assume.

Sep 15 2017, 6:26 AM · Reading-Infrastructure-Team-Backlog, Services (watching), Mobile, Wikipedia-iOS-App-Backlog, Wikipedia-Android-App-Backlog, iOS-app-Bugs, Android-app-Bugs

Sep 14 2017

aaron added a comment to T174993: Vandalism in "In the news" articles persisting in the app' ?.

I think LinksUpdate for the page directly edited can probably be moved (back) to doing the actual work post-send.

The 'enqueue' parameter can be removed from MediaWiki::restInPeace() since, unlike in the PRESEND run, the user is not waiting on it to run. If a caller really wants to enqueue a job post-send, it can always use lazyPush() instead of adding an EnqueueableDataUpdate to the POSTSEND deferred update list.

Sep 14 2017, 8:25 AM · Reading-Infrastructure-Team-Backlog, Services (watching), Mobile, Wikipedia-iOS-App-Backlog, Wikipedia-Android-App-Backlog, iOS-app-Bugs, Android-app-Bugs
aaron added a comment to T174993: Vandalism in "In the news" articles persisting in the app' ?.

I think LinksUpdate for the page directly edited can probably be moved (back) to doing the actual work post-send.

Sep 14 2017, 8:02 AM · Reading-Infrastructure-Team-Backlog, Services (watching), Mobile, Wikipedia-iOS-App-Backlog, Wikipedia-Android-App-Backlog, iOS-app-Bugs, Android-app-Bugs
aaron added a comment to T174993: Vandalism in "In the news" articles persisting in the app' ?.

As far as I can tell, the page image(s) are handled as part of deferred linksUpdate processing. This means that the updates would be executed after the main web request, but on the same PHP thread that handled the original edit request.

Sep 14 2017, 7:56 AM · Reading-Infrastructure-Team-Backlog, Services (watching), Mobile, Wikipedia-iOS-App-Backlog, Wikipedia-Android-App-Backlog, iOS-app-Bugs, Android-app-Bugs

Sep 13 2017

aaron added a comment to T102899: Implement or find a generic leaderboard web interface.

For catching slow queries, we can use logging to logstash when the runtime passes a certain threshold (to avoid spamming the service). A leaderboard could be added to Kibana for the top occurrences of normalized messages.

Sep 13 2017, 1:22 PM · Performance-Team
aaron renamed T99060: Create a dashboard of key user-centric performance metrics from Performance key metrics dashboard(s) to Create a dashboard of key user-centric performance metrics.
Sep 13 2017, 12:25 PM · Performance-Team

Sep 12 2017

aaron moved T95501: Fix causes of slave lag and get it to under 5 seconds at peak from Next-up to Blocked on the Performance-Team board.
Sep 12 2017, 10:08 AM · Goal, Performance-Team, Availability
aaron moved T161749: Introduce InterruptMutexManager from Next-up to Backlog on the Performance-Team board.
Sep 12 2017, 10:08 AM · TechCom-RfC (ArchCom-Approved), User-Daniel, Performance-Team, MediaWiki-General-or-Unknown
aaron moved T121440: Dedicated post-edit cache busting cookie to prevent stale reads (session consistency) from Potential goals to Backlog on the Performance-Team board.
Sep 12 2017, 10:08 AM · Performance-Team
aaron moved T121440: Dedicated post-edit cache busting cookie to prevent stale reads (session consistency) from Next-up to Potential goals on the Performance-Team board.
Sep 12 2017, 10:07 AM · Performance-Team
aaron moved T171071: Perform testing for TLS effect on connection rate from Doing to Blocked on the Performance-Team board.
Sep 12 2017, 10:01 AM · Availability (Multiple-active-datacenters), DBA, Operations, Performance-Team
aaron updated the task description for T175672: Make apache/maintenance hosts TLS connections to mariadb work.
Sep 12 2017, 8:44 AM · Patch-For-Review, Performance-Team (Radar), Availability (Multiple-active-datacenters), DBA, Operations
aaron created T175672: Make apache/maintenance hosts TLS connections to mariadb work.
Sep 12 2017, 8:43 AM · Patch-For-Review, Performance-Team (Radar), Availability (Multiple-active-datacenters), DBA, Operations

Sep 9 2017

aaron updated the task description for T175437: Improve [rollback] logic when it encounters null edits.
Sep 9 2017, 1:08 AM · MediaWiki-Recent-changes, MediaWiki-History-or-Diffs
aaron updated the task description for T175437: Improve [rollback] logic when it encounters null edits.
Sep 9 2017, 12:54 AM · MediaWiki-Recent-changes, MediaWiki-History-or-Diffs
aaron created T175439: SQL error with postgres during 1.30 update.php run.
Sep 9 2017, 12:29 AM · MW-1.28-release-notes, MW-1.27-release-notes, MW-1.29-release-notes, Patch-For-Review, PostgreSQL, MediaWiki-Database

Sep 8 2017

aaron created T175437: Improve [rollback] logic when it encounters null edits.
Sep 8 2017, 11:48 PM · MediaWiki-Recent-changes, MediaWiki-History-or-Diffs