Page MenuHomePhabricator

Joe (Giuseppe Lavagetto)
Spy

Projects (24)

Today

  • Clear sailing ahead.

Tomorrow

  • Clear sailing ahead.

Sunday

  • Clear sailing ahead.

User Details

User Since
Oct 3 2014, 5:57 AM (296 w, 17 h)
Availability
Available
LDAP User
Giuseppe Lavagetto
MediaWiki User
GLavagetto (WMF) [ Global Accounts ]

Recent Activity

Today

Joe added a comment to T254275: HTML Dumps - June/2020.

Having all the content in HDFS would be hugely useful, and doing it right would imply solving _so_ many problems. We should definitely talk about this.

Fri, Jun 5, 9:52 PM · Analytics, Core Platform Team, Dumps-Generation
Joe added a comment to T244340: Reduce read pressure on memcached servers by adding a machine-local Memcache instance.

The problem we're trying to solve here is not an individual cache key read, or even multiple cache key reads, but more generally to smooth any read swarm we have. For now this is an experiment, and I don't think (out of our past experience) that caching keys with a TTL lower than the db max lag could really cause issues of consistency. I will go further: if this could cause consistency issues, then we're using memcached as a database with some guarantees of consistency and durability, neither of which is true, and we need to go back to question that.

Fri, Jun 5, 11:27 AM · Performance-Team, Patch-For-Review, Operations, serviceops
Joe added a comment to T254275: HTML Dumps - June/2020.

I have one question: I see you're using AWS for building the project. This raises one fundamental question: Do you plan to just run a prototype on AWS, or is that intended to be used in production as well?

Fri, Jun 5, 9:43 AM · Analytics, Core Platform Team, Dumps-Generation
Joe added a comment to T254275: HTML Dumps - June/2020.

Having said that, not everything is as optimistic, as, the devil is in the details. How much staleness is acceptable? What to do with pages that fail to render because they consume too much resources? And the most critical question - how would we do it in RESTBase-less universe?

Fri, Jun 5, 9:40 AM · Analytics, Core Platform Team, Dumps-Generation
Joe added a comment to T254275: HTML Dumps - June/2020.

I'm probably not the first person to think of this, but figuring out how to do all the extra page renders this will need in the "warm" datacenter (usually codfw) seems pretty ideal. I vaguely remember that a big issue with the last time this was tried naively was that Parsoid + MediaWiki had a hard time keeping up with the extra needed traffic. Today that might be a bit easier to do hardware planning for with Parsoid/PHP having replaced the nodejs sidecar service. But it seems like a nice win to put the unloaded DC to work in producing the HTML renders.

Fri, Jun 5, 9:37 AM · Analytics, Core Platform Team, Dumps-Generation

Yesterday

Joe updated the task description for T245594: Many objects in conftool have pooled=yes, weight=0.
Thu, Jun 4, 7:35 AM · Service-Architecture, Operations
Joe closed T245594: Many objects in conftool have pooled=yes, weight=0 as Resolved.

Resolving this as we have no more services with weight 0, and now "pool" should correctly refuse to pool a service if the weight is zero

Thu, Jun 4, 7:34 AM · Service-Architecture, Operations
Physikerwelt awarded T245170: Revisit timeouts, concurrency limits in remote HTTP calls from MediaWiki a The World Burns token.
Thu, Jun 4, 6:38 AM · MW-1.35-notes (1.35.0-wmf.35; 2020-06-02), Core Platform Team Workboards (Clinic Duty Team), Sustainability (Incident Prevention), MediaWiki-General, Operations, serviceops
Joe closed T97972: Figure out a security model for etcd, a subtask of T97978: Create a tool to sync static configuration from a repository to the consistent k/v store, as Resolved.
Thu, Jun 4, 5:55 AM · Operations, services-tooling, discovery-system, Traffic
Joe closed T97972: Figure out a security model for etcd as Resolved.
Thu, Jun 4, 5:55 AM · Patch-For-Review, conftool, Operations, services-tooling, discovery-system, Traffic

Wed, Jun 3

Joe added a comment to T249065: RFC: Wikimedia Push Notification Service.

@Krinkle I think it's perfectly ok to not use changeprop - I just wanted to get some clarification as to why to be in the RFC so that that analysis is explicit and documented. I have no concerns regarding the RFC as it is.

Wed, Jun 3, 12:40 PM · TechCom-RFC (TechCom-RFC-Closed), Product-Infrastructure-Team-Backlog (Kanban), Security, Push-Notification-Service, Reading Epics (Push Notifications)
Joe added a comment to T97972: Figure out a security model for etcd.

My tests went fine:

  • mwdebug* servers got the datacenter-appropriate pool-$dc-testserver user
  • the deployment servers got the conftool user instead
Wed, Jun 3, 10:57 AM · Patch-For-Review, conftool, Operations, services-tooling, discovery-system, Traffic
Joe created P11379 puppet merge fail..
Wed, Jun 3, 10:09 AM
Joe closed T82176: Setup HTCP monitoring alerts as Declined.

We're moving to purged using kafka, so we will set up alerting based on that, rather than on htcpd

Wed, Jun 3, 7:52 AM · Operations, observability

Thu, May 28

Joe added a comment to T253840: "depool" installed on prometheus hosts but not "confctl".

This happens because the prometheus role somehow includes conftool::scripts but not profile::conftool::client.

Thu, May 28, 10:42 AM · Patch-For-Review, conftool
Joe added a comment to T97972: Figure out a security model for etcd.

The deploy strategy is simply adding the new users to etcd, move most hosts to use conftool as the root user immediately, and then progressively move them to the new users system on the long run.

Thu, May 28, 6:51 AM · Patch-For-Review, conftool, Operations, services-tooling, discovery-system, Traffic
Joe added a comment to T236017: Move blubberoid to use TLS only..

Picking this up again - we already migrated the CDN to use https - do we need to do something for CI?

Thu, May 28, 6:49 AM · Release Pipeline (Blubber), Kubernetes, serviceops, Operations
Joe closed T247389: Use envoy for TLS termination on the appservers, a subtask of T244843: Create a service-to-service proxy for handling HTTP calls from services to other entities, as Resolved.
Thu, May 28, 6:47 AM · Patch-For-Review, MediaWiki-General, Operations, serviceops, Service-Architecture
Joe closed T247389: Use envoy for TLS termination on the appservers as Resolved.
Thu, May 28, 6:47 AM · Patch-For-Review, MediaWiki-General, Operations, serviceops, Service-Architecture
Joe closed T247388: Create a grafana dashboard to monitor services proxied via envoy, a subtask of T244843: Create a service-to-service proxy for handling HTTP calls from services to other entities, as Resolved.
Thu, May 28, 6:47 AM · Patch-For-Review, MediaWiki-General, Operations, serviceops, Service-Architecture
Joe closed T247388: Create a grafana dashboard to monitor services proxied via envoy as Resolved.
Thu, May 28, 6:47 AM · MediaWiki-General, Operations, serviceops, Service-Architecture
Joe added a comment to T253822: Remove provisioning for 'mwscript', 'foreachwikiindblist' etc from deployment host.

Just out of curiosity, what's the problem with running scripts from the deployment host, other than "we prefer if they're run on mwmaint for consistency"?

Thu, May 28, 5:20 AM · Sustainability (Incident Prevention), Deployments, serviceops, Release-Engineering-Team-TODO

Wed, May 27

Joe triaged T253715: Degraded RAID on restbase2009 as High priority.

Setting priority to "high" as the failed disk was also used in JBOD configuration for cassandra, which is not failing to start.

Wed, May 27, 5:38 AM · Operations, ops-codfw

Mon, May 25

Joe added a comment to T242155: Update Doxygen in CI to 1.8.17 or greater.

The package has been uploaded.

Mon, May 25, 1:49 PM · Operations, Patch-For-Review, Release-Engineering-Team (CI & Testing services), Release-Engineering-Team-TODO (2020-04 to 2020-06 (Q4)), Doxygen, Developer Productivity, Continuous-Integration-Config
Joe added a comment to T247389: Use envoy for TLS termination on the appservers.

As of today, all appservers use envoy too.

Mon, May 25, 1:07 PM · Patch-For-Review, MediaWiki-General, Operations, serviceops, Service-Architecture
Joe lowered the priority of T253405: Some (recent?) uploads to Commons are not available on other wikis from High to Medium.

I've been monitoring the status of new images in the following way:

Mon, May 25, 7:21 AM · MW-1.35-notes (1.35.0-wmf.32; 2020-05-12), Wikimedia-Incident, Performance-Team, MediaWiki-Uploading, Operations, Commons, MediaWiki-File-management
Joe added a comment to T244340: Reduce read pressure on memcached servers by adding a machine-local Memcache instance.

The idea is obviously sensible. I do have some concerns about how this will perform with our loaded mwservers. […]

I'm also worried about the latency it might add unconditionally. I'd be inclined to persue to something more explicit from the application side. But, let's see how this goes.

Mon, May 25, 5:10 AM · Performance-Team, Patch-For-Review, Operations, serviceops

Fri, May 22

Joe committed rLPRIb9be9c82a78d: Add etcd password autogen seed (authored by Joe).
Add etcd password autogen seed
Fri, May 22, 7:19 AM

Thu, May 21

Joe added a comment to T97972: Figure out a security model for etcd.

I think I will try to implement the following RBAC schema:

Thu, May 21, 5:48 AM · Patch-For-Review, conftool, Operations, services-tooling, discovery-system, Traffic

Wed, May 20

Joe committed rLPRI5a26ff3fbe64: Move jobrunner cert to correct path (authored by Joe).
Move jobrunner cert to correct path
Wed, May 20, 2:24 PM
Joe committed rLPRI976a50232b52: Add cert for combined jobrunner (authored by Joe).
Add cert for combined jobrunner
Wed, May 20, 2:20 PM
Joe added a comment to T247389: Use envoy for TLS termination on the appservers.

Status update: we've deployed envoy on all mediawiki servers with the exception of:

  • jobrunners (where we still have to reproduce what nginx was doing)
  • all servers in the appserver cluster in eqiad with a sequence number above mw1275.
Wed, May 20, 9:29 AM · Patch-For-Review, MediaWiki-General, Operations, serviceops, Service-Architecture
Joe added a comment to T253128: intermittent brief data dropouts for esams netflow data.

Looking at kafka, it seems there is a bizarre pattern in producing the data to the "netflow" topic:

Wed, May 20, 7:06 AM · Operations, netops
Joe added a comment to T249065: RFC: Wikimedia Push Notification Service.

TechCom is placing this on Last Call ending on 27th of May.

@Joe can you confirm you concerns have been addressed?

Wed, May 20, 6:19 AM · TechCom-RFC (TechCom-RFC-Closed), Product-Infrastructure-Team-Backlog (Kanban), Security, Push-Notification-Service, Reading Epics (Push Notifications)
Joe added a comment to T249065: RFC: Wikimedia Push Notification Service.
  • For metrics to collect, let's ensure we make them compatible with Prometheus, and let's use the standard Latency, Traffic, Errors, and Saturation metrics. I will expand in a later comment on this.

When we launched the Wikifeeds service on the pipeline a little while ago, I noticed that a dashboard for the service with those metrics appeared in Grafana, though I'm not sure if it happened automagically or someone other than me went to the trouble of creating it manually. In any case, that suggests that we can pattern our metrics collection after what Wikifeeds is doing.

For now I'll make a note in the Metrics section that we should be collecting these.

Wed, May 20, 6:15 AM · TechCom-RFC (TechCom-RFC-Closed), Product-Infrastructure-Team-Backlog (Kanban), Security, Push-Notification-Service, Reading Epics (Push Notifications)
Joe added a comment to T252091: RFC: Site-wide edit rate limiting with PoolCounter.

So, while I find the idea of using poolcounter to limit the editing concurrency (it's not rate-limiting, which is different) a good proposal, and in general something desirable to have (including the possibility we tune it down to zero if we're in a crisis for instance), I think the fundamental problem reported here is that WDQS can't ingest the updates fast enough.

Wed, May 20, 6:09 AM · Sustainability (Incident Prevention), User-Addshore, Wikidata-Campsite, Wikidata, TechCom-RFC
Joe added a comment to T252091: RFC: Site-wide edit rate limiting with PoolCounter.
  • "The above suggests that the current rate limit is too high," this is not correct, the problem is that there is no rate limit for bots at all. The group explicitly doesn't have a rate limit. Adding such ratelimit was tried and caused lots of issues (even with a pretty high number).
Wed, May 20, 6:02 AM · Sustainability (Incident Prevention), User-Addshore, Wikidata-Campsite, Wikidata, TechCom-RFC

Tue, May 19

Joe added a comment to T251869: Regression test for Logstash filters.

I think we're at the point where it would be best if we could change the logic of our testing, and use docker directly, so that we can split tests into different images.

Tue, May 19, 10:19 AM · Release-Engineering-Team-TODO (2020-04 to 2020-06 (Q4)), Release-Engineering-Team (CI & Testing services), observability, User-fgiunchedi, Patch-For-Review, Wikimedia-Logstash

Mon, May 18

Joe added a comment to T133821: Make CDN purges reliable.

Status update: purged is now consuming purges from restbase directly via kafka and not via multicast anymore. This should unblock the complete migration of changeprop to kubernetes, amongst other things.

Mon, May 18, 2:48 PM · serviceops, Patch-For-Review, Sustainability (MediaWiki-MultiDC), Performance-Team (Radar), Operations, Traffic

Thu, May 14

Joe triaged T252743: Test effects of forcing numa locality for php-fpm as Medium priority.
Thu, May 14, 8:12 AM · serviceops, Operations
Joe created T252745: Sandbox/limit child processes within a container runtime.
Thu, May 14, 7:56 AM · serviceops, Operations
Joe created T252743: Test effects of forcing numa locality for php-fpm.
Thu, May 14, 7:34 AM · serviceops, Operations

Wed, May 13

Joe added a comment to T252605: Reliable metrics for idle/busy PHP-FPM workers.

I took a brief peek at what flows to systemd from php-fpm on dbus:

Wed, May 13, 3:30 PM · Patch-For-Review, serviceops, observability, Operations
Joe updated the task description for T252619: Debian-glue doesn't check for the validity of the distribution in the changelog..
Wed, May 13, 8:09 AM · Release-Engineering-Team
Joe renamed T252619: Debian-glue doesn't check for the validity of the distribution in the changelog. from Jobs using "debian-glue" on buster fail consistently to Debian-glue doesn't check for the validity of the distribution in the changelog..
Wed, May 13, 7:53 AM · Release-Engineering-Team
Joe lowered the priority of T252619: Debian-glue doesn't check for the validity of the distribution in the changelog. from High to Low.

I just realized, the problem is the typo buTster I made in the commit. So changing the priority accordingly, I'll rewrite the text of the bug as the problem is different :)

Wed, May 13, 7:48 AM · Release-Engineering-Team
Joe triaged T252619: Debian-glue doesn't check for the validity of the distribution in the changelog. as High priority.

Setting priority as "high" as this is blocking a project.

Wed, May 13, 7:42 AM · Release-Engineering-Team
Joe created T252619: Debian-glue doesn't check for the validity of the distribution in the changelog..
Wed, May 13, 7:42 AM · Release-Engineering-Team
Joe added a comment to T252605: Reliable metrics for idle/busy PHP-FPM workers.

FWIW, I think I remember systemctl status php7.2-fpm to stall on a busy server, but I might remember incorrectly.

Wed, May 13, 6:56 AM · Patch-For-Review, serviceops, observability, Operations

Tue, May 12

Joe committed rLPRI9d970d3dfba5: Adding the fake certificates for purged (authored by Joe).
Adding the fake certificates for purged
Tue, May 12, 12:53 PM
Ladsgroup awarded T250261: Stop sending purges for `action=history` for linked pages. a Love token.
Tue, May 12, 12:31 PM · Performance-Team-publish, MW-1.35-notes (1.35.0-wmf.31; 2020-05-05), Sustainability (Incident Prevention), Core Platform Team Workboards (Clinic Duty Team), MediaWiki-Cache, Performance-Team (Radar), serviceops, Traffic, Operations
Joe added a comment to T133821: Make CDN purges reliable.

At a later time, we could think of changing the logic, and make purges avoid race conditions, removing the need for the rebound purges.
One way to implement this would be the following:

  • No more changes are needed at the application layer
  • All purged servers join a single consumer group per datacenter. This will ensure each purge message is consumed by only one purged instance.
  • This instance will take care of sending the purges to all the cache backends in the DC first, and to all the frontends afterwards

This would ensure there are no fe/be race conditions.

I guess we have two categories of rebound purges. The MediaWiki ones (for DB replication lag) and the infrastructure ones for cache tier race mitigation. The proposed scheme would eliminate the later. The former case is largely negligible (only applies to URLs of pages directly edited, not changed via templates/files).

Tue, May 12, 9:52 AM · serviceops, Patch-For-Review, Sustainability (MediaWiki-MultiDC), Performance-Team (Radar), Operations, Traffic
ema awarded T250261: Stop sending purges for `action=history` for linked pages. a Love token.
Tue, May 12, 7:38 AM · Performance-Team-publish, MW-1.35-notes (1.35.0-wmf.31; 2020-05-05), Sustainability (Incident Prevention), Core Platform Team Workboards (Clinic Duty Team), MediaWiki-Cache, Performance-Team (Radar), serviceops, Traffic, Operations
Joe added a comment to T250261: Stop sending purges for `action=history` for linked pages..

This change was released to production to all wikis yesterday.

Tue, May 12, 7:31 AM · Performance-Team-publish, MW-1.35-notes (1.35.0-wmf.31; 2020-05-05), Sustainability (Incident Prevention), Core Platform Team Workboards (Clinic Duty Team), MediaWiki-Cache, Performance-Team (Radar), serviceops, Traffic, Operations

Mon, May 11

Joe added a comment to T243106: Phased rollout of sessionstore to production fleet.

Thank you @akosiaris! Does this unblock us to continue with the rollout? If so I'll arrange to schedule that work on our side and coordinate with @thcipriani

Mon, May 11, 9:12 AM · Performance-Team-publish, Core Platform Team Workboards (Clinic Duty Team), CPT Initiatives (Session Management Service (CDP2)), serviceops-radar, TPG-Epics (Team Practices Group Coaching Clinic), User-Clarakosi, User-Eevans
Joe closed T251378: Chaos Engineering - Stop for x hours one or more mc10xx memcached shards, a subtask of T244852: Upgrade and improve our application object caching service (memcached), as Resolved.
Mon, May 11, 8:39 AM · Patch-For-Review, Operations, serviceops
Joe closed T251378: Chaos Engineering - Stop for x hours one or more mc10xx memcached shards as Resolved.

We ran this test, and it passed with flying colors:

  • A transient peak of memcached errors, lasting less than 1 minute
  • The gutter pool picks up the slack pretty fast
  • No noticeable effect on latency.
  • The cache hit ratio on the gutter pool was good (88% after less than one hour in the pool, but probably capped around that value by the 10 minutes TTL)
  • As soon as the server became available again, the memcached traffic went back quickly but not instantly, in the span of ~ 2 minutes. This also eases the risk of thundering herds from the deletes that get replayed.
Mon, May 11, 8:39 AM · Operations, serviceops

Fri, May 8

Joe added a comment to T252127: Improve resource-purge request_id and dt propagation.

I don't think we need the request_id to be preserved - purged is definitely not the place where to do analysis of such data.

Fri, May 8, 6:14 AM · ChangeProp, Core Platform Team Workboards (Clinic Duty Team), serviceops

May 6 2020

Joe added a comment to T99740: Use static php array files for l10n cache at WMF (instead of CDB).
In T99740#6100595, @ori wrote:

@Joe, I appreciate the effort you put in to evaluating this change! If you have the patience to put up with some more annoying kibitzing from me, I have a few questions :)

Hammering one or two pages may not be representative. The performance test should force MediaWiki to look up entries in the l10n cache at the same rate as production, and that may not happen if every request is a parser cache hit. (Timo, please correct me if I'm wrong.) It's the traffic test we ought to pay attention to.

May 6 2020, 8:10 AM · Performance-Team, Release-Engineering-Team-TODO, Scap, MediaWiki-Internationalization
Joe added a comment to T251378: Chaos Engineering - Stop for x hours one or more mc10xx memcached shards.

@elukey let's schedule this test for 6:00Z on monday, May 11th?

May 6 2020, 4:39 AM · Operations, serviceops
Joe added a comment to T251942: check_mw_versions alerts for each individual app server.

While it's clear that 400 alerts flooding production are not great, this check is important for each single machine. So we can aggregate the output, but we can't suppress it. We need to know *very clearly* if one single machine is running an outdated version of mediawiki.
So I second the aggregation if it's possible to show clearly in the icinga alert which machine (or machines, if the number if below, say, 90% of all mw servers) is failing the check.

May 6 2020, 4:37 AM · observability, Operations

Apr 30 2020

Joe added a comment to T133821: Make CDN purges reliable.

After a discussion on the patch, it was clearer to me that some information can't be removed from the message, and that makes resource_change the perfect fit for our use-case.

Apr 30 2020, 4:01 PM · serviceops, Patch-For-Review, Sustainability (MediaWiki-MultiDC), Performance-Team (Radar), Operations, Traffic
Joe added a comment to T133821: Make CDN purges reliable.

Looking at our existing event schemas, resource_change has all the information we need, but also much more. We would like to get a much smaller object to transmit, and specifically we only want to define:

Apr 30 2020, 9:48 AM · serviceops, Patch-For-Review, Sustainability (MediaWiki-MultiDC), Performance-Team (Radar), Operations, Traffic
Joe added a project to T133821: Make CDN purges reliable: serviceops.
Apr 30 2020, 7:48 AM · serviceops, Patch-For-Review, Sustainability (MediaWiki-MultiDC), Performance-Team (Radar), Operations, Traffic
Joe added a comment to T133821: Make CDN purges reliable.
  • Define a schema for a "url purge message".

If I can throw in another $0.02 here - I would scope this bigger than a URL, and think of it as a schema for purging broader things as well. "Purge a URL" is one kind of purge we have today, and will probably always need as a baseline capability, but we've always wanted to gain the ability to purge on a more semantic sort of level, as with the earlier (never really completed, and now everything has changed) X-Key work. The idea here is the ability to purge alternate K:V sets that can be used to tag small sets (not large swaths, it only scales to small-ish sets well!) of related content. For example, a purge might have a key of type article, and a value like enwiki:Foo, which would purge all of the potentially-many outputs related to enwiki's Foo article (history, various content snippet outputs from APIs, etc), which we'd control by having all the related content outputs contain a special header like X-Key: article=enwiki:foo, and having the caches build alternate lookup indices on these keys to efficiently purge content on them.

Apr 30 2020, 7:48 AM · serviceops, Patch-For-Review, Sustainability (MediaWiki-MultiDC), Performance-Team (Radar), Operations, Traffic
Joe added a comment to T249745: Could not enqueue jobs: "Unable to deliver all events: 503: Service Unavailable".

In another case, we had ~ 100 errors corresponding to a spike in latency from the backend:

Apr 30 2020, 7:24 AM · User-brennen, serviceops, Core Platform Team Workboards (Clinic Duty Team), WMF-JobQueue, MediaWiki-extensions-Linter, Parsoid, Wikimedia-production-error
Joe added a comment to T249745: Could not enqueue jobs: "Unable to deliver all events: 503: Service Unavailable".

In general this is happening quite a lot, ~5000 cases in the last 7 days, however, this is both for sending jobs as well as monolog events for the API.

Filtering it further to only get the errors affecting the jobs, the severity seems lower: ~900 cases in 7 days. Still fairly high though.

The error itself is coming from envoy: upstream connect error or disconnect/reset before headers. reset reason: connection failure. Here's an individual log message. Given that @Joe
has already bumped the timeout to 25s and retries to 2 in envoy as a part of T248602, I'm not sure if further increasing the numbers would help. Perhaps this case is different and retries don't kick in?

Apr 30 2020, 6:52 AM · User-brennen, serviceops, Core Platform Team Workboards (Clinic Duty Team), WMF-JobQueue, MediaWiki-extensions-Linter, Parsoid, Wikimedia-production-error

Apr 29 2020

Joe added a comment to T99740: Use static php array files for l10n cache at WMF (instead of CDB).
In T99740#6089198, @Joe wrote:

Some more data:

Rendering the enwiki Barack Obama page, with concurrency of 25, gives this response time distribution (over 10k requests - so the p99 can be found at 9900 requests):

In this case, we don't see significant differences.

Same thing, for a lighter page (https://it.wikipedia.org/wiki/Nemico_pubblico_(film_1998))

In this case, we have more differences, still well below 1% in p99

Finally, what happens when we try to load a resource via load.php, with concurrency of 40:

as you can see, differences are negligible here too.

One rather random note and feel free to correct me if I'm wrong. Articles themselves are not heavy in using l10n (unless in multilingual projects like commons/wikidata). Checking Special pages or action history might yield a different result.

Apr 29 2020, 12:37 PM · Performance-Team, Release-Engineering-Team-TODO, Scap, MediaWiki-Internationalization
Joe added a comment to T133821: Make CDN purges reliable.

At a later time, we could think of changing the logic, and make purges avoid race conditions, removing the need for the rebound purges.
One way to implement this would be the following:

  • No more changes are needed at the application layer
  • All purged servers join a single consumer group per datacenter. This will ensure each purge message is consumed by only one purged instance.
  • This instance will take care of sending the purges to all the cache backends in the DC first, and to all the frontends afterwards
Apr 29 2020, 10:29 AM · serviceops, Patch-For-Review, Sustainability (MediaWiki-MultiDC), Performance-Team (Radar), Operations, Traffic
Joe added a comment to T133821: Make CDN purges reliable.

Since purged is now in production, and that we have some work ongoing that will reduce the amount of purges we send (T250261), I think it's time to revisit the idea of moving purges to Kafka. This would also help with the transition of change-prop to kubernetes.

Apr 29 2020, 10:29 AM · serviceops, Patch-For-Review, Sustainability (MediaWiki-MultiDC), Performance-Team (Radar), Operations, Traffic
Joe added a comment to T251378: Chaos Engineering - Stop for x hours one or more mc10xx memcached shards.

I think we should run 3 different tests, and I would run them for 1 host first.

  • Stop memcached completely
  • drop all packets directed to port 11211
  • drop a percentage of packets incoming and outgoing
Apr 29 2020, 9:04 AM · Operations, serviceops
Joe added a comment to T249065: RFC: Wikimedia Push Notification Service.

A general observation first: It's not clear which application will be responsible of storing subscription data. I would assume, if we expect multiple possible sources of subscriptions, that those sources would keep track of their own subscriptions. But I can see arguments in the other direction - for example, maintaining those in a centralized place would make it easier for people to manage them. Anyways, this should be clarified in the RFC.

Apr 29 2020, 5:57 AM · TechCom-RFC (TechCom-RFC-Closed), Product-Infrastructure-Team-Backlog (Kanban), Security, Push-Notification-Service, Reading Epics (Push Notifications)
Joe added a comment to T249065: RFC: Wikimedia Push Notification Service.

I have a few questions and observations about this RFC, but let's start from the basics:

Apr 29 2020, 5:36 AM · TechCom-RFC (TechCom-RFC-Closed), Product-Infrastructure-Team-Backlog (Kanban), Security, Push-Notification-Service, Reading Epics (Push Notifications)
Joe added a comment to T249065: RFC: Wikimedia Push Notification Service.

At risk of muddying the waters, @JoeWalsh and I were looking at the Signal app source code recently and noticed that it uses a rather interesting approach to push notifications: it sends an empty notification which prompts the app to wake up and retrieve message content from Signal's servers. I think this approach could also translate well to web push. The chief benefit of following this pattern would be that no message content would transit through push provider servers, which is a privacy win. Are there any strong reasons (particularly from an operations perspective) to rule out this approach?

Apr 29 2020, 5:15 AM · TechCom-RFC (TechCom-RFC-Closed), Product-Infrastructure-Team-Backlog (Kanban), Security, Push-Notification-Service, Reading Epics (Push Notifications)

Apr 28 2020

Joe updated the task description for T244852: Upgrade and improve our application object caching service (memcached).
Apr 28 2020, 3:04 PM · Patch-For-Review, Operations, serviceops
Joe added a comment to T99740: Use static php array files for l10n cache at WMF (instead of CDB).

Some more data:

Apr 28 2020, 2:19 PM · Performance-Team, Release-Engineering-Team-TODO, Scap, MediaWiki-Internationalization
Joe placed T99740: Use static php array files for l10n cache at WMF (instead of CDB) up for grabs.

Just to be clearer: we achieved a much larger improvement in the average latency of requests by switching to persistent connections to our session storage:

Apr 28 2020, 1:57 PM · Performance-Team, Release-Engineering-Team-TODO, Scap, MediaWiki-Internationalization
Joe added a comment to T99740: Use static php array files for l10n cache at WMF (instead of CDB).

First, the results of the real traffic test. These are averages over 10 minutes, starting after 20 minutes of having both servers pooled. This is an attempt at smoothing out the effects of very slow queries at higher percentiles, that can be traffic dependent.

Apr 28 2020, 10:07 AM · Performance-Team, Release-Engineering-Team-TODO, Scap, MediaWiki-Internationalization
Joe added a comment to T99740: Use static php array files for l10n cache at WMF (instead of CDB).

Assuming we'll be ok with restarting php-fpm at every release, I reduced the amount of strings memory and opcache allocated on mw1407 from the values in the puppet patch. I am now using 300 MB of interned strings cache and 3.3 GB of opcache space. These figures can be reduced further probably.

Apr 28 2020, 8:44 AM · Performance-Team, Release-Engineering-Team-TODO, Scap, MediaWiki-Internationalization
Joe added a comment to T99740: Use static php array files for l10n cache at WMF (instead of CDB).

I think 5% is huge! As part of T233886 and T189966, I took many work-days to achieve similar gains, and even that is becoming harder and harder without it turning into weeks of multi-person/cross-team dependencies. These kinds of gain will decide how much work it is going to take to achieve certain consistent latencies on the new REST api for example, and also make latencies generally more consistent.

Apr 28 2020, 7:30 AM · Performance-Team, Release-Engineering-Team-TODO, Scap, MediaWiki-Internationalization

Apr 27 2020

Joe added a comment to T99740: Use static php array files for l10n cache at WMF (instead of CDB).

After running a few benchmarks on mw1407 (where LCStoreStaticArray is used) vs mw1409 (which uses cdb files), it seemed the change made little to no difference for the following urls:

Apr 27 2020, 2:18 PM · Performance-Team, Release-Engineering-Team-TODO, Scap, MediaWiki-Internationalization
Joe updated the task description for T244852: Upgrade and improve our application object caching service (memcached).
Apr 27 2020, 6:32 AM · Patch-For-Review, Operations, serviceops

Apr 20 2020

Joe created P11020 mcrouter gutter proper config.
Apr 20 2020, 11:05 AM

Apr 17 2020

Joe added a comment to T245170: Revisit timeouts, concurrency limits in remote HTTP calls from MediaWiki.

Moving this to feature requests for PMs to review, we'll need to investigate what appropriate limits would be and how they should be tailored to requesters needs in order to move forward.

Apr 17 2020, 7:15 AM · MW-1.35-notes (1.35.0-wmf.35; 2020-06-02), Core Platform Team Workboards (Clinic Duty Team), Sustainability (Incident Prevention), MediaWiki-General, Operations, serviceops

Apr 15 2020

Joe added a comment to T250261: Stop sending purges for `action=history` for linked pages..

I would frankly prefer to pass a flag to getCdnUrls, and return those dependent urls only if the flag has its default value. I say this because it won't make Title significantly more heavy, and at the same time it will fix *quickly* a problem we have in production.

Apr 15 2020, 8:07 PM · Performance-Team-publish, MW-1.35-notes (1.35.0-wmf.31; 2020-05-05), Sustainability (Incident Prevention), Core Platform Team Workboards (Clinic Duty Team), MediaWiki-Cache, Performance-Team (Radar), serviceops, Traffic, Operations
Joe created T250261: Stop sending purges for `action=history` for linked pages..
Apr 15 2020, 10:30 AM · Performance-Team-publish, MW-1.35-notes (1.35.0-wmf.31; 2020-05-05), Sustainability (Incident Prevention), Core Platform Team Workboards (Clinic Duty Team), MediaWiki-Cache, Performance-Team (Radar), serviceops, Traffic, Operations
Joe created P10987 We do not cache /w/index.php urls, so why purge them?.
Apr 15 2020, 9:11 AM
Joe created T250251: Audit and harmonize timeouts across the stack.
Apr 15 2020, 5:45 AM · DBA, Traffic, serviceops, Operations
Joe placed T250205: Reduce rate of purges emitted by MediaWiki up for grabs.

@Joe You are assigned to this ticket, is this something you are going to work on in the code? Or shalle we assigned someone from the CPT team once we are working on it?

Apr 15 2020, 5:30 AM · MW-1.35-notes (1.35.0-wmf.35; 2020-06-02), Sustainability (Incident Prevention), Performance-Team (Radar), Core Platform Team, serviceops, Traffic, Operations

Apr 14 2020

Joe renamed T250205: Reduce rate of purges emitted by MediaWiki from placeholder: reduce rate of purges emitted by Mediawiki to Reduce rate of purges emitted by Mediawiki.
Apr 14 2020, 5:26 PM · MW-1.35-notes (1.35.0-wmf.35; 2020-06-02), Sustainability (Incident Prevention), Performance-Team (Radar), Core Platform Team, serviceops, Traffic, Operations
Joe closed T249705: Intermittent internal API errors with Flow as Resolved.

This is now resolved, I see no further errors since my latest change was merged.

Apr 14 2020, 2:44 PM · serviceops, Parsoid, MediaWiki-Configuration, Pywikibot-Flow, Growth-Team, StructuredDiscussions, Pywikibot, Pywikibot-tests
Joe claimed T249705: Intermittent internal API errors with Flow.

I've added some further retry logic for requests to parsoid, this *might* help.

Apr 14 2020, 1:35 PM · serviceops, Parsoid, MediaWiki-Configuration, Pywikibot-Flow, Growth-Team, StructuredDiscussions, Pywikibot, Pywikibot-tests
Joe raised the priority of T249705: Intermittent internal API errors with Flow from Low to High.

Changing priority as this seems to be highly user visible.

Apr 14 2020, 7:22 AM · serviceops, Parsoid, MediaWiki-Configuration, Pywikibot-Flow, Growth-Team, StructuredDiscussions, Pywikibot, Pywikibot-tests
Joe merged T249997: Flow failing with Error contacting the Parsoid/RESTBase server (HTTP 400) into T249705: Intermittent internal API errors with Flow.
Apr 14 2020, 7:22 AM · serviceops, Parsoid, MediaWiki-Configuration, Pywikibot-Flow, Growth-Team, StructuredDiscussions, Pywikibot, Pywikibot-tests
Joe merged task T249997: Flow failing with Error contacting the Parsoid/RESTBase server (HTTP 400) into T249705: Intermittent internal API errors with Flow.
Apr 14 2020, 7:21 AM · User-Ryasmeen, VisualEditor, Growth-Team, StructuredDiscussions
Joe reopened T249997: Flow failing with Error contacting the Parsoid/RESTBase server (HTTP 400) as "Open".

Ehm, phab UI fail.

Apr 14 2020, 7:21 AM · User-Ryasmeen, VisualEditor, Growth-Team, StructuredDiscussions
Joe closed T249997: Flow failing with Error contacting the Parsoid/RESTBase server (HTTP 400) as Resolved.

This seems to be the same as T249705

Apr 14 2020, 7:20 AM · User-Ryasmeen, VisualEditor, Growth-Team, StructuredDiscussions
Joe added a comment to T237889: Install php-ldap on all MW appservers.

Also: how temporary? Do you have a tentative timeline for transitioning wikitech to SUL?

Geologically short, but maybe not short by other measures. We need a new Developer account creation portal first. That has been discussed for at least 3 years, but not resourced by any team yet. I wouldn't count on it being less than 24 months at best before SUL happened.

Apr 14 2020, 6:31 AM · serviceops, Operations, wikitech.wikimedia.org
Joe added a comment to T249705: Intermittent internal API errors with Flow.

@Joe I see your patch was merged on Friday but it seems like there is still an instance of this problem occurring as of today (T249997), do you have any thoughts on what might be going on?

Apr 14 2020, 6:00 AM · serviceops, Parsoid, MediaWiki-Configuration, Pywikibot-Flow, Growth-Team, StructuredDiscussions, Pywikibot, Pywikibot-tests

Apr 13 2020

Joe added a comment to T237889: Install php-ldap on all MW appservers.

I don't know enough about php-ldap at the moment to have an opinion. In itself, adding a php extension to production is a big deal, but it's also easy to undo.

Apr 13 2020, 9:04 PM · serviceops, Operations, wikitech.wikimedia.org