Page MenuHomePhabricator

Joe (Giuseppe Lavagetto)
Spy

Projects (24)

Today

  • Clear sailing ahead.

Tomorrow

  • Clear sailing ahead.

Thursday

  • Clear sailing ahead.

User Details

User Since
Oct 3 2014, 5:57 AM (259 w, 4 d)
Availability
Available
LDAP User
Giuseppe Lavagetto
MediaWiki User
GLavagetto (WMF) [ Global Accounts ]

Recent Activity

Today

Joe committed rOSCTf39eefdc0cde: Backport of I05914e2b3cd4 (authored by Joe).
Backport of I05914e2b3cd4
Tue, Sep 24, 6:43 AM
Joe committed rOSCTa028e4e8739d: Translate the default section name to DEFAULT in ReadOnlyBySection too (authored by Joe).
Translate the default section name to DEFAULT in ReadOnlyBySection too
Tue, Sep 24, 6:36 AM
Joe added a comment to T233534: db1075 (s3 master) crashed - BBU failure.

I'm wondering if an entry should be added under "Where did we get lucky?" along the lines of "I/We noticed this incident before SMS paging begun".

Done

Was there a deliberate decision that DB master hosts appearing offline individually is not page-worthy due to the potential for false positives? For most hosts that probably makes sense, I can't think of any other class of non-dev-facing server where an individual host going offline can cause direct user-visible problems (edit: perhaps arguably IRCd or something but that one really is a different discussion), maybe a few years ago the poolcounters or something, I do wonder if DB masters should be considered for an exception to this.

I would assume a master not paging on HOST DOWN was not done at the time because of how puppet with Icinga works and/or because it needed a decent amount of refactoring to be able to select which hosts (or roles) can or cannot page if they go down.
Also depending on the DC they should or shouldn't page as of today, we do not want codfw masters to page but we'd need them to if we run active-active.
This also comes with the discussion: what is the source of truth for database masters, is it puppet or mediawiki? and which should feed each other (again, a different discussion).
Having said that, I do agree that a master going down means that we are on read-only instantly, so probably they should page. I will create a task for this and add it to the actionable list on the IR
Thanks!

Tue, Sep 24, 6:02 AM · Wikimedia-Incident, ops-eqiad, Operations, DBA
Joe added a comment to T233679: dbctl doesn't always correctly translate section names in its output.

We should perhaps also modify the schema to prevent s3 from occurring where we know it shouldn't?

Tue, Sep 24, 5:33 AM · conftool
Joe added a comment to T233534: db1075 (s3 master) crashed - BBU failure.

whether that needs changing on the desired thresholds is a different discussion.

The director of SRE was the person who decided that at the time because alerts were too annoying: https://gerrit.wikimedia.org/r/c/operations/puppet/mariadb/+/289825 I don't think we can override his decision.

Tue, Sep 24, 5:02 AM · Wikimedia-Incident, ops-eqiad, Operations, DBA

Yesterday

Joe claimed T147204: Update confd package.
Mon, Sep 23, 7:20 AM · serviceops, User-Joe, Beta-Cluster-reproducible, Operations

Fri, Sep 20

Joe awarded T232711: Deploy ripe-atlas-tools for ad-hoc network tests a Love token.
Fri, Sep 20, 10:54 AM · Operations, netops, observability
Joe committed rDEPLOYCHARTS7050dba0ce3a: Add rakefile to run helm tests (authored by Joe).
Add rakefile to run helm tests
Fri, Sep 20, 9:14 AM
Joe updated the task description for T233291: Set up CI for the deployment-charts repository.
Fri, Sep 20, 9:07 AM · Release-Engineering-Team-TODO (201909), Continuous-Integration-Config, Kubernetes, local-charts, Release Pipeline, serviceops, Operations
Joe added a comment to T233390: zuul-merger fails to fetch from Gerrit.

and yes, if you need to have a service not start when the package is installed, you need a systemd::mask definition in puppet.

Fri, Sep 20, 8:32 AM · Release-Engineering-Team (CI & Testing services), Release-Engineering-Team-TODO (201909), Zuul, Continuous-Integration-Infrastructure
Joe added a comment to T233390: zuul-merger fails to fetch from Gerrit.

@hashar I guess the CI servers should have more relaxed thresholds? Is it even possible to configure gerrit to whitelist some host?

Fri, Sep 20, 8:31 AM · Release-Engineering-Team (CI & Testing services), Release-Engineering-Team-TODO (201909), Zuul, Continuous-Integration-Infrastructure

Thu, Sep 19

Joe added a comment to T233298: Proposal: simplify set up of basic CI jobs for new projects.

Just to clarify - this wasn't an attempt to imagine a future, perfect system, but just a small MVP that could possibly work on our current infrastructure.

Thu, Sep 19, 4:43 PM · Release-Engineering-Team (CI & Testing services), serviceops-radar, Continuous-Integration-Infrastructure
Tarrow awarded T233298: Proposal: simplify set up of basic CI jobs for new projects a Love token.
Thu, Sep 19, 2:02 PM · Release-Engineering-Team (CI & Testing services), serviceops-radar, Continuous-Integration-Infrastructure
Joe added projects to T233298: Proposal: simplify set up of basic CI jobs for new projects: Continuous-Integration-Infrastructure, serviceops-radar, Release-Engineering-Team.
Thu, Sep 19, 1:31 PM · Release-Engineering-Team (CI & Testing services), serviceops-radar, Continuous-Integration-Infrastructure
akosiaris awarded T233298: Proposal: simplify set up of basic CI jobs for new projects a Love token.
Thu, Sep 19, 11:22 AM · Release-Engineering-Team (CI & Testing services), serviceops-radar, Continuous-Integration-Infrastructure
Joe updated the task description for T233298: Proposal: simplify set up of basic CI jobs for new projects.
Thu, Sep 19, 11:13 AM · Release-Engineering-Team (CI & Testing services), serviceops-radar, Continuous-Integration-Infrastructure
Joe created T233298: Proposal: simplify set up of basic CI jobs for new projects.
Thu, Sep 19, 11:13 AM · Release-Engineering-Team (CI & Testing services), serviceops-radar, Continuous-Integration-Infrastructure
Joe added a parent task for T216049: add CI job into operations/deployments-charts repo that helm lint packages and perform the helm index after merge.: T233291: Set up CI for the deployment-charts repository.
Thu, Sep 19, 10:36 AM · Continuous-Integration-Config, Continuous-Integration-Infrastructure, User-fsero, Kubernetes, serviceops
Joe added subtasks for T233291: Set up CI for the deployment-charts repository: T216049: add CI job into operations/deployments-charts repo that helm lint packages and perform the helm index after merge., T217868: Add tests to local-charts / configure local-charts for CI.
Thu, Sep 19, 10:36 AM · Release-Engineering-Team-TODO (201909), Continuous-Integration-Config, Kubernetes, local-charts, Release Pipeline, serviceops, Operations
Joe added a parent task for T217868: Add tests to local-charts / configure local-charts for CI: T233291: Set up CI for the deployment-charts repository.
Thu, Sep 19, 10:36 AM · Release-Engineering-Team (Local Dev), Release-Engineering-Team-TODO, local-charts, Developer Productivity
Joe claimed T233291: Set up CI for the deployment-charts repository.
Thu, Sep 19, 10:35 AM · Release-Engineering-Team-TODO (201909), Continuous-Integration-Config, Kubernetes, local-charts, Release Pipeline, serviceops, Operations
Joe added a comment to T233291: Set up CI for the deployment-charts repository.

@Jdforrester-WMF has added an experimental helm-lint job to the repository: T216049. It runs help lint --strict charts/*/ :]

Thu, Sep 19, 10:34 AM · Release-Engineering-Team-TODO (201909), Continuous-Integration-Config, Kubernetes, local-charts, Release Pipeline, serviceops, Operations
Joe created T233291: Set up CI for the deployment-charts repository.
Thu, Sep 19, 9:38 AM · Release-Engineering-Team-TODO (201909), Continuous-Integration-Config, Kubernetes, local-charts, Release Pipeline, serviceops, Operations
Joe added a comment to T233047: Apache mod_status aggregator.

One important detail:

Thu, Sep 19, 9:10 AM · observability, Operations

Wed, Sep 18

Joe added a comment to T75181: Remove HHVM revision tag.

just to note that when we remove the HHVM tag we might also want to remove the PHP7 one given that latter migration is almost over as well :P

Wed, Sep 18, 10:34 AM · Core Platform Team Workboards (Clinic Duty Team), MediaWiki-extensions-WikimediaMaintenance, Patch-For-Review, HHVM, Wikimedia-General-or-Unknown

Tue, Sep 17

Joe moved T223469: New Service Request: wikifeeds from Backlog to Doing on the serviceops board.
Tue, Sep 17, 6:34 AM · serviceops, Core Platform Team Legacy (Watching / External), Services (watching), Mobile-Content-Service, Page Content Service, Product-Infrastructure-Team-Backlog, Service-deployment-requests, Operations
Joe added a comment to T233047: Apache mod_status aggregator.

@ori I'm not 100% sure I got what information you think would be useful to extract. At first glance it would seem like collecting those data in a structured manner on logstash would be useful, but the ticket seems to suggest to build a specialized interface.

Tue, Sep 17, 6:19 AM · observability, Operations

Mon, Sep 16

Joe added a comment to T180696: Terminate Thumbor with SSL.

TLS on haproxy it is then:)

Mon, Sep 16, 3:04 PM · User-jijiki, serviceops, Performance-Team (Radar), Thumbor
Joe moved T224247: upgrade and rename krypton & create its codfw equivalent from Backlog to Doing on the serviceops board.
Mon, Sep 16, 3:02 PM · serviceops, Operations
Joe closed T228836: recreate eqiad cluster state from code stored in deployment-charts with helmfile [MIGHT CAUSE DOWNTIME], a subtask of T212123: Kubernetes clusters roadmap, as Resolved.
Mon, Sep 16, 3:00 PM · User-fsero, serviceops, Prod-Kubernetes
Joe closed T228836: recreate eqiad cluster state from code stored in deployment-charts with helmfile [MIGHT CAUSE DOWNTIME] as Resolved.
Mon, Sep 16, 3:00 PM · User-fsero, serviceops, Prod-Kubernetes
Joe moved T228836: recreate eqiad cluster state from code stored in deployment-charts with helmfile [MIGHT CAUSE DOWNTIME] from Backlog to Doing on the serviceops board.
Mon, Sep 16, 3:00 PM · User-fsero, serviceops, Prod-Kubernetes
Joe moved T231546: decom krypton.eqiad.wmnet from Backlog to Doing on the serviceops board.
Mon, Sep 16, 2:59 PM · decommission, serviceops, Operations
Joe closed T229736: Disable now-redundant mediawiki/TorBlock/loadExitNodes.php cron script as Resolved.
Mon, Sep 16, 2:59 PM · Performance-Team (Radar), serviceops, cloud-services-team, wikitech.wikimedia.org
Joe triaged T229736: Disable now-redundant mediawiki/TorBlock/loadExitNodes.php cron script as Normal priority.
Mon, Sep 16, 2:54 PM · Performance-Team (Radar), serviceops, cloud-services-team, wikitech.wikimedia.org
Joe added a comment to T232613: LBFactoryMulti.php: PHP Notice: Undefined index: .

I see no further occurrences of the bug in logstash either for mw1347 - not that I had many doubts at this point.

Mon, Sep 16, 10:55 AM · Patch-For-Review, MW-1.34-notes (1.34.0-wmf.22; 2019-09-10), Core Platform Team Workboards (Clinic Duty Team), Wikimedia-Rdbms, PHP 7.2 support, Wikimedia-production-error
Joe added a comment to T232613: LBFactoryMulti.php: PHP Notice: Undefined index: .

But it's now a mystery as to why opcache.interned_strings_buffer = 0 helps, since this works without opcache.

Mon, Sep 16, 10:36 AM · Patch-For-Review, MW-1.34-notes (1.34.0-wmf.22; 2019-09-10), Core Platform Team Workboards (Clinic Duty Team), Wikimedia-Rdbms, PHP 7.2 support, Wikimedia-production-error

Sun, Sep 15

Joe updated subscribers of T232613: LBFactoryMulti.php: PHP Notice: Undefined index: .

As can be seen on logstash, as @Daimona mentioned, errors suddenly stopped after I disabled the interned strings buffer. While it's early to evaluate fully any impact on performance, I would say that the effect is clearly not very penalizing. See the 95th percentile for successful responses for mw1348.

Sun, Sep 15, 7:49 PM · Patch-For-Review, MW-1.34-notes (1.34.0-wmf.22; 2019-09-10), Core Platform Team Workboards (Clinic Duty Team), Wikimedia-Rdbms, PHP 7.2 support, Wikimedia-production-error

Fri, Sep 13

Joe added a comment to T232613: LBFactoryMulti.php: PHP Notice: Undefined index: .

We have a first core dump on mw1348 - I moved it under /root/T232613, if you need access ping me on IRC.

Fri, Sep 13, 8:27 AM · Patch-For-Review, MW-1.34-notes (1.34.0-wmf.22; 2019-09-10), Core Platform Team Workboards (Clinic Duty Team), Wikimedia-Rdbms, PHP 7.2 support, Wikimedia-production-error
Joe added a comment to T230570: De-noise systemd alerts (Reduce Icinga alert noise goal).

It's not true that "important services are monitored via dedicated service specific checks", quite the contrary on a lot of systems, I would rather improve the systemd alert instead of silencing it, and maybe be finally done with using those hacky checks for the number of running processes.

Fri, Sep 13, 6:28 AM · Patch-For-Review, Goal, observability

Thu, Sep 12

Joe added a comment to T232692: Should MediaWiki stop storing sessions on the server?.

Storing session state on server side (whether in memcached or whatever) instead of as an encrypted blob on the client has lots of upsides

  • Its much harder to screw up from a security perspective
  • You are not tossing around large blobs (I assume in this proposal everything in $_SESSION would be stored as an encrypted JWT token?)

I'm not really aware of any significant scalability concerns that would justify backing away from these upsides, but the ops aspect of session storage is not an area I follow too closely

Thu, Sep 12, 4:18 PM · MediaWiki-Authentication-and-authorization
Joe added a comment to T232692: Should MediaWiki stop storing sessions on the server?.

So what's the specific problem with regard to MediaWiki? Is this blocking something? Would it enable us to do something of particular value in our contexts here?

I think the problem (in my mind) is that MediaWiki is storing state on the server, that isn't necessary to store. In other words, it creates a scalability and reliability hurdle that, when removed, simplifies the infrastructure needed to run MediaWiki.

Thu, Sep 12, 4:12 PM · MediaWiki-Authentication-and-authorization
Joe added a comment to T232613: LBFactoryMulti.php: PHP Notice: Undefined index: .

mw1347 and mw1348 receive more traffic than the rest of the php api servers, so it makes sense this happens more frequently there.

Thu, Sep 12, 9:08 AM · Patch-For-Review, MW-1.34-notes (1.34.0-wmf.22; 2019-09-10), Core Platform Team Workboards (Clinic Duty Team), Wikimedia-Rdbms, PHP 7.2 support, Wikimedia-production-error
Joe added a comment to T231089: WikibaseClient.php: PHP Notice: Undefined index:.

Smells like T229433. Which is also about '' array index, and PHP 7.2. It's obviously a bug in PHP 7.2, but I've not been able to find evidence that it is due to opcache/T224491.

Thu, Sep 12, 9:00 AM · PHP 7.2 support, Wikidata-Campsite (Wikidata-Campsite-Iteration-∞), Wikidata, Wikimedia-production-error
Joe added a comment to T231089: WikibaseClient.php: PHP Notice: Undefined index:.
Thu, Sep 12, 8:59 AM · PHP 7.2 support, Wikidata-Campsite (Wikidata-Campsite-Iteration-∞), Wikidata, Wikimedia-production-error
Joe closed T232698: 503 errors when trying to log in to Wikimedia sites as Resolved.

Hi, we had some connectivity issues earlier. As soon as we were alerted and started checking, the issues recovered. We suspect the root cause to be a network maintenance ongoing at the time, but the problem is now resolved.

Thu, Sep 12, 6:00 AM · netops, Traffic, Operations

Wed, Sep 11

Daimona awarded T219150: Ramp up percentage of users on php7.2 to 100% on both API and appserver clusters a Love token.
Wed, Sep 11, 9:40 AM · Patch-For-Review, Performance-Team (Radar), User-jijiki, Operations, serviceops
Joe moved T226516: deploy CoreDNS as a in-cluster DNS service from Next up to Doing on the serviceops board.
Wed, Sep 11, 7:20 AM · serviceops
Joe moved T180696: Terminate Thumbor with SSL from Next up to Backlog on the serviceops board.
Wed, Sep 11, 7:20 AM · User-jijiki, serviceops, Performance-Team (Radar), Thumbor
Joe closed T224857: Enhance MediaWiki deployments for support of php7.x, a subtask of T176370: Migrate to PHP 7 in WMF production, as Resolved.
Wed, Sep 11, 7:17 AM · MW-1.34-notes (1.34.0-wmf.22; 2019-09-10), CPT Initiatives (PHP7 (TEC4)), Patch-For-Review, TechCom-RFC (TechCom-Approved), User-ArielGlenn, HHVM, Operations
Joe closed T224857: Enhance MediaWiki deployments for support of php7.x as Resolved.

This is now 99% done. We just need a confctl release to be able to make scap pull work as intended. Resolving this task though, as the work on the mw deployment side has been completed.

Wed, Sep 11, 7:16 AM · Release-Engineering-Team-TODO (201909), Release-Engineering-Team (Deployment services), Patch-For-Review, User-jijiki, PHP 7.2 support, Scap, serviceops
Joe moved T224857: Enhance MediaWiki deployments for support of php7.x from Externally Blocked to Doing on the serviceops board.
Wed, Sep 11, 7:14 AM · Release-Engineering-Team-TODO (201909), Release-Engineering-Team (Deployment services), Patch-For-Review, User-jijiki, PHP 7.2 support, Scap, serviceops
Joe moved T228965: set up limitranges and resourcequotas to protect the cluster from resource abuse and starvation from Doing to Backlog on the serviceops board.
Wed, Sep 11, 7:14 AM · User-fsero, serviceops, Prod-Kubernetes
Joe moved T228967: Set up PodSecurityPolicies in clusters from Doing to Backlog on the serviceops board.
Wed, Sep 11, 7:14 AM · Patch-For-Review, User-fsero, serviceops, Prod-Kubernetes
Joe closed T228837: recreate codfw cluster state from code stored in deployment-charts with helmfile [MIGHT CAUSE DOWNTIME], a subtask of T212123: Kubernetes clusters roadmap, as Resolved.
Wed, Sep 11, 7:13 AM · User-fsero, serviceops, Prod-Kubernetes
Joe closed T228837: recreate codfw cluster state from code stored in deployment-charts with helmfile [MIGHT CAUSE DOWNTIME] as Resolved.
Wed, Sep 11, 7:13 AM · User-fsero, serviceops, Prod-Kubernetes
Joe closed T232233: mw1317 issue: "DatabaseMysqli.php: Class undefined: stdClass" as Resolved.

I'm not sure about php7 - it was fixed when I got around fixing this machine.

Wed, Sep 11, 7:12 AM · Performance-Team (Radar), Wikimedia-Rdbms, serviceops
Joe added a comment to T232128: Make MultiHttpClient use CURLMOPT_MAX_HOST_CONNECTIONS and reuse connections.

While I support the use of this patch, the problem you're seeing should be greatly mitigated when we start using a middleware to manage service-to-service RPC. For now that's still in its infancy, but we already use that approach for cirrussearch, where requests are proxied via a local nginx on each appserver.

Wed, Sep 11, 6:31 AM · MediaWiki-libs-HTTP, Patch-For-Review, Performance-Team (Radar), Core Platform Team Workboards (Clinic Duty Team)
Joe added a comment to T232538: Make the parsoid server on the beta cluster a mediawiki app server.

hi @ssastry just a clarification: how would we load the parsoid code, if it can't be merged in the wmf vendor repository? Same way we do on scandium?

Wed, Sep 11, 6:26 AM · Patch-For-Review, Beta-Cluster-Infrastructure, Core Platform Team Workboards (Purple), RESTBase, Parsoid-PHP

Sat, Sep 7

Joe edited P9058 Various failing mtrs.
Sat, Sep 7, 2:39 PM
Joe created P9058 Various failing mtrs.
Sat, Sep 7, 2:37 PM
Joe created P9057 From Vodafone IT.
Sat, Sep 7, 2:00 PM

Fri, Sep 6

Joe committed rOSCTf5dbddac4e2c: Fix configuration file lookup when running with sudo (authored by Joe).
Fix configuration file lookup when running with sudo
Fri, Sep 6, 1:45 PM
Joe committed rOSCTcfc1388ec26f: kvobject: fix some class property ordering (authored by Joe).
kvobject: fix some class property ordering
Fri, Sep 6, 1:10 PM
Joe created P9048 Semantic Versioning.
Fri, Sep 6, 11:00 AM
Joe added a comment to T231027: Cassandra instances outages (was: Outage of restbase2017-b).
Fri, Sep 6, 9:09 AM · User-Eevans, Core Platform Team Workboards (Clinic Duty Team)
Joe created MediaWiki-extensions-WebToolsManager.
Fri, Sep 6, 5:41 AM

Thu, Sep 5

Joe added a comment to T232035: 1.34.0-wmf.21 cause termbox to emit: Test get rendered termbox returned the unexpected status 500.

So the real issue was:

  • termbox correctly uses the api-ro.discovery.wmnet host
  • the discovery record was incorrectly set to active-active
  • so requests from termbox would just go to the nearest dc, meaning that in codfw it would face super-cold caches after every release
  • as a consequence, some requests would time out because of the cold caches at all levels
Thu, Sep 5, 8:55 AM · Patch-For-Review, Wikidata-Campsite (Wikidata-Campsite-Iteration-∞), serviceops, Wikidata, Wikidata-Termbox, Release-Engineering-Team-TODO (201909), Release, Train Deployments
Joe updated subscribers of T224857: Enhance MediaWiki deployments for support of php7.x.

I did some tests, and we still have one problem with scap pull:

Thu, Sep 5, 8:26 AM · Release-Engineering-Team-TODO (201909), Release-Engineering-Team (Deployment services), Patch-For-Review, User-jijiki, PHP 7.2 support, Scap, serviceops
Joe added a comment to T232035: 1.34.0-wmf.21 cause termbox to emit: Test get rendered termbox returned the unexpected status 500.

I think this is probably the same as T229313. We suspected it might be related to T231011; perhaps the new train makes this problem more pronounced

Thu, Sep 5, 7:40 AM · Patch-For-Review, Wikidata-Campsite (Wikidata-Campsite-Iteration-∞), serviceops, Wikidata, Wikidata-Termbox, Release-Engineering-Team-TODO (201909), Release, Train Deployments

Wed, Sep 4

Joe added a comment to T231192: mw2231 is down and unable to reboot.

@Papaul that looks fine - I don't think we need to swap out the SSDs, so just do it if we have a better use of those disks (they're pretty useless on an appserver).

Wed, Sep 4, 4:17 PM · ops-codfw, DC-Ops, Operations
Joe added a comment to T224857: Enhance MediaWiki deployments for support of php7.x.

@thcipriani should we create a new package/release?

Wed, Sep 4, 6:56 AM · Release-Engineering-Team-TODO (201909), Release-Engineering-Team (Deployment services), Patch-For-Review, User-jijiki, PHP 7.2 support, Scap, serviceops

Tue, Sep 3

Joe added a comment to T231192: mw2231 is down and unable to reboot.

I second what @MoritzMuehlenhoff suggested. The system is not scheduled for replacement for another 2 years, so if we can salvage it somehow, that'd be great.

Tue, Sep 3, 4:47 AM · ops-codfw, DC-Ops, Operations

Mon, Sep 2

Joe added a comment to T229686: #dbctl: manage 'externalLoads' data.

A couple comments:

  • I concur with @Volans I'd keep the first iteration (at least) *very* simple
  • I know adding tags to a schema is a pain (in fact, it will need a data migration) but the flavour thing you were proposing seems like the kind of thing that should be a tag, so scope=eqiad,flavour=main could be a set of tags for an instance object
Mon, Sep 2, 7:07 AM · Performance-Team (Radar), DBA, conftool
Joe updated the task description for T227541: b6-eqiad pdu refresh (Tuesday 9/10 @11am UTC).
Mon, Sep 2, 5:39 AM · DC-Ops, Operations, ops-eqiad
Joe updated the task description for T227541: b6-eqiad pdu refresh (Tuesday 9/10 @11am UTC).
Mon, Sep 2, 5:38 AM · DC-Ops, Operations, ops-eqiad

Tue, Aug 27

Joe committed rOSCT9af0cc2c074b: Remove the service object for the default schema (authored by Joe).
Remove the service object for the default schema
Tue, Aug 27, 7:58 AM

Mon, Aug 26

Joe triaged T231200: CI performance issues as Unbreak Now! priority.

For context, the actual time to run the tests for operations/puppet is under one minute for most patches.

Mon, Aug 26, 1:46 PM · Patch-For-Review, Release-Engineering-Team (CI & Testing services), Release-Engineering-Team-TODO (201908), Continuous-Integration-Infrastructure
Joe triaged T231192: mw2231 is down and unable to reboot as Normal priority.
Mon, Aug 26, 9:52 AM · ops-codfw, DC-Ops, Operations
Joe created T231192: mw2231 is down and unable to reboot.
Mon, Aug 26, 9:52 AM · ops-codfw, DC-Ops, Operations
Joe closed T231016: expand list of those who have permissions to edit the #wikimedia-operations topic as Resolved.
Mon, Aug 26, 9:25 AM · Operations
Joe added a comment to T231016: expand list of those who have permissions to edit the #wikimedia-operations topic.

I did some cleanup removing non-sres and adding a few people from the US TZ.

Mon, Aug 26, 9:25 AM · Operations
Joe added a comment to T231009: Make jobprocessor's test not depend on external files.

@Mathew.onipe can I ask further details on the error you get? It should definitely not be an issue if the test works in a docker image locally.

Mon, Aug 26, 8:24 AM · Release Pipeline, Operations, Maps (Kartotherian)
Joe triaged T231009: Make jobprocessor's test not depend on external files as Normal priority.
Mon, Aug 26, 8:19 AM · Release Pipeline, Operations, Maps (Kartotherian)
Joe added a comment to T216750: Article recommendation API: replace WDQS with MW API.

@leila I just stumbled upon this task, and besides being happy that patch was merged I'm asking myself:

Mon, Aug 26, 8:16 AM · Patch-For-Review, Google-Summer-of-Code (2019), Article-Recommendation, Research, Recommendation-API
Joe triaged T231086: Picture from Commons not found from Singapore as Normal priority.
Mon, Aug 26, 8:13 AM · User-fgiunchedi, Structured-Data-Backlog, Structured Data Engineering, Multimedia, MW-1.34-notes (1.34.0-wmf.21; 2019-09-03), Patch-For-Review, MediaWiki-File-management, Commons, media-storage, Traffic, Operations
Joe added a comment to T231119: Uploading a big PDF file failed.

A file of 473 MB surely goes over the large file limits unless something changed recently.
https://commons.wikimedia.org/wiki/Help:Server-side_upload still seems to suggest you should request that. Untagging Operations as I don't think there is anything SRE should do here.

All of upload_by_url, bigChunkedUpload and the UploadWizard are supposed to allow uploads up to 4 GB. If that's not the case, please say it clearly and fix the documentation. Otherwise, this is a bug.

Mon, Aug 26, 7:42 AM · User-Urbanecm, Commons, Internet-Archive
Joe removed projects from T231119: Uploading a big PDF file failed: serviceops, Operations.
Mon, Aug 26, 6:51 AM · User-Urbanecm, Commons, Internet-Archive
Joe added a comment to T231119: Uploading a big PDF file failed.

A file of 473 MB surely goes over the large file limits unless something changed recently.

Mon, Aug 26, 6:51 AM · User-Urbanecm, Commons, Internet-Archive
Joe updated subscribers of T229980: Need help to create and deploy Debian-packaged Python 3 app.

BTW I see the patch is still under review, and @Volans is on it.

Mon, Aug 26, 5:59 AM · serviceops, Operations, Packaging, CPT Initiatives (Session Management Service (CDP2))
Joe added a comment to T229980: Need help to create and deploy Debian-packaged Python 3 app.

Packaging is primarily handled by serviceops / Operations (meta: which should I tag here, SRE folks?)

Mon, Aug 26, 5:58 AM · serviceops, Operations, Packaging, CPT Initiatives (Session Management Service (CDP2))

Aug 23 2019

Joe created P8970 Thank you puppet, you're the gift that keeps on giving..
Aug 23 2019, 5:22 PM
Joe added a comment to T231011: Mysterious, coordinated slowdowns every ~ 25 minutes on mw1347,mw1348 (php7 api servers).

First smoking gun is in all the intervals I controlled the offender was parsoid-batch with quite large requests. I'm trying to gathering more cases to create a better statistics.

Aug 23 2019, 1:35 PM · PHP 7.2 support, serviceops, Operations
Ladsgroup awarded T219150: Ramp up percentage of users on php7.2 to 100% on both API and appserver clusters a Love token.
Aug 23 2019, 10:27 AM · Patch-For-Review, Performance-Team (Radar), User-jijiki, Operations, serviceops
Joe added a comment to T231063: Allow blocking requests from specific networks on the edge.

I think it's good to have a first, simple implementation, like the one above, but I think going further we would need a "block" object in puppet (or elsewhere, more on that below) that includes:

Aug 23 2019, 8:17 AM · Operations, Traffic

Aug 22 2019

Joe added a comment to T204056: Move wikimedia.ee under WM-EE.

@Dzahn Can you confirm how @tramm should configure the MX records? I think he'll need to add 18-19 from here to WMEE's elkdata DNS info, but want to make sure that's correct.
Is there anything else SRE needs to do before I change the nameserver with the registrar?

Aug 22 2019, 4:35 PM · WMF-Legal, Patch-For-Review, Operations, Domains, Traffic
Joe added a comment to T229697: Investigate Kask request latency.

@WDoranWMF we will get to this as soon as our resources allow it.

Aug 22 2019, 2:27 PM · User-Eevans, Core Platform Team Workboards (Clinic Duty Team), CPT Initiatives (Session Management Service (CDP2)), Performance-Team (Radar)
Joe triaged T231011: Mysterious, coordinated slowdowns every ~ 25 minutes on mw1347,mw1348 (php7 api servers) as High priority.
Aug 22 2019, 2:20 PM · PHP 7.2 support, serviceops, Operations
Joe created T231011: Mysterious, coordinated slowdowns every ~ 25 minutes on mw1347,mw1348 (php7 api servers).
Aug 22 2019, 2:20 PM · PHP 7.2 support, serviceops, Operations
Joe triaged T230861: PHP 7.2 is very slow on an allocation-intensive benchmark as Normal priority.
Aug 22 2019, 10:12 AM · PHP 7.3 support, PHP 7.2 support, serviceops, Operations