Page MenuHomePhabricator

Joe (Giuseppe Lavagetto)
Spy

Projects (23)

Today

  • Clear sailing ahead.

Tomorrow

  • Clear sailing ahead.

Friday

  • Clear sailing ahead.

User Details

User Since
Oct 3 2014, 5:57 AM (277 w, 5 d)
Availability
Available
LDAP User
Giuseppe Lavagetto
MediaWiki User
GLavagetto (WMF) [ Global Accounts ]

Recent Activity

Mon, Jan 27

Joe triaged T243803: API action=parse should be poolcounter-limited if a re-parse is necessary as High priority.
Mon, Jan 27, 11:41 PM · Wikimedia-Incident, MediaWiki-Parser, Core Platform Team, serviceops, Operations
Joe created T243803: API action=parse should be poolcounter-limited if a re-parse is necessary.
Mon, Jan 27, 11:41 PM · Wikimedia-Incident, MediaWiki-Parser, Core Platform Team, serviceops, Operations

Sun, Jan 26

Joe created P10271 Stack trace.
Sun, Jan 26, 8:29 PM

Thu, Jan 23

Joe created T243520: Decommission the "session redis" cluster.
Thu, Jan 23, 3:26 PM · Core Platform Team, Goal, Operations
Joe added a parent task for T211250: Create a mediawiki::cronjob define: T243314: 2020 Q3 DC switchover and switchback.
Thu, Jan 23, 2:43 PM · Patch-For-Review, serviceops, User-jijiki, Operations
Joe added a subtask for T243314: 2020 Q3 DC switchover and switchback: T211250: Create a mediawiki::cronjob define.
Thu, Jan 23, 2:43 PM · Goal, Operations
Joe added a comment to T242309: Onboarding Hugh Nowlan.

@Dzahn can we please ensure this procedure is finished before next week?

Thu, Jan 23, 7:57 AM · serviceops-radar, Core Platform Team Workboards (Clinic Duty Team), Operations, LDAP-Access-Requests, SRE-Access-Requests
Joe updated the task description for T242309: Onboarding Hugh Nowlan.
Thu, Jan 23, 7:57 AM · serviceops-radar, Core Platform Team Workboards (Clinic Duty Team), Operations, LDAP-Access-Requests, SRE-Access-Requests

Wed, Jan 22

Joe added a comment to T240884: Standalone service to evaluate user-provided regular expressions.

One complicating factor here is that AbuseFilter and SpamBlacklist both don't have a clear maintainer.

I think @Daimona is understood to be the de facto AF maintainer these days (trusted dev, wmf-NDA, etc.) and is pretty active in its current development.

So, I'm going to answer for myself. I think a re2-like solution would indeed improve performance [1] for regexps-related extensions. AbuseFilter and SpamBlacklist for sure, but also TitleBlacklist, and CentralAuth as of T101615. Given the number of possible consumers, I believe that a reusable service would be the best choice.
Of note, there's also T187669 about adding a static ReDoS validator, in case you want to explore it as an alternative.
[1] - About AbuseFilter performance, some numbers are on grafana, and there's also a dashboard on logstash, although regexps aren't the only responsible for slowness.

Wed, Jan 22, 9:43 PM · User-Addshore, TechCom-RFC, Wikidata
Joe added a comment to T240943: Security Concept Review For new CI.
Wed, Jan 22, 9:56 AM · SecTeam Discussion, Security-Team, Release-Engineering-Team-TODO (2020-01 to 2020-03 (Q3)), Security Concept Review

Tue, Jan 21

Joe committed rDEPLOYCHARTS666c256139f8: Enable TLS on citoid (authored by Joe).
Enable TLS on citoid
Tue, Jan 21, 7:14 AM
Joe committed rDEPLOYCHARTS48a2a28b2164: Fix citoid chart tar (authored by Joe).
Fix citoid chart tar
Tue, Jan 21, 7:13 AM
Joe committed rDEPLOYCHARTS58fac378dcd2: citoid: add TLS termination (authored by Joe).
citoid: add TLS termination
Tue, Jan 21, 7:07 AM

Fri, Jan 17

Joe added a comment to T211250: Create a mediawiki::cronjob define.

We should complete this work before we perform the MediaWiki switchover.

Fri, Jan 17, 11:37 AM · Patch-For-Review, serviceops, User-jijiki, Operations
Joe updated the task description for T211250: Create a mediawiki::cronjob define.
Fri, Jan 17, 11:34 AM · Patch-For-Review, serviceops, User-jijiki, Operations
Joe moved T237407: basic prometheus monitoring for PoolCounter from Backlog to Doing on the serviceops board.
Fri, Jan 17, 8:41 AM · Operations, observability, serviceops
Joe moved T241852: rack/setup/install new codfw mw systems from Backlog to Doing on the serviceops board.
Fri, Jan 17, 8:38 AM · ops-codfw, serviceops, Operations
Joe updated subscribers of T212934: etcd1004-1006 is unused and idle, use the cluster or kill it..

@akosiaris didn't we just dismiss this cluster last week?

Fri, Jan 17, 8:37 AM · decommission, Prod-Kubernetes, Kubernetes, serviceops
Joe moved T212934: etcd1004-1006 is unused and idle, use the cluster or kill it. from Backlog to Doing on the serviceops board.
Fri, Jan 17, 8:36 AM · decommission, Prod-Kubernetes, Kubernetes, serviceops
Joe triaged T236699: Build a black-box httpd testing framework as Medium priority.
Fri, Jan 17, 8:36 AM · Wikimedia-Apache-configuration, Operations, serviceops
Joe moved T242775: Hundreds of tags for `wikimedia/mediawiki-core` image from Backlog to Doing on the serviceops board.
Fri, Jan 17, 8:35 AM · Release-Engineering-Team, serviceops, Operations
Joe added a comment to T237038: Archive operations/debs/hhvm repository.

We should also remove all the stale docker images.

Fri, Jan 17, 8:32 AM · Patch-For-Review, Phabricator, serviceops, Cleanup, HHVM, Repository-Admins
Joe changed the status of T237038: Archive operations/debs/hhvm repository from Stalled to Open.
Fri, Jan 17, 8:32 AM · Patch-For-Review, Phabricator, serviceops, Cleanup, HHVM, Repository-Admins
Joe added a comment to T236942: Outdated Blubber package 0.6 in repo.

Retagged for release engineering as they're managing blubber.

Fri, Jan 17, 8:30 AM · Release-Engineering-Team
Joe edited projects for T236942: Outdated Blubber package 0.6 in repo, added: Release-Engineering-Team; removed serviceops.
Fri, Jan 17, 8:30 AM · Release-Engineering-Team
Joe added a comment to T230037: Create warmup procedure for MediaWiki app servers.

I am not convinced this is a great idea.

Fri, Jan 17, 7:28 AM · Release-Engineering-Team, serviceops, Performance-Team
Joe triaged T236008: New Deployment charts should allow exposing services via TLS as Medium priority.
Fri, Jan 17, 7:24 AM · Kubernetes, serviceops, Operations
Joe moved T236008: New Deployment charts should allow exposing services via TLS from Backlog to Doing on the serviceops board.
Fri, Jan 17, 7:24 AM · Kubernetes, serviceops, Operations
Joe triaged T238774: Provide the official production base images for Wikimedia use as Medium priority.
Fri, Jan 17, 7:22 AM · serviceops, Release-Engineering-Team, Release Pipeline
Joe added a comment to T238774: Provide the official production base images for Wikimedia use.

This task is generic enough, in this form, that I'm not sure wether it's a good idea.

Fri, Jan 17, 7:20 AM · serviceops, Release-Engineering-Team, Release Pipeline

Thu, Jan 16

Joe added a comment to T242775: Hundreds of tags for `wikimedia/mediawiki-core` image.

The total number of images present on the registry is 1003. I'm going to slowly remove most of the old ones in the coming week.

Thu, Jan 16, 8:04 AM · Release-Engineering-Team, serviceops, Operations
Joe claimed T242775: Hundreds of tags for `wikimedia/mediawiki-core` image.
Thu, Jan 16, 8:03 AM · Release-Engineering-Team, serviceops, Operations

Wed, Jan 15

Reedy defrocked Joe.
Wed, Jan 15, 3:59 PM
People empowered Joe as an administrator.
Wed, Jan 15, 8:54 AM

Tue, Jan 14

Joe triaged T242775: Hundreds of tags for `wikimedia/mediawiki-core` image as High priority.
Tue, Jan 14, 5:47 PM · Release-Engineering-Team, serviceops, Operations
Joe added a comment to T242715: Anycast for webproxies.

One problem I see with this is - proxy IPs regularly get banned by third-party services by accident. So having multiple *external* IPs, and being able to switch between them, is a plus.

Tue, Jan 14, 8:38 AM · Operations

Mon, Jan 13

Jdforrester-WMF awarded T242604: Remove obsoleted docker images a Like token.
Mon, Jan 13, 4:29 PM · User-brennen, Operations, Release Pipeline, Release-Engineering-Team, serviceops
Joe created T242604: Remove obsoleted docker images.
Mon, Jan 13, 1:32 PM · User-brennen, Operations, Release Pipeline, Release-Engineering-Team, serviceops
Joe closed T241206: Report image metadata to debmonitor as Resolved.
Mon, Jan 13, 1:27 PM · docker-pkg, Operations, SRE-tools, serviceops
Joe added a comment to T240884: Standalone service to evaluate user-provided regular expressions.

I think the main question to answer is "does it make sense to create a safe regex evaluation service?".

Mon, Jan 13, 10:39 AM · User-Addshore, TechCom-RFC, Wikidata
Joe added a comment to T240884: Standalone service to evaluate user-provided regular expressions.

Though this is mainly an implementation detail and not significant in terms requirements or pros/cons.

I disagree for a couple of reasons: gRPC is faster. According to some measurements in ASP.net (not in php) it's seven times faster than HTTP/JSON. That would be an important factor in deciding whether we should go with standalone service or another direction.
The other reason I think it's important that is this would be the first time we are going to use gRPC in production, meaning introducing new dependencies (in php) and services, this is cross cutting and would involve more work from services and SRE than HTTP+JSON solution. Also, another reason also is that the API spec of the regex implementation is hard to undo as it'll be used in several places not just one part of Wikibase.

Mon, Jan 13, 10:25 AM · User-Addshore, TechCom-RFC, Wikidata

Sat, Jan 11

Joe updated subscribers of T236800: Ensure apcu incr/decr are atomic (Upgrade php-apcu).
Sat, Jan 11, 7:25 AM · Performance-Team (Radar), MediaWiki-Cache, Core Platform Team, serviceops

Thu, Jan 9

Joe added a comment to T218733: Migrate mobileapps to k8s and node 10.

Hi everyone, serviceops needs to dismiss the scb cluster by april 2020. So work on the kubernetes migration should be prioritized accordingly.

Thu, Jan 9, 9:18 PM · Product-Infrastructure-Team-Backlog, Page Content Service, Mobile-Content-Service
Joe added a comment to T241202: ServiceChecker still calls Parsoid/JS.

This happens because parsoid/js is still deployed in production, so it also gets monitoring.

Thu, Jan 9, 6:44 AM · Mobile-Content-Service, Product-Infrastructure-Team-Backlog

Wed, Jan 8

Joe closed T242226: VE over en.wiki gives a 502 error as Resolved.

Hi all, as you have noticed, we were working on a resolution and things should be ok now. An incident report will be published available at a later time at https://wikitech.wikimedia.org/wiki/Incident_documentation

Wed, Jan 8, 4:30 PM
Joe added a comment to T242228: Current performance issues.

An incident report will be published later on wikitech at https://wikitech.wikimedia.org/wiki/Incident_documentation

Wed, Jan 8, 4:29 PM · Performance Issue, Traffic, Operations
Joe closed T242228: Current performance issues as Resolved.

Hi, thanks for your report!

Wed, Jan 8, 4:28 PM · Performance Issue, Traffic, Operations
Joe created T242200: Docker registry needs cache to vary on Accept header value.
Wed, Jan 8, 9:23 AM · Traffic, Operations

Tue, Jan 7

Joe closed T122825: Service Ownership and Maintenance as Resolved.
Tue, Jan 7, 3:47 PM · Core Platform Team, TechCom, User-mobrovac, Operations
Joe closed T122825: Service Ownership and Maintenance, a subtask of T141066: Identify "first responders" for "all" "components" deployed on Wikimedia servers, as Resolved.
Tue, Jan 7, 3:47 PM · RelEng-Archive-FY201718-Q1, Wikimedia-Incident, User-greg
Joe changed the status of T242023: Add alert for app servers in prod serving outdated MediaWiki branches, a subtask of T218412: Define a mediawiki "version", from Open to Stalled.
Tue, Jan 7, 6:11 AM · Release-Engineering-Team (Deployment services), Release-Engineering-Team-TODO, serviceops, Scap
Joe changed the status of T242023: Add alert for app servers in prod serving outdated MediaWiki branches from Open to Stalled.

This isn't going to happen until some effort is put in making scap's management of data saner.

Tue, Jan 7, 6:11 AM · Patch-For-Review, observability, serviceops, Wikimedia-Incident

Dec 23 2019

Joe created P10010 registry omg.
Dec 23 2019, 11:47 AM
Joe created P10009 Filters for debmonitor report.
Dec 23 2019, 9:39 AM
Joe closed T231011: (euwiki) Mysterious, coordinated slowdowns every ~ 25 minutes on API servers as Resolved.

Yes, this is yet another WTF resolved by killing parsoid-js.

Dec 23 2019, 8:05 AM · PHP 7.2 support, serviceops, Operations

Dec 20 2019

Joe claimed T241206: Report image metadata to debmonitor.
Dec 20 2019, 6:56 AM · docker-pkg, Operations, SRE-tools, serviceops
Joe created T241206: Report image metadata to debmonitor.
Dec 20 2019, 6:56 AM · docker-pkg, Operations, SRE-tools, serviceops

Dec 19 2019

Joe added a comment to T235411: Add TLS termination to services running on kubernetes.

Port reservations are for now indicated here: https://wikitech.wikimedia.org/wiki/Service_ports

Dec 19 2019, 2:15 PM · Patch-For-Review, Kubernetes, serviceops, Operations
Joe updated the task description for T235411: Add TLS termination to services running on kubernetes.
Dec 19 2019, 2:12 PM · Patch-For-Review, Kubernetes, serviceops, Operations

Dec 18 2019

Joe updated subscribers of T122825: Service Ownership and Maintenance.

I think most of the issues described here have been in the meantime solved by the implementation of the code stewardship review process and a list of developers/maintainers. @Pchelolo @Eevans @Clarakosi any opinions?

Dec 18 2019, 9:35 PM · Core Platform Team, TechCom, User-mobrovac, Operations
Joe added a comment to T240775: Support PHP 7.4 preload.

The model of MW-in-containers we're heading for will have each PHP process (within the container) running only one version of MW at a time ('hetdeploy' as it were will be done at a larger scale, with blue/green etc. container deployment), so this isn't out of scope for Wikimedia usage eventually.

Dec 18 2019, 9:20 PM · TechCom-RFC
Joe added a comment to T240990: cergen should output unencrypted key file for use with envoyproxy kubernetes sidecars.

@Joe, 2 Qs for you:

  • openssl ec ... -out outputs in 'Traditional SSL' serialization format. Python cryptography suggests we use PKCS8, and all the other .pem files are doing the same. Do we need the Traditional format for this file, or can we use PKCS8 too?
Dec 18 2019, 5:30 AM · Analytics-Kanban, Analytics, serviceops, Operations

Dec 17 2019

Joe created P9917 puppet-merge fail..
Dec 17 2019, 5:06 PM
Joe committed rDEPLOYCHARTS1d6ed58a44ed: cxserver: remove the securityContext from the pod (authored by Joe).
cxserver: remove the securityContext from the pod
Dec 17 2019, 8:23 AM
Joe committed rDEPLOYCHARTS9ce4b3c1b152: cxserver: enable TLS in production (authored by Joe).
cxserver: enable TLS in production
Dec 17 2019, 6:56 AM
Joe lowered the priority of T240518: Some jobs are not being processed / are processed slowly from High to Medium.

All jobs have recovered. Not closing the task as we need to reduce concurrency again.

Dec 17 2019, 5:34 AM · Wikimedia-Incident, SDC-Statements (Machine-vision-depicts), MachineVision, Structured-Data-Backlog, Product-Infrastructure-Team-Backlog, MW-1.35-notes (1.35.0-wmf.10; 2019-12-10), Core Platform Team Workboards (Clinic Duty Team), WMF-JobQueue, Operations

Dec 16 2019

Joe added a comment to T240518: Some jobs are not being processed / are processed slowly.

Thank you @mobrovac for the help!

Dec 16 2019, 11:06 PM · Wikimedia-Incident, SDC-Statements (Machine-vision-depicts), MachineVision, Structured-Data-Backlog, Product-Infrastructure-Team-Backlog, MW-1.35-notes (1.35.0-wmf.10; 2019-12-10), Core Platform Team Workboards (Clinic Duty Team), WMF-JobQueue, Operations
Joe added a comment to T240518: Some jobs are not being processed / are processed slowly.

I see sometimes an error like the following:

Dec 16 2019, 5:19 PM · Wikimedia-Incident, SDC-Statements (Machine-vision-depicts), MachineVision, Structured-Data-Backlog, Product-Infrastructure-Team-Backlog, MW-1.35-notes (1.35.0-wmf.10; 2019-12-10), Core Platform Team Workboards (Clinic Duty Team), WMF-JobQueue, Operations
Joe committed rDEPLOYCHARTSa81c25390d16: cxserver: add TLS termination (authored by Joe).
cxserver: add TLS termination
Dec 16 2019, 3:48 PM
Joe committed rDEPLOYCHARTSc0d1a9dce53b: blubberoid: release new chart version using the common templates directory (authored by Joe).
blubberoid: release new chart version using the common templates directory
Dec 16 2019, 2:37 PM
Joe committed rDEPLOYCHARTS12dc3ee32f47: Create common template helpers directory (authored by Joe).
Create common template helpers directory
Dec 16 2019, 2:37 PM
Joe closed T237234: Collect metrics from envoy where it is enabled on k8s, a subtask of T235411: Add TLS termination to services running on kubernetes, as Resolved.
Dec 16 2019, 10:00 AM · Patch-For-Review, Kubernetes, serviceops, Operations
Joe closed T237234: Collect metrics from envoy where it is enabled on k8s as Resolved.
Dec 16 2019, 10:00 AM · Kubernetes, serviceops, Operations
Joe added a comment to T240798: Degraded RAID on ms-be2016.

This happens after the key is offered for authentication.

Dec 16 2019, 8:45 AM · Operations, ops-codfw
Joe added a comment to T240518: Some jobs are not being processed / are processed slowly.

Again, looking specifically to recentChangesUpdate:

Dec 16 2019, 7:42 AM · Wikimedia-Incident, SDC-Statements (Machine-vision-depicts), MachineVision, Structured-Data-Backlog, Product-Infrastructure-Team-Backlog, MW-1.35-notes (1.35.0-wmf.10; 2019-12-10), Core Platform Team Workboards (Clinic Duty Team), WMF-JobQueue, Operations
Joe added a comment to T240518: Some jobs are not being processed / are processed slowly.

In the meantime, it seems that processing of recentChangesUpdate gets completely outdated for some reason. Let's start with the logfile that began on the 12th:

Dec 16 2019, 6:38 AM · Wikimedia-Incident, SDC-Statements (Machine-vision-depicts), MachineVision, Structured-Data-Backlog, Product-Infrastructure-Team-Backlog, MW-1.35-notes (1.35.0-wmf.10; 2019-12-10), Core Platform Team Workboards (Clinic Duty Team), WMF-JobQueue, Operations
Joe added a comment to T240518: Some jobs are not being processed / are processed slowly.

Looking at cpjobqueue logs, it's clear it's getting 500 responses at least to some of the requests:

Dec 16 2019, 6:24 AM · Wikimedia-Incident, SDC-Statements (Machine-vision-depicts), MachineVision, Structured-Data-Backlog, Product-Infrastructure-Team-Backlog, MW-1.35-notes (1.35.0-wmf.10; 2019-12-10), Core Platform Team Workboards (Clinic Duty Team), WMF-JobQueue, Operations
Joe renamed T240518: Some jobs are not being processed / are processed slowly from Job queue seems to be processed slowly than expected to Some jobs are not being processed / are processed slowly.
Dec 16 2019, 6:11 AM · Wikimedia-Incident, SDC-Statements (Machine-vision-depicts), MachineVision, Structured-Data-Backlog, Product-Infrastructure-Team-Backlog, MW-1.35-notes (1.35.0-wmf.10; 2019-12-10), Core Platform Team Workboards (Clinic Duty Team), WMF-JobQueue, Operations
Joe added a comment to T240518: Some jobs are not being processed / are processed slowly.

Specifically, we have a 6 million jobs backlog on recentChangesUpdate, evergrowing since Dec 11th. I don't see any of them in the logs for JobExecutor, so I guess the problem has to do with changepropagation.

Dec 16 2019, 6:11 AM · Wikimedia-Incident, SDC-Statements (Machine-vision-depicts), MachineVision, Structured-Data-Backlog, Product-Infrastructure-Team-Backlog, MW-1.35-notes (1.35.0-wmf.10; 2019-12-10), Core Platform Team Workboards (Clinic Duty Team), WMF-JobQueue, Operations

Dec 13 2019

Joe added a comment to T240665: pybal fails to reconnect cleanly to etcd when etcd is restarted.

Should I merge this into T169765 @Joe as per ema comment?

Dec 13 2019, 2:25 PM · Operations, Traffic, Pybal
Joe created T240665: pybal fails to reconnect cleanly to etcd when etcd is restarted.
Dec 13 2019, 10:06 AM · Operations, Traffic, Pybal
Joe closed T237362: Rolling restart of etcd to pick up the renewed CA public certificate., a subtask of T237259: Document all uses of the puppetCA certificate, as Resolved.
Dec 13 2019, 10:02 AM · Patch-For-Review, User-jbond, Puppet, Operations
Joe closed T237362: Rolling restart of etcd to pick up the renewed CA public certificate. as Resolved.
Dec 13 2019, 10:02 AM · Patch-For-Review, serviceops, User-jbond, Puppet, Operations

Dec 12 2019

Joe added a comment to T214734: All debug hosts give (likely spurious) message: PHP Fatal error: The UdpSocket to 127.0.0.1:10514 has been closed (from Monolog/SyslogUdp).

It appears to me that we try to send something on that udp socket with send() after we called close(); This only happens on the debug servers? If so, various things are peculiar there:

Dec 12 2019, 4:54 PM · Patch-For-Review, Release-Engineering-Team, Performance-Team (Radar), serviceops, Wikimedia-production-error, User-fgiunchedi, Operations
Joe added a comment to T240576: Use Envoy instead of nginx for TLS termination on Appservers.

While it should be easy to swap nginx for envoy, we need to also convert profile::services_proxy to use envoy at the same time.

Dec 12 2019, 2:46 PM · serviceops, Traffic, Operations
Joe updated subscribers of T240576: Use Envoy instead of nginx for TLS termination on Appservers.
Dec 12 2019, 2:24 PM · serviceops, Traffic, Operations
Joe reassigned T194031: Setup a new PKI software as an alternative to the puppet CA for managing services certificates from Joe to Volans.
Dec 12 2019, 9:30 AM · User-jbond, Traffic, Operations
Joe closed T238050: envoy overwrites the server header as Resolved.
Dec 12 2019, 7:20 AM · RESTBase, Traffic, Operations

Dec 10 2019

Joe added a comment to T237362: Rolling restart of etcd to pick up the renewed CA public certificate..

Good news is we only need to do a rolling restart in eqiad, not in codfw, where we still don't use the ca for peer connections

Dec 10 2019, 10:52 AM · Patch-For-Review, serviceops, User-jbond, Puppet, Operations
Joe added a comment to T235216: Reconsider memcached connection method for MW in PHP7 world.

Mcrouter can't be configured to listen both on a unix socket and on a TCP port. This means, apart from how cumbersome the change is going to be if we want to do it, that we'd need to change our architecture and have separate "mcrouter proxies" for cross-DC replication.

Dec 10 2019, 10:01 AM · Performance-Team (Radar), serviceops
Joe committed rDEPLOYCHARTS3767943576d5: blubberoid: break TLS functionality into a helper (authored by Joe).
blubberoid: break TLS functionality into a helper
Dec 10 2019, 8:20 AM

Dec 9 2019

Joe added a comment to T236566: "hat-imagescalers" Cloud VPS project jessie deprecation.

This project was supposed to be orphaned and removed in the last purge by Andrew. Please remove it completely at your earliest convenience.

Dec 9 2019, 6:58 AM · Cloud-VPS (Debian Jessie Deprecation)

Dec 6 2019

Joe updated the task description for T235411: Add TLS termination to services running on kubernetes.
Dec 6 2019, 9:21 AM · Patch-For-Review, Kubernetes, serviceops, Operations
Joe closed T237235: Build and upload envoy 1.12.0 package., a subtask of T238050: envoy overwrites the server header, as Resolved.
Dec 6 2019, 9:20 AM · RESTBase, Traffic, Operations
Joe closed T237235: Build and upload envoy 1.12.0 package. as Resolved.
Dec 6 2019, 9:20 AM · Packaging, Operations, serviceops

Dec 5 2019

Joe added a comment to T235216: Reconsider memcached connection method for MW in PHP7 world.

I'm pretty sure using unix sockets would improve performance, it did for sure when we were on HHVM. It's pretty easy to test this effect as we now collect timing data per-server, so we could change the configuration conditionally on one api and one appserver, and check if there is a positive performance impact pretty quickly.

Dec 5 2019, 12:25 PM · Performance-Team (Radar), serviceops

Dec 3 2019

Joe closed T239688: Rename operations/debs/poolcounter-prometheus-exporter to match other Prometheus repositories as Declined.

Except that's the name of the software we're packaging https://github.com/Wikia/poolcounter-prometheus-exporter so I think there's little we should do there

Dec 3 2019, 8:53 AM · Gerrit, Release-Engineering-Team (Development services), serviceops

Dec 2 2019

Joe closed T123854: Set up action API latency / error rate metrics & alerts as Resolved.

This has been resolved for some time:

Dec 2 2019, 8:40 AM · Core Platform Team Legacy (Watching / External), Services (watching), MediaWiki-API, Traffic, Operations, observability

Nov 28 2019

Joe added a comment to T239392: Applications and scripts need to be able to understand the pooled status of servers in our load balancers..

We could also think of writing a sort of HTTP router that returns a list of PyBal API endpoints for a node. For instance:

GET /host/cp3060.esams.wmnet
http://lvs3005.esams.wmnet:9090/pools/textlb6_80/cp3060.esams.wmnet
http://lvs3005.esams.wmnet:9090/pools/textlb6_443/cp3060.esams.wmnet
[...]

This approach has the advantage, compared to the solution outlined in the task, of not duplicating state, and the disadvantage of hitting the PyBal API directly, which might be a deal breaker though?

Nov 28 2019, 4:00 PM · Operations, serviceops, SRE-tools, Traffic, Pybal
Joe updated the task description for T239392: Applications and scripts need to be able to understand the pooled status of servers in our load balancers..
Nov 28 2019, 9:59 AM · Operations, serviceops, SRE-tools, Traffic, Pybal