Page MenuHomePhabricator

Joe (Giuseppe Lavagetto)
Spy

Projects (25)

Today

  • Clear sailing ahead.

Tomorrow

  • Clear sailing ahead.

Saturday

  • Clear sailing ahead.

User Details

User Since
Oct 3 2014, 5:57 AM (343 w, 5 d)
Availability
Available
LDAP User
Giuseppe Lavagetto
MediaWiki User
GLavagetto (WMF) [ Global Accounts ]

Recent Activity

Yesterday

Joe added a comment to T281596: Publish wikimedia-bullseye base docker image.

If possible it would be great if we could get a base bullseye image somehow, even if not auto-updated right from the start and otherwise created.
We have already a couple of hosts on bullseye and I need to add a python-build bullseye image to the registry to build the wheels of a Python application included in the role of one of the hosts, that of course requires a base bullseye image.
FYI Moritz has opened T281984 for the long term solution.

Wed, May 5, 2:22 PM · SRE, serviceops
Joe added a comment to T279108: Introduce a Front-end Build Step for MediaWiki Skins and Extensions.

I would object to using any build step that depends on downloading assets from the internet at runtime. We do that at the moment for many projects and we're aware it's completely wrong and needs fixing.

Wed, May 5, 7:10 AM · Vue.js Migration, tech-decision-forum

Thu, Apr 29

Joe added a comment to T281480: SqlBlobStore no longer caching blobs (DBConnectionError Too many connections).

There is definitely something going very wrong with memcached:

Thu, Apr 29, 4:10 PM · Performance-Team, MW-1.37-notes (1.37.0-wmf.4; 2021-05-04), MediaWiki-Cache, DBA, MediaWiki-Revision-backend, Wikimedia-production-error
Joe updated subscribers of T281480: SqlBlobStore no longer caching blobs (DBConnectionError Too many connections).

Given we only make requests to external storage when parsercache has a miss, it seemed sensible to look for corresponding patterns in parsercache.

I see we introduced a new category of misses on the same date "miss_absent_metadata", see https://grafana-rw.wikimedia.org/d/000000106/parser-cache?viewPanel=7&orgId=1&from=now-30d&to=now which seems related.

Thu, Apr 29, 4:00 PM · Performance-Team, MW-1.37-notes (1.37.0-wmf.4; 2021-05-04), MediaWiki-Cache, DBA, MediaWiki-Revision-backend, Wikimedia-production-error
Joe added a comment to T281480: SqlBlobStore no longer caching blobs (DBConnectionError Too many connections).

Given we only make requests to external storage when parsercache has a miss, it seemed sensible to look for corresponding patterns in parsercache.

Thu, Apr 29, 3:54 PM · Performance-Team, MW-1.37-notes (1.37.0-wmf.4; 2021-05-04), MediaWiki-Cache, DBA, MediaWiki-Revision-backend, Wikimedia-production-error
TBlasta awarded rOPUP4e64be4f8d1c: environment/future: remove redundant settings a 100 token.
Thu, Apr 29, 7:17 AM

Wed, Apr 28

Joe added a comment to T249745: Could not enqueue jobs: "Unable to deliver all events: 503: Service Unavailable".

ok, a bit more time passed and we've lost 24 jobs in 7 days.

@Krinkle this is a significant improvement over the previous state, but given its a production error - do we need to aim for 0?

Wed, Apr 28, 5:09 AM · Patch-For-Review, User-brennen, serviceops, Platform Team Workboards (Clinic Duty Team), WMF-JobQueue, Wikimedia-production-error

Tue, Apr 27

Joe updated subscribers of T281240: Ensure Changeprop is disabled when the databases are in read only mode .

To be clear, the idea came out of the fact that during read-only time we had a lot of jobs failing, but given we actually retry the jobs, we should not need to actually.

Tue, Apr 27, 12:00 PM · ChangeProp, Sustainability (Incident Followup), SRE, serviceops

Mon, Apr 26

Joe updated subscribers of T281079: Generate non-restricted versions of the mediawiki-webserver image.
Mon, Apr 26, 7:32 AM · Release-Engineering-Team (Seen), serviceops, MW-on-K8s
Joe created T281079: Generate non-restricted versions of the mediawiki-webserver image.
Mon, Apr 26, 7:17 AM · Release-Engineering-Team (Seen), serviceops, MW-on-K8s

Wed, Apr 21

Joe added a comment to T279695: Deploy Scap version 3.17.1-1.

@LarsWirzenius as discussed in T265501, using docker to build scap, is not an option. In T277793#6957246 I wrote that "we will be happy with a step by step documented process of how to build the debian package on our build host". Please update the "Building" section in https://wikitech.wikimedia.org/wiki/Scap3#Production_Upgrade, and I will be happy to build and roll it out.

Or simply provide the dsc/debian.tar.*/orig.tar.* somewhere for import/building, that rules out any issues in the git workflows before that step.

Wed, Apr 21, 7:35 AM · Release-Engineering-Team (Radar), serviceops, Scap

Mon, Apr 19

Joe triaged T280497: Benchmark performance of MediaWiki on k8s as High priority.
Mon, Apr 19, 9:44 AM · MW-on-K8s, serviceops, SRE
Joe created T280497: Benchmark performance of MediaWiki on k8s.
Mon, Apr 19, 9:44 AM · MW-on-K8s, serviceops, SRE
Joe moved T265183: In a k8s world: where does MediaWiki code live? from Backlog to In Progress on the MW-on-K8s board.
Mon, Apr 19, 9:38 AM · MW-on-K8s
Joe closed T276908: Figure out appropriate readiness and liveness probes , a subtask of T265327: Create a basic helm chart to test MediaWiki on kubernetes, as Resolved.
Mon, Apr 19, 9:37 AM · Patch-For-Review, SRE, serviceops, MW-on-K8s
Joe closed T276908: Figure out appropriate readiness and liveness probes as Resolved.
Mon, Apr 19, 9:37 AM · SRE, serviceops, MW-on-K8s

Fri, Apr 16

Joe added a comment to T279804: Visits to Wikimedia properties should not be used for Google ad targeting (FLoC).

Is the header needed at all?

https://github.com/WICG/floc/issues/45#issuecomment-781042491 says:

During the Origin Trial, the default for whether a page will be used for FLoC computation will be based on Chrome's existing infrastructure which detects pages that load ads-related resources. Our thinking here is that pages detected as including ads-related resources probably fetched something with an ads-related 3p cookie attached, which means it's reasonable to guess that the page visit contributes to some ads profile today.

Since the WMF doesn't serve ads (I don't think the fundraising banners count), I don't think WMF sites would be included, so the header would be just cruft (and extra bytes down the wire).

Disclosure: I work for Google, but I'm making this (and all other) comments in a personal capacity. I have no relevant insider knowledge and my observation is based purely on the public GitHub thread.

Fri, Apr 16, 4:08 PM · fundraising-tech-ops, Patch-For-Review, SRE, Traffic, Privacy Engineering, Privacy
Joe merged T280377: Protect our users against Google-driven privacy breach via FLOC into T279804: Visits to Wikimedia properties should not be used for Google ad targeting (FLoC).
Fri, Apr 16, 4:07 PM · fundraising-tech-ops, Patch-For-Review, SRE, Traffic, Privacy Engineering, Privacy
Joe merged task T280377: Protect our users against Google-driven privacy breach via FLOC into T279804: Visits to Wikimedia properties should not be used for Google ad targeting (FLoC).
Fri, Apr 16, 4:06 PM · Traffic, SRE
Joe added a comment to T280377: Protect our users against Google-driven privacy breach via FLOC.

Please note that while we surely won't use the js api in our base javascript, this is intended as a defensive measure for all third-party js and software we run,

Fri, Apr 16, 4:02 PM · Traffic, SRE
Joe created T280377: Protect our users against Google-driven privacy breach via FLOC.
Fri, Apr 16, 4:00 PM · Traffic, SRE

Thu, Apr 15

Joe added a comment to T275637: (Need By: TBD) rack/setup/install conf200[456].codfw.wmnet.

Thanks @Papaul! We'll now work on service implementation.

Thu, Apr 15, 7:07 AM · serviceops, SRE, ops-codfw, DC-Ops

Wed, Apr 14

Joe moved T278220: Define the size of a pod for mediawiki in terms of resource usage from Backlog to In Progress on the MW-on-K8s board.
Wed, Apr 14, 3:42 PM · serviceops, MW-on-K8s
Joe closed T276097: Create MediaWiki httpd base image, a subtask of T265327: Create a basic helm chart to test MediaWiki on kubernetes, as Resolved.
Wed, Apr 14, 3:39 PM · Patch-For-Review, SRE, serviceops, MW-on-K8s
Joe closed T276097: Create MediaWiki httpd base image as Resolved.
Wed, Apr 14, 3:39 PM · MW-on-K8s, serviceops
Joe added a comment to T265327: Create a basic helm chart to test MediaWiki on kubernetes.
joe@wotan:~/Sandbox/mw-on-k8s$ kubectl get pods
NAME                              READY   STATUS    RESTARTS   AGE
mediawiki-test-6fb67b5f8b-2nwqh   6/6     Running   5          3m44s
Wed, Apr 14, 2:35 PM · Patch-For-Review, SRE, serviceops, MW-on-K8s

Tue, Apr 13

Joe raised the priority of T276029: Renew certs for mcrouter on all mw appservers from Medium to High.

I don't realistically see it possible to switch memcached to TLS in the remaining time before we need to renew the certificates, hence raising priority. It will be raised to UBN! in a couple days.

Tue, Apr 13, 6:00 AM · SRE, serviceops

Thu, Apr 8

dcaro awarded T166066: Integrate the puppet compiler in the puppet CI pipeline a Stroopwafel token.
Thu, Apr 8, 2:35 PM · Release-Engineering-Team (Radar), Patch-For-Review, Puppet, puppet-compiler, SRE

Mar 24 2021

Joe closed T278274: Backend Save Timing raised by +80ms at lower percentiles since 23 Mar 2021 as Invalid.

We've had all supportive services serving from codfw for the well announced rebuild of the eqiad kubernetes cluster (see email to wikitech-l). Numbers should be again unaffected once we migrate back.

Mar 24 2021, 7:18 AM · serviceops, DBA, Performance-Team (Radar)

Mar 23 2021

Joe added a comment to T277711: Memcached, mcrouter, nutcracker's future in MediaWiki on Kubernetes.

Trying to break down my current thoughts:

Mar 23 2021, 4:38 PM · serviceops, SRE
Joe added a comment to T278220: Define the size of a pod for mediawiki in terms of resource usage.

At 15 workers per pod, we get 5 pods per node (6 if we only reserve 5% of ram and cpu). That's more or less the maximum concurrency at which the sweet spot holds for php-fpm. It gets us either 75 or 90 workers per node, and I think it would be a net win. I will update the task once I have more realistic numbers.

Mar 23 2021, 11:55 AM · serviceops, MW-on-K8s
Joe added a comment to T278220: Define the size of a pod for mediawiki in terms of resource usage.

A typical appserver has 96 GB of memory and 48 cores. Let's assume we can use up to 85% of those with pods, which looks a bit conservative, but it's ok for our current calculations.

Mar 23 2021, 11:51 AM · serviceops, MW-on-K8s
Joe updated the task description for T278220: Define the size of a pod for mediawiki in terms of resource usage.
Mar 23 2021, 11:07 AM · serviceops, MW-on-K8s
Joe updated the task description for T278220: Define the size of a pod for mediawiki in terms of resource usage.
Mar 23 2021, 11:06 AM · serviceops, MW-on-K8s
Joe added a comment to T278220: Define the size of a pod for mediawiki in terms of resource usage.

Some data from one appserver:

  • httpd uses less than 1 GB of memory and 1 cpu. If we assume we'll reduce the number of workers, it can be safe to assume e.g. 600 MB and 0.6 CPUs are ok
  • mcrouter uses around 300 MB of memory. Again this would be reduced if it's inside the pod, down to ~ 200 MB should be safe. 1 CPU is enough for a whole-host mcrouter, so we can assume 0.5 CPUs should be enough
  • nutcracker currently uses 200 MB of memory + 0.1 cpus
Mar 23 2021, 10:57 AM · serviceops, MW-on-K8s
Joe created T278220: Define the size of a pod for mediawiki in terms of resource usage.
Mar 23 2021, 10:24 AM · serviceops, MW-on-K8s

Mar 22 2021

Joe claimed T277711: Memcached, mcrouter, nutcracker's future in MediaWiki on Kubernetes.
Mar 22 2021, 4:07 PM · serviceops, SRE
Joe added a comment to T277711: Memcached, mcrouter, nutcracker's future in MediaWiki on Kubernetes.

Given it has created some doubts, let me clarify: I've created a first version of the charts that implements solution 1 (and not a complete version of it, either).

Mar 22 2021, 4:07 PM · serviceops, SRE
Joe added a comment to T277711: Memcached, mcrouter, nutcracker's future in MediaWiki on Kubernetes.

That is already done in the MediaWiki chart.

But that does now deploy mcrouter as a sidecar in each MW pod. AIUI this might come with additional cons like more connections to memcached, less use of pooled connections etc. or did I get that wrong? Should we evaluate that?

Mar 22 2021, 10:30 AM · serviceops, SRE
Joe added a comment to T277711: Memcached, mcrouter, nutcracker's future in MediaWiki on Kubernetes.

I don't really like option 3 just because it moves parts of the software stack to the node itself and I would personally like them to be as dumb as possible, ideally just running kubernetes components and docker. This might be a bit opinionated, but IMHO it makes dealing with the nodes more easy as one can be sure that all the actual workload on the node is "visible" via kubenetes API and there is nothing "hidden" that one might need to take care of when dealing with nodes.

This is understood, but we need to come up with a reasonable way to gradually rollout changes to mcrouter when needed, be that a newer version or a configuration change.

Mar 22 2021, 10:04 AM · serviceops, SRE

Mar 18 2021

Joe added a comment to T277711: Memcached, mcrouter, nutcracker's future in MediaWiki on Kubernetes.

As far as mcrouter goes, the only non-brittle solution is to run it inside the pod, so solution 1. The reason is simple: restarting mcrouter and/or it crashing on the node or in a daemonset would make it unavailable for all the pods on the node, without MediaWiki *or* kubernetes noticing.

Mar 18 2021, 5:47 PM · serviceops, SRE
Joe triaged T277742: Update sury-php images for updated gpg key as High priority.

Triaging as high priority as this is at best going to make building the images fail, at worst it's a security liability.

Mar 18 2021, 10:36 AM · Release-Engineering-Team, Continuous-Integration-Infrastructure
Joe created T277742: Update sury-php images for updated gpg key.
Mar 18 2021, 10:35 AM · Release-Engineering-Team, Continuous-Integration-Infrastructure

Mar 17 2021

Joe added a comment to T276908: Figure out appropriate readiness and liveness probes .

After some more work, this is my ideas for liveness and readiness probes:

  1. httpd:
    • liveness: tcp connection to the main tcp port
    • readiness: http status page (via the monitoring port)
  2. php-fpm
    • liveness: tcp connection to the port (TCP ONLY)
    • liveness: check that the socket is owned by the main php-fpm process (UNIX ONLY)
    • readiness: make a request for /livez, the ping path, using a fcgi client
  3. mcrouter:
    • liveness: tcp connection to the port
    • readiness: get __mcrouter__.config_md5_digest or some other admin request
  4. nutcracker:
    • liveness: tcp connection to the port
    • readiness: tcp connection to the stats port maybe?

I will not add readiness probes for the exporters, and just liveness ones for now.

Sounds pretty good to me although I have not much of an understanding of mcrouter and nutcracker. One thing to keep in mind is that the readiness probe for php-fpm will take a child in the default (www) pool and will probably increase the max_requests counter as well. So maybe it might be good to have a separate pool for that but otoh it might be smart to explicitly check the www pool and thereby mark the pod as not ready when no children are left for example.

Mar 17 2021, 4:49 PM · SRE, serviceops, MW-on-K8s
Joe moved T276908: Figure out appropriate readiness and liveness probes from Backlog to In Progress on the MW-on-K8s board.
Mar 17 2021, 10:03 AM · SRE, serviceops, MW-on-K8s
Joe added a comment to T277539: maintenance/mysql.php does not support ssl.

Using hostnames in mediawiki-config is not really an option.

Mar 17 2021, 9:22 AM · Performance-Team (Radar), Patch-For-Review, User-Urbanecm, Data-Persistence (Consultation), MediaWiki-Maintenance-system
Joe added a comment to T277146: Authoritative ports list.

FWIW, the document on wikitech is not authoritative - service::catalog in hiera is.

Mar 17 2021, 9:18 AM · Patch-For-Review, SRE, netops
Joe added a comment to T272918: Create ml-serve k8s cluster.

FWIW, ServiceOps decided against using a full mesh networking for our services because we considered istio to be both very complex and not really needed for our level of complication.

Mar 17 2021, 8:49 AM · Machine-Learning-Team, Patch-For-Review, Lift-Wing

Mar 16 2021

Joe added a comment to T120242: Consistent MediaWiki state change events | MediaWiki events as source of truth.

The best practices I am talking about are, basically:

Mar 16 2021, 8:12 PM · DBA, WMF-Architecture-Team, Platform Team Legacy (Later), Analytics, Event-Platform, Services (later)
Joe added a comment to T120242: Consistent MediaWiki state change events | MediaWiki events as source of truth.

clouddb1021 is owned by Analytics so we can set up ROW there if that's

Many services (WDQS updater, job queue, change prop, analytics event ingestion, etc.) will rely on it, just like they rely on EventBus generated events right now. In the future, I expect more and more event driven services to be built, all of which will mostly rely on events for state transfer, and many of them will want MediaWiki state.

Mar 16 2021, 6:08 PM · DBA, WMF-Architecture-Team, Platform Team Legacy (Later), Analytics, Event-Platform, Services (later)
Joe triaged T276908: Figure out appropriate readiness and liveness probes as High priority.
Mar 16 2021, 7:46 AM · SRE, serviceops, MW-on-K8s
Joe added a comment to T276908: Figure out appropriate readiness and liveness probes .

After some more work, this is my ideas for liveness and readiness probes:

Mar 16 2021, 7:45 AM · SRE, serviceops, MW-on-K8s

Mar 15 2021

Joe added a comment to T120242: Consistent MediaWiki state change events | MediaWiki events as source of truth.

Ok I'll try to re-summarize my argument:
the problem we're trying to solve is having transactional consistency between mediawiki and kafka. And we want to do it not at the application layer, but at the data layer, which is what I think is wrong for a few reasons. But before we go back to discussing solutions, I'd like to see a better explanation of the problem.

Mar 15 2021, 4:47 PM · DBA, WMF-Architecture-Team, Platform Team Legacy (Later), Analytics, Event-Platform, Services (later)
Joe added a comment to T120242: Consistent MediaWiki state change events | MediaWiki events as source of truth.

We should focus on reconciliation strategies instead of chasing for panaceas for problems we know cannot be solved reliably.

Can you explain why the binlog solution is not solving the problem reliably?

Mar 15 2021, 2:28 PM · DBA, WMF-Architecture-Team, Platform Team Legacy (Later), Analytics, Event-Platform, Services (later)
Joe added a comment to T120242: Consistent MediaWiki state change events | MediaWiki events as source of truth.

Where is debezium supposed to run?

In k8s.

We should keep primary masters as clean as possible

Agree, I'd prefer to consume the binlog of a replica.

What happens with the PII that gets written to binlogs?

We aren't going to produce MySQL all data as events. I looked into this, and the output is just too raw and contextless to get the rich events we have currently. My preferred solution is 'Change Data Capture via Transactional Outbox' one. In that, we'd convert the EventBus extension to insert into one or several 'event outbox' tables in the exact format we want, likely just a JSON string that Debezium can hoist into an event. So really, we don't even need the full binlog, just the logs for our outbox table(s). (I think MySQL has a binlog table filter setting, right?)

Mar 15 2021, 1:58 PM · DBA, WMF-Architecture-Team, Platform Team Legacy (Later), Analytics, Event-Platform, Services (later)
Joe added a comment to T120242: Consistent MediaWiki state change events | MediaWiki events as source of truth.

Reading the whole history here it seems that the problem we want to solve is a traditionally unsolvable one (keeping two logically-distinct datastores perfectly consistent in a distributed architecture while being not in a CP setup.

Mar 15 2021, 1:55 PM · DBA, WMF-Architecture-Team, Platform Team Legacy (Later), Analytics, Event-Platform, Services (later)

Mar 11 2021

Joe triaged T277183: Phase out nutcracker for connecting to redis as Medium priority.
Mar 11 2021, 3:27 PM · Performance-Team (Radar), SRE, serviceops
Joe created T277183: Phase out nutcracker for connecting to redis.
Mar 11 2021, 3:27 PM · Performance-Team (Radar), SRE, serviceops
Joe added a comment to T235216: Consider socket files for MW-to-mcrouter connection.

I'm confused. If such change requires the kind of migration you describe, then what did we do during the PHP 7 migration this year?

Mar 11 2021, 3:14 PM · Performance-Team (Radar), serviceops
Joe committed rLPRIe9a894a68d3c: Fix salt for php monitoring (authored by Joe).
Fix salt for php monitoring
Mar 11 2021, 9:29 AM

Mar 10 2021

Joe awarded T277008: Server fabula outage, www.wikimedia.it offline (provider incident) a Love token.
Mar 10 2021, 11:56 AM · WMIT-Infrastructure
Joe added a comment to T277008: Server fabula outage, www.wikimedia.it offline (provider incident).

Hi @valerio.bozzolan, from this tweet

Mar 10 2021, 8:59 AM · WMIT-Infrastructure

Mar 9 2021

Joe added a comment to T276908: Figure out appropriate readiness and liveness probes .

Ideally, the liveness probe needs to check if the container is running (more or less), while the readiness probe should check that the service is still responding.

Mar 9 2021, 10:22 AM · SRE, serviceops, MW-on-K8s
Joe created T276908: Figure out appropriate readiness and liveness probes .
Mar 9 2021, 10:17 AM · SRE, serviceops, MW-on-K8s

Mar 5 2021

Joe added a comment to T264209: Run stress tests on docker images infrastructure.

For the record, we're now building the actual multiversion images of mediawiki, it would be interesting to do all testing using those. In particular it's interesting imho to work on the layering so that we reduce the number of layers we need to download for each release.

Mar 5 2021, 11:10 AM · serviceops, Release Pipeline, MW-on-K8s
Joe added a comment to T273950: Modernise memcached systemd unit / sync to current buster setup.

systemd-memcached-wrapper is a perl script, an evolution of the old wrapper script debian always used and that caused me more headaches than it solved. I'd very much prefer we keep the approach we took with our systemd unit back in the day (while it might make sense to switch to use user memcached for the reasons above)

Mar 5 2021, 10:22 AM · serviceops, User-jijiki, SRE

Mar 4 2021

Joe added a comment to T272319: Frequent "Nonce already used" errors in scripts and tools.

@Vort @AntiCompositeNumber this is related to T276415 - we had another hardware failure. I'll route around it now, but we definitely need to make OAuth tokens less brittle.

Mar 4 2021, 7:20 AM · serviceops, cloud-services-team (Kanban), Platform Team Workboards (Clinic Duty Team), MediaWiki-extensions-OAuth

Mar 3 2021

Joe added a comment to T271967: Enable TLS on memcached for cross-dc replication.

Can I ask how do we intend to perform the transition from non-tls to tls in detail? I see a series of pitfalls with our current setup and the code I see in puppet, but please be explicit about the steps you want to take to switch one server to enable tls.

Mar 3 2021, 9:27 AM · Performance-Team (Radar), User-jijiki, SRE, serviceops
Joe added a comment to T276217: Use Envoy for making GET requests to lang.wikipedia.org/api.php.

So AIUI what needs to happen:

  1. add the mwapi-async listener to the helmfile values.yaml (done in the WIP patch)
  1. change the application code, currently the code which calls MW API looks like this:
getPageDict(page_title, wiki_id, os.environ.get("MEDIAWIKI_API_URL"))

[...]

def getPageDict(title: str, wiki_id: str, api_url: str = None) -> dict:
   [...]
    api_url = api_url or "https://{0}.wikipedia.org/w/api.php".format(
        wiki_id.replace("wiki", "").replace("_", "-")
    )
    headers = {"User-Agent": "mwaddlink"}
    req = requests.get(api_url, headers=headers, params=params)

So in the helm chart, we should be setting MEDIAWIKI_API_URL to https://api-rw.discovery.wmnet/w/api.php (although it looks like some charts use localhost:6500 instead) and then the getPageDict() method above should add a header for e.g. {"Host": "cs.wikipedia.org"}when calling the API endpoint.

Does this sound correct @JMeybohm?

Indeed you should enable the mwapi-async listener and then:

  • change MEDIAWIKI_API_URL to http://localhost:6500/w/api.php
  • send the Host header with the actual wiki host you want to reach
Mar 3 2021, 8:55 AM · Patch-For-Review, Growth-Team (Current Sprint), serviceops, Add-Link
Joe closed T276299: alert1001's tcpircbot down for all internal clients (spicerack, helmfile, dbctl, klaxon, etc) as Resolved.

After more analysis, this is my understanding of the outstanding problems:

  • pontoon was connecting without password as logmsgbot, and given the nick has no enforce on, it will just lie around, causing tcpircbot to be unable to connect
  • this also caused problems with freenode's spam protection so even after ghosting the user, we had issues coming from that
  • any attempt to !log while the bot is reconnecting (it can take up to 2 minutes) will crash the bot.
Mar 3 2021, 8:48 AM · observability, SRE
Joe added a comment to T276299: alert1001's tcpircbot down for all internal clients (spicerack, helmfile, dbctl, klaxon, etc).

We found a few other issues:

  • The nick has no enforce, thus a random instance running in labs is connecting (obviously without password)
  • Nickserv says it saw the user the last time at when the issues started:
NickServ (NickServ@services.): Last seen  : Mar 03 00:10:46 2021 (7h 45m 43s ago)
NickServ (NickServ@services.): User seen  : Mar 03 07:47:30 2021 (8m 59s ago) [this is me ghosting the user]
Mar 3 2021, 8:44 AM · observability, SRE
Joe added a comment to T276299: alert1001's tcpircbot down for all internal clients (spicerack, helmfile, dbctl, klaxon, etc).

The issue is more general: tcpircbot crashes on every invokation in the following way:

Mar 3 2021, 7:20 AM · observability, SRE

Mar 2 2021

Joe added a comment to T276213: Sudden surge of requests to https://wikipedia.org/ from Telus customers.

Did this cause any actual issue?

Mar 2 2021, 4:39 PM · Traffic, SRE
Joe triaged T276213: Sudden surge of requests to https://wikipedia.org/ from Telus customers as Low priority.
Mar 2 2021, 11:41 AM · Traffic, SRE
Joe added a subtask for T265327: Create a basic helm chart to test MediaWiki on kubernetes: T276097: Create MediaWiki httpd base image.
Mar 2 2021, 9:54 AM · Patch-For-Review, SRE, serviceops, MW-on-K8s
Joe added a parent task for T276097: Create MediaWiki httpd base image: T265327: Create a basic helm chart to test MediaWiki on kubernetes.
Mar 2 2021, 9:54 AM · MW-on-K8s, serviceops
Joe moved T265327: Create a basic helm chart to test MediaWiki on kubernetes from Backlog to In Progress on the MW-on-K8s board.
Mar 2 2021, 9:53 AM · Patch-For-Review, SRE, serviceops, MW-on-K8s
Joe claimed T265327: Create a basic helm chart to test MediaWiki on kubernetes.
Mar 2 2021, 9:53 AM · Patch-For-Review, SRE, serviceops, MW-on-K8s
Joe moved T273521: Create restricted docker-registry namespace for security patched images from Backlog to In Progress on the MW-on-K8s board.
Mar 2 2021, 9:52 AM · Patch-For-Review, serviceops, Release-Engineering-Team-TODO, Release Pipeline, MW-on-K8s
Joe edited projects for T275806: wmf-utils has an outdated script to update known hosts files, added: SRE; removed serviceops.

This task has definitely nothing to do with serviceops specifically.

Mar 2 2021, 7:15 AM · SRE
Joe added a comment to T275731: legoktm can't build CI docker images without using root because he's no longer in contint-admins.

I would rather have cherry picked people that knows about docker-pkg / CI. But I guess it is fine to be liberal in granting the right to all of SRE members.

Mar 2 2021, 5:56 AM · serviceops, SRE, Continuous-Integration-Infrastructure

Mar 1 2021

Joe moved T265876: Logging options for apache httpd in k8s from In Progress to Blocked on the MW-on-K8s board.
Mar 1 2021, 2:05 PM · observability, SRE, serviceops, MW-on-K8s
Joe moved T276097: Create MediaWiki httpd base image from Backlog to In Progress on the MW-on-K8s board.
Mar 1 2021, 2:05 PM · MW-on-K8s, serviceops
Joe triaged T276097: Create MediaWiki httpd base image as Medium priority.
Mar 1 2021, 2:05 PM · MW-on-K8s, serviceops
Joe created T276095: Keep calculating latencies for MediaWiki requests that happen k8s.
Mar 1 2021, 1:59 PM · observability, SRE, serviceops, MW-on-K8s
Joe added a comment to T276029: Renew certs for mcrouter on all mw appservers.

I am aiming to at least test TLS on memcached T271967, hoping to roll it out next month. If this works out, we will not be needing mcrouter certs. We have 60 days ahead of us, I think it can be done, providing that testing is successful.

Mar 1 2021, 12:03 PM · SRE, serviceops
Joe claimed T265876: Logging options for apache httpd in k8s.

At the meeting we decided it's ok to let apache log to kafka as a main method of collection. We will therefore, at least in a first iteration:

Mar 1 2021, 11:47 AM · observability, SRE, serviceops, MW-on-K8s
Joe moved T265876: Logging options for apache httpd in k8s from Backlog to In Progress on the MW-on-K8s board.
Mar 1 2021, 11:10 AM · observability, SRE, serviceops, MW-on-K8s
Joe updated subscribers of T276029: Renew certs for mcrouter on all mw appservers.

@RLazarus in https://phabricator.wikimedia.org/T248093#6076630 you mentioned committing a script for automating cert renewal, and I see it indeed. Renewing the certs should amount to just running the script, correct?

Mar 1 2021, 9:53 AM · SRE, serviceops
Joe updated subscribers of T275752: Jobrunner on Buster occasional timeout on codfw file upload.

I can't imagine a single valid reason for a distro upgrade meaning that data transfer would slow down so much.

Mar 1 2021, 9:26 AM · Sustainability, serviceops, SRE

Feb 25 2021

Joe closed T272319: Frequent "Nonce already used" errors in scripts and tools as Resolved.

I got a few reports of bots not having more issues, I would consider the immediate problem solved.

Feb 25 2021, 2:31 PM · serviceops, cloud-services-team (Kanban), Platform Team Workboards (Clinic Duty Team), MediaWiki-extensions-OAuth
Joe added a comment to T272319: Frequent "Nonce already used" errors in scripts and tools.

Further update: in the next 10 minutes, after the dust of resharding settled, we just had 36 errors, which seems more than acceptable.

Feb 25 2021, 11:49 AM · serviceops, cloud-services-team (Kanban), Platform Team Workboards (Clinic Duty Team), MediaWiki-extensions-OAuth
Joe claimed T272319: Frequent "Nonce already used" errors in scripts and tools.

After merging my change, the number of errors in OAuth.log regarding 'nonce already used' decreased from ~ 80/minute to ~ 17/minute, which seems to be in line with the rate we had before the incident.

Feb 25 2021, 11:43 AM · serviceops, cloud-services-team (Kanban), Platform Team Workboards (Clinic Duty Team), MediaWiki-extensions-OAuth
Joe added a comment to T272319: Frequent "Nonce already used" errors in scripts and tools.

We lost mc1024 (shard06 on redis) on Jan14, which might be related. On the other hand I can't be completely sure, since nutcracker ejects the faulty host and reshards https://github.com/twitter/twemproxy/blob/master/notes/recommendation.md#liveness.

Regardless, we will remove the shard properly from the pool, and hope that it will fix the issues we are experiencing.

Feb 25 2021, 11:06 AM · serviceops, cloud-services-team (Kanban), Platform Team Workboards (Clinic Duty Team), MediaWiki-extensions-OAuth
Joe committed rLPRIac41c15800fc: Use valid-looking salts for etcd password (authored by Joe).
Use valid-looking salts for etcd password
Feb 25 2021, 11:03 AM
Joe added a comment to T256541: Fix the problem with gravatar and mailman3.

So the fix is merged upstream, we can probably package it and deploy it?

Feb 25 2021, 9:34 AM · Upstream, SRE, Wikimedia-Mailing-lists
Joe updated subscribers of T275731: legoktm can't build CI docker images without using root because he's no longer in contint-admins.

I agree that it would make sense for anyone with global root to also be able to manage CI, but it was a delibarate choice back in the day AIUI to exclude global roots.

Feb 25 2021, 7:48 AM · serviceops, SRE, Continuous-Integration-Infrastructure

Feb 24 2021

Joe added a comment to T255568: Envoy should listen on ipv6 and ipv4.

No, that entry is for testreduce, so another test instance too. So I doubt that what you're seeing in the logs has anything to do with this setting.

Feb 24 2021, 9:18 PM · Patch-For-Review, envoy, User-fgiunchedi, observability, serviceops
Joe added a comment to T255568: Envoy should listen on ipv6 and ipv4.

Just for the record, the restbase cluster that has ipv6_compat activated is the dev cluster. Nothing serving production traffic.

Feb 24 2021, 9:04 PM · Patch-For-Review, envoy, User-fgiunchedi, observability, serviceops
Joe created T275600: Support proxying to etcd v3 storage on buster or later.
Feb 24 2021, 9:59 AM · SRE, serviceops