Page MenuHomePhabricator

Joe (Giuseppe Lavagetto)
Spy

Projects (24)

Today

  • Clear sailing ahead.

Tomorrow

  • Clear sailing ahead.

Saturday

  • Clear sailing ahead.

User Details

User Since
Oct 3 2014, 5:57 AM (380 w, 6 d)
Availability
Available
LDAP User
Giuseppe Lavagetto
MediaWiki User
GLavagetto (WMF) [ Global Accounts ]

Recent Activity

Today

Joe created T299648: Make scap deploy to kubernetes together with the legacy systems.
Thu, Jan 20, 3:22 PM · Release-Engineering-Team, MW-on-K8s, serviceops, SRE
Joe added a comment to T292322: Support large files in Shellbox.

Using a known broken hash like MD5 seems wrong in what's supposed to be a security-sensitive application. Since we are already calculating the SHA-1 hash for img_sha1, can we reuse that and only sign the SHA-1 instead of the entire file? Then the Shellbox server would verify the file matches the provided SHA-1.

(Yes, I know I complained about using broken hashes and then said to use SHA-1, but presumably this would get fixed when we move away from SHA-1 in the image table.)

Thu, Jan 20, 2:21 PM · MW-1.38-notes (1.38.0-wmf.18; 2022-01-17), Patch-For-Review, SRE-swift-storage, Shellbox, serviceops, MW-on-K8s
Jelto awarded T292390: Upgrade all deployment charts to use the latest version of common_templates a Yellow Medal token.
Thu, Jan 20, 9:04 AM · Patch-For-Review, good first task, SRE, serviceops
Joe claimed T291959: The TLS proxy configuration in deployment-charts allows invalid listeners.
Thu, Jan 20, 7:30 AM · Patch-For-Review, serviceops, envoy, SRE

Yesterday

Joe closed T299542: contint1001.wikimedia.org is almost unresponsive as Resolved.
Wed, Jan 19, 5:30 PM · SRE, ops-eqiad, Release-Engineering-Team, Continuous-Integration-Infrastructure
Joe added a comment to T299542: contint1001.wikimedia.org is almost unresponsive.

No need for further restarts, I was able to powercycle the server using ipmi. @Cmjohnson you don't need to do anything :)

Wed, Jan 19, 5:29 PM · SRE, ops-eqiad, Release-Engineering-Team, Continuous-Integration-Infrastructure
Joe closed T285298: Make all httpbb tests pass on the mwdebug deployment., a subtask of T283056: Create a mwdebug deployment for mediawiki on kubernetes, as Resolved.
Wed, Jan 19, 5:10 PM · Patch-For-Review, User-jijiki, MW-on-K8s, serviceops, SRE
Joe closed T285298: Make all httpbb tests pass on the mwdebug deployment. as Resolved.

Right now we have just 3 tests not passing:

Wed, Jan 19, 5:10 PM · Patch-For-Review, MW-on-K8s, serviceops, SRE
Joe created P18884 blubber.sh.
Wed, Jan 19, 5:04 PM
Joe added a comment to T299501: scap fails deployments on bullseye/python 3.9.

The problem arises because pyyaml version 5.3.1 by default uses the safe loader for python objects, so to make the yaml load we need to change the code from:

Wed, Jan 19, 11:57 AM · SRE, Scap
Joe created T299501: scap fails deployments on bullseye/python 3.9.
Wed, Jan 19, 11:49 AM · SRE, Scap
Joe edited P18841 scap bullseye fail.
Wed, Jan 19, 11:26 AM
Joe created P18841 scap bullseye fail.
Wed, Jan 19, 11:18 AM
Joe added a comment to T292818: Better scaffolding for helm charts / releases.

To be honest, I am not understanding what the solution exactly is. I gather that the point is to not have a lot of boilerplate code in the charts? Because the rest sounds a lot like what we currently have. We already have a wizard. It may be not asking all these questions (and I am of the opinion it should NOT be asking some of these questions e.g. mcrouter or not), but it's asking already 3 of them. We can add a few more with sane defaults.

Now, the generated charts are large, but that's mostly because of duplication and could be greatly simplified by moving more things to common_templates/ and if guarding them. That would make reviews way easier (knowing that a chart just uses parts of already vetted common_templates would definitely be a plus)

Note that generating fully custom tailored charts poses the inverse problem. When the time comes to add some functionality to the chart, it will have to be done manually by the owner of the chart and will lead to non upgradeable situations as well.

Wed, Jan 19, 10:33 AM · Prod-Kubernetes, serviceops
Joe closed T291530: Get mcrouter & prometheus-mcrouter-exporter tags for helmfile.d from upstream config as Resolved.

This task is resolved. I think it's time we write a style guide to writing a helm chart and managing this kind of stuff.

Wed, Jan 19, 10:22 AM · Patch-For-Review, serviceops, Toolhub

Tue, Jan 18

Joe added a comment to T292322: Support large files in Shellbox.

52 seconds in Shellbox\Client::computeHmac over 3 calls, I guess all signatures for the remote shellbox calls

I benchmarked the SHA-256 HMAC we're using at 5.2 seconds per gigabyte. If that's too slow, one option is to weaken the hash. On the same test, SHA-1 took 1.8 seconds, and MD5 took 1.3 seconds. Another option is to improve the implementation. PHP has its own implementation of the SHA family hash functions which are not using hardware acceleration. Intel has an article explaining how to use the Intel SHA intrinsics to implement SHA-256, including sample code.

Tue, Jan 18, 8:46 AM · MW-1.38-notes (1.38.0-wmf.18; 2022-01-17), Patch-For-Review, SRE-swift-storage, Shellbox, serviceops, MW-on-K8s

Mon, Jan 17

Joe created P18763 The best url in the world.
Mon, Jan 17, 3:16 PM
Joe updated subscribers of T299302: Linter jobs are running slowly.

The backlog has been recovered; tomorrow I'll lower the concurrency for such jobs.

Given that the lint is being removed, I think we're going to see the same activity in jobs just in the other direction, deleting rows rather than inserting, so it would be nice to keep the extra concurrency it through this week too.

Mon, Jan 17, 7:44 AM · MW-1.38-notes (1.38.0-wmf.17; 2022-01-10), Patch-For-Review, serviceops, Platform Engineering, WMF-JobQueue, MediaWiki-extensions-Linter

Sun, Jan 16

Joe closed T299302: Linter jobs are running slowly as Resolved.

The backlog has been recovered; tomorrow I'll lower the concurrency for such jobs.

Sun, Jan 16, 5:42 PM · MW-1.38-notes (1.38.0-wmf.17; 2022-01-10), Patch-For-Review, serviceops, Platform Engineering, WMF-JobQueue, MediaWiki-extensions-Linter
Joe added a comment to T299302: Linter jobs are running slowly.

We're now reducing the number of backlogged items at a rate of 25k/minute. At this pace, the backlog should be back near zero in 6 hours. I think this is a reasonable time for resolution. Leaving the task open so we can come back and assess further.

Sun, Jan 16, 8:36 AM · MW-1.38-notes (1.38.0-wmf.17; 2022-01-10), Patch-For-Review, serviceops, Platform Engineering, WMF-JobQueue, MediaWiki-extensions-Linter
Joe added a comment to T299302: Linter jobs are running slowly.

I think the task title is inaccurate.

Sun, Jan 16, 8:11 AM · MW-1.38-notes (1.38.0-wmf.17; 2022-01-10), Patch-For-Review, serviceops, Platform Engineering, WMF-JobQueue, MediaWiki-extensions-Linter

Sat, Jan 15

Joe added a comment to T299282: All search results on afwikibooks redirect to invalid URL (due to code in MediaWiki:Common.js).

The MediaWiki:Common.js snippet causing this seems obsolete to me (it seems to come from dewiktionary which does not use it anymore, we haven't used Squid for user traffic in years and afaik don't collect search analytics from ats/varnish logs). Since there are no local int-admins or sysops I'm tempted to simply remove that code. Thoughts?

Sat, Jan 15, 9:26 PM · WMF-General-or-Unknown

Thu, Jan 13

Joe added a comment to T292322: Support large files in Shellbox.

On mwdebug1002 I have set the excimer time limit (in mediawiki's code), the envoy timeout, the apache timeout, the php-fpm request_terminate_timeout, the php max_execution_time (!!!) and restarted all the relevant deamons. Using Xhgui I found the following results.

Thu, Jan 13, 3:34 PM · MW-1.38-notes (1.38.0-wmf.18; 2022-01-17), Patch-For-Review, SRE-swift-storage, Shellbox, serviceops, MW-on-K8s
Joe closed T298986: Deploy Scap version 4.1.1 as Resolved.
Thu, Jan 13, 11:27 AM · Release-Engineering-Team, serviceops, Scap
Joe added a comment to T292322: Support large files in Shellbox.

I did the following test:

  1. - Try to upload the image via Special:Upload to testwiki using "upload via url", which currently has wmgUsePagedTiffHandlerShellbox set to true, on mwdebug. It times out after 202 seconds (the time limit in php-fpm via request_terminate_timeout is 201 seconds, 202 is the timeout in apache). In this case the script seems to be called twice, with the timeout happening while executing the second call, see https://logstash.wikimedia.org/goto/ae32fc43f7b06d894cbd7c59fa6fc080
  2. - Try to upload the image to testwiki after changing manually wmgUsePagedTiffHandlerShellbox to false on mwdebug1001. This succeeds after 194 seconds, but an error is returned anyways as (IIRC) the timeout on the frontend caches is 180 seconds. So the image is uploaded but the user gets an error. This is something we've had to fix for a long time, but unrelated to the current issue. In this case the script seems to be called twice, see https://logstash.wikimedia.org/goto/e86e8520c572b4709595d1a10f12aeea
  3. - Try to upload the image to testwiki with wmgUsePagedTiffHandlerShellbox set to true but raising the timeout to 300 seconds in mediawiki, php-fpm and apache2. Still fails after 300 seconds. In this case, the script seems to be called three times, see https://logstash.wikimedia.org/goto/0db9bb9d9484149f0d231b3aec2ef701
Thu, Jan 13, 11:17 AM · MW-1.38-notes (1.38.0-wmf.18; 2022-01-17), Patch-For-Review, SRE-swift-storage, Shellbox, serviceops, MW-on-K8s

Wed, Jan 12

Joe claimed T292322: Support large files in Shellbox.

I ran the command locally (I think!) on mwmaint1002, and it took a comparable time to what it took calling shellbox - apparently the transfer time is very small compared to the time it takes to prepare the file for input.
@tstarling I think it is possible that about 10 seconds are spent copying the input file into the sandbox (via InputFileFromFile) - am I correct in thinking that we should be able to avoid this copy when we execute the command remotely?

Wed, Jan 12, 3:52 PM · MW-1.38-notes (1.38.0-wmf.18; 2022-01-17), Patch-For-Review, SRE-swift-storage, Shellbox, serviceops, MW-on-K8s
Joe triaged T292322: Support large files in Shellbox as High priority.
Wed, Jan 12, 2:33 PM · MW-1.38-notes (1.38.0-wmf.18; 2022-01-17), Patch-For-Review, SRE-swift-storage, Shellbox, serviceops, MW-on-K8s
Joe added a comment to T292322: Support large files in Shellbox.

I tried to give more resources to the shellbox container, but that didn't matter much - I guess the shellout we're running is single-threaded anyways. I can try to beef up the cpu/memory of the envoy container in front of shellbox, but that won't improve much the render times anyways.

Wed, Jan 12, 11:58 AM · MW-1.38-notes (1.38.0-wmf.18; 2022-01-17), Patch-For-Review, SRE-swift-storage, Shellbox, serviceops, MW-on-K8s
Joe added a comment to T292322: Support large files in Shellbox.

Is the procedure the one documented at https://wikitech.wikimedia.org/wiki/Kubernetes/Deployments ?

Wed, Jan 12, 11:10 AM · MW-1.38-notes (1.38.0-wmf.18; 2022-01-17), Patch-For-Review, SRE-swift-storage, Shellbox, serviceops, MW-on-K8s
Joe triaged T298986: Deploy Scap version 4.1.1 as Medium priority.
Wed, Jan 12, 10:53 AM · Release-Engineering-Team, serviceops, Scap
Joe added a comment to T292322: Support large files in Shellbox.

After the deployment, running the same command as before results in:

var_dump($result);
object(Shellbox\Command\BoxedResult)#675 (4) {
  ["files":"Shellbox\Command\BoxedResult":private]=>
  array(0) {
  }
  ["exitCode":"Shellbox\Command\UnboxedResult":private]=>
  int(0)
  ["stdout":"Shellbox\Command\UnboxedResult":private]=>
  string(18173) "TIFF Directory at offset 0x6c466db8 (1816554936)
  Subfile Type: (0 = 0x0)
  Image Width: 40000 Image Length: 12788
  Resolution: 72, 72 pixels/inch
  Bits/Sample: 8
...
Wed, Jan 12, 8:21 AM · MW-1.38-notes (1.38.0-wmf.18; 2022-01-17), Patch-For-Review, SRE-swift-storage, Shellbox, serviceops, MW-on-K8s

Tue, Jan 11

Joe claimed T298986: Deploy Scap version 4.1.1.
Tue, Jan 11, 6:25 PM · Release-Engineering-Team, serviceops, Scap
Joe added a comment to T292322: Support large files in Shellbox.

For the record, I'm taking care of this release, and given I am annoyed at how we manage image versions for shellbox, I'm also slightly modifying the procedure. I'll add docs to wikitech once I'm done.

Tue, Jan 11, 11:23 AM · MW-1.38-notes (1.38.0-wmf.18; 2022-01-17), Patch-For-Review, SRE-swift-storage, Shellbox, serviceops, MW-on-K8s

Mon, Jan 10

Joe added a comment to T298854: NPM "colors" and "faker" package vulnerability audit.

The nodejs12-devel image contains node-colors as shipped in Debian Bullseye, but that version isn't affected (it predates the vandalised release)

Mon, Jan 10, 10:37 AM · Vuln-VulnComponent, SecTeam-Processed, LibUp, JavaScript, Security-Team, Security
Joe added a comment to T298815: cross-validate-accounts: Malformed membership for ops user ..., has additional group(s): {'deployment-ci-admins'}.

Sorry I didn't realize I needed to add the group to the cross-validate-accounts lists.

Mon, Jan 10, 8:55 AM · User-RhinosF1, serviceops, Infrastructure-Foundations
Joe added a comment to T298570: Consider filesystem/disk based improvements on WQDS servers.

XFS has both advantages and disadvantages, including in terms of data safety. In fact, I think the data persistence team has used xfs in the past but switched to ext4 a long time ago.

Mon, Jan 10, 8:54 AM · SRE, Discovery-Search (Current work)
Joe added a comment to T266055: Update Scap to perform rolling restart for all MW deploy.

We ran the test today. @jijiki supplied SRE backup.

It ran in two phases:

Phase 1: php_fpm_always_restart: false (its usual state at the moment)
scap sync-file README
Total time: 1m19.942s

php_fpm_always_restart: true:
scap sync-file README
Total time: 3m12.836s

After the test it was noted in #wikimedia-operations that there was a small spike of HTTP 500 errors during phase 1 and a larger spike during phase 2. A discussion ensued and it was concluded that this is expected, unavoidable, and an acceptable outcome.

It is a surprise to me that without restart there would be an increase in "normal" HTTP 500 errors during the deploy. I'm not aware of this being the status quo or what causal link that would have.

Mon, Jan 10, 7:31 AM · Release-Engineering-Team (Radar), User-jijiki, Patch-For-Review, Scap
Joe added a comment to T298854: NPM "colors" and "faker" package vulnerability audit.

Sadly that code search is not a complete guarantee. For things we build with the pipeline, we might not have a package.json for all dependencies, so we will have to do a manual check of the docker images.

Mon, Jan 10, 6:37 AM · Vuln-VulnComponent, SecTeam-Processed, LibUp, JavaScript, Security-Team, Security

Wed, Jan 5

Joe triaged T295578: Test running php7.2 and php7.4 in parallel on the beta cluster as Medium priority.
Wed, Jan 5, 9:06 AM · Patch-For-Review, serviceops
Joe added a comment to T292322: Support large files in Shellbox.

Updating myself: looks like the error comes from the fact mwmaint is *not* using remote shellbox to execute scripts/retrieveMetaData.sh - see https://logstash.wikimedia.org/goto/54a14411c613702412af091616f31203

Wed, Jan 5, 8:12 AM · MW-1.38-notes (1.38.0-wmf.18; 2022-01-17), Patch-For-Review, SRE-swift-storage, Shellbox, serviceops, MW-on-K8s

Tue, Jan 4

Joe placed T297673: Build MediaWiki images for kubernetes on the deployment servers up for grabs.

The deployment servers should now be able to build the images for mediawiki. I'll de-assign the task from myself for now, please ping me If I'm needed for the next steps!

Tue, Jan 4, 4:31 PM · Release-Engineering-Team, serviceops, MW-on-K8s
Joe updated the task description for T297673: Build MediaWiki images for kubernetes on the deployment servers.
Tue, Jan 4, 4:28 PM · Release-Engineering-Team, serviceops, MW-on-K8s
Joe committed rLPRI804357511c0f: Update for refactor of deployment server role (authored by Joe).
Update for refactor of deployment server role
Tue, Jan 4, 6:48 AM

Mon, Jan 3

Joe updated the task description for T297673: Build MediaWiki images for kubernetes on the deployment servers.
Mon, Jan 3, 3:00 PM · Release-Engineering-Team, serviceops, MW-on-K8s
Joe added a comment to T292322: Support large files in Shellbox.

I thought I had replied earlier, for now the plan is to test POSTing large files to Shellbox, identify what layers it fails at and fix those.

A basic test would be to download a big file (maybe https://commons.wikimedia.org/wiki/File:Andromeda_Galaxy_M31_-_Heic1502a_Full_resolution.tiff) to mwmaint and then run something like:

use MediaWiki\MediaWikiServices;
$command = MediaWikiServices::getInstance()->getShellCommandFactory()->createBoxed( 'pagedtiffhandler' )->disableNetwork()->firejailDefaultSeccomp()->routeName( 'pagedtiffhandler-metadata' );
$command->params( 'tiffinfo', 'file.tiff' );
$command->inputFileFromFile( 'file.tiff', __DIR__ . '/downloaded-file.tiff' );
$result = $command->execute();
var_dump($result);
Mon, Jan 3, 2:52 PM · MW-1.38-notes (1.38.0-wmf.18; 2022-01-17), Patch-For-Review, SRE-swift-storage, Shellbox, serviceops, MW-on-K8s
Joe committed rLPRIa3d2fa31b157: Copy hiera private data for merging Id10dbe7d244ab9b8 (authored by Joe).
Copy hiera private data for merging Id10dbe7d244ab9b8
Mon, Jan 3, 10:22 AM
Joe closed T297667: mysqli/mysqlnd memory leak as Resolved.
Mon, Jan 3, 10:00 AM · serviceops-radar, WMF-General-or-Unknown

Wed, Dec 22

Joe placed T294962: Q2:(Need By: TBD) rack/setup/install mc20[38-55] up for grabs.
Wed, Dec 22, 9:04 AM · SRE, serviceops, ops-codfw, DC-Ops
Joe added a comment to T287130: Container image lifecycle management.

@Joe: My recollection is you were going to take care of the blubber and docker-pkg parts (although it’s been a while since we talked about it) -- with the exception of adding a DNS record to talk to, that should be unblocked, so feel free to grab it while I’m still out, if you're so inclined.

Wed, Dec 22, 6:51 AM · Patch-For-Review, Release Pipeline (Blubber), docker-pkg, serviceops, SRE

Dec 20 2021

Joe added a comment to T297667: mysqli/mysqlnd memory leak.

I just upgraded all the canary appservers, will proceed with the rest of the canaries now (which includes parsoid and the api clusters). If after inspection tomorrow I don't see major issues, I'll proceed with the rest of the clusters.

Dec 20 2021, 6:34 PM · serviceops-radar, WMF-General-or-Unknown
Joe claimed T297673: Build MediaWiki images for kubernetes on the deployment servers.
Dec 20 2021, 10:42 AM · Release-Engineering-Team, serviceops, MW-on-K8s
Joe added a comment to T294962: Q2:(Need By: TBD) rack/setup/install mc20[38-55].

Yes sorry, I dropped the ball on this.

Dec 20 2021, 10:33 AM · SRE, serviceops, ops-codfw, DC-Ops
Joe updated the task description for T288851: Make logging work for mediawiki in k8s.
Dec 20 2021, 8:00 AM · Patch-For-Review, SRE Observability, MW-on-K8s, serviceops, SRE

Dec 17 2021

Joe closed T297613: On the kube-experimental mwdebug cluster, MediaWiki sees all edits as coming from localhost as Resolved.
Dec 17 2021, 11:02 AM · serviceops, MW-on-K8s
Joe added a comment to T297613: On the kube-experimental mwdebug cluster, MediaWiki sees all edits as coming from localhost.

With my last changes, I'm now able to correctly see the page, and REMOTE_ADDR is not set to localhost in either of the following situations:

Dec 17 2021, 9:19 AM · serviceops, MW-on-K8s
Joe added a comment to T294911: Apparent latency warning in 90th centile of eventgate-logging-external.

@BTullis I'm not fully convinced that slowness in connection to the edge would reflect in a longer connection timeout on the backend.

Dec 17 2021, 7:59 AM · Analytics, Data-Engineering, Event-Platform, Observability-Alerting
Joe added a comment to T297667: mysqli/mysqlnd memory leak.

Yeah, it turns out segfaulting once every couple of hours keeps a lid on memory usage. I did a linear regression of the data from 2021-12-16 04:00 to 21:50. mw1414 is leaking 122 MB/hour, and mw1415 is leaking 600MB/hour. Most likely there is a second smaller memory leak, and the task as described is resolved. I would suggest rolling out the patch. This is my last day before vacation. I can do another core dump analysis in the new year.

Dec 17 2021, 6:59 AM · serviceops-radar, WMF-General-or-Unknown

Dec 16 2021

Joe added a comment to T297613: On the kube-experimental mwdebug cluster, MediaWiki sees all edits as coming from localhost.

Sadly the story is more complex; in fact, only requests coming from the edge contain X-Client-Ip by default, so we need to inject it into any request at the tls termination layer.

Dec 16 2021, 12:22 PM · serviceops, MW-on-K8s
Joe added a comment to T297667: mysqli/mysqlnd memory leak.

Interestingly, after the second patch, I think the memory usage is steadily increasing again:

Dec 16 2021, 8:00 AM · serviceops-radar, WMF-General-or-Unknown
Joe added a comment to T297667: mysqli/mysqlnd memory leak.

Can confirm no new segfaults since them (and, if any would happen, we need first of all to check if it happens always in the same address, as php has its own sizeable amount of segfaults anyways.

Dec 16 2021, 7:56 AM · serviceops-radar, WMF-General-or-Unknown

Dec 15 2021

Joe claimed T297613: On the kube-experimental mwdebug cluster, MediaWiki sees all edits as coming from localhost.

Enabling mod_remoteip did the trick. I will now add the configuration to the base image.

Dec 15 2021, 1:40 PM · serviceops, MW-on-K8s
Joe added a comment to T297667: mysqli/mysqlnd memory leak.

Update: I rebuilt php 7.2 with Tim's backport, published it to our apt repository, and upgraded mw1414. Then repooled it, and restarted php-fpm on mw1415 about 10 minutes later to provide a baseline comparison of memory usage. Tomorrow I'll revisit the situation.

Dec 15 2021, 11:52 AM · serviceops-radar, WMF-General-or-Unknown

Dec 14 2021

Joe added a comment to T297613: On the kube-experimental mwdebug cluster, MediaWiki sees all edits as coming from localhost.

I added a debug script that just dumps $_SERVER, and indeed REMOTE_ADDR is 127.0.0.1, while on mwdebug1001 it's set to the IP address of the host, while X_FORWARDED_FOR is always set to the IP of the client:

Dec 14 2021, 6:03 PM · serviceops, MW-on-K8s
Joe added a comment to T297517: wtp* hosts: Out of memory (allocated 39845888) (tried to allocate 131072 bytes) in OutputHandler.php.

FWIW, I wholeheartedly agree with @thcipriani's opinions above.

Dec 14 2021, 5:40 PM · User-Ladsgroup, SRE, serviceops, Wikimedia-production-error
Joe closed T293450: Allow coexisting php version in our puppet code as Resolved.
Dec 14 2021, 8:12 AM · serviceops
Joe closed T293450: Allow coexisting php version in our puppet code, a subtask of T271736: Migrate WMF Production from PHP 7.2 to PHP 7.4, as Resolved.
Dec 14 2021, 8:12 AM · Performance-Team (Radar), serviceops
Joe created T297673: Build MediaWiki images for kubernetes on the deployment servers.
Dec 14 2021, 7:44 AM · Release-Engineering-Team, serviceops, MW-on-K8s
Joe added a comment to T297517: wtp* hosts: Out of memory (allocated 39845888) (tried to allocate 131072 bytes) in OutputHandler.php.

The only thing unique to this report as compared to T296098 and T296063 is the failure mode, i.e. mmap() failure, which as I've said should be fixed by tuning the kernel. Is tuning the kernel the thing that you want unbroken now? Again, it has probably been broken for years.

Dec 14 2021, 7:16 AM · User-Ladsgroup, SRE, serviceops, Wikimedia-production-error
Joe added a comment to T297517: wtp* hosts: Out of memory (allocated 39845888) (tried to allocate 131072 bytes) in OutputHandler.php.

We're currently on 1.38.0-wmf.9, and this remains a blocker to rolling forward both wmf.12 and this week's wmf.13. Re-raising to UBN!.

This bug is a duplicate of T296098 which is marked as resolved. Per my latest comment there, I think there is a memory leak in mysqli which was always present but was exacerbated by T296063 (also marked as resolved). I don't see how a bug can be a duplicate of two resolved bugs and yet be UBN. I don't know what I'm meant to do with it to fix it. I'm going to work on the memory leak, I'll file a bug about it once it's fully isolated, but it has probably been present for years so doesn't seem like it should block the train.

Dec 14 2021, 7:14 AM · User-Ladsgroup, SRE, serviceops, Wikimedia-production-error

Dec 13 2021

Joe added a comment to T297613: On the kube-experimental mwdebug cluster, MediaWiki sees all edits as coming from localhost.

Narrowing down the problem: I see the actual client IP in the apache httpd logs for my requests. So it seems that the problem is somewhere between what gets passed to mediawiki and how mediawiki does treat such data.

Dec 13 2021, 3:48 PM · serviceops, MW-on-K8s
Joe triaged T297613: On the kube-experimental mwdebug cluster, MediaWiki sees all edits as coming from localhost as High priority.
Dec 13 2021, 2:44 PM · serviceops, MW-on-K8s
Joe created T297613: On the kube-experimental mwdebug cluster, MediaWiki sees all edits as coming from localhost.
Dec 13 2021, 2:44 PM · serviceops, MW-on-K8s
Joe updated the task description for T285298: Make all httpbb tests pass on the mwdebug deployment..
Dec 13 2021, 7:06 AM · Patch-For-Review, MW-on-K8s, serviceops, SRE

Dec 9 2021

Joe added a comment to T297322: CVE-2021-44857, CVE-2021-44858: Unauthorized users can undo edits on any protected page and view contents of private wikis using mcrundo .

@Dylsss, @Ladsgroup, @Jdforrester-WMF: thank you very much for finding and fixing this! This is very likely my bad.

Are there any issues remaining to be addressed here?

Dec 9 2021, 10:14 AM · MW-1.38-notes (1.38.0-wmf.18; 2022-01-17), Patch-For-Review, MW-1.37-notes, MW-1.36-notes, MW-1.35-notes, MediaWiki-General, Platform Team Initiatives (MCR), Wikimedia-Incident, Vuln-Infoleak, Security, Security-Team

Dec 8 2021

Joe added a comment to T297259: Compare Parsoid perf on current production servers vs a newer test server.

I would suggest that instead of trying a test server, we should focus on making parsoid tests run on kubernetes, which is where parsoid will be running soon.

Dec 8 2021, 8:08 AM · serviceops, Parsoid

Dec 2 2021

Joe added a comment to T296641: Upgrade kafka-main nodes to buster.

@elukey would it facilitate the work if we disabled eventgate-main in eqiad during the work?

Dec 2 2021, 2:20 PM · Patch-For-Review, serviceops

Nov 24 2021

Joe created P17824 YAML LOL.
Nov 24 2021, 4:40 PM
Joe added a comment to T288851: Make logging work for mediawiki in k8s.

After deploying the changes to php-fatal-error.php, we can now see the error messages delivered by php-wmerrors in logstash.

Nov 24 2021, 11:33 AM · Patch-For-Review, SRE Observability, MW-on-K8s, serviceops, SRE
Joe updated the task description for T288851: Make logging work for mediawiki in k8s.
Nov 24 2021, 11:25 AM · Patch-For-Review, SRE Observability, MW-on-K8s, serviceops, SRE

Nov 23 2021

Joe updated subscribers of T296312: Exception: Invalid reverseProxy configured.

@Legoktm this seems related to your patches?

Nov 23 2021, 4:58 PM · MW-1.38-notes (1.38.0-wmf.12; 2021-12-06), MediaWiki-libs-HTTP, Beta-Cluster-reproducible

Nov 22 2021

Joe added a comment to P17795 (An Untitled Masterwork).

Suspect of needing to be moved to critical: true

Nov 22 2021, 2:54 PM
Joe created P17795 (An Untitled Masterwork).
Nov 22 2021, 2:52 PM

Nov 17 2021

Joe added a comment to T246371: Move job traffic from rpc/RunSingleJob to REST endpoint.

To clarify things a bit:

  • For switching to using the rest endpoint, we need to unify the apache configurations between the appservers and the jobrunners, probably just leaving around the current jobrunner vhosts for the time of the switch.
  • Not needing to implement a separate set of apache configurations for jobrunners on kubernetes would be desirable, so yes, I would like this task to be completed before we switch the jobrunners to kubernetes. It's not a hard requirement though: given we want to do this, it would be better to do it before we migrate to k8s for jobs.
  • We should not wait for the move to kubernetes in order to do this switch though. It should be considered a precondition.
Nov 17 2021, 5:23 PM · serviceops-radar, ChangeProp, Platform Team Workboards (Clinic Duty Team), WMF-JobQueue, MediaWiki-Core-JobQueue
Joe added projects to T295900: Access to scandium.eqiad.wmnet & testreduce1001.eqiad.wmnet: SRE, SRE-Access-Requests.
Nov 17 2021, 3:30 PM · SRE-Access-Requests, SRE, Parsoid (Tracking), serviceops

Nov 16 2021

Joe closed T283056: Create a mwdebug deployment for mediawiki on kubernetes as Resolved.
Nov 16 2021, 2:59 PM · Patch-For-Review, User-jijiki, MW-on-K8s, serviceops, SRE

Nov 12 2021

Joe updated the task description for T295578: Test running php7.2 and php7.4 in parallel on the beta cluster.
Nov 12 2021, 11:10 AM · Patch-For-Review, serviceops
Joe added a comment to T295578: Test running php7.2 and php7.4 in parallel on the beta cluster.

I have installed a fresh appserver whith both php 7.2 and 7.4, and:

Nov 12 2021, 11:09 AM · Patch-For-Review, serviceops
Joe created T295580: Test php7.4 for dumps generation.
Nov 12 2021, 11:05 AM · Dumps-Generation, serviceops
Joe created T295578: Test running php7.2 and php7.4 in parallel on the beta cluster.
Nov 12 2021, 11:01 AM · Patch-For-Review, serviceops

Nov 10 2021

Joe added a comment to T295481: Setup GitLab Runner in trusted environment.

The only things I could see as potentially different between these runners and those running in wmcs are:

Nov 10 2021, 6:11 PM · Patch-For-Review, GitLab (CI & Job Runners), SecTeam-Processed, Release-Engineering-Team (Radar), Security-Team, serviceops
Joe claimed T293450: Allow coexisting php version in our puppet code.
Nov 10 2021, 7:04 AM · serviceops

Nov 8 2021

Joe added a comment to T295290: tegola-vector-tiles doesnt execute new tile pregeneration jobs.

From looking at the envoy logs from that particular Pod I'd assume that envoy was not up/ready when tegola tried to connect.
You should be able to wait for it to be up by checking for HTTP 200 on 127.0.0.1:9361/healthz

Nov 8 2021, 3:13 PM · serviceops, Maps
Joe added a comment to T219279: Some pages will become completely unreachable after PHP7 update due to Unicode changes.

https://en.wikipedia.org/wiki/WP:VPT#Automatically_renamed_users

Hello! See User:Ʝ and User:DZoo — two users who seem to have disappeared during a rename from lowercase to capital letters. Their user pages' history show them being created by users with the lowercase name, but clicking their user link or contribs link brings you to the nonexistent username with the capital first character.

Nov 8 2021, 10:03 AM · MW-1.38-notes (1.38.0-wmf.6; 2021-10-26), Patch-For-Review, MW-1.35-notes (1.35.0-wmf.28; 2020-04-14), User-notice, Platform Team Workboards (Clinic Duty Team), MW-1.34-notes (1.34.0-wmf.16; 2019-07-30), serviceops, SRE, PHP 7.2 support, MediaWiki-General

Nov 4 2021

Joe added a comment to T294800: Reconcile MediaWiki POST timeout and Varnish/ATS timeouts.

Let me add another data point: Of those 8 requests over 175 seconds, only 2 were to POSTs to Special:Upload.

Nov 4 2021, 7:34 AM · Performance-Team (Radar), Traffic, serviceops, SRE
Joe added a comment to T294800: Reconcile MediaWiki POST timeout and Varnish/ATS timeouts.
Nov 4 2021, 7:32 AM · Performance-Team (Radar), Traffic, serviceops, SRE

Nov 3 2021

Joe claimed T294962: Q2:(Need By: TBD) rack/setup/install mc20[38-55].
Nov 3 2021, 6:26 PM · SRE, serviceops, ops-codfw, DC-Ops
Joe updated subscribers of T294962: Q2:(Need By: TBD) rack/setup/install mc20[38-55].

@RobH @jijiki is on PTO at the moment.

Nov 3 2021, 6:25 PM · SRE, serviceops, ops-codfw, DC-Ops

Nov 2 2021

Joe committed rLPRI761e6d4cacfc: Introducing profile::lists::web_deny_conditions (authored by Joe).
Introducing profile::lists::web_deny_conditions
Nov 2 2021, 11:23 AM
Joe raised the priority of T294581: Upgrade ECS to 1.11.0 from Medium to High.

@lmata changing the priority to "high" as this work is a blocker for T288851

Nov 2 2021, 6:45 AM · Patch-For-Review, Observability-Logging, SRE Observability (FY2021/2022-Q2)
Joe added a comment to T294800: Reconcile MediaWiki POST timeout and Varnish/ATS timeouts.

If anything, I think we should go in the other direction, and progressively and drastically reduce our timeouts for any synchronous requests to something nearer to the GET timeout of 60 seconds.

Nov 2 2021, 6:41 AM · Performance-Team (Radar), Traffic, serviceops, SRE