Page MenuHomePhabricator

Joe (Giuseppe Lavagetto)
Spy

Projects (24)

Today

  • Clear sailing ahead.

Tomorrow

  • Clear sailing ahead.

Thursday

  • Clear sailing ahead.

User Details

User Since
Oct 3 2014, 5:57 AM (321 w, 4 d)
Availability
Available
LDAP User
Giuseppe Lavagetto
MediaWiki User
GLavagetto (WMF) [ Global Accounts ]

Recent Activity

Thu, Nov 26

Aklapper awarded T264991: Upgrade the MediaWiki servers to ICU 63 a Love token.
Thu, Nov 26, 3:35 PM · Patch-For-Review, Beta-Cluster-Infrastructure, DBA, Operations, serviceops
Joe lowered the priority of T268819: Blubber needs to check if a user is present before creating it as part of its runs stanza from High to Low.

After reviewing some more details:
www-data is already present, but its home directory doesn't exist, so there is no use in re-using it. We're also *not* actively using it, as we're setting the home directory as /srv/app anyways. So we're ok with creating a duplicate user in this case, even if it feels weird.

Thu, Nov 26, 11:24 AM · serviceops, Release-Engineering-Team, MW-on-K8s, Release Pipeline (Blubber)
Joe updated the task description for T268819: Blubber needs to check if a user is present before creating it as part of its runs stanza.
Thu, Nov 26, 11:13 AM · serviceops, Release-Engineering-Team, MW-on-K8s, Release Pipeline (Blubber)
Joe added a comment to T268819: Blubber needs to check if a user is present before creating it as part of its runs stanza.

To clarify:

Thu, Nov 26, 10:39 AM · serviceops, Release-Engineering-Team, MW-on-K8s, Release Pipeline (Blubber)
Joe added a comment to T268819: Blubber needs to check if a user is present before creating it as part of its runs stanza.

I did set the priority to 'high' because this is a blocker to the production deployment of shellbox.

Thu, Nov 26, 10:22 AM · serviceops, Release-Engineering-Team, MW-on-K8s, Release Pipeline (Blubber)
Joe triaged T268819: Blubber needs to check if a user is present before creating it as part of its runs stanza as High priority.
Thu, Nov 26, 10:22 AM · serviceops, Release-Engineering-Team, MW-on-K8s, Release Pipeline (Blubber)
Joe added a comment to T263295: Setup Git repo and CI for shellbox.

Some updates on this after conversations with various people and some more poking:

  • We need the CI to run pipeline-test and pipeline-test-and-publish. This is currently blocked by not having proper images to base the blubber variants on, or perhaps my inability to find the right one.
Thu, Nov 26, 8:02 AM · Release-Engineering-Team-TODO (2020-10-01 to 2020-12-31 (Q2)), Release-Engineering-Team (CI & Testing services), Continuous-Integration-Config, Platform Team Workboards (Clinic Duty Team), Patch-For-Review

Wed, Nov 25

Joe claimed T268612: Docker image on the build host seem to ignore apt priority for wikimedia packages.
Wed, Nov 25, 6:45 AM · docker-pkg, serviceops, Operations
Joe closed T265324: Create the base container images for running MediaWiki in a production environment as Resolved.
Wed, Nov 25, 6:44 AM · Operations, serviceops, MW-on-K8s
Joe closed T265324: Create the base container images for running MediaWiki in a production environment, a subtask of T238774: Provide the official production base images for Wikimedia use, as Resolved.
Wed, Nov 25, 6:44 AM · Release-Engineering-Team-TODO, serviceops, Release-Engineering-Team, Release Pipeline

Tue, Nov 24

Joe added a comment to T268612: Docker image on the build host seem to ignore apt priority for wikimedia packages.

Ouch.

Could we normalize everything to use the public image reference? That would also make local test more easy or straight forward.
Or do we gain a bit benefit by using the internal reference?

Tue, Nov 24, 11:50 AM · docker-pkg, serviceops, Operations
Joe added a comment to T268612: Docker image on the build host seem to ignore apt priority for wikimedia packages.

ok, found the problem, and it's slightly embarassing:

Tue, Nov 24, 10:26 AM · docker-pkg, serviceops, Operations
Joe created T268612: Docker image on the build host seem to ignore apt priority for wikimedia packages.
Tue, Nov 24, 10:03 AM · docker-pkg, serviceops, Operations

Mon, Nov 23

Joe closed T266515: Set ENV SERVERGROUP for jobrunner MW web requests as Resolved.
Mon, Nov 23, 7:43 AM · Platform Team Workboards (External Code Reviews), Developer Productivity, serviceops, observability
Joe added a comment to T260330: RFC: PHP microservice for containerized shell execution.

I may misunderstand, but would Swift-awareness still be useful for things which are not images? In other words, will the remote caller be responsible for taking the result of a processed command and then persisting it into Swift if it's non-image data if that's what it ultimately wants to do?

I don't have a strong position there and can see reasons why and why not to build logic and firewall rules in as such, was more curious, as it's a potentially common case - the common case could just as easily exist in some sort of orchestrator component, though.

Sure, it's just that I can only find one thing that takes the result of a command and puts it into Swift, and that is Score. Maybe Score should be migrated to do it the same way as everything else, rather than adding more things that work like Score. Score blocks wikitext parsing while command execution and Swift upload are in progress, which is not a great architecture. If Shellbox can upload directly to Swift, then that makes it privileged, partly defeating the purpose of having it. The current plan is that Swift credentials will not be accessible to sandboxed code.

Mon, Nov 23, 6:41 AM · Platform Team Workboards (Purple), MW-on-K8s, Patch-For-Review, TechCom-RFC, serviceops, Operations

Tue, Nov 17

Joe added a comment to T259312: Deal with donatewiki Thank You page launching in apps.

the apache change has been merged, and tested to work with the renewed httpbb test suite. It will be deployed everywhere in the next 30 minutes. Please confirm this resolves this issue.

Tue, Nov 17, 4:57 PM · Fundraising Sprint We all meet again, Patch-For-Review, Wikimedia-Apache-configuration, Operations, Fundraising Sprint Vagranty McVagrantface, Wikipedia-Android-App-Backlog, Thank-You-Page, Fundraising-Backlog, Android-app-Bugs, Wikipedia-iOS-App-Backlog
Joe closed T267891: Create Debian packages for Node.js 14 upgrade, a subtask of T267888: Create WMF CI image for Node.js 14, as Declined.
Tue, Nov 17, 7:42 AM · Continuous-Integration-Config, Continuous-Integration-Infrastructure, Release-Engineering-Team-TODO
Joe closed T267891: Create Debian packages for Node.js 14 upgrade, a subtask of T267890: Upgrade CI jobs for WMF-deployed projects from Node 10 to Node 14 LTS (tracking), as Declined.
Tue, Nov 17, 7:42 AM · Release-Engineering-Team (CI & Testing services), Continuous-Integration-Config
Joe closed T267891: Create Debian packages for Node.js 14 upgrade as Declined.

I didn't notice the task was opened again, I will decline again as I don't see a need for a node 14 package to use in production. CI can easily use the binaries from nodejs.org, as I stated above.

Tue, Nov 17, 7:42 AM · Packaging, serviceops, Operations
Joe added a comment to T267891: Create Debian packages for Node.js 14 upgrade.

While nothing above seems to contradict that we don't have a compelling reason to install node 14 packages for production now, I want to make a point: developers don't just decide to use a newer version of some software in production in a void.

Tue, Nov 17, 7:39 AM · Packaging, serviceops, Operations
Joe updated the task description for T265324: Create the base container images for running MediaWiki in a production environment.
Tue, Nov 17, 7:29 AM · Operations, serviceops, MW-on-K8s

Mon, Nov 16

Joe closed T267891: Create Debian packages for Node.js 14 upgrade, a subtask of T267888: Create WMF CI image for Node.js 14, as Declined.
Mon, Nov 16, 6:53 AM · Continuous-Integration-Config, Continuous-Integration-Infrastructure, Release-Engineering-Team-TODO
Joe closed T267891: Create Debian packages for Node.js 14 upgrade, a subtask of T267890: Upgrade CI jobs for WMF-deployed projects from Node 10 to Node 14 LTS (tracking), as Declined.
Mon, Nov 16, 6:53 AM · Release-Engineering-Team (CI & Testing services), Continuous-Integration-Config
Joe closed T267891: Create Debian packages for Node.js 14 upgrade as Declined.

Not sure what this task rationale is.

Mon, Nov 16, 6:53 AM · Packaging, serviceops, Operations

Fri, Nov 13

Joe lowered the priority of T267804: Varnish 503 errors on page with large number of flag icons. from High to Low.

From the diuscussion on VP:T I would assume the editor just hit the maximum amount of templates to include in a single article. Lowering priority and removing traffic from the task, adding what I think is the actual problem.

Fri, Nov 13, 7:54 AM · MediaWiki-Parser, Operations

Thu, Nov 12

Joe added a comment to T267804: Varnish 503 errors on page with large number of flag icons..

I strongly doubt the problem happens at the traffic layer. This seems to be a different kind of problem - maybe those pages once edited overflow some specific limit.

Thu, Nov 12, 4:58 PM · MediaWiki-Parser, Operations
Joe triaged T267804: Varnish 503 errors on page with large number of flag icons. as High priority.
Thu, Nov 12, 4:55 PM · MediaWiki-Parser, Operations
Joe created T267803: python installation in `mediawiki-services-wikispeech-mishkal` break the system-level python3.
Thu, Nov 12, 3:08 PM · Wikispeech-WMSE, User-kalle, Wikispeech-Text-to-Speech, Release Pipeline (Blubber), Wikispeech-Jobrunner
Joe triaged T267748: Degraded RAID on ms-be2031 as High priority.
Thu, Nov 12, 1:26 PM · Operations, ops-codfw
Joe created T267795: wmf_auto_restart_{jenkins,rsync} failing on releases2002.
Thu, Nov 12, 1:22 PM · Operations
Joe added a comment to T267668: Some recent Commons uploads not available on other wikis (2020-11).

The current situation is:

Thu, Nov 12, 9:22 AM · MW-1.36-notes (1.36.0-wmf.16; 2020-11-03), Patch-For-Review, Wikimedia-production-error, Operations, Commons, MediaWiki-File-management
Joe added a comment to T261534: Strengthen the shared cache key logic in FileRepo classes.

After some digging around Special:Newfiles searching for a pattern in the files that fail, and I think I found a smoking gun:

Thu, Nov 12, 8:52 AM · MW-1.36-notes (1.36.0-wmf.16; 2020-11-03), Sustainability (Incident Followup), Patch-For-Review, Structured-Data-Backlog, Performance-Team (Radar), Structured Data Engineering, MediaWiki-File-management
Joe updated subscribers of T267668: Some recent Commons uploads not available on other wikis (2020-11).

Not many people are around, and most importantly no one with extensive WanCache experience. @ArielGlenn and I are thinking of merging the patch and testing it on a debug server to see if it fixes the issue for files where we see it.

Thu, Nov 12, 8:20 AM · MW-1.36-notes (1.36.0-wmf.16; 2020-11-03), Patch-For-Review, Wikimedia-production-error, Operations, Commons, MediaWiki-File-management
Joe added a comment to T266515: Set ENV SERVERGROUP for jobrunner MW web requests.

@AMooney I don't think that task will be fast to complete, also because we should really dedicate our energies to transitioning MediaWiki to Kubernetes, and freeze all the rest of the work that's unrelated for now. Adding the env variable to the jobrunners is matter of one simple patch, and I'll tak that on.

Thu, Nov 12, 7:17 AM · Platform Team Workboards (External Code Reviews), Developer Productivity, serviceops, observability
Joe claimed T266515: Set ENV SERVERGROUP for jobrunner MW web requests.
Thu, Nov 12, 7:16 AM · Platform Team Workboards (External Code Reviews), Developer Productivity, serviceops, observability

Tue, Nov 10

Joe updated subscribers of T219279: Some pages will become completely unreachable after PHP7 update due to Unicode changes.

Gentle nudge, this really needs to be completed.

Tue, Nov 10, 8:50 AM · MW-1.35-notes (1.35.0-wmf.28; 2020-04-14), User-notice, Platform Team Workboards (Clinic Duty Team), MW-1.34-notes (1.34.0-wmf.16; 2019-07-30), serviceops, Operations, PHP 7.2 support, MediaWiki-General

Mon, Nov 9

Joe added a comment to T267376: Set up IP addresses for the new wiki replicas setup.

Proposal #4 and #5 would directly clash with stuff we do in production, and/or create confusion as to what services are for cloud and which aren't, and I strongly oppose them.

Mon, Nov 9, 9:30 AM · Data-Services, cloud-services-team (Kanban)

Fri, Nov 6

Joe added a comment to T264991: Upgrade the MediaWiki servers to ICU 63.

For the ICU transition it's crucially important that no machine with write access to the databases gets updated before the date of the migration. So please do not test this on the eqiad mwdebugs for the time being.

Fri, Nov 6, 11:04 AM · Patch-For-Review, Beta-Cluster-Infrastructure, DBA, Operations, serviceops

Thu, Nov 5

Joe added a comment to T266893: Build calico 3.17.0.

As for the manifests, if we need them, they should be in hellfile.d/admin I guess?

Thu, Nov 5, 9:00 AM · Patch-For-Review, Kubernetes, Prod-Kubernetes, serviceops
Joe added a comment to T266826: Add Link engineering: Pipeline for moving MySQL database(s) from stats1008 to production MySQL server.

Would this be different than the SwiftUploadComplete event that @Ottomata pointed me to? (@Ottomata as an aside, I can't find any documentation on this in Wikitech or via codesearch, is there something you could point me to please?)

Thu, Nov 5, 6:53 AM · Growth-Team (Current Sprint), Add-Link, Growth-Structured-Tasks

Wed, Nov 4

Joe added a comment to T260401: Avoid unfinished train deploys over holidays, weekends, or other stretches of no-deploy days.

I think that while we should try to avoid such a situation, mandating we either roll forward or back by policy would just be removing the ability for people managing releases to make a judgement call, which is almost never a good idea. And this is not counting that in some cases, rolling back after days might be slightly problematic.

Wed, Nov 4, 2:39 PM · User-brennen, Release-Engineering-Team (Deployment services), Sustainability (Incident Followup), Release-Engineering-Team-TODO, Deployments
Joe added a comment to T266893: Build calico 3.17.0.

calico/node is the only more-than-slightly-worrisome thing here. for everything else we're probably ok using their builds for the time being. It's also true that as far as external images go, a redhat certified image is probably one of the most "secure" options we can find:

Wed, Nov 4, 11:57 AM · Patch-For-Review, Kubernetes, Prod-Kubernetes, serviceops
Joe added a comment to T266826: Add Link engineering: Pipeline for moving MySQL database(s) from stats1008 to production MySQL server.

@Marostegui @Joe, @Ottomata and I met yesterday (notes). It doesn't sound like there is an existing worfklow for pushing database contents from stats machines to the misc cluster.

As far as I understand the infrastructure we have, this is what I think we need to do:

  • The research/mwaddlink tool is going to write its datasets to tables on the staging database on stats1008. Those table names will be in the format addlink_{language}_{dataset_name}, e.g. addlink_en_anchors for the "anchors" dataset for enwiki.
  • When we are happy with the dataset quality for a particular language, we generate a MySQL dump of the tables (5 tables)
  • Then we will have a script that pushes those files to Swift
  • Then (this part I am unclear on) we either have some service that subscribes to SwiftUploadComplete which imports those tables into the production database on the misc cluster, or we (Growth team engineers) have access to manually import the tables.

How does this sound to you, @Marostegui & @Joe?

Wed, Nov 4, 10:43 AM · Growth-Team (Current Sprint), Add-Link, Growth-Structured-Tasks

Tue, Nov 3

Joe closed T258572: Refactor our helmfile.d dir structure for services as Resolved.
Tue, Nov 3, 11:31 AM · Patch-For-Review, Prod-Kubernetes, Release Pipeline, serviceops

Mon, Nov 2

Joe added a comment to T266893: Build calico 3.17.0.

I typically prefer if we rebuild images from dockerfiles, using our base images. That gives us a tad more control over upgrading in case of a disaster security hole in e.g. alpine linux.

Mon, Nov 2, 7:30 AM · Patch-For-Review, Kubernetes, Prod-Kubernetes, serviceops

Oct 30 2020

Joe triaged T266855: Using envoy to connect from MediaWiki to restbase causes an explosion of live LVS connections. as High priority.
Oct 30 2020, 9:34 AM · Service-Architecture, Operations, serviceops
Joe created T266855: Using envoy to connect from MediaWiki to restbase causes an explosion of live LVS connections..
Oct 30 2020, 9:33 AM · Service-Architecture, Operations, serviceops

Oct 29 2020

Joe updated subscribers of T266577: codfw: relocate sessionstore2002 and mc2029 from C4 to C3 .

As far as mc2029 is concerned, you can just proceed without any impact.

Oct 29 2020, 2:09 PM · User-jijiki, serviceops, ops-codfw, Operations

Oct 28 2020

Joe added a comment to T239742: Should npm packages maintained by Wikimedia be scoped or unscoped?.

I second @Mholloway as well. The reasons are both what @Demian said, and we're not going to risk naming collisions in the future.

Oct 28 2020, 9:06 AM · TechCom, Platform Engineering (Icebox), Readers-Web-Backlog (Tracking), Release-Engineering-Team-TODO, Front-end-Standards-Group, Product-Infrastructure-Team-Backlog

Oct 23 2020

Joe added a comment to T243009: Add option in Scap to restart php-fpm for emergency deployments, and skip depooling/pooling servers.

I would also like for someone to investigate if systemctl reload php7.2-fpm clears opcache.

Oct 23 2020, 4:04 PM · Patch-For-Review, Release-Engineering-Team-TODO (2020-10-01 to 2020-12-31 (Q2)), Sustainability (Incident Followup), Release-Engineering-Team (Deployment services), Scap
Joe added a comment to T266055: Update Scap to perform rolling restart for all MW deploy.

I added what I think is an outline of the work we still need to do in order to make this process safe and efficient in T243009#6574045

Oct 23 2020, 10:46 AM · Patch-For-Review, Release-Engineering-Team (Deployment services), Release-Engineering-Team-TODO (2020-10-01 to 2020-12-31 (Q2)), Scap
Joe added a comment to T243009: Add option in Scap to restart php-fpm for emergency deployments, and skip depooling/pooling servers.

Sorry, I see that given I didn't express my thoughts linearly, some confusion ensued.

Oct 23 2020, 10:43 AM · Patch-For-Review, Release-Engineering-Team-TODO (2020-10-01 to 2020-12-31 (Q2)), Sustainability (Incident Followup), Release-Engineering-Team (Deployment services), Scap
Joe added a comment to T243009: Add option in Scap to restart php-fpm for emergency deployments, and skip depooling/pooling servers.

Proposal:

Revise /usr/local/bin/safe-service-restart to accept --force to make it skip depool/repool.

Revise /usr/local/sbin/check-and-restart-php to accept --force and
pass it along to /usr/local/bin/safe-service-restart as needed.

scap: when scap sync or scap sync-world is passed --force, pass
--force to php_fpm_restart_script (aka /usr/local/sbin/check-and-restart-php).

Oct 23 2020, 7:26 AM · Patch-For-Review, Release-Engineering-Team-TODO (2020-10-01 to 2020-12-31 (Q2)), Sustainability (Incident Followup), Release-Engineering-Team (Deployment services), Scap

Oct 21 2020

Joe added a comment to T265324: Create the base container images for running MediaWiki in a production environment.

Also I want to clarify: we can reduce the pain as much as possible, but for the duration of the transition phase, it will be somewhat more work than we're used to for this kind of changes. There is no way around that that I can think of.

Oct 21 2020, 11:52 AM · Operations, serviceops, MW-on-K8s
Joe added a comment to T265324: Create the base container images for running MediaWiki in a production environment.

Overall, I think we may need to take one step back and consider if an apache container in the pod, is something that might cause more problems than it will solve.

Oct 21 2020, 11:49 AM · Operations, serviceops, MW-on-K8s
Joe added a comment to T265324: Create the base container images for running MediaWiki in a production environment.

Regarding the apache httpd container, I am approaching layering as follows:

  • one base image, which uses the apache2-bin debian package and just modifies the vanilla configuration to listen on port 8080 (so that the container can run as user www-data).
  • one image configured to manage a php-fpm application. This image will be used as a base for both MediaWiki and the shellout service. It will include all modules and base configurations we need, and have a single virtualhost sending all "*.php" files to the fastcgi daemon

How are we planning to solve some minor differences our clusters have in our php configurations? The ones I know for sure are max_execution_time (jobrunners) and apc.ttl = 10 (parsoid). Would use ENV work? Moreover, I think we would like to have the ability to easily tweak those settings

Oct 21 2020, 11:26 AM · Operations, serviceops, MW-on-K8s
Joe added a comment to T245183: PHP7 corruption reports in 2020 (Call on wrong object, etc.).
  • timestamp: 2020-10-20T18:10:00 to 2020-10-20T21:15:00
  • host: mw2252
Oct 21 2020, 10:59 AM · Wikimedia-production-error, serviceops, Operations
Joe added a comment to T265324: Create the base container images for running MediaWiki in a production environment.
  • one base image, which uses the apache2-bin debian package and just modifies the vanilla configuration to listen on port 8080 (so that the container can run as user www-data).

In case it makes things easier/cleaner, instead of modifying the configuration you could set the capability CAP_NET_BIND_SERVICE.

Oct 21 2020, 9:00 AM · Operations, serviceops, MW-on-K8s
Joe added a comment to T265324: Create the base container images for running MediaWiki in a production environment.

If I got this right you are purposing to put apache and php-fpm in the same container, correct (talking about vhosts in fpm context)?
I can think of reasons why that might make sense to do (sharing "static" assets for example, ease of MVP), but maybe you could outline them here?

Oct 21 2020, 8:47 AM · Operations, serviceops, MW-on-K8s
Joe added a comment to T265324: Create the base container images for running MediaWiki in a production environment.

Regarding the apache httpd container, I am approaching layering as follows:

Oct 21 2020, 8:11 AM · Operations, serviceops, MW-on-K8s

Oct 20 2020

Joe added a comment to T263587: CAPEX for ParserCache for Parsoid.

Cassandra is not absent of its own issues, and it has a much higher cost per GB than parsercache currently has (I did no research, but I suspect it's in the order of 5x at the very least).

Oct 20 2020, 1:32 PM · Data-Persistence-Consultation, serviceops, MediaWiki-Parser, Parsoid
Joe added a comment to T262675: Store Kubernetes events for more than one hour.

Quick chat in IRC turned out that we don't have a "good for kubernetes" way to discover the kafka brokers (like DNS SRV records) producing directly to kafka-logging would require some coupling with puppet code/re-deployments on changes to kafka-logging brokers (which is obviously bad).

Oct 20 2020, 7:56 AM · Patch-For-Review, observability, Prod-Kubernetes, Kubernetes, serviceops
Joe added a comment to T258572: Refactor our helmfile.d dir structure for services.

Ok! Done.

Oct 20 2020, 7:29 AM · Patch-For-Review, Prod-Kubernetes, Release Pipeline, serviceops

Oct 19 2020

Joe added a comment to T264071: Blog post: Improving the security and performance of PHP RPCs with envoy.

@Joe I did a first pass at the doc and made some grammar/style changes and suggestions. After doing a close read, I want to think about it over the weekend and come back and see if there are any additional suggestions.-- feel free to accept/decline the current suggestions and expect some more early next week.

Oct 19 2020, 2:01 PM · Technical-blog-posts
Joe added a comment to T265183: In a k8s world: where does MediaWiki code live?.

@dduvall I like the idea of using scap prep to extract the code from the images, I didn't think of inverting the logic like that but it's surely workable.

Oct 19 2020, 10:22 AM · MW-on-K8s
Joe created T265882: Allow canarying new envoy configurations in kubernetes.
Oct 19 2020, 9:51 AM · Operations, Kubernetes, Service-Architecture, serviceops
Joe created T265881: Improve envoy configuration CI checks.
Oct 19 2020, 9:48 AM · Operations, Kubernetes, Service-Architecture, serviceops
Joe created T265880: Upgrade envoy configuration to use the v3 API.
Oct 19 2020, 9:39 AM · Operations, Kubernetes, Service-Architecture, serviceops
Joe created T265879: Consider using a file-based xDS system for envoy in k8s.
Oct 19 2020, 9:38 AM · Operations, Kubernetes, Service-Architecture, serviceops
Joe added a comment to T265876: Logging options for apache httpd in k8s.

Additional datapoint that was required: we should be sending ~ 10/15k messages per second to the central log server, depending on traffic.

Oct 19 2020, 9:07 AM · observability, Operations, serviceops, MW-on-K8s
Joe updated subscribers of T258572: Refactor our helmfile.d dir structure for services.

I think there are just a few dangling services that are managed by the analytics team, specifically:

Oct 19 2020, 8:59 AM · Patch-For-Review, Prod-Kubernetes, Release Pipeline, serviceops
Joe moved T265324: Create the base container images for running MediaWiki in a production environment from Incoming 🐫 to Doing 😎 on the serviceops board.
Oct 19 2020, 8:53 AM · Operations, serviceops, MW-on-K8s
Joe moved T264209: Run stress tests on docker images infrastructure from Backlog to In Progress on the MW-on-K8s board.
Oct 19 2020, 8:52 AM · Release-Engineering-Team-TODO (2020-10-01 to 2020-12-31 (Q2)), MW-on-K8s, Release-Engineering-Team (Pipeline)
Joe moved T265324: Create the base container images for running MediaWiki in a production environment from Backlog to In Progress on the MW-on-K8s board.
Oct 19 2020, 8:51 AM · Operations, serviceops, MW-on-K8s
Joe claimed T265324: Create the base container images for running MediaWiki in a production environment.
Oct 19 2020, 8:51 AM · Operations, serviceops, MW-on-K8s
Joe triaged T265876: Logging options for apache httpd in k8s as High priority.
Oct 19 2020, 8:51 AM · observability, Operations, serviceops, MW-on-K8s
Joe created T265876: Logging options for apache httpd in k8s.
Oct 19 2020, 8:50 AM · observability, Operations, serviceops, MW-on-K8s
Joe added a comment to T106517: upload.wikimedia.org returns HTTP status code 503 for truncated urls, not 404.

While those urls weren't my original report (which was about truncated URLs), it seems the behaviour has in the meantime changed, but not in a correct way.

Oct 19 2020, 8:29 AM · Sustainability (Incident Followup), Traffic, Operations

Oct 16 2020

Joe added a comment to T264071: Blog post: Improving the security and performance of PHP RPCs with envoy.

@srodlund the post is ready for a thorough review :) I shared the document with you, let me know what you think :)

Oct 16 2020, 10:27 AM · Technical-blog-posts

Oct 15 2020

Joe created P13008 esams fail.
Oct 15 2020, 8:04 PM

Oct 13 2020

Joe added a comment to T265183: In a k8s world: where does MediaWiki code live?.

My concern is that this transition step becomes a permanent step.

Oct 13 2020, 8:33 AM · MW-on-K8s
Joe raised the priority of T264209: Run stress tests on docker images infrastructure from Medium to High.

This needs to be done while we have one DC turned off for most traffic as we do right now, IMHO.

Oct 13 2020, 8:07 AM · Release-Engineering-Team-TODO (2020-10-01 to 2020-12-31 (Q2)), MW-on-K8s, Release-Engineering-Team (Pipeline)
Joe triaged T265327: Create a basic helm chart to test MediaWiki on kubernetes as High priority.
Oct 13 2020, 7:49 AM · Operations, serviceops, MW-on-K8s
Joe created T265324: Create the base container images for running MediaWiki in a production environment.
Oct 13 2020, 7:40 AM · Operations, serviceops, MW-on-K8s
Joe added a comment to T215217: deployment-prep: Code stewardship request.

Normally I'm all for celebrating birthdays.... I can't believe this baby is one year old already... Let us not need to celebrate its second birthday (dark, I know) :-/

Oct 13 2020, 6:29 AM · Release-Engineering-Team (Code Health), Release-Engineering-Team-TODO, Beta-Cluster-Infrastructure, Code-Stewardship-Reviews
Joe updated the task description for T257118: Beta cluster has reached its quota.
Oct 13 2020, 6:23 AM · Beta-Cluster-Infrastructure

Oct 12 2020

Joe added a comment to T265208: Degraded RAID on ms-be2036.

I just want to comment that this server had its root directory filled up today, and it's in a strange state where only 13 GB are found by du -xsh /, but 53 are occupied on /dev/md0 according to df. Given there are no huge deleted files I can see, it seems possible the server has some leftover data under /srv on the root partition that is now overwritten by the mountpoints.

Oct 12 2020, 2:57 PM · Operations, ops-codfw
Joe updated the task description for T257118: Beta cluster has reached its quota.
Oct 12 2020, 11:19 AM · Beta-Cluster-Infrastructure
Joe updated the task description for T257118: Beta cluster has reached its quota.
Oct 12 2020, 11:18 AM · Beta-Cluster-Infrastructure
Joe added a comment to T257118: Beta cluster has reached its quota.

The situation of the memcached servers is amazingly telling of how deployment-prep is unmanaged, and we should really dedicate some resources to it, but also work better when we do stuff there.

Oct 12 2020, 11:16 AM · Beta-Cluster-Infrastructure
Joe added a comment to T264991: Upgrade the MediaWiki servers to ICU 63.
Oct 12 2020, 7:58 AM · Patch-For-Review, Beta-Cluster-Infrastructure, DBA, Operations, serviceops

Oct 9 2020

Joe added a comment to T258978: Service operations setup for Add a Link project.

Adding some notes after yesterday's meeting:

Oct 9 2020, 2:35 PM · Add-Link, Growth-Team (Current Sprint), Product-Infrastructure-Team-Backlog, Operations, serviceops, GrowthExperiments-NewcomerTasks
Joe added a comment to T264881: Wikipedia iOS apps sending harmful bursts of traffic synchronized to the top of the hour, especially at 22:00 UTC.

I think some form of ratelimiting for that should be present in restbase, and in general, we should ratelimit calls to uncachable URLs to volumes we can support.

Oct 9 2020, 7:11 AM · iOS-app-v6.8-Manta-Ray-On-A-Riding-Mower, Operations, Traffic, iOS-app-Bugs, Wikipedia-iOS-App-Backlog

Oct 8 2020

Joe created T264991: Upgrade the MediaWiki servers to ICU 63.
Oct 8 2020, 7:50 AM · Patch-For-Review, Beta-Cluster-Infrastructure, DBA, Operations, serviceops
Joe added a comment to T243009: Add option in Scap to restart php-fpm for emergency deployments, and skip depooling/pooling servers.
Oct 8 2020, 6:07 AM · Patch-For-Review, Release-Engineering-Team-TODO (2020-10-01 to 2020-12-31 (Q2)), Sustainability (Incident Followup), Release-Engineering-Team (Deployment services), Scap

Oct 7 2020

Joe added a comment to T264821: Hourly read spikes against s8 resulting in occasional user-visible latency & error spikes.

To clarify a bit - restbase has hourly spikes of requests for the feed endpoint, which go back to wikifeeds, which calls both restbase and the action api.

Oct 7 2020, 1:11 PM · Wikidata, serviceops, Operations

Oct 6 2020

Joe added a comment to T260917: Support TLS for service-to-service communication in k8s staging.

I could use a pair of eyes on https://gerrit.wikimedia.org/r/q/bug:T260917
The PCC full diff (https://puppet-compiler.wmflabs.org/compiler1002/25621/) lacks defaultsecret: notdefault for staging zotero. What am I missing here?

Oct 6 2020, 1:40 PM · serviceops, Kubernetes

Oct 5 2020

Joe created P12919 Simple test of RPC performance from PHP.
Oct 5 2020, 9:37 AM

Oct 2 2020

Joe added a comment to T263910: ORES redis: max number of clients reached....

Small status update:

Oct 2 2020, 3:57 PM · User-Ladsgroup, Sustainability (Incident Followup), Patch-For-Review, Okapi, serviceops, Operations, Machine Learning Platform, ORES
Joe added a comment to T263910: ORES redis: max number of clients reached....

Checkin in to report that calls from OKAPI have stopped tonight. Thanks @RBrounley_WMF (and the team)!

Oct 2 2020, 9:10 AM · User-Ladsgroup, Sustainability (Incident Followup), Patch-For-Review, Okapi, serviceops, Operations, Machine Learning Platform, ORES