Page MenuHomePhabricator
Feed Advanced Search

Mon, Mar 25

TheDJ awarded T357296: Create new flavour of shellbox for video transcoding a Party Time token.
Mon, Mar 25, 1:27 PM · Video, MW-on-K8s, serviceops
Joe edited projects for T360867: httpbb appserver test breaks deployment of the week due to a timeout parsing page, added: serviceops-radar; removed serviceops.

I don't think httpbb tests should really break deployment, but rather ask for confirmation to the deployer if they intend to continue in case of issues.

Mon, Mar 25, 8:42 AM · Patch-For-Review, serviceops, Release-Engineering-Team, Deployments
Joe raised the priority of T313729: Splunk/VictorOps CAPTCHA from Low to High.

Every time I log into the victorops portal I'm shown reCaptcha, and it's the only site on the internet that does, so it's not about the reputation of my IP/browser profile.

Mon, Mar 25, 7:25 AM · Observability-Alerting, observability

Mar 24 2024

Joe added a comment to T295007: Upload by URL should use the job queue, possibly chunked with range requests.

With the API change, you should be able to upload by url large files (up to 4 GB IIRC) without incurring in timeouts; and that will also allow us to move file processing to Shellbox, making our infrastructure more secure.

@Joe: How about 5 GiB per https://gerrit.wikimedia.org/r/1002813 ? See https://phabricator.wikimedia.org/T191804#9363066 for a discussion
about the capacity implications of this change.

Mar 24 2024, 7:39 AM · MW-1.42-notes (1.42.0-wmf.24; 2024-03-26), Patch-For-Review, MediaWiki CodeJam Dec 2023, MediaWiki-Uploading

Mar 23 2024

Joe added a comment to T295007: Upload by URL should use the job queue, possibly chunked with range requests.

What's the status of this task?

Mar 23 2024, 10:40 AM · MW-1.42-notes (1.42.0-wmf.24; 2024-03-26), Patch-For-Review, MediaWiki CodeJam Dec 2023, MediaWiki-Uploading
Joe created P58902 upload_file_from_url.py.
Mar 23 2024, 10:38 AM

Mar 21 2024

Joe added a comment to T360625: Alter changeprop chart to use the service mesh.

There is a few reasons why we didn't migrate changeprop to use the service mesh, first of all the fact we don't want to define timeouts outside of it.

Mar 21 2024, 1:39 PM · WMF-JobQueue, ChangeProp, SRE, serviceops, MW-on-K8s
Joe triaged T360597: Increased latency, timeouts from wikifeeds since march 10th as High priority.
Mar 21 2024, 8:58 AM · Content-Transform-Team-WIP, Patch-For-Review, serviceops, Content-Transform-Team
Joe created T360597: Increased latency, timeouts from wikifeeds since march 10th.
Mar 21 2024, 8:58 AM · Content-Transform-Team-WIP, Patch-For-Review, serviceops, Content-Transform-Team

Mar 8 2024

Joe added a comment to T292322: Support large files in Shellbox.

@tstarling I think we determined that the expensive part of handling large files in shellbox was mostly the download/hash verification of the file received, I don't even think we strictly need PUT support.

Mar 8 2024, 8:46 AM · MW-1.38-notes (1.38.0-wmf.21; 2022-02-07), SRE-swift-storage, Shellbox, serviceops, MW-on-K8s

Mar 7 2024

Joe closed T353414: Build and deploy LuaSandbox 4.1.2 as Resolved.
Mar 7 2024, 6:24 AM · serviceops

Mar 6 2024

Joe edited projects for T359234: ATSBackendErrorsHigh cache_text sre (restbase.discovery.wmnet eqiad), added: Traffic, Data-Persistence, Content-Transform-Team; removed serviceops.

Hi @Eevans, I'm a bit perplexed by why you think serviceops should be able to assist with this issue. This seems like an application bug triggered by external traffic, from the looks of it.

Mar 6 2024, 7:13 AM · Content-Transform-Team-WIP, Content-Transform-Team, Data-Persistence, Traffic

Mar 5 2024

Joe updated the task description for T355873: Migrate servers in codfw rack B8 from asw-b8-codfw to lsw1-b8-codfw.
Mar 5 2024, 3:38 PM · DBA, SRE, netops, Infrastructure-Foundations, ops-codfw
Joe added a comment to T358636: etcdmirror does not recover from a cleared waitIndex.

@Scott_French great job, indeed I think we can just go forward and do it, even if there's a few things that look suboptimal.

Mar 5 2024, 8:38 AM · Patch-For-Review, serviceops

Mar 4 2024

Joe closed T358867: php7.4-fpm-multiversion-base Docker image fails to build as Resolved.
Mar 4 2024, 3:00 PM · serviceops
Joe claimed T358867: php7.4-fpm-multiversion-base Docker image fails to build.
Mar 4 2024, 2:28 PM · serviceops
Joe added a comment to T353414: Build and deploy LuaSandbox 4.1.2.

Re-uploaded the packages to the right components.

Mar 4 2024, 2:23 PM · serviceops

Mar 2 2024

Joe added a comment to T249745: Could not enqueue jobs: "Unable to deliver all events: 503: Service Unavailable".

I couple things I wonder about:

  • Though the bottleneck seems to be EventGate more than Kafka, I still wonder why profile::kafka::mirror::properties doesn't blacklist all MW jobs?* Is anything making use of that extra data?
Mar 2 2024, 9:40 AM · MediaWiki-Engineering, Data-Engineering, Unstewarded-production-error, User-brennen, serviceops, WMF-JobQueue, Wikimedia-production-error

Mar 1 2024

Joe added a comment to T353414: Build and deploy LuaSandbox 4.1.2.

Apologies, I must have made a mistake. I will fix it as there are other issues as well.

Mar 1 2024, 2:31 PM · serviceops

Feb 29 2024

Joe updated subscribers of T249745: Could not enqueue jobs: "Unable to deliver all events: 503: Service Unavailable".

When we're talking about errors, it's always a good idea to reason in terms of error ratios and not error rates.

Feb 29 2024, 12:42 PM · MediaWiki-Engineering, Data-Engineering, Unstewarded-production-error, User-brennen, serviceops, WMF-JobQueue, Wikimedia-production-error
Joe added a comment to T358634: OOM livelock stalls.

I don't think it holds any ground for systems involved in live responses or which have strict latency requirements in general.

For instance, enabling swap might save a database from being OOM killed, but it will slow down databases to a halt, basically causing it to be unusable to serve live requests.

That's not true, according to the article I linked by Chris Down. If a database server is completely out of memory, it's better to swap than to go into a livelock, a state so severely broken that the power-cycling is seen as a reasonable solution. But either way, the problem is that you're running out of memory, not that you have swap.

Under light memory pressure, enabling swap allows more RAM to be used for file caches, improving performance.

Feb 29 2024, 8:17 AM · serviceops, Cloud-VPS
Joe added a comment to T358636: etcdmirror does not recover from a cleared waitIndex.

If we're comfortable with that being the default for the entire keyspace (i.e., even for new workloads) then that's a fairly straightforward config change [0].

Feb 29 2024, 7:55 AM · Patch-For-Review, serviceops
Joe added a comment to T358636: etcdmirror does not recover from a cleared waitIndex.

In the meantime, I remembered why we were only replicating the /conftool keyspace:

Feb 29 2024, 7:43 AM · Patch-For-Review, serviceops

Feb 28 2024

Joe added a comment to T358636: etcdmirror does not recover from a cleared waitIndex.

That etcdmirror is mirroring only the /conftool keys it's totally news to me, I assumed it was replicating the whole content of etcd. But indeed it does not:

$ sudo etcdctl --endpoints https://conf1007.eqiad.wmnet:4001 ls  /
/conftool
/spicerack
/test

vs

$ sudo etcdctl --endpoints https://conf2005.codfw.wmnet:4001 ls  /
/conftool

Is there any specific reason we shouldn't replicate it all? If replicating it all this issue should never happen right?
Spicerack follows the SRV records for the endpoint, so if that gets failed over to codfw spicerack should follow and I would have expected to find the locks replicated there too.

Feb 28 2024, 3:44 PM · Patch-For-Review, serviceops
Joe added a comment to T354657: Wikifeeds increase on 500 errors after switchover to core page HTML.

I don't think this is a data persistence issue, but rather it's much more probable this is actually a restbase bug. At the very least, investigation should start there.

Feb 28 2024, 3:18 PM · RESTBase, Wikifeeds, RESTBase Sunsetting, serviceops
Joe added a comment to T358507: XSS on doc.wikimedia.org (documentation generated by yard) (CVE-2024-27285).

The file has been removed from the puppet doc tree, so at least that's not a problem anymore.

Feb 28 2024, 3:00 PM · Patch-For-Review, Infrastructure-Foundations, SecTeam-Processed, doc.wikimedia.org, Vuln-XSS, Upstream, Security, Security-Team
Joe created P58031 Puppet is awesome.
Feb 28 2024, 9:55 AM
Joe added a comment to T358634: OOM livelock stalls.

I also want to note that on kubernetes memory is mostly managed by the k8s scheduler on top of the kernel one, so that we never have overflowing use of memory and we OOM containers (which are nothing more than cgroups) to control we never do. And the k8s scheduler also takes care of never allocating a portion of memory we reserver for system component, and we currently do so.

Feb 28 2024, 7:05 AM · serviceops, Cloud-VPS
Joe added a comment to T358634: OOM livelock stalls.

Focusing on the swap part of the problem, for posterity:

Feb 28 2024, 7:05 AM · serviceops, Cloud-VPS

Feb 23 2024

Joe created P57812 fabtypes.
Feb 23 2024, 10:23 AM
Joe triaged T357949: Code in Shellbox specific to WMF production as Low priority.
Feb 23 2024, 7:59 AM · serviceops, Shellbox
Joe added a comment to T357949: Code in Shellbox specific to WMF production.

Before we move into finding solutions, I'd like to understand better what is the goal we want to accomplish:

  • If the goal is to make what we upload to packagist cleaner, composer.json has a way to do it (archive.exclude, IIRC?)
  • If we prefer not to have pygmentize in the repository (why?), we can adapt the image build process to fetch it remotely
Feb 23 2024, 7:58 AM · serviceops, Shellbox
Joe added a comment to T345274: Remove similar-users service from k8s.

@Niharika @Tchanders any concerns with this?

Can we please get your feedback on this? It would help to decide whether to put additional work into keeping the similar-users chart up to date.

We are interested to use the service but can't commit to a specific timeline for when it would be connected to a MediaWiki extension interface, because we're sorting out the roadmap and timelines for Trust and Safety Product Team, and have a lot of ongoing projects.

Is it a lot of effort to keep the chart up to date?

Feb 23 2024, 7:39 AM · Patch-For-Review, Similarusers, serviceops
Joe closed T126306: Eliminate symlinks in mediawiki-config (as much as possible) as Invalid.

This task was about HHVM-specific issues. Feel free to reopen if you think it's still valid.

Feb 23 2024, 5:53 AM · Release-Engineering-Team (Seen), scap2

Feb 22 2024

Joe added a comment to T357949: Code in Shellbox specific to WMF production.

The .pipeline files are used to create container images for running in Wikimedia production or anywhere else that may choose to run Shellbox as a container.

You're saying that anyone anywhere who wants to run Shellbox in a container, whether they are using MediaWiki or not, should use the exact same configuration as WMF production, down to the PHP version, number of FPM workers, list of fonts to install, etc.?

Feb 22 2024, 9:04 AM · serviceops, Shellbox
Joe added a comment to T357949: Code in Shellbox specific to WMF production.

@tstarling just removing the .pipelinelib directory isn't an option, if we want to keep using shellbox in production. I don't see a good solution for than other than maintaining separate branches and backporting changes to the main branch to a wmf branch, which would come with its own inconveniences of course.

Feb 22 2024, 8:45 AM · serviceops, Shellbox

Feb 21 2024

Joe added a comment to T289228: Convert media handling code (PdfHandler, PagedTiffHandler) to use Shellbox.

We still need to investigate/fix the regression on CLI for DjVuImage, see https://gerrit.wikimedia.org/r/c/mediawiki/core/+/720143/comments/8f625c32_80c745cd

Feb 21 2024, 2:40 PM · Content-Transform-Team, MediaWiki-Engineering, MW-1.38-notes (1.38.0-wmf.7; 2021-11-02), MW-1.37-notes (1.37.0-wmf.23; 2021-09-13), MediaWiki-extensions-PdfHandler, MediaWiki-extensions-PagedTiffHandler, Shellbox, MW-on-K8s
Lucas_Werkmeister_WMDE awarded T349796: Move MediaWiki jobs to mw-on-k8s a Barnstar token.
Feb 21 2024, 10:56 AM · Patch-For-Review, Release-Engineering-Team (Seen), SRE, Traffic, serviceops, MW-on-K8s

Feb 20 2024

Joe added a comment to T353414: Build and deploy LuaSandbox 4.1.2.

I built and uploaded the packages to apt, which means that next week we should automatically roll them out to kubernetes in the weekly rebuild of base images.

Feb 20 2024, 9:54 AM · serviceops

Feb 16 2024

Joe added a comment to T357595: Investigate restricting match pattern on /wiki RewriteRule.

the intention was probably for this to match something a bit more restrictive (e.g., matching ^/wiki(/.*)?$)

I'm looking at the diff where the RewriteRule was introduced, and I'm pretty sure this is right -- it's replacing a

ProxyPass       /wiki                fcgi://127.0.0.1:9000<%= @docroot %>/w/index.php retry=0

so we went from a path argument to a regex. But some history from @Joe would help to confirm. And we'd still need to make sure nothing else has started relying on this in the last five years.

Feb 16 2024, 8:28 AM · Patch-For-Review, Wikimedia-Apache-configuration, serviceops

Feb 12 2024

Joe created T357309: Create a deployment for `shellbox-timedmedia`.
Feb 12 2024, 11:56 AM · Patch-For-Review, Video, TimedMediaHandler, MW-on-K8s, serviceops
Joe created T357296: Create new flavour of shellbox for video transcoding.
Feb 12 2024, 11:24 AM · Video, MW-on-K8s, serviceops
Joe closed T356242: Convert TimedMediaHandler to use BoxedCommand/Shellbox as Resolved.
Feb 12 2024, 11:20 AM · MW-1.43-notes (1.43.0-wmf.2; 2024-04-23), Patch-For-Review, MW-1.42-notes (1.42.0-wmf.17; 2024-02-06), Video, TimedMediaHandler, MW-on-K8s, serviceops
Joe closed T356242: Convert TimedMediaHandler to use BoxedCommand/Shellbox, a subtask of T356241: Move video transcoding to use Shellbox, as Resolved.
Feb 12 2024, 11:20 AM · Video, TimedMediaHandler, MW-on-K8s, serviceops
Joe lowered the priority of T356984: Stop sending change notification email if edit is done by a bot from Unbreak Now! to Medium.

@Tacsipacsi I would ask you to keep emotions in check, it's very hard to collaborate (which is what we are supposed to be doing here) when someone is that confrontational. Specifically:

  • yelling doesn't get your point through better, quite the opposite. In fact, took me quite a bit to understand what you were trying to say.
  • raising tasks to UBN! because you're in disagreement also won't get your point through better,.
Feb 12 2024, 9:55 AM · User-notice-archive, MW-1.42-notes (1.42.0-wmf.18; 2024-02-13), Infrastructure-Foundations, Mail

Feb 8 2024

Joe added a comment to T323201: Evaluate a high available GitLab architecture.

Gitlab needs regular maintenance windows, at least once a month, if not more often and they usually last around 15 minutes.

Feb 8 2024, 9:18 AM · GitLab (Infrastructure), collaboration-services

Feb 7 2024

Joe added a comment to T356780: Video transcoding fails when firejail is enabled.

My spot tests show transcodes now work on testwiki. I'll let @brion take a look as well before I declare this bug resolved.

Feb 7 2024, 3:42 PM · Unstewarded-production-error, MW-1.42-notes (1.42.0-wmf.17; 2024-02-06), TimedMediaHandler-Transcode, Wikimedia-production-error

Feb 6 2024

Joe claimed T356780: Video transcoding fails when firejail is enabled.

While the original issue is solved, we're running into a new one - now the script exits with an OOM, and unless I'm reading it incorrectly, the memory limit seems to be just 4 MB which would explain the problem.

Feb 6 2024, 6:55 PM · Unstewarded-production-error, MW-1.42-notes (1.42.0-wmf.17; 2024-02-06), TimedMediaHandler-Transcode, Wikimedia-production-error
Joe added a comment to T356780: Video transcoding fails when firejail is enabled.

Hah, I found the issue.

Feb 6 2024, 4:08 PM · Unstewarded-production-error, MW-1.42-notes (1.42.0-wmf.17; 2024-02-06), TimedMediaHandler-Transcode, Wikimedia-production-error
Joe added a comment to T356780: Video transcoding fails when firejail is enabled.

That is not the problem, or not the only one for what it's worth; I've tested encoding an audio file and I get the same error, and there is no TMH_OPT_VIDEOCODEC set there (see reqId in Logstash).

Feb 6 2024, 4:02 PM · Unstewarded-production-error, MW-1.42-notes (1.42.0-wmf.17; 2024-02-06), TimedMediaHandler-Transcode, Wikimedia-production-error
Joe added a comment to T356780: Video transcoding fails when firejail is enabled.

I think the '\\''--env=TMH_OPT_VIDEOCODEC='\\''\\'\\'''\\''vp9'\\''\\'\\'''\\'''\\'' '\\'' looks quite suspicious here.

Feb 6 2024, 3:40 PM · Unstewarded-production-error, MW-1.42-notes (1.42.0-wmf.17; 2024-02-06), TimedMediaHandler-Transcode, Wikimedia-production-error
Joe updated the task description for T356780: Video transcoding fails when firejail is enabled.
Feb 6 2024, 3:31 PM · Unstewarded-production-error, MW-1.42-notes (1.42.0-wmf.17; 2024-02-06), TimedMediaHandler-Transcode, Wikimedia-production-error
Joe updated the task description for T356780: Video transcoding fails when firejail is enabled.
Feb 6 2024, 3:31 PM · Unstewarded-production-error, MW-1.42-notes (1.42.0-wmf.17; 2024-02-06), TimedMediaHandler-Transcode, Wikimedia-production-error
Joe triaged T356780: Video transcoding fails when firejail is enabled as Unbreak Now! priority.
Feb 6 2024, 3:27 PM · Unstewarded-production-error, MW-1.42-notes (1.42.0-wmf.17; 2024-02-06), TimedMediaHandler-Transcode, Wikimedia-production-error

Jan 31 2024

Joe added projects to T355914: Change default image thumbnail size: Data-Persistence, SRE-swift-storage, Traffic.

Given the chosen size is both non-standard (meaning it's not used on most large wikis) and not in the list of thumbnail sizes we pregenerate at upload time, I would imagine switching enwiki to this new thumbnail size would have a big impact on both the upload edge clusters and the backend object storage.

Jan 31 2024, 10:42 AM · Web Team Essential Work 2024, Web-Team-Backlog, Design, Wikimedia-Design, Thumbor, Traffic, SRE-swift-storage, Data-Persistence, SRE, Wikimedia-Site-requests
Joe triaged T356242: Convert TimedMediaHandler to use BoxedCommand/Shellbox as High priority.
Jan 31 2024, 7:47 AM · MW-1.43-notes (1.43.0-wmf.2; 2024-04-23), Patch-For-Review, MW-1.42-notes (1.42.0-wmf.17; 2024-02-06), Video, TimedMediaHandler, MW-on-K8s, serviceops
Joe updated the task description for T356241: Move video transcoding to use Shellbox.
Jan 31 2024, 7:47 AM · Video, TimedMediaHandler, MW-on-K8s, serviceops
Joe triaged T356241: Move video transcoding to use Shellbox as High priority.
Jan 31 2024, 7:46 AM · Video, TimedMediaHandler, MW-on-K8s, serviceops
Joe created T356242: Convert TimedMediaHandler to use BoxedCommand/Shellbox.
Jan 31 2024, 7:46 AM · MW-1.43-notes (1.43.0-wmf.2; 2024-04-23), Patch-For-Review, MW-1.42-notes (1.42.0-wmf.17; 2024-02-06), Video, TimedMediaHandler, MW-on-K8s, serviceops
Joe created T356241: Move video transcoding to use Shellbox.
Jan 31 2024, 7:35 AM · Video, TimedMediaHandler, MW-on-K8s, serviceops
Joe added a comment to T355292: Port videoscaling to kubernetes.
Jan 31 2024, 7:28 AM · Patch-For-Review, Video, TimedMediaHandler, MW-on-K8s, serviceops
Joe triaged T338297: Revisit thumbor's poolcounter integration as Low priority.

Low priority as we've found out the problem is fundamentally deeper than we'd like.

Jan 31 2024, 7:24 AM · Patch-For-Review, serviceops, Thumbor

Jan 30 2024

Joe added a comment to T356163: ChieBot: Intermittent connection reset by peer errors.

Just stating for the record that connection refused/reset messages will come from our edge caching layer, specifically from the tcp stack of our servers there, so it wouldn't be related to a migration to kubernetes (which is still only partial, btw).

Jan 30 2024, 9:55 AM · Toolforge

Jan 25 2024

Joe added a comment to T354794: Requesting permission to enable kafka log compaction for page_rerender on kafka-main.

It generally seems ok, but a few considerations:

  • kafka-main is much smaller than kafka-jumbo, and critical to site operations
  • The codfw.mediawiki.currussearch.page_rerender.v1 topic is pretty large at the moment, 292 GB in codfw and 149 GB in eqiad, while the corresponding eqiad topic is as expected tiny/irrelevant.
Jan 25 2024, 9:15 AM · Data-Platform-SRE (2024.03.25 - 2024.04.14), serviceops, Discovery-Search, CirrusSearch

Jan 23 2024

Joe added a comment to T355619: Request MediaWiki +2 for Paladox.

@Paladox is my de facto point of contact for patches I write for Gerrit. He is quite speedy, productive and has valuable reviews.

Jan 23 2024, 10:03 AM · MediaWiki-Gerrit-Group-Requests
Joe awarded T355619: Request MediaWiki +2 for Paladox a Dislike token.
Jan 23 2024, 10:02 AM · MediaWiki-Gerrit-Group-Requests
Joe added a comment to T347004: Create a staging apt repository for CI-based builds of Debian packages.

Please let's make apt-merge less cumbersome to use.

Jan 23 2024, 8:01 AM · collaboration-services, GitLab (CI & Job Runners), serviceops

Jan 18 2024

Joe updated subscribers of T355292: Port videoscaling to kubernetes.

Adding @brion as the resident expert / maintainer of TimedMediaHandler. I'd like to get your opinion on how hard it would be to port WebVideoTranscodeJob to use shellbox :)

Jan 18 2024, 9:43 AM · Patch-For-Review, Video, TimedMediaHandler, MW-on-K8s, serviceops
Joe renamed T355292: Port videoscaling to kubernetes from [DRAFT] Port videoscaling to kubernetes to Port videoscaling to kubernetes.
Jan 18 2024, 9:37 AM · Patch-For-Review, Video, TimedMediaHandler, MW-on-K8s, serviceops
Joe updated the task description for T355292: Port videoscaling to kubernetes.
Jan 18 2024, 9:37 AM · Patch-For-Review, Video, TimedMediaHandler, MW-on-K8s, serviceops
Joe triaged T355292: Port videoscaling to kubernetes as High priority.
Jan 18 2024, 7:46 AM · Patch-For-Review, Video, TimedMediaHandler, MW-on-K8s, serviceops
Joe added a comment to T355243: MediaModeration maintenance script scanFilesInScanTable.php indirectly calls $wgImageMagickConvertCommand.

@Joe do you want us to stop the script for now, and switch to not using the job queue?

Jan 18 2024, 7:32 AM · Trust and Safety Product Team, serviceops-radar, Patch-For-Review, SRE, MW-on-K8s
Joe added a comment to T355243: MediaModeration maintenance script scanFilesInScanTable.php indirectly calls $wgImageMagickConvertCommand.
Jan 18 2024, 7:25 AM · Trust and Safety Product Team, serviceops-radar, Patch-For-Review, SRE, MW-on-K8s
Joe added a comment to T355243: MediaModeration maintenance script scanFilesInScanTable.php indirectly calls $wgImageMagickConvertCommand.

EDIT: It looks like actually the metamoderation script actually spawns jobs, as interestingly all errors seem to come from the same reqId, 8e1438850af4ec4c4b82ebb6, and clearly it's not ThumbnailRenderer jobs.

Jan 18 2024, 7:14 AM · Trust and Safety Product Team, serviceops-radar, Patch-For-Review, SRE, MW-on-K8s

Jan 16 2024

Joe created T355158: Generate netmapper file for abuse requestctl ipblocks.
Jan 16 2024, 4:33 PM · Traffic

Jan 11 2024

Joe added a project to T354832: Create a special throttling class for "rogue proxies" IPs: MediaWiki-extensions-CentralAuth.
Jan 11 2024, 10:38 AM · SecTeam-Processed, MediaWiki-Platform-Team, MediaWiki-extensions-CentralAuth, Trust and Safety Product Team, Security-Team, iPoid-Service
Joe created T354832: Create a special throttling class for "rogue proxies" IPs.
Jan 11 2024, 10:03 AM · SecTeam-Processed, MediaWiki-Platform-Team, MediaWiki-extensions-CentralAuth, Trust and Safety Product Team, Security-Team, iPoid-Service

Jan 8 2024

Joe added a comment to T354532: Limit the concurrency of envoy in service mesh.

It seems to me that trying to respond to 1k rps with a concurrency of 2 is probably the issue. Throttling is bad because it raises latencies; if any measure we take to avoid throttling increases latencies compared to throttling, why bother?

Jan 8 2024, 2:50 PM · Patch-For-Review, Kubernetes, Prod-Kubernetes, serviceops

Jan 5 2024

Joe added a comment to T350507: Update mobileapps k8s deployment chart for Cassandra credentials.

The snippet from the cassandra-http-gateway helm chart is not using keyspace/tables (because its user submitted at runtime from what I understand from the project). I updated my patch with this expected config structure:

cassandra:
  hosts: ["127.0.0.1"]
  port: 9042
  local_dc: "datacenter1"
  authentication:
    username: "cassandra"
    password: "cassandra"
caching:
  enabled: false
  cassandra:
    keyspace: "tests"
    storageTable: "storage"
Jan 5 2024, 2:35 PM · Content-Transform-Team, Patch-For-Review, Page Content Service, serviceops, RESTBase Sunsetting

Jan 4 2024

Joe committed rOSCTa72b0932d197: Release 2.3.3.
Release 2.3.3
Jan 4 2024, 4:59 PM
Joe committed rOSCTfad5ae19f4b7: requestctl: ensure no irc logging happens.
requestctl: ensure no irc logging happens
Jan 4 2024, 4:59 PM

Jan 3 2024

Joe added a comment to T350507: Update mobileapps k8s deployment chart for Cassandra credentials.

I suggest we standardize on the configuration that we've used for the golang applications using cassandra.

Jan 3 2024, 10:18 AM · Content-Transform-Team, Patch-For-Review, Page Content Service, serviceops, RESTBase Sunsetting
Joe added a comment to T354229: Moving jobs to MW-on-k8s decreased their timeout from 1200s to 200s.

Yes, your understanding is correct; I had a patch fixing this that never got merged, I should just make a new version of that.

Jan 3 2024, 8:40 AM · SRE, serviceops, WMF-JobQueue, MW-on-K8s

Dec 14 2023

Joe edited P54403 (An Untitled Masterwork).
Dec 14 2023, 9:49 AM
Joe created P54403 (An Untitled Masterwork).
Dec 14 2023, 9:48 AM

Dec 11 2023

Joe added a comment to T348284: Handle sidecar containers in one-off Kubernetes jobs.

Yeah, good point. Fortunately it looks like a pretty straightforward Go patch to add a --namespace flag if need be, and pipe it through to the API call.

Dec 11 2023, 6:48 AM · MW-on-K8s, serviceops

Dec 7 2023

Joe added a comment to T352650: Migrate current-generation dumps to run from our containerized images.

@Joe thanks for thinking this through. I have three follow-up questions:

Dec 7 2023, 7:55 AM · MW-on-K8s, Dumps-Generation, Release-Engineering-Team, serviceops
Joe closed T352870: Add humorous redirect for fox.wikimedia.org as Declined.

Or not :)

Dec 7 2023, 7:02 AM · SRE, Puppet

Dec 6 2023

Joe triaged T295007: Upload by URL should use the job queue, possibly chunked with range requests as High priority.

We intend to try to take a stab at this during next week's MediaWiki CodeJam.

Dec 6 2023, 3:50 PM · MW-1.42-notes (1.42.0-wmf.24; 2024-03-26), Patch-For-Review, MediaWiki CodeJam Dec 2023, MediaWiki-Uploading
Joe claimed T338297: Revisit thumbor's poolcounter integration.
Dec 6 2023, 3:32 PM · Patch-For-Review, serviceops, Thumbor
Joe added a project to T295007: Upload by URL should use the job queue, possibly chunked with range requests: MediaWiki CodeJam Dec 2023.
Dec 6 2023, 2:43 PM · MW-1.42-notes (1.42.0-wmf.24; 2024-03-26), Patch-For-Review, MediaWiki CodeJam Dec 2023, MediaWiki-Uploading

Dec 5 2023

Joe added a comment to T352744: OpenSSL 3.x performance issues.

Almost anything relevant internally uses envoy to mediate TLS both client and server side, so it's probably useful to list the oddballs.

Dec 5 2023, 4:50 PM · SRE-swift-storage, Traffic
Joe triaged T352247: Remove calls to GeoIP 1 in Extension:LandingCheck as High priority.

Raising the priority because at best Special:landingcheck is using data from 1 year and a half ago. Given we're entering our main fundraiser, this should probably be solved?

Dec 5 2023, 7:21 AM · MW-1.42-notes (1.42.0-wmf.9; 2023-12-12), Fundraising-Backlog, Fundraising Tech - Chaos Crew, LandingCheck, Wikimedia-production-error

Dec 4 2023

Joe added a comment to T352628: Fatal exception of type "Wikimedia\Rdbms\DBTransactionSizeError".

@Ladsgroup I think the log linked by @TheresNoTime is a typical example of a distributed transaction going wrong:

Dec 4 2023, 2:18 PM · MediaWiki-Categories, Patch-For-Review, MW-1.42-notes (1.42.0-wmf.7; 2023-11-28), DBA, Wikimedia-production-error
Joe updated subscribers of T352650: Migrate current-generation dumps to run from our containerized images.
Dec 4 2023, 11:26 AM · MW-on-K8s, Dumps-Generation, Release-Engineering-Team, serviceops
Joe triaged T352650: Migrate current-generation dumps to run from our containerized images as Medium priority.
Dec 4 2023, 11:25 AM · MW-on-K8s, Dumps-Generation, Release-Engineering-Team, serviceops
Joe created T352650: Migrate current-generation dumps to run from our containerized images.
Dec 4 2023, 11:25 AM · MW-on-K8s, Dumps-Generation, Release-Engineering-Team, serviceops
Joe closed T352569: Lift IP cap on 2023-12-04 for Editathon for commonswiki and eswiki as Resolved.

Please next time allow the designated time of at least two weeks

I wonder in what fraction of situations an organiser would know the IP 2 weeks in advance. If that is at least 30% I would be surprised.

Dec 4 2023, 8:40 AM · Wikimedia-Site-requests
Joe added a comment to T352569: Lift IP cap on 2023-12-04 for Editathon for commonswiki and eswiki.

@Joe I'm guessing that comment is to the author of this task. Are you going to schedule a patch for the backport, or do you want me to take care of it for you? :)

Dec 4 2023, 6:27 AM · Wikimedia-Site-requests
Joe added a comment to T352569: Lift IP cap on 2023-12-04 for Editathon for commonswiki and eswiki.

Please note that usually a lead time is requested for throttle requests. Opening the task on a friday for something needed on monday (and adding the relevant information needed only on saturday) isn't the right way to plan things.

Dec 4 2023, 6:18 AM · Wikimedia-Site-requests