Page MenuHomePhabricator

Clement_Goubert (claime)
Senior SRE

Projects

Today

  • Clear sailing ahead.

Tomorrow

  • Clear sailing ahead.

Sunday

  • Clear sailing ahead.

User Details

User Since
Jul 26 2022, 2:11 PM (60 w, 3 d)
Availability
Available
IRC Nick
claime
LDAP User
Clément Goubert
MediaWiki User
CGoubert-WMF [ Global Accounts ]

Recent Activity

Wed, Sep 20

Clement_Goubert added a project to T345970: Deploy StatsD exporter for Kubernetes: MW-on-K8s.
Wed, Sep 20, 11:18 AM · MW-on-K8s, serviceops, User-herron, Observability-Metrics, SRE Observability (FY2023/2024-Q1)
Clement_Goubert changed the status of T345243: Remove tls-proxy cpu limits on eventstreams from In Progress to Open.
Wed, Sep 20, 11:06 AM · Patch-For-Review, serviceops
Clement_Goubert changed the status of T345244: Remove tls-proxy cpu limits on eventgate from Open to In Progress.
Wed, Sep 20, 11:06 AM · Patch-For-Review, serviceops
Clement_Goubert changed the status of T345244: Remove tls-proxy cpu limits on eventgate, a subtask of T344814: mw-on-k8s tls-proxy container CPU throttling at low average load, from Open to In Progress.
Wed, Sep 20, 11:06 AM · serviceops, MW-on-K8s
Clement_Goubert changed the status of T345243: Remove tls-proxy cpu limits on eventstreams, a subtask of T344814: mw-on-k8s tls-proxy container CPU throttling at low average load, from In Progress to Open.
Wed, Sep 20, 11:06 AM · serviceops, MW-on-K8s
Clement_Goubert changed the status of T345243: Remove tls-proxy cpu limits on eventstreams from Open to In Progress.
Wed, Sep 20, 11:04 AM · Patch-For-Review, serviceops
Clement_Goubert changed the status of T345243: Remove tls-proxy cpu limits on eventstreams, a subtask of T344814: mw-on-k8s tls-proxy container CPU throttling at low average load, from Open to In Progress.
Wed, Sep 20, 11:03 AM · serviceops, MW-on-K8s

Tue, Sep 19

Clement_Goubert edited P52535 (An Untitled Masterwork).
Tue, Sep 19, 4:42 PM
Clement_Goubert added a subtask for T345888: 1.41.0-wmf.27 deployment blockers: T346800: startupregistrystats-testwiki periodic job fails.
Tue, Sep 19, 4:35 PM · Patch-For-Review, User-brennen, Release-Engineering-Team (Priority Backlog 📥), Release, Train Deployments
Clement_Goubert added a parent task for T346800: startupregistrystats-testwiki periodic job fails: T345888: 1.41.0-wmf.27 deployment blockers.
Tue, Sep 19, 4:35 PM · MW-1.41-notes (1.41.0-wmf.28; 2023-09-26), MediaWiki-extensions-WikimediaMaintenance, MediaWiki-ResourceLoader, MediaWiki-Platform-Team
Clement_Goubert created T346800: startupregistrystats-testwiki periodic job fails.
Tue, Sep 19, 4:34 PM · MW-1.41-notes (1.41.0-wmf.28; 2023-09-26), MediaWiki-extensions-WikimediaMaintenance, MediaWiki-ResourceLoader, MediaWiki-Platform-Team
Clement_Goubert edited P52535 (An Untitled Masterwork).
Tue, Sep 19, 4:33 PM
Clement_Goubert created P52535 (An Untitled Masterwork).
Tue, Sep 19, 4:26 PM
Clement_Goubert created P52534 (An Untitled Masterwork).
Tue, Sep 19, 3:36 PM

Mon, Sep 18

Clement_Goubert moved T246348: Log the real X-Client-IP in apache mediawiki logs from Incoming 🐫 to 🌻Mediawiki on the serviceops board.
Mon, Sep 18, 11:40 AM · serviceops, SRE
Clement_Goubert moved T199220: Cleanup cirrus keys in $wmfSwiftEqiadConfig from Incoming 🐫 to 🌻Mediawiki on the serviceops board.
Mon, Sep 18, 11:40 AM · serviceops, Wikimedia-Site-requests
Clement_Goubert moved T344154: Allow parallel image pulls in k8s from Incoming 🐫 to ⎈Kubernetes on the serviceops board.
Mon, Sep 18, 11:39 AM · Prod-Kubernetes, serviceops, Kubernetes
Clement_Goubert moved T345740: Sunset onhost memcached on mediawiki servers and puppet from Incoming 🐫 to 🌻Mediawiki on the serviceops board.
Mon, Sep 18, 11:39 AM · serviceops
Clement_Goubert moved T345868: Rename the shellbox service to shellbox-score from Incoming 🐫 to 🛎 Services & Oids on the serviceops board.
Mon, Sep 18, 11:39 AM · Shellbox, serviceops
Clement_Goubert moved T343388: Split the monolithic function-evaluator service up in production so we have differently-scalable pods for python vs. node from Incoming 🐫 to 🌻Mediawiki on the serviceops board.
Mon, Sep 18, 11:39 AM · serviceops, function-evaluator, Abstract Wikipedia team
Clement_Goubert moved T343389: Split the monolithic function-evaluator service up in production so we have differently-scalable pods for python 3.7 vs. python 3.8 vs. … from Incoming 🐫 to 🌻Mediawiki on the serviceops board.
Mon, Sep 18, 11:39 AM · serviceops, function-evaluator, Abstract Wikipedia team
Clement_Goubert moved T343459: Split the wikifunctions k8s pod up in production so we have differently-scalable pods for the orchestrator vs. the evaluator from Incoming 🐫 to 🌻Mediawiki on the serviceops board.
Mon, Sep 18, 11:39 AM · function-orchestrator, serviceops, function-evaluator, Abstract Wikipedia team
Clement_Goubert moved T345263: September 2023 Datacenter Switchover from Incoming 🐫 to Doing 😎 on the serviceops board.
Mon, Sep 18, 11:38 AM · Performance-Team, Data-Persistence, serviceops, Datacenter-Switchover, SRE
Clement_Goubert moved T345853: Fail event on /dev/md/0:kubernetes2028 from Incoming 🐫 to 🛠 Upgrades and Hardware on the serviceops board.
Mon, Sep 18, 11:38 AM · serviceops, ops-codfw, SRE
Clement_Goubert moved T346448: Migrate all eventgate installations to mw-api-int from Incoming 🐫 to this.quarter 🍕 on the serviceops board.
Mon, Sep 18, 11:37 AM · Patch-For-Review, Release-Engineering-Team (Seen), SRE, Traffic, serviceops, MW-on-K8s
Clement_Goubert moved T346447: Migrate wikifeeds to mw-api-int from Incoming 🐫 to this.quarter 🍕 on the serviceops board.
Mon, Sep 18, 11:37 AM · Patch-For-Review, Release-Engineering-Team (Seen), SRE, Traffic, serviceops, MW-on-K8s
Clement_Goubert moved T343829: Sandboxing Strategy for Wikifunctions from Incoming 🐫 to 🌻Mediawiki on the serviceops board.
Mon, Sep 18, 11:33 AM · Abstract Wikipedia team, serviceops
Clement_Goubert moved T345289: Come up with a way to make Wikifunctions calls not keep a PHP process alive whilst waiting for the backend from Incoming 🐫 to 🌻Mediawiki on the serviceops board.
Mon, Sep 18, 11:33 AM · function-orchestrator, serviceops, WikiLambda, Abstract Wikipedia team
Clement_Goubert moved T344998: Wikifunctions functions that require a lookup on wikifunctions.org timing out in the orchestrator, UX instead showing 'http' from Incoming 🐫 to 🌻Mediawiki on the serviceops board.
Mon, Sep 18, 11:33 AM · Patch-For-Review, serviceops, Wikimedia-production-error, Abstract Wikipedia team, Wikifunctions
Clement_Goubert moved T346354: restbase deploys via scap lead to all hosts being disabled in conftool from Incoming 🐫 to 🛎 Services & Oids on the serviceops board.
Mon, Sep 18, 11:33 AM · Release-Engineering-Team, Scap, serviceops
Clement_Goubert moved T345823: Wikikube staging clusters are out of IPv4 Pod IP's from Incoming 🐫 to ⎈Kubernetes on the serviceops board.
Mon, Sep 18, 11:33 AM · Prod-Kubernetes, Kubernetes, serviceops
Clement_Goubert moved T331680: Update footer links to direct to proper locations on Foundation Governance Wiki from Incoming 🐫 to 🌻Mediawiki on the serviceops board.
Mon, Sep 18, 11:33 AM · MW-1.41-notes (1.41.0-wmf.13; 2023-06-13), Patch-For-Review, serviceops, MediaWiki-extensions-General, WMF-General-or-Unknown, Wikimedia Foundation Governance Wiki (foundation.wikimedia.org), MediaWiki-Internationalization
Clement_Goubert moved T339119: Switchover plan from restbase to api gateway for wikifeeds from Incoming 🐫 to 🛎 Services & Oids on the serviceops board.
Mon, Sep 18, 11:33 AM · serviceops, Content-Transform-Team-WIP, RESTBase Sunsetting, Code-Health-Objective, Platform Engineering Roadmap, Wikifeeds
Clement_Goubert moved T336627: Accounts taking 30+ minutes to autocreate on metawiki/loginwiki (2023-05) from Incoming 🐫 to 🌻Mediawiki on the serviceops board.
Mon, Sep 18, 11:32 AM · MediaWiki-Platform-Team (Radar), Patch-For-Review, serviceops, WMF-JobQueue, Stewards-and-global-tools, MediaWiki-extensions-CentralAuth
Clement_Goubert moved T343398: Evaluate alternative for noc.wikimedia.org/dbconfig/ file server from Incoming 🐫 to 🌻Mediawiki on the serviceops board.
Mon, Sep 18, 11:31 AM · serviceops, noc.wikimedia.org
Clement_Goubert moved T343801: Create kube-state-metrics docker image from Incoming 🐫 to ⎈Kubernetes on the serviceops board.
Mon, Sep 18, 11:31 AM · Prod-Kubernetes, serviceops, Kubernetes
Clement_Goubert moved T344751: Decide on default histogram buckets for MediaWiki timers from Incoming 🐫 to 🌻Mediawiki on the serviceops board.
Mon, Sep 18, 11:31 AM · Patch-For-Review, serviceops, Observability-Metrics
Clement_Goubert moved T344605: deployment-prep needs a Thumbor instance from Incoming 🐫 to 🛎 Services & Oids on the serviceops board.
Mon, Sep 18, 11:30 AM · serviceops, Beta-Cluster-reproducible, Beta-Cluster-Infrastructure, Thumbor
Clement_Goubert moved T344914: Make termbox-test a proper production release from Incoming 🐫 to 🛎 Services & Oids on the serviceops board.
Mon, Sep 18, 11:30 AM · wdwb-tech, Wikidata-Termbox, serviceops, MW-on-K8s, Wikidata
Clement_Goubert moved T345274: Remove similar-users service from k8s from Incoming 🐫 to 🛎 Services & Oids on the serviceops board.
Mon, Sep 18, 11:30 AM · Similarusers, serviceops
Clement_Goubert moved T346315: Improve the flink-app chart to provide more useful defaults from Incoming 🐫 to 🛎 Services & Oids on the serviceops board.
Mon, Sep 18, 11:30 AM · Patch-For-Review, Discovery-Search (Current work), serviceops, Data-Engineering, Event-Platform
Clement_Goubert moved T346456: Improve concurrency limits configuration of the wdqs updater from Incoming 🐫 to 🛎 Services & Oids on the serviceops board.
Mon, Sep 18, 11:29 AM · Discovery-Search (Current work), wdwb-tech, Wikidata, serviceops, Wikidata-Query-Service
Clement_Goubert moved T343025: Identify path forward for k8s deployment of prometheus-statsd-exporter from Incoming 🐫 to 🌻Mediawiki on the serviceops board.
Mon, Sep 18, 11:29 AM · Patch-For-Review, serviceops, Observability-Metrics, SRE Observability (FY2023/2024-Q1)
Clement_Goubert moved T346472: Sept 2023 Switchover: list new primary DC servers first in debug.json from Incoming 🐫 to Doing 😎 on the serviceops board.
Mon, Sep 18, 11:29 AM · serviceops, Datacenter-Switchover, SRE
Clement_Goubert moved T346330: Sept 2023 Switchover Checklist: Services & Traffic from Incoming 🐫 to Doing 😎 on the serviceops board.
Mon, Sep 18, 11:29 AM · serviceops, Datacenter-Switchover, SRE
Clement_Goubert moved T346474: Sept 2023 Switchover Checklist: MediaWiki from Incoming 🐫 to Doing 😎 on the serviceops board.
Mon, Sep 18, 11:29 AM · Patch-For-Review, serviceops, Datacenter-Switchover, SRE
Clement_Goubert updated subscribers of T343025: Identify path forward for k8s deployment of prometheus-statsd-exporter.

The prometheus-statsd-exporter container and configuration is already deployed as a side-car for a number of kubernetes services:

deployment-charts/charts on  master [$?⇡] 
❯ git grep statsd-exporter | cut -d/ -f1 | sort -u
apertium
api-gateway
blubberoid
changeprop
developer-portal
eventstreams
linkrecommendation
machinetranslation
mediawiki-dev
miscweb
mobileapps
recommendation-api
shellbox
termbox
thumbor
toolhub
wikifeeds
Mon, Sep 18, 11:24 AM · Patch-For-Review, serviceops, Observability-Metrics, SRE Observability (FY2023/2024-Q1)
Clement_Goubert closed T344814: mw-on-k8s tls-proxy container CPU throttling at low average load as Resolved.

p50 latency increased slightly, we may want to up the concurrency a little to see what shakes.
Example mw-web eqiad

image.png (500×1 px, 90 KB)

Mon, Sep 18, 11:16 AM · serviceops, MW-on-K8s
Clement_Goubert closed T344814: mw-on-k8s tls-proxy container CPU throttling at low average load, a subtask of T290536: Serve production traffic via Kubernetes, as Resolved.
Mon, Sep 18, 11:16 AM · Release-Engineering-Team (Seen), SRE, Traffic, serviceops, MW-on-K8s
Clement_Goubert moved T346422: Move 10% of mediawiki external requests to mw on k8s from Incoming 🐫 to this.quarter 🍕 on the serviceops board.
Mon, Sep 18, 11:10 AM · Patch-For-Review, serviceops, Kubernetes, Prod-Kubernetes

Fri, Sep 15

Clement_Goubert added a comment to T345884: mw2444 down.

Repooled, thank you!

Fri, Sep 15, 2:06 PM · serviceops, SRE, ops-codfw

Thu, Sep 14

Clement_Goubert closed T341780: Direct 5% of all traffic to mw-on-k8s as Resolved.

We are now serving 5% of global traffic from mw-on-k8s. Resolving.

Thu, Sep 14, 11:09 AM · Patch-For-Review, Release-Engineering-Team (Seen), SRE, Traffic, serviceops, MW-on-K8s
Clement_Goubert closed T341780: Direct 5% of all traffic to mw-on-k8s, a subtask of T290536: Serve production traffic via Kubernetes, as Resolved.
Thu, Sep 14, 11:07 AM · Release-Engineering-Team (Seen), SRE, Traffic, serviceops, MW-on-K8s

Wed, Sep 13

Clement_Goubert added a comment to T345884: mw2444 down.

I'm putting mw2444 back into pooled=no (instead of pooled=inactive) so it gets scap updates and stops warning, however I'll wait until we're sure it's stable before actually putting it back in production.

Wed, Sep 13, 10:00 AM · serviceops, SRE, ops-codfw
Clement_Goubert updated subscribers of T346129: Puppet doesn't self-recover on build-envoy-config failure.

Recap from irc discussion:
As we've given up on the first puppet run working completely, it doesn't make sense to put effort into fixing the first-order root cause of this particular issue, which is that /var/log/envoy permissions are wrong on first run.

Wed, Sep 13, 9:35 AM · serviceops, envoy

Mon, Sep 11

Clement_Goubert added a comment to T345802: hw troubleshooting: DIMM failure for mc2040.codfw.wmnet.

Thanks @Jhancock.wm !

Mon, Sep 11, 2:33 PM · serviceops, SRE, ops-codfw, DC-Ops

Fri, Sep 8

Clement_Goubert added a comment to T345868: Rename the shellbox service to shellbox-score.

No problem with the general approach, I propose using a _shellbox_common_ directory like we have a _aqs2-common_ and a _mediawiki-common_ directory in helmfile.d/services and symlink from there.

Fri, Sep 8, 4:36 AM · Shellbox, serviceops

Thu, Sep 7

Clement_Goubert closed T345812: Uncaught ConfigException: Failed to load configuration from etcd: in /srv/mediawiki/php-1.41.0-wmf.25/includes/config/EtcdConfig.php:229 as Resolved.

A subsequent deployment didn't trigger that error again, I think we can file that as a transient issue with one pod on startup. We will look into it further if it happens again.

Thu, Sep 7, 11:01 AM · serviceops, Wikimedia-production-error
Clement_Goubert lowered the priority of T345812: Uncaught ConfigException: Failed to load configuration from etcd: in /srv/mediawiki/php-1.41.0-wmf.25/includes/config/EtcdConfig.php:229 from Unbreak Now! to Medium.

Since it only impacts one pod, it has a reqId and an actual request (meaning it's a runtime error, not a startup/load time error), and didn't log anything afterwards despite serving many requests, I'm downgrading to medium on the assumption it's a transient error. I will keep an eye on subsequent deployments to see if it pops up again.

Thu, Sep 7, 10:45 AM · serviceops, Wikimedia-production-error
Clement_Goubert created T345802: hw troubleshooting: DIMM failure for mc2040.codfw.wmnet.
Thu, Sep 7, 8:22 AM · serviceops, SRE, ops-codfw, DC-Ops

Wed, Sep 6

Clement_Goubert closed T345741: hw troubleshooting: Can't reboot mw1349.eqiad.wmnet as Resolved.

You could try running "racadm racreset" over the serial console, it solved similar cases for me in the past. It will kick you off the current SSH connection to the mgmt, but you can usually reconnect after ~ 30 seconds.

Wed, Sep 6, 2:28 PM · SRE, ops-eqiad, DC-Ops
Clement_Goubert updated subscribers of T345741: hw troubleshooting: Can't reboot mw1349.eqiad.wmnet.
Wed, Sep 6, 2:09 PM · SRE, ops-eqiad, DC-Ops
Clement_Goubert added a comment to T345741: hw troubleshooting: Can't reboot mw1349.eqiad.wmnet.

Server came back up as I pressed submit, however there still is an issue with the management interface. It does powercycle the server when asked, but states Unable to perform requested operation.

Wed, Sep 6, 2:09 PM · SRE, ops-eqiad, DC-Ops
Clement_Goubert created T345741: hw troubleshooting: Can't reboot mw1349.eqiad.wmnet.
Wed, Sep 6, 2:06 PM · SRE, ops-eqiad, DC-Ops

Fri, Sep 1

Clement_Goubert changed the status of T339984: Configure the aggregation job to run periodically on Wikimedia wikis from Open to In Progress.
Fri, Sep 1, 9:36 AM · Patch-For-Review, Campaign-Tools (Campaign-Tools-Current-Sprint), serviceops, Wikimedia-Site-requests, Campaign-Registration
Clement_Goubert changed the status of T339984: Configure the aggregation job to run periodically on Wikimedia wikis, a subtask of T321822: [EPIC] Participant Questions - MVP, from Open to In Progress.
Fri, Sep 1, 9:36 AM · User-Iflorez, CampaignEvents, Campaign-Tools, Campaign-Registration

Wed, Aug 30

Clement_Goubert updated subscribers of T345263: September 2023 Datacenter Switchover.
Wed, Aug 30, 4:25 PM · Performance-Team, Data-Persistence, serviceops, Datacenter-Switchover, SRE
Clement_Goubert triaged T345243: Remove tls-proxy cpu limits on eventstreams as Medium priority.
Wed, Aug 30, 12:14 PM · Patch-For-Review, serviceops
Clement_Goubert triaged T345244: Remove tls-proxy cpu limits on eventgate as Medium priority.
Wed, Aug 30, 12:13 PM · Patch-For-Review, serviceops
Clement_Goubert removed a project from T345244: Remove tls-proxy cpu limits on eventgate: MW-on-K8s.
Wed, Aug 30, 12:13 PM · Patch-For-Review, serviceops
Clement_Goubert created T345244: Remove tls-proxy cpu limits on eventgate.
Wed, Aug 30, 12:12 PM · Patch-For-Review, serviceops
Clement_Goubert created T345243: Remove tls-proxy cpu limits on eventstreams.
Wed, Aug 30, 12:11 PM · Patch-For-Review, serviceops

Tue, Aug 29

Clement_Goubert added a comment to T344814: mw-on-k8s tls-proxy container CPU throttling at low average load.

CPU limits have now been removed on all mw-on-k8s deployments except mw-misc. We'll wait a few days to see how the reduced concurrency impacts latency if at all, then resolve this task.

Tue, Aug 29, 11:01 AM · serviceops, MW-on-K8s
Clement_Goubert added a comment to T326657: Add prometheus-https load balancer.

Just FYI, JS and CSS are currently broken on prometheus-{eqiad,codfw}.wikipedia.org due to 401 and 403 errors, with some CORS sprinkled in

Tue, Aug 29, 9:47 AM · Traffic, Patch-For-Review, Observability-Metrics
Clement_Goubert created P51863 401 errors on prometheus.
Tue, Aug 29, 9:45 AM
Clement_Goubert created T345138: xe-3/2/1: down -> Transport: cr1-esams:xe-0/0/7 (Lumen, BDFS2448 80ms 10Gbps wave) {#2013}.
Tue, Aug 29, 9:37 AM · SRE, netops, Infrastructure-Foundations

Mon, Aug 28

Clement_Goubert reopened T340935: Some apache access logs are invalid json as "Open".

We are still experiencing issues, some logs are getting escaped into single byte ISO-8859-1 values instead of the double-byte utf-8 encoding.

Mon, Aug 28, 12:36 PM · Patch-For-Review, Observability-Logging, serviceops, MW-on-K8s

Fri, Aug 25

Clement_Goubert added a comment to T341320: Wikimedia\RemexHtml\Tokenizer\TokenizerError: Wikimedia\RemexHtml\Tokenizer\Tokenizer: pcre.backtrack_limit exhausted.

@tstarling I've verified that the above requests work with the correct new configurations applied for pcre.backtrack_limit and max_execution_time after deploying to mw-debug and forcing the requests there through XWD. My checks for the ini values in debug are good as well.

Fri, Aug 25, 11:17 AM · MW-on-K8s, Maintenance-Worktype, RemexHtml, Wikimedia-production-error
Clement_Goubert changed the status of T341320: Wikimedia\RemexHtml\Tokenizer\TokenizerError: Wikimedia\RemexHtml\Tokenizer\Tokenizer: pcre.backtrack_limit exhausted from Open to In Progress.
Fri, Aug 25, 10:41 AM · MW-on-K8s, Maintenance-Worktype, RemexHtml, Wikimedia-production-error

Thu, Aug 24

Clement_Goubert closed T344904: Termbox SSR broken on Test Wikidata (since k8s migration? unclear) as Resolved.

Resolving, feel free to reopen if there are still any issues.

Thu, Aug 24, 12:41 PM · wdwb-tech, Wikidata-Termbox, serviceops, MW-on-K8s, Wikidata
Clement_Goubert added a comment to T344904: Termbox SSR broken on Test Wikidata (since k8s migration? unclear).

Looks like Test Wikidata (which is mw-on-k8s) can’t talk to the Termbox SSR (@Joe says in IRC it’s missing an egress rule):

lucaswerkmeister-wmde@deploy1002 ~ $ sudo mw-debug-repl testwikidatawiki
Finding a mw-debug pod in eqiad...
Now running shell.php for testwikidatawiki inside pod/mw-debug.eqiad.pinkunicorn-8477b6d89d-8r4bc...
Psy Shell v0.11.10 (PHP 7.4.33 — cli) by Justin Hileman
> $rf = mws()->getHttpRequestFactory()
= MediaWiki\Http\HttpRequestFactory {#3812}

> $req = $rf->create( 'http://termbox-test.staging.svc.eqiad.wmnet:3031/termbox?entity=Q229877&revision=630197&language=en&editLink=%2Fwiki%2FSpecial%3ASetLabelDescriptionAliases%2FQ229877&preferredLanguages=en', [ 'method' => 'GET' ], 'Lucas Werkmeister (WMDE) manual testing' )
= GuzzleHttpRequest {#3961}

> $res = $req->execute()
= Status {#3986
    +cleanCallback: false,
    +value: & 0,
    +success: & [],
    +successCount: & 0,
    +failCount: & 0,
  }

> $res->getErrors()
= [
    [
      "type" => "error",
      "message" => "http-curl-error",
      "params" => [
        "Failed to connect to termbox-test.staging.svc.eqiad.wmnet port 3031: Connection timed out",
      ],
    ],
    [
      "type" => "error",
      "message" => "http-bad-status",
      "params" => [
        "0",
        "Error",
      ],
    ],
  ]
Thu, Aug 24, 12:39 PM · wdwb-tech, Wikidata-Termbox, serviceops, MW-on-K8s, Wikidata
Clement_Goubert created T344914: Make termbox-test a proper production release.
Thu, Aug 24, 12:18 PM · wdwb-tech, Wikidata-Termbox, serviceops, MW-on-K8s, Wikidata

Aug 23 2023

Clement_Goubert added a comment to T343708: Physical re-labeling of mw1497 and mw1498 to kubernetes1025 and kubernetes1026.

Thank you, sorry for the out-of-order operation

Aug 23 2023, 2:31 PM · SRE, ops-eqiad, DC-Ops, serviceops-radar, MW-on-K8s
Clement_Goubert added a comment to T344814: mw-on-k8s tls-proxy container CPU throttling at low average load.

Dumping the envoy configuration in one of our containers as well as there being no CLI flag set for it means envoy is setting its number of threads to the number of hardware cores.
In other words, we have a CPU limit of 500mCPU, a 48 core machine, so each worker thread gets ~10mCPU (this is just for illustration purposes because the allocation doesn't work like that).

Aug 23 2023, 12:40 PM · serviceops, MW-on-K8s
Clement_Goubert changed the status of T344814: mw-on-k8s tls-proxy container CPU throttling at low average load, a subtask of T290536: Serve production traffic via Kubernetes, from Open to In Progress.
Aug 23 2023, 11:48 AM · Release-Engineering-Team (Seen), SRE, Traffic, serviceops, MW-on-K8s
Clement_Goubert changed the status of T344814: mw-on-k8s tls-proxy container CPU throttling at low average load from Open to In Progress.
Aug 23 2023, 11:48 AM · serviceops, MW-on-K8s
Clement_Goubert removed a project from T344814: mw-on-k8s tls-proxy container CPU throttling at low average load: SRE.
Aug 23 2023, 11:48 AM · serviceops, MW-on-K8s
Clement_Goubert created T344814: mw-on-k8s tls-proxy container CPU throttling at low average load.
Aug 23 2023, 11:46 AM · serviceops, MW-on-K8s
Clement_Goubert closed T342748: mw-on-k8s app container CPU throttling at low average load, a subtask of T290536: Serve production traffic via Kubernetes, as Resolved.
Aug 23 2023, 11:39 AM · Release-Engineering-Team (Seen), SRE, Traffic, serviceops, MW-on-K8s
Clement_Goubert closed T342748: mw-on-k8s app container CPU throttling at low average load as Resolved.
Aug 23 2023, 11:39 AM · serviceops, MW-on-K8s

Aug 22 2023

Clement_Goubert created P50920 (An Untitled Masterwork).
Aug 22 2023, 1:09 PM
Clement_Goubert added a comment to T342748: mw-on-k8s app container CPU throttling at low average load.

Everything looking ok, we will see how it copes with doubling the incoming traffic from T341780: Direct 5% of all traffic to mw-on-k8s (only going to 2% for now) and resolve afterwards if everything stays ok.

Aug 22 2023, 9:15 AM · serviceops, MW-on-K8s
Clement_Goubert added a comment to T341780: Direct 5% of all traffic to mw-on-k8s.

Pending more hardware, we will move on to 2% first.

Aug 22 2023, 9:08 AM · Patch-For-Review, Release-Engineering-Team (Seen), SRE, Traffic, serviceops, MW-on-K8s

Aug 21 2023

Clement_Goubert reassigned T334064: Migrate termbox to mw-api-int from Clement_Goubert to Joe.
Aug 21 2023, 2:39 PM · Patch-For-Review, wdwb-tech, Wikidata, Wikidata-Termbox, serviceops, MW-on-K8s
Clement_Goubert moved T334064: Migrate termbox to mw-api-int from this.quarter 🍕 to Doing 😎 on the serviceops board.
Aug 21 2023, 2:34 PM · Patch-For-Review, wdwb-tech, Wikidata, Wikidata-Termbox, serviceops, MW-on-K8s
Clement_Goubert moved T341780: Direct 5% of all traffic to mw-on-k8s from this.quarter 🍕 to Doing 😎 on the serviceops board.
Aug 21 2023, 2:27 PM · Patch-For-Review, Release-Engineering-Team (Seen), SRE, Traffic, serviceops, MW-on-K8s
Clement_Goubert moved T342748: mw-on-k8s app container CPU throttling at low average load from this.quarter 🍕 to Doing 😎 on the serviceops board.
Aug 21 2023, 2:27 PM · serviceops, MW-on-K8s
Clement_Goubert added a comment to T342748: mw-on-k8s app container CPU throttling at low average load.

All deployments of mw-on-k8s are now using:

  • Autocomputed CPU requests, no limits
  • Autocomputed Memory requests and limits
Aug 21 2023, 2:05 PM · serviceops, MW-on-K8s

Aug 18 2023

Clement_Goubert added a comment to T277876: Reserve resources for system daemons on kubernetes nodes.

Considering there's no reservation for system resources at the moment, I feel like that would be a better solution than doing nothing, especially as we increase requests for T342748

Aug 18 2023, 1:28 PM · Patch-For-Review, serviceops, Kubernetes, Prod-Kubernetes
Clement_Goubert updated Clement_Goubert.
Aug 18 2023, 11:13 AM