Page MenuHomePhabricator

hnowlan (Hugh Nowlan)
User

Projects

Today

  • Clear sailing ahead.

Tomorrow

  • Clear sailing ahead.

Saturday

  • Clear sailing ahead.

User Details

User Since
Jan 6 2020, 12:19 PM (223 w, 3 d)
Availability
Available
LDAP User
Unknown
MediaWiki User
HNowlan (WMF) [ Global Accounts ]

Recent Activity

Yesterday

hnowlan added a comment to T358308: AssembleUploadChunksJob & PublishStashedFile jobs seem to be timing out at about 3 minutes, but should be ~20 minutes.

Looking at recent logs, it does seem like some of the failures might be caused by things other than a timeout (There was one recently that failed after 17 seconds). However i still think the larger files are failing due to some sort of timeout as they tend to fail at around 202 second mark, which is pretty suspicious. Sample size is pretty low since users know large uploads don't work so they don't try and upload large files.

Wed, Apr 17, 3:06 PM · Patch-For-Review, WMF-JobQueue, MediaWiki-File-management

Tue, Apr 16

hnowlan triaged T358308: AssembleUploadChunksJob & PublishStashedFile jobs seem to be timing out at about 3 minutes, but should be ~20 minutes as High priority.
Tue, Apr 16, 3:05 PM · Patch-For-Review, WMF-JobQueue, MediaWiki-File-management
hnowlan updated the task description for T360636: Phase out cergen for ServiceOps services.
Tue, Apr 16, 11:17 AM · Patch-For-Review, serviceops, Epic, SRE

Mon, Apr 15

hnowlan added a comment to T362518: Deprecate buster-backports.

mediawiki/vagrant (php7.4-wikidiff2)

MediaWiki-Vagrant is based on Buster, then it is no more officially supported. There is a task to update it to Bullseye T319167

Mon, Apr 15, 10:40 AM · Patch-For-Review, Infrastructure-Foundations, Release-Engineering-Team, serviceops

Thu, Apr 11

hnowlan awarded T362323: Move 100% of external traffic to Kubernetes (excluding Votewiki and Commons) a Stroopwafel token.
Thu, Apr 11, 12:16 PM · MoveComms-Support, SRE, Traffic, serviceops, MW-on-K8s

Wed, Apr 10

hnowlan added a comment to T360531: [Commons Impact Metrics] Refactor/create helm config for AQS service accessing both Cassandra and Druid.

There is also an opportunity to remove the need to specify all of the cassandra server IPs by implementing T359423: Migrate charts to Calico Network Policies for this chart.

Wed, Apr 10, 9:00 AM · Data Products (Data Products Sprint 12), Patch-For-Review, Data-Platform-SRE (2024.03.25 - 2024.04.14), Commons-Impact-Metrics

Tue, Apr 9

hnowlan added a comment to T358308: AssembleUploadChunksJob & PublishStashedFile jobs seem to be timing out at about 3 minutes, but should be ~20 minutes.

Kubernetes jobrunners were using the default max_execution_time of 180s, which might be the culprit for these behaviours. I've bumped it to match the metal jobrunners.

Hmm, that might have also caused problems for web requests too, as in the current configuration, MediaWiki assumes that web POST requests can last more than 200 seconds, and there is code that tries to ensure that if a request is taking too much time that it is cleaned up gracefully and doesn't die inside a critical section.

As an aside, the jobs do seem to be failing after about 200 seconds, which is the timeout for a non-job runner POST request. However, i read through the code, and don't see how that limit could be applied here, since its based on SERVERGROUP and the log entries from the jobs have the correct SERVERGROUP.

Tue, Apr 9, 11:53 AM · Patch-For-Review, WMF-JobQueue, MediaWiki-File-management

Mon, Apr 8

hnowlan added a comment to T350507: Update mobileapps k8s deployment chart for Cassandra credentials.

Do we have a plan for when and how we'd like to move this to production?

Mon, Apr 8, 3:41 PM · Content-Transform-Team, Patch-For-Review, Page Content Service, serviceops, RESTBase Sunsetting
hnowlan added a comment to T358308: AssembleUploadChunksJob & PublishStashedFile jobs seem to be timing out at about 3 minutes, but should be ~20 minutes.

Kubernetes jobrunners were using the default max_execution_time of 180s, which might be the culprit for these behaviours. I've bumped it to match the metal jobrunners.

Mon, Apr 8, 11:16 AM · Patch-For-Review, WMF-JobQueue, MediaWiki-File-management
hnowlan added a comment to T358308: AssembleUploadChunksJob & PublishStashedFile jobs seem to be timing out at about 3 minutes, but should be ~20 minutes.

Having looked at this a little further, I have a strong suspicion the timeout is coming from somewhere other than the config that we currently have defined (perhaps a default somewhere that isn't explicit). Going to keep looking into it, there's a chance this timeout will be affecting other mw-on-k8s charts but just hasn't bitten us because they don't use the longer timeouts the jobrunners do.

Mon, Apr 8, 9:41 AM · Patch-For-Review, WMF-JobQueue, MediaWiki-File-management

Fri, Apr 5

hnowlan added a comment to T361576: Switchover plan from restbase to api gateway for Citoid.

We don't strictly need to support $lang/v1/data/citation?format=$format&search=$search or $lang/v1/data/citation/api?format=$format&search=$search. Incoming requests with get parameters were deprecated many years ago and everyone moved to the restbase pattern. This was more a nice to have.

Fri, Apr 5, 10:09 AM · Citoid, RESTBase Sunsetting

Thu, Apr 4

hnowlan added a comment to T361835: Commons Impact Metrics AQS 2.0 Deployment to Staging and Production.

What external paths should we be routing to what internal paths for this service?

Thu, Apr 4, 3:33 PM · Data Products (Data Products Sprint 12), serviceops, Service-deployment-requests, SRE
hnowlan updated the task description for T361835: Commons Impact Metrics AQS 2.0 Deployment to Staging and Production.
Thu, Apr 4, 1:50 PM · Data Products (Data Products Sprint 12), serviceops, Service-deployment-requests, SRE
hnowlan updated the task description for T361835: Commons Impact Metrics AQS 2.0 Deployment to Staging and Production.
Thu, Apr 4, 1:44 PM · Data Products (Data Products Sprint 12), serviceops, Service-deployment-requests, SRE
jijiki awarded T350507: Update mobileapps k8s deployment chart for Cassandra credentials a Love token.
Thu, Apr 4, 12:48 PM · Content-Transform-Team, Patch-For-Review, Page Content Service, serviceops, RESTBase Sunsetting
hnowlan added a comment to T361576: Switchover plan from restbase to api gateway for Citoid.

After spending some time on this, I think we might have some difficulty with supporting the third citoid pattern of rewriting $site/v1/data/citation/$format/$search to /api?format=$format&search=$search. Envoy's support for this kind of mangling isn't very complete as it's something it kinda expects services to do elsewhere.

Thu, Apr 4, 10:31 AM · Citoid, RESTBase Sunsetting

Wed, Apr 3

hnowlan added a comment to T361706: 2024-04-03 calico/typha down.

Deployed https://gerrit.wikimedia.org/r/c/operations/deployment-charts/+/1016794 to increase Typha memory limits

Wed, Apr 3, 2:36 PM · Patch-For-Review, Prod-Kubernetes, Wikimedia-Incident
hnowlan added a comment to T350507: Update mobileapps k8s deployment chart for Cassandra credentials.

It looks like the network setup between staging and cassandra-dev is all what we would expect. Pods are allowed to connect, and the cassandra firewall allows the staging IP range. While nsentered into the mobileapps staging pod:

root@kubestage1003:/home/hnowlan# time nc -z 10.192.16.15 9042  && echo ok
Wed, Apr 3, 12:10 PM · Content-Transform-Team, Patch-For-Review, Page Content Service, serviceops, RESTBase Sunsetting

Tue, Apr 2

hnowlan added a comment to T338425: Prepare Citoid for use without RESTbase.

So yeah, this might be something we could add to Rest Gateway. Honestly it's more nice to have than anything... alternatively could have people put in their own accept-language header instead of passing on the one in the wiki automatically. Basically affects how people are using the gui api endpoint, not how the service functions inherently.

Tue, Apr 2, 3:45 PM · Patch-For-Review, Citoid, RESTBase Sunsetting
hnowlan added a comment to T358308: AssembleUploadChunksJob & PublishStashedFile jobs seem to be timing out at about 3 minutes, but should be ~20 minutes.

I've merged the change to increase the envoy timeout to the jobrunners (thanks for the patches!) - looks like run times are increasing for the jobs in question but I'll continue to monitor before we take any further changes.

Tue, Apr 2, 12:34 PM · Patch-For-Review, WMF-JobQueue, MediaWiki-File-management
hnowlan closed T336836: REST-Gateway: Requirements for gateway response headers vs service/app layer response headers as Resolved.
Tue, Apr 2, 9:31 AM · serviceops, Platform Team Initiatives (API Gateway), RESTBase Sunsetting
hnowlan closed T277584: [API Gateway] Redefine response time using proxy model as Declined.

Declined in favour of T297222

Tue, Apr 2, 9:29 AM · serviceops, Platform Team Initiatives (API Gateway)
hnowlan closed T277584: [API Gateway] Redefine response time using proxy model, a subtask of T277582: API Gateway SLO v2 , as Declined.
Tue, Apr 2, 9:29 AM · serviceops, Platform Team Initiatives (API Gateway)
hnowlan updated the task description for T297222: [API Gateway] Get insight into proxy time for Envoy .
Tue, Apr 2, 9:28 AM · serviceops, Platform Team Initiatives (API Gateway), Platform Team Workboards (Platform Engineering Reliability)
hnowlan closed T296288: API Gateway needs a dual logging solution as Declined.
Tue, Apr 2, 9:21 AM · serviceops, Platform Team Workboards (Platform Engineering Reliability), Platform Team Initiatives (API Gateway)
hnowlan closed T264095: Fail k8s config template render for api-gateway if some critical values are not defined as Declined.
Tue, Apr 2, 9:16 AM · serviceops, Platform Team Initiatives (API Gateway), Platform Team Workboards (Green)

Mon, Mar 25

hnowlan created P58907 (An Untitled Masterwork).
Mon, Mar 25, 12:59 PM
hnowlan added a comment to T360597: Increased latency, timeouts from wikifeeds since march 10th.

From logs I think there are 2 things to investigate:

  • What happened since ~10th March ?
  • Why traffic was completely dropped on the 19th of March on codfw?
    • Could be related to the datacenter switchover ?
    • Meanwhile eqiad kept having traffic before and after that time.

codfw:

image.png (583×2 px, 335 KB)

eqiad:

image.png (591×2 px, 547 KB)

Mon, Mar 25, 11:40 AM · Content-Transform-Team-WIP, Patch-For-Review, serviceops, Content-Transform-Team
hnowlan updated subscribers of T360597: Increased latency, timeouts from wikifeeds since march 10th.

Turnilo says that this is mostly being caused by clients using the mobile apps, various versions. Logstash makes it clear that Wikifeeds itself is timing out when talking to restbase. The URL pattern in that query shows an oddity - request.url shows the featured link (or similar) but the failing internalURI is almost always always an image. I can't find reference for it but I'm fairly sure we've seen this timeout-upon-image-request behaviour in restbase before, but with changeprop.

Mon, Mar 25, 11:36 AM · Content-Transform-Team-WIP, Patch-For-Review, serviceops, Content-Transform-Team

Mar 6 2024

hnowlan closed T357907: Migrate remaining internal MW API traffic to k8s as Resolved.
Mar 6 2024, 10:53 AM · MW-on-K8s, serviceops
hnowlan updated the task description for T357907: Migrate remaining internal MW API traffic to k8s.
Mar 6 2024, 10:52 AM · MW-on-K8s, serviceops
hnowlan closed T329049: Configure REST Gateway as Resolved.
Mar 6 2024, 10:08 AM · Patch-For-Review, Platform Team Workboards (Platform Engineering Reliability), RESTBase, Platform Team Initiatives (API Gateway)
hnowlan closed T338143: Migrate AQS 1.0 off RESTbase as Resolved.
Mar 6 2024, 10:07 AM · MediaWiki-Engineering, Epic, RESTBase Sunsetting
hnowlan closed T338143: Migrate AQS 1.0 off RESTbase, a subtask of T262315: <CORE TECHNOLOGY> API Migration & RESTbase Sunset, as Resolved.
Mar 6 2024, 10:07 AM · API Platform (RESTbase Deprecation Roadmap), Epic, Foundational Technology Requests, Code-Health, Platform Engineering Roadmap, Platform Engineering Roadmap Decision Making
hnowlan added a comment to T359234: ATSBackendErrorsHigh cache_text sre (restbase.discovery.wmnet eqiad).

Is this the same bug as T356369?

Mar 6 2024, 9:51 AM · Content-Transform-Team-WIP, Content-Transform-Team, Data-Persistence, Traffic

Feb 28 2024

hnowlan committed rMSGOae42ec14a6c0: logger: Correct documentation.
logger: Correct documentation
Feb 28 2024, 5:11 PM

Feb 26 2024

akosiaris awarded T345274: Remove similar-users service from k8s a Love token.
Feb 26 2024, 12:57 PM · Patch-For-Review, Similarusers, serviceops

Feb 23 2024

hnowlan created P57817 homer diff .
Feb 23 2024, 11:11 AM

Feb 22 2024

hnowlan closed T358001: Issues reimaging servers in codfw as Resolved.

@hnowlan I've replaced the network cable on both of these. These are both connected to a 1G switch so there is no SFP to replace in this case.

If this does not fix the issue lmk and we can upgrade the idrac and bios firmware.

Feb 22 2024, 7:19 PM · SRE, serviceops, ops-codfw
hnowlan added a comment to T358195: Investigate increased preprocessing latencies on LW of article-descriptions model.

I added some more logging so that we can get a better break down of where the time is being spent and it turns out that it is the MWAPI that is taking too long

time curl "https://inference-staging.svc.codfw.wmnet:30443/v1/models/article-descriptions:predict" -X POST -d '{"lang": "fr", "title": "Tour_Eiffel", "num_beams": 2, "debug": 1}' -H  "Host: article-descriptions.experimental.wikimedia.org" -H "Content-Type: application/json" --http1.1
{"lang":"fr","title":"Tour_Eiffel","blp":false,"num_beams":2,"groundtruth":"monument de Paris, France","latency":{"wikidata-info (s)":0.0849905014038086,"mwapi - first paragraphs (s)":6.222133636474609,"total network (s)":6.26397967338562,"model (s)":4.969624042510986,"total (s)":11.23367714881897},"features"

In the specific example 16 requests are made and all 16 extracts are returned to be processed by the model. Although the requests seem to be made asynchronously they take approx 6 seconds to complete.
Regarding these requests, I think we should also be spcifying the Host header when making requests to Rest Gateway here

Feb 22 2024, 2:35 PM · Wikipedia-Android-App-Backlog, Machine-Learning-Team
hnowlan updated the task description for T354791: Reclaim jobrunner hardware for k8s.
Feb 22 2024, 1:13 PM · SRE, serviceops, MW-on-K8s
hnowlan closed T357731: Requesting access to Analytics-privatedata-users for jwheeler as Resolved.

Done

Feb 22 2024, 11:51 AM · SRE, SRE-Access-Requests

Feb 21 2024

Lucas_Werkmeister_WMDE awarded T349796: Move MediaWiki jobs to mw-on-k8s a Barnstar token.
Feb 21 2024, 10:56 AM · Patch-For-Review, Release-Engineering-Team (Seen), SRE, Traffic, serviceops, MW-on-K8s
hnowlan moved T357731: Requesting access to Analytics-privatedata-users for jwheeler from Untriaged to Manager/NDA Approval/Confirmation on the SRE-Access-Requests board.
Feb 21 2024, 10:44 AM · SRE, SRE-Access-Requests
hnowlan closed T349796: Move MediaWiki jobs to mw-on-k8s, a subtask of T290536: Serve production traffic via Kubernetes, as Resolved.
Feb 21 2024, 10:39 AM · Release-Engineering-Team (Seen), SRE, Traffic, serviceops, MW-on-K8s
hnowlan closed T349796: Move MediaWiki jobs to mw-on-k8s as Resolved.

All (non-videoscaler) jobs migrated to Kubernetes jobrunners. Videoscaler work tracked in T355292

Feb 21 2024, 10:39 AM · Patch-For-Review, Release-Engineering-Team (Seen), SRE, Traffic, serviceops, MW-on-K8s

Feb 20 2024

hnowlan created T358001: Issues reimaging servers in codfw.
Feb 20 2024, 3:54 PM · SRE, serviceops, ops-codfw
hnowlan added a comment to T249745: Could not enqueue jobs: "Unable to deliver all events: 503: Service Unavailable".

We saw a recurrence of this issue this morning, with a large number of jobs failing with 503 messages from eventgate for a short period. Envoy also saw failures connecting to eventgate around the same time.

Feb 20 2024, 2:51 PM · MediaWiki-Engineering, Data-Engineering, Unstewarded-production-error, User-brennen, serviceops, WMF-JobQueue, Wikimedia-production-error

Feb 19 2024

hnowlan updated the task description for T349796: Move MediaWiki jobs to mw-on-k8s.
Feb 19 2024, 5:21 PM · Patch-For-Review, Release-Engineering-Team (Seen), SRE, Traffic, serviceops, MW-on-K8s
hnowlan updated the task description for T349796: Move MediaWiki jobs to mw-on-k8s.
Feb 19 2024, 5:21 PM · Patch-For-Review, Release-Engineering-Team (Seen), SRE, Traffic, serviceops, MW-on-K8s
hnowlan closed T357504: Prepare amount of workers to handle enwiki traffic for parsoid endpoints as Resolved.
Feb 19 2024, 4:39 PM · serviceops, Parsoid (Tracking), RESTBase Sunsetting, Epic
hnowlan closed T357504: Prepare amount of workers to handle enwiki traffic for parsoid endpoints, a subtask of T344945: Disable storage of Parsoid content in RESTbase, as Resolved.
Feb 19 2024, 4:38 PM · Content-Transform-Team-WIP, Parsoid (Tracking), RESTBase Sunsetting, Epic

Feb 16 2024

hnowlan added a comment to T356279: Remove production data access for former WMDE staff member goransm.

Has there been a final decision on access and the scope of this ticket?

Feb 16 2024, 4:19 PM · Patch-For-Review, User-ItamarWMDE, SRE, Data-Platform-SRE, SRE-Access-Requests
hnowlan updated subscribers of T357731: Requesting access to Analytics-privatedata-users for jwheeler.

Does this ticket supersede T355170? Change created but dependent upon approval by group approvers (@odimitrijevic , @WDoranWMF, @Ahoelzl @Milimetric)

Feb 16 2024, 4:19 PM · SRE, SRE-Access-Requests
hnowlan added a comment to T357097: Requesting access to analytics-privatedata-users for ElineWMDE.

Any word on approvals for this ticket @odimitrijevic , @WDoranWMF, @Ahoelzl @Milimetric?

Feb 16 2024, 4:19 PM · Patch-For-Review, SRE, SRE-Access-Requests
hnowlan created P56893 homer diff.
Feb 16 2024, 3:22 PM

Feb 15 2024

hnowlan closed T357483: Updating access key - rkhan as Resolved.

Done

Feb 15 2024, 11:42 AM · SRE, SRE-Access-Requests

Feb 14 2024

hnowlan closed T355333: Possible firmware issues reimaging mw2282 as Resolved.

Reimage was successful, networking survived a reboot. All done!

Feb 14 2024, 6:11 PM · SRE, ops-codfw, DC-Ops, serviceops
hnowlan added a comment to T355333: Possible firmware issues reimaging mw2282.

I tried another reimage and it currently proceeding successfully - maybe replacing the SFP did the job? This is all a bit inexplicable.

Feb 14 2024, 5:32 PM · SRE, ops-codfw, DC-Ops, serviceops
hnowlan added a comment to T355333: Possible firmware issues reimaging mw2282.

Reimaging fails still after these changes fwiw - however, a reboot has broken network connectivity again?! The host is up and rebooted in the management interface, but I can't ssh in again.

Feb 14 2024, 4:54 PM · SRE, ops-codfw, DC-Ops, serviceops
hnowlan added a comment to T355333: Possible firmware issues reimaging mw2282.

This command also fails - but interestingly the host itself appears to have lost network connectivity. ethtool reports that the link is up but I can't connect in or out and the arp table is empty, I can only get in via the management console.

Feb 14 2024, 4:15 PM · SRE, ops-codfw, DC-Ops, serviceops
hnowlan added a comment to T355333: Possible firmware issues reimaging mw2282.

idk if this would help, but can we run the provisioning script with the --no-dhcp and --no-user tags. to catch any bios settings that might have changed?

Feb 14 2024, 3:26 PM · SRE, ops-codfw, DC-Ops, serviceops
hnowlan created T357539: Multiple hosts in codfw fail to PXE boot upon reimage.
Feb 14 2024, 3:24 PM · SRE, ops-codfw, serviceops
hnowlan added a comment to T357504: Prepare amount of workers to handle enwiki traffic for parsoid endpoints.

To limit the impact of this rollout we'll be gradually deploying to restbase servers either in small numbers or one at a time rather than doing a full scap rollout.

Feb 14 2024, 2:33 PM · serviceops, Parsoid (Tracking), RESTBase Sunsetting, Epic

Feb 13 2024

hnowlan closed T357198: Page: cr2-codfw.wikimedia.org - Primary outbound port utilisation over 80% as Resolved.

Could Too many eqiad mediawiki originals uploads be a red herring? The traffic jumps are all in codfw.

Honestly, that alert is what prompted me to look at the Swift dashboards where network throughput corresponds with the increase on cr2-codfw:

image.png (835×1 px, 124 KB)

Looking back, I guess I didn't question why that alert had fired for eqiad (which seems pretty curious to me now that you bring it up).

Feb 13 2024, 12:11 PM · SRE
hnowlan added a comment to T355333: Possible firmware issues reimaging mw2282.

I reseated the NIC and it connected. when I rebooted it went down again and didn't come up. swapped it out and rebooted it. stayed up this time. should have replaced the cable as well the first time D= It -should- stay up this time. lmk if it acts up again.

also tried the racreset trick and I'm getting a straight 404 error on the idrac login.

Feb 13 2024, 11:27 AM · SRE, ops-codfw, DC-Ops, serviceops

Feb 12 2024

hnowlan closed T356917: Requesting access to analytics-privatedata-users for JTanner as Resolved.

User jaz has been added to analytics-privatedata-users, you should be able to use superset with the associated password.

Feb 12 2024, 2:48 PM · SRE, SRE-Access-Requests
hnowlan updated the task description for T356917: Requesting access to analytics-privatedata-users for JTanner.
Feb 12 2024, 2:47 PM · SRE, SRE-Access-Requests
hnowlan changed the status of T357147: Requesting access to analytics-privatedata-users for Arthur Taylor from In Progress to Stalled.

Moving to stalled pending approval from an analytics-privatedata-users owner (@odimitrijevic , @WDoranWMF, @Ahoelzl @Milimetric)

Feb 12 2024, 11:39 AM · Patch-For-Review, SRE, SRE-Access-Requests
hnowlan changed the status of T357097: Requesting access to analytics-privatedata-users for ElineWMDE from In Progress to Stalled.

Moving to stalled pending approval from an analytics-privatedata-users owner (@odimitrijevic , @WDoranWMF, @Ahoelzl @Milimetric)

Feb 12 2024, 11:39 AM · Patch-For-Review, SRE, SRE-Access-Requests
hnowlan added a comment to T357198: Page: cr2-codfw.wikimedia.org - Primary outbound port utilisation over 80%.

Could Too many eqiad mediawiki originals uploads be a red herring? The traffic jumps are all in codfw. I'm not sure what's actionable in this ticket.

Feb 12 2024, 11:04 AM · SRE
hnowlan closed T357263: Allow Wikimedia Maps usage on <domain> as Resolved.

Required fields empty, no domain specified. Declining.

Feb 12 2024, 10:13 AM · Maps, SRE

Feb 9 2024

hnowlan added a comment to T356526: High level of backend errors for CirrusSearch jobs in jobrunners.

This is kinda verging on a UBN for us as we go into the weekend because it's causing a lot of spam and it'll hide other error prod states for jobqueues.

Feb 9 2024, 4:48 PM · MW-1.42-notes (1.42.0-wmf.18; 2024-02-13), serviceops, Discovery-Search (Current work), CirrusSearch
hnowlan closed T345952: 3d2png npm install errors as Invalid.
Feb 9 2024, 4:07 PM · Thumbor, Technical-Debt, 3D
hnowlan closed T350566: No thumbnails / images rendered for 1.5GB PDF on Commons as Resolved.

This is now rendering after changes to resources for thumbor workers.

Feb 9 2024, 4:01 PM · Thumbor, Commons
hnowlan closed T354858: Error accessing File:KlimtDieJungfrau.jpg after it was included on the enwiki Main Page as Resolved.

This may have been a question of resources, this image is now rendering after some recent bumps.

Feb 9 2024, 3:57 PM · serviceops, Thumbor
hnowlan created T357145: Consider moving to haproxy ingress for Thumbor workers.
Feb 9 2024, 3:45 PM · Kubernetes, serviceops, Thumbor
hnowlan added a comment to T355333: Possible firmware issues reimaging mw2282.

I can't seem to access the idrac remotely. Is it okay if I power down the server at this time?

Feb 9 2024, 3:21 PM · SRE, ops-codfw, DC-Ops, serviceops
hnowlan added a project to T356526: High level of backend errors for CirrusSearch jobs in jobrunners: serviceops.
Feb 9 2024, 10:57 AM · MW-1.42-notes (1.42.0-wmf.18; 2024-02-13), serviceops, Discovery-Search (Current work), CirrusSearch
hnowlan added a comment to T356526: High level of backend errors for CirrusSearch jobs in jobrunners.

The train rollout hasn't fixed this issue and we're getting alerts every hour for error spikes on jobrunners - this is beginning to hide other prod issues with the jobrunners so this is a pretty serious concern for us. Apologies, the errors in the ticket *have* been fixed, however we're still seeing spikes of the other errors reported, and they are causing alerts to fire every two hours:

Feb 9 2024, 10:57 AM · MW-1.42-notes (1.42.0-wmf.18; 2024-02-13), serviceops, Discovery-Search (Current work), CirrusSearch

Feb 8 2024

hnowlan added a comment to T355333: Possible firmware issues reimaging mw2282.

Good catch! Unfortunately I'm still seeing the same PXE behaviour failing on boot

Feb 8 2024, 4:31 PM · SRE, ops-codfw, DC-Ops, serviceops

Feb 7 2024

hnowlan added a comment to T356797: Response alignment for routing errors between gateway, restbase and services errors.

To be explicit about routing differences - we have a rule for /api/rest_v1/metrics/* that directs traffic to the rest gateway and we have a rule for *most* other things under /api/rest_v1/* that defaults to restbase. There are some exceptions for services that have been migrated out of restbase.

Feb 7 2024, 11:24 AM · AQS2.0
hnowlan closed T355892: Repool maps primaries in Kartotherian as Resolved.
Feb 7 2024, 10:25 AM · serviceops, Maps

Feb 6 2024

hnowlan reopened T355892: Repool maps primaries in Kartotherian as "Open".
Feb 6 2024, 1:16 PM · serviceops, Maps
hnowlan added a comment to T355892: Repool maps primaries in Kartotherian.

This change looks like is causing an issue. From apps team:

We seem to be getting intermittent 404s for certain urls, e.g. https://maps.wikimedia.org/static/webgl/wikisprites%402x.json

From a quick look running this request in all maps nodes only maps1009 is returning 404s for this URL. I think we should depool maps1009 and 2009 again and figure out whats wrong.
Probably its running some older version of kartotherian that doesn't expose this static file.

More specifically maps1009 is missing the static folder to serve:

jgiannelos@maps1009:/srv/deployment/kartotherian/deploy/src$ stat static
stat: cannot stat 'static': No such file or directory
Feb 6 2024, 10:22 AM · serviceops, Maps

Feb 2 2024

hnowlan created T356526: High level of backend errors for CirrusSearch jobs in jobrunners.
Feb 2 2024, 6:01 PM · MW-1.42-notes (1.42.0-wmf.18; 2024-02-13), serviceops, Discovery-Search (Current work), CirrusSearch

Feb 1 2024

hnowlan added a project to T356288: Restbase erroring very frequently with "HTTPError: Cannot read property '0' of null" on resource_change events: RESTBase Sunsetting.
Feb 1 2024, 3:58 PM · Page Content Service, Content-Transform-Team-WIP, RESTBase Sunsetting, RESTBase
hnowlan closed T355892: Repool maps primaries in Kartotherian as Resolved.
Feb 1 2024, 3:30 PM · serviceops, Maps

Jan 31 2024

hnowlan created T356288: Restbase erroring very frequently with "HTTPError: Cannot read property '0' of null" on resource_change events.
Jan 31 2024, 4:39 PM · Page Content Service, Content-Transform-Team-WIP, RESTBase Sunsetting, RESTBase

Jan 30 2024

hnowlan added a comment to T354499: Degraded RAID on aqs1013.

Failing disk:

root@aqs1013:/home/hnowlan# udevadm info --query=all --name=/dev/sde| grep SERIAL
E: ID_SERIAL=MZ7KH1T9HAJR0D3_S4KVNA0MB04213
E: ID_SERIAL_SHORT=S4KVNA0MB04213
root@aqs1013:/home/hnowlan# dmesg | grep sde | tail
[6107332.316343] sd 6:0:0:0: [sde] tag#13 FAILED Result: hostbyte=DID_BAD_TARGET driverbyte=DRIVER_OK cmd_age=0s
[6107332.316359] sd 6:0:0:0: [sde] tag#13 CDB: Read(10) 28 00 02 e9 0f 80 00 00 08 00
[6107332.316365] blk_update_request: I/O error, dev sde, sector 48828288 op 0x0:(READ) flags 0x0 phys_seg 1 prio class 0
[6107332.327008] Buffer I/O error on dev sde1, logical block 6103280, async page read
[6107332.335094] sd 6:0:0:0: [sde] tag#1 FAILED Result: hostbyte=DID_BAD_TARGET driverbyte=DRIVER_OK cmd_age=0s
[6107332.335101] sd 6:0:0:0: [sde] tag#1 CDB: Read(10) 28 00 df 8f df 80 00 00 08 00
[6107332.335107] blk_update_request: I/O error, dev sde, sector 3750748032 op 0x0:(READ) flags 0x0 phys_seg 1 prio class 0
[6107332.345913] Buffer I/O error on dev sde2, logical block 462739952, async page read
[6107370.360633] sd 6:0:0:0: [sde] tag#16 FAILED Result: hostbyte=DID_BAD_TARGET driverbyte=DRIVER_OK cmd_age=0s
[6107370.360637] sd 6:0:0:0: [sde] tag#16 CDB: ATA command pass through(16) 85 06 2c 00 00 00 00 00 00 00 00 00 00 00 e5 00
Jan 30 2024, 12:45 PM · SRE, ops-eqiad
hnowlan added a comment to T356024: TypeError: Argument 4 passed to Wikimedia\Parsoid\Utils\Title::__construct() must be of the type string, null given, called in /srv/mediawiki/php-1.42.0-wmf.15/vendor/wikimedia/parsoid/src/Utils/Title.php on line 392.

This appears to have trailed off around 1400 on the 29th, but if there is a risk of this recurring it'd be great if we could avoid these exception spikes in future.

Jan 30 2024, 11:18 AM · MW-1.42-notes (1.42.0-wmf.17; 2024-02-06), Patch-For-Review, Essential-Work, serviceops, Wikimedia-production-error

Jan 29 2024

hnowlan changed the status of T355892: Repool maps primaries in Kartotherian from Open to In Progress.
Jan 29 2024, 2:50 PM · serviceops, Maps
hnowlan updated the task description for T354791: Reclaim jobrunner hardware for k8s.
Jan 29 2024, 2:33 PM · SRE, serviceops, MW-on-K8s
hnowlan triaged T356024: TypeError: Argument 4 passed to Wikimedia\Parsoid\Utils\Title::__construct() must be of the type string, null given, called in /srv/mediawiki/php-1.42.0-wmf.15/vendor/wikimedia/parsoid/src/Utils/Title.php on line 392 as High priority.

A surge in this error started around 0700 on the 27th of January, and seems to only occur in volume upon hewikisource (there was a brief spike on itwikivoyage on the 26th also).

Jan 29 2024, 11:01 AM · MW-1.42-notes (1.42.0-wmf.17; 2024-02-06), Patch-For-Review, Essential-Work, serviceops, Wikimedia-production-error

Jan 25 2024

hnowlan added a project to T355892: Repool maps primaries in Kartotherian: serviceops.
Jan 25 2024, 3:38 PM · serviceops, Maps
hnowlan created T355892: Repool maps primaries in Kartotherian.
Jan 25 2024, 3:38 PM · serviceops, Maps

Jan 24 2024

hnowlan updated the task description for T354791: Reclaim jobrunner hardware for k8s.
Jan 24 2024, 5:14 PM · SRE, serviceops, MW-on-K8s
hnowlan updated the task description for T354791: Reclaim jobrunner hardware for k8s.
Jan 24 2024, 12:43 PM · SRE, serviceops, MW-on-K8s
hnowlan updated the task description for T354791: Reclaim jobrunner hardware for k8s.
Jan 24 2024, 12:42 PM · SRE, serviceops, MW-on-K8s
hnowlan added a comment to T355730: Provide developer access to the cassandra-dev cluster.

Presumably we want to restrict access somewhat beyond "everything the cassandra user can do"? At which point a separate user to sudo to seems like a sensible idea unless it's a lot of hassle...

Jan 24 2024, 11:28 AM · Patch-For-Review, Cassandra