Page MenuHomePhabricator

akosiaris (Alexandros Kosiaris)
Site Reliability EngineerAdministrator

Today

  • Clear sailing ahead.

Tomorrow

  • Clear sailing ahead.

Monday

  • Clear sailing ahead.

User Details

User Since
Oct 3 2014, 8:40 AM (398 w, 16 h)
Roles
Administrator
Availability
Available
IRC Nick
akosiaris
LDAP User
Alexandros Kosiaris
MediaWiki User
AKosiaris (WMF) [ Global Accounts ]

Blurb

Recent Activity

Wed, May 18

akosiaris added a comment to T308339: eqiad: move non WMCS servers out of rack C8.

deploy1002 will need to be scheduled well in advance and/or failed over to deploy2002 as it is the canonical deployment host.

Wed, May 18, 9:28 AM · SRE, DBA, ops-eqiad
akosiaris added a comment to T308563: Request for a gitlab repo for the kubernetes workshop.

Many thanks for this!

Wed, May 18, 8:00 AM · Gerrit, serviceops-radar, GitLab, Release-Engineering-Team (GitLab-a-thon 🦊)
akosiaris awarded T308563: Request for a gitlab repo for the kubernetes workshop a Love token.
Wed, May 18, 8:00 AM · Gerrit, serviceops-radar, GitLab, Release-Engineering-Team (GitLab-a-thon 🦊)

Tue, May 17

akosiaris added a comment to T306649: Agree strategy for Kubernetes BGP peering to top-of-rack switches.

I merged two changes for the ml-serve-eqiad cluster, and now the concerns expressed in T306649#7881940 should be gone:

elukey@ml-serve1005:~$ ss -tulpna | grep ":179" | sort
tcp   ESTAB     0      0                       10.64.131.3:43807                   10.64.131.1:179         
tcp   ESTAB     0      0      [2620:0:861:10a:10:64:131:3]:60061           [2620:0:861:10a::1]:179         
tcp   LISTEN    0      8                           0.0.0.0:179                         0.0.0.0:*           
tcp   LISTEN    0      8                              [::]:179                            [::]:*

ml-serve1005 is in row E and it doesn't peer with cr{1,2} anymore. Of course the configuration needs to be improved and automated as described above, but for the moment:

  • ml-serve100[5-8] peer only with their ToRs in row E/F
  • ml-serve100[1-4] and ml-serve-ctrl100[1,2] peer with cr{1,2}-eqiad
Tue, May 17, 2:30 PM · Prod-Kubernetes, SRE, Infrastructure-Foundations, netops
akosiaris updated subscribers of T308563: Request for a gitlab repo for the kubernetes workshop.

Gitlab wise, I can create a personal repo easily, but my guess says that's not ideal either. I guess I shouldn't just jump the gun and create a repo on my own under the non-personal hierarchy, right?

Tue, May 17, 2:29 PM · Gerrit, serviceops-radar, GitLab, Release-Engineering-Team (GitLab-a-thon 🦊)
akosiaris created T308563: Request for a gitlab repo for the kubernetes workshop.
Tue, May 17, 2:26 PM · Gerrit, serviceops-radar, GitLab, Release-Engineering-Team (GitLab-a-thon 🦊)
akosiaris added a comment to T306649: Agree strategy for Kubernetes BGP peering to top-of-rack switches.

Regarding the "fake nodes": I think that could be done with adding the leafs as GlobalNetworkSet to the K8s/Calico API. That should make them easily selectable via peerSelectors without creating the confusion fake nodes would create.

Reading that, I am under the impression it won't work cause it only applies for Network policies. Can't hurt to try though.

Yeah. Some other docs suggest they can be used in selector fields an general, though.

They have the node-role.kubernetes.io/master:NoSchedule taint so nothing (aside from calico-node) will be scheduled on them, we can skip them for now while we test our setup, I expect (famous last words) it won't hurt (much) if they don't peer with cr{1,2}.

Not 100% true currently (https://gerrit.wikimedia.org/r/c/operations/deployment-charts/+/777364).

Tue, May 17, 9:01 AM · Prod-Kubernetes, SRE, Infrastructure-Foundations, netops

Mon, May 16

akosiaris added a comment to T297140: New Service Request: developer-portal.

Ah, cool! Thanks for the update!

Mon, May 16, 4:55 PM · Patch-For-Review, Goal, serviceops, Wikimedia-Developer-Portal, Service-deployment-requests
akosiaris added a comment to T306397: Service Ops SRE support for iOS notifications update.

@akosiaris cool, thanks! My instinct is that it feels a bit low - I wonder if pushes are getting dropped somewhere. It would be cool if we could somehow check how many Echo notifications were generated by accounts with subscribed device tokens in the last 7 days and see if that number is close to 552. Although I know the service batches pushes which may throw the numbers off too much for a direct comparison. Tagging @JMinor and @SNowick_WMF to see if this is worth investigating.

Mon, May 16, 2:53 PM · serviceops, SRE
akosiaris added a comment to T306649: Agree strategy for Kubernetes BGP peering to top-of-rack switches.

Regarding the "fake nodes": I think that could be done with adding the leafs as GlobalNetworkSet to the K8s/Calico API. That should make them easily selectable via peerSelectors without creating the confusion fake nodes would create.

Mon, May 16, 2:41 PM · Prod-Kubernetes, SRE, Infrastructure-Foundations, netops

Fri, May 13

akosiaris added a comment to T306181: intake-analytics is responsible for up to a 85% of varnish backend fetch errors.

The 50% bump in capacity didn't make any noticeable difference this time around. :-(

Fri, May 13, 4:18 PM · Data-Engineering-Kanban, Data-Engineering, SRE, Traffic
akosiaris updated subscribers of T306649: Agree strategy for Kubernetes BGP peering to top-of-rack switches.
Fri, May 13, 4:11 PM · Prod-Kubernetes, SRE, Infrastructure-Foundations, netops
akosiaris added a comment to T305729: Kubernetes credentials on deployment servers should be available to deployers, not all users.

helm 2 did have a different structure regarding these things, it's definitely fine to revisit and re-evaluate the approach.

Fri, May 13, 11:58 AM · Release-Engineering-Team (Radar), Patch-For-Review, Kubernetes, MW-on-K8s, serviceops
akosiaris added a comment to T306649: Agree strategy for Kubernetes BGP peering to top-of-rack switches.

Even in the legacy setup (pre row e/f) adding new nodes requires manual error-prone gerrit changes like this one 35b0c9a4832068d08

Yes I agree totally. One thing I had previously looked at, and thought it was not an option for us, is the Juniper "dynamic neighbor" feature for BGP peers. Basically this allows you to define an IP range/subnet as the peer, after which it will form a peering session with any remote device on the subnet that tries to establish a session. I'd previously tested this and it works great, however until recently it wouldn't work as you could only configure a single peer-as for the "dynamic-peer", and we have various peer ASNs connected off all our private subnets. I've looked again at this today though, and Juniper have introduced a "peer as-list" command in version 20 which seems to allow for it:

https://phabricator.wikimedia.org/P27787

So we could basically add that kind of config to all top-of-rack switches by default, and no further per-server configuration would be needed on the network side. You can configure a password on the sessions too if that tightens it up security wise. I'd need to run that by @ayounsi but it seems to me to be a good way forward.

Fri, May 13, 11:36 AM · Prod-Kubernetes, SRE, Infrastructure-Foundations, netops
akosiaris added a comment to T293012: Productionise mc20[38-55].

we should consider if it makes sense to make an exception and renumber these hosts from 2037 so they are in par with eqiad.

Fri, May 13, 10:45 AM · Patch-For-Review, serviceops
akosiaris added a comment to T302430: <Tech Initiative> Commons Copy-by-URL Image Uploads Slowdown (Shellbox).

Note that when above https://commons.wikimedia.org/wiki/Special:Upload is mentioned, we are focusing on a specific subset of Special:Upload. We are focusing on Copy by URL which is only triggered via the respective radio box. See image for a visual explanation

image.png (217×1 px, 23 KB)

Fri, May 13, 10:25 AM · Foundational Technology Requests
akosiaris renamed T302430: <Tech Initiative> Commons Copy-by-URL Image Uploads Slowdown (Shellbox) from <Tech Initiative> Commons Image Uploads Slowdown (Shellbox) to <Tech Initiative> Commons Copy-by-URL Image Uploads Slowdown (Shellbox).
Fri, May 13, 10:23 AM · Foundational Technology Requests
akosiaris added a comment to T306397: Service Ops SRE support for iOS notifications update.

For what is worth, I think we 've peaked.

Fri, May 13, 9:51 AM · serviceops, SRE

Wed, May 11

akosiaris added a comment to T306397: Service Ops SRE support for iOS notifications update.

Thanks for the update. Load has somewhat increased on our side, albeit minimally.

Wed, May 11, 4:23 PM · serviceops, SRE
akosiaris added a comment to T306181: intake-analytics is responsible for up to a 85% of varnish backend fetch errors.

I have now deployed the change to double the number of replica pods for eventgate-analytics-external. Monitoring for any changes to the p90 and p99 response times.

Wed, May 11, 10:24 AM · Data-Engineering-Kanban, Data-Engineering, SRE, Traffic
akosiaris added a comment to T306649: Agree strategy for Kubernetes BGP peering to top-of-rack switches.

If there is any kind of anycast with the k8s prefixes (same prefix advertised from multiple locations), we should also prepend the AS once on the core routers to keep path lengths consistent across the infra.

Yeah I had considered that alright, in checking I found what Alex has confirmed (we don't have it right now).

My only gripe is that it's going to be a pretty long and cryptic config. But at least it's not going to be changing often. We 'll have to deduplicate it a bit too across cluster before it becomes unyieldy.

Yeah apologies for that. Happy to help in whatever way we can to assist getting this config together. At the very least as we add new racks/switches we should be able to complete the definition/add the IP. The peer on each subnet will be the default GW configured on the host too, and we can probably ensure it's the first usable IP on the subnet. I suspect that doesn't help much but want to mention it.

Wed, May 11, 9:41 AM · Prod-Kubernetes, SRE, Infrastructure-Foundations, netops
akosiaris added a comment to T305729: Kubernetes credentials on deployment servers should be available to deployers, not all users.

Just to note a semantic thing here:

Wed, May 11, 8:52 AM · Release-Engineering-Team (Radar), Patch-For-Review, Kubernetes, MW-on-K8s, serviceops

Tue, May 10

akosiaris added a comment to T306181: intake-analytics is responsible for up to a 85% of varnish backend fetch errors.

We have proposed creating a new bucket for requests that take up to 10 seconds but that hasn't happened yet.

Tue, May 10, 2:49 PM · Data-Engineering-Kanban, Data-Engineering, SRE, Traffic

Mon, May 9

akosiaris added a comment to T306181: intake-analytics is responsible for up to a 85% of varnish backend fetch errors.

Looking at https://grafana-rw.wikimedia.org/d/ZB39Izmnz/eventgate?orgId=1&viewPanel=37&var-service=eventgate-analytics-external&var-stream=All&var-kafka_broker=All&var-kafka_producer_type=All&var-dc=thanos&editPanel=37 and temporarily adding a legend with max to the right:

Mon, May 9, 4:00 PM · Data-Engineering-Kanban, Data-Engineering, SRE, Traffic
akosiaris added a comment to T297140: New Service Request: developer-portal.

@bd808, just for greater visibility, as I said in https://gerrit.wikimedia.org/r/c/773994, you can proceed and self-merge https://gerrit.wikimedia.org/r/c/773994 and https://gerrit.wikimedia.org/r/c/773995 and do the first deploy of it. We are unfortunately short-handed right now, we can't properly review the chart. The rest of the changes have been merged, so we shouldn't be a blocker for getting developer-portal deployed. Don't hesitate however to reach out if you hit any roadblocks or need any help!

Mon, May 9, 9:29 AM · Patch-For-Review, Goal, serviceops, Wikimedia-Developer-Portal, Service-deployment-requests

Fri, May 6

akosiaris added a comment to T303049: New Service Request: DataHub.

FWIW, we hope that Datahub will one day be a service for more than just analytics data, but for now, it isn't, and can be considered part of the 'analytics cluster' which only exists in eqiad. What that means for DNS names and DC failover for now I'll leave up to yall :)

Fri, May 6, 3:03 PM · Patch-For-Review, serviceops, Data-Catalog, Data-Engineering, Service-deployment-requests, Services, SRE

Thu, May 5

akosiaris added a comment to T306797: [Shared Event Platform] Investigate Event Service Platforms.

Oh, just saw this response, was just chatting with @BBlack about this in IRC.

I'd guess if > 100ms is a rare occasion, Kafka stretch would still be fine.

Thu, May 5, 2:59 PM · Epic, Generated Data Platform
akosiaris added a comment to T307647: 2022-05-05 Wikimedia full site outage.

Just for greater visibility and awareness, there is T301505 for the upstream connect error or disconnect/reset before headers. reset reason: overflow". error. As pointed out in that task that error message is a symptom and not the cause.

Thu, May 5, 12:38 PM · SRE-OnFire (FY2021/2022-Q4), GlobalBlocking, DBA, SRE, Wikimedia-Incident
akosiaris added a comment to T306797: [Shared Event Platform] Investigate Event Service Platforms.

Been reading some blogs and talks about this 'stretched' Kafka cluster idea, and it sure would make Multi DC apps much easier. From the linked Kafka talk though, it recommends not doing it unless you can guarantee latency between DCs of <= 100ms. I have a feeling that our SREs would not like to guarantee that.

Thu, May 5, 9:44 AM · Epic, Generated Data Platform

Wed, May 4

akosiaris moved T277849: Convert helm releases to the new release naming schema from Incoming 🐫 to Stalled 🐌 on the serviceops board.
Wed, May 4, 3:15 PM · Prod-Kubernetes, Kubernetes, serviceops, SRE
akosiaris removed a subtask for T251305: Migrate to helm v3: T277849: Convert helm releases to the new release naming schema.
Wed, May 4, 3:15 PM · Patch-For-Review, Kubernetes, serviceops, SRE
akosiaris removed a parent task for T277849: Convert helm releases to the new release naming schema: T251305: Migrate to helm v3.
Wed, May 4, 3:15 PM · Prod-Kubernetes, Kubernetes, serviceops, SRE
akosiaris added a parent task for T260663: Create a cookbook for depooling one or all services from one kubernetes cluster: T277677: Write a cookbook to set a k8s cluster in maintenance mode.
Wed, May 4, 2:29 PM · Sustainability (Incident Followup), Infrastructure-Foundations, Prod-Kubernetes, SRE-tools, SRE, serviceops
akosiaris added a subtask for T277677: Write a cookbook to set a k8s cluster in maintenance mode: T260663: Create a cookbook for depooling one or all services from one kubernetes cluster.
Wed, May 4, 2:29 PM · Sustainability (Incident Followup), Infrastructure-Foundations, SRE-tools, SRE, Prod-Kubernetes, serviceops
akosiaris added a comment to T303049: New Service Request: DataHub.

I was under the impression that datahub should only run/be used in the active datacenter because it relies on state in MySQL and other datastores which are not equally available in both DCs.

Thanks. You're right that the stateful components only exist in eqiad, so running datahub in codfw is going to be slower than eqiad. However, the network policies are in place so that it should work in codfw.
I'm happy to take advice on whether this should be set up as an active/passive or active/active service. Do you think active/passive would be better, if our preferred service is eqiad?

Wed, May 4, 1:05 PM · Patch-For-Review, serviceops, Data-Catalog, Data-Engineering, Service-deployment-requests, Services, SRE

Tue, May 3

akosiaris added a comment to T306649: Agree strategy for Kubernetes BGP peering to top-of-rack switches.

We have discussed this issue in the serviceops channel yesterday, and the idea is to indeed use labels. The ML clusters do not define failure-domain.beta.kubernetes.io at the moment, but we are going to add them asap (I'd say manually, with a comment in Hiera to be deployed when the nodes are reimaged).
The idea for the new ToR-specific configuration is to add a label called wikimedia.org/node-location (or something similar), for example:

lsw1-f3-eqiad-ipv6:
  nodeSelector: wikimedia.org/node-location == lsw1-f3-eqiad
  asNumber: 14907
  peerIP: 2620:0:861:10f::1
Tue, May 3, 3:53 PM · Prod-Kubernetes, SRE, Infrastructure-Foundations, netops

Fri, Apr 29

akosiaris updated subscribers of T306797: [Shared Event Platform] Investigate Event Service Platforms.

We should talk to @dcausse and @akosiaris about their experience deploying Flink and what they would do differently.

Fri, Apr 29, 3:14 PM · Epic, Generated Data Platform
akosiaris created T307220: decommission wtp10[25-48].
Fri, Apr 29, 2:08 PM · serviceops
akosiaris created T307219: Put parse parse100[01-24] in production.
Fri, Apr 29, 2:07 PM · serviceops
akosiaris added a comment to T306860: Videoscalers fail health checks while CPU is maxed.

As a starting point: @jhathaway noted that we're running ffmpeg at niceness -19, which is quite assertive; raising that value might be an easy way to relieve the pressure. I don't have historical context for why it is that way, but if we can change it safely, it might be a good first step.

Fri, Apr 29, 1:56 PM · Sustainability (Incident Followup), WMF-JobQueue, serviceops, SRE
akosiaris added a comment to T305899: Improve grafana dashboard for monitoring Toolhub in production.

which just doesn't look right. I wonder what prometheus does in this case.

That 301 redirect is basically the same issue that we saw with /healthz checks in T294072. I'll put up a patch to make the same fix for /metrics.

After fixing the redirect metrics dropped down to levels that make more logical sense (<1 rps). There must have been something very strange happening because of the redirect for the traffic to have been reporting 2000% higher than reality.

Fri, Apr 29, 1:27 PM · User-bd808, Developer-Advocacy (Apr-Jun 2022), Toolhub
akosiaris added a comment to T67270: Default license for operations/puppet.

https://gerrit.wikimedia.org/r/787708 removes the puppet lvm module which is GPL-2 and incompatible with apache 2.0. So that removes an interesting blocker in adopting a cross repo license.

Fri, Apr 29, 10:49 AM · Patch-For-Review, SRE, Software-Licensing, Documentation, WMF-Legal, WMF-General-or-Unknown

Thu, Apr 28

akosiaris added a comment to T297140: New Service Request: developer-portal.
Thu, Apr 28, 9:35 AM · Patch-For-Review, Goal, serviceops, Wikimedia-Developer-Portal, Service-deployment-requests

Apr 20 2022

akosiaris added a comment to T305613: <math>\land</math> – Unclear why the page appears in an error-category.

@Eevans, @hnowlan let me know if you have any ideas on how to fix this.

Apr 20 2022, 12:09 PM · serviceops, RESTBase, Math

Apr 19 2022

akosiaris updated the task description for T306419: Prometheus: Disable following 3xx redirects by default in at least kubernetes pods scraping.
Apr 19 2022, 8:51 AM · Prod-Kubernetes, Observability-Metrics, serviceops-radar
akosiaris updated subscribers of T306419: Prometheus: Disable following 3xx redirects by default in at least kubernetes pods scraping.
Apr 19 2022, 8:37 AM · Prod-Kubernetes, Observability-Metrics, serviceops-radar
akosiaris triaged T306419: Prometheus: Disable following 3xx redirects by default in at least kubernetes pods scraping as Low priority.
Apr 19 2022, 8:36 AM · Prod-Kubernetes, Observability-Metrics, serviceops-radar
akosiaris created T306419: Prometheus: Disable following 3xx redirects by default in at least kubernetes pods scraping.
Apr 19 2022, 8:36 AM · Prod-Kubernetes, Observability-Metrics, serviceops-radar
akosiaris added a comment to T305899: Improve grafana dashboard for monitoring Toolhub in production.

So, prometheus follows redirects by default. In https://github.com/prometheus/prometheus/commit/646556a2632700f7fca42cec51d0100294d43c52 support for disabling that functionality was added. I 'd say that for production we want to default to not following redirects, it should help us avoid weird edge cases like this one.

Apr 19 2022, 7:41 AM · User-bd808, Developer-Advocacy (Apr-Jun 2022), Toolhub
akosiaris added a comment to T305899: Improve grafana dashboard for monitoring Toolhub in production.

I have attempted to fix the Traffic and Errors rows on https://grafana.wikimedia.org/d/wJHvm8Ank/toolhub?orgId=1&refresh=1m. I'm honestly not convinced that I have the sum(rate(...)) things quite right. The shape of the curves all look reasonable, but the amplitude seems too high even after I filtered out all of the /healthz and /metrics requests from the time series queries.

These are the queries that I used:

  • Traffic / Total: sum(rate(django_http_requests_total_by_view_transport_method_total{app="$service",view!="healthz",view!="prometheus-django-metrics"}[5m]))
  • Traffic / by HTTP method: sum(rate(django_http_requests_total_by_view_transport_method_total{app="$service",view!="healthz",view!="prometheus-django-metrics"}[5m])) by (method)
  • Traffic / by endpoint: sum(rate(django_http_requests_total_by_view_transport_method_total{app="$service",view!="healthz",view!="prometheus-django-metrics"}[5m])) by (view)
  • Traffic / by HTTP status: sum(rate(django_http_responses_total_by_status_view_method_total{app="$service",view!="healthz",view!="prometheus-django-metrics"}[5m])) by (status)
  • Errors / HTTP errors: sum(rate(django_http_responses_total_by_status_view_method_total{app="$service",status=~"4..|5..",view!="healthz",view!="prometheus-django-metrics"}[5m])) by (status)
  • Errors / current error rate: sum(rate(django_http_responses_total_by_status_view_method_total{app="$service",status=~"4..|5..",view!="healthz",view!="prometheus-django-metrics"}[5m]))
  • Errors / current error %: sum(rate(django_http_responses_total_by_status_view_method_total{app="$service",status=~"4..|5..",view!="healthz",view!="prometheus-django-metrics"}[5m])) / sum(rate(django_http_responses_total_by_status_view_method_total{app="$service",view!="healthz",view!="prometheus-django-metrics"}[5m]))
Apr 19 2022, 7:33 AM · User-bd808, Developer-Advocacy (Apr-Jun 2022), Toolhub

Apr 18 2022

akosiaris committed rLPRI8dc8281d0402: Add dummy tokens for developer-portal (authored by Majavah).
Add dummy tokens for developer-portal
Apr 18 2022, 2:33 PM
akosiaris added a comment to T305899: Improve grafana dashboard for monitoring Toolhub in production.

I think that you are right. Those numbers don't look ok. And I do notice the following

Apr 18 2022, 2:33 PM · User-bd808, Developer-Advocacy (Apr-Jun 2022), Toolhub
akosiaris moved T297140: New Service Request: developer-portal from Backlog to In progress on the Service-deployment-requests board.
Apr 18 2022, 1:49 PM · Patch-For-Review, Goal, serviceops, Wikimedia-Developer-Portal, Service-deployment-requests
akosiaris added a comment to T301471: New Service Request SchemaTree.

I merged the pull request on github now.

I do not have rights to push to the gerrit repository, it might just be my limited knowledge of how gerrit works.

Apr 18 2022, 1:49 PM · User-ItamarWMDE, serviceops, wdwb-tech, Wikidata, MediaWiki-extensions-PropertySuggester, Service-deployment-requests, Services

Apr 15 2022

akosiaris reopened T300914: cpjobqueue not achieving configured concurrency as "Open".

I am reopening this. Due to https://wikitech.wikimedia.org/wiki/Incidents/2022-03-27_api we had to lower the concurrency by half in https://gerrit.wikimedia.org/r/c/operations/deployment-charts/+/774462

Apr 15 2022, 2:28 PM · Patch-For-Review, Platform Team Workboards (Platform Engineering Reliability), WMF-JobQueue, Platform Engineering
akosiaris added a comment to T305613: <math>\land</math> – Unclear why the page appears in an error-category.

Some more information regarding this. With the exception of the warnings stanza the response by mathoid for queries `\land xcc' and '\and xcc' is identical

Apr 15 2022, 11:17 AM · serviceops, RESTBase, Math
akosiaris removed a project from T305613: <math>\land</math> – Unclear why the page appears in an error-category: MediaWiki-Categories.

Adding @Eevans and @hnowlan. They are the only 2 people in the WMF that I know that might be able to help debugging this.

Apr 15 2022, 10:55 AM · serviceops, RESTBase, Math
akosiaris added a comment to T305613: <math>\land</math> – Unclear why the page appears in an error-category.

@Wurgl, nailed it!

Apr 15 2022, 10:53 AM · serviceops, RESTBase, Math

Apr 14 2022

akosiaris added a comment to T305613: <math>\land</math> – Unclear why the page appears in an error-category.

OK, I am sending the output of the curl -v in private paste as suggested.

Apr 14 2022, 11:21 AM · serviceops, RESTBase, Math
akosiaris added a comment to T305358: service::catalog entries and dnsdisc for Kubernetes services under Ingress.
  • The monitoring: stanza can't be added as having that without lvs: breaks icinga. Can potentially be ignored (T291946), see above.
Apr 14 2022, 11:08 AM · Patch-For-Review, SRE, Traffic, Prod-Kubernetes, Kubernetes, serviceops
akosiaris changed the status of T306162: Decommission mw13[07-48] from Open to Stalled.

Stalling until T306121 is done.

Apr 14 2022, 7:35 AM · SRE, ops-eqiad, serviceops, DC-Ops
akosiaris added a subtask for T306162: Decommission mw13[07-48]: T306121: Q4: (Need By: TBD) rack/setup/install mw14[57-98].
Apr 14 2022, 7:35 AM · SRE, ops-eqiad, serviceops, DC-Ops
akosiaris added a parent task for T306121: Q4: (Need By: TBD) rack/setup/install mw14[57-98]: T306162: Decommission mw13[07-48].
Apr 14 2022, 7:35 AM · SRE, serviceops, ops-eqiad, DC-Ops
akosiaris created T306162: Decommission mw13[07-48].
Apr 14 2022, 7:35 AM · SRE, ops-eqiad, serviceops, DC-Ops
akosiaris updated the task description for T306121: Q4: (Need By: TBD) rack/setup/install mw14[57-98].
Apr 14 2022, 7:33 AM · SRE, serviceops, ops-eqiad, DC-Ops
akosiaris added a comment to T306089: Cloud VPS "packaging" project Stretch deprecation.

I had a quick lock at openstack browser. Only VM that is stretch is packager01.packaging.eqiad1.wikimedia.cloud. I only use that one for building the etherpad-lite Debian packages cause they are unfortunately unbuildeable in our production infrastructure. I just tested I can build them on packager02.packaging.eqiad1.wikimedia.cloud, so feel free to delete packager01.

Apr 14 2022, 7:30 AM · Cloud-VPS (Debian Stretch Deprecation)

Apr 13 2022

akosiaris added a comment to T305469: codfw: Dedicate Rack B1 for cloudX-dev servers.

@Papaul: mc2023 and kubestage2002 have been downtimed again (for 2days) and I 've just powered them off. The should be ready to be moved.

Apr 13 2022, 3:49 PM · SRE, ops-codfw
akosiaris added a comment to T238751: Only generate maxlag from pooled query service servers..

@Joe (Also pinging @akosiaris as I know joe is out right now).
It seems like the ideal solution of T239392: Applications and scripts need to be able to understand the pooled status of servers in our load balancers. might not happen for some time.
Would it be possible to resolve this for now with https://gerrit.wikimedia.org/r/c/operations/puppet/+/553097 which I believe would have been "fine" TM for the last 2.5 years and decreased humans touching things and also decreased the number of issues users end up seeing around delayed / broken but depooled wdqs hosts?

Apr 13 2022, 2:59 PM · Discovery-Search (Current work), User-ItamarWMDE, SRE-OnFire, wdwb-tech, Sustainability (Incident Followup), Patch-For-Review, User-Addshore, Wikidata
akosiaris added a comment to T290357: Maintenance environment needed for running one-off commands.

That leaves the use cases of interactively using the Django REPL (or any other interactive tool requiring a PTY), which we are still discussing if and how we will support.

@akosiaris has there been any more discussion of this use case in your team? One of the findings from T303889: Toolhub broken in prod by memcached client library change was that "run maintenance actions from Bryan's laptop" (T290357#7578866) is slow and hindered our recovery from the outage.

Apr 13 2022, 1:37 PM · Patch-For-Review, Prod-Kubernetes, Kubernetes, serviceops, Toolhub
akosiaris added a comment to T305899: Improve grafana dashboard for monitoring Toolhub in production.

As pointed out in T305902, those metrics are already scraped by prometheus for a pretty long time now. What's left is to actually alter the dashboard to reference those metrics and not the service-runner ones.

Apr 13 2022, 11:08 AM · User-bd808, Developer-Advocacy (Apr-Jun 2022), Toolhub
akosiaris closed T305902: Injest Toolhub custom prometheus metrics as Invalid.

How can one get started working on this @bd808 ? any pointers? especially the kubernetes scraping part

This is yet another area where Toolhub is an early adopter and there does not yet seem to be strong documentation on how to proceed. I think the answer is going to be something like "make a patch to operations/puppet.git". https://github.com/wikimedia/puppet/blob/production/modules/profile/files/prometheus/rules_k8s.yml might be the right place, but I'm not sure.

I would recommend trying to contact folks like @fgiunchedi from SRE Observability or @akosiaris from serviceops to ask for some advice.

Apr 13 2022, 11:07 AM · observability, Toolhub
akosiaris closed T305902: Injest Toolhub custom prometheus metrics, a subtask of T305899: Improve grafana dashboard for monitoring Toolhub in production, as Invalid.
Apr 13 2022, 11:07 AM · User-bd808, Developer-Advocacy (Apr-Jun 2022), Toolhub
akosiaris closed T291707: zotero paging / serving 5xxes after CPU spikes as Resolved.

Tentatively resolving this.

Apr 13 2022, 10:02 AM · Patch-For-Review, serviceops, Citoid
akosiaris added a comment to T291707: zotero paging / serving 5xxes after CPU spikes.

I don't see an alert since Apr 5 in Icinga for either codfw or eqiad. I am gonna re-enabling paging.

Apr 13 2022, 9:44 AM · Patch-For-Review, serviceops, Citoid

Apr 12 2022

akosiaris added a comment to T305613: <math>\land</math> – Unclear why the page appears in an error-category.

I recall that some requests were cached forever. The person who cleaned the cache last time was @GWicke so some time has passed since then. Do you know what the current lifetime of the cache is https://github.com/wikimedia/restbase/blob/ecef17bda6f4efc0d6e187fb05b1eeb389bf7120/sys/mathoid.js#L176

Apr 12 2022, 5:02 PM · serviceops, RESTBase, Math
akosiaris reopened T303857: Need a service account on deploy servers for automated train pre-sync operations as "Open".

Thanks for the ping, wouldn't have seen it otherwise. Re-opening and I 'll have a look.

Apr 12 2022, 4:03 PM · Patch-For-Review, Release-Engineering-Team (Radar), SRE-Access-Requests, serviceops, SRE, Infrastructure-Foundations
akosiaris reopened T303857: Need a service account on deploy servers for automated train pre-sync operations, a subtask of T245187: Automate rebuild l10n cache for Train, as Open.
Apr 12 2022, 4:02 PM · Release-Engineering-Team (🌱 Spring Cleaning — April 2022), Scap
akosiaris added a comment to T305613: <math>\land</math> – Unclear why the page appears in an error-category.

@akosiaris thank you. Thank you. The check endpoint does not work with a hash but the actual TeX string.

Apr 12 2022, 2:30 PM · serviceops, RESTBase, Math
akosiaris added a project to T305613: <math>\land</math> – Unclear why the page appears in an error-category: MediaWiki-Categories.
Apr 12 2022, 10:23 AM · serviceops, RESTBase, Math
akosiaris added a comment to T305613: <math>\land</math> – Unclear why the page appears in an error-category.

@Wurgl I reached the end of my options. I checked that no warning is emitted in the source code by adding tests. Maybe this is a caching problem. Here you need support from someone with access to the restbase server that can delete the cache for the land command from Cassandra. As you can see from the edits on https://www.mediawiki.org/w/index.php?title=Extension:Math/T305613&action=history it seems to be a caching issue. I don't know what to do next. Maybe @akosiaris can help?

Apr 12 2022, 10:15 AM · serviceops, RESTBase, Math

Apr 11 2022

akosiaris added a comment to T305469: codfw: Dedicate Rack B1 for cloudX-dev servers.

I marked rdb2008, kubestage2002 and mc2023 as YES in the table. rdb2008 is the secondary, not the primary, kubestage2002 is for the staging cluster anyway and mc2023 will be handled by mcrouter's configuration and shard05 should be moved to mc-gp* hosts (gutter pool).

Apr 11 2022, 5:14 PM · SRE, ops-codfw
akosiaris updated the task description for T305469: codfw: Dedicate Rack B1 for cloudX-dev servers.
Apr 11 2022, 5:12 PM · SRE, ops-codfw
akosiaris updated the task description for T305469: codfw: Dedicate Rack B1 for cloudX-dev servers.
Apr 11 2022, 5:03 PM · SRE, ops-codfw

Apr 7 2022

akosiaris added a comment to T301471: New Service Request SchemaTree.

So a summary comment as promised!

@QChris I noticed the addition of the .gitreview file on gerrit. Is this file needed? If so, we would merge it into our github repository, so we can keep the active development there and synchronize with gerrit.

Sounds like this would work to me, and would be fine from the WMDE side of things.
@akosiaris would syncing the code from Github to Gerrit via a Gerrit change whenever it would need to be updated be acceptable on your side?

Apr 7 2022, 2:14 PM · User-ItamarWMDE, serviceops, wdwb-tech, Wikidata, MediaWiki-extensions-PropertySuggester, Service-deployment-requests, Services
akosiaris added a comment to T291707: zotero paging / serving 5xxes after CPU spikes.
akosiaris@deploy1002:~$ kubectl get events
LAST SEEN   TYPE      REASON      OBJECT                                  MESSAGE
2m49s       Warning   Unhealthy   pod/zotero-production-684574794-6n999   Readiness probe failed: Get http://10.64.64.81:1969/?spec: net/http: request canceled (Client.Timeout exceeded while awaiting headers)
38s         Warning   Unhealthy   pod/zotero-production-684574794-g5xr6   Readiness probe failed: Get http://10.64.65.14:1969/?spec: net/http: request canceled (Client.Timeout exceeded while awaiting headers)
6m17s       Warning   Unhealthy   pod/zotero-production-684574794-gv9nf   Readiness probe failed: Get http://10.64.68.96:1969/?spec: net/http: request canceled (Client.Timeout exceeded while awaiting headers)
9m40s       Warning   Unhealthy   pod/zotero-production-684574794-jglk4   Readiness probe failed: Get http://10.64.64.67:1969/?spec: net/http: request canceled (Client.Timeout exceeded while awaiting headers)
17m         Warning   Unhealthy   pod/zotero-production-684574794-jlk69   Readiness probe failed: Get http://10.64.68.239:1969/?spec: net/http: request canceled (Client.Timeout exceeded while awaiting headers)
3m35s       Warning   Unhealthy   pod/zotero-production-684574794-nlqz2   Readiness probe failed: Get http://10.64.64.60:1969/?spec: net/http: request canceled (Client.Timeout exceeded while awaiting headers)
17m         Warning   Unhealthy   pod/zotero-production-684574794-nr4ng   Readiness probe failed: Get http://10.64.69.222:1969/?spec: net/http: request canceled (Client.Timeout exceeded while awaiting headers)
31m         Warning   Unhealthy   pod/zotero-production-684574794-ps595   Readiness probe failed: Get http://10.64.68.138:1969/?spec: net/http: request canceled (Client.Timeout exceeded while awaiting headers)
27m         Warning   Unhealthy   pod/zotero-production-684574794-sk5zw   Readiness probe failed: Get http://10.64.67.19:1969/?spec: net/http: request canceled (Client.Timeout exceeded while awaiting headers)
39m         Warning   Unhealthy   pod/zotero-production-684574794-st8qr   Readiness probe failed: Get http://10.64.67.28:1969/?spec: net/http: request canceled (Client.Timeout exceeded while awaiting headers)
2m38s       Warning   Unhealthy   pod/zotero-production-684574794-tj2qh   Readiness probe failed: Get http://10.64.67.249:1969/?spec: net/http: request canceled (Client.Timeout exceeded while awaiting headers)
48m         Warning   Unhealthy   pod/zotero-production-684574794-wf5rq   Readiness probe failed: Get http://10.64.71.49:1969/?spec: net/http: request canceled (Client.Timeout exceeded while awaiting headers)
5m2s        Warning   Unhealthy   pod/zotero-production-684574794-wlbwg   Readiness probe failed: Get http://10.64.66.98:1969/?spec: net/http: request canceled (Client.Timeout exceeded while awaiting headers)
34m         Warning   Unhealthy   pod/zotero-production-684574794-xtdqm   Readiness probe failed: Get http://10.64.66.153:1969/?spec: net/http: request canceled (Client.Timeout exceeded while awaiting headers)
Apr 7 2022, 1:37 PM · Patch-For-Review, serviceops, Citoid

Apr 6 2022

akosiaris closed T305482: Mobileapps is often throttled on codfw as Resolved.

I am resolving this followup from https://wikitech.wikimedia.org/wiki/Incident_documentation/2022-03-27_api. This will decrease errors from mobileapps in the future and responses will be given to changeprop more promptly, lowering the amount of retries.

Apr 6 2022, 9:09 AM · Wikimedia-Incident, Sustainability (Incident Followup), serviceops, Product-Infrastructure-Team-Backlog, Mobile-Content-Service
akosiaris added a comment to T305482: Mobileapps is often throttled on codfw.

And now that enough time has passed, indeed the rate of errors is lower (I 've arbitrarily drawn a couple of lines at around the 90+th percentile to showcase it easily. There is also less variation so this isn't exactly scientific, but I 'd say good enough.

Apr 6 2022, 8:54 AM · Wikimedia-Incident, Sustainability (Incident Followup), serviceops, Product-Infrastructure-Team-Backlog, Mobile-Content-Service

Apr 5 2022

akosiaris added a comment to T305482: Mobileapps is often throttled on codfw.

After patch was merged and deployed we have happier graphs!

Apr 5 2022, 4:10 PM · Wikimedia-Incident, Sustainability (Incident Followup), serviceops, Product-Infrastructure-Team-Backlog, Mobile-Content-Service
akosiaris created T305482: Mobileapps is often throttled on codfw.
Apr 5 2022, 3:12 PM · Wikimedia-Incident, Sustainability (Incident Followup), serviceops, Product-Infrastructure-Team-Backlog, Mobile-Content-Service

Apr 4 2022

akosiaris added a comment to T303803: Prometheus use of Squid proxies.

The reasoning for checking these via the proxy is because the prometheus hosts can't reach all of the watchrat checked URLs directly, and it's simpler to have one blackbox exporter configuration that uses a proxy and works for all cases than to split the config out between proxied/non-proxied urls. Here's the current config https://phabricator.wikimedia.org/source/operations-puppet/browse/production/modules/prometheus/templates/blackbox_exporter/common.yml.erb$25-34

I agree it would be nice to not need the proxy, and IMO also worth considering if it'd be worthwhile for prometheus hosts to have public addresses so this kind of config would do the right thing without proxy.

If it's not causing a problem on the proxies but more a question of why then I think we're ok as configured, but happy to adjust if needed.

Apr 4 2022, 10:03 AM · SRE Observability (FY2021/2022-Q4)

Apr 2 2022

akosiaris added a comment to T291707: zotero paging / serving 5xxes after CPU spikes.

For posterity's sake, alert histograms from icinga for the 2 instances of zotero.

Apr 2 2022, 11:25 AM · Patch-For-Review, serviceops, Citoid

Apr 1 2022

akosiaris moved T277483: New Service Request: xhgui from In progress to Externally blocked on the Service-deployment-requests board.
Apr 1 2022, 3:34 PM · serviceops-radar, Performance-Team (Radar), Patch-For-Review, Service-deployment-requests, Services, SRE
akosiaris moved T297815: New Service Request memcached-wikifunctions from Backlog to Externally blocked on the Service-deployment-requests board.
Apr 1 2022, 3:31 PM · serviceops, Service-deployment-requests, Services

Mar 29 2022

akosiaris added a comment to T303803: Prometheus use of Squid proxies.

Putting aside if we should split the config or provide an external ip address. i wonder if https://wikitech.wikimedia.org/wiki/Url-downloader should be preferred for things like this, @akosiaris?

Mar 29 2022, 3:01 PM · SRE Observability (FY2021/2022-Q4)

Mar 28 2022

akosiaris updated the task description for T303045: decommission kubernetes200[1-4].
Mar 28 2022, 3:25 PM · SRE, ops-codfw, decommission-hardware
akosiaris updated the task description for T303044: decommission kubernetes100[1-4].
Mar 28 2022, 3:24 PM · SRE, ops-eqiad, decommission-hardware
akosiaris added a comment to T291707: zotero paging / serving 5xxes after CPU spikes.

In terms of a get endpoint, would swagger docs suffice? I started to do something like that but never got around to finishing it :/ https://github.com/zotero/translation-server/issues/76

That would work fine. Many thanks for working on that!

In review -> https://github.com/zotero/translation-server/pull/131

Mar 28 2022, 2:37 PM · Patch-For-Review, serviceops, Citoid
akosiaris added a comment to T288375: IPInfo MediaWiki extension depends on presence of maxmind db in the container/host.

Could we deploy the GeoIP databases to the kube-workers and then mount it to the mw pods as a readonly hostpath volume?

Mar 28 2022, 12:53 PM · Data-Engineering-Radar, serviceops, MW-on-K8s

Mar 21 2022

akosiaris closed T293729: setup/install kubestage100[34] as Resolved.

This has been done, resolving!

Mar 21 2022, 3:53 PM · Patch-For-Review, Prod-Kubernetes, Kubernetes, serviceops
akosiaris closed T293729: setup/install kubestage100[34], a subtask of T290894: Q2: (Need By: TBD) rack/setup/install kubestage100[34].eqiad.wmnet, as Resolved.
Mar 21 2022, 3:52 PM · SRE, serviceops, ops-eqiad, DC-Ops