Page MenuHomePhabricator

akosiaris (Alexandros Kosiaris)
Senior Site Reliability Engineer

Projects (21)

Today

  • Clear sailing ahead.

Tomorrow

  • Clear sailing ahead.

Saturday

  • Clear sailing ahead.

User Details

User Since
Oct 3 2014, 8:40 AM (299 w, 6 d)
Availability
Available
IRC Nick
akosiaris
LDAP User
Alexandros Kosiaris
MediaWiki User
AKosiaris (WMF) [ Global Accounts ]

Blurb

Recent Activity

Today

akosiaris awarded T242604: Remove obsoleted docker images a Love token.
Thu, Jul 2, 8:38 AM · Patch-For-Review, User-brennen, Operations, Release Pipeline, Release-Engineering-Team, serviceops

Yesterday

akosiaris closed T256726: redis for docker-registry should have maxmemory-policy set to allkeys-lru, a subtask of T209271: improve docker registry architecture, as Resolved.
Wed, Jul 1, 3:27 PM · User-fsero, Patch-For-Review, serviceops, Prod-Kubernetes, Kubernetes, Operations
akosiaris closed T256726: redis for docker-registry should have maxmemory-policy set to allkeys-lru as Resolved.

Double checked across all nodes, this has been applied successfully. Resolving. Thanks!

Wed, Jul 1, 3:27 PM · serviceops, Prod-Kubernetes, Kubernetes, Operations
akosiaris added a comment to T256863: restbase2009 down.

show system1/log1 etc has 2 telling entries

Wed, Jul 1, 1:28 PM · RESTBase, Operations, ops-codfw
akosiaris updated the task description for T244530: upgrade memory in ganeti100[5-8].eqiad.wmnet.
Wed, Jul 1, 11:05 AM · ops-eqiad, Operations
akosiaris lowered the priority of T242705: ORES uwsgi consumes a large amount of memory and CPU when shutting down (as part of a restart) from High to Low.

I 'll lower priority for this. We may have the solution. https://grafana.wikimedia.org/d/000000607/cluster-overview?panelId=87&fullscreen&orgId=1&var-datasource=eqiad%20prometheus%2Fops&var-cluster=ores&var-instance=All&from=now-7d&to=now lacks the pattern on the 30th and the 1st, whereas it is clearly visible on all the days before that. I 'll leave it be for a couple more days though so we can monitor.

Wed, Jul 1, 11:03 AM · Patch-For-Review, Scoring-platform-team (Current), Operations, ORES

Mon, Jun 29

akosiaris updated subscribers of T256629: Adopt SLIs / SLOs for sessionstore.

@akosiaris Thanks for putting this together. We're very excited to have SLOs around all our services. For the API Gateway work, Hugh, Giuseppe, Wolfgang and I discussed having a basic set of SLOs in place for Gateway and ideally for one of the services that back it under T254916.

The initial areas we were going to look at would be throughput, latency, error budget and availability. I know the dimensions we consider will vary depending on the component or service we're covering but are there general guidelines, process or methodology we could already be reviewing?

Mon, Jun 29, 2:31 PM · Core Platform Team, Sustainability (Incident Prevention), serviceops-radar
akosiaris changed the status of T256629: Adopt SLIs / SLOs for sessionstore from Open to Stalled.
Mon, Jun 29, 12:40 PM · Core Platform Team, Sustainability (Incident Prevention), serviceops-radar
akosiaris created T256629: Adopt SLIs / SLOs for sessionstore.
Mon, Jun 29, 12:40 PM · Core Platform Team, Sustainability (Incident Prevention), serviceops-radar
akosiaris added a comment to T242705: ORES uwsgi consumes a large amount of memory and CPU when shutting down (as part of a restart).

My bet is that when you shutdown the workers, the PyObject c structure of each python object is 'touched' forcing the 'copy part' of the COW behaviour.

Mon, Jun 29, 10:04 AM · Patch-For-Review, Scoring-platform-team (Current), Operations, ORES

Fri, Jun 26

akosiaris closed T256358: wikifeeds usage increased by 3x on 2020-06-24 as Resolved.

Everything seems fine. I am gonna resolve this. @Pchelolo thanks again!

Fri, Jun 26, 11:36 AM · Core Platform Team Workboards (Clinic Duty Team), RESTBase, Wikipedia-iOS-App-Backlog, iOS-app-Bugs, serviceops-radar, Mobile-Content-Service, Wikifeeds, Product-Infrastructure-Team-Backlog
akosiaris closed T256236: Increase capacity of the sessionstore dedicated kubernetes nodes as Resolved.

Both paths outlined in the description have been followed. We now have 4 different sessionstore dedicated nodes in each DC, with capacity of 6 kask pods each for a total of 24pods. That is 6 times larger than the capacity we had when the incident was triggered and 3 times larger than the capacity we now have allocated. So we have room to react in a sudden increase, hopefully in the future even automatically.

Fri, Jun 26, 8:15 AM · Operations, serviceops, Sustainability (Incident Prevention)
akosiaris closed T256254: Site: 2 VM request for kubernetes sessionstore dedicated nodes, a subtask of T256236: Increase capacity of the sessionstore dedicated kubernetes nodes, as Resolved.
Fri, Jun 26, 8:04 AM · Operations, serviceops, Sustainability (Incident Prevention)
akosiaris closed T256254: Site: 2 VM request for kubernetes sessionstore dedicated nodes as Resolved.

VMs created, installed and ready. Resolving

Fri, Jun 26, 8:04 AM · vm-requests, Operations
akosiaris added a comment to T244530: upgrade memory in ganeti100[5-8].eqiad.wmnet.

kubestagetcd1004 is still down. not sure if desired. ACKed in Icinga.

Fri, Jun 26, 7:46 AM · ops-eqiad, Operations

Thu, Jun 25

akosiaris lowered the priority of T256358: wikifeeds usage increased by 3x on 2020-06-24 from High to Low.

It seems like the restbase deploy of 821e96b fixed the issue. I 'll lower priority but leave this open for a few hours and if everything is fine, close as Resolved

Thu, Jun 25, 3:34 PM · Core Platform Team Workboards (Clinic Duty Team), RESTBase, Wikipedia-iOS-App-Backlog, iOS-app-Bugs, serviceops-radar, Mobile-Content-Service, Wikifeeds, Product-Infrastructure-Team-Backlog
akosiaris added a comment to T244530: upgrade memory in ganeti100[5-8].eqiad.wmnet.

@Jclark-ctr Excellent. I started the process of emptying ganeti1006 (and filling ganeti1005), that should take quite a while, but we should be on time for next Thursday. Many thanks!

Thu, Jun 25, 2:20 PM · ops-eqiad, Operations
akosiaris updated the task description for T244530: upgrade memory in ganeti100[5-8].eqiad.wmnet.
Thu, Jun 25, 2:19 PM · ops-eqiad, Operations
akosiaris added a comment to T182331: [Epic] Deploy ORES in kubernetes cluster.

Couple of more benefits of k8s I forgot to mention yesterday

Thu, Jun 25, 2:17 PM · Operations, ORES, Scoring-platform-team
akosiaris added a comment to T244530: upgrade memory in ganeti100[5-8].eqiad.wmnet.

@Jclark-ctr: ganeti1005 is ready. Fully depooled, downtimed and powered off.

Thu, Jun 25, 10:05 AM · ops-eqiad, Operations
akosiaris updated the task description for T244530: upgrade memory in ganeti100[5-8].eqiad.wmnet.
Thu, Jun 25, 10:04 AM · ops-eqiad, Operations
akosiaris added a project to T256358: wikifeeds usage increased by 3x on 2020-06-24: RESTBase.

@Pchelolo I think that yesterday's deploy of restbase has caused the issue described in this task. Restbase also alerts with

Thu, Jun 25, 9:49 AM · Core Platform Team Workboards (Clinic Duty Team), RESTBase, Wikipedia-iOS-App-Backlog, iOS-app-Bugs, serviceops-radar, Mobile-Content-Service, Wikifeeds, Product-Infrastructure-Team-Backlog
akosiaris added a comment to T256358: wikifeeds usage increased by 3x on 2020-06-24.

The start of the increases coincides with a deploy of restbase

Thu, Jun 25, 9:35 AM · Core Platform Team Workboards (Clinic Duty Team), RESTBase, Wikipedia-iOS-App-Backlog, iOS-app-Bugs, serviceops-radar, Mobile-Content-Service, Wikifeeds, Product-Infrastructure-Team-Backlog
akosiaris added a comment to T256358: wikifeeds usage increased by 3x on 2020-06-24.

codfw wikifeeds is by the way exhibiting similar behavior. https://grafana.wikimedia.org/d/35vIuGpZk/wikifeeds?panelId=12&fullscreen&orgId=1&from=now-24h&to=now&var-dc=codfw%20prometheus%2Fk8s&var-service=wikifeeds

Thu, Jun 25, 9:29 AM · Core Platform Team Workboards (Clinic Duty Team), RESTBase, Wikipedia-iOS-App-Backlog, iOS-app-Bugs, serviceops-radar, Mobile-Content-Service, Wikifeeds, Product-Infrastructure-Team-Backlog
akosiaris added a project to T256358: wikifeeds usage increased by 3x on 2020-06-24: iOS-app-Bugs.

Turnillo points out that these requests come disproportionately with this User Agent

Thu, Jun 25, 9:11 AM · Core Platform Team Workboards (Clinic Duty Team), RESTBase, Wikipedia-iOS-App-Backlog, iOS-app-Bugs, serviceops-radar, Mobile-Content-Service, Wikifeeds, Product-Infrastructure-Team-Backlog
akosiaris triaged T256358: wikifeeds usage increased by 3x on 2020-06-24 as High priority.
Thu, Jun 25, 8:32 AM · Core Platform Team Workboards (Clinic Duty Team), RESTBase, Wikipedia-iOS-App-Backlog, iOS-app-Bugs, serviceops-radar, Mobile-Content-Service, Wikifeeds, Product-Infrastructure-Team-Backlog
akosiaris created T256358: wikifeeds usage increased by 3x on 2020-06-24.
Thu, Jun 25, 8:31 AM · Core Platform Team Workboards (Clinic Duty Team), RESTBase, Wikipedia-iOS-App-Backlog, iOS-app-Bugs, serviceops-radar, Mobile-Content-Service, Wikifeeds, Product-Infrastructure-Team-Backlog

Wed, Jun 24

akosiaris added a comment to T256256: Raise an alarm on container restarts/OOMs in kubernetes.

With kube-state-metrics (sorry for me repeating this over and over 😂 ) there is kube_pod_container_status_restarts_total and kube_pod_container_status_last_terminated_reason which can be used to detect OOM on containers.

Wed, Jun 24, 2:16 PM · Sustainability (Incident Prevention), serviceops, Kubernetes, ChangeProp
akosiaris added a comment to T256256: Raise an alarm on container restarts/OOMs in kubernetes.

An interesting thing to note here is that some services have quite often pod restarts. e.g.

Wed, Jun 24, 1:43 PM · Sustainability (Incident Prevention), serviceops, Kubernetes, ChangeProp
akosiaris triaged T256256: Raise an alarm on container restarts/OOMs in kubernetes as Medium priority.
Wed, Jun 24, 1:41 PM · Sustainability (Incident Prevention), serviceops, Kubernetes, ChangeProp
akosiaris created T256256: Raise an alarm on container restarts/OOMs in kubernetes.
Wed, Jun 24, 1:41 PM · Sustainability (Incident Prevention), serviceops, Kubernetes, ChangeProp
akosiaris updated the task description for T255975: Investigate the iowait issues plaguing kubernetes nodes since 2020-05-29.
Wed, Jun 24, 1:38 PM · Sustainability (Incident Prevention), serviceops, Kubernetes, ChangeProp
akosiaris added a comment to T255975: Investigate the iowait issues plaguing kubernetes nodes since 2020-05-29.

Thanks for writing this up @akosiaris! I think it would be nice to have the follow up tasks linked here. Like the removal of the service-runner and splitting up changeprop into multiple deployments (one per topic?).

Wed, Jun 24, 1:35 PM · Sustainability (Incident Prevention), serviceops, Kubernetes, ChangeProp
akosiaris added a subtask for T256236: Increase capacity of the sessionstore dedicated kubernetes nodes: T256254: Site: 2 VM request for kubernetes sessionstore dedicated nodes.
Wed, Jun 24, 1:24 PM · Operations, serviceops, Sustainability (Incident Prevention)
akosiaris added a parent task for T256254: Site: 2 VM request for kubernetes sessionstore dedicated nodes: T256236: Increase capacity of the sessionstore dedicated kubernetes nodes.
Wed, Jun 24, 1:24 PM · vm-requests, Operations
akosiaris triaged T256254: Site: 2 VM request for kubernetes sessionstore dedicated nodes as High priority.
Wed, Jun 24, 1:24 PM · vm-requests, Operations
akosiaris created T256254: Site: 2 VM request for kubernetes sessionstore dedicated nodes.
Wed, Jun 24, 1:24 PM · vm-requests, Operations
akosiaris removed projects from T210582: New node request: oresrdb[12]003: vm-requests, Operations.

There are no oresrdb nodes at all anymore. ORES redis has been moved to the redis::misc cluster. This is probably no longer relevant. I 'll remove the vm-requests tag, feel free to readd.

Wed, Jun 24, 1:20 PM · Scoring-platform-team, ORES
akosiaris added a comment to T256236: Increase capacity of the sessionstore dedicated kubernetes nodes.

Currently, sessionstore sets a limit of 400Mi and 2.5 CPUs[1]. Memory wise, the nodes have 4GB RAM and 6 CPUs. The easy win here is to increase the amount of the vCPUs from 6 to 15. That should allow for a 200% increase in the amount of pods the node can run (from 2 to 6). The amount of memory available supports already 10 pods in theory, in practice since the node also consumes some memory it's more like ~8, which makes the 2 resources pretty close to each other.

Wed, Jun 24, 12:02 PM · Operations, serviceops, Sustainability (Incident Prevention)
akosiaris triaged T256236: Increase capacity of the sessionstore dedicated kubernetes nodes as High priority.
Wed, Jun 24, 11:57 AM · Operations, serviceops, Sustainability (Incident Prevention)
akosiaris created T256236: Increase capacity of the sessionstore dedicated kubernetes nodes.
Wed, Jun 24, 11:57 AM · Operations, serviceops, Sustainability (Incident Prevention)

Tue, Jun 23

akosiaris added a comment to T244530: upgrade memory in ganeti100[5-8].eqiad.wmnet.

@Jclark-ctr: OK, how about 1 host per week? no need for specific timeframes. I 'll have the host depooled, emptied, powered off and downtimed in icinga and ready for the memory upgrade. All you 'll need is to add the memory and power up.

Tue, Jun 23, 2:36 PM · ops-eqiad, Operations
akosiaris created P11640 opusmt locust.
Tue, Jun 23, 1:57 PM
akosiaris closed T249927: Support kubernetes Egress networkpolicies in our helm charts, a subtask of T207804: Upgrade calico in production to version 2.4+, as Resolved.
Tue, Jun 23, 12:58 PM · User-jijiki, User-fsero, serviceops, Kubernetes, Operations
akosiaris closed T249927: Support kubernetes Egress networkpolicies in our helm charts as Resolved.

With the last 2 changes merged, the work on this has been completed. Many thanks @apakhomov !

Tue, Jun 23, 12:58 PM · Patch-For-Review, User-jijiki, User-fsero, serviceops, Kubernetes, Operations

Mon, Jun 22

akosiaris added a comment to T253843: Move helm chart repository out of git.

I need to make decisions regarding TLS and storage:

Do we want to use envoy here (ChartMuseum is able to TLS termination as well)? I think it might be desirable as it prevents us from writing/maintaining TLS stuff in the ChartMuseum class/profile and the overhead is probably irrelevant here.

Mon, Jun 22, 2:00 PM · Patch-For-Review, serviceops, Prod-Kubernetes, Kubernetes
akosiaris added a comment to T255975: Investigate the iowait issues plaguing kubernetes nodes since 2020-05-29.

This looks great, thanks @akosiaris! The only thing I'd note (not sure if it even warrants inclusion in the body of the report but it's an interesting correlation) is how closely CPU throttling was linked with memory maxing out in the container. The troughing seen is of course the pod being OOM killed but the peaks correspond quite closely with memory usage peaking.

Mon, Jun 22, 1:14 PM · Sustainability (Incident Prevention), serviceops, Kubernetes, ChangeProp
akosiaris updated the task description for T255975: Investigate the iowait issues plaguing kubernetes nodes since 2020-05-29.
Mon, Jun 22, 1:14 PM · Sustainability (Incident Prevention), serviceops, Kubernetes, ChangeProp
akosiaris added a comment to T255975: Investigate the iowait issues plaguing kubernetes nodes since 2020-05-29.

@JMeybohm @Pchelolo @hnowlan: This is an account of the investigation we went through for the changeprop memory/cpu limiting.
It's part of the sessionstore incident[1] actionables, although it's after all largely unrelated. Please proof read and correct any errors I might have made.

Mon, Jun 22, 12:46 PM · Sustainability (Incident Prevention), serviceops, Kubernetes, ChangeProp
akosiaris triaged T255975: Investigate the iowait issues plaguing kubernetes nodes since 2020-05-29 as Low priority.
Mon, Jun 22, 12:44 PM · Sustainability (Incident Prevention), serviceops, Kubernetes, ChangeProp
akosiaris updated the task description for T255975: Investigate the iowait issues plaguing kubernetes nodes since 2020-05-29.
Mon, Jun 22, 12:43 PM · Sustainability (Incident Prevention), serviceops, Kubernetes, ChangeProp
akosiaris created T255975: Investigate the iowait issues plaguing kubernetes nodes since 2020-05-29.
Mon, Jun 22, 9:03 AM · Sustainability (Incident Prevention), serviceops, Kubernetes, ChangeProp
akosiaris added a project to T252185: (Need by: TBD) rack/setup/install kubernetes20[07-14].codfw.wmnet and kubestage200[1-2].codfw.wmnet.: Sustainability (Incident Prevention).
Mon, Jun 22, 8:43 AM · Sustainability (Incident Prevention), ops-codfw, serviceops, Operations
akosiaris added a project to T241850: (Need by: TBD) rack/setup/install kubernetes10[07-14].eqiad.wmnet: Sustainability (Incident Prevention).
Mon, Jun 22, 8:43 AM · Sustainability (Incident Prevention), serviceops, ops-eqiad, Operations
akosiaris awarded T105378: Stop a poolcounter server fail from being a SPOF for the service and the api (and the site) a Love token.
Mon, Jun 22, 7:58 AM · Sustainability (Incident Prevention), Operations, WMF-deploy-2015-08-25_(1.26wmf20), Patch-For-Review, MediaWiki-General
akosiaris added a comment to T254556: Upgrade m1 to Buster and Mariadb 10.4.
Mon, Jun 22, 7:55 AM · DBA

Fri, Jun 19

akosiaris added a comment to T187984: Update OTRS to the latest stable version (6.x.x).

For what is worth this is probably going to be scheduled for next quarter (so July-September).

Fri, Jun 19, 12:42 PM · User-notice, Operations, OTRS

Thu, Jun 18

akosiaris updated the task description for T252185: (Need by: TBD) rack/setup/install kubernetes20[07-14].codfw.wmnet and kubestage200[1-2].codfw.wmnet..
Thu, Jun 18, 3:36 PM · Sustainability (Incident Prevention), ops-codfw, serviceops, Operations
akosiaris added a comment to T105378: Stop a poolcounter server fail from being a SPOF for the service and the api (and the site).

The log messages represent failure of all servers in the pool, although the error is not user-visible. PoolCounter_ConnectionManager::open() fails silently after two attempts to connect to a server. Once the loop over all hosts in PoolCounter_ConnectionManager::get() fails to get a valid connection, the last error is wrapped in a Status and returned to PoolCounterWork::execute(), which logs the error. It's those log messages that we see. The impact is that PoolCounterWork::execute() runs the work locally, as if the pool population was zero -- a fair assumption unless the cause of the connection error is a very heavy system-wide overload.

Thu, Jun 18, 9:20 AM · Sustainability (Incident Prevention), Operations, WMF-deploy-2015-08-25_(1.26wmf20), Patch-For-Review, MediaWiki-General

Wed, Jun 17

akosiaris closed T251626: (Need By: TDB) rack/setup/install rdb200[78], a subtask of T255250: Replace rdb200[34] with rdb200[78], as Resolved.
Wed, Jun 17, 2:29 PM · serviceops
akosiaris closed T251626: (Need By: TDB) rack/setup/install rdb200[78] as Resolved.

@akosiaris the racking/setup ticket needs to be closed and open another ticket for the service.

Thanks

Wed, Jun 17, 2:29 PM · Operations, ops-codfw, DC-Ops
akosiaris triaged T255681: Put rdb200[78] into service as High priority.
Wed, Jun 17, 2:29 PM · Operations, ops-codfw, DC-Ops
akosiaris created T255681: Put rdb200[78] into service.
Wed, Jun 17, 2:28 PM · Operations, ops-codfw, DC-Ops
akosiaris reopened T251626: (Need By: TDB) rack/setup/install rdb200[78], a subtask of T255250: Replace rdb200[34] with rdb200[78], as Open.
Wed, Jun 17, 1:39 PM · serviceops
akosiaris reopened T251626: (Need By: TDB) rack/setup/install rdb200[78] as "Open".

Great! Thanks Papaul

Wed, Jun 17, 1:39 PM · Operations, ops-codfw, DC-Ops
Jdforrester-WMF awarded T255672: Migrate apertium to the deployment pipeline a Like token.
Wed, Jun 17, 1:37 PM · Language-Team (Language-2020-Focus-Sprint), CX-cxserver, serviceops, Release-Engineering-Team (Pipeline)
akosiaris created P11568 wmf-reimage stacktrace.
Wed, Jun 17, 1:20 PM
akosiaris added a subtask for T198901: Migrate production services to kubernetes using the pipeline: T255672: Migrate apertium to the deployment pipeline.
Wed, Jun 17, 1:19 PM · Release-Engineering-Team (Pipeline), Release-Engineering-Team-TODO, Core Platform Team Legacy (Watching / External), Epic, Services (watching), Operations, Release Pipeline
akosiaris added a parent task for T255672: Migrate apertium to the deployment pipeline: T198901: Migrate production services to kubernetes using the pipeline.
Wed, Jun 17, 1:19 PM · Language-Team (Language-2020-Focus-Sprint), CX-cxserver, serviceops, Release-Engineering-Team (Pipeline)
akosiaris updated the task description for T198901: Migrate production services to kubernetes using the pipeline.
Wed, Jun 17, 1:19 PM · Release-Engineering-Team (Pipeline), Release-Engineering-Team-TODO, Core Platform Team Legacy (Watching / External), Epic, Services (watching), Operations, Release Pipeline
akosiaris triaged T255672: Migrate apertium to the deployment pipeline as Medium priority.
Wed, Jun 17, 1:18 PM · Language-Team (Language-2020-Focus-Sprint), CX-cxserver, serviceops, Release-Engineering-Team (Pipeline)
akosiaris created T255672: Migrate apertium to the deployment pipeline.
Wed, Jun 17, 1:18 PM · Language-Team (Language-2020-Focus-Sprint), CX-cxserver, serviceops, Release-Engineering-Team (Pipeline)
akosiaris assigned T255554: decommission ganeti200[1-6].codfw.wmnet to Papaul.

Service owner steps done, DC ops work can begin.

Wed, Jun 17, 12:15 PM · Operations, ops-codfw, decommission-hardware
akosiaris updated the task description for T255554: decommission ganeti200[1-6].codfw.wmnet.
Wed, Jun 17, 12:15 PM · Operations, ops-codfw, decommission-hardware
akosiaris added a comment to T255553: decommission ganeti100[1-4].eqiad.wmnet.

Service ops owner steps done, machines are ready to be handled by dc ops.

Wed, Jun 17, 12:14 PM · Operations, ops-eqiad, decommission-hardware
akosiaris assigned T255553: decommission ganeti100[1-4].eqiad.wmnet to Cmjohnson.
Wed, Jun 17, 12:14 PM · Operations, ops-eqiad, decommission-hardware
akosiaris added a comment to T255179: Session failures preventing edits, login, logout, etc: "invalid CSRF token".

As an external observer, I'm fearful of "log everyone out". This will cause a spike in traffic to the authentication infrastructure as everybody logs back in again. Which, of course, was the root cause of this problem to begin with. Is there some way to do a rolling logout, killing say, 1% of the sessions per hour?

Wed, Jun 17, 8:14 AM · Wikimedia-Incident, Core Platform Team, MediaWiki-Authentication-and-authorization, MediaWiki-User-login-and-signup, User-DannyS712
akosiaris added a comment to T105378: Stop a poolcounter server fail from being a SPOF for the service and the api (and the site).

The count over 24 hours was 1689 connection timeout errors. I think that crosses my threshold of concern. Another tweak may be needed.

Wed, Jun 17, 8:00 AM · Sustainability (Incident Prevention), Operations, WMF-deploy-2015-08-25_(1.26wmf20), Patch-For-Review, MediaWiki-General

Tue, Jun 16

akosiaris updated the task description for T255554: decommission ganeti200[1-6].codfw.wmnet.
Tue, Jun 16, 2:06 PM · Operations, ops-codfw, decommission-hardware
akosiaris updated the task description for T255553: decommission ganeti100[1-4].eqiad.wmnet.
Tue, Jun 16, 2:05 PM · Operations, ops-eqiad, decommission-hardware
akosiaris updated the task description for T245161: Track down and replace very old HW.
Tue, Jun 16, 12:08 PM · DC-Ops
akosiaris added a subtask for T245161: Track down and replace very old HW: T255554: decommission ganeti200[1-6].codfw.wmnet.
Tue, Jun 16, 12:06 PM · DC-Ops
akosiaris added a parent task for T255554: decommission ganeti200[1-6].codfw.wmnet: T245161: Track down and replace very old HW.
Tue, Jun 16, 12:06 PM · Operations, ops-codfw, decommission-hardware
akosiaris created T255554: decommission ganeti200[1-6].codfw.wmnet.
Tue, Jun 16, 12:06 PM · Operations, ops-codfw, decommission-hardware
akosiaris updated the task description for T245161: Track down and replace very old HW.
Tue, Jun 16, 12:04 PM · DC-Ops
akosiaris added a parent task for T255553: decommission ganeti100[1-4].eqiad.wmnet: T245161: Track down and replace very old HW.
Tue, Jun 16, 12:04 PM · Operations, ops-eqiad, decommission-hardware
akosiaris added a subtask for T245161: Track down and replace very old HW: T255553: decommission ganeti100[1-4].eqiad.wmnet.
Tue, Jun 16, 12:04 PM · DC-Ops
akosiaris created T255553: decommission ganeti100[1-4].eqiad.wmnet.
Tue, Jun 16, 12:03 PM · Operations, ops-eqiad, decommission-hardware
akosiaris added a comment to T255179: Session failures preventing edits, login, logout, etc: "invalid CSRF token".

Should something be done about users who thought they have logged out but actually didn't?

Tue, Jun 16, 12:00 PM · Wikimedia-Incident, Core Platform Team, MediaWiki-Authentication-and-authorization, MediaWiki-User-login-and-signup, User-DannyS712
akosiaris added a comment to T224041: Kask functional testing with Cassandra via the Deployment Pipeline.

I created a helm test and got the integration and functional tests running in minikube. Do we have DNS within the ci cluster?

Tue, Jun 16, 10:30 AM · Core Platform Team, CPT Initiatives (Session Management Service (CDP2)), Release-Engineering-Team (Pipeline), Release-Engineering-Team-TODO, Services (next), User-Eevans, Release Pipeline, Operations, serviceops

Mon, Jun 15

akosiaris added a comment to T255410: Termbox SSR connection terminated very often.
Mon, Jun 15, 2:56 PM · wikidata-tech-focus, Wikidata, Wikidata-Termbox
akosiaris added a comment to T185664: Code stewardship review: FlaggedRevs.

Is there a default result for an extension if no team agrees to be responsible for it? Undeployment?

Mon, Jun 15, 7:12 AM · Release-Engineering-Team (Code Health), MediaWiki-extensions-FlaggedRevs, Code-Stewardship-Reviews

Sun, Jun 14

akosiaris added a comment to T252185: (Need by: TBD) rack/setup/install kubernetes20[07-14].codfw.wmnet and kubestage200[1-2].codfw.wmnet..

kubernetes2007 has been reimage successfully, it seems like kubernetes2008 to kubernetes2014 require networking configuration on the switch side.

Sun, Jun 14, 8:46 AM · Sustainability (Incident Prevention), ops-codfw, serviceops, Operations

Fri, Jun 12

akosiaris triaged T255250: Replace rdb200[34] with rdb200[78] as Medium priority.
Fri, Jun 12, 10:40 AM · serviceops
akosiaris added a parent task for T251626: (Need By: TDB) rack/setup/install rdb200[78]: T255250: Replace rdb200[34] with rdb200[78].
Fri, Jun 12, 10:40 AM · Operations, ops-codfw, DC-Ops
akosiaris added a subtask for T255250: Replace rdb200[34] with rdb200[78]: T251626: (Need By: TDB) rack/setup/install rdb200[78].
Fri, Jun 12, 10:40 AM · serviceops
akosiaris created T255250: Replace rdb200[34] with rdb200[78].
Fri, Jun 12, 10:39 AM · serviceops
akosiaris closed T255179: Session failures preventing edits, login, logout, etc: "invalid CSRF token", a subtask of T254173: 1.35.0-wmf.36 deployment blockers, as Resolved.
Fri, Jun 12, 9:16 AM · Release-Engineering-Team-TODO (2020-04 to 2020-06 (Q4)), Release, Train Deployments
akosiaris closed T255179: Session failures preventing edits, login, logout, etc: "invalid CSRF token" as Resolved.

Yes, absolutely agreed. The trigger was indeed insufficient capacity in sessionstore to handle a, at least, 33% (~15k to ~20k if not more) sudden increase in requests. We 've gone ahead and added capacity to the service and will follow up with adding more capacity to the entire cluster as well as the dedicated sessionstore nodes. So, I 'll be bold and resolve this, feel free to reopen though.

Fri, Jun 12, 9:16 AM · Wikimedia-Incident, Core Platform Team, MediaWiki-Authentication-and-authorization, MediaWiki-User-login-and-signup, User-DannyS712

Thu, Jun 11

akosiaris awarded T250493: Benchmark CPU and memory usage of push notifications service a Love token.
Thu, Jun 11, 3:01 PM · Patch-For-Review, Product-Infrastructure-Team-Backlog (Kanban), Push-Notification-Service