Page MenuHomePhabricator

akosiaris (Alexandros Kosiaris)
Senior Site Reliability Engineer

Today

  • Clear sailing ahead.

Tomorrow

  • Clear sailing ahead.

Tuesday

  • Clear sailing ahead.

User Details

User Since
Oct 3 2014, 8:40 AM (280 w, 2 d)
Availability
Available
IRC Nick
akosiaris
LDAP User
Alexandros Kosiaris
MediaWiki User
AKosiaris (WMF) [ Global Accounts ]

Blurb

Recent Activity

Fri, Feb 14

akosiaris triaged T245272: Draft a plan for upgrading kubernetes machines to buster as Low priority.
Fri, Feb 14, 2:23 PM · serviceops
akosiaris created T245272: Draft a plan for upgrading kubernetes machines to buster.
Fri, Feb 14, 2:23 PM · serviceops
akosiaris added a comment to T245121: RRDP status alert .

I think this means that the query to that URL times out.
As it completes properly from codfw I'm wondering if it's not an issue with the webproxies (overloaded or similar).
Any idea who can help looking into it?

Fri, Feb 14, 8:59 AM · Operations, netops

Thu, Feb 13

akosiaris closed T245158: ganeti doesn't change the boot order to network, a subtask of T244585: Upgrade rpki VMs to buster, as Resolved.
Thu, Feb 13, 3:39 PM · Operations
akosiaris closed T245158: ganeti doesn't change the boot order to network as Resolved.

The VM had a kvm_extra: -bios OVMF.fd in its configuration. That meant is used UEFI, not BIOS and hence the usual boot_order: network ganeti functionality wouldn't work as the boot order was stored in the UEFI "firmware" which ganeti doesn't have support for yet. The fix was

Thu, Feb 13, 3:39 PM · Operations
akosiaris added a comment to T245058: Create an automated alert for 'too many nodes depooled from a service'.

From IRC

Thu, Feb 13, 2:05 PM · serviceops-radar, conftool, Wikimedia-Incident, Operations
akosiaris triaged T245058: Create an automated alert for 'too many nodes depooled from a service' as Medium priority.

Note that we currently have such an alert (or at least something close to it).

Thu, Feb 13, 11:07 AM · serviceops-radar, conftool, Wikimedia-Incident, Operations

Wed, Feb 12

akosiaris added a comment to T238658: Migrate EventStreams to k8s deployment pipeline.

@akosiaris I just did a bit of benchmarking in staging. As I added more concurrent consumers, my throughput dropped. I got up to 10 connections consuming from since=0 (meaning the last 30 days) of the recentchange stream from Kafka. With a fresh pod, a single eventstreams consumer can consume a 1-2 thousand messages per second, but as concurrency increases, this can drop to 40 or 50 per second. Still fine, but from what I can tell, the app is CPU bound. I've got it limited to cpu: 1000m now, and I see the CPU pod max bumping up into 1s. Memory usage is also getting close to the 1000Mi limit, but I don't see it reaching there yet.

Wed, Feb 12, 5:46 PM · Analytics-Kanban, Analytics, Patch-For-Review, Release-Engineering-Team (Pipeline), Services (watching), Release Pipeline
akosiaris added a project to T243106: Phased rollout of sessionstore to production fleet: serviceops-radar.
Wed, Feb 12, 3:53 PM · serviceops-radar, Patch-For-Review, TPG-Epics (Team Practices Group Coaching Clinic), CPT Initiatives (Multi-DC (TEC1)), User-Clarakosi, User-Eevans
akosiaris added a comment to T243106: Phased rollout of sessionstore to production fleet.

I 've just conducted 2 separate tests on 2 selected mw hosts, one appserver and one apiserver. Those were mw1331, mw1348 respectively. The tests were (via mangling /etc/hosts)

Wed, Feb 12, 3:52 PM · serviceops-radar, Patch-For-Review, TPG-Epics (Team Practices Group Coaching Clinic), CPT Initiatives (Multi-DC (TEC1)), User-Clarakosi, User-Eevans
akosiaris added a comment to T244238: Upgrade and restart m1 master (db1135).

@jcrespo @akosiaris any tentative date?

Wed, Feb 12, 1:45 PM · Wikimedia-Etherpad, DBA, Operations
akosiaris added a comment to T242705: ORES uwsgi consumes a large amount of memory and CPU when shutting down (as part of a restart).

Seems like the deploy did not fix it after all. Most (if not all) hosts alerted this morning. It's evident in the graphs as well

Wed, Feb 12, 8:38 AM · Scoring-platform-team (Current), Operations, ORES

Tue, Feb 11

akosiaris added a comment to T242705: ORES uwsgi consumes a large amount of memory and CPU when shutting down (as part of a restart).

Maybe related:

Till uWSGI 2.1, by default, sending the SIGTERM signal to uWSGI means “brutally reload the stack” while the convention is to shut an application down on SIGTERM. To shutdown uWSGI use SIGINT or SIGQUIT instead. If you absolutely can not live with uWSGI being so disrespectful towards SIGTERM, by all means enable the die-on-term option. Fortunately, this bad choice has been fixed in uWSGI 2.1

Tue, Feb 11, 3:18 PM · Scoring-platform-team (Current), Operations, ORES
akosiaris added a comment to T213193: Migrate changeprop to kubernetes.

Updated the list of actions that have to be taken at the task description. Number #1 is done, we are doing now the helm chart. As soon as the chart is reviewed and is merged, the other 2 items are SRE deploys (and should be done fairly quickly). The define what a "safe" deploy is for changeprop is a pretty good question. I guess @Pchelolo might be able to help with that, there's already some discussion in T244387 about it. Last step is pretty obvious I think :-)

Tue, Feb 11, 2:52 PM · Patch-For-Review, Release-Engineering-Team (Pipeline), Release-Engineering-Team-TODO, Services (watching), Release Pipeline, serviceops, ChangeProp
akosiaris updated the task description for T213193: Migrate changeprop to kubernetes.
Tue, Feb 11, 2:48 PM · Patch-For-Review, Release-Engineering-Team (Pipeline), Release-Engineering-Team-TODO, Services (watching), Release Pipeline, serviceops, ChangeProp
akosiaris added a comment to T213193: Migrate changeprop to kubernetes.

Adding @hnowlan so that he is aware and perhaps maybe help move this along.

Tue, Feb 11, 2:42 PM · Patch-For-Review, Release-Engineering-Team (Pipeline), Release-Engineering-Team-TODO, Services (watching), Release Pipeline, serviceops, ChangeProp
akosiaris updated subscribers of T213193: Migrate changeprop to kubernetes.
Tue, Feb 11, 2:38 PM · Patch-For-Review, Release-Engineering-Team (Pipeline), Release-Engineering-Team-TODO, Services (watching), Release Pipeline, serviceops, ChangeProp
akosiaris committed rDEPLOYCHARTSce560191aa4f: helpers: Move most charts to common_templates (authored by akosiaris).
helpers: Move most charts to common_templates
Tue, Feb 11, 10:50 AM
akosiaris added a comment to T242705: ORES uwsgi consumes a large amount of memory and CPU when shutting down (as part of a restart).

The result of running the command below is at:

Tue, Feb 11, 8:52 AM · Scoring-platform-team (Current), Operations, ORES

Mon, Feb 10

akosiaris added a comment to T242705: ORES uwsgi consumes a large amount of memory and CPU when shutting down (as part of a restart).

I'd like to try deploying this change on Monday's deployment window.

Mon, Feb 10, 3:16 PM · Scoring-platform-team (Current), Operations, ORES
akosiaris added a project to T244713: OTRS CVE-2020-1768, CVE-2019-11358: Security.

Adding security team as an FYI

Mon, Feb 10, 10:42 AM · Security Related, Security, OTRS, serviceops
akosiaris closed T244713: OTRS CVE-2020-1768, CVE-2019-11358 as Resolved.
Mon, Feb 10, 10:42 AM · Security Related, Security, OTRS, serviceops
akosiaris created T244713: OTRS CVE-2020-1768, CVE-2019-11358.
Mon, Feb 10, 10:42 AM · Security Related, Security, OTRS, serviceops

Fri, Feb 7

akosiaris updated the task description for T242461: restrouter.svc.{eqiad,codfw}.wmnet in a failed state.
Fri, Feb 7, 2:10 PM · serviceops, Core Platform Team Workboards (Clinic Duty Team)
akosiaris added a comment to T244530: upgrade memory in ganeti100[5-8].eqiad.wmnet.

There's nothing rushing us on this btw, feel free to proposed alternative maint windows.

Fri, Feb 7, 1:58 PM · ops-eqiad, Operations
akosiaris updated the task description for T244530: upgrade memory in ganeti100[5-8].eqiad.wmnet.
Fri, Feb 7, 1:57 PM · ops-eqiad, Operations
akosiaris updated subscribers of T244530: upgrade memory in ganeti100[5-8].eqiad.wmnet.

Those 4 machines will have to be done one by one in order as @RobH points out. Overall, about an hour of advance notice should suffice, but let's do one each day ? I 'll add tentative maint windows (last 1 day each for your convenience) to the task

Fri, Feb 7, 1:56 PM · ops-eqiad, Operations
akosiaris closed T244535: wikifeeds - fix the CPU limits so that it doesn't get starved as Resolved.

The capacity increase did not fix anything, neither did some efforts with increasing requests/limits more. In fact the sum of throttled times got a 50% increase, which adds more value to the hypothesis about CFS quota issues. A TL;DR is that all pods, regardless of the amount of work they do, got mildly throttled because of linux CFS schedulers accounting for every chunk of time allocated to a task, even if the task has yielded.

Fri, Feb 7, 11:05 AM · Wikimedia-Incident, serviceops, Wikifeeds
akosiaris added a comment to T242705: ORES uwsgi consumes a large amount of memory and CPU when shutting down (as part of a restart).

Let me add my own finding. Doing systemctl stop uwsgi-ores triggers the issue. It's during the stop phase that uwsgi workers go haywire on CPU and memory usage. systemctl start uwsgi-ores after that does cause any significant CPU or memory increase.

Fri, Feb 7, 10:41 AM · Scoring-platform-team (Current), Operations, ORES
akosiaris committed rDEPLOYCHARTSe4cdd6171c87: wikifeeds: Bump capacity by 50% (authored by akosiaris).
wikifeeds: Bump capacity by 50%
Fri, Feb 7, 9:59 AM
akosiaris added a comment to T244535: wikifeeds - fix the CPU limits so that it doesn't get starved.

We are definitely better than what we used to be, but I am still not happy. I 'll increase the capacity as well, from 4 pods to 6 pods, that is by 50%

Fri, Feb 7, 9:46 AM · Wikimedia-Incident, serviceops, Wikifeeds
akosiaris added a comment to T244535: wikifeeds - fix the CPU limits so that it doesn't get starved.

Limits have been increased to 2.5 cores. However the app is still mildly throttled [1]. Given the limits is 1.5 times more than the current total usage, I am inclined to think this is a scheduler artifact. We 've seen it before with kask and there's a lot of talk about it. It's essentially a recap of 512ac999[2] of the linux kernel. Interestingly after the deploy, latencies dropped [3] by some 25ms

Fri, Feb 7, 9:39 AM · Wikimedia-Incident, serviceops, Wikifeeds
akosiaris committed rDEPLOYCHARTS48f06ab416a1: wikifeeds: package wikifeeds-0.0.9 (authored by akosiaris).
wikifeeds: package wikifeeds-0.0.9
Fri, Feb 7, 8:54 AM
akosiaris committed rDEPLOYCHARTS2b33f2dc4a16: wikifeeds: slightly lower the CPU limits (authored by akosiaris).
wikifeeds: slightly lower the CPU limits
Fri, Feb 7, 8:54 AM
akosiaris committed rDEPLOYCHARTS818e598a98ec: admin: Remove the limitrange overrides for staging (authored by akosiaris).
admin: Remove the limitrange overrides for staging
Fri, Feb 7, 8:54 AM
akosiaris added a comment to T244535: wikifeeds - fix the CPU limits so that it doesn't get starved.

https://grafana.wikimedia.org/d/35vIuGpZk/wikifeeds?orgId=1&from=1581018813182&to=1581025628155&var-dc=eqiad%20prometheus%2Fk8s&var-service=wikifeeds is a graph of wikifeeds during the outage yesterday. The CPU throttling is very aggressive there, meaning the service did not have adequate resources to serve the requests in time. That ended up depooling the pods one by one until none were left to serve the load. That trigged the obvious alerts upon which we investigated and resolve the issue by restarting all pods, as they were probably non salvageable in any decent amount of time. In fact, judging by the output of kubectl get pods some were occasionally repooled, only to be flooded with requests once more, rendered quickly unable to serve more traffic and again being depooled, leading to a self-sustaining downward spiral, out of which is was difficult to get.

Fri, Feb 7, 8:27 AM · Wikimedia-Incident, serviceops, Wikifeeds
akosiaris claimed T244535: wikifeeds - fix the CPU limits so that it doesn't get starved.
Fri, Feb 7, 8:20 AM · Wikimedia-Incident, serviceops, Wikifeeds
akosiaris committed rDEPLOYCHARTSfac7c9bcee96: wikifeeds: package wikifeeds-0.0.8 (authored by akosiaris).
wikifeeds: package wikifeeds-0.0.8
Fri, Feb 7, 8:19 AM
akosiaris committed rDEPLOYCHARTSaa5451302396: wikifeeds: Redefine CPU limits (authored by akosiaris).
wikifeeds: Redefine CPU limits
Fri, Feb 7, 8:18 AM

Thu, Feb 6

akosiaris added a reverting change for rDEPLOYCHARTSa5f0dcc19fe0: Update scaffold template names to use chart name: rDEPLOYCHARTS004e7b0ed0f3: Revert "Update scaffold template names to use chart name".
Thu, Feb 6, 3:11 PM
akosiaris committed rDEPLOYCHARTS004e7b0ed0f3: Revert "Update scaffold template names to use chart name" (authored by Joe).
Revert "Update scaffold template names to use chart name"
Thu, Feb 6, 3:11 PM
akosiaris added a comment to T242705: ORES uwsgi consumes a large amount of memory and CPU when shutting down (as part of a restart).

I 've tried to reproduce this. It's easily reproducible after all. Just do what logrotate does and issue systemctl reload uwsgi-ores. CPU usage spikes and reaches 100% for all CPUs in the machines for several seconds. Memory usage spikes as well and then OOM killer shows up as the machine is out of memory. The best thing for OOM killer to kill is celery as this is the big memory user.

Thu, Feb 6, 8:47 AM · Scoring-platform-team (Current), Operations, ORES

Wed, Feb 5

akosiaris added a comment to T243634: ulsfo varnish-fe vcache processes overflow on FDs.

I just reverted the cr3, cr4 uslfo change.

Wed, Feb 5, 11:37 AM · Operations, Traffic
akosiaris updated subscribers of T243634: ulsfo varnish-fe vcache processes overflow on FDs.

I blocked a number of IPs manually on cr3 and cr4 for ulsfo. Command was

Wed, Feb 5, 10:38 AM · Operations, Traffic
akosiaris added a comment to T244335: Upgrade production kubernetes clusters to a security supported version.

Important release notes for 1.13.x that affect us

Wed, Feb 5, 10:22 AM · Prod-Kubernetes, serviceops
akosiaris updated the task description for T244335: Upgrade production kubernetes clusters to a security supported version.
Wed, Feb 5, 9:55 AM · Prod-Kubernetes, serviceops
akosiaris lowered the priority of T244335: Upgrade production kubernetes clusters to a security supported version from High to Medium.
Wed, Feb 5, 9:50 AM · Prod-Kubernetes, serviceops
akosiaris triaged T244335: Upgrade production kubernetes clusters to a security supported version as High priority.
Wed, Feb 5, 9:49 AM · Prod-Kubernetes, serviceops
akosiaris moved T244335: Upgrade production kubernetes clusters to a security supported version from Backlog to Doing on the serviceops board.
Wed, Feb 5, 9:49 AM · Prod-Kubernetes, serviceops
akosiaris created T244335: Upgrade production kubernetes clusters to a security supported version.
Wed, Feb 5, 9:49 AM · Prod-Kubernetes, serviceops
akosiaris added a comment to T242705: ORES uwsgi consumes a large amount of memory and CPU when shutting down (as part of a restart).

T243451 does explain the higher memory usage. It even points out that the higher memory usage is worrisome, however it was deployed anyway.

Wed, Feb 5, 8:13 AM · Scoring-platform-team (Current), Operations, ORES

Tue, Feb 4

akosiaris committed rDEPLOYCHARTScb69d479eba3: citoid: build package, update repo index (authored by akosiaris).
citoid: build package, update repo index
Tue, Feb 4, 2:30 PM
akosiaris committed rDEPLOYCHARTSc776565c2354: Remove config for xisbn and update (authored by Mvolz).
Remove config for xisbn and update
Tue, Feb 4, 2:27 PM
akosiaris committed rDEPLOYCHARTS3745b8620bf7: wikifeeds: Bump chart version (authored by akosiaris).
wikifeeds: Bump chart version
Tue, Feb 4, 2:15 PM
akosiaris committed rDEPLOYCHARTSab060cb4500e: _scaffold: Fix YAML syntax error (authored by akosiaris).
_scaffold: Fix YAML syntax error
Tue, Feb 4, 1:14 PM
akosiaris committed rDEPLOYCHARTS83e2bce548d2: wikifeeds: Move to the debug functionality (authored by akosiaris).
wikifeeds: Move to the debug functionality
Tue, Feb 4, 1:14 PM
akosiaris committed rDEPLOYCHARTSf1284e6fb171: wikifeeds: Remove appbase_url_port (authored by akosiaris).
wikifeeds: Remove appbase_url_port
Tue, Feb 4, 1:14 PM
akosiaris committed rDEPLOYCHARTS20813f304c4a: wikifeeds: Add tests (authored by akosiaris).
wikifeeds: Add tests
Tue, Feb 4, 9:32 AM
akosiaris committed rDEPLOYCHARTSa7e8509e9fc6: scaffold: Remove last appbase_url_port mention (authored by akosiaris).
scaffold: Remove last appbase_url_port mention
Tue, Feb 4, 8:39 AM
akosiaris added a comment to T242705: ORES uwsgi consumes a large amount of memory and CPU when shutting down (as part of a restart).

eqiad and codfw graphs both point to similar issue as last time. 06:25UTC seems to be the start of the incident. Memory usage however has increased by close to 100% 9 hours before the event. The trigger is probably logrotate again (it anyway happens every day - if it was the cause we would see this all the time), but the cause is probably something in the traffic patterns.

Tue, Feb 4, 7:56 AM · Scoring-platform-team (Current), Operations, ORES

Fri, Jan 24

akosiaris added a comment to T243444: Request took down both zotero and citoid (exceeding memory).

Thanks. That's working now, but I've downloaded the log file and it's just what's already available on kibana, warn level or higher. There's no debug level or message (10/20) in the logs - I don't suppose we have those anywhere?

Nope. What's in there is what citoid is instructed to log. The config is at https://gerrit.wikimedia.org/r/plugins/gitiles/operations/deployment-charts/+/master/charts/citoid/templates/config.yaml#16

Ah - well unfortunately without the trace and debug messages, and the missing err_body_internalURI in the warning log entries of interest, the logs seem largely useless in terms of figuring out which url was the culprit :/
I've put in a pr to bump up zotero misses up to warn, what do you think about logging trace and debug as well in the meantime? Too overkill?

Fri, Jan 24, 4:15 PM · Operations, Citoid
akosiaris committed rDEPLOYCHARTS94f8b58a54ee: cluster-helmfile: Add a simple sleep 1 (authored by akosiaris).
cluster-helmfile: Add a simple sleep 1
Fri, Jan 24, 2:22 PM
akosiaris committed rDEPLOYCHARTS14d4eca10ad9: calico: Remove all urldownloader IPs (authored by akosiaris).
calico: Remove all urldownloader IPs
Fri, Jan 24, 2:19 PM
akosiaris committed rDEPLOYCHARTS1d262a639b01: Deduplicate cluster-helmfile.sh (authored by akosiaris).
Deduplicate cluster-helmfile.sh
Fri, Jan 24, 2:10 PM
akosiaris committed rDEPLOYCHARTS8d92acac4f38: admin: Realign calico policies between clusters (authored by akosiaris).
admin: Realign calico policies between clusters
Fri, Jan 24, 1:50 PM
akosiaris committed rDEPLOYCHARTS7de32995ae95: admin: Align staging symlink with production clusters (authored by akosiaris).
admin: Align staging symlink with production clusters
Fri, Jan 24, 1:40 PM
akosiaris committed rDEPLOYCHARTS242cad43e922: admin: get rid of apply-calico-policy.sh (authored by akosiaris).
admin: get rid of apply-calico-policy.sh
Fri, Jan 24, 1:40 PM
akosiaris committed rDEPLOYCHARTS790ade1cdbf8: admin: DRY podsecuritypolicies (authored by akosiaris).
admin: DRY podsecuritypolicies
Fri, Jan 24, 1:39 PM
akosiaris committed rDEPLOYCHARTS32cadbccfa3b: rbac: Move under common/ directory (authored by akosiaris).
rbac: Move under common/ directory
Fri, Jan 24, 1:17 PM
akosiaris committed rDEPLOYCHARTS50c9b4f7b9a4: admin: Use a template to drop values symlinks (authored by akosiaris).
admin: Use a template to drop values symlinks
Fri, Jan 24, 1:14 PM
akosiaris committed rDEPLOYCHARTSf392f1fc0130: admin: DRY environments by using a common one (authored by akosiaris).
admin: DRY environments by using a common one
Fri, Jan 24, 10:13 AM
akosiaris added a comment to T224580: Migrate etherpad1001 to Buster.

@Dzahn, I 've merged the required remaining changes to get the migration done. Now etherpad.wikimedia.org uses etherpad1002. Checked a couple of pads, it seems everything is fine. Hopefully we have no corruption issues. etherpad1001 is now removed from site.pp and I 've removed the etherpad-lite debian package from it. I 've also -2ed the discovery record changes due to the issue above about the software not supporting scaling out. I guess what's left is to decomission and delete that VM.

Fri, Jan 24, 9:58 AM · Patch-For-Review, Wikimedia-Etherpad, serviceops, Operations
akosiaris added a comment to T224580: Migrate etherpad1001 to Buster.

Pad that per logs have been accessed on https://etherpad-new.wikimedia.org

Fri, Jan 24, 9:40 AM · Patch-For-Review, Wikimedia-Etherpad, serviceops, Operations
akosiaris added a comment to T224580: Migrate etherpad1001 to Buster.

I 've removed the DNS and stopped and masked the service for now on etherpad1002. Since we proved it works, let's just move over to etherpad1002.eqiad.wmnet, stopping beforehand etherpad1001 (to avoid the issues I alluded to). etherpad is anyway best effort, it's ok to even have an extended downtime.

Fri, Jan 24, 9:31 AM · Patch-For-Review, Wikimedia-Etherpad, serviceops, Operations
akosiaris added a comment to T224580: Migrate etherpad1001 to Buster.
Fri, Jan 24, 7:49 AM · Patch-For-Review, Wikimedia-Etherpad, serviceops, Operations

Thu, Jan 23

akosiaris committed rDEPLOYCHARTS7afddea9a6d7: evenstreams: Add forgotten admin/ values files (authored by akosiaris).
evenstreams: Add forgotten admin/ values files
Thu, Jan 23, 4:58 PM
akosiaris committed rDEPLOYCHARTSc6e30ddc6264: eventstreams: Add the namespace and calico rules (authored by akosiaris).
eventstreams: Add the namespace and calico rules
Thu, Jan 23, 4:51 PM
akosiaris committed rDEPLOYCHARTS5638265eabb0: admin: Remove graphoid remnants (authored by akosiaris).
admin: Remove graphoid remnants
Thu, Jan 23, 4:43 PM
akosiaris committed rDEPLOYCHARTS294e6e4e25ec: admin: DRY all environments (authored by akosiaris).
admin: DRY all environments
Thu, Jan 23, 4:36 PM
akosiaris committed rDEPLOYCHARTSf6994fc83405: admin: Move blubberoid into a more DRY format (authored by akosiaris).
admin: Move blubberoid into a more DRY format
Thu, Jan 23, 3:42 PM
akosiaris committed rLPRIa76760678f15: eventstreams: Add k8s tokens (authored by akosiaris).
eventstreams: Add k8s tokens
Thu, Jan 23, 3:36 PM
akosiaris added a comment to T243444: Request took down both zotero and citoid (exceeding memory).

Thanks. That's working now, but I've downloaded the log file and it's just what's already available on kibana, warn level or higher. There's no debug level or message (10/20) in the logs - I don't suppose we have those anywhere?

Thu, Jan 23, 1:58 PM · Operations, Citoid
akosiaris added a comment to T243444: Request took down both zotero and citoid (exceeding memory).

I just noticed that for some reason setting DEBUG_LEVEL: 0 for zotero no longer works however.

Thu, Jan 23, 1:20 PM · Operations, Citoid
akosiaris added a comment to T243444: Request took down both zotero and citoid (exceeding memory).

-l app=citoid as that's the value for the app label, not citoid-production.

Thu, Jan 23, 1:07 PM · Operations, Citoid
akosiaris added a comment to T243444: Request took down both zotero and citoid (exceeding memory).

I just noticed that for some reason setting DEBUG_LEVEL: 0 for zotero no longer works however.

Thu, Jan 23, 11:12 AM · Operations, Citoid
akosiaris added a comment to T243444: Request took down both zotero and citoid (exceeding memory).

it should be in the raw logs

Thu, Jan 23, 8:34 AM · Operations, Citoid

Wed, Jan 22

akosiaris added a comment to T224580: Migrate etherpad1001 to Buster.

The following packages are used by the puppet role but so far missing on buster:

  • prometheus-etherpad-exporter
  • etherpad-lite
Wed, Jan 22, 2:21 PM · Patch-For-Review, Wikimedia-Etherpad, serviceops, Operations

Tue, Jan 21

akosiaris added a comment to T242705: ORES uwsgi consumes a large amount of memory and CPU when shutting down (as part of a restart).

Thanks for the ping. Notes:

Tue, Jan 21, 5:36 PM · Scoring-platform-team (Current), Operations, ORES
akosiaris committed rDEPLOYCHARTS7c01af0d42b5: eventgate: Remove extraneous comment in _helpers.tpl (authored by akosiaris).
eventgate: Remove extraneous comment in _helpers.tpl
Tue, Jan 21, 4:11 PM
akosiaris added a comment to T242861: Clarify multi-service instance concepts in helm charts and enable canary releases.

But first, I think a big source of confusion in our patches is the conflation of the word 'service'.

Tue, Jan 21, 3:01 PM · Analytics-Kanban, Analytics, serviceops
akosiaris committed rDEPLOYCHARTS4f179cf33348: Remove all 1.7 netpol checks (authored by akosiaris).
Remove all 1.7 netpol checks
Tue, Jan 21, 2:39 PM
akosiaris committed rDEPLOYCHARTS9cec12739133: scaffold: Remove the 1.8 kubernetes netpol if clause (authored by akosiaris).
scaffold: Remove the 1.8 kubernetes netpol if clause
Tue, Jan 21, 12:32 PM
akosiaris committed rDEPLOYCHARTSdae344baf57a: Fix helm release NOTES.txt text (authored by akosiaris).
Fix helm release NOTES.txt text
Tue, Jan 21, 12:32 PM
akosiaris added a comment to T238830: Profile proton memory usage for Helm chart.

I did rerun 2 times the num_workers=3 test. No big diff. 100 "locust users", spawned at a rate of 0.1/s. After about peaking at about 0.5 RPS, errors start happening and latency skyrockets at ~60s. CPU still is around 3K. Memory wise it has gone up to 4GB after about 1.5h of benchmarking. Funny thing is that the memory usage is not plateauing at all, but rather keep on increasing. This is I guess expected given we use chromium which is known for being a memory hog. Kubernetes will anyway take care of the memory leaks by restarting the pod if it goes over usage.

Tue, Jan 21, 11:20 AM · Patch-For-Review, Proton, serviceops, Product-Infrastructure-Team-Backlog (Kanban)
akosiaris closed T243070: Requesting +2 rights for Mvolz for operations/deployment-charts, a subtask of T213269: Requesting access to Citoid/Zotero production servers for MVOLZ, as Resolved.
Tue, Jan 21, 11:00 AM · SRE-Access-Requests, Operations, Citoid
akosiaris closed T243070: Requesting +2 rights for Mvolz for operations/deployment-charts as Resolved.

We resolved this live in a hangout with @Mvolz. Re-resolving

Tue, Jan 21, 11:00 AM · Gerrit-Privilege-Requests, SRE-Access-Requests, Operations, Citoid
akosiaris committed rDEPLOYCHARTS23b6352f35f2: Update zotero to 5953b26 (staging only) (authored by Mvolz).
Update zotero to 5953b26 (staging only)
Tue, Jan 21, 10:34 AM

Mon, Jan 20

akosiaris added a comment to T241230: Migrate recommendation-api to kubernetes.

@akosiaris thanks! The first two points were already done. I've created a chart and uploaded a patch. Would you please review it.

Mon, Jan 20, 11:51 AM · Patch-For-Review, serviceops, Release-Engineering-Team, Services, Recommendation-API
akosiaris added a comment to T238830: Profile proton memory usage for Helm chart.

I 've rerun the benchmark against values of 1,2,3 for num_workers

Mon, Jan 20, 11:16 AM · Patch-For-Review, Proton, serviceops, Product-Infrastructure-Team-Backlog (Kanban)

Sat, Jan 18

akosiaris added a comment to T238830: Profile proton memory usage for Helm chart.

I just noticed that num_workers: ncpu in the chart. Sigh, this probably makes all CPU calculations wrong as it is impossible to size a pod that is dependent on the underlying hardware CPU wise. I 'll have to rerun those tests with values between 1,2,3

I assumed pod size would be determined by values-chart and "ncpu" would fill all CPU available for that pod, is that right?

Sat, Jan 18, 1:58 PM · Patch-For-Review, Proton, serviceops, Product-Infrastructure-Team-Backlog (Kanban)