Page MenuHomePhabricator

akosiaris (Alexandros Kosiaris)
Senior Site Reliability Engineer

Today

  • Clear sailing ahead.

Tomorrow

  • Clear sailing ahead.

Wednesday

  • Clear sailing ahead.

User Details

User Since
Oct 3 2014, 8:40 AM (245 w, 3 d)
Availability
Available
IRC Nick
akosiaris
LDAP User
Alexandros Kosiaris
MediaWiki User
AKosiaris (WMF) [ Global Accounts ]

Blurb

Recent Activity

Tue, Jun 11

akosiaris added a comment to T220402: Introduce wikidata termbox SSR to kubernetes.

Just as an FYI, everything looks ok on this end, but there's a train freeze this week, so we have to wait before deploying this. Patches are up and waiting to be merged on Monday the 17th

Tue, Jun 11, 8:28 AM · Patch-For-Review, Core Platform Team Backlog (Watching / External), Services (watching), Wikidata-Termbox-Hike, Wikidata, Release Pipeline, Operations, serviceops, Release-Engineering-Team

Fri, Jun 7

akosiaris added a comment to T220402: Introduce wikidata termbox SSR to kubernetes.

This should now be fixed. Sadly this was due to a mismatch between the code in wikibase master and that deployed on Wikidata.org

Fri, Jun 7, 7:31 PM · Patch-For-Review, Core Platform Team Backlog (Watching / External), Services (watching), Wikidata-Termbox-Hike, Wikidata, Release Pipeline, Operations, serviceops, Release-Engineering-Team
akosiaris added a comment to T224603: rack/setup/ codfw: ganeti2009 - ganeti201[0-8].

@ayounsi I am planning on installing those new servers in row c and row D and I don't have the "interface-range ganeti" in both of those rows Is it okay for me to go ahead and create "interface-range ganeti" on asw-c-codfw and asw-d-codfw?

Fri, Jun 7, 7:19 PM · Patch-For-Review, ops-codfw, Operations
akosiaris added a comment to T220401: Introduce kask session storage service to kubernetes.

sessionstore.discovery.wmnet is now around and should be the canonical DNS used to address the service.

Fri, Jun 7, 1:28 PM · Patch-For-Review, Core Platform Team Backlog (Next), Core Platform Team (Session Management Service (CDP2)), Services (next), User-Eevans, Release Pipeline, Operations, serviceops, Release-Engineering-Team
akosiaris added a comment to T199219: WDQS should use internal endpoint to communicate to Wikidata.

But won't we lose use of the varnish cache if we use the internal endpoint?

Fri, Jun 7, 1:18 PM · Wikidata, Wikidata-Query-Service
akosiaris added a comment to T220402: Introduce wikidata termbox SSR to kubernetes.

Indeed this was fixed. However another regression has crept up it's head

Fri, Jun 7, 12:13 PM · Patch-For-Review, Core Platform Team Backlog (Watching / External), Services (watching), Wikidata-Termbox-Hike, Wikidata, Release Pipeline, Operations, serviceops, Release-Engineering-Team
akosiaris added a comment to T220402: Introduce wikidata termbox SSR to kubernetes.

@Tarrow, @WMDE-leszek
Hi, sorry for taking so long to answer to this, it's been really busy.

Fri, Jun 7, 10:39 AM · Patch-For-Review, Core Platform Team Backlog (Watching / External), Services (watching), Wikidata-Termbox-Hike, Wikidata, Release Pipeline, Operations, serviceops, Release-Engineering-Team
akosiaris committed rDEPLOYCHARTSc243b5fabb6f: Add termbox-0.0.2.tgz (authored by akosiaris).
Add termbox-0.0.2.tgz
Fri, Jun 7, 10:32 AM
akosiaris committed rDEPLOYCHARTSe8dea9de5556: termbox: Use newer ENV variables (authored by akosiaris).
termbox: Use newer ENV variables
Fri, Jun 7, 10:32 AM

Thu, Jun 6

akosiaris added a comment to T225064: post merge builds in citoid are failing.

Nice, thank you for the explanation :-] Left to figure out in a different task is how to test Citoid together with Zotero, but I guess that is for another task.

Euh, no, that's what this task is for :) We were able to build images before, now we are not.

Not since adding the magic file that makes helm test work: https://gerrit.wikimedia.org/r/#/c/mediawiki/services/citoid/+/506107/

The build is gone from Jenkins (since it was six weeks old), but the failure comment in gerrit is still there.

Thu, Jun 6, 4:47 PM · Core Platform Team Kanban (Done with CPT), Services (done), Release Pipeline, Citoid

Wed, Jun 5

akosiaris updated the task description for T198901: Migrate production services to kubernetes using the pipeline.
Wed, Jun 5, 7:10 PM · Release-Engineering-Team-TODO, Core Platform Team Backlog (Watching / External), Epic, Services (watching), Operations, Release Pipeline
akosiaris updated the task description for T198901: Migrate production services to kubernetes using the pipeline.
Wed, Jun 5, 6:45 PM · Release-Engineering-Team-TODO, Core Platform Team Backlog (Watching / External), Epic, Services (watching), Operations, Release Pipeline
akosiaris updated the task description for T198901: Migrate production services to kubernetes using the pipeline.
Wed, Jun 5, 6:45 PM · Release-Engineering-Team-TODO, Core Platform Team Backlog (Watching / External), Epic, Services (watching), Operations, Release Pipeline
akosiaris added a comment to T198901: Migrate production services to kubernetes using the pipeline.

I don't think eventstreams is in k8s, is it?

Wed, Jun 5, 6:45 PM · Release-Engineering-Team-TODO, Core Platform Team Backlog (Watching / External), Epic, Services (watching), Operations, Release Pipeline
akosiaris updated the task description for T198901: Migrate production services to kubernetes using the pipeline.
Wed, Jun 5, 6:41 PM · Release-Engineering-Team-TODO, Core Platform Team Backlog (Watching / External), Epic, Services (watching), Operations, Release Pipeline
akosiaris updated the task description for T198901: Migrate production services to kubernetes using the pipeline.
Wed, Jun 5, 6:41 PM · Release-Engineering-Team-TODO, Core Platform Team Backlog (Watching / External), Epic, Services (watching), Operations, Release Pipeline

Tue, Jun 4

akosiaris added a comment to T199219: WDQS should use internal endpoint to communicate to Wikidata.

There's a change though that WDQS no longer uses nocache for cache-busting in most common cases (see T217897 for more details). So I am not sure using internal endpoint now makes sense.

Tue, Jun 4, 8:24 PM · Wikidata, Wikidata-Query-Service
akosiaris added a comment to T199219: WDQS should use internal endpoint to communicate to Wikidata.

@BBlack I am getting rather strange result with appservers-ro.discovery.wmnet - if I call the URL you provided, the call takes a lot of time:

real 0m4.270s

while if I call to www.wikidata.org, I get:

real 0m0.127s

Same with api-ro. appservers-rw is a bit faster:

real 0m0.320s

But still 3x from going through frontend (and it's not caching - I changed the URL, result is the same, and varnish settings all say "miss").

Tue, Jun 4, 4:04 PM · Wikidata, Wikidata-Query-Service
akosiaris closed T220401: Introduce kask session storage service to kubernetes as Resolved.

And LVS done today.

Tue, Jun 4, 3:34 PM · Patch-For-Review, Core Platform Team Backlog (Next), Core Platform Team (Session Management Service (CDP2)), Services (next), User-Eevans, Release Pipeline, Operations, serviceops, Release-Engineering-Team
akosiaris closed T220401: Introduce kask session storage service to kubernetes, a subtask of T220398: TEC3:O3:O3.1:Q4 Goal - Move cpjobqueue, Wikidata Termbox SSR (new service), Kask (session storage service) and ORES (partially) through the production CD Pipeline, as Resolved.
Tue, Jun 4, 3:34 PM · Core Platform Team Backlog (Watching / External), Services (watching), Release Pipeline, Operations, serviceops, Release-Engineering-Team
akosiaris committed rDEPLOYCHARTSe30c4a531ea8: kask: Actually ship affinity correctly (authored by akosiaris).
kask: Actually ship affinity correctly
Tue, Jun 4, 3:27 PM
akosiaris moved T210861: OTRS exposes session cookie in URLs from Pending patch / update to Resolved on the OTRS board.
Tue, Jun 4, 8:08 AM · Upstream, OTRS, Security
akosiaris added a comment to T210861: OTRS exposes session cookie in URLs.

Patch has been applied to our own packages and has been deployed and tested. Marking this as resolved, thanks!

Tue, Jun 4, 8:04 AM · Upstream, OTRS, Security
akosiaris closed T210861: OTRS exposes session cookie in URLs as Resolved.
Tue, Jun 4, 8:04 AM · Upstream, OTRS, Security
akosiaris added a comment to T210861: OTRS exposes session cookie in URLs.

Fix by upstream in https://github.com/OTRS/otrs/commit/7ab33e51a4db9f712e979040f644d0d0c39ff0af for 5.x (which we run). Has also been fixed in our package for OTRS in https://gerrit.wikimedia.org/r/#/c/operations/software/otrs/+/514230

Tue, Jun 4, 8:03 AM · Upstream, OTRS, Security

Mon, Jun 3

akosiaris added a comment to T220401: Introduce kask session storage service to kubernetes.

One minor question. Given per T220401#5128786 1 kask instance is able to handle ~300req/s, how many instances will we require? I am unsure of the current rate of sessions requests to/from redis.

What was the test environment used there? When I tested using the sessionstore Cassandra cluster nodes, I got at least two orders of magnitude higher throughput.

An admittedly underpowered minikube environment with a probably untuned cassandra. Some values for cassandra itself are in https://gerrit.wikimedia.org/r/plugins/gitiles/operations/deployment-charts/+/refs/heads/master/charts/kask/values.yaml#114. It makes absolute sense that a well tuned and more powered cassandra cluster would be able to serve more req/s.

Now, to answer my question, and by looking at T221292, I 'll assume a single instance for production should be able to serve some 30k req/s (I am rounding down from the lowest score in that table just to be on the safe side). So 1 instance would probably not cover it, we would need at least 2 instances. Adding 2x rack row redundancy means 4 instances. Looks like that's our number for now. We can always increase it ofc.

FWIW, we've been bouncing around a target throughput of 30k/sec in production based on Redis metrics, but as was later noted in T212129, that number includes everything in Mainstash, only a fraction of which is sessions (we're moving sessions over separately of the rest). IOW, sessions should be something considerably less 30k/s, even if we don't know exactly what.

Mon, Jun 3, 2:05 PM · Patch-For-Review, Core Platform Team Backlog (Next), Core Platform Team (Session Management Service (CDP2)), Services (next), User-Eevans, Release Pipeline, Operations, serviceops, Release-Engineering-Team
akosiaris committed rDEPLOYCHARTS8c9d8c6c4eb2: Bump kask to 0.0.6 (authored by akosiaris).
Bump kask to 0.0.6
Mon, Jun 3, 12:27 PM
akosiaris committed rDEPLOYCHARTS7c7523a9deb1: kask: Add affinity/tolerations headings (authored by akosiaris).
kask: Add affinity/tolerations headings
Mon, Jun 3, 12:27 PM

Sat, Jun 1

akosiaris added a comment to T220401: Introduce kask session storage service to kubernetes.

One minor question. Given per T220401#5128786 1 kask instance is able to handle ~300req/s, how many instances will we require? I am unsure of the current rate of sessions requests to/from redis.

What was the test environment used there? When I tested using the sessionstore Cassandra cluster nodes, I got at least two orders of magnitude higher throughput.

Sat, Jun 1, 10:07 AM · Patch-For-Review, Core Platform Team Backlog (Next), Core Platform Team (Session Management Service (CDP2)), Services (next), User-Eevans, Release Pipeline, Operations, serviceops, Release-Engineering-Team

Fri, May 31

akosiaris added a comment to T220401: Introduce kask session storage service to kubernetes.

One minor question. Given per T220401#5128786 1 kask instance is able to handle ~300req/s, how many instances will we require? I am unsure of the current rate of sessions requests to/from redis.

Fri, May 31, 4:50 PM · Patch-For-Review, Core Platform Team Backlog (Next), Core Platform Team (Session Management Service (CDP2)), Services (next), User-Eevans, Release Pipeline, Operations, serviceops, Release-Engineering-Team
akosiaris added a comment to T224562: Decommission darmstadtium.

I think so, let's wait for @fsero though

Fri, May 31, 4:45 PM · Operations, Kubernetes
akosiaris added a comment to T220401: Introduce kask session storage service to kubernetes.

And this uncovered now that prometheus can't talk to it (cause it expects HTTP I guess?). /me looking into it (more deeply this time around).

Fri, May 31, 4:25 PM · Patch-For-Review, Core Platform Team Backlog (Next), Core Platform Team (Session Management Service (CDP2)), Services (next), User-Eevans, Release Pipeline, Operations, serviceops, Release-Engineering-Team
akosiaris committed rDEPLOYCHARTSdce38b7a963c: Bump kask version to 0.0.5 (authored by akosiaris).
Bump kask version to 0.0.5
Fri, May 31, 4:21 PM
akosiaris committed rDEPLOYCHARTS8c9a98c92414: kask: prometheus scraping over HTTPS if TLS enabled (authored by akosiaris).
kask: prometheus scraping over HTTPS if TLS enabled
Fri, May 31, 4:21 PM
akosiaris committed rDEPLOYCHARTS50ccd6c54964: Fix typo in initialize_service.sh (authored by akosiaris).
Fix typo in initialize_service.sh
Fri, May 31, 4:21 PM
akosiaris committed rDEPLOYCHARTS635e34722894: kask: Fix TLS certs checks (authored by akosiaris).
kask: Fix TLS certs checks
Fri, May 31, 4:21 PM
akosiaris added a comment to T220401: Introduce kask session storage service to kubernetes.

One thing that I just met is that kask stops accepting HTTP connections if kask cert/key pair is configured. That's fine normally, but there is a very interesting repercussion. Kubernetes readiness probes to the /healthz endpoint now fail. kask logs

2019/05/31 09:36:00 http: TLS handshake error from 10.64.0.247:55194: tls: first record does not look like a TLS handshake

Fri, May 31, 3:37 PM · Patch-For-Review, Core Platform Team Backlog (Next), Core Platform Team (Session Management Service (CDP2)), Services (next), User-Eevans, Release Pipeline, Operations, serviceops, Release-Engineering-Team
akosiaris added a comment to T220401: Introduce kask session storage service to kubernetes.

One thing that I just met is that kask stops accepting HTTP connections if kask cert/key pair is configured. That's fine normally, but there is a very interesting repercussion. Kubernetes readiness probes to the /healthz endpoint now fail. kask logs

Fri, May 31, 10:21 AM · Patch-For-Review, Core Platform Team Backlog (Next), Core Platform Team (Session Management Service (CDP2)), Services (next), User-Eevans, Release Pipeline, Operations, serviceops, Release-Engineering-Team
akosiaris added a comment to T224569: Migrate ORES Redis servers to Stretch/Buster.

We might be able to get away with reusing the redis misc servers (rdb1005/rdb1009). That should give us more memory and allow us to use the empty databases on those hosts and deprecate/remove these VMs

Fri, May 31, 9:48 AM · Scoring-platform-team, ORES, serviceops, Operations

Thu, May 30

akosiaris added a comment to T198901: Migrate production services to kubernetes using the pipeline.

Indeed. I added it under TBD. The exact way this will be done will need to be investigated.

Thu, May 30, 7:19 PM · Release-Engineering-Team-TODO, Core Platform Team Backlog (Watching / External), Epic, Services (watching), Operations, Release Pipeline
akosiaris updated the task description for T198901: Migrate production services to kubernetes using the pipeline.
Thu, May 30, 7:19 PM · Release-Engineering-Team-TODO, Core Platform Team Backlog (Watching / External), Epic, Services (watching), Operations, Release Pipeline
akosiaris updated the task description for T198901: Migrate production services to kubernetes using the pipeline.
Thu, May 30, 7:18 PM · Release-Engineering-Team-TODO, Core Platform Team Backlog (Watching / External), Epic, Services (watching), Operations, Release Pipeline
akosiaris added a comment to T198901: Migrate production services to kubernetes using the pipeline.

Is apertium part of the cxserver migration?

Thu, May 30, 7:11 PM · Release-Engineering-Team-TODO, Core Platform Team Backlog (Watching / External), Epic, Services (watching), Operations, Release Pipeline
akosiaris committed rDEPLOYCHARTS4c3ec3de3963: Fix TLS support for kask and cassandra connections (authored by akosiaris).
Fix TLS support for kask and cassandra connections
Thu, May 30, 3:51 PM
akosiaris committed rDEPLOYCHARTS6d355ddeba51: Package and publish kask 0.0.4 (authored by akosiaris).
Package and publish kask 0.0.4
Thu, May 30, 3:51 PM
akosiaris committed rDEPLOYCHARTS61edf897d37a: Fix TLS support for kask and cassandra connections (authored by akosiaris).
Fix TLS support for kask and cassandra connections
Thu, May 30, 3:47 PM
akosiaris committed rDEPLOYCHARTS5566aedc8578: Package and publish kask 0.0.4 (authored by akosiaris).
Package and publish kask 0.0.4
Thu, May 30, 3:47 PM
akosiaris committed rDEPLOYCHARTSd7119070ad4c: Package and publish kask 0.0.4 (authored by akosiaris).
Package and publish kask 0.0.4
Thu, May 30, 3:08 PM
akosiaris committed rDEPLOYCHARTS65c9ee36fb12: Fix TLS support for kask and cassandra connections (authored by akosiaris).
Fix TLS support for kask and cassandra connections
Thu, May 30, 3:08 PM

Wed, May 29

akosiaris closed T224556: Migrate etcd ganeti VMs to plain disk template as Resolved.

https://wikitech.wikimedia.org/wiki/Ganeti#VMs_without_DRBD_disk_template has been added to address the drawback needing to be communicated and documented.

Wed, May 29, 11:02 AM · Operations
akosiaris updated the task description for T224556: Migrate etcd ganeti VMs to plain disk template.
Wed, May 29, 11:01 AM · Operations
akosiaris created T224556: Migrate etcd ganeti VMs to plain disk template.
Wed, May 29, 11:01 AM · Operations

Tue, May 28

akosiaris moved T210861: OTRS exposes session cookie in URLs from Incoming to Pending patch / update on the OTRS board.
Tue, May 28, 2:44 PM · Upstream, OTRS, Security
akosiaris added a comment to T210861: OTRS exposes session cookie in URLs.

Hidden and marked as security by upstream.

Tue, May 28, 2:43 PM · Upstream, OTRS, Security
akosiaris added a comment to T210861: OTRS exposes session cookie in URLs.

Upstream bug at https://bugs.otrs.org/show_bug.cgi?id=14568

Tue, May 28, 12:54 PM · Upstream, OTRS, Security
akosiaris closed T224404: OTRS Stewards queue: Auto-response not working when using meta Special:Contact/Stewards as Resolved.
Tue, May 28, 12:14 PM · Patch-For-Review, MediaWiki-extensions-ContactPage, MediaWiki-extensions-WikimediaMaintenance, OTRS
akosiaris added a comment to T224404: OTRS Stewards queue: Auto-response not working when using meta Special:Contact/Stewards.

https://gerrit.wikimedia.org/r/512875 seems to have fixed this. Ticket #2019052810003632 as a demonstration. I 'll resolve this as successfully. @Reedy, thanks for correcting me, it really helped!

Tue, May 28, 12:14 PM · Patch-For-Review, MediaWiki-extensions-ContactPage, MediaWiki-extensions-WikimediaMaintenance, OTRS
akosiaris triaged T210861: OTRS exposes session cookie in URLs as Normal priority.

I just reproduced it. This is related to a sandboxed <iframe> that is embedded in the page to showcase the content of the email in a safe way. That being said, leaking session info in the url is indeed bad practice, due to all the copy pasting that can happen. In fact, just adding a valid OTRSAgentInterface to any OTRS url seems to be enough to allow assuming the OTRS identity of a user. On the plus side, those aren't guessable at least. The fun part is that

Tue, May 28, 10:49 AM · Upstream, OTRS, Security
akosiaris moved T222171: OTRS security update: CVE-2019-9892 CVE-2019-10067 from Incoming to Resolved on the OTRS board.
Tue, May 28, 10:20 AM · OTRS, Security
akosiaris changed the visibility for T222171: OTRS security update: CVE-2019-9892 CVE-2019-10067.
Tue, May 28, 10:20 AM · OTRS, Security
akosiaris closed T222171: OTRS security update: CVE-2019-9892 CVE-2019-10067 as Resolved.

We 've upgraded to 5.0.35. Resolving this. Thanks!

Tue, May 28, 10:20 AM · OTRS, Security
akosiaris moved T224404: OTRS Stewards queue: Auto-response not working when using meta Special:Contact/Stewards from Resolved to Pending patch / update on the OTRS board.
Tue, May 28, 10:10 AM · Patch-For-Review, MediaWiki-extensions-ContactPage, MediaWiki-extensions-WikimediaMaintenance, OTRS
akosiaris moved T224404: OTRS Stewards queue: Auto-response not working when using meta Special:Contact/Stewards from Incoming to Resolved on the OTRS board.
Tue, May 28, 10:10 AM · Patch-For-Review, MediaWiki-extensions-ContactPage, MediaWiki-extensions-WikimediaMaintenance, OTRS
akosiaris added a comment to T224404: OTRS Stewards queue: Auto-response not working when using meta Special:Contact/Stewards.

https://meta.wikimedia.org/wiki/Special:Contact/Stewards is used in the Wikimedia Cluster and as such is using the WikimediaMaintenance extension [1] which per [2] sets the SMTP Precedence header.

I note WikimediaMaintenance is just a repo full of WMF specific maintenance scripts...

ContactPage does not use sendBulkEmail.php from that repo, at all.

Tue, May 28, 9:33 AM · Patch-For-Review, MediaWiki-extensions-ContactPage, MediaWiki-extensions-WikimediaMaintenance, OTRS

Mon, May 27

akosiaris added a comment to T224404: OTRS Stewards queue: Auto-response not working when using meta Special:Contact/Stewards.

That would split even worse though the configuration. Instead of having it at least in different but related parts of the infrastructure, now it's across different infrastructures (mediawiki vs email).

Mon, May 27, 4:45 PM · Patch-For-Review, MediaWiki-extensions-ContactPage, MediaWiki-extensions-WikimediaMaintenance, OTRS
akosiaris committed rDEPLOYCHARTSfb5c4e666de4: Add log_level, tls, openapi config options (authored by akosiaris).
Add log_level, tls, openapi config options
Mon, May 27, 1:26 PM
akosiaris triaged T224404: OTRS Stewards queue: Auto-response not working when using meta Special:Contact/Stewards as Low priority.
Mon, May 27, 8:49 AM · Patch-For-Review, MediaWiki-extensions-ContactPage, MediaWiki-extensions-WikimediaMaintenance, OTRS
akosiaris created T224404: OTRS Stewards queue: Auto-response not working when using meta Special:Contact/Stewards.
Mon, May 27, 8:49 AM · Patch-For-Review, MediaWiki-extensions-ContactPage, MediaWiki-extensions-WikimediaMaintenance, OTRS

Wed, May 22

akosiaris added a comment to T224041: Kask functional testing with Cassandra via the Deployment Pipeline.

It seems that the cassandra subchart already exists for cask (via https://gerrit.wikimedia.org/r/#/c/operations/deployment-charts/+/509102/ );

Wed, May 22, 1:42 PM · Core Platform Team Backlog (Next), Core Platform Team (Session Management Service (CDP2)), Services (next), User-Eevans, Release Pipeline, Operations, serviceops, Release-Engineering-Team
akosiaris lowered the priority of T221577: Wikimedia\Rdbms\LBFactory::getEmptyTransactionTicket: LinksUpdate does not have outer scope from Unbreak Now! to High.

I am lowering to High, just in the interest of not abusing Unbreak Now!, since this task has been in this state since Apr 23. That being said, this indeed needs to be resolved ASAP.

Wed, May 22, 12:59 PM · MW-1.34-notes (1.34.0-wmf.7; 2019-05-28), Wikidata, Wikidata-Campsite, Patch-For-Review, Performance-Team, Multimedia, MediaWiki-Database, GlobalUsage, MediaWiki-extensions-PageAssessments, Discovery-Search, GeoData, Wikimedia-production-error

Sun, May 19

akosiaris reopened Restricted Task, a subtask of T218750: Re-enable use of Gerrit HTTP token to push patchsets, as Open.
Sun, May 19, 7:41 AM · VPS-project-libraryupgrader, Release-Engineering-Team, Gerrit

May 17 2019

akosiaris awarded T220894: Replacement of network::constant's special_hosts a Yellow Medal token.
May 17 2019, 2:34 PM · Patch-For-Review, Operations

May 16 2019

akosiaris updated the task description for T220894: Replacement of network::constant's special_hosts.
May 16 2019, 11:37 AM · Patch-For-Review, Operations
akosiaris added a comment to T223395: Cxserver container: Container does not send fatal errors to docker logs via stdout?.

needs an extra stanza

May 16 2019, 10:08 AM · CX-cxserver, Beta-Cluster-reproducible, Services (next), serviceops
akosiaris updated the task description for T220894: Replacement of network::constant's special_hosts.
May 16 2019, 9:05 AM · Patch-For-Review, Operations
akosiaris closed T220709: Upgrade statsd_exporter to 0.9 as Resolved.

Every deployment that uses statsd-exporter (namely zotero & blubberoid don't) in kubernetes has been upgraded. Resolving this. Many thanks!

May 16 2019, 8:54 AM · Patch-For-Review, Core Platform Team Backlog (Watching / External), Services (watching), Analytics, EventBus, observability, User-fgiunchedi, Operations

May 15 2019

akosiaris created P8527 hetzner IP blocks per https://ipinfo.io/AS24940.
May 15 2019, 1:06 PM
akosiaris committed rDEPLOYCHARTSc1786e208d8c: First draft of a wikibase-termbox chart (authored by akosiaris).
First draft of a wikibase-termbox chart
May 15 2019, 11:25 AM
akosiaris committed rDEPLOYCHARTSde0b78535858: Introduce the wikibase-termbox chart (authored by akosiaris).
Introduce the wikibase-termbox chart
May 15 2019, 11:25 AM
akosiaris added a comment to T220235: Migrate Beta cluster services to use Kubernetes .
krenair@deployment-docker-cxserver01:~$ sudo /usr/bin/docker run -p 8080:8080 -v /etc/mediawiki-services-cxserver/:/etc/mediawiki-services-cxserver --name alex-test docker-registry.wikimedia.org/wikimedia/mediawiki-services-cxserver:2019-05-08-064536-production -c /etc/mediawiki-services-cxserver/config.yaml
 Error during DHT setup undefined
 krenair@deployment-docker-cxserver01:~$

Anyone know what that means?

May 15 2019, 8:18 AM · Editing-team, Core Platform Team Backlog (Next), Services (next), Kubernetes, Release Pipeline, serviceops, Beta-Cluster-Infrastructure
akosiaris added a comment to T223345: Zotero container: Production is running candidate version, last production version is broken due to lack of ca-certificates package.

It's working in production because we connect to external URIs via a proxy, hence we don't need ca-certificates.

May 15 2019, 8:14 AM · Core Platform Team Backlog (Watching / External), Beta-Cluster-reproducible, Editing-team, Services (next), serviceops

May 14 2019

akosiaris added a comment to T220402: Introduce wikidata termbox SSR to kubernetes.

Hi @akosiaris,

thanks for taking the time to explain the way the Host header is intended to be used.
If I understand correctly the goal is to ensure that requests originating from our service bear a header Host: (www.)wikidata.org and reach which ever IP(s) appservers.discovery.wmnet resolves to on the system running it. This sounds like a name resolution challenge and a case for HostAliases or, more traditionally, a CNAME record.

May 14 2019, 12:03 PM · Patch-For-Review, Core Platform Team Backlog (Watching / External), Services (watching), Wikidata-Termbox-Hike, Wikidata, Release Pipeline, Operations, serviceops, Release-Engineering-Team
akosiaris added a comment to T220402: Introduce wikidata termbox SSR to kubernetes.

Hi @akosiaris - thanks for getting back to us.

sending a Host: HTTP for the identification of the exact project. Would it be possible to add that functionality?

It certainly is possible and depending on operational needs we certainly can make this happen. We quickly discussed this in the team and would like to first truly understand the goal to make sure we don't mix up the different layers of our proverbial sausage pizza without a valid reason. The service would be run in a container inside a k8s pod, controlling its DNS - why not use this option to make sure requests reach the intended endpoint? The correct host would then come "for free" per the host part configured in WIKIBASE_REPO.

May 14 2019, 10:12 AM · Patch-For-Review, Core Platform Team Backlog (Watching / External), Services (watching), Wikidata-Termbox-Hike, Wikidata, Release Pipeline, Operations, serviceops, Release-Engineering-Team
akosiaris added a comment to T220402: Introduce wikidata termbox SSR to kubernetes.

With respect to the end point checks it would be great to hear what we are trying to achieve with them. Our service depends on the availability of another service. If the examples are to act as smoke tests then their reliability depends on the upstream service; a dependency which would need to be configured (are we going to point it against prod for this?) & modeled (how to express service inter-dependency in the config?) in order to be able to make sense of the information down the line (i.e. "no need to be alarmed that this service reported 500 while the mw api was down").

May 14 2019, 10:01 AM · Patch-For-Review, Core Platform Team Backlog (Watching / External), Services (watching), Wikidata-Termbox-Hike, Wikidata, Release Pipeline, Operations, serviceops, Release-Engineering-Team
akosiaris closed T222899: Set up LVS for eventgate-main on port 32192 as Resolved.
May 14 2019, 7:57 AM · Analytics-Kanban, Patch-For-Review, serviceops, Core Platform Team Backlog (Watching / External), Services (watching), EventBus, Analytics
akosiaris closed T222899: Set up LVS for eventgate-main on port 32192, a subtask of T218346: Modern Event Platform: Deploy instance of EventGate service that produces events to kafka main , as Resolved.
May 14 2019, 7:57 AM · serviceops, Patch-For-Review, Analytics-Kanban, Core Platform Team Backlog (Watching / External), Services (watching), Analytics-EventLogging, EventBus, Analytics
akosiaris added a comment to T220235: Migrate Beta cluster services to use Kubernetes .

Could we use image version: latest in beta hiera? And somehow pull down the new latest and restart the image whenever a new version is created and uploaded to the registry?

Sure, you just have to add a more complex script to exec to service::docker I guess, so that you can properly check that 'latest' or any other similar metatag are correctly respected.

Be my guest, I'll happily review the change!

May 14 2019, 7:50 AM · Editing-team, Core Platform Team Backlog (Next), Services (next), Kubernetes, Release Pipeline, serviceops, Beta-Cluster-Infrastructure

May 13 2019

akosiaris added a comment to T223126: Install new PDUs into b5-eqiad.

Bacula & puppet databases are not going to exhibit any problems anyway. Puppet is literally used only by servermon and this is to be uninstalled pretty soon and backups don't happen during that timewindow.
etherpad, given the software, is a best-effort service, so no guarantees there. it will probably crash anyway, be restarted by systemd (as it anyway does every couple of days), users will be reconnected.

May 13 2019, 4:55 PM · ops-eqiad, Operations
akosiaris added a comment to T222962: Use new eventgate chart release analytics for eventgate-analytics service..

Hm, question. Currently mediawiki-config ProductionServices.php has:

'eventgate-analytics' => 'http://eventgate-analytics.discovery.wmnet:31192',

If we change LVS port to 33192, this will just change the LVS monitoring/pooling/depooling, right?

May 13 2019, 1:56 PM · Patch-For-Review, serviceops, Analytics-Kanban, Services (watching), EventBus, Analytics
akosiaris committed rDEPLOYCHARTSda66bfb82706: Actually rename the GC metric for eventgate (authored by akosiaris).
Actually rename the GC metric for eventgate
May 13 2019, 1:47 PM
Gerrit Code Review <gerrit@wikimedia.org> committed rDEPLOYCHARTS2ac6149c1fe9: Merge "eventgate: Switch GC metric to microseconds, update buckets" (authored by akosiaris).
Merge "eventgate: Switch GC metric to microseconds, update buckets"
May 13 2019, 1:33 PM
akosiaris committed rDEPLOYCHARTS83da86e1d472: Add initialize_service.sh tool (authored by akosiaris).
Add initialize_service.sh tool
May 13 2019, 1:29 PM
akosiaris committed rDEPLOYCHARTS33070727823c: eventgate: Switch GC metric to microseconds, update buckets (authored by akosiaris).
eventgate: Switch GC metric to microseconds, update buckets
May 13 2019, 1:24 PM
akosiaris committed rDEPLOYCHARTS629ee5835c7c: cxserver: Switch GC stats back to microseconds (authored by akosiaris).
cxserver: Switch GC stats back to microseconds
May 13 2019, 1:05 PM
akosiaris committed rDEPLOYCHARTS0a844a695e1f: cxserver: Switch GC stats back to microseconds (authored by akosiaris).
cxserver: Switch GC stats back to microseconds
May 13 2019, 12:49 PM
akosiaris added a comment to T222795: Re-evaluate service-runner's (ab)use of statsd timing metric for nodejs GC stats.

Thank you for an impressive level of details :) There's a bunch of other places where we abuse the timing metric within services exactly for the reason that we've needed to have percentiles, so the decision we make here should probably be adopted elsewhere.

May 13 2019, 12:26 PM · Patch-For-Review, Services (later), service-runner, serviceops, Operations
akosiaris added a comment to T172333: Scap: keyholder Too many authentication failures.

https://gerrit.wikimedia.org/r/#/c/operations/puppet/+/377269/ had fallen through the cracks. It's now merged, right before a SWAT window, in order to identify issues as fast as possible

May 13 2019, 10:52 AM · RelEng-Archive-FY201718-Q1, Patch-For-Review, Scap

May 10 2019

akosiaris committed rDEPLOYCHARTS37e07d1c0d6b: First draft of a wikibase-termbox chart (authored by akosiaris).
First draft of a wikibase-termbox chart
May 10 2019, 2:22 PM
akosiaris added a comment to T220402: Introduce wikidata termbox SSR to kubernetes.

@Tarrow , @WMDE-leszek I 've noticed 3 things while working on the above

May 10 2019, 1:56 PM · Patch-For-Review, Core Platform Team Backlog (Watching / External), Services (watching), Wikidata-Termbox-Hike, Wikidata, Release Pipeline, Operations, serviceops, Release-Engineering-Team
akosiaris committed rDEPLOYCHARTSb36eef9f0f37: First draft of a wikibase-termbox chart (authored by akosiaris).
First draft of a wikibase-termbox chart
May 10 2019, 1:21 PM