Page MenuHomePhabricator

akosiaris (Alexandros Kosiaris)
Site Reliability EngineerAdministrator

Today

  • Clear sailing ahead.

Tomorrow

  • Clear sailing ahead.

Saturday

  • Clear sailing ahead.

User Details

User Since
Oct 3 2014, 8:40 AM (320 w, 6 d)
Roles
Administrator
Availability
Available
IRC Nick
akosiaris
LDAP User
Alexandros Kosiaris
MediaWiki User
AKosiaris (WMF) [ Global Accounts ]

Blurb

Recent Activity

Today

akosiaris added a comment to T229397: Puppet: get row/rack info from Netbox.

Larger scope could be to look at all the IPs hardcoded in Puppet and see if it would make sens to import them from Netbox?
Same for prefixes I guess.

Thu, Nov 26, 11:02 AM · observability, User-crusnov, User-jbond, Patch-For-Review, Puppet, Operations

Yesterday

akosiaris added a comment to T265512: Set up Pipeline Configuration in WDQS repo.
Wed, Nov 25, 2:49 PM · Patch-For-Review, Discovery-Search (Current work), Wikidata-Query-Service, Wikidata
akosiaris edited Description on vm-requests.
Wed, Nov 25, 2:28 PM
akosiaris edited Description on vm-requests.
Wed, Nov 25, 2:23 PM
akosiaris renamed T268747: codfw: 4 VM request for kubernetes staging from Site: 4 VM request for kubernetes staging in codfw to codfw: 4 VM request for kubernetes staging.
Wed, Nov 25, 2:05 PM · Kubernetes, vm-requests, Operations
People empowered akosiaris as an administrator.
Wed, Nov 25, 1:50 PM
akosiaris triaged T268747: codfw: 4 VM request for kubernetes staging as Medium priority.
Wed, Nov 25, 1:43 PM · Kubernetes, vm-requests, Operations
akosiaris added a subtask for T244335: Upgrade kubernetes clusters to a security supported (LTS) version: T268747: codfw: 4 VM request for kubernetes staging.
Wed, Nov 25, 1:42 PM · Kubernetes, Prod-Kubernetes, serviceops
akosiaris added a parent task for T268747: codfw: 4 VM request for kubernetes staging: T244335: Upgrade kubernetes clusters to a security supported (LTS) version.
Wed, Nov 25, 1:42 PM · Kubernetes, vm-requests, Operations
akosiaris created T268747: codfw: 4 VM request for kubernetes staging.
Wed, Nov 25, 1:42 PM · Kubernetes, vm-requests, Operations

Tue, Nov 24

akosiaris added a comment to T268612: Docker image on the build host seem to ignore apt priority for wikimedia packages.

Ouch.

Could we normalize everything to use the public image reference? That would also make local test more easy or straight forward.
Or do we gain a bit benefit by using the internal reference?

The point is consistency. We want to use the same registry when referencing images and saving them.

Sure. My question is more like: Why are did we start using both names in first place and can we stop doing so. :)

Tue, Nov 24, 12:41 PM · docker-pkg, serviceops, Operations

Mon, Nov 23

akosiaris updated the task description for T268505: New database request: sockpuppet.
Mon, Nov 23, 5:31 PM · DBA
akosiaris updated subscribers of T265512: Set up Pipeline Configuration in WDQS repo.

@akosiaris it was unclear to me whether we need the promote section in the pipeline config. I'm referring to this: https://wikitech.wikimedia.org/wiki/PipelineLib/Reference#Promote and I saw it in a couple of configs here: https://gerrit.wikimedia.org/r/plugins/gitiles/mediawiki/services/mathoid/+/refs/heads/master/.pipeline/config.yaml#34.

Mon, Nov 23, 3:41 PM · Patch-For-Review, Discovery-Search (Current work), Wikidata-Query-Service, Wikidata

Fri, Nov 20

akosiaris added a comment to T242855: Undeploy graphoid .

What is the status of the decommissioning of Graphoid ?

Fri, Nov 20, 10:48 AM · MW-1.35-notes (1.35.0-wmf.34; 2020-05-26), Platform Engineering (Icebox), serviceops, Operations, Graphoid
akosiaris updated the task description for T198901: Migrate production services to kubernetes using the pipeline.
Fri, Nov 20, 7:50 AM · Release-Engineering-Team (Pipeline), Release-Engineering-Team-TODO, Platform Team Legacy (Watching / External), Epic, Services (watching), Operations, Release Pipeline
akosiaris closed T182331: [Epic] Deploy ORES in kubernetes cluster as Declined.

Dependent T210268 and T210269 have been declined, declining this as well. See T210268#6488834 for the reasoning.

Fri, Nov 20, 7:49 AM · Operations, ORES, Machine Learning Platform
akosiaris closed T182331: [Epic] Deploy ORES in kubernetes cluster, a subtask of T198901: Migrate production services to kubernetes using the pipeline, as Declined.
Fri, Nov 20, 7:49 AM · Release-Engineering-Team (Pipeline), Release-Engineering-Team-TODO, Platform Team Legacy (Watching / External), Epic, Services (watching), Operations, Release Pipeline

Thu, Nov 19

akosiaris added a comment to T268202: Eq: 5 VM request for kafka-test-eqiad cluster.

OK then. +1 from my side (and my role as a rubber-stamper is done here). Feel free to create those VMs. Docs if you need them are at https://wikitech.wikimedia.org/wiki/Ganeti#Create_a_VM

Thu, Nov 19, 2:33 PM · Patch-For-Review, vm-requests, Operations
akosiaris added a comment to T268202: Eq: 5 VM request for kafka-test-eqiad cluster.

Just to verify, total is 20 vCPUs, 40GB RAM and 500GB disk space?

Thu, Nov 19, 2:15 PM · Patch-For-Review, vm-requests, Operations
akosiaris added a comment to T268202: Eq: 5 VM request for kafka-test-eqiad cluster.

Just to verify, total is 20 vCPUs, 40GB RAM and 500GB disk space?

Thu, Nov 19, 1:32 PM · Patch-For-Review, vm-requests, Operations
akosiaris closed T241230: Migrate recommendation-api to kubernetes as Resolved.

The service has been deployed yesterday, and the traffic switch happened today. Per https://grafana.wikimedia.org/d/Y5wk80oGk/recommendation-api?orgId=1&var-dc=thanos&var-site=eqiad&var-service=recommendation-api&var-prometheus=k8s&var-container_name=All&from=now-3h&to=now traffic (alas there is no corresponding dashboard for the legacy infrastructure) is flowing now to the kubernetes based deployment. There is some cleanup work to happen, but otherwise this is done. I am gonna resolve it successfully, but feel free to reopen. Thanks to @bmansurov for working through getting the container created and the helm chart ready.

Thu, Nov 19, 9:59 AM · Product-Infrastructure-Team-Backlog, Patch-For-Review, serviceops, Release-Engineering-Team, Services, Recommendation-API
akosiaris closed T241230: Migrate recommendation-api to kubernetes, a subtask of T198901: Migrate production services to kubernetes using the pipeline, as Resolved.
Thu, Nov 19, 9:59 AM · Release-Engineering-Team (Pipeline), Release-Engineering-Team-TODO, Platform Team Legacy (Watching / External), Epic, Services (watching), Operations, Release Pipeline
akosiaris closed T241230: Migrate recommendation-api to kubernetes, a subtask of T248355: Archive mediawiki/services/recommendation-api/deploy once it's no longer used, as Resolved.
Thu, Nov 19, 9:59 AM · Patch-For-Review, Recommendation-API, Projects-Cleanup
akosiaris updated the task description for T241230: Migrate recommendation-api to kubernetes.
Thu, Nov 19, 9:56 AM · Product-Infrastructure-Team-Backlog, Patch-For-Review, serviceops, Release-Engineering-Team, Services, Recommendation-API
akosiaris closed T266373: Connection closed while downloading PDF of articles as Resolved.

I am going to resolve this per the comments above. Feel free to reopen. Many thanks to @BBlack and @CDanis for figuring this out.

Thu, Nov 19, 9:55 AM · Traffic, Readers-Web-Backlog (Tracking), Proton, Product-Infrastructure-Team-Backlog, serviceops, Operations, Desktop Improvements, Wikimedia-production-error
akosiaris updated subscribers of T267065: eqiad: Server moves to free up space on 10g racks.

For what is worth

Thu, Nov 19, 9:37 AM · Platform Engineering, ops-eqiad, Operations, DC-Ops

Wed, Nov 18

akosiaris raised the priority of T266373: Connection closed while downloading PDF of articles from Low to High.

More and more duplicates are being merged into this one and stats from tests above suggest a mean rate of failures of ~20%, which is a lot. Bumping priority to High

Wed, Nov 18, 10:08 AM · Traffic, Readers-Web-Backlog (Tracking), Proton, Product-Infrastructure-Team-Backlog, serviceops, Operations, Desktop Improvements, Wikimedia-production-error

Mon, Nov 16

akosiaris added a comment to T266373: Connection closed while downloading PDF of articles.

@akosiaris More from debugging on this issue:

  • Querying directly the RESTBase service for a PDF render of an article doesn't reproduce the issue

Looking at the pastes above, I must say I don't see a direct RESTBase service (restbase.discovery.wmnet or restbase.svc.eqiad.wmnet or restbase.svc.codfw.wmnet) call. There is a single wget against the proton service, but I don't think that gives us enough data. However, I think we indeed need to test this more.

I used the same scripts pointing to restbase.svc.eqiad.wmnet and proton.svc.eqiad.wmnet but since it didn't show any issues I didn't paste the whole run output.

Where are all these run from btw ? Judging by the speeds a local DSL?

Yes, local cable connection but i think I had similar results from the deployment node too.

Mon, Nov 16, 11:05 AM · Traffic, Readers-Web-Backlog (Tracking), Proton, Product-Infrastructure-Team-Backlog, serviceops, Operations, Desktop Improvements, Wikimedia-production-error

Wed, Nov 11

akosiaris added a comment to T266373: Connection closed while downloading PDF of articles.

A few more tests. the TL;DR says varnish 6 is at fault probably, but with a question mark.

Wed, Nov 11, 1:47 PM · Traffic, Readers-Web-Backlog (Tracking), Proton, Product-Infrastructure-Team-Backlog, serviceops, Operations, Desktop Improvements, Wikimedia-production-error

Tue, Nov 10

akosiaris added a comment to T266373: Connection closed while downloading PDF of articles.

Interestingly, proton returns transfer-encoding: chunked responses, that don't have a Content-Length obviously. So, for the internal service, cl-matches-bytes makes no sense and it's not there.

Tue, Nov 10, 8:54 PM · Traffic, Readers-Web-Backlog (Tracking), Proton, Product-Infrastructure-Team-Backlog, serviceops, Operations, Desktop Improvements, Wikimedia-production-error
akosiaris added a project to T266373: Connection closed while downloading PDF of articles: Traffic.

Given we, Product Infra, are not finding issues at our service level investigation, we're asking that SRE take a look as next steps in helping to resolve. @akosiaris would you be a good person to assign?

Tue, Nov 10, 8:48 PM · Traffic, Readers-Web-Backlog (Tracking), Proton, Product-Infrastructure-Team-Backlog, serviceops, Operations, Desktop Improvements, Wikimedia-production-error
akosiaris added a comment to T266373: Connection closed while downloading PDF of articles.

I 've also ran the same tests against restbase.svc.eqiad.wmnet in P13257 and I have the following

Tue, Nov 10, 8:44 PM · Traffic, Readers-Web-Backlog (Tracking), Proton, Product-Infrastructure-Team-Backlog, serviceops, Operations, Desktop Improvements, Wikimedia-production-error
akosiaris added a comment to T266373: Connection closed while downloading PDF of articles.

@akosiaris More from debugging on this issue:

  • Querying directly the RESTBase service for a PDF render of an article doesn't reproduce the issue
Tue, Nov 10, 5:06 PM · Traffic, Readers-Web-Backlog (Tracking), Proton, Product-Infrastructure-Team-Backlog, serviceops, Operations, Desktop Improvements, Wikimedia-production-error
akosiaris edited P13257 More pdf stats.
Tue, Nov 10, 2:19 PM · Proton
akosiaris edited P13257 More pdf stats.
Tue, Nov 10, 1:58 PM · Proton
akosiaris edited P13257 More pdf stats.
Tue, Nov 10, 1:22 PM · Proton
akosiaris created P13257 More pdf stats.
Tue, Nov 10, 1:19 PM · Proton
akosiaris added a comment to T265504: Create Blubberfile in WDQS repo.

@akosiaris I started using the new Java images that you uploaded. I wasn't able to install gpg in the build process. There are some conflicts. We can skip gpg verification of the Flink tar, but I don't think that's a good idea. I will continue to do some debugging.

Error message:

The following packages have unmet dependencies:
 gpg : Depends: gpgconf (= 2.2.12-1+deb10u1~bpo9+1) but it is not going to be installed
       Depends: libassuan0 (>= 2.5.0) but 2.4.3-2 is to be installed
       Depends: libgpg-error0 (>= 1.35) but 1.26-2 is to be installed
E: Unable to correct problems, you have held broken packages.
Tue, Nov 10, 12:01 PM · Patch-For-Review, Discovery-Search (Current work), Wikidata-Query-Service, Wikidata

Mon, Nov 9

akosiaris closed T266730: AttributeError: 'Changelog' object has no attribute 'get_version' as Resolved.

Change merged, 2.1.0 is up for review at https://gerrit.wikimedia.org/r/c/operations/docker-images/docker-pkg/+/640268, I expect it to be released soon. I 'll resolve this, feel free to reopen though

Mon, Nov 9, 11:44 PM · docker-pkg
akosiaris added a comment to T263910: ORES redis: max number of clients reached....

@akosiaris Can you review it? I don't know enough about the nodes vs redis connection to intelligently review.

Mon, Nov 9, 10:23 AM · User-Ladsgroup, Sustainability (Incident Followup), Patch-For-Review, Okapi, serviceops, Operations, Machine Learning Platform, ORES
akosiaris closed T171157: Monitor internal CA expirations as Declined.

Setting to stalled until we decide what to actually do with the internal CA, as we're considering dropping it entirely in favour of other options.

@akosiaris / @faidon: Has this situation somehow changed by resolved T133717: Letsencrypt all the prod things we can - planning / T194962: Create and deploy a centralized letsencrypt service / Acme-chief (though I'm not sure if that also touched CA monitoring at all)?

Mon, Nov 9, 10:23 AM · observability, Operations

Fri, Nov 6

akosiaris added a comment to T266373: Connection closed while downloading PDF of articles.

So this is not specific to frwiki it seems. Is there perhaps some correlation between page size and failure rate? Or maybe some failure rate and response time?

Fri, Nov 6, 3:40 PM · Traffic, Readers-Web-Backlog (Tracking), Proton, Product-Infrastructure-Team-Backlog, serviceops, Operations, Desktop Improvements, Wikimedia-production-error
akosiaris added a comment to T267339: [EPIC] Address maps level of support issues.

That's excellent news @sdkim . Many thanks for this!

Fri, Nov 6, 2:51 PM · Product-Infrastructure-Team-Backlog, Maps (Kartotherian)

Thu, Nov 5

akosiaris added a comment to T267339: [EPIC] Address maps level of support issues.

Some of the above items are optional (e.g. cookbooks if nothing is done often and is automatable) but good to have.

Thu, Nov 5, 4:32 PM · Product-Infrastructure-Team-Backlog, Maps (Kartotherian)
akosiaris created T267339: [EPIC] Address maps level of support issues.
Thu, Nov 5, 4:30 PM · Product-Infrastructure-Team-Backlog, Maps (Kartotherian)

Wed, Nov 4

akosiaris added a comment to T244335: Upgrade kubernetes clusters to a security supported (LTS) version.

We are not able to go 1.19 because of calico only supporting 1.18

Looks like this isn't true. Judging from https://github.com/projectcalico/calico/commit/21a45a4a141fff03b251fde2f1ab77fbb0c903ee#diff-f386c272afd3d855bf9f1d3609d1782962951258a58e2b298df60c70b16517ee, the calico 3.16 requirements page will be updated soon.

Not sure about that. The commit is from 2nd Sept. and never made it to the 3.16 release branch. I would guess it's for 3.17.

Wed, Nov 4, 10:59 AM · Kubernetes, Prod-Kubernetes, serviceops

Tue, Nov 3

akosiaris added a comment to T265504: Create Blubberfile in WDQS repo.

@akosiaris when you get some time, can you please take another look at https://gerrit.wikimedia.org/r/c/wikidata/query/rdf/+/635074

Tue, Nov 3, 5:30 PM · Patch-For-Review, Discovery-Search (Current work), Wikidata-Query-Service, Wikidata
akosiaris added a comment to T244335: Upgrade kubernetes clusters to a security supported (LTS) version.

We are not able to go 1.19 because of calico only supporting 1.18

Tue, Nov 3, 5:26 PM · Kubernetes, Prod-Kubernetes, serviceops

Mon, Nov 2

akosiaris added a comment to T266479: Puppet Proposal to remove require_package.

The idea was indeed to just make sure that the packages are installed before anything else in the class happens. These days, if one puts ensure_packages() at the top of the manifest, we have that. So we can indeed probably move off from require_packages. However, the bad thing with all of this is that the migration is untestable. The relationships aren't exercised during catalog compilation but rather during catalog application by the agent. Which we don't have any decent way of testing :-(. Of course, the worse that can happen is that we regress to having to run puppet >1 times during reimaging of a host.

Mon, Nov 2, 2:42 PM · Patch-For-Review, Operations, Puppet

Fri, Oct 30

akosiaris closed T204907: Scap is checking canary servers in dormant instead of active-dc as Resolved.

This was done, resolving.

Fri, Oct 30, 5:16 PM · Patch-Needs-Improvement, Sustainability (Incident Followup), Release-Engineering-Team (Deployment services), Release-Engineering-Team-TODO, Operations, Datacenter-Switchover, Scap
akosiaris added a comment to T265876: Logging options for apache httpd in k8s.

Couple of points

Fri, Oct 30, 4:43 PM · observability, Operations, serviceops, MW-on-K8s
akosiaris added a comment to T266766: Build new kubernetes packages.

With Kubernetes 1.19.3 things changed a bit and we now need a docker version supporting the --platform flag for FROM.
The envoy builder host would support that but lacks enough space on /var/lib/docker to hold all the intermediate images used for build.

Fri, Oct 30, 2:55 PM · Patch-For-Review, Kubernetes, Prod-Kubernetes, serviceops

Thu, Oct 29

akosiaris closed T264710: Host static sites on kubernetes as Invalid.

Sounds like a fine solution from our side for now.
I'll let serviceops do with this ticket as they wish (keep it or close it) and I'll get on to creating some tickets for:

Thu, Oct 29, 4:28 PM · Wikidata Query UI, Wikidata Query Builder, serviceops, Wikidata, User-Addshore

Oct 27 2020

akosiaris placed T177371: Phase out DSA keys for SSH access (ssh-dss) up for grabs.
Oct 27 2020, 4:58 PM · Operations
akosiaris added a comment to T266373: Connection closed while downloading PDF of articles.

See also this Grafana dashboard showing increase of daily PDF rendering by Proton from 80k to 20k, since beginning of August.

Looks like that's when Proton migrated to Kubernetes:
https://grafana.wikimedia.org/d/llIEd7MMz/proton?viewPanel=68&orgId=1&from=now-30d&to=now

The dashboard linked by @Framawiki is probably overdue for deletion.

Oct 27 2020, 1:21 PM · Traffic, Readers-Web-Backlog (Tracking), Proton, Product-Infrastructure-Team-Backlog, serviceops, Operations, Desktop Improvements, Wikimedia-production-error
akosiaris added a comment to T264398: 8-10% response start regression (Varnish 5.1.3-1wm15 -> 6.0.6-1wm1).

I tried installing 6.0.2 on cp4032, and to my surprise I found out that 6.0.6 and 6.0.2 are not binary compatible:

Oct 27 2020, 10:39 AM · Performance-Team (Radar), Operations, Traffic
akosiaris added a comment to T265504: Create Blubberfile in WDQS repo.

Could you elaborate on that a bit?

Sure, here goes: We are using Apache Flink[1] as a platform for our event processing we do to feed Wikidata Query Service. We've want to move to Flink deployment to Kubernetes, hence this ticket. Apache Flink provides it's own docker image[2] which, in other circumstances, we would build upon. What @Mstyles is doing now is basically replaying work original Flink contributors did for their docker image - which, according to our current knowledge is what we must do.
The actual docker file (with additional entry script) is here [3] - it would be great if we wouldn't need to make sure that we covered everything that is handled here with each Flink update.

Oct 27 2020, 10:36 AM · Patch-For-Review, Discovery-Search (Current work), Wikidata-Query-Service, Wikidata
akosiaris added a comment to T265504: Create Blubberfile in WDQS repo.

@akosiaris I see, makes sense. I still would like to solve the issue with replicating the original dockerfile - can we deploy Flink images to our registry - even if we'd need to fork Flink docker repo?

Oct 27 2020, 10:05 AM · Patch-For-Review, Discovery-Search (Current work), Wikidata-Query-Service, Wikidata

Oct 26 2020

akosiaris added a comment to T265504: Create Blubberfile in WDQS repo.

@akosiaris Can we base a blubber enabled project on a 3rd party docker image, provided on docker hub? I was wondering if we have to replicate original dockerfile here (I'd rather base of their image to reduce future maintenance).

Oct 26 2020, 1:59 PM · Patch-For-Review, Discovery-Search (Current work), Wikidata-Query-Service, Wikidata

Oct 23 2020

akosiaris added a comment to T229397: Puppet: get row/rack info from Netbox.

I fear that we are going down another path like the dns generation in which the Netbox API don't really suits our needs in terms of performance and efficiency (multiple API calls per device). I'm wondering if we should convert John's patch into a Netbox script instead and take advantage of the speed and power of Django/Netbox internal APIs instead.

It was already discussed but I want to re-surface the fact that this will create a direct link Netbox data -> Puppet without any manual stopgap. Are we ready for this?

Oct 23 2020, 1:24 PM · observability, User-crusnov, User-jbond, Patch-For-Review, Puppet, Operations
akosiaris added a comment to T266333: Xml/sql dumps are still querying etcd excessively, fix this..

/me subscribing anyway, thanks !

Oct 23 2020, 12:36 PM · Dumps-Generation
akosiaris added a comment to T264710: Host static sites on kubernetes.

For what is worth, the idea that Daniel explains above, would solve the issue for now without the need to move to kubernetes, satisfying multiple of the requirements without requiring significant effort.

Oct 23 2020, 8:37 AM · Wikidata Query UI, Wikidata Query Builder, serviceops, Wikidata, User-Addshore

Oct 22 2020

akosiaris added a comment to T266194: wikifeeds-production-tls-proxy regularly exceeding its k8s CPU reservation.

That's pretty interesting, there shouldn't be so much throttling at so low CPU usage. user+system summed barely hit 1/5 of the limit.

Oct 22 2020, 10:05 AM · Kubernetes, Wikifeeds, serviceops
akosiaris added a comment to T265904: Remove SLAAC IPs from Ganeti hosts.

@jbond thanks for looking into this, unfortunately the data used in the Netbox import comes from the networking fact because it needs all of them and parses that one, so not sure if the above patch would be useful in practice.

Oct 22 2020, 9:48 AM · Patch-For-Review, Traffic, Operations

Oct 20 2020

akosiaris added a comment to T265982: eqiad: New ganeti instance for orchestrator installation.

LGTM, perhaps old do codfw as well since you are at it to have a fallback/backup?

Oct 20 2020, 9:33 AM · Operations, vm-requests, serviceops

Oct 19 2020

akosiaris committed rLPRI0a36d8754bc4: Remove restrouter (authored by akosiaris).
Remove restrouter
Oct 19 2020, 2:47 PM
akosiaris added a comment to T254954: Move Wikisource OCR's API proxy to production.

@akosiaris - Thanks for the reply and clearing up my misunderstandings. We have no need to keep the OCR service on Toolforge and would like to eventually move it into a MediaWiki extension (which would also make it easier for all the Wikisource projects to utilize). But it won't make any sense for us to do that unless there is an API proxy available in production that can communicate with the Google Vision API. What API proxy is cxserver.wikimedia.org utilizing to communicate with Google? Would it be possible for us to use that as well or have a similar proxy set up for this service?

Oct 19 2020, 2:20 PM · Tech-Product API Roadmap, Community-Tech, Wikisource OCR
akosiaris added a comment to T255410: Termbox SSR connection terminated very often.
Oct 19 2020, 2:11 PM · User-Michael, serviceops-radar, Wikidata-Campsite (Wikidata-Campsite-Iteration-∞), wdwb-tech-focus, Wikidata, Wikidata-Termbox
akosiaris added a comment to T265904: Remove SLAAC IPs from Ganeti hosts.

I fear this ain't gonna be easy. When we tried the approach that exists for all other hosts, we ended up in broken connectivity for ganeti hosts. See T233906 which ended up with the decision described in T233906#5529507. T234207 was then created as an investigation task to handle improvements to our puppetization of network configuration (which is crude and barely existing to be honest). There has been no move in this since then.

Oct 19 2020, 1:51 PM · Patch-For-Review, Traffic, Operations
akosiaris added a comment to T261130: ganeti5002 was down / powered off, machine check entries in SEL.

Any news on this one? (just found out today about it while working on T265607)

Oct 19 2020, 11:42 AM · serviceops, Operations, ops-eqsin

Oct 16 2020

akosiaris added a comment to T265183: In a k8s world: where does MediaWiki code live?.

My concern is that this transition step becomes a permanent step.

It's up to us to avoid that being true, in case, but I have further doubts I will clarify below. And yes, I clarified in a meeting with our managers I have that concern too (after all, Starling's first law still holds), and they swore commitment to repay that debt once we've transitioned traffic.

Oct 16 2020, 2:55 PM · MW-on-K8s
akosiaris added a comment to T264209: Run stress tests on docker images infrastructure.

For the We are limited on the docker-registry infrastructure side., the sanest way out of this (until we hit the next bottleneck) is to scale out, aka just more docker registry VMs. That should be easily doable, we got the capacity. The VMs should be split across the rack rows for higher availability.

Oct 16 2020, 1:26 PM · Release-Engineering-Team-TODO (2020-10-01 to 2020-12-31 (Q2)), MW-on-K8s, Release-Engineering-Team (Pipeline)
akosiaris added a comment to T264209: Run stress tests on docker images infrastructure.

We need to permanently bump the tmpfs /var/lib/nginx size if we want to be able to consistently push images with blobs that are larger than 1 GB compressed

Couldn't we get around this by using a (bigger) non tmpfs filesystem as client_body_temp_path?
Not sure how much the upload performance would suffer in this case, but we could test that...

+1 on this suggestion. For small requests, there will be minimal writing to a real filesystem for files that exist briefly. These writes would be background I/O in most cases.

Oct 16 2020, 11:02 AM · Release-Engineering-Team-TODO (2020-10-01 to 2020-12-31 (Q2)), MW-on-K8s, Release-Engineering-Team (Pipeline)
akosiaris updated subscribers of T254954: Move Wikisource OCR's API proxy to production.

This was brought to my attention yesterday by @WDoranWMF, sorry for missing it and many thanks for the ping.

Oct 16 2020, 10:23 AM · Tech-Product API Roadmap, Community-Tech, Wikisource OCR
akosiaris added a comment to T255410: Termbox SSR connection terminated very often.

@akosiaris Thank you a lot for your detailed response. I did look into those errors a tiny bit more to properly document them as can be now seen on wikitech.

In the course of that I looked at the last days and noticed some discrepancies to the numbers you provided above. All the following data is for the 7 days between 2020-10-07 00:00:00 and 2020-10-13 23:59:59.

Oct 16 2020, 8:47 AM · User-Michael, serviceops-radar, Wikidata-Campsite (Wikidata-Campsite-Iteration-∞), wdwb-tech-focus, Wikidata, Wikidata-Termbox

Oct 15 2020

akosiaris added a comment to T264209: Run stress tests on docker images infrastructure.

The first pull test was successful. 34 hosts pull from the registry simultaneously. The test lasted about 5minutes.

Oct 15 2020, 11:55 AM · Release-Engineering-Team-TODO (2020-10-01 to 2020-12-31 (Q2)), MW-on-K8s, Release-Engineering-Team (Pipeline)
akosiaris updated subscribers of T265490: rate limited etherpad.
Oct 15 2020, 9:10 AM · Patch-For-Review, Operations, Wikimedia-Etherpad
akosiaris changed the status of T265490: rate limited etherpad from Open to Stalled.

Lowering priority as the service isn't broken and setting as Stalled as we are waiting from the upstream to release the new version to fix this.

Oct 15 2020, 9:07 AM · Patch-For-Review, Operations, Wikimedia-Etherpad

Oct 14 2020

akosiaris added a comment to T264209: Run stress tests on docker images infrastructure.

1st obstacle found already. The push failed with '500 internal server error'. Logs indicate

Oct 14 2020, 2:17 PM · Release-Engineering-Team-TODO (2020-10-01 to 2020-12-31 (Q2)), MW-on-K8s, Release-Engineering-Team (Pipeline)
akosiaris awarded T219544: Make hadoop cluster able to push to swift a Love token.
Oct 14 2020, 11:09 AM · Patch-For-Review, Analytics-Kanban, Research, Operations, Discovery, Analytics

Oct 12 2020

akosiaris changed the status of T263038: Dev images registry from Open to Stalled.

Setting as stalled for now, pending the investigation mentioned in the last comment.

Oct 12 2020, 9:26 AM · Release-Engineering-Team (Pipeline)
akosiaris added a comment to T263873: "The OTRS background task is not launched. Please contact your administrator" on OTRS web interface.

And again, can't reproduce. Not only that, but logs around the time of the report indicate that the daemon was working fine. That is also supported by systemd's status for the service

Oct 12 2020, 8:59 AM · OTRS

Oct 9 2020

akosiaris added a comment to T264710: Host static sites on kubernetes.

A couple of requirements from my side, regardless of where those sites are deployed and the technology used:

Oct 9 2020, 11:09 AM · Wikidata Query UI, Wikidata Query Builder, serviceops, Wikidata, User-Addshore

Oct 8 2020

akosiaris added a comment to T239459: service-runner apps running on kubernetes emit logs with log level 50 .

The hold-up seems to be eventstreams; it actually uses a fork of service runner, and the fork is missing the feature, so it's slightly more complicated to update. https://github.com/wikimedia/service-runner/tree/prometheus_metrics

Oct 8 2020, 12:48 PM · Platform Team Workboards (Clinic Duty Team), Patch-For-Review, CX-cxserver, serviceops-radar, Product-Infrastructure-Team-Backlog, Operations
akosiaris added a comment to T264888: Review default ferm INPUT policy.

Overall, I am willing to test this out, couples of points though:

Oct 8 2020, 7:45 AM · Patch-For-Review, Security, Operations, netops, User-jbond

Oct 7 2020

akosiaris added a comment to T260917: Support TLS for service-to-service communication in k8s staging.

What's peculiar, is that https://puppet-compiler.wmflabs.org/compiler1002/25621/deploy1001.eqiad.wmnet/change.deploy1001.eqiad.wmnet.pson doesn't have the test:data thing. Perhaps that's an earlier PCC though?

Yeah, that's a PCC from before the second change to labs/private. https://puppet-compiler.wmflabs.org/compiler1003/25719/ is the current one (which contains test: data in change catalog).

Oct 7 2020, 1:04 PM · serviceops, Kubernetes
akosiaris added a comment to T260917: Support TLS for service-to-service communication in k8s staging.

I could use a pair of eyes on https://gerrit.wikimedia.org/r/q/bug:T260917
The PCC full diff (https://puppet-compiler.wmflabs.org/compiler1002/25621/) lacks defaultsecret: notdefault for staging zotero. What am I missing here?

Oct 7 2020, 12:56 PM · serviceops, Kubernetes
akosiaris edited projects for T263764: Termbox service: unusual errors that could be from envoy, added: serviceops-radar; removed serviceops.

Envoy is being documented at https://wikitech.wikimedia.org/wiki/Envoy#Envoy_at_WMF. It is being used by termbox to talk to mediawiki (it's a component of a service mesh). The idea is to have low cost persistent TLS connections, with retries and telemetry. More more insights aside from the doc link above the following grafana dashboard is useful https://grafana.wikimedia.org/d/b1jttnFMz/envoy-telemetry-k8s?orgId=1&var-datasource=thanos&var-site=codfw&var-prometheus=k8s&var-app=termbox&var-destination=mwapi-async&from=now-7d&to=now

Oct 7 2020, 10:43 AM · serviceops-radar, Wikidata, Wikidata-Termbox
akosiaris added a comment to T122220: Enable optional two-factor authentication for OTRS.

Why does OTRS even use password authentication? All OTRS agents presumably have Wikimedia accounts, so wouldn't OAuth be a more secure and convenient method?

Oct 7 2020, 9:43 AM · Security, OTRS
akosiaris added a project to T255410: Termbox SSR connection terminated very often: serviceops-radar.

Sorry for not answering earlier.

Oct 7 2020, 8:43 AM · User-Michael, serviceops-radar, Wikidata-Campsite (Wikidata-Campsite-Iteration-∞), wdwb-tech-focus, Wikidata, Wikidata-Termbox

Oct 6 2020

akosiaris changed the status of T264195: Kubernetes pods are being periodically evicted because of Disk Space pressure caused by cpjobqueue from Open to Stalled.

I think it's showing already

Oct 6 2020, 9:07 AM · serviceops-radar, Kubernetes

Oct 5 2020

akosiaris added a comment to T264195: Kubernetes pods are being periodically evicted because of Disk Space pressure caused by cpjobqueue.

The changeprop change about the changeprop service above seems to have solved the daily saw like pattern

Oct 5 2020, 11:42 AM · serviceops-radar, Kubernetes

Oct 3 2020

akosiaris closed T187984: Update OTRS to the latest stable version (6.0.x) as Resolved.

The upgrade has happened successfully and tickets for followup work that is required as a result of this upgrade can now be opened under the OTRS 6 column in OTRS project in phabricator. So, I 'll resolve this

Oct 3 2020, 11:55 AM · Patch-For-Review, serviceops, User-notice, Operations, OTRS
akosiaris closed T187984: Update OTRS to the latest stable version (6.0.x), a subtask of T122220: Enable optional two-factor authentication for OTRS, as Resolved.
Oct 3 2020, 11:55 AM · Security, OTRS
akosiaris closed T187984: Update OTRS to the latest stable version (6.0.x), a subtask of T126759: Some URLs seem to get replaced with [##############2] when written in notes and responses. , as Resolved.
Oct 3 2020, 11:55 AM · Upstream, OTRS

Oct 2 2020

akosiaris created P12908 cleanup etherpad js.
Oct 2 2020, 6:01 PM
akosiaris added a comment to T264390: Site: 4 VM request for LDAP replicas.

/me rubberstamping. Thanks for this!

Oct 2 2020, 8:35 AM · vm-requests, Operations

Oct 1 2020

akosiaris closed T255877: Move proton to use TLS only as Resolved.

All old stuff has been removed, I 'll resolve this.

Oct 1 2020, 12:53 PM · Patch-For-Review, Prod-Kubernetes, Kubernetes, serviceops, Operations
akosiaris closed T255877: Move proton to use TLS only, a subtask of T235411: Add TLS termination to services running on kubernetes, as Resolved.
Oct 1 2020, 12:52 PM · Prod-Kubernetes, Kubernetes, serviceops, Operations
akosiaris updated the task description for T255877: Move proton to use TLS only.
Oct 1 2020, 12:52 PM · Patch-For-Review, Prod-Kubernetes, Kubernetes, serviceops, Operations