Page MenuHomePhabricator

akosiaris (Alexandros Kosiaris)
Senior Site Reliability Engineer

Today

  • Clear sailing ahead.

Tomorrow

  • Clear sailing ahead.

Wednesday

  • Clear sailing ahead.

User Details

User Since
Oct 3 2014, 8:40 AM (270 w, 3 d)
Availability
Available
IRC Nick
akosiaris
LDAP User
Alexandros Kosiaris
MediaWiki User
AKosiaris (WMF) [ Global Accounts ]

Blurb

Recent Activity

Today

akosiaris added a comment to T239942: ORES articlequality for euwiki works differently in production.

I had a quick look, I agree with the above investigation (I reproduced as well) and can add (in case it's not obvious) that for some reason libenchant seems to prefer on ores1009 aspell instead of ispell. It is become even more complicated because of hunspell usage as well, lower down strace. I dug into it for about 1 hour or so without an obvious cause/solution being revealed to me.

Mon, Dec 9, 5:27 PM · Patch-For-Review, artificial-intelligence, articlequality-modeling, ORES, Scoring-platform-team
akosiaris added a comment to T235844: Collect tasks related code and security review.

I 'll remove myself from this if you don't mind. I guess I was added because of T240175#5723047 but that was an answer to a very cut and clear question. Overall, I honestly I have no idea what the project is about, nor how I could be of help (at least for now). Feel free to readd me if you disagree.

Mon, Dec 9, 4:26 PM · User-LokalProfil, Wikispeech-jobrunner (Sprint), User-Sebastian_Berlin-WMSE, WMSE-Wikispeech-Speech-Data-Collector-2019

Thu, Dec 5

akosiaris committed rDEPLOYCHARTS6fdf3abd1a5f: blubberoid: Harmonize eqiad limits/requests (authored by akosiaris).
blubberoid: Harmonize eqiad limits/requests
Thu, Dec 5, 3:44 PM
akosiaris closed T236554: "packaging" Cloud VPS project jessie deprecation as Resolved.

I 've deleted that instance. It's been a long time since we last used it and is probably not particularly useful right now. I 'll resolve this.

Thu, Dec 5, 9:17 AM · Cloud-VPS (Debian Jessie Deprecation)
akosiaris added a comment to T236550: "analytics" Cloud VPS project jessie deprecation.

I should probably be removed as an admin. I have no usage of it and no idea what that machine is.

Thu, Dec 5, 9:13 AM · Cloud-VPS (Debian Jessie Deprecation)
akosiaris added a comment to T236545: "otrs" Cloud VPS project jessie deprecation.

Feel free to delete machines and project.

Thu, Dec 5, 9:12 AM · Cloud-VPS (Debian Jessie Deprecation)
akosiaris added a comment to T236573: "etcd" Cloud VPS project jessie deprecation.

For what is worth, I have no usage of these machines nor the project.

Thu, Dec 5, 9:11 AM · Cloud-VPS (Debian Jessie Deprecation)
akosiaris triaged T239862: unwind the Puppetized /etc/hosts override of statsd.eqiad.wmnet as Low priority.
Thu, Dec 5, 9:09 AM · Performance-Team (Radar), Operations

Wed, Dec 4

akosiaris added a parent task for T239838: EQIAD+CODFW: (9) VM request for kubernetes etcd: T239835: setup new, buster based, kubernetes etcd servers for staging/codfw/eqiad cluster.
Wed, Dec 4, 4:32 PM · Patch-For-Review, vm-requests, Operations
akosiaris added a subtask for T239835: setup new, buster based, kubernetes etcd servers for staging/codfw/eqiad cluster: T239838: EQIAD+CODFW: (9) VM request for kubernetes etcd.
Wed, Dec 4, 4:32 PM · serviceops
akosiaris renamed T239838: EQIAD+CODFW: (9) VM request for kubernetes etcd from Site: (QUANTITY) VM request for SERVICE[S] to EQIAD+CODFW: (9) VM request for kubernetes etcd.
Wed, Dec 4, 4:31 PM · Patch-For-Review, vm-requests, Operations
akosiaris created T239838: EQIAD+CODFW: (9) VM request for kubernetes etcd.
Wed, Dec 4, 4:31 PM · Patch-For-Review, vm-requests, Operations
akosiaris awarded T234642: Wikimedia Technical Conference 2019 Session: What have we learned when deploying a standalone server side rendering service for the new mobile Wikidata termbox a Love token.
Wed, Dec 4, 4:25 PM · International-Developer-Events, Wikimedia-Technical-Conference-2019
akosiaris added a comment to T238048: Followup to backup1001 bacula switchover (misc pending tasks).

Hi, @akosiaris, thanks for the reviews and feedback. Could I have further your thoughts on T238048#5701519 and T238048#5701534.

Wed, Dec 4, 4:23 PM · Goal, Operations
akosiaris moved T239835: setup new, buster based, kubernetes etcd servers for staging/codfw/eqiad cluster from Backlog to Doing on the serviceops board.
Wed, Dec 4, 4:12 PM · serviceops
akosiaris triaged T239835: setup new, buster based, kubernetes etcd servers for staging/codfw/eqiad cluster as High priority.
Wed, Dec 4, 4:11 PM · serviceops
akosiaris created T239835: setup new, buster based, kubernetes etcd servers for staging/codfw/eqiad cluster.
Wed, Dec 4, 4:11 PM · serviceops
akosiaris awarded T238048: Followup to backup1001 bacula switchover (misc pending tasks) a Love token.
Wed, Dec 4, 1:05 PM · Goal, Operations
akosiaris added a comment to T238048: Followup to backup1001 bacula switchover (misc pending tasks).

"Offsite Job" seems to be correctly configured as "Copy", but it is not showing any activity. Needs checking.

Wed, Dec 4, 1:05 PM · Goal, Operations
akosiaris added a comment to T238048: Followup to backup1001 bacula switchover (misc pending tasks).

Same for bast1001:

29-Nov 13:19 backup1001.eqiad.wmnet JobId 162657: Start Restore Job RestoreFiles.2019-11-29_13.18.59_50
29-Nov 13:19 backup1001.eqiad.wmnet JobId 162657: Using Device "FileStorageArchive" to read.
29-Nov 13:19 backup1001.eqiad.wmnet-fd JobId 162657: Ready to read from volume "archive0003" on File device "FileStorageArchive" (/srv/archive).
29-Nov 13:19 backup1001.eqiad.wmnet-fd JobId 162657: Forward spacing Volume "archive0003" to addr=323517936212
29-Nov 13:19 dbprov2001.codfw.wmnet-fd JobId 162657: Error: openssl.c:78 Encryption session provided an invalid symmetric key: ERR=error:0407109F:rsa routines:RSA_padding_check_PKCS1_type_2:pkcs decoding error
29-Nov 13:19 dbprov2001.codfw.wmnet-fd JobId 162657: Error: openssl.c:78 Encryption session provided an invalid symmetric key: ERR=error:04065072:rsa routines:RSA_EAY_PRIVATE_DECRYPT:padding check failed
29-Nov 13:19 dbprov2001.codfw.wmnet-fd JobId 162657: Error: openssl.c:78 Encryption session provided an invalid symmetric key: ERR=error:0607A082:digital envelope routines:EVP_CIPHER_CTX_set_key_length:invalid key length
29-Nov 13:19 dbprov2001.codfw.wmnet-fd JobId 162657: Error: restore.c:741 Failed to initialize decryption context for /srv/tmp/srv/home_pmtpa/aaron/.gitignore
29-Nov 13:19 dbprov2001.codfw.wmnet-fd JobId 162657: Error: openssl.c:78 OpenSSL digest Verify final failed: ERR=error:0407109F:rsa routines:RSA_padding_check_PKCS1_type_2:pkcs decoding error
29-Nov 13:19 dbprov2001.codfw.wmnet-fd JobId 162657: Error: openssl.c:78 OpenSSL digest Verify final failed: ERR=error:04065072:rsa routines:RSA_EAY_PRIVATE_DECRYPT:padding check failed
29-Nov 13:19 dbprov2001.codfw.wmnet-fd JobId 162657: Error: openssl.c:78 OpenSSL digest Verify final failed: ERR=error:04091077:rsa routines:INT_RSA_VERIFY:wrong signature length
29-Nov 13:19 dbprov2001.codfw.wmnet-fd JobId 162657: Error: restore.c:1680 Signature validation failed for file /srv/tmp/srv/home_pmtpa/aaron/.cache/motd.legal-displayed: ERR=Signature is invalid
29-Nov 13:19 dbprov2001.codfw.wmnet-fd JobId 162657: Error: openssl.c:78 Encryption session provided an invalid symmetric key: ERR=error:0407109F:rsa routines:RSA_padding_check_PKCS1_type_2:pkcs decoding error
29-Nov 13:19 dbprov2001.codfw.wmnet-fd JobId 162657: Error: openssl.c:78 Encryption session provided an invalid symmetric key: ERR=error:04065072:rsa routines:RSA_EAY_PRIVATE_DECRYPT:padding check failed

@akosiaris does this ring any bell? I find hard to belive that a new version of openssl couldn't decrypt a file encrypted with an older version. Maybe I am using the wrong options.

Wed, Dec 4, 1:03 PM · Goal, Operations
akosiaris updated subscribers of T239795: Connection tracking on kubernetes hosts alerts.
Wed, Dec 4, 12:09 PM · Analytics, serviceops, Event-Platform
akosiaris updated the task description for T239795: Connection tracking on kubernetes hosts alerts.
Wed, Dec 4, 12:01 PM · Analytics, serviceops, Event-Platform
akosiaris updated subscribers of T239795: Connection tracking on kubernetes hosts alerts.
Wed, Dec 4, 11:58 AM · Analytics, serviceops, Event-Platform
akosiaris renamed T239795: Connection tracking on kubernetes hosts alerts from conntrack -L to Connection tracking on kubernetes hosts alerts.
Wed, Dec 4, 11:55 AM · Analytics, serviceops, Event-Platform
akosiaris created T239795: Connection tracking on kubernetes hosts alerts.
Wed, Dec 4, 10:57 AM · Analytics, serviceops, Event-Platform

Tue, Dec 3

akosiaris added a comment to T239009: Degraded RAID on ganeti2002.

Do you think we can keep using the server without replacing the disk or do we have to buy 1 disk and keep on site in case the disk failed

Tue, Dec 3, 6:10 PM · Operations, ops-codfw
akosiaris updated subscribers of P9799 varnish tests traceback.
Tue, Dec 3, 12:15 PM
akosiaris created P9799 varnish tests traceback.
Tue, Dec 3, 12:12 PM
akosiaris closed T238410: Errors from k8s API for user 'prometheus' as Resolved.

Problem fixed by the change above

Tue, Dec 3, 11:21 AM · observability, Kubernetes
akosiaris committed rDEPLOYCHARTS6fd94a237a23: RBAC: Allow prometheus access to nodes resources (authored by akosiaris).
RBAC: Allow prometheus access to nodes resources
Tue, Dec 3, 8:59 AM
akosiaris committed rDEPLOYCHARTS4e968299bfbf: RBAC: Unify rules into 1 file (authored by akosiaris).
RBAC: Unify rules into 1 file
Tue, Dec 3, 8:47 AM
akosiaris added a comment to T236386: Set up eventgate-logging-external in production.

@akosiaris I merged and applied https://gerrit.wikimedia.org/r/c/operations/deployment-charts/+/551610 in staging. My main app isn't coming up, and I suspect it is because it can't reach Kafka at the 9093 TLS port on the logstash hosts.

Tue, Dec 3, 8:23 AM · Services, Operations, Service-deployment-requests, serviceops, Patch-For-Review, Event-Platform, Analytics-Kanban, User-Elukey, User-fgiunchedi, Better Use Of Data, Product-Infrastructure-Team-Backlog, Epic

Mon, Dec 2

akosiaris renamed T234207: Investigate improvements to how puppet manages network interfaces from Investigate improvements to how puppet manages interfaces to Investigate improvements to how puppet manages network interfaces.
Mon, Dec 2, 5:11 PM · User-jbond, Puppet, Operations, netops
akosiaris added a comment to T239009: Degraded RAID on ganeti2002.

We haven't even started the process of refreshing those nodes and they host important services. I 'd rather we just replaced the disk.

Mon, Dec 2, 5:01 PM · Operations, ops-codfw

Fri, Nov 29

akosiaris added a comment to T239392: Applications and scripts need to be able to understand the pooled status of servers in our load balancers..

need to be able to understand the pooled status

Fri, Nov 29, 2:44 PM · Operations, serviceops, SRE-tools, Traffic, Pybal
akosiaris renamed T239459: service-runner apps (wikifeeds/cxserver at the least) running on kubernetes emit logs with log level 50 from wikifeeds (Service-runner app) running on kubernetes emit logs with log level 50 to service-runner apps (wikifeeds/cxserver at the least) running on kubernetes emit logs with log level 50 .
Fri, Nov 29, 11:46 AM · Core Platform Team Workboards (Clinic Duty Team), CX-cxserver, serviceops-radar, Mobile-Content-Service, Product-Infrastructure-Team-Backlog, Operations
akosiaris added a project to T239459: service-runner apps (wikifeeds/cxserver at the least) running on kubernetes emit logs with log level 50 : CX-cxserver.

So it does look like it's service-runner specific then.

Fri, Nov 29, 11:45 AM · Core Platform Team Workboards (Clinic Duty Team), CX-cxserver, serviceops-radar, Mobile-Content-Service, Product-Infrastructure-Team-Backlog, Operations
akosiaris renamed T239459: service-runner apps (wikifeeds/cxserver at the least) running on kubernetes emit logs with log level 50 from Kubernetes emits log level 50 to wikifeeds (Service-runner app) running on kubernetes emit logs with log level 50 .
Fri, Nov 29, 11:01 AM · Core Platform Team Workboards (Clinic Duty Team), CX-cxserver, serviceops-radar, Mobile-Content-Service, Product-Infrastructure-Team-Backlog, Operations
akosiaris closed T91649: Deploy Sentry (JavaScript error logging) to production, configured to log only a limited subset of users/pages as Declined.
In T91649#5700798, @Tgr wrote:

Sentry is no longer open source (by the definition of OSI and by the admission of upstream as well). More at https://forum.sentry.io/t/re-licensing-sentry-faq-dAiscussion/8044.

To be more exact, Sentry is now business source, ie. eventually opensource with a non-compete restriction. (The code can be used in any way, except it cannot be used to offer paid error monitoring services; and it becomes proper Apache-2.0 three years after being published.) That is of course a no-go for any code that needs to be integrated with MediaWiki (using it would probably violate both the Sentry license and the GPL); using the Sentry server as a self-contained service should IMO still be considered. The license meets all the conditions for which we care about open source, and it's a reasonable balance between the spirit of open source (far closer to it than open core, which we do use) and not letting large commercial entities cannibalize your project. But of course that's a far larger discussion that would have to involve WMF Legal and TechCom at the minimum.

Fri, Nov 29, 8:11 AM · Developer-Wishlist (2017), Multimedia, Developer-notice, Notice, Epic, UploadWizard, Sentry, Roadmap
akosiaris closed T91649: Deploy Sentry (JavaScript error logging) to production, configured to log only a limited subset of users/pages, a subtask of T106915: Use Sentry in production, as Declined.
Fri, Nov 29, 8:11 AM · Readers-Web-Backlog, Front-end-Standards-Group, Epic, Sentry

Thu, Nov 28

akosiaris added a comment to T238658: Migrate EventStreams to k8s deployment pipeline.

I can't seem to develop this locally due to a barrage of version problems.
From what I can tell, in prod we use.
k8s 1.12.9
helm 2.12.2
helmfile 0.66.0 2.
To use minikube with k8s 1.12.9, I tried installing minikube 1.1.0 (released at about the same time as 1.12.9).
I failed making this combo work locally (macOS). I think perhaps my Docker Desktop (with hyperkit?) has been keeping itself up to date, and no longer works with older kubectl? Not sure.

Thu, Nov 28, 11:42 AM · Analytics, Patch-For-Review, Release-Engineering-Team (Pipeline), Services (watching), Release Pipeline
akosiaris added a comment to T219923: Move graphoid logging to new logging pipeline.

Graphoid is likely going away, so we shouldn't work on this.

Thu, Nov 28, 9:11 AM · observability, Core Platform Team (Needs Cleaning - Services Operations), service-runner, Wikimedia-Logstash, Operations

Wed, Nov 27

akosiaris triaged T239340: Investigate the remaining usage of X-Real-IP as Low priority.
Wed, Nov 27, 2:07 PM · Patch-For-Review, Traffic, serviceops, Operations
akosiaris created T239340: Investigate the remaining usage of X-Real-IP.
Wed, Nov 27, 2:07 PM · Patch-For-Review, Traffic, serviceops, Operations

Tue, Nov 26

akosiaris created P9747 calico debug logs.
Tue, Nov 26, 8:36 AM

Mon, Nov 25

akosiaris updated the task description for T238909: Proposal: simplify set up of a new load-balanced service on kubernetes.
Mon, Nov 25, 11:49 AM · Operations, Prod-Kubernetes, Pybal, Traffic, serviceops

Fri, Nov 22

akosiaris added a comment to T238927: Build and supply buster nodejs (nodejs-slim and nodejs-devel) container images.

This was mentioned in T238890 (of which it is not a blocker) but we will anyway needs this for all services.

Fri, Nov 22, 2:14 PM · serviceops
akosiaris triaged T238927: Build and supply buster nodejs (nodejs-slim and nodejs-devel) container images as Medium priority.
Fri, Nov 22, 2:12 PM · serviceops
akosiaris created T238927: Build and supply buster nodejs (nodejs-slim and nodejs-devel) container images.
Fri, Nov 22, 2:12 PM · serviceops
akosiaris added a comment to T238890: [Bug] Chromium binary missing in proton's production docker image.
docker run --rm --entrypoint /bin/bash -it docker-registry.wikimedia.org/wikimedia/mediawiki-services-chromium-render:2019-11-22-134049-production 
runuser@5f6d4fdc14f8:/srv/service$ ls
LICENSE    app.js           config.yaml  lib           package.json  scripts    spec.yaml     test
README.md  config.dev.yaml  dist         node_modules  routes        server.js  targets.yaml
runuser@5f6d4fdc14f8:/srv/service$ dpkg -l chromium
Desired=Unknown/Install/Remove/Purge/Hold
| Status=Not/Inst/Conf-files/Unpacked/halF-conf/Half-inst/trig-aWait/Trig-pend
|/ Err?=(none)/Reinst-required (Status,Err: uppercase=bad)
||/ Name                          Version             Architecture        Description
+++-=============================-===================-===================-===============================================================
ii  chromium                      73.0.3683.75-1~deb9 amd64               web browser
Fri, Nov 22, 2:09 PM · Product-Infrastructure-Team-Backlog (Kanban), serviceops
akosiaris added a comment to T238833: Create NRPE check to alert when cergen certificates are due to expire.

Just so that you aren't caught off guard

Fri, Nov 22, 11:09 AM · Patch-For-Review, User-jbond, Puppet, Operations

Thu, Nov 21

akosiaris committed rDEPLOYCHARTSf065ef7a3bb6: Also check charts generated by helmfile (authored by Joe).
Also check charts generated by helmfile
Thu, Nov 21, 5:01 PM
akosiaris added a comment to T238789: dropped packets to echostore.svc.eqiad 8082/tcp.

More investigation in T238823. It's quite possibly expected.

Thu, Nov 21, 1:27 PM · Operations, serviceops
akosiaris updated subscribers of T238823: Sporatic RST drops in the ulogd logs.

Could be totally different but with @jijiki we 've seen this behavior elsewhere as well. The latest installment is T238789. Per that logstash dashboard, it's the mediawiki's that send the RST to the echostore.

Thu, Nov 21, 1:26 PM · Operations, netops, User-jbond
akosiaris added a comment to T238817: Request routing to active/passive services active in codfw only stopped working.

\o/. Thanks for taking care of this!

Thu, Nov 21, 11:13 AM · Traffic, Operations
akosiaris closed T238792: Wikifeeds deployment failed in staging as Resolved.

Looks like there are currently issues accessing docker-registry.wikimedia.org, which may be to blame here.

Thu, Nov 21, 9:15 AM · serviceops, Product-Infrastructure-Team-Backlog, Wikifeeds
akosiaris committed rDEPLOYCHARTSabfd99bb8476: wikifeeds: Followup to b129b2088dbe (authored by akosiaris).
wikifeeds: Followup to b129b2088dbe
Thu, Nov 21, 9:07 AM
akosiaris committed rDEPLOYCHARTSb129b2088dbe: Switch all services' docker-registries to internal (authored by akosiaris).
Switch all services' docker-registries to internal
Thu, Nov 21, 8:52 AM
akosiaris triaged T238158: Identify which parts of the "Add a wiki" procedure can be integrated with the deployment pipeline as Low priority.
Thu, Nov 21, 8:11 AM · Operations, Kubernetes, serviceops, Release Pipeline

Wed, Nov 20

akosiaris awarded T236386: Set up eventgate-logging-external in production a Love token.
Wed, Nov 20, 5:02 PM · Services, Operations, Service-deployment-requests, serviceops, Patch-For-Review, Event-Platform, Analytics-Kanban, User-Elukey, User-fgiunchedi, Better Use Of Data, Product-Infrastructure-Team-Backlog, Epic
akosiaris added a comment to T236386: Set up eventgate-logging-external in production.

@Joe @akosiaris @ema I'd like to move forward with these patches this week, hopefully sooner rather than later. Can you find some time to review? I'll add them all to ticket description for easier reference.

Wed, Nov 20, 3:57 PM · Services, Operations, Service-deployment-requests, serviceops, Patch-For-Review, Event-Platform, Analytics-Kanban, User-Elukey, User-fgiunchedi, Better Use Of Data, Product-Infrastructure-Team-Backlog, Epic
akosiaris closed T224247: upgrade and rename krypton & create its codfw equivalent, a subtask of T224549: Track remaining jessie systems in production, as Resolved.
Wed, Nov 20, 8:31 AM · Operations
akosiaris closed T224247: upgrade and rename krypton & create its codfw equivalent as Resolved.

krypton is no more since 7a36b4e7a94f486a400f0363c263c446c33bba80, resolving.

Wed, Nov 20, 8:31 AM · serviceops, Operations
akosiaris closed T228196: docker-registry: some layers has been corrupted due to deleting other swift containers as Resolved.

I 'll boldly resolve (no update since July), feel free to reopen

Wed, Nov 20, 8:30 AM · Release-Engineering-Team-TODO, Patch-For-Review, Operations, Wikimedia-Incident, serviceops
akosiaris moved T224041: Kask functional testing with Cassandra via the Deployment Pipeline from Next up to Backlog on the serviceops board.
Wed, Nov 20, 8:29 AM · CPT Initiatives (Session Management Service (CDP2)), Release-Engineering-Team (Pipeline), Release-Engineering-Team-TODO, Services (next), User-Eevans, Release Pipeline, Operations, serviceops

Tue, Nov 19

akosiaris added a comment to T212828: SRE FY2019 Q3 goal: Ramp-up serving traffic to PHP 7 .

Should we resolve this?

Tue, Nov 19, 6:56 PM · User-Joe, serviceops, Operations
akosiaris closed T223345: Zotero container: Production is running candidate version, last production version is broken due to lack of ca-certificates package as Resolved.

This has been fixed in 3229da692ef3a003a860d6b0024c9ef4813ce13d. The reason production tagged releases are now again available is because we gave up on having swagger/openapi specs and removed helm.yaml (the thing informing the pipeline that integration tests should be run) in https://gerrit.wikimedia.org/r/#/c/mediawiki/services/zotero/+/538171/.

Tue, Nov 19, 6:56 PM · Core Platform Team Legacy (Watching / External), Beta-Cluster-reproducible, Editing-team, Services (next), serviceops
akosiaris closed T226516: deploy CoreDNS as a in-cluster DNS service as Resolved.

This has been done, resolving

Tue, Nov 19, 6:44 PM · serviceops
akosiaris added a comment to T238593: Phabricator downtime due to aphlict and websockets (aphlict current disabled).

By going through SAL and the irc logs on #wikimedia-operations I've reconstructed the events as follows. There are some parts I don't understand so please fill the gaps.

  • 2019-11-15 16:31 phab1003 has 169 busy apache workers according to this grafana dashboard which seems unusual. The number of busy workers is usually lower (below 20), but in the past it had been occasionally about as high (eg: it was 159 on 2019-09-21 at 00:00).
Tue, Nov 19, 2:44 PM · Operations, Traffic, serviceops, Phabricator
akosiaris triaged T238652: Hardware request for Postgres database for censorship monitoring scripts as Medium priority.

Adding some extra info as we 've discussed with @ssingh before this task was filed. After the discussion it became obvious that a ganeti VM would not be a good candidate for this, mainly because of the disk size requirement.

Tue, Nov 19, 12:31 PM · hardware-requests, Operations
akosiaris added a comment to T70113: Alert when Zuul/Gearman queue is stalled.

The alarm got removed via https://gerrit.wikimedia.org/r/#/c/operations/puppet/+/550943/ for some reason ...

Tue, Nov 19, 11:58 AM · Release-Engineering-Team, Patch-For-Review, observability, Continuous-Integration-Infrastructure

Fri, Nov 15

akosiaris added a comment to T231011: (euwiki) Mysterious, coordinated slowdowns every ~ 25 minutes on API servers.

I think it's in the ~35mins "schedule" now, but other than that, it's still present https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?orgId=1&var-datasource=eqiad%20prometheus%2Fops&var-cluster=api_appserver&var-method=GET&var-code=200&from=1573809235233&to=1573845235233

Fri, Nov 15, 7:15 PM · PHP 7.2 support, serviceops, Operations
akosiaris updated the task description for T234641: Wikimedia Technical Conference 2019 Session: Continuous Delivery/Deployment in Wikimedia: The Future of the Deployment Pipeline.
Fri, Nov 15, 7:11 PM · International-Developer-Events, Wikimedia-Technical-Conference-2019
akosiaris triaged T238410: Errors from k8s API for user 'prometheus' as Low priority.

I think we can solve this by just adding system:heapster role to our prometheus kubernetes group. I 'll try that approach but for now I am lowering to Low as it isn't causing an issue with metrics gathering (at least the metrics we currently rely on)

Fri, Nov 15, 5:20 PM · observability, Kubernetes
akosiaris added a reverting change for rLPRIc7d82bc3cf37: Add system:heapster role to prometheus: rLPRIbaf082d7f97e: Revert "Add system:heapster role to prometheus".
Fri, Nov 15, 5:02 PM
akosiaris committed rLPRIbaf082d7f97e: Revert "Add system:heapster role to prometheus" (authored by akosiaris).
Revert "Add system:heapster role to prometheus"
Fri, Nov 15, 5:02 PM
akosiaris committed rLPRIc7d82bc3cf37: Add system:heapster role to prometheus (authored by akosiaris).
Add system:heapster role to prometheus
Fri, Nov 15, 4:35 PM

Thu, Nov 14

akosiaris updated the task description for T238259: Unconference: Interactive content at Wikipedia: how to even start.
Thu, Nov 14, 9:05 PM · Wikimedia-Technical-Conference-2019
akosiaris awarded T236406: Switchover backup director service from helium to backup1001 a Yellow Medal token.
Thu, Nov 14, 7:25 PM · Patch-For-Review, Goal, DBA, serviceops, Operations
akosiaris triaged T236386: Set up eventgate-logging-external in production as Medium priority.
Thu, Nov 14, 7:11 PM · Services, Operations, Service-deployment-requests, serviceops, Patch-For-Review, Event-Platform, Analytics-Kanban, User-Elukey, User-fgiunchedi, Better Use Of Data, Product-Infrastructure-Team-Backlog, Epic
akosiaris added a comment to T236386: Set up eventgate-logging-external in production.

Namespaces and tokens have been created and populated. @Ottomata, you are clear for deployment. I am guessing after that we need LVS, discovery, public endpoint exposing.

Thu, Nov 14, 6:59 PM · Services, Operations, Service-deployment-requests, serviceops, Patch-For-Review, Event-Platform, Analytics-Kanban, User-Elukey, User-fgiunchedi, Better Use Of Data, Product-Infrastructure-Team-Backlog, Epic
akosiaris committed rDEPLOYCHARTS22af35a69d8d: Namespaces for eventgate-logging-external (authored by akosiaris).
Namespaces for eventgate-logging-external
Thu, Nov 14, 5:05 PM
akosiaris committed rLPRIf8cd5689af17: Add eventgate-logging-external k8s stanzas (authored by akosiaris).
Add eventgate-logging-external k8s stanzas
Thu, Nov 14, 2:58 PM

Wed, Nov 13

akosiaris closed T227514: k8s liveness check(?) generating session storage log noise, a subtask of T206016: Create a service for session storage, as Resolved.
Wed, Nov 13, 11:11 PM · CPT Initiatives (Multi-DC (TEC1)), User-Clarakosi, User-Eevans
akosiaris closed T227514: k8s liveness check(?) generating session storage log noise as Resolved.

I 'll be bold and resolve this since we now have fixed b). We can always reopen and evaluate if a) becomes an issue (aka I am going with 3.)

Wed, Nov 13, 11:11 PM · CPT Initiatives (Multi-DC (TEC1)), serviceops
akosiaris committed rDEPLOYCHARTSa5f0dcc19fe0: Update scaffold template names to use chart name (authored by jeena).
Update scaffold template names to use chart name
Wed, Nov 13, 9:03 PM
akosiaris added a comment to T238048: Followup to backup1001 bacula switchover (misc pending tasks).

@akosiaris Could you give a quick look to see if these seems like a complete archive contents?
{P9597}

Wed, Nov 13, 4:19 PM · Goal, Operations

Tue, Nov 12

akosiaris added a comment to T106915: Use Sentry in production.

As noted in T91649 (duplicating here just for more dissemination), sentry is no longer open source (by the strict definition of the term). More at https://forum.sentry.io/t/re-licensing-sentry-faq-dAiscussion/8044

Tue, Nov 12, 3:06 PM · Readers-Web-Backlog, Front-end-Standards-Group, Epic, Sentry
akosiaris added a comment to T91649: Deploy Sentry (JavaScript error logging) to production, configured to log only a limited subset of users/pages.

Sentry is no longer open source (by the definition of OSI and by the admission of upstream as well). More at https://forum.sentry.io/t/re-licensing-sentry-faq-dAiscussion/8044.

Tue, Nov 12, 3:04 PM · Developer-Wishlist (2017), Multimedia, Developer-notice, Notice, Epic, UploadWizard, Sentry, Roadmap

Nov 8 2019

akosiaris added a comment to T202982: Requests to MW 404 when on HTTPS.

Seems like it's been fixed, the only thing left to be done is to remove the hacky line from puppet.

Nov 8 2019, 1:27 PM · Core Platform Team Workboards (Clinic Duty Team), Product-Infrastructure-Team-Backlog, Operations, Proton
akosiaris added a comment to T237713: Remove old builds on package builder.

All builds done on boron end up in /var/cache/pbuilder/result/* and we don't expire debs from there.

Nov 8 2019, 10:04 AM · Operations

Nov 6 2019

akosiaris added a comment to T235013: Use `git lfs` for large binary files of Design Style Guide.

@Ladsgroup git-lfs is not installed on the prod servers and the puppet git:::clone class also does not support changing the command yet. So this breaks cloning on the prod servers.

Nov 6 2019, 3:16 PM · Release-Engineering-Team, Patch-For-Review, User-Ladsgroup, Wikimedia Design Style Guide

Nov 4 2019

akosiaris added a comment to T236277: Extend Puppet CA Expiry date .

@CDanis the problem is that all of those identify clients, while for the CA validation we're mostly interested in the server side. So while that surely would help, it's a 1:1 mapping. Also there might be places that have hardcoded the path to the CA cert for validation, either in the puppet repo or, potentially, in other repos too (as a default for example, dunno).
I don't know if this CA is also used in the k8s world for example.

Nov 4 2019, 11:40 PM · DBA, Patch-For-Review, User-jbond, Puppet, Operations
akosiaris added a comment to T236277: Extend Puppet CA Expiry date .

IMPORTANT: The puppet CA cert (and correspondingly key), is used as a "master" (a failsafe in case the actual host key is not around) key for bacula backups. That is, if we lose it we won't be able to restore backups for hosts that no longer are around. This is documented under https://wikitech.wikimedia.org/wiki/Bacula#Restore_from_a_non-existent_host_(missing_private_key).

@akosiaris I only intend to update the public key with this change as such everything should still work post-change, or am i missing something else?

Nov 4 2019, 1:46 PM · DBA, Patch-For-Review, User-jbond, Puppet, Operations
akosiaris closed T237016: Update router ACLs for newer bacula hosts, a subtask of T229209: Strengthen backup infrastructure and support, as Resolved.
Nov 4 2019, 1:43 PM · Patch-For-Review, Goal, DBA, serviceops, Operations
akosiaris closed T237016: Update router ACLs for newer bacula hosts as Resolved.
akosiaris@an-master1002:~$ telnet -4 backup1001.eqiad.wmnet 9103
Trying 10.64.48.36...
Connected to backup1001.eqiad.wmnet.
Escape character is '^]'.
Nov 4 2019, 1:43 PM · Operations, netops
akosiaris added a comment to T237198: Kubernetes workers frequent oom-killer in action.

As @Joe said, that's expected. It's how misbehaving services are killed in order to recover. Here's also a breakdown in case anyone is interested

Nov 4 2019, 1:40 PM · Operations, serviceops
akosiaris added a comment to T236277: Extend Puppet CA Expiry date .

IMPORTANT: The puppet CA cert (and correspondingly key), is used as a "master" (a failsafe in case the actual host key is not around) key for bacula backups. That is, if we lose it we won't be able to restore backups for hosts that no longer are around. This is documented under https://wikitech.wikimedia.org/wiki/Bacula#Restore_from_a_non-existent_host_(missing_private_key).

Nov 4 2019, 1:29 PM · DBA, Patch-For-Review, User-jbond, Puppet, Operations
akosiaris added a comment to T237016: Update router ACLs for newer bacula hosts.

Let a minor comment, namely let's keep helium around for a bit more.

Nov 4 2019, 12:19 PM · Operations, netops

Nov 1 2019

akosiaris changed the status of T237091: Pass RDRAND CPU feature flag to ganeti VMs from Resolved to Invalid.

Rolling it back. Per https://en.wikipedia.org/wiki/RDRAND and further tests, rdrand is already present in IvyBridge, which is what we pass as the base anyway.

Nov 1 2019, 8:53 AM · Operations