Page MenuHomePhabricator

akosiaris (Alexandros Kosiaris)
Senior Site Reliability Engineer

Today

  • Clear sailing ahead.

Tomorrow

  • Clear sailing ahead.

Wednesday

  • Clear sailing ahead.

User Details

User Since
Oct 3 2014, 8:40 AM (237 w, 2 d)
Availability
Available
IRC Nick
akosiaris
LDAP User
Alexandros Kosiaris
MediaWiki User
AKosiaris (WMF) [ Global Accounts ]

Blurb

Recent Activity

Fri, Apr 19

akosiaris committed rDEPLOYCHARTSa1ceeba25c6f: First version of the kask chart (authored by akosiaris).
First version of the kask chart
Fri, Apr 19, 4:05 PM
akosiaris closed T220785: Requesting deployment access for santhosh as Resolved.

Change merged, thanks!

Fri, Apr 19, 3:07 PM · Patch-For-Review, Operations, SRE-Access-Requests
akosiaris assigned T220822: Site: 4 VM request for kubernetes to ayounsi.

This is almost done. That only thing missing seems to be the peering with the juniper routers.

Fri, Apr 19, 3:00 PM · Patch-For-Review, User-jijiki, serviceops, vm-requests, Operations
akosiaris added a comment to T218750: Re-enable use of Gerrit HTTP token to push patchsets.

Before we move forward and enable this, let's make sure we have understood the security repercussions and have mitigated them (and if we find it impossible to do so, avoid it).

Fri, Apr 19, 1:49 PM · VPS-project-libraryupgrader, Release-Engineering-Team, Gerrit
akosiaris committed rDEPLOYCHARTSa2c1409df930: Support affinity in all charts (authored by akosiaris).
Support affinity in all charts
Fri, Apr 19, 8:34 AM

Thu, Apr 18

akosiaris added a comment to T196478: rack/setup/install backup1001.

Host is up and running but as @Cmjohnson points out in T196478#4976375

Thu, Apr 18, 3:01 PM · Patch-For-Review, Operations, ops-eqiad
akosiaris added a comment to T200832: remove mathoid from scb.

deployment-mathoid still exists and has been failing puppet runs since December 3rd when profile::mathoid got removed.

I 'd delete the VM, profile::mathoid isn't coming back. If anything the VMs with the role::beta::docker_services role applied can probably handle the service now.

Unfortunately MediaWiki is still expecting the VM to exist:
wmf-config/LabsServices.php: 'mathoid' => 'http://deployment-mathoid.eqiad.wmflabs:10042',

The two hosts with that role are deployment-eventgate-analytics-1 and deployment-sessionstore01 so neither of them are really appropriate for this, plus they both run jessie.

Thu, Apr 18, 2:17 PM · Beta-Cluster-Infrastructure, Core Platform Team Backlog (Watching / External), Services (watching), SCB, Mathoid, Operations
akosiaris added a comment to T219556: Create schema[12]00[12] (schema.svc.{eqiad,codfw}.wmnet).

ganeti1001, 1002 and 2001 have been installed. I dunno what's up with ganeti2002. gnt-instance console schema2002.codfw.wmnet never shows anything.

Thu, Apr 18, 9:56 AM · Analytics-Kanban, Patch-For-Review, Core Platform Team Backlog (Watching / External), Core Platform Team (Modern Event Platform (TEC2)), Services (watching), Operations, vm-requests, EventBus, Analytics
akosiaris added a comment to T221259: eqord - ulsfo Telia link down - IC-313592.

Just noting that at 10:41 UTC the circuit was still down per

Thu, Apr 18, 9:42 AM · Operations, netops

Wed, Apr 17

akosiaris added a comment to T220661: EventGate service runner worker occasionally killed, usually during higher load.

Been doing a lot to get more data, including enabling node profiling and connecting to node inspector. I didn't learn much, except confirmed that I don't think Kafka CPU culprit. I did see good number of cycles being spent generating the x-request-id header using cassandra-uuid.TimeUuid. This won't happen if we can ever get the x-request-id header to be set by varnish eventually (T201409).

But I didn't see anything particularly abnormal, just a lot of time being spent by express parsing the JSON request bodies, which I guess makes sense.

Just to try, I increased k8s resource requests/limits to:

requests:
  cpu: 1000m
  memory: 100Mi
limits:
  cpu: 2000m
  memory: 200Mi
Wed, Apr 17, 9:14 AM · Analytics-Kanban, Patch-For-Review, Services (watching), EventBus, Analytics

Mon, Apr 15

akosiaris added a comment to T165795: Ldap auth extension vs. ldap vs. username Case.

I see and understand the concerns from @akosiaris in T165795#5092381. I personally believe this un-indexed lookup is an acceptable overhead in the short term because it will only be used in the Wikitech account login flow. This is not a high rate of change event in our current system.

Mon, Apr 15, 4:39 PM · Patch-For-Review, cloud-services-team (Kanban), wikitech.wikimedia.org, MediaWiki-extensions-LdapAuthentication
akosiaris triaged T220894: Replacement of network::constant's special_hosts as Low priority.
Mon, Apr 15, 4:22 PM · Patch-For-Review, Operations
akosiaris added a comment to T220661: EventGate service runner worker occasionally killed, usually during higher load.

@akosiaris no, spanning up a new worker takes no time, the problem here is actually hilling old worker.

The heartbeat limit currently is 7.5 seconds, after which we do not just SIGKILL the worker, we optimistically try to properly shut it down and if that does not happen, we SIGKILL it after 1 minute. For 'normal' operation with many workers, this is a right approach - master stops dispatching requests to the worker immediately after ordering it to shut down gracefully, with less incoming pressure the worker might recover a bit and get a chance to finish up requests and do it's pre-shutdown housekeeping.

Mon, Apr 15, 1:56 PM · Analytics-Kanban, Patch-For-Review, Services (watching), EventBus, Analytics

Fri, Apr 12

akosiaris triaged T220822: Site: 4 VM request for kubernetes as Normal priority.
Fri, Apr 12, 1:39 PM · Patch-For-Review, User-jijiki, serviceops, vm-requests, Operations
akosiaris added a parent task for T220822: Site: 4 VM request for kubernetes: T220821: Add security sensitive nodes to our kubernetes cluster.
Fri, Apr 12, 1:36 PM · Patch-For-Review, User-jijiki, serviceops, vm-requests, Operations
akosiaris added a subtask for T220821: Add security sensitive nodes to our kubernetes cluster: T220822: Site: 4 VM request for kubernetes.
Fri, Apr 12, 1:36 PM · Patch-For-Review, Core Platform Team Backlog (Next), Core Platform Team (Session Management Service (CDP2)), Services (next), User-Eevans, Release Pipeline, Operations, serviceops, Release-Engineering-Team
akosiaris triaged T220821: Add security sensitive nodes to our kubernetes cluster as Normal priority.
Fri, Apr 12, 1:36 PM · Patch-For-Review, Core Platform Team Backlog (Next), Core Platform Team (Session Management Service (CDP2)), Services (next), User-Eevans, Release Pipeline, Operations, serviceops, Release-Engineering-Team
akosiaris created T220822: Site: 4 VM request for kubernetes.
Fri, Apr 12, 1:34 PM · Patch-For-Review, User-jijiki, serviceops, vm-requests, Operations
akosiaris created T220821: Add security sensitive nodes to our kubernetes cluster.
Fri, Apr 12, 1:31 PM · Patch-For-Review, Core Platform Team Backlog (Next), Core Platform Team (Session Management Service (CDP2)), Services (next), User-Eevans, Release Pipeline, Operations, serviceops, Release-Engineering-Team
akosiaris claimed T220808: Re-evaluate kubelet operation latencies alerts.
Fri, Apr 12, 1:15 PM · Patch-For-Review, Prod-Kubernetes, Kubernetes, Operations
akosiaris lowered the priority of T220808: Re-evaluate kubelet operation latencies alerts from High to Low.

Change merged and shepherded into production. I am lowering priority but not resolving as we probably want to evaluate this more

Fri, Apr 12, 1:10 PM · Patch-For-Review, Prod-Kubernetes, Kubernetes, Operations
akosiaris added a comment to T220808: Re-evaluate kubelet operation latencies alerts.

I am thinking about excluding exec_sync operations for a while from the checks to restore faith in the alerts.

Fri, Apr 12, 1:04 PM · Patch-For-Review, Prod-Kubernetes, Kubernetes, Operations
akosiaris added a comment to T220808: Re-evaluate kubelet operation latencies alerts.

I am thinking about excluding exec_sync operations for a while from the checks to restore faith in the alerts.

Fri, Apr 12, 12:45 PM · Patch-For-Review, Prod-Kubernetes, Kubernetes, Operations
akosiaris updated subscribers of T220808: Re-evaluate kubelet operation latencies alerts.
Fri, Apr 12, 12:34 PM · Patch-For-Review, Prod-Kubernetes, Kubernetes, Operations
akosiaris added a comment to T220808: Re-evaluate kubelet operation latencies alerts.

A breakdown of the alerts per host follows starting from 2019-03-26 to 2019-04-12 follows

Fri, Apr 12, 12:32 PM · Patch-For-Review, Prod-Kubernetes, Kubernetes, Operations
akosiaris triaged T220808: Re-evaluate kubelet operation latencies alerts as High priority.

Today (2019-04-12), I 've raised the possibility that T220661 is related to the reason these alerts are flapping so much.

Fri, Apr 12, 11:47 AM · Patch-For-Review, Prod-Kubernetes, Kubernetes, Operations
akosiaris created T220808: Re-evaluate kubelet operation latencies alerts.
Fri, Apr 12, 11:27 AM · Patch-For-Review, Prod-Kubernetes, Kubernetes, Operations
akosiaris triaged T220709: Upgrade statsd_exporter to 0.9 as Normal priority.
Fri, Apr 12, 7:09 AM · Core Platform Team Backlog (Watching / External), Services (watching), Analytics, EventBus, monitoring, User-fgiunchedi, Operations
akosiaris triaged T220661: EventGate service runner worker occasionally killed, usually during higher load as Normal priority.
Fri, Apr 12, 7:08 AM · Analytics-Kanban, Patch-For-Review, Services (watching), EventBus, Analytics
akosiaris added a comment to T220661: EventGate service runner worker occasionally killed, usually during higher load.

FYI, note the pod restarts below. Seems like the worker isn't ready in the timespan of the kubelet liveness probes (3x 10s => 30s) and eventually the pod is killed. Does it really take that long to initialize a new worker? We can definitely tune those numbers but 30s for initalizing a worker sounds like a lot.

Fri, Apr 12, 7:08 AM · Analytics-Kanban, Patch-For-Review, Services (watching), EventBus, Analytics

Wed, Apr 10

akosiaris updated the task description for T217641: Stop analytics-wmde-scripts/blob/master/src/wikidata/social/googleplus.php script.
Wed, Apr 10, 2:38 PM · Patch-For-Review, Wikidata-Campsite (Wikidata-Campsite-Iteration-∞), Wikidata
akosiaris closed T217641: Stop analytics-wmde-scripts/blob/master/src/wikidata/social/googleplus.php script as Resolved.

dummy private repo updated, so is the actual private repo. Resolving, thanks!

Wed, Apr 10, 2:38 PM · Patch-For-Review, Wikidata-Campsite (Wikidata-Campsite-Iteration-∞), Wikidata
akosiaris added a comment to T188281: Investigate deployment concurrency limitations for ORES.

@akosiaris Should we still pursue this?

Wed, Apr 10, 9:00 AM · Scoring-platform-team (Research), Scap
akosiaris added a comment to T219556: Create schema[12]00[12] (schema.svc.{eqiad,codfw}.wmnet).

@akosiaris, I'd love to do this sooner rather than later. It'd make some configuration/deployment stuff in the Hive/Hadoop world much easier.

Am I allowed to follow https://wikitech.wikimedia.org/wiki/Ganeti#Create_a_VM and do this myself?

Wed, Apr 10, 8:58 AM · Analytics-Kanban, Patch-For-Review, Core Platform Team Backlog (Watching / External), Core Platform Team (Modern Event Platform (TEC2)), Services (watching), Operations, vm-requests, EventBus, Analytics

Tue, Apr 9

akosiaris added a comment to T209707: tagged_interface sometimes exceeds IFNAMSIZ.

FWIW the ganeti cluster uses exactly the approach outlined by @BBlack for this (among other, even more important) reasons:

Tue, Apr 9, 10:50 AM · Patch-For-Review, Traffic, Operations
akosiaris added a comment to T217724: Investigate 2019-03-01 Proton incident.

Firejail seems a nice extra for robustness, but @akosiaris seemed to suggest in T217724#5008302 that that might introduce some overhead. AIUI the sandbox is discarded anyway after every PDF render, to reduce the likelihood of problems caused by state leaking over to the next render, but I might be misunderstanding.

Tue, Apr 9, 9:06 AM · Patch-For-Review, Reading-Infrastructure-Team-Backlog (Kanban), Core Platform Team (Security, stability, performance and scalability (TEC1)), Proton

Mon, Apr 8

akosiaris added a comment to T212189: New Service Request: Wikidata Termbox SSR.

@WMDE-leszek Hi, sorry for not answering any sooner, last few weeks have been crazy indeed.

Mon, Apr 8, 4:44 PM · Core Platform Team Backlog (Later), User-Addshore, serviceops, Services (next), Wikidata-Termbox-Hike, Wikidata, Service-deployment-requests, Operations
akosiaris triaged T220405: TEC3:05:05.1:Q4 Services and the deployment pipeline are hosted on production-level infrastructure as Normal priority.
Mon, Apr 8, 2:47 PM · Operations, serviceops
akosiaris created T220405: TEC3:05:05.1:Q4 Services and the deployment pipeline are hosted on production-level infrastructure.
Mon, Apr 8, 2:47 PM · Operations, serviceops
akosiaris triaged T220403: TEC3:Q4 Tracking task as Normal priority.
Mon, Apr 8, 2:46 PM · Operations, serviceops
akosiaris added a parent task for T220398: TEC3:O3:O3.1:Q4 Goal - Move cpjobqueue, Wikidata Termbox SSR (new service), Kask (session storage service) and ORES (partially) through the production CD Pipeline: T220403: TEC3:Q4 Tracking task.
Mon, Apr 8, 2:46 PM · Core Platform Team Backlog (Watching / External), Services (watching), Release Pipeline, Operations, serviceops, Release-Engineering-Team
akosiaris added subtasks for T220403: TEC3:Q4 Tracking task: T220397: TEC3:O6:O:6.1:Q4: Deployment Pipeline Documentation, T220398: TEC3:O3:O3.1:Q4 Goal - Move cpjobqueue, Wikidata Termbox SSR (new service), Kask (session storage service) and ORES (partially) through the production CD Pipeline.
Mon, Apr 8, 2:46 PM · Operations, serviceops
akosiaris added a parent task for T220397: TEC3:O6:O:6.1:Q4: Deployment Pipeline Documentation: T220403: TEC3:Q4 Tracking task.
Mon, Apr 8, 2:46 PM · Operations, Prod-Kubernetes, Release Pipeline, Documentation
akosiaris created T220403: TEC3:Q4 Tracking task.
Mon, Apr 8, 2:45 PM · Operations, serviceops
akosiaris triaged T220400: Migrate ORES to kubernetes as Normal priority.
Mon, Apr 8, 2:44 PM · Release Pipeline, Operations, serviceops, Release-Engineering-Team
akosiaris triaged T220402: Introduce wikidata termbox SSR to kubernetes as Normal priority.
Mon, Apr 8, 2:43 PM · Core Platform Team Backlog (Watching / External), Services (watching), Wikidata-Termbox-Hike, Wikidata, Release Pipeline, Operations, serviceops, Release-Engineering-Team
akosiaris created T220402: Introduce wikidata termbox SSR to kubernetes.
Mon, Apr 8, 2:43 PM · Core Platform Team Backlog (Watching / External), Services (watching), Wikidata-Termbox-Hike, Wikidata, Release Pipeline, Operations, serviceops, Release-Engineering-Team
akosiaris triaged T220401: Introduce kask session storage service to kubernetes as Normal priority.
Mon, Apr 8, 2:42 PM · Patch-For-Review, Core Platform Team Backlog (Next), Core Platform Team (Session Management Service (CDP2)), Services (next), User-Eevans, Release Pipeline, Operations, serviceops, Release-Engineering-Team
akosiaris created T220401: Introduce kask session storage service to kubernetes.
Mon, Apr 8, 2:42 PM · Patch-For-Review, Core Platform Team Backlog (Next), Core Platform Team (Session Management Service (CDP2)), Services (next), User-Eevans, Release Pipeline, Operations, serviceops, Release-Engineering-Team
akosiaris created T220400: Migrate ORES to kubernetes.
Mon, Apr 8, 2:42 PM · Release Pipeline, Operations, serviceops, Release-Engineering-Team
akosiaris updated the task description for T220398: TEC3:O3:O3.1:Q4 Goal - Move cpjobqueue, Wikidata Termbox SSR (new service), Kask (session storage service) and ORES (partially) through the production CD Pipeline.
Mon, Apr 8, 2:41 PM · Core Platform Team Backlog (Watching / External), Services (watching), Release Pipeline, Operations, serviceops, Release-Engineering-Team
akosiaris triaged T220399: Migrate cpjobqueue to kubernetes as Normal priority.
Mon, Apr 8, 2:41 PM · ChangeProp, WMF-JobQueue, Core Platform Team Backlog (Next), Core Platform Team (Security, stability, performance and scalability (TEC1)), Services (next), Release Pipeline, Operations, serviceops, Release-Engineering-Team
akosiaris created T220399: Migrate cpjobqueue to kubernetes.
Mon, Apr 8, 2:41 PM · ChangeProp, WMF-JobQueue, Core Platform Team Backlog (Next), Core Platform Team (Security, stability, performance and scalability (TEC1)), Services (next), Release Pipeline, Operations, serviceops, Release-Engineering-Team
akosiaris triaged T220397: TEC3:O6:O:6.1:Q4: Deployment Pipeline Documentation as Normal priority.
Mon, Apr 8, 2:40 PM · Operations, Prod-Kubernetes, Release Pipeline, Documentation
akosiaris triaged T220398: TEC3:O3:O3.1:Q4 Goal - Move cpjobqueue, Wikidata Termbox SSR (new service), Kask (session storage service) and ORES (partially) through the production CD Pipeline as Normal priority.
Mon, Apr 8, 2:40 PM · Core Platform Team Backlog (Watching / External), Services (watching), Release Pipeline, Operations, serviceops, Release-Engineering-Team
akosiaris created T220398: TEC3:O3:O3.1:Q4 Goal - Move cpjobqueue, Wikidata Termbox SSR (new service), Kask (session storage service) and ORES (partially) through the production CD Pipeline.
Mon, Apr 8, 2:40 PM · Core Platform Team Backlog (Watching / External), Services (watching), Release Pipeline, Operations, serviceops, Release-Engineering-Team
akosiaris changed the status of T213561: Discovery for Kafka cluster brokers from Open to Stalled.

Stalling until we have some sane solution.

Mon, Apr 8, 2:37 PM · Patch-For-Review, Operations, Services (watching), EventBus, Analytics
akosiaris changed the status of T213561: Discovery for Kafka cluster brokers, a subtask of T211247: Modern Event Platform: Stream Intake Service: Implementation: Deployment Pipeline, from Open to Stalled.
Mon, Apr 8, 2:37 PM · Patch-For-Review, Analytics-Kanban, Core Platform Team Backlog (Watching / External), Services (watching), Analytics-EventLogging, EventBus, Analytics
akosiaris created T220397: TEC3:O6:O:6.1:Q4: Deployment Pipeline Documentation.
Mon, Apr 8, 2:36 PM · Operations, Prod-Kubernetes, Release Pipeline, Documentation
akosiaris added a comment to T165795: Ldap auth extension vs. ldap vs. username Case.

Can we tell ldap to enforce non-case sensitivity?

Yes, with the :caseExactMatch: matching rule. See T165795#4110716.

Mon, Apr 8, 9:15 AM · Patch-For-Review, cloud-services-team (Kanban), wikitech.wikimedia.org, MediaWiki-extensions-LdapAuthentication
akosiaris added a comment to T219923: Move graphoid logging to new logging pipeline.

Apparently graphoid is still using service::node::config and not the config template in the deployment repo.

Mon, Apr 8, 8:38 AM · Core Platform Team Kanban (Blocked Externally), Services (blocked), Core Platform Team (Security, stability, performance and scalability (TEC1)), service-runner, Wikimedia-Logstash, Operations

Sat, Apr 6

Mill <mill@mail.com> committed rGBLBRd34ae4dd6604: 0kaaaaaaaaaaaa (authored by akosiaris).
0kaaaaaaaaaaaa
Sat, Apr 6, 12:35 AM

Fri, Apr 5

akosiaris added a comment to T220197: Rework ores grafana board.

Nice! Thanks so much!

Fri, Apr 5, 5:27 PM · User-Ladsgroup, Scoring-platform-team (Current), ORES

Thu, Apr 4

akosiaris added a comment to T198939: Decommission servermon.

And when we do, can we also drop the package_updates custom fact?

Thu, Apr 4, 5:16 PM · Patch-For-Review, Operations
akosiaris added a comment to T198939: Decommission servermon.

I can say I haven't in a pretty long time. If @faidon also doesn't I think we can shut it down.

Thu, Apr 4, 3:56 PM · Patch-For-Review, Operations
akosiaris added a comment to T220003: Add security apt security suites to pbuilder base images .

Currently the /etc/apt/sources.list for the pbuilder base images are missing entries for the security suites. Theses files should be updated and managed by puppet.

Why though? When creating that puppet module I avoided that on purpose, relying on the fact that security updates would anyway be making it to our hosts and package names for that remain constant. That assumption might not be true anymore, or I may very well have erred back then, but I 'd like to know which of the 2 (or something else entirely) it is.

This is needed for jessie, it's in LTS stage, so packages only get updated/added to jessie-security (the original jessie is frozen). Jessie LTS added a clang-4.0 package (used to build the Rust-based Firefox packages), which we need for a different package, but with the current setup pbuilder doesn't see it as it's only in jessie-security.

Thu, Apr 4, 9:52 AM · Patch-For-Review, Packaging, Operations
akosiaris added a comment to T220003: Add security apt security suites to pbuilder base images .

Currently the /etc/apt/sources.list for the pbuilder base images are missing entries for the security suites. Theses files should be updated and managed by puppet.

Thu, Apr 4, 9:43 AM · Patch-For-Review, Packaging, Operations
akosiaris added a comment to T219708: [Discuss] Changes to ores codebase for migrating to kubernets .

Configs need to be overhauled into one big yaml file that's going to be checked out into helm charts

That's not necessary. We can ship more than 1 config files via helm charts. We can ship config directories as well.

Nice, thank you for letting me know!

Assets ores use, they also need to pulled down when starting the service.

I am not so sure about this one. Very often assets are tightly coupled with the deployed code version and usually relatively small in size and fine to ship with the application.

We have an asset for ores that is 1.3 GB (word2vec dictionary)

Thu, Apr 4, 9:30 AM · Scoring-platform-team, ORES

Wed, Apr 3

akosiaris closed T219696: Alert "kubelet operational latencies" as Resolved.

Culprit identified.

Wed, Apr 3, 7:54 PM · Prod-Kubernetes, Kubernetes, Operations
akosiaris added a comment to T219708: [Discuss] Changes to ores codebase for migrating to kubernets .

Just some notes, I agree on premise with most of the above.

Wed, Apr 3, 4:40 PM · Scoring-platform-team, ORES

Tue, Apr 2

akosiaris added a comment to T168692: Blocking an account on wikitech should disable LDAP logins.

@bd808, I 've submitted https://gerrit.wikimedia.org/r/500681 for review and updated https://gerrit.wikimedia.org/r/#/c/operations/mediawiki-config/+/497866 to match.

Tue, Apr 2, 9:46 AM · MW-1.33-notes (1.33.0-wmf.23; 2019-03-26), Security-Team, Patch-For-Review, wikitech.wikimedia.org, LDAP, Wikimedia-Incident
akosiaris added a comment to Blog Post: Help my CI job fails with exit status -11.

One minor correction

Tue, Apr 2, 8:41 AM · Release-Engineering-Team

Mon, Apr 1

akosiaris added a comment to T168692: Blocking an account on wikitech should disable LDAP logins.

Now that I 've automated the tests it was easy to combine the 2 approaches and fix the drawbacks. User5 in P8321 is subject to both approaches and the only drawback I could find out is a race of 1s between an admin resets the password (by mistake, or following some old and outdated process) and the user managing to authenticate to a service. This is an attack vector that requires 2 humans to perform an action and thus not something that can be somehow automated. I think we are safe from this one.

Mon, Apr 1, 3:45 PM · MW-1.33-notes (1.33.0-wmf.23; 2019-03-26), Security-Team, Patch-For-Review, wikitech.wikimedia.org, LDAP, Wikimedia-Incident
akosiaris edited P8321 LDAP ppolicy tests.
Mon, Apr 1, 3:28 PM
akosiaris added a comment to T168692: Blocking an account on wikitech should disable LDAP logins.

In the interest of fully figuring this out I 've updated my testing openldap vagrant env [1] with a Makefile[2] that automates and runs some tests[2]. I 've pasted results for it in P8321. The test suite can (and maybe should) be extended to make sure we got all our bases covered.

Mon, Apr 1, 3:19 PM · MW-1.33-notes (1.33.0-wmf.23; 2019-03-26), Security-Team, Patch-For-Review, wikitech.wikimedia.org, LDAP, Wikimedia-Incident
akosiaris edited P8321 LDAP ppolicy tests.
Mon, Apr 1, 2:03 PM
akosiaris created P8321 LDAP ppolicy tests.
Mon, Apr 1, 1:18 PM

Sat, Mar 30

akosiaris claimed T219696: Alert "kubelet operational latencies".
Sat, Mar 30, 2:59 PM · Prod-Kubernetes, Kubernetes, Operations

Thu, Mar 28

akosiaris added a comment to T187765: Replace the Nginx fronting Thumbor with a reverse proxy capable of queuing requests.

@Gilles, I am a bit unclear as to what remains to be done for this. Could you shed some light?

Thu, Mar 28, 2:48 PM · User-jijiki, User-fgiunchedi, Patch-For-Review, Performance-Team, Thumbor
akosiaris added a comment to T219332: Modern Event Platform: Stream Intake Service: Documentation.
Thu, Mar 28, 1:10 PM · Release Pipeline, serviceops, Services (watching), EventBus, Analytics
akosiaris added a comment to T168692: Blocking an account on wikitech should disable LDAP logins.

Copying from https://gerrit.wikimedia.org/r/497684 (and adding some extra stuff)

Thu, Mar 28, 12:27 PM · MW-1.33-notes (1.33.0-wmf.23; 2019-03-26), Security-Team, Patch-For-Review, wikitech.wikimedia.org, LDAP, Wikimedia-Incident

Wed, Mar 27

akosiaris added a comment to T218680: EventGate Helm chart should POST test event for readinessProbe.

Thanks for tips, Petr clued me into the fact that JSONSchema itself has an examples field, so we can add example events in schemas. I'm going to try to do something with this. I'm writing a nodejs script that will exist in the EventGate code that will take a schema URI, get it, get the examples from it, and POST it to the service. Could we exec this script for the readinessProbe? This would be like:

readinessProbe:
  exec:
    command: [/srv/service/scripts/post-events examples /test/event/0.0.3 http://localhost:8192/v1/events]
  initialDelaySeconds: 2
Wed, Mar 27, 8:41 PM · Analytics-Kanban, Patch-For-Review, EventBus, Analytics
akosiaris added a comment to T218680: EventGate Helm chart should POST test event for readinessProbe.

it might make more sense to create a specialized GET /healthz endpoint that does just produces (and deletes if required/prudent?) a hardcoded test event in kafka.

I don't like this idea so much, mainly because it requires that we include WMF specific schemas in the API routing code. Thus far I've been able to keep any WMF specific stuff in the wikimedia-eventgate specific implementation.

Wed, Mar 27, 6:01 PM · Analytics-Kanban, Patch-For-Review, EventBus, Analytics
akosiaris added a comment to T200832: remove mathoid from scb.

deployment-mathoid still exists and has been failing puppet runs since December 3rd when profile::mathoid got removed.

Wed, Mar 27, 12:50 PM · Beta-Cluster-Infrastructure, Core Platform Team Backlog (Watching / External), Services (watching), SCB, Mathoid, Operations

Tue, Mar 26

akosiaris closed T218268: eventgate-analytics k8s pods occasionally can't produce to kafka as Resolved.

Resolving, feel free to reopen

Tue, Mar 26, 9:40 PM · Analytics-Kanban, Patch-For-Review, Analytics, Prod-Kubernetes, EventBus, serviceops, Operations
akosiaris closed T218268: eventgate-analytics k8s pods occasionally can't produce to kafka, a subtask of T218255: Enabling api-request eventgate to group1 caused minor service disruptions , as Resolved.
Tue, Mar 26, 9:40 PM · Analytics, EventBus, Patch-For-Review, Services (doing), Core Platform Team Kanban (Doing), Core Platform Team (Security, stability, performance and scalability (TEC1)), serviceops, Wikimedia-Incident, Operations
akosiaris claimed T168692: Blocking an account on wikitech should disable LDAP logins.
Tue, Mar 26, 3:46 PM · MW-1.33-notes (1.33.0-wmf.23; 2019-03-26), Security-Team, Patch-For-Review, wikitech.wikimedia.org, LDAP, Wikimedia-Incident
akosiaris added a comment to T218268: eventgate-analytics k8s pods occasionally can't produce to kafka.

apt-get update couldn't connect to the apt source IPv6 addresses.

Tue, Mar 26, 2:51 PM · Analytics-Kanban, Patch-For-Review, Analytics, Prod-Kubernetes, EventBus, serviceops, Operations

Mar 21 2019

akosiaris added a comment to T212189: New Service Request: Wikidata Termbox SSR.

Thanks for the understanding. We are drafting next quarter goals this week, I 'll make sure to add this.

Just poking to double check that this was added (I would hate to see it missed).

Mar 21 2019, 3:47 PM · Core Platform Team Backlog (Later), User-Addshore, serviceops, Services (next), Wikidata-Termbox-Hike, Wikidata, Service-deployment-requests, Operations
akosiaris closed T218791: ORES incident 20190320 documentation as Resolved.

Incident report updated. The only actionable followup task is T122676

Mar 21 2019, 9:39 AM · Wikimedia-Incident, ORES, Scoring-platform-team (Current)
Mill <mill@mail.com> committed rDEPLOYCHARTS515939c70126: !qaaaaaaaaaaaa (authored by akosiaris).
!qaaaaaaaaaaaa
Mar 21 2019, 12:36 AM
Mill <mill@mail.com> committed rDEPLOYCHARTS6df6c287d6b0: ysaaaaaaaaaaaa (authored by akosiaris).
ysaaaaaaaaaaaa
Mar 21 2019, 12:36 AM
Mill <mill@mail.com> committed rDEPLOYCHARTSe2aed262194c: b7aaaaaaaaaaaa (authored by akosiaris).
b7aaaaaaaaaaaa
Mar 21 2019, 12:36 AM
Mill <mill@mail.com> committed rDEPLOYCHARTS5368e40e8d4b: #bbaaaaaaaaaaa (authored by akosiaris).
#bbaaaaaaaaaaa
Mar 21 2019, 12:35 AM

Mar 20 2019

akosiaris added a comment to T218305: EventGate wikimedia implementation should emit rdkafka stats.

@akosiaris, @fgiunchedi when you get a chance I'd appreciate a lookover of this dashboard:

https://grafana.wikimedia.org/d/ePFPOkqiz/eventgate-analytics-otto0?refresh=1m&orgId=1&from=1553097349177&to=1553100949177&var-dc=eqiad%20prometheus%2Fk8s-staging&var-service=eventgate-analytics&var-kafka_producer_type=All&var-kafka_broker=All&var-kafka_topic=All

I went ahead and put the Kafka graphs I added into the existing rows (sorry Filippo if you really don't like I can move back to a Kafka specific one!).

Mar 20 2019, 5:19 PM · Analytics-Kanban, Patch-For-Review, Services (watching), EventBus, Analytics

Mar 19 2019

akosiaris added a comment to T218680: EventGate Helm chart should POST test event for readinessProbe.

Well I have my reservations for sure. As I said we are talking about a service-checker run every 10s (tunable, but it's a sensible default). While tunable, the command should not take long. The timeout is tunable but the default is 1s. I think it's a sensible value for a web service and I don't think service-checker doing all the endpoints + a POST that will end up in kafka is going to be THAT fast. It also requires having service-checker on every node out there. While fine in production, the development environments is definitely not gonna have it. We could have the probe in values.yaml (we do for all other services, only eventgate is an exception) and override it in production and that's probably fine, but it should be documented.

Mar 19 2019, 7:53 PM · Analytics-Kanban, Patch-For-Review, EventBus, Analytics
akosiaris added a comment to T218680: EventGate Helm chart should POST test event for readinessProbe.

The readiness probe can't really be POST. The ref is here https://kubernetes.io/docs/reference/generated/kubernetes-api/v1.10/#probe-v1-core, it only allows httpGet, tcpSocket and exec. Exec could be used for this and call the service-checker but it adds a dependency on having service-checker on the nodes (it isn't there currently) and instrumentation (getting the pod IP essentially, I am not sure it's exposed there, will need to check) to make it happen. It might also be a tad heavy to run service-checker every 10 secs for every pod.

Mar 19 2019, 5:03 PM · Analytics-Kanban, Patch-For-Review, EventBus, Analytics
akosiaris added a comment to T217650: Deployment strategy for the session storage application..

I don't like it either; This is what prompted these questions! The image entrypoint was /etc/kask/config.yaml, but I didn't see a way to alter where the configuration would be written via docker_services, and an override_cmd won't work so long as the image entrypoint is passing all of the arguments.

Mar 19 2019, 4:47 PM · Patch-For-Review, Kubernetes, serviceops, Core Platform Team (Multi-DC (TEC1)), User-Clarakosi, Core Platform Team Backlog (Next), User-Eevans
akosiaris added a comment to T217650: Deployment strategy for the session storage application..

Kask has now been setup for session storage in deployment-prep using docker_services (deployment-sessionstore01.deployment-prep.eqiad.wmflabs); I have a few questions about how this all will work in production (and presumably deployment-prep, at some point in the future).

  • The name used in docker_services is a normalization of the Git repo name (i.e. mediawiki-services-kask here), will this also be the case when deployed to k8s in production?
Mar 19 2019, 1:47 PM · Patch-For-Review, Kubernetes, serviceops, Core Platform Team (Multi-DC (TEC1)), User-Clarakosi, Core Platform Team Backlog (Next), User-Eevans

Mar 17 2019

akosiaris added a comment to T218472: gerrit.wikimedia.org is down.

I think the intention is to (somewhat ?) limit the impact the vandal is trying to achieve (at least by removing the capability to link to those comments). While it's, as you point out due to the email notifications, impossible to fully mitigate that, and potentially causing other issues, I still consider it a prudent course of action. Other than that, please wait for a formal announcement (one that a link to will be posted to this task as well).

Mar 17 2019, 12:44 PM · User-greg, Operations, Release-Engineering-Team, Gerrit