Page MenuHomePhabricator
Feed Advanced Search

Wed, May 22

akosiaris added a comment to T224041: Kask integration testing with Cassandra via the Deployment Pipeline.

It seems that the cassandra subchart already exists for cask (via https://gerrit.wikimedia.org/r/#/c/operations/deployment-charts/+/509102/ );

Wed, May 22, 1:42 PM · Core Platform Team Backlog (Next), Core Platform Team (Session Management Service (CDP2)), Services (next), User-Eevans, Release Pipeline, Operations, serviceops, Release-Engineering-Team
akosiaris lowered the priority of T221577: Wikimedia\Rdbms\LBFactory::getEmptyTransactionTicket: GeoData\Hooks::doLinksUpdate does not have outer scope from Unbreak Now! to High.

I am lowering to High, just in the interest of not abusing Unbreak Now!, since this task has been in this state since Apr 23. That being said, this indeed needs to be resolved ASAP.

Wed, May 22, 12:59 PM · Patch-For-Review, Performance-Team, Multimedia, MediaWiki-Database, GlobalUsage, MediaWiki-extensions-PageAssessments, Discovery-Search, GeoData, Wikimedia-production-error

Sun, May 19

akosiaris reopened Restricted Task, a subtask of T218750: Re-enable use of Gerrit HTTP token to push patchsets, as Open.
Sun, May 19, 7:41 AM · VPS-project-libraryupgrader, Release-Engineering-Team, Gerrit

Fri, May 17

akosiaris awarded T220894: Replacement of network::constant's special_hosts a Yellow Medal token.
Fri, May 17, 2:34 PM · Patch-For-Review, Operations

Thu, May 16

akosiaris updated the task description for T220894: Replacement of network::constant's special_hosts.
Thu, May 16, 11:37 AM · Patch-For-Review, Operations
akosiaris added a comment to T223395: Cxserver container: Container does not send fatal errors to docker logs via stdout?.

needs an extra stanza

Thu, May 16, 10:08 AM · CX-cxserver, Beta-Cluster-reproducible, Services (next), serviceops
akosiaris updated the task description for T220894: Replacement of network::constant's special_hosts.
Thu, May 16, 9:05 AM · Patch-For-Review, Operations
akosiaris closed T220709: Upgrade statsd_exporter to 0.9 as Resolved.

Every deployment that uses statsd-exporter (namely zotero & blubberoid don't) in kubernetes has been upgraded. Resolving this. Many thanks!

Thu, May 16, 8:54 AM · Patch-For-Review, Core Platform Team Backlog (Watching / External), Services (watching), Analytics, EventBus, observability, User-fgiunchedi, Operations

Wed, May 15

akosiaris created P8527 hetzner IP blocks per https://ipinfo.io/AS24940.
Wed, May 15, 1:06 PM
akosiaris committed rDEPLOYCHARTSc1786e208d8c: First draft of a wikibase-termbox chart (authored by akosiaris).
First draft of a wikibase-termbox chart
Wed, May 15, 11:25 AM
akosiaris committed rDEPLOYCHARTSde0b78535858: Introduce the wikibase-termbox chart (authored by akosiaris).
Introduce the wikibase-termbox chart
Wed, May 15, 11:25 AM
akosiaris added a comment to T220235: Migrate Beta cluster services to use Kubernetes .
krenair@deployment-docker-cxserver01:~$ sudo /usr/bin/docker run -p 8080:8080 -v /etc/mediawiki-services-cxserver/:/etc/mediawiki-services-cxserver --name alex-test docker-registry.wikimedia.org/wikimedia/mediawiki-services-cxserver:2019-05-08-064536-production -c /etc/mediawiki-services-cxserver/config.yaml
 Error during DHT setup undefined
 krenair@deployment-docker-cxserver01:~$

Anyone know what that means?

Wed, May 15, 8:18 AM · Patch-For-Review, Editing-team, Core Platform Team Backlog (Next), Services (next), Kubernetes, Release Pipeline, serviceops, Beta-Cluster-Infrastructure
akosiaris added a comment to T223345: Zotero container: Production is running candidate version, last production version is broken due to lack of ca-certificates package.

It's working in production because we connect to external URIs via a proxy, hence we don't need ca-certificates.

Wed, May 15, 8:14 AM · Core Platform Team Backlog (Watching / External), Beta-Cluster-reproducible, Editing-team, Services (next), serviceops

Tue, May 14

akosiaris added a comment to T220402: Introduce wikidata termbox SSR to kubernetes.

Hi @akosiaris,

thanks for taking the time to explain the way the Host header is intended to be used.
If I understand correctly the goal is to ensure that requests originating from our service bear a header Host: (www.)wikidata.org and reach which ever IP(s) appservers.discovery.wmnet resolves to on the system running it. This sounds like a name resolution challenge and a case for HostAliases or, more traditionally, a CNAME record.

Tue, May 14, 12:03 PM · Patch-For-Review, Core Platform Team Backlog (Watching / External), Services (watching), Wikidata-Termbox-Hike, Wikidata, Release Pipeline, Operations, serviceops, Release-Engineering-Team
akosiaris added a comment to T220402: Introduce wikidata termbox SSR to kubernetes.

Hi @akosiaris - thanks for getting back to us.

sending a Host: HTTP for the identification of the exact project. Would it be possible to add that functionality?

It certainly is possible and depending on operational needs we certainly can make this happen. We quickly discussed this in the team and would like to first truly understand the goal to make sure we don't mix up the different layers of our proverbial sausage pizza without a valid reason. The service would be run in a container inside a k8s pod, controlling its DNS - why not use this option to make sure requests reach the intended endpoint? The correct host would then come "for free" per the host part configured in WIKIBASE_REPO.

Tue, May 14, 10:12 AM · Patch-For-Review, Core Platform Team Backlog (Watching / External), Services (watching), Wikidata-Termbox-Hike, Wikidata, Release Pipeline, Operations, serviceops, Release-Engineering-Team
akosiaris added a comment to T220402: Introduce wikidata termbox SSR to kubernetes.

With respect to the end point checks it would be great to hear what we are trying to achieve with them. Our service depends on the availability of another service. If the examples are to act as smoke tests then their reliability depends on the upstream service; a dependency which would need to be configured (are we going to point it against prod for this?) & modeled (how to express service inter-dependency in the config?) in order to be able to make sense of the information down the line (i.e. "no need to be alarmed that this service reported 500 while the mw api was down").

Tue, May 14, 10:01 AM · Patch-For-Review, Core Platform Team Backlog (Watching / External), Services (watching), Wikidata-Termbox-Hike, Wikidata, Release Pipeline, Operations, serviceops, Release-Engineering-Team
akosiaris closed T222899: Set up LVS for eventgate-main on port 32192 as Resolved.
Tue, May 14, 7:57 AM · Analytics-Kanban, Patch-For-Review, serviceops, Core Platform Team Backlog (Watching / External), Services (watching), EventBus, Analytics
akosiaris closed T222899: Set up LVS for eventgate-main on port 32192, a subtask of T218346: Modern Event Platform: Deploy instance of EventGate service that produces events to kafka main , as Resolved.
Tue, May 14, 7:57 AM · serviceops, Patch-For-Review, Analytics-Kanban, Core Platform Team Backlog (Watching / External), Services (watching), Analytics-EventLogging, EventBus, Analytics
akosiaris added a comment to T220235: Migrate Beta cluster services to use Kubernetes .

Could we use image version: latest in beta hiera? And somehow pull down the new latest and restart the image whenever a new version is created and uploaded to the registry?

Sure, you just have to add a more complex script to exec to service::docker I guess, so that you can properly check that 'latest' or any other similar metatag are correctly respected.

Be my guest, I'll happily review the change!

Tue, May 14, 7:50 AM · Patch-For-Review, Editing-team, Core Platform Team Backlog (Next), Services (next), Kubernetes, Release Pipeline, serviceops, Beta-Cluster-Infrastructure

Mon, May 13

akosiaris added a comment to T223126: Install new PDUs into b5-eqiad.

Bacula & puppet databases are not going to exhibit any problems anyway. Puppet is literally used only by servermon and this is to be uninstalled pretty soon and backups don't happen during that timewindow.
etherpad, given the software, is a best-effort service, so no guarantees there. it will probably crash anyway, be restarted by systemd (as it anyway does every couple of days), users will be reconnected.

Mon, May 13, 4:55 PM · ops-eqiad, Operations
akosiaris added a comment to T222962: Use new eventgate chart release analytics for eventgate-analytics service..

Hm, question. Currently mediawiki-config ProductionServices.php has:

'eventgate-analytics' => 'http://eventgate-analytics.discovery.wmnet:31192',

If we change LVS port to 33192, this will just change the LVS monitoring/pooling/depooling, right?

Mon, May 13, 1:56 PM · Patch-For-Review, serviceops, Analytics-Kanban, Services (watching), EventBus, Analytics
akosiaris committed rDEPLOYCHARTSda66bfb82706: Actually rename the GC metric for eventgate (authored by akosiaris).
Actually rename the GC metric for eventgate
Mon, May 13, 1:47 PM
Gerrit Code Review <gerrit@wikimedia.org> committed rDEPLOYCHARTS2ac6149c1fe9: Merge "eventgate: Switch GC metric to microseconds, update buckets" (authored by akosiaris).
Merge "eventgate: Switch GC metric to microseconds, update buckets"
Mon, May 13, 1:33 PM
akosiaris committed rDEPLOYCHARTS83da86e1d472: Add initialize_service.sh tool (authored by akosiaris).
Add initialize_service.sh tool
Mon, May 13, 1:29 PM
akosiaris committed rDEPLOYCHARTS33070727823c: eventgate: Switch GC metric to microseconds, update buckets (authored by akosiaris).
eventgate: Switch GC metric to microseconds, update buckets
Mon, May 13, 1:24 PM
akosiaris committed rDEPLOYCHARTS629ee5835c7c: cxserver: Switch GC stats back to microseconds (authored by akosiaris).
cxserver: Switch GC stats back to microseconds
Mon, May 13, 1:05 PM
akosiaris committed rDEPLOYCHARTS0a844a695e1f: cxserver: Switch GC stats back to microseconds (authored by akosiaris).
cxserver: Switch GC stats back to microseconds
Mon, May 13, 12:49 PM
akosiaris added a comment to T222795: Re-evaluate service-runner's (ab)use of statsd timing metric for nodejs GC stats.

Thank you for an impressive level of details :) There's a bunch of other places where we abuse the timing metric within services exactly for the reason that we've needed to have percentiles, so the decision we make here should probably be adopted elsewhere.

Mon, May 13, 12:26 PM · Patch-For-Review, Services (later), service-runner, serviceops, Operations
akosiaris added a comment to T172333: Scap: keyholder Too many authentication failures.

https://gerrit.wikimedia.org/r/#/c/operations/puppet/+/377269/ had fallen through the cracks. It's now merged, right before a SWAT window, in order to identify issues as fast as possible

Mon, May 13, 10:52 AM · RelEng-Archive-FY201718-Q1, Patch-For-Review, Scap

Fri, May 10

akosiaris committed rDEPLOYCHARTS37e07d1c0d6b: First draft of a wikibase-termbox chart (authored by akosiaris).
First draft of a wikibase-termbox chart
Fri, May 10, 2:22 PM
akosiaris added a comment to T220402: Introduce wikidata termbox SSR to kubernetes.

@Tarrow , @WMDE-leszek I 've noticed 3 things while working on the above

Fri, May 10, 1:56 PM · Patch-For-Review, Core Platform Team Backlog (Watching / External), Services (watching), Wikidata-Termbox-Hike, Wikidata, Release Pipeline, Operations, serviceops, Release-Engineering-Team
akosiaris committed rDEPLOYCHARTSb36eef9f0f37: First draft of a wikibase-termbox chart (authored by akosiaris).
First draft of a wikibase-termbox chart
Fri, May 10, 1:21 PM
akosiaris added a comment to T220402: Introduce wikidata termbox SSR to kubernetes.

@WMDE-leszek. Yes I did.

Fri, May 10, 1:16 PM · Patch-For-Review, Core Platform Team Backlog (Watching / External), Services (watching), Wikidata-Termbox-Hike, Wikidata, Release Pipeline, Operations, serviceops, Release-Engineering-Team
akosiaris created P8511 termbox.py.
Fri, May 10, 1:10 PM · Wikidata, serviceops, Wikidata-Termbox-Hike
akosiaris added a comment to T222435: /termbox query validation based on openapi.json.

I guess this can be resolved? Nice work on the validation btw! I just witnessed it. Thanks!

Fri, May 10, 1:09 PM · Wikidata-Termbox-Iteration-15, Wikidata-Termbox-Iteration-14
akosiaris added a comment to T220401: Introduce kask session storage service to kubernetes.

@Clarakosi @Eevans. I 've updated the chart to also conditionally install a minimal cassandra for use in minikube. In my tests I was able to use and even run some benchmarks on it. On our side we are ready to proceed with deployment to staging and then production as soon as https://gerrit.wikimedia.org/r/#/c/mediawiki/services/kask/+/507397/ is merged.

Fri, May 10, 10:20 AM · Patch-For-Review, Core Platform Team Backlog (Next), Core Platform Team (Session Management Service (CDP2)), Services (next), User-Eevans, Release Pipeline, Operations, serviceops, Release-Engineering-Team
akosiaris committed rDEPLOYCHARTSce175c421ee0: kask: Add incubator/cassandra subchart (authored by akosiaris).
kask: Add incubator/cassandra subchart
Fri, May 10, 10:16 AM

Thu, May 9

akosiaris committed rDEPLOYCHARTS9ed66642275d: kask: Add incubator/cassandra subchart (authored by akosiaris).
kask: Add incubator/cassandra subchart
Thu, May 9, 4:24 PM
akosiaris added a comment to T220235: Migrate Beta cluster services to use Kubernetes .

Since the docker container will be the same as the one running in production, I don't think the environmental differences will be more than they already are in beta. The config for the service will be manually specified in hiera (either in Horizon or in operations/puppet), and here you can get as close to or as far from prod as you like.

I think the main problem will come from the fact that in production the service configs live in Helm charts now, rather than in Puppet. This means we can't easily reuse config defaults between beta and production. If we eventually end up somehow rendering Helm release values files from Puppet (@akosiaris mentioned this might happen), then it might be easier to build some puppet abstraction on top of role::beta::docker_services that makes it a little easier to share configs with beta.

Thu, May 9, 2:14 PM · Patch-For-Review, Editing-team, Core Platform Team Backlog (Next), Services (next), Kubernetes, Release Pipeline, serviceops, Beta-Cluster-Infrastructure
akosiaris committed rMSKS8f33b25175b0: Serve OpenAPI specification (authored by Eevans).
Serve OpenAPI specification
Thu, May 9, 12:34 PM

Wed, May 8

akosiaris renamed T222795: Re-evaluate service-runner's (ab)use of statsd timing metric for nodejs GC stats from Re-evaluate service-runner's (ab) of statsd timing metric for nodejs GC stats to Re-evaluate service-runner's (ab)use of statsd timing metric for nodejs GC stats.
Wed, May 8, 3:01 PM · Patch-For-Review, Services (later), service-runner, serviceops, Operations
Ottomata awarded T222795: Re-evaluate service-runner's (ab)use of statsd timing metric for nodejs GC stats a Stroopwafel token.
Wed, May 8, 2:13 PM · Patch-For-Review, Services (later), service-runner, serviceops, Operations
akosiaris triaged T222795: Re-evaluate service-runner's (ab)use of statsd timing metric for nodejs GC stats as Low priority.
Wed, May 8, 2:05 PM · Patch-For-Review, Services (later), service-runner, serviceops, Operations
akosiaris created T222795: Re-evaluate service-runner's (ab)use of statsd timing metric for nodejs GC stats.
Wed, May 8, 2:05 PM · Patch-For-Review, Services (later), service-runner, serviceops, Operations

Tue, May 7

akosiaris added a comment to T218346: Modern Event Platform: Deploy instance of EventGate service that produces events to kafka main .
Tue, May 7, 12:04 PM · serviceops, Patch-For-Review, Analytics-Kanban, Core Platform Team Backlog (Watching / External), Services (watching), Analytics-EventLogging, EventBus, Analytics

Mon, May 6

akosiaris added a comment to T221529: Frequent puppet failures .
Mon, May 6, 10:56 AM · Puppet, puppet-compiler, Operations

Thu, Apr 25

akosiaris added a comment to T218346: Modern Event Platform: Deploy instance of EventGate service that produces events to kafka main .

kubernetes_namespace might be right, but will app or service be? If we leave it as is, I believe those will be set to .Chart.Name, which will now be just 'eventgate'.

Thu, Apr 25, 4:47 PM · serviceops, Patch-For-Review, Analytics-Kanban, Core Platform Team Backlog (Watching / External), Services (watching), Analytics-EventLogging, EventBus, Analytics
akosiaris added a comment to T218346: Modern Event Platform: Deploy instance of EventGate service that produces events to kafka main .

I 'd advise against having the namespace referenced in charts.

I'm mostly just trying fix the fact that the service and the metrics are named after .Chart.Name currently. We don't currently use the release (e.g. staging, production) in these names, which is good since they can vary even more than that, as you say. How should I set

# Statsd metrics reporter
metrics:
  name: {{ .Chart.Name }}

If I don't use the namespace?

Thu, Apr 25, 3:53 PM · serviceops, Patch-For-Review, Analytics-Kanban, Core Platform Team Backlog (Watching / External), Services (watching), Analytics-EventLogging, EventBus, Analytics
akosiaris added a comment to T218346: Modern Event Platform: Deploy instance of EventGate service that produces events to kafka main .

@fsero @akosiaris - moving discussion about eventgate-main patch here, will be easier to communicate.

I'm going to try to make a single chart that can be used for all deployments of EventGate, and I think I need a little help figuring out how to organize that. Right now in prod we have:

  • chart: eventgate-analytics
  • releases:
    • production (in CLUSTER=eqiad and CLUSTER=codfw)
    • staging (in CLUSTER=staging).

I think what we want is:

  • chart: eventgate
  • releases:
    • analytics-production (in CLUSTER=eqiad and CLUSTER=codfw)
    • analytics-staging (in CLUSTER=staging)
    • main-production (in CLUSTER=eqiad and CLUSTER=codfw)
    • main-staging (in CLUSTER=staging)

      And also eventually as part of T217142 something like:
  • releases:
    • logging-production (in CLUSTER=eqiad and CLUSTER=codfw)
    • logging-staging (in CLUSTER=staging)
Thu, Apr 25, 10:24 AM · serviceops, Patch-For-Review, Analytics-Kanban, Core Platform Team Backlog (Watching / External), Services (watching), Analytics-EventLogging, EventBus, Analytics

Apr 24 2019

akosiaris closed T221758: Level3 esams <-> eqiad link outage as Resolved.

CenturyLink sent a summary and a notification they 'll close the issues as resolved on their end.

Apr 24 2019, 2:04 PM · netops, Operations
akosiaris lowered the priority of T221758: Level3 esams <-> eqiad link outage from High to Low.
Apr 24 2019, 10:41 AM · netops, Operations
akosiaris updated the task description for T221758: Level3 esams <-> eqiad link outage.
Apr 24 2019, 10:41 AM · netops, Operations
akosiaris added a comment to T221758: Level3 esams <-> eqiad link outage.

New update says:

Apr 24 2019, 10:41 AM · netops, Operations
akosiaris committed rMSKSb9c13b3a97a4: Add helm.yaml file for use by the pipeline (authored by akosiaris).
Add helm.yaml file for use by the pipeline
Apr 24 2019, 10:31 AM
akosiaris updated subscribers of T220401: Introduce kask session storage service to kubernetes.

@Eevans @Clarakosi chart has been merged and is published. The only thing missing before we can move on to the deployment is the swagger/openapi spec so that service-checker[1] can run and monitor this service.

Apr 24 2019, 10:22 AM · Patch-For-Review, Core Platform Team Backlog (Next), Core Platform Team (Session Management Service (CDP2)), Services (next), User-Eevans, Release Pipeline, Operations, serviceops, Release-Engineering-Team
akosiaris triaged T221758: Level3 esams <-> eqiad link outage as High priority.
Apr 24 2019, 10:19 AM · netops, Operations
akosiaris added a comment to T221758: Level3 esams <-> eqiad link outage.

Received information for Level3

Apr 24 2019, 10:12 AM · netops, Operations
akosiaris created T221758: Level3 esams <-> eqiad link outage.
Apr 24 2019, 10:00 AM · netops, Operations
akosiaris committed rDEPLOYCHARTS76bbc8620020: First draft of a wikibase-termbox chart (authored by akosiaris).
First draft of a wikibase-termbox chart
Apr 24 2019, 9:18 AM
akosiaris committed rDEPLOYCHARTSd655879fa29b: First draft of a wikibase-termbox chart (authored by akosiaris).
First draft of a wikibase-termbox chart
Apr 24 2019, 9:18 AM
akosiaris committed rDEPLOYCHARTS2d34fd6ec317: Publish the kask chart in the repo (authored by akosiaris).
Publish the kask chart in the repo
Apr 24 2019, 9:18 AM
akosiaris committed rDEPLOYCHARTSf1eb582f21e2: cxserver: Fix typo in GC metric name (authored by akosiaris).
cxserver: Fix typo in GC metric name
Apr 24 2019, 9:18 AM
akosiaris committed rDEPLOYCHARTSc84ee45019f4: First draft of a wikibase-termbox chart (authored by akosiaris).
First draft of a wikibase-termbox chart
Apr 24 2019, 9:12 AM
akosiaris committed rDEPLOYCHARTSf92146cb70ca: cxserver: Fix typo in GC metric name (authored by akosiaris).
cxserver: Fix typo in GC metric name
Apr 24 2019, 9:12 AM
akosiaris awarded T187765: Replace the Nginx fronting Thumbor with Haproxy a Love token.
Apr 24 2019, 9:06 AM · Patch-For-Review, User-jijiki, User-fgiunchedi, Performance-Team, Thumbor
akosiaris added a comment to T220402: Introduce wikidata termbox SSR to kubernetes.

@Tarrow, @WMDE-leszek. I 've been working on the termbox helm chart and while the service seems to be up and running, I see no /_info endpoint nor a swagger/openapi[1] spec published under /?spec. Both are crucial for deploying, as the former is used as a kubernetes readiness probe (aka if an instance of the app can't serve that for any reason it will temporarily not see new traffic) and the latter is used by our monitoring, so we can't proceed without those. Could you please have a look at it and add them? Thanks!

Apr 24 2019, 8:57 AM · Patch-For-Review, Core Platform Team Backlog (Watching / External), Services (watching), Wikidata-Termbox-Hike, Wikidata, Release Pipeline, Operations, serviceops, Release-Engineering-Team
akosiaris committed rDEPLOYCHARTS915ce8af3d03: Update repo index (authored by akosiaris).
Update repo index
Apr 24 2019, 8:52 AM
akosiaris committed rDEPLOYCHARTS95d17c91077d: Update repo index (authored by akosiaris).
Update repo index
Apr 24 2019, 8:47 AM
akosiaris committed rDEPLOYCHARTS31976757ece5: First draft of a wikibase-termbox chart (authored by akosiaris).
First draft of a wikibase-termbox chart
Apr 24 2019, 8:47 AM
Gerrit Code Review <gerrit@wikimedia.org> committed rDEPLOYCHARTS4b426503a4d7: Merge "Support affinity in all charts" (authored by akosiaris).
Merge "Support affinity in all charts"
Apr 24 2019, 8:31 AM
akosiaris committed rDEPLOYCHARTS1669bbdb56c3: First version of the kask chart (authored by akosiaris).
First version of the kask chart
Apr 24 2019, 7:31 AM

Apr 23 2019

akosiaris committed rDEPLOYCHARTS2426711594d3: First draft of a wikibase-termbox chart (authored by akosiaris).
First draft of a wikibase-termbox chart
Apr 23 2019, 6:08 PM
akosiaris committed rDEPLOYCHARTS096a460dac0a: First draft of a wikibase-termbox chart (authored by akosiaris).
First draft of a wikibase-termbox chart
Apr 23 2019, 3:02 PM
akosiaris created T221622: Tech Talks Proposal July 10, 2019: Deployment Pipeline Overview.
Apr 23 2019, 12:45 PM · Developer-Advocacy, Documentation
akosiaris committed rDEPLOYCHARTS3fc1b72ce950: First version of the kask chart (authored by akosiaris).
First version of the kask chart
Apr 23 2019, 12:30 PM
akosiaris committed rDEPLOYCHARTS4f592dd3931a: First version of the kask chart (authored by akosiaris).
First version of the kask chart
Apr 23 2019, 11:55 AM

Apr 22 2019

akosiaris added a comment to T220401: Introduce kask session storage service to kubernetes.

Just for posterity's sake, at ~1500 artificially simulated users the service started to crumble and started returning

Apr 22 2019, 7:27 PM · Patch-For-Review, Core Platform Team Backlog (Next), Core Platform Team (Session Management Service (CDP2)), Services (next), User-Eevans, Release Pipeline, Operations, serviceops, Release-Engineering-Team
akosiaris added a comment to T220401: Introduce kask session storage service to kubernetes.

I did some benchmarking and here's some first (rather impressive numbers) for kask

Apr 22 2019, 7:10 PM · Patch-For-Review, Core Platform Team Backlog (Next), Core Platform Team (Session Management Service (CDP2)), Services (next), User-Eevans, Release Pipeline, Operations, serviceops, Release-Engineering-Team
akosiaris created P8425 locust_kask.py.
Apr 22 2019, 6:55 PM · Prod-Kubernetes, Operations
akosiaris closed T220821: Add security sensitive nodes to our kubernetes cluster, a subtask of T220401: Introduce kask session storage service to kubernetes, as Resolved.
Apr 22 2019, 6:41 PM · Patch-For-Review, Core Platform Team Backlog (Next), Core Platform Team (Session Management Service (CDP2)), Services (next), User-Eevans, Release Pipeline, Operations, serviceops, Release-Engineering-Team
akosiaris closed T220821: Add security sensitive nodes to our kubernetes cluster as Resolved.

kubernetes1005, kubernetes1006, kubernetes2005, kubernetes2006 added with specific taints in order to have only kask being scheduled on them. Resolving

Apr 22 2019, 6:41 PM · Patch-For-Review, Core Platform Team Backlog (Next), Core Platform Team (Session Management Service (CDP2)), Services (next), User-Eevans, Release Pipeline, Operations, serviceops, Release-Engineering-Team
akosiaris added a comment to T220822: Site: 4 VM request for kubernetes.

One typo:
codfw has 10.64.32.18 and 2620:0:861:103:10:64:32:18

Apr 22 2019, 5:55 PM · Patch-For-Review, User-jijiki, serviceops, vm-requests, Operations
akosiaris added a comment to T218750: Re-enable use of Gerrit HTTP token to push patchsets.

The same could be said of the ssh key. Temporary access to a browser session would enable someone to add a key to perform actions on behalf of a user. The gerrit command line tools have mostly the same capabilities as the http token.

That's really good point.

Apr 22 2019, 1:24 PM · VPS-project-libraryupgrader, Release-Engineering-Team, Gerrit

Apr 19 2019

akosiaris committed rDEPLOYCHARTSa1ceeba25c6f: First version of the kask chart (authored by akosiaris).
First version of the kask chart
Apr 19 2019, 4:05 PM
akosiaris closed T220785: Requesting deployment access for santhosh as Resolved.

Change merged, thanks!

Apr 19 2019, 3:07 PM · Patch-For-Review, Operations, SRE-Access-Requests
akosiaris assigned T220822: Site: 4 VM request for kubernetes to ayounsi.

This is almost done. That only thing missing seems to be the peering with the juniper routers.

Apr 19 2019, 3:00 PM · Patch-For-Review, User-jijiki, serviceops, vm-requests, Operations
akosiaris added a comment to T218750: Re-enable use of Gerrit HTTP token to push patchsets.

Before we move forward and enable this, let's make sure we have understood the security repercussions and have mitigated them (and if we find it impossible to do so, avoid it).

Apr 19 2019, 1:49 PM · VPS-project-libraryupgrader, Release-Engineering-Team, Gerrit
akosiaris committed rDEPLOYCHARTSa2c1409df930: Support affinity in all charts (authored by akosiaris).
Support affinity in all charts
Apr 19 2019, 8:34 AM

Apr 18 2019

akosiaris added a comment to T196478: rack/setup/install backup1001.

Host is up and running but as @Cmjohnson points out in T196478#4976375

Apr 18 2019, 3:01 PM · Operations, ops-eqiad
akosiaris added a comment to T200832: remove mathoid from scb.

deployment-mathoid still exists and has been failing puppet runs since December 3rd when profile::mathoid got removed.

I 'd delete the VM, profile::mathoid isn't coming back. If anything the VMs with the role::beta::docker_services role applied can probably handle the service now.

Unfortunately MediaWiki is still expecting the VM to exist:
wmf-config/LabsServices.php: 'mathoid' => 'http://deployment-mathoid.eqiad.wmflabs:10042',

The two hosts with that role are deployment-eventgate-analytics-1 and deployment-sessionstore01 so neither of them are really appropriate for this, plus they both run jessie.

Apr 18 2019, 2:17 PM · Beta-Cluster-Infrastructure, Core Platform Team Backlog (Watching / External), Services (watching), SCB, Mathoid, Operations
akosiaris added a comment to T219556: Create schema[12]00[12] (schema.svc.{eqiad,codfw}.wmnet).

ganeti1001, 1002 and 2001 have been installed. I dunno what's up with ganeti2002. gnt-instance console schema2002.codfw.wmnet never shows anything.

Apr 18 2019, 9:56 AM · Analytics-Kanban, Patch-For-Review, Core Platform Team Backlog (Watching / External), Core Platform Team (Modern Event Platform (TEC2)), Services (watching), Operations, vm-requests, EventBus, Analytics
akosiaris added a comment to T221259: eqord - ulsfo Telia link down - IC-313592.

Just noting that at 10:41 UTC the circuit was still down per

Apr 18 2019, 9:42 AM · Operations, netops

Apr 17 2019

akosiaris added a comment to T220661: EventGate service runner worker occasionally killed, usually during higher load.

Been doing a lot to get more data, including enabling node profiling and connecting to node inspector. I didn't learn much, except confirmed that I don't think Kafka CPU culprit. I did see good number of cycles being spent generating the x-request-id header using cassandra-uuid.TimeUuid. This won't happen if we can ever get the x-request-id header to be set by varnish eventually (T201409).

But I didn't see anything particularly abnormal, just a lot of time being spent by express parsing the JSON request bodies, which I guess makes sense.

Just to try, I increased k8s resource requests/limits to:

requests:
  cpu: 1000m
  memory: 100Mi
limits:
  cpu: 2000m
  memory: 200Mi
Apr 17 2019, 9:14 AM · Services (done), Analytics-Kanban, EventBus, Analytics

Apr 15 2019

akosiaris added a comment to T165795: Ldap auth extension vs. ldap vs. username Case.

I see and understand the concerns from @akosiaris in T165795#5092381. I personally believe this un-indexed lookup is an acceptable overhead in the short term because it will only be used in the Wikitech account login flow. This is not a high rate of change event in our current system.

Apr 15 2019, 4:39 PM · Patch-For-Review, cloud-services-team (Kanban), wikitech.wikimedia.org, MediaWiki-extensions-LdapAuthentication
akosiaris triaged T220894: Replacement of network::constant's special_hosts as Low priority.
Apr 15 2019, 4:22 PM · Patch-For-Review, Operations
akosiaris added a comment to T220661: EventGate service runner worker occasionally killed, usually during higher load.

@akosiaris no, spanning up a new worker takes no time, the problem here is actually hilling old worker.

The heartbeat limit currently is 7.5 seconds, after which we do not just SIGKILL the worker, we optimistically try to properly shut it down and if that does not happen, we SIGKILL it after 1 minute. For 'normal' operation with many workers, this is a right approach - master stops dispatching requests to the worker immediately after ordering it to shut down gracefully, with less incoming pressure the worker might recover a bit and get a chance to finish up requests and do it's pre-shutdown housekeeping.

Apr 15 2019, 1:56 PM · Services (done), Analytics-Kanban, EventBus, Analytics

Apr 12 2019

akosiaris triaged T220822: Site: 4 VM request for kubernetes as Normal priority.
Apr 12 2019, 1:39 PM · Patch-For-Review, User-jijiki, serviceops, vm-requests, Operations
akosiaris added a parent task for T220822: Site: 4 VM request for kubernetes: T220821: Add security sensitive nodes to our kubernetes cluster.
Apr 12 2019, 1:36 PM · Patch-For-Review, User-jijiki, serviceops, vm-requests, Operations
akosiaris added a subtask for T220821: Add security sensitive nodes to our kubernetes cluster: T220822: Site: 4 VM request for kubernetes.
Apr 12 2019, 1:36 PM · Patch-For-Review, Core Platform Team Backlog (Next), Core Platform Team (Session Management Service (CDP2)), Services (next), User-Eevans, Release Pipeline, Operations, serviceops, Release-Engineering-Team
akosiaris triaged T220821: Add security sensitive nodes to our kubernetes cluster as Normal priority.
Apr 12 2019, 1:36 PM · Patch-For-Review, Core Platform Team Backlog (Next), Core Platform Team (Session Management Service (CDP2)), Services (next), User-Eevans, Release Pipeline, Operations, serviceops, Release-Engineering-Team