Page MenuHomePhabricator
Feed Advanced Search

Fri, Jan 22

akosiaris added a comment to T238909: Proposal: simplify set up of a new load-balanced service on kubernetes.

Adding https://metallb.universe.tf/ as a potential solution as well.

Fri, Jan 22, 3:37 PM · SRE, Prod-Kubernetes, Pybal, Traffic, serviceops
akosiaris added a comment to T271475: Move private settings to a k8s compatible location.

@akosiaris or @Joe, the current /srv/mediawiki-staging/private appears to be a local git repo with no remotes. Do you know if this repo/files live anywhere else?

Fri, Jan 22, 2:55 PM · Release-Engineering-Team (Pipeline), Release-Engineering-Team-TODO (2021-01-01 to 2021-03-31 (Q3)), MW-on-K8s, Release Pipeline
akosiaris closed T272555: releases2002 ganeti VM not getting IP after reboot as Resolved.

Anyway, s/ens5/ens6/ in /etc/network/interfaces and the issue has been fixed. I was wondering whether it makes sense to invest time to "fix" this but having met 1 instance of it in 5-6 years that we have ganeti around, I am gonna say it's not worth it. That being said, let's document this.

Fri, Jan 22, 12:00 PM · SRE
akosiaris added a comment to T272555: releases2002 ganeti VM not getting IP after reboot.

I think the following explains it:

Fri, Jan 22, 11:59 AM · SRE

Thu, Jan 21

akosiaris added a comment to T272559: Unused puppet resources audit, early 2021.

I 've checked off stdlib and lvm classes as they are from external modules that have been imported to the tree as is (aka vendoring).

Thu, Jan 21, 3:37 PM · Patch-For-Review, SRE, Puppet
akosiaris updated the task description for T272559: Unused puppet resources audit, early 2021.
Thu, Jan 21, 3:35 PM · Patch-For-Review, SRE, Puppet
akosiaris added a comment to T265893: Add Link engineering: Deployment Pipeline setup.

That would work, but would also require that the mwaddlink repo is checkout and kept up to date on stats1008 with whatever is deployed in production.

The stat1008 repo is used for producing the dataset (since that requires access to stats data, and maybe GPUs as well, I'm not sure about that), and that logic is much more likely to change over time than the very simple "take a bunch of tables and copy them verbatim to another database", so I don't think this would require keeping in sync to any larger extent than it is the case.

Thu, Jan 21, 2:44 PM · Release-Engineering-Team (Pipeline), Patch-For-Review, Growth-Team (Current Sprint), Add-Link, Growth-Structured-Tasks
akosiaris added a comment to T272238: Elasticsearch and Kibana are switching to non-OSI-approved SSPL licence.

I 've marked T272111 as a parent of this task for greater visibility. This one seems more generic than the specific one to CirrusSearch hence this relationship, but feel free to undo.

Thu, Jan 21, 9:55 AM · observability, Software-Licensing, Wikimedia-Logstash, SRE
akosiaris added a subtask for T272111: Elasticsearch, a CirrusSearch dependency, is switching to SSPL/Custom licence: T272238: Elasticsearch and Kibana are switching to non-OSI-approved SSPL licence.
Thu, Jan 21, 9:54 AM · Discovery-Search (Current work), Software-Licensing, CirrusSearch
akosiaris added a parent task for T272238: Elasticsearch and Kibana are switching to non-OSI-approved SSPL licence: T272111: Elasticsearch, a CirrusSearch dependency, is switching to SSPL/Custom licence.
Thu, Jan 21, 9:54 AM · observability, Software-Licensing, Wikimedia-Logstash, SRE

Wed, Jan 20

akosiaris added a comment to T265893: Add Link engineering: Deployment Pipeline setup.

Talked this over with @kostajh and the options we saw for the batch job were:

  • Use a MediaWiki maintenance script to download the dataset over the web and import it into the production table. This is the well-trodden path since there's plenty of scaffolding for MediaWiki batch jobs, but conceptually wrong: the script belongs to the mwaddlink repository, not MediaWiki.
Wed, Jan 20, 2:02 PM · Release-Engineering-Team (Pipeline), Patch-For-Review, Growth-Team (Current Sprint), Add-Link, Growth-Structured-Tasks
akosiaris added a comment to T179696: Homepage for https://docker-registry.wikimedia.org.

Got pretty close, one last sticking point is that docker_report hardcodes connecting to the registry over HTTPS. So if you try https://localhost then you'll end up with requests.exceptions.SSLError: hostname 'localhost' doesn't match either of 'docker-registry.discovery.wmnet', 'docker-registry.svc.eqiad.wmnet', 'docker-registry.svc.codfw.wmnet', 'docker-registry.wikimedia.org'. And of course https://localhost:5000 (the HTTP port) fails with a protocol error.

Should we adapt docker_report to allow connecting over HTTP? I thought about using one of the domain names but then we're generating a homepage for a different registry, not the one that instance is serving...or does it not matter?

Wed, Jan 20, 11:16 AM · serviceops, Patch-For-Review, SRE, MediaWiki-Containers
akosiaris closed T271134: Some Machine Learning clusters do not support IPv6, a subtask of T253173: Some clusters do not have DNS for IPv6 addresses (TRACKING TASK), as Declined.
Wed, Jan 20, 11:06 AM · IPv6, User-jbond, netbox
akosiaris closed T271134: Some Machine Learning clusters do not support IPv6 as Declined.

@akosiaris does this mean we need to upgrade ores hosts to buster?

Wed, Jan 20, 11:06 AM · User-crusnov, IPv6, Machine Learning Platform, SRE-tools

Sat, Jan 16

akosiaris committed rGBLBR77a3a01351df: Switch to buster and golang 1.13 for the build phase (authored by akosiaris).
Switch to buster and golang 1.13 for the build phase
Sat, Jan 16, 12:08 AM

Fri, Jan 15

akosiaris added a comment to T258978: Service operations setup for Add a Link project.

@akosiaris picking up the thread on this from before the holiday break; IIRC there was some network setup that had to be done for the production instance (staging was already done). Is there anything else that needs to happen from SRE's side before we can begin calling this service in production with our maintenance script?

Fri, Jan 15, 2:39 PM · Add-Link, Growth-Team (Current Sprint), Product-Infrastructure-Team-Backlog, SRE, serviceops, GrowthExperiments-NewcomerTasks

Thu, Jan 14

akosiaris added a comment to T261369: Deployment infrastructure for PHP microservices.

As I understand it, there's a halt on that npm approach which indeed seems to have slipped away from the deploy-repo approach for one or two services in the past year without Security realizing it. This is unfortunate, but also makes it a bad example to follow.

Wait, what? First time I hear of this. When did that halt happen? Has it been communicated? All of the nodejs services on kubernetes follow the npm install approach for a long time now, what does that mean for them?

I'm not certain there's ever been a reasonable, organizationally-accepted policy or set of guidelines around using various npm commands (especially install) at any point along a given extension/app/service's production deployment path.

Thu, Jan 14, 8:47 AM · MW-on-K8s, Release-Engineering-Team (Pipeline), Release Pipeline (Blubber), serviceops, SRE

Wed, Jan 13

akosiaris closed T239459: service-runner apps running on kubernetes emit logs with log level 50 as Resolved.

eventgate done. And with this, we can close this task. Thanks to all those that contributed.

Wed, Jan 13, 3:50 PM · Platform Team Workboards (Clinic Duty Team), Patch-For-Review, CX-cxserver, serviceops-radar, Product-Infrastructure-Team-Backlog, SRE
akosiaris closed T239459: service-runner apps running on kubernetes emit logs with log level 50 , a subtask of T211125: Move service-runner to new logging infrastructure, as Resolved.
Wed, Jan 13, 3:49 PM · observability, Platform Team Legacy (Watching / External), Patch-For-Review, service-runner, Wikimedia-Logstash, SRE
akosiaris closed T239459: service-runner apps running on kubernetes emit logs with log level 50 , a subtask of T219919: Move citoid logging to new logging pipeline, as Resolved.
Wed, Jan 13, 3:49 PM · observability, Citoid, Platform Team Legacy (Watching / External), Services (watching), service-runner, Wikimedia-Logstash, SRE
akosiaris closed T239459: service-runner apps running on kubernetes emit logs with log level 50 , a subtask of T219921: Move cxserver logging to new logging pipeline, as Resolved.
Wed, Jan 13, 3:49 PM · observability, CX-cxserver, Platform Team Legacy (Watching / External), Services (watching), service-runner, Wikimedia-Logstash, SRE
akosiaris closed T239459: service-runner apps running on kubernetes emit logs with log level 50 , a subtask of T219924: Move mobileapps logging to new logging pipeline, as Resolved.
Wed, Jan 13, 3:49 PM · Product-Infrastructure-Team-Backlog, Page Content Service, observability, Platform Team Legacy (Watching / External), Services (watching), service-runner, Wikimedia-Logstash, SRE
akosiaris closed T239459: service-runner apps running on kubernetes emit logs with log level 50 , a subtask of T227080: Deprecate all non-Kafka logstash inputs, as Resolved.
Wed, Jan 13, 3:49 PM · observability, Wikimedia-Logstash, User-fgiunchedi, SRE
akosiaris closed T239459: service-runner apps running on kubernetes emit logs with log level 50 , a subtask of T219925: Move proton logging to new logging pipeline, as Resolved.
Wed, Jan 13, 3:49 PM · Product-Infrastructure-Team-Backlog, observability, Proton, Platform Team Legacy (Watching / External), Services (watching), service-runner, Wikimedia-Logstash, SRE
akosiaris closed T239459: service-runner apps running on kubernetes emit logs with log level 50 , a subtask of T245603: Move termbox to the logging pipeline, as Resolved.
Wed, Jan 13, 3:49 PM · Wikidata-Termbox, observability, Wikimedia-Logstash
akosiaris closed T239459: service-runner apps running on kubernetes emit logs with log level 50 , a subtask of T245604: Move wikifeeds to the logging pipeline, as Resolved.
Wed, Jan 13, 3:49 PM · Wikifeeds, observability, Wikimedia-Logstash, SRE
akosiaris updated the task description for T239459: service-runner apps running on kubernetes emit logs with log level 50 .
Wed, Jan 13, 3:44 PM · Platform Team Workboards (Clinic Duty Team), Patch-For-Review, CX-cxserver, serviceops-radar, Product-Infrastructure-Team-Backlog, SRE
akosiaris added a comment to T239459: service-runner apps running on kubernetes emit logs with log level 50 .

eventstreams done. Double checked in logstash and I can see nice log levels now.

Wed, Jan 13, 1:41 PM · Platform Team Workboards (Clinic Duty Team), Patch-For-Review, CX-cxserver, serviceops-radar, Product-Infrastructure-Team-Backlog, SRE
akosiaris updated the task description for T239459: service-runner apps running on kubernetes emit logs with log level 50 .
Wed, Jan 13, 1:40 PM · Platform Team Workboards (Clinic Duty Team), Patch-For-Review, CX-cxserver, serviceops-radar, Product-Infrastructure-Team-Backlog, SRE
akosiaris added a comment to T271702: kubestage200* change on every puppet run.

I 've had a look into it. The culprit is https://github.com/projectcalico/felix/pull/2424. The reason for the change itself is to honor kube-proxy rules in case of a service with no endpoints. kube-proxy would normally REJECT but calico rules before the PR above would just ACCEPT. The accepting rule was indeed in a cali- prefixed chain up to calico 3.16 (we just updated to 3.17.1 in the newer staging env) but it's now directly added as the last rule in the FORWARD chain. It's definitely not as compartmentalized as before, but that's the way it was done.

Wed, Jan 13, 10:12 AM · Patch-For-Review, Kubernetes, Prod-Kubernetes, serviceops

Tue, Jan 12

akosiaris added a comment to T241230: Migrate recommendation-api to kubernetes.

Changes associated with this patch (mostly 27152427487bed18321d582a7f77a301ea114968) have left the following VMs unpuppetized:

  • deployment-sca01.deployment-prep.eqiad1.wikimedia.cloud
  • deployment-sca02.deployment-prep.eqiad1.wikimedia.cloud

Should those VMs be deleted?

Tue, Jan 12, 4:33 PM · Product-Infrastructure-Team-Backlog, Patch-For-Review, serviceops, Release-Engineering-Team, Services, Recommendation-API
akosiaris added a comment to T259686: echostore helm test service checker failing in staging cluster.

@jeena. With https://gerrit.wikimedia.org/r/641790 reviewed and merged I just release 0.2.1 and the relevant image was built. That allows to amend the chart and add the --insecure flag to service-checker so that we can ignore mismatched certs in CI.

Tue, Jan 12, 4:27 PM · Release-Engineering-Team-TODO (2020-10-01 to 2020-12-31 (Q2))

Mon, Jan 11

akosiaris added a comment to T261369: Deployment infrastructure for PHP microservices.

Some kind of /deploy repo seems needed, I think, as otherwise we would be deploying unaudited code never seen by a trusted pair of eys. There'd be no diff to review during production dependency updates or image rebuilds.

How feasible is it in this case to audit, via a trusted pair of eyes, the code though? In the nodejs case, being probably the pathological case, it's borderline impossible. The dependency tree, even flattened, is huge usually and brings in probably tens (if not hunders) of thousands of lines of code per project. E.g. and of the top of my head, citoid's old deploy repo[1] clocked at 369055 LoC for javascript files only (including blank lines and comments, but the size would be staggering anyway even counting those out) for dependent node modules. I expect this to have increase since then as well as other projects exhibiting similar numbers. This is a known issue and the npm ecosystem has introduced the npm audit command which somewhat makes this a bit better but at least informing of the known vulnerabilities. But auditing for unknown vulnerabilities still is a herculean task.

As I alluded to earlier, all of the Shellbox dependencies have already been audited via their inclusion in mediawiki/vendor. I don't know if the Shellbox service dependencies will forever be a subset of MediaWiki's, but at least for now I think there's no extra work being added if we want to audit all the PHP dependencies. It might just be me, but if I find PHP code easier to audit compared to nodejs's explosion of libraries that might've been transpiled or minified or whatever. And PHP doesn't attempt to ship/compile native code into vendor/!

I guess if the number of dependencies is small enough (which I would expect for the service in question), it remains doable, but it might not be desirable in the future if those increase substantially. In any case, to increase reproducibility and auditibility dependencies should be version pinned.

Shellbox's vendor/ has a little under 16K LoC. I can see us adding a few more libraries like something for metrics, but nothing like the 369K you mentioned for citoid.

Mon, Jan 11, 3:13 PM · MW-on-K8s, Release-Engineering-Team (Pipeline), Release Pipeline (Blubber), serviceops, SRE
akosiaris added a comment to T271404: esams/ulsfo/eqsin: 1 VM requested for bastions.

LGTM.

Mon, Jan 11, 2:39 PM · SRE, vm-requests
akosiaris added a comment to T270071: SVC DNS zonefiles and source of truth.
  • DNS Records with non-standard TTL. We have just one for oresrdb that has a 5M TTL instead of the default 1H and that's not currently supported by the Netbox automation for lack of a place where to store that information properly. This record too points to a host IP and is not using a service IP.

We can get rid of the non-standard TTL for that one. In fact, we can get rid of the entire RR. It was put there to facilitate oresrdb maintenance, but due to various difficulties that never materialized well. We can point to the server directly instead.

Mon, Jan 11, 1:58 PM · Patch-For-Review, serviceops-radar, SRE-tools, SRE
akosiaris added a comment to T271711: Update cxserver to service-runner 2.8.1.

Migration guide is at https://github.com/wikimedia/service-runner/blob/master/doc/2.7-2.8_Migration_Guide.md, it looks like most metrics calls will have to be slightly modified after the version bump. Configuration wise no change will be needed at that point, but later on we can enable service-runner native prometheus and migrate away from statsd.

Mon, Jan 11, 12:21 PM · CX-cxserver
akosiaris added a comment to T271540: Upgrade and restart m1 master (db1080).

Excellent, thanks. I will double check with @akosiaris to see if he can be around in case etherpad requires some action.

Mon, Jan 11, 10:17 AM · Wikimedia-Etherpad, DBA

Dec 16 2020

akosiaris renamed T270191: Add kubernetes 1.17+ topology annotations from Add kubernetes 1.17+ typology annotations to Add kubernetes 1.17+ topology annotations.
Dec 16 2020, 2:23 PM · Kubernetes, Prod-Kubernetes, serviceops
akosiaris added a comment to T261369: Deployment infrastructure for PHP microservices.

Some kind of /deploy repo seems needed, I think, as otherwise we would be deploying unaudited code never seen by a trusted pair of eys. There'd be no diff to review during production dependency updates or image rebuilds.

Dec 16 2020, 12:18 PM · MW-on-K8s, Release-Engineering-Team (Pipeline), Release Pipeline (Blubber), serviceops, SRE
akosiaris added a comment to T264006: Deploy Flink to kubernetes (k8s).

After the helm chart is merged and published (both should happen automatically on a +2, I 've +1ed already), the final 2 items for deployment are:

Dec 16 2020, 11:13 AM · Patch-For-Review, Discovery-Search (Current work), Wikidata-Query-Service, Wikidata

Dec 14 2020

akosiaris added a comment to T249745: Could not enqueue jobs: "Unable to deliver all events: 503: Service Unavailable".

https://grafana.wikimedia.org/d/g0GUXaJMk/t249745?orgId=1 says that instead it might make sense to increase the memory limit. There's 3 spikes above the limit in the last 30 days, leading to extra CPU usage and probably increased GC cycles

Dec 14 2020, 5:08 PM · MW-1.36-notes (1.36.0-wmf.28; 2021-01-26), Patch-For-Review, User-brennen, serviceops, Platform Team Workboards (Clinic Duty Team), WMF-JobQueue, Wikimedia-production-error
akosiaris added a comment to T270071: SVC DNS zonefiles and source of truth.
  • DNS Records with non-standard TTL. We have just one for oresrdb that has a 5M TTL instead of the default 1H and that's not currently supported by the Netbox automation for lack of a place where to store that information properly. This record too points to a host IP and is not using a service IP.
Dec 14 2020, 4:54 PM · Patch-For-Review, serviceops-radar, SRE-tools, SRE
akosiaris closed T252185: (Need by: TBD) rack/setup/install kubernetes20[07-14].codfw.wmnet and kubestage200[1-2].codfw.wmnet. as Resolved.

And finally being now used. Resolving this.

Dec 14 2020, 4:07 PM · Patch-For-Review, Sustainability (Incident Followup), ops-codfw, serviceops, SRE
akosiaris closed T252185: (Need by: TBD) rack/setup/install kubernetes20[07-14].codfw.wmnet and kubestage200[1-2].codfw.wmnet., a subtask of T241852: (Need by: TBD) rack/setup/install 86 new codfw mw systems, as Resolved.
Dec 14 2020, 4:06 PM · ops-codfw, serviceops, SRE
akosiaris updated the task description for T252185: (Need by: TBD) rack/setup/install kubernetes20[07-14].codfw.wmnet and kubestage200[1-2].codfw.wmnet..
Dec 14 2020, 4:06 PM · Patch-For-Review, Sustainability (Incident Followup), ops-codfw, serviceops, SRE
akosiaris updated the task description for T252185: (Need by: TBD) rack/setup/install kubernetes20[07-14].codfw.wmnet and kubestage200[1-2].codfw.wmnet..
Dec 14 2020, 4:04 PM · Patch-For-Review, Sustainability (Incident Followup), ops-codfw, serviceops, SRE
akosiaris committed rOHPUab00e5e2c40b: Fix kubestage2002 IPv6 address (authored by akosiaris).
Fix kubestage2002 IPv6 address
Dec 14 2020, 11:24 AM
akosiaris committed rOHPU9f6fab5da2a6: Adding AS64604 policy rules (authored by akosiaris).
Adding AS64604 policy rules
Dec 14 2020, 10:47 AM
akosiaris committed rOHPU0a635031abdc: Specify k8s-stage codfw AS number (authored by akosiaris).
Specify k8s-stage codfw AS number
Dec 14 2020, 10:09 AM
akosiaris added a comment to T265893: Add Link engineering: Deployment Pipeline setup.

To clarify what I'm proposing:

Then, a request from stats machine to the kubernetes service would instruct the app to download new datasets from the public endpoint via curl / HTTP.

That's a bad pattern IMHO. The main reason is that we will be using part of the capacity of the service to achieve something something that has nothing to do with the actual task of the service which is to serve requests (i.e. network, CPU and memory will be utilized for this that could be utilized just for serving requests) but rather a batch job. Furthermore, we would be adding an API endpoint that can be called by anyone internally to achieve that, an attacker that somehow gains access to WMF IPs would be able to exploit this to cause an outage.

The mwmaint option would be open to this vulnerability too, right? You could reset the checksum stored locally and re-run refreshLinkRecommendations.php.

Dec 14 2020, 8:54 AM · Release-Engineering-Team (Pipeline), Patch-For-Review, Growth-Team (Current Sprint), Add-Link, Growth-Structured-Tasks

Dec 10 2020

akosiaris added a comment to T269835: Implement switching of staging clusters.

When switching from staging-eqiad to staging-codfw (and vice versa) we would need to:

Dec 10 2020, 3:01 PM · Kubernetes, Prod-Kubernetes, serviceops
akosiaris added a comment to T265893: Add Link engineering: Deployment Pipeline setup.

To clarify what I'm proposing:

Then, a request from stats machine to the kubernetes service would instruct the app to download new datasets from the public endpoint via curl / HTTP.

Dec 10 2020, 2:26 PM · Release-Engineering-Team (Pipeline), Patch-For-Review, Growth-Team (Current Sprint), Add-Link, Growth-Structured-Tasks
akosiaris added a comment to T265893: Add Link engineering: Deployment Pipeline setup.

@kostajh, @MGerlach I 've gone ahead and created tokens and namespaces. You should be able to now deploy the service. Docs are at https://wikitech.wikimedia.org/wiki/Deployments_on_kubernetes#Deploying_with_helmfile. But the TL;DR is you 'll need to create a change similar to https://gerrit.wikimedia.org/r/c/operations/deployment-charts/+/645076, but tailored to your service and then following the helmfile -e <env> stuff mentioned in the docs. Also, you should have +2 access to the repo, let me know if you don't so I can add those.

Thank you @akosiaris. I've run helmfile -e for staging (verified with curl -L https://staging.svc.eqiad.wmnet:4005/apidocs + service-checker-swagger staging.svc.eqiad.wmnet https://staging.svc.eqiad.wmnet:4005 -t 2 -s /apispec_1.json), eqiad and codfw. AIUI it's back to your team now to implement the LVS / networking setup for eqiad/codfw.

Dec 10 2020, 1:51 PM · Release-Engineering-Team (Pipeline), Patch-For-Review, Growth-Team (Current Sprint), Add-Link, Growth-Structured-Tasks

Dec 9 2020

akosiaris added a comment to T265893: Add Link engineering: Deployment Pipeline setup.

Btw, if the service is going to reach out via the network to any other resource than MySQL (which has already been taken care of), now is the time to say so :-)

I think we will be downloading database dumps via Swift from within the container. Does that require additional setup on your end?

Dec 9 2020, 3:10 PM · Release-Engineering-Team (Pipeline), Patch-For-Review, Growth-Team (Current Sprint), Add-Link, Growth-Structured-Tasks
akosiaris added a comment to T267653: Refactor calico deploy strategy.

The new calico chart is merged, thanks @akosiaris

What is missing currently is a proper RoleBinding for the calicoctl user as I was not sure yet what permissions he's going to need.
We should be not using the tool for changing calico config, that's to be done via the helm chart now. But we will want to keep the analyze functionality intact. Could not find any docs on that by know so we will maybe just have to figure it out when we have a node in staging-codfw

Dec 9 2020, 3:06 PM · Patch-For-Review, Prod-Kubernetes, serviceops, Kubernetes, SRE
akosiaris added a comment to T265893: Add Link engineering: Deployment Pipeline setup.

@kostajh, @MGerlach I 've gone ahead and created tokens and namespaces. You should be able to now deploy the service. Docs are at https://wikitech.wikimedia.org/wiki/Deployments_on_kubernetes#Deploying_with_helmfile. But the TL;DR is you 'll need to create a change similar to https://gerrit.wikimedia.org/r/c/operations/deployment-charts/+/645076, but tailored to your service and then following the helmfile -e <env> stuff mentioned in the docs. Also, you should have +2 access to the repo, let me know if you don't so I can add those.

Thanks @akosiaris! I will work on getting a patch for you this week.

I don't have +2 to that repo, could you please add me, @MGerlach and @Tgr there?

Dec 9 2020, 2:50 PM · Release-Engineering-Team (Pipeline), Patch-For-Review, Growth-Team (Current Sprint), Add-Link, Growth-Structured-Tasks
akosiaris added a comment to T269581: Add Link engineering: Allow external traffic to linkrecommendation service.

Arguably, in the interest of https://en.wikipedia.org/wiki/Separation_of_concerns it's probably better than whatever instance of the service mediawiki queries is NOT exposed to the public. That way requests of the infrastructure won't be mixed with external ones, allowing for better capacity planning, service level support etc.

That makes sense.

We could however instantiate a second deployment of the software, e.g. as a second helm/helmfile release and expose that one. Depending on the timeline it might end up being easier than expected, as we are working on some changes in the infrastructure which might remove a lot of manual work from SRE's plate (the stuff in the README).

If it were quick and easy, I was hoping to do this soon to avoid having the need for other developers to setup the software themselves for testing and development. But I could also figure out an interim solution where the service is running on e.g. Toolforge or WCMS instead of a kubernetes prod instance.

Dec 9 2020, 2:45 PM · Patch-For-Review, Growth-Team, serviceops, Add-Link
akosiaris added a comment to T269581: Add Link engineering: Allow external traffic to linkrecommendation service.

Arguably, in the interest of https://en.wikipedia.org/wiki/Separation_of_concerns it's probably better than whatever instance of the service mediawiki queries is NOT exposed to the public. That way requests of the infrastructure won't be mixed with external ones, allowing for better capacity planning, service level support etc.

Dec 9 2020, 2:36 PM · Patch-For-Review, Growth-Team, serviceops, Add-Link
akosiaris added a comment to T269731: Requesting access to deployment for Kosta Harlan.

Requestor -- Please coordinate obtaining a comment of approval on this task from the approving party.

cc @akosiaris @marcella Please let me know if you have any questions, thank you.

Dec 9 2020, 10:15 AM · SRE, SRE-Access-Requests
akosiaris added a comment to T163692: Have puppet create Prometheus LVs.

The lvm puppet module is a bit problematic. It's been released under the GPL-2 license whereas per T67270 we want to move to Apache2 for that repo. So in the long scheme of things, we probably want to rip it from our repo.

Dec 9 2020, 9:47 AM · observability, User-fgiunchedi, Prometheus-metrics-monitoring

Dec 8 2020

akosiaris created T269684: [EPIC] Docker deprecation as a container runtime enginer for kubernetes. .
Dec 8 2020, 3:07 PM · serviceops

Dec 4 2020

akosiaris added a comment to T265893: Add Link engineering: Deployment Pipeline setup.

Btw, if the service is going to reach out via the network to any other resource than MySQL (which has already been taken care of), now is the time to say so :-)

Dec 4 2020, 3:33 PM · Release-Engineering-Team (Pipeline), Patch-For-Review, Growth-Team (Current Sprint), Add-Link, Growth-Structured-Tasks
akosiaris added a comment to T265893: Add Link engineering: Deployment Pipeline setup.

@kostajh, @MGerlach I 've gone ahead and created tokens and namespaces. You should be able to now deploy the service. Docs are at https://wikitech.wikimedia.org/wiki/Deployments_on_kubernetes#Deploying_with_helmfile. But the TL;DR is you 'll need to create a change similar to https://gerrit.wikimedia.org/r/c/operations/deployment-charts/+/645076, but tailored to your service and then following the helmfile -e <env> stuff mentioned in the docs. Also, you should have +2 access to the repo, let me know if you don't so I can add those.

Dec 4 2020, 3:19 PM · Release-Engineering-Team (Pipeline), Patch-For-Review, Growth-Team (Current Sprint), Add-Link, Growth-Structured-Tasks
akosiaris added a comment to T238753: [OSM] Backport imposm3 to the debian channel.

There's also an additional option:Postgres 9.6 is also available on Buster (We already use it for cescout, which has a strict dependency on 9.6 since OONI upstream publishes there datasets that way).

From a quick glance Buster has all the deps (src:leveldb, src:geos, Golang 1.11) required to build imposm3, so we can create an imposm 3 deb on Buster, setup a Ganeti instance on Buster (maps-import1001) with Postgres 9.6 and add it to the Maps Postgres setup. Then the OSM import can simply happen from that separate instance (until we eventually also migrate maps at large).

I think there is also a dated cassandra requirement somewhere in there, so it might not be that easy. But that's almost hearsay, so @MSantos could you confirm ?

Actually, this hypothetical maps-import machine doesn't need to have a Cassandra node. The Cassandra requirement is storage for vector-tiles generated after the OSM data is synced.

I've attempted to document the current data flow in these diagrams. It should give a nice understanding of where the data lies in the infrastructure.

That being said, I worked a bit on packaging imposm3 yesterday. I am happy to report success: https://people.wikimedia.org/~akosiaris/

I started with stretch in mind (since maps is stretrch), but turns out that it can't be built on stretch, so it requires buster after all. That being said, if we can backport leveldb and libgeos we might be able to run it on stretch.

YAY! Well, I guess it's a matter of choosing the best option for you between the tradeoffs about backporting the dependencies or start a new machine for OSM DB master and sync scripts.

My 2 cents:
Backports strategy:
Pros:

  • It unblocks our work and we can test imposm migration and prepare it for deployment beginning of next quarter.
  • Less work to leverage the needed infrastructure

Cons:

  • More backported binaries to keep track of
  • Keeps the status quo, and we know this isn't great

maps-import strategy:
Pros:

  • Iteratively start to move towards debian buster
  • Isolate OSM sync scripts from the production infrastructure
  • Doesn't change the way we store OSM data because we already have the main instance doing the OSM sync and replicating the data through the cluster

Cons:

  • Needs more planning and changes the scope of current work
  • More work to leverage infrastructure

Also, the 2 strategies can be iterative steps of the same plan

Dec 4 2020, 2:24 PM · Patch-For-Review, Discovery-Search, serviceops, Maps, Product-Infrastructure-Team-Backlog
akosiaris added a comment to T238753: [OSM] Backport imposm3 to the debian channel.

There's also an additional option:Postgres 9.6 is also available on Buster (We already use it for cescout, which has a strict dependency on 9.6 since OONI upstream publishes there datasets that way).

From a quick glance Buster has all the deps (src:leveldb, src:geos, Golang 1.11) required to build imposm3, so we can create an imposm 3 deb on Buster, setup a Ganeti instance on Buster (maps-import1001) with Postgres 9.6 and add it to the Maps Postgres setup. Then the OSM import can simply happen from that separate instance (until we eventually also migrate maps at large).

I think there is also a dated cassandra requirement somewhere in there, so it might not be that easy. But that's almost hearsay, so @MSantos could you confirm ?

That being said, I worked a bit on packaging imposm3 yesterday. I am happy to report success: https://people.wikimedia.org/~akosiaris/

I started with stretch in mind (since maps is stretrch), but turns out that it can't be built on stretch, so it requires buster after all. That being said, if we can backport leveldb and libgeos we might be able to run it on stretch.

For those not familiar with the state of maps and stumbling upon this task, getting a feeling of exasperation is normal. Trying to adopt a not well maintained in the past infrastructure means that some tech debt needs to be paid. It's

Thanks @akosiaris
I also had some success first with FPM and after properly packaging imposm3 and 2-3 missing deps according to dh-make-golang (levigo, fsnotify). I can share them if its of any help.

Dec 4 2020, 1:44 PM · Patch-For-Review, Discovery-Search, serviceops, Maps, Product-Infrastructure-Team-Backlog
akosiaris committed rLPRIe7036a5756ee: Add tokens and users for 3 new k8s services (authored by akosiaris).
Add tokens and users for 3 new k8s services
Dec 4 2020, 11:56 AM
akosiaris added a comment to T269357: Requesting access to maps for mbsantos and jgiannelos.

I am also interested in is the to be testing the new puppet rules part. Could you please share a bit on how this will be done?

If the only way to apply puppet rules is through the master branch, I'm thinking of a conditional rule where it would load the new imposm_planet_sync classes. And then I would add the needed parameters to the hieradata configuration for the machine available for test.

Dec 4 2020, 10:21 AM · Maps, SRE
akosiaris added a comment to T238753: [OSM] Backport imposm3 to the debian channel.

There's also an additional option:Postgres 9.6 is also available on Buster (We already use it for cescout, which has a strict dependency on 9.6 since OONI upstream publishes there datasets that way).

From a quick glance Buster has all the deps (src:leveldb, src:geos, Golang 1.11) required to build imposm3, so we can create an imposm 3 deb on Buster, setup a Ganeti instance on Buster (maps-import1001) with Postgres 9.6 and add it to the Maps Postgres setup. Then the OSM import can simply happen from that separate instance (until we eventually also migrate maps at large).

Dec 4 2020, 10:17 AM · Patch-For-Review, Discovery-Search, serviceops, Maps, Product-Infrastructure-Team-Backlog

Dec 3 2020

akosiaris added a comment to T238753: [OSM] Backport imposm3 to the debian channel.

@hnowlan this can be a good resource for this task https://github.com/omniscale/imposm3#binary

Dec 3 2020, 9:27 PM · Patch-For-Review, Discovery-Search, serviceops, Maps, Product-Infrastructure-Team-Backlog
akosiaris added a comment to T269357: Requesting access to maps for mbsantos and jgiannelos.

Is this specifically only about maps2007? Given that you're both members of the maps-admin group you can log into any maps host, but maps2007 in particular is currently inaccessible since Puppet was disabled there for a long time (which made it evicted from the system records in the Puppet database). This needs to be fixed in general and is unrelated to your access permissions, right now not even people in SRE can perform a standard SSH login into the system in the current state of it.

@MoritzMuehlenhoff hmm, I assumed that the machine was only out of production traffic. I guess there is another machine depooled (maps2002) but this machine have disk space issues with Cassandra.

Dec 3 2020, 3:14 PM · Maps, SRE
akosiaris updated the task description for T255672: Migrate apertium to the deployment pipeline.
Dec 3 2020, 1:26 PM · Patch-For-Review, Language-Team (Language-2020-October-December), CX-cxserver, serviceops, Release-Engineering-Team (Pipeline)
akosiaris closed T268747: codfw: 4 VM request for kubernetes staging as Resolved.

VMs are up and running and the services (etcd, apiserver) have been setup on them

Dec 3 2020, 9:39 AM · Kubernetes, vm-requests, SRE
akosiaris closed T268747: codfw: 4 VM request for kubernetes staging, a subtask of T244335: Upgrade kubernetes clusters to a security supported (LTS) version, as Resolved.
Dec 3 2020, 9:39 AM · Patch-For-Review, Kubernetes, Prod-Kubernetes, serviceops

Dec 2 2020

akosiaris committed rLPRIe5da6ea04544: Add profile::prometheus::kubernetes::cluster_tokens (authored by akosiaris).
Add profile::prometheus::kubernetes::cluster_tokens
Dec 2 2020, 6:54 PM
akosiaris added a comment to T267214: Add a link engineering: Database for link recommendation service.

I have tested the connection from kubernetes1017 which is on 10.64.0, and it works fine, it can reach m2-master.eqiad.wmnet  thru port 3306 just fine.

For what is worth, that test would NOT catch a problem if there was one. Kubernetes pods do not utilize the IPs of their nodes, but rather each have their own.

However, I did test too and it's fine.

For posterity's sake (and this is probably worthy of a wikitech page), the process was (needs sudo on deploy1001, aka root, but this is being revisited due to the helm3 migration)

$ ssh deploy1001
$ sudo -i
$ kube_env admin codfw
$ kubectl -n default run testing --rm -it --image=docker-registry.discovery.wmnet/wmfdebug
wait for the prompt

$ nmap -p 3306 m2-master.eqiad.wmnet
[snip]

rDNS record for 10.64.0.135: dbproxy1013.eqiad.wmnet
PORT     STATE SERVICE
3306/tcp open  mysql

Thanks Alex: for my own understanding, by the comment at T267214#6652778 - any those pods would be using their own IPs, but always belonging to 10.64.0 range no? (which was the intention of the test with the node)

Dec 2 2020, 1:30 PM · DBA
akosiaris added a comment to T267214: Add a link engineering: Database for link recommendation service.

I have tested the connection from kubernetes1017 which is on 10.64.0, and it works fine, it can reach m2-master.eqiad.wmnet  thru port 3306 just fine.

Dec 2 2020, 1:18 PM · DBA

Dec 1 2020

akosiaris added a comment to T267327: Run latest Thumbor on Docker with Buster + Python 3.

oh it's not stateful but I think it's high I/O compared to other applications (maybe not as high as jitsi but higher than other apps in k8s). let's do benchmark and see.

Dec 1 2020, 3:31 PM · SRE, User-jijiki, serviceops, Performance-Team

Nov 30 2020

akosiaris committed rLPRI5f97b75046e2: Set profile::prometheus::kubernetes::client_token (authored by akosiaris).
Set profile::prometheus::kubernetes::client_token
Nov 30 2020, 5:20 PM
akosiaris committed rLPRIb90961351d64: Add kubestagemaster dummy keys (authored by akosiaris).
Add kubestagemaster dummy keys
Nov 30 2020, 3:03 PM

Nov 26 2020

akosiaris added a comment to T229397: Puppet: get row/rack info from Netbox.

Larger scope could be to look at all the IPs hardcoded in Puppet and see if it would make sens to import them from Netbox?
Same for prefixes I guess.

Nov 26 2020, 11:02 AM · observability, User-crusnov, User-jbond, Patch-For-Review, Puppet, SRE

Nov 25 2020

akosiaris added a comment to T265512: Set up Pipeline Configuration in WDQS repo.
Nov 25 2020, 2:49 PM · Patch-For-Review, Discovery-Search (Current work), Wikidata-Query-Service, Wikidata
akosiaris edited Description on vm-requests.
Nov 25 2020, 2:28 PM
akosiaris edited Description on vm-requests.
Nov 25 2020, 2:23 PM
akosiaris renamed T268747: codfw: 4 VM request for kubernetes staging from Site: 4 VM request for kubernetes staging in codfw to codfw: 4 VM request for kubernetes staging.
Nov 25 2020, 2:05 PM · Kubernetes, vm-requests, SRE
People empowered akosiaris as an administrator.
Nov 25 2020, 1:50 PM
akosiaris triaged T268747: codfw: 4 VM request for kubernetes staging as Medium priority.
Nov 25 2020, 1:43 PM · Kubernetes, vm-requests, SRE
akosiaris added a subtask for T244335: Upgrade kubernetes clusters to a security supported (LTS) version: T268747: codfw: 4 VM request for kubernetes staging.
Nov 25 2020, 1:42 PM · Patch-For-Review, Kubernetes, Prod-Kubernetes, serviceops
akosiaris added a parent task for T268747: codfw: 4 VM request for kubernetes staging: T244335: Upgrade kubernetes clusters to a security supported (LTS) version.
Nov 25 2020, 1:42 PM · Kubernetes, vm-requests, SRE
akosiaris created T268747: codfw: 4 VM request for kubernetes staging.
Nov 25 2020, 1:42 PM · Kubernetes, vm-requests, SRE

Nov 24 2020

akosiaris added a comment to T268612: Docker image on the build host seem to ignore apt priority for wikimedia packages.

Ouch.

Could we normalize everything to use the public image reference? That would also make local test more easy or straight forward.
Or do we gain a bit benefit by using the internal reference?

The point is consistency. We want to use the same registry when referencing images and saving them.

Sure. My question is more like: Why are did we start using both names in first place and can we stop doing so. :)

Nov 24 2020, 12:41 PM · docker-pkg, serviceops, SRE

Nov 23 2020

akosiaris updated the task description for T268505: New database request: sockpuppet.
Nov 23 2020, 5:31 PM · DBA
akosiaris updated subscribers of T265512: Set up Pipeline Configuration in WDQS repo.

@akosiaris it was unclear to me whether we need the promote section in the pipeline config. I'm referring to this: https://wikitech.wikimedia.org/wiki/PipelineLib/Reference#Promote and I saw it in a couple of configs here: https://gerrit.wikimedia.org/r/plugins/gitiles/mediawiki/services/mathoid/+/refs/heads/master/.pipeline/config.yaml#34.

Nov 23 2020, 3:41 PM · Patch-For-Review, Discovery-Search (Current work), Wikidata-Query-Service, Wikidata

Nov 20 2020

akosiaris added a comment to T242855: Undeploy graphoid .

What is the status of the decommissioning of Graphoid ?

Nov 20 2020, 10:48 AM · Patch-For-Review, MW-1.35-notes (1.35.0-wmf.34; 2020-05-26), Platform Engineering (Icebox), serviceops, SRE, Graphoid
akosiaris updated the task description for T198901: Migrate production services to kubernetes using the pipeline.
Nov 20 2020, 7:50 AM · Release-Engineering-Team (Pipeline), Release-Engineering-Team-TODO, Platform Team Legacy (Watching / External), Epic, Services (watching), SRE, Release Pipeline
akosiaris closed T182331: [Epic] Deploy ORES in kubernetes cluster as Declined.

Dependent T210268 and T210269 have been declined, declining this as well. See T210268#6488834 for the reasoning.

Nov 20 2020, 7:49 AM · SRE, ORES, Machine Learning Platform
akosiaris closed T182331: [Epic] Deploy ORES in kubernetes cluster, a subtask of T198901: Migrate production services to kubernetes using the pipeline, as Declined.
Nov 20 2020, 7:49 AM · Release-Engineering-Team (Pipeline), Release-Engineering-Team-TODO, Platform Team Legacy (Watching / External), Epic, Services (watching), SRE, Release Pipeline

Nov 19 2020

akosiaris added a comment to T268202: Eq: 5 VM request for kafka-test-eqiad cluster.

OK then. +1 from my side (and my role as a rubber-stamper is done here). Feel free to create those VMs. Docs if you need them are at https://wikitech.wikimedia.org/wiki/Ganeti#Create_a_VM

Nov 19 2020, 2:33 PM · Patch-For-Review, vm-requests, SRE
akosiaris added a comment to T268202: Eq: 5 VM request for kafka-test-eqiad cluster.

Just to verify, total is 20 vCPUs, 40GB RAM and 500GB disk space?

Nov 19 2020, 2:15 PM · Patch-For-Review, vm-requests, SRE
akosiaris added a comment to T268202: Eq: 5 VM request for kafka-test-eqiad cluster.

Just to verify, total is 20 vCPUs, 40GB RAM and 500GB disk space?

Nov 19 2020, 1:32 PM · Patch-For-Review, vm-requests, SRE
akosiaris closed T241230: Migrate recommendation-api to kubernetes as Resolved.

The service has been deployed yesterday, and the traffic switch happened today. Per https://grafana.wikimedia.org/d/Y5wk80oGk/recommendation-api?orgId=1&var-dc=thanos&var-site=eqiad&var-service=recommendation-api&var-prometheus=k8s&var-container_name=All&from=now-3h&to=now traffic (alas there is no corresponding dashboard for the legacy infrastructure) is flowing now to the kubernetes based deployment. There is some cleanup work to happen, but otherwise this is done. I am gonna resolve it successfully, but feel free to reopen. Thanks to @bmansurov for working through getting the container created and the helm chart ready.

Nov 19 2020, 9:59 AM · Product-Infrastructure-Team-Backlog, Patch-For-Review, serviceops, Release-Engineering-Team, Services, Recommendation-API