Adding https://metallb.universe.tf/ as a potential solution as well.
- Queries
- All Stories
- Search
- Advanced Search
- Transactions
- Transaction Logs
Advanced Search
Fri, Jan 22
In T271475#6765888, @dduvall wrote:@akosiaris or @Joe, the current /srv/mediawiki-staging/private appears to be a local git repo with no remotes. Do you know if this repo/files live anywhere else?
Anyway, s/ens5/ens6/ in /etc/network/interfaces and the issue has been fixed. I was wondering whether it makes sense to invest time to "fix" this but having met 1 instance of it in 5-6 years that we have ganeti around, I am gonna say it's not worth it. That being said, let's document this.
I think the following explains it:
Thu, Jan 21
I 've checked off stdlib and lvm classes as they are from external modules that have been imported to the tree as is (aka vendoring).
In T265893#6764004, @Tgr wrote:In T265893#6761792, @akosiaris wrote:That would work, but would also require that the mwaddlink repo is checkout and kept up to date on stats1008 with whatever is deployed in production.
The stat1008 repo is used for producing the dataset (since that requires access to stats data, and maybe GPUs as well, I'm not sure about that), and that logic is much more likely to change over time than the very simple "take a bunch of tables and copy them verbatim to another database", so I don't think this would require keeping in sync to any larger extent than it is the case.
I 've marked T272111 as a parent of this task for greater visibility. This one seems more generic than the specific one to CirrusSearch hence this relationship, but feel free to undo.
Wed, Jan 20
In T265893#6760579, @Tgr wrote:Talked this over with @kostajh and the options we saw for the batch job were:
- Use a MediaWiki maintenance script to download the dataset over the web and import it into the production table. This is the well-trodden path since there's plenty of scaffolding for MediaWiki batch jobs, but conceptually wrong: the script belongs to the mwaddlink repository, not MediaWiki.
In T179696#6760378, @Legoktm wrote:Got pretty close, one last sticking point is that docker_report hardcodes connecting to the registry over HTTPS. So if you try https://localhost then you'll end up with requests.exceptions.SSLError: hostname 'localhost' doesn't match either of 'docker-registry.discovery.wmnet', 'docker-registry.svc.eqiad.wmnet', 'docker-registry.svc.codfw.wmnet', 'docker-registry.wikimedia.org'. And of course https://localhost:5000 (the HTTP port) fails with a protocol error.
Should we adapt docker_report to allow connecting over HTTP? I thought about using one of the domain names but then we're generating a homepage for a different registry, not the one that instance is serving...or does it not matter?
In T271134#6760363, @Ladsgroup wrote:@akosiaris does this mean we need to upgrade ores hosts to buster?
Sat, Jan 16
Fri, Jan 15
In T258978#6729580, @kostajh wrote:@akosiaris picking up the thread on this from before the holiday break; IIRC there was some network setup that had to be done for the production instance (staging was already done). Is there anything else that needs to happen from SRE's side before we can begin calling this service in production with our maintenance script?
Thu, Jan 14
In T261369#6746098, @sbassett wrote:In T261369#6695002, @akosiaris wrote:As I understand it, there's a halt on that npm approach which indeed seems to have slipped away from the deploy-repo approach for one or two services in the past year without Security realizing it. This is unfortunate, but also makes it a bad example to follow.
Wait, what? First time I hear of this. When did that halt happen? Has it been communicated? All of the nodejs services on kubernetes follow the npm install approach for a long time now, what does that mean for them?
I'm not certain there's ever been a reasonable, organizationally-accepted policy or set of guidelines around using various npm commands (especially install) at any point along a given extension/app/service's production deployment path.
Wed, Jan 13
eventgate done. And with this, we can close this task. Thanks to all those that contributed.
eventstreams done. Double checked in logstash and I can see nice log levels now.
I 've had a look into it. The culprit is https://github.com/projectcalico/felix/pull/2424. The reason for the change itself is to honor kube-proxy rules in case of a service with no endpoints. kube-proxy would normally REJECT but calico rules before the PR above would just ACCEPT. The accepting rule was indeed in a cali- prefixed chain up to calico 3.16 (we just updated to 3.17.1 in the newer staging env) but it's now directly added as the last rule in the FORWARD chain. It's definitely not as compartmentalized as before, but that's the way it was done.
Tue, Jan 12
In T241230#6740058, @Andrew wrote:Changes associated with this patch (mostly 27152427487bed18321d582a7f77a301ea114968) have left the following VMs unpuppetized:
- deployment-sca01.deployment-prep.eqiad1.wikimedia.cloud
- deployment-sca02.deployment-prep.eqiad1.wikimedia.cloud
Should those VMs be deleted?
@jeena. With https://gerrit.wikimedia.org/r/641790 reviewed and merged I just release 0.2.1 and the relevant image was built. That allows to amend the chart and add the --insecure flag to service-checker so that we can ignore mismatched certs in CI.
Mon, Jan 11
In T261369#6707040, @Legoktm wrote:In T261369#6695002, @akosiaris wrote:In T261369#6693707, @Krinkle wrote:Some kind of /deploy repo seems needed, I think, as otherwise we would be deploying unaudited code never seen by a trusted pair of eys. There'd be no diff to review during production dependency updates or image rebuilds.
How feasible is it in this case to audit, via a trusted pair of eyes, the code though? In the nodejs case, being probably the pathological case, it's borderline impossible. The dependency tree, even flattened, is huge usually and brings in probably tens (if not hunders) of thousands of lines of code per project. E.g. and of the top of my head, citoid's old deploy repo[1] clocked at 369055 LoC for javascript files only (including blank lines and comments, but the size would be staggering anyway even counting those out) for dependent node modules. I expect this to have increase since then as well as other projects exhibiting similar numbers. This is a known issue and the npm ecosystem has introduced the npm audit command which somewhat makes this a bit better but at least informing of the known vulnerabilities. But auditing for unknown vulnerabilities still is a herculean task.
As I alluded to earlier, all of the Shellbox dependencies have already been audited via their inclusion in mediawiki/vendor. I don't know if the Shellbox service dependencies will forever be a subset of MediaWiki's, but at least for now I think there's no extra work being added if we want to audit all the PHP dependencies. It might just be me, but if I find PHP code easier to audit compared to nodejs's explosion of libraries that might've been transpiled or minified or whatever. And PHP doesn't attempt to ship/compile native code into vendor/!
I guess if the number of dependencies is small enough (which I would expect for the service in question), it remains doable, but it might not be desirable in the future if those increase substantially. In any case, to increase reproducibility and auditibility dependencies should be version pinned.
Shellbox's vendor/ has a little under 16K LoC. I can see us adding a few more libraries like something for metrics, but nothing like the 369K you mentioned for citoid.
LGTM.
In T270071#6689398, @akosiaris wrote:
- DNS Records with non-standard TTL. We have just one for oresrdb that has a 5M TTL instead of the default 1H and that's not currently supported by the Netbox automation for lack of a place where to store that information properly. This record too points to a host IP and is not using a service IP.
We can get rid of the non-standard TTL for that one. In fact, we can get rid of the entire RR. It was put there to facilitate oresrdb maintenance, but due to various difficulties that never materialized well. We can point to the server directly instead.
Migration guide is at https://github.com/wikimedia/service-runner/blob/master/doc/2.7-2.8_Migration_Guide.md, it looks like most metrics calls will have to be slightly modified after the version bump. Configuration wise no change will be needed at that point, but later on we can enable service-runner native prometheus and migrate away from statsd.
In T271540#6735233, @Marostegui wrote:Excellent, thanks. I will double check with @akosiaris to see if he can be around in case etherpad requires some action.
Dec 16 2020
In T261369#6693707, @Krinkle wrote:Some kind of /deploy repo seems needed, I think, as otherwise we would be deploying unaudited code never seen by a trusted pair of eys. There'd be no diff to review during production dependency updates or image rebuilds.
After the helm chart is merged and published (both should happen automatically on a +2, I 've +1ed already), the final 2 items for deployment are:
Dec 14 2020
https://grafana.wikimedia.org/d/g0GUXaJMk/t249745?orgId=1 says that instead it might make sense to increase the memory limit. There's 3 spikes above the limit in the last 30 days, leading to extra CPU usage and probably increased GC cycles
- DNS Records with non-standard TTL. We have just one for oresrdb that has a 5M TTL instead of the default 1H and that's not currently supported by the Netbox automation for lack of a place where to store that information properly. This record too points to a host IP and is not using a service IP.
And finally being now used. Resolving this.
In T265893#6683461, @kostajh wrote:In T265893#6682244, @akosiaris wrote:In T265893#6681943, @kostajh wrote:To clarify what I'm proposing:
Then, a request from stats machine to the kubernetes service would instruct the app to download new datasets from the public endpoint via curl / HTTP.
That's a bad pattern IMHO. The main reason is that we will be using part of the capacity of the service to achieve something something that has nothing to do with the actual task of the service which is to serve requests (i.e. network, CPU and memory will be utilized for this that could be utilized just for serving requests) but rather a batch job. Furthermore, we would be adding an API endpoint that can be called by anyone internally to achieve that, an attacker that somehow gains access to WMF IPs would be able to exploit this to cause an outage.
The mwmaint option would be open to this vulnerability too, right? You could reset the checksum stored locally and re-run refreshLinkRecommendations.php.
Dec 10 2020
When switching from staging-eqiad to staging-codfw (and vice versa) we would need to:
In T265893#6681943, @kostajh wrote:To clarify what I'm proposing:
Then, a request from stats machine to the kubernetes service would instruct the app to download new datasets from the public endpoint via curl / HTTP.
In T265893#6681979, @kostajh wrote:In T265893#6669424, @akosiaris wrote:@kostajh, @MGerlach I 've gone ahead and created tokens and namespaces. You should be able to now deploy the service. Docs are at https://wikitech.wikimedia.org/wiki/Deployments_on_kubernetes#Deploying_with_helmfile. But the TL;DR is you 'll need to create a change similar to https://gerrit.wikimedia.org/r/c/operations/deployment-charts/+/645076, but tailored to your service and then following the helmfile -e <env> stuff mentioned in the docs. Also, you should have +2 access to the repo, let me know if you don't so I can add those.
Thank you @akosiaris. I've run helmfile -e for staging (verified with curl -L https://staging.svc.eqiad.wmnet:4005/apidocs + service-checker-swagger staging.svc.eqiad.wmnet https://staging.svc.eqiad.wmnet:4005 -t 2 -s /apispec_1.json), eqiad and codfw. AIUI it's back to your team now to implement the LVS / networking setup for eqiad/codfw.
Dec 9 2020
In T265893#6672752, @kostajh wrote:In T265893#6669463, @akosiaris wrote:Btw, if the service is going to reach out via the network to any other resource than MySQL (which has already been taken care of), now is the time to say so :-)
I think we will be downloading database dumps via Swift from within the container. Does that require additional setup on your end?
In T267653#6678721, @JMeybohm wrote:The new calico chart is merged, thanks @akosiaris
What is missing currently is a proper RoleBinding for the calicoctl user as I was not sure yet what permissions he's going to need.
We should be not using the tool for changing calico config, that's to be done via the helm chart now. But we will want to keep the analyze functionality intact. Could not find any docs on that by know so we will maybe just have to figure it out when we have a node in staging-codfw
In T265893#6672750, @kostajh wrote:In T265893#6669424, @akosiaris wrote:@kostajh, @MGerlach I 've gone ahead and created tokens and namespaces. You should be able to now deploy the service. Docs are at https://wikitech.wikimedia.org/wiki/Deployments_on_kubernetes#Deploying_with_helmfile. But the TL;DR is you 'll need to create a change similar to https://gerrit.wikimedia.org/r/c/operations/deployment-charts/+/645076, but tailored to your service and then following the helmfile -e <env> stuff mentioned in the docs. Also, you should have +2 access to the repo, let me know if you don't so I can add those.
Thanks @akosiaris! I will work on getting a patch for you this week.
I don't have +2 to that repo, could you please add me, @MGerlach and @Tgr there?
In T269581#6679290, @kostajh wrote:In T269581#6679284, @akosiaris wrote:Arguably, in the interest of https://en.wikipedia.org/wiki/Separation_of_concerns it's probably better than whatever instance of the service mediawiki queries is NOT exposed to the public. That way requests of the infrastructure won't be mixed with external ones, allowing for better capacity planning, service level support etc.
That makes sense.
We could however instantiate a second deployment of the software, e.g. as a second helm/helmfile release and expose that one. Depending on the timeline it might end up being easier than expected, as we are working on some changes in the infrastructure which might remove a lot of manual work from SRE's plate (the stuff in the README).
If it were quick and easy, I was hoping to do this soon to avoid having the need for other developers to setup the software themselves for testing and development. But I could also figure out an interim solution where the service is running on e.g. Toolforge or WCMS instead of a kubernetes prod instance.
Arguably, in the interest of https://en.wikipedia.org/wiki/Separation_of_concerns it's probably better than whatever instance of the service mediawiki queries is NOT exposed to the public. That way requests of the infrastructure won't be mixed with external ones, allowing for better capacity planning, service level support etc.
In T269731#6678565, @kostajh wrote:Requestor -- Please coordinate obtaining a comment of approval on this task from the approving party.
cc @akosiaris @marcella Please let me know if you have any questions, thank you.
The lvm puppet module is a bit problematic. It's been released under the GPL-2 license whereas per T67270 we want to move to Apache2 for that repo. So in the long scheme of things, we probably want to rip it from our repo.
Dec 8 2020
Dec 4 2020
Btw, if the service is going to reach out via the network to any other resource than MySQL (which has already been taken care of), now is the time to say so :-)
@kostajh, @MGerlach I 've gone ahead and created tokens and namespaces. You should be able to now deploy the service. Docs are at https://wikitech.wikimedia.org/wiki/Deployments_on_kubernetes#Deploying_with_helmfile. But the TL;DR is you 'll need to create a change similar to https://gerrit.wikimedia.org/r/c/operations/deployment-charts/+/645076, but tailored to your service and then following the helmfile -e <env> stuff mentioned in the docs. Also, you should have +2 access to the repo, let me know if you don't so I can add those.
In T238753#6668872, @MSantos wrote:In T238753#6668798, @akosiaris wrote:In T238753#6668586, @MoritzMuehlenhoff wrote:There's also an additional option:Postgres 9.6 is also available on Buster (We already use it for cescout, which has a strict dependency on 9.6 since OONI upstream publishes there datasets that way).
From a quick glance Buster has all the deps (src:leveldb, src:geos, Golang 1.11) required to build imposm3, so we can create an imposm 3 deb on Buster, setup a Ganeti instance on Buster (maps-import1001) with Postgres 9.6 and add it to the Maps Postgres setup. Then the OSM import can simply happen from that separate instance (until we eventually also migrate maps at large).
I think there is also a dated cassandra requirement somewhere in there, so it might not be that easy. But that's almost hearsay, so @MSantos could you confirm ?
Actually, this hypothetical maps-import machine doesn't need to have a Cassandra node. The Cassandra requirement is storage for vector-tiles generated after the OSM data is synced.
I've attempted to document the current data flow in these diagrams. It should give a nice understanding of where the data lies in the infrastructure.
That being said, I worked a bit on packaging imposm3 yesterday. I am happy to report success: https://people.wikimedia.org/~akosiaris/
I started with stretch in mind (since maps is stretrch), but turns out that it can't be built on stretch, so it requires buster after all. That being said, if we can backport leveldb and libgeos we might be able to run it on stretch.
YAY! Well, I guess it's a matter of choosing the best option for you between the tradeoffs about backporting the dependencies or start a new machine for OSM DB master and sync scripts.
My 2 cents:
Backports strategy:
Pros:
- It unblocks our work and we can test imposm migration and prepare it for deployment beginning of next quarter.
- Less work to leverage the needed infrastructure
Cons:
- More backported binaries to keep track of
- Keeps the status quo, and we know this isn't great
maps-import strategy:
Pros:
- Iteratively start to move towards debian buster
- Isolate OSM sync scripts from the production infrastructure
- Doesn't change the way we store OSM data because we already have the main instance doing the OSM sync and replicating the data through the cluster
Cons:
- Needs more planning and changes the scope of current work
- More work to leverage infrastructure
Also, the 2 strategies can be iterative steps of the same plan
In T238753#6668876, @Jgiannelos wrote:In T238753#6668798, @akosiaris wrote:In T238753#6668586, @MoritzMuehlenhoff wrote:There's also an additional option:Postgres 9.6 is also available on Buster (We already use it for cescout, which has a strict dependency on 9.6 since OONI upstream publishes there datasets that way).
From a quick glance Buster has all the deps (src:leveldb, src:geos, Golang 1.11) required to build imposm3, so we can create an imposm 3 deb on Buster, setup a Ganeti instance on Buster (maps-import1001) with Postgres 9.6 and add it to the Maps Postgres setup. Then the OSM import can simply happen from that separate instance (until we eventually also migrate maps at large).
I think there is also a dated cassandra requirement somewhere in there, so it might not be that easy. But that's almost hearsay, so @MSantos could you confirm ?
That being said, I worked a bit on packaging imposm3 yesterday. I am happy to report success: https://people.wikimedia.org/~akosiaris/
I started with stretch in mind (since maps is stretrch), but turns out that it can't be built on stretch, so it requires buster after all. That being said, if we can backport leveldb and libgeos we might be able to run it on stretch.
For those not familiar with the state of maps and stumbling upon this task, getting a feeling of exasperation is normal. Trying to adopt a not well maintained in the past infrastructure means that some tech debt needs to be paid. It's
Thanks @akosiaris
I also had some success first with FPM and after properly packaging imposm3 and 2-3 missing deps according to dh-make-golang (levigo, fsnotify). I can share them if its of any help.
In T269357#6666944, @MSantos wrote:I am also interested in is the to be testing the new puppet rules part. Could you please share a bit on how this will be done?
If the only way to apply puppet rules is through the master branch, I'm thinking of a conditional rule where it would load the new imposm_planet_sync classes. And then I would add the needed parameters to the hieradata configuration for the machine available for test.
In T238753#6668586, @MoritzMuehlenhoff wrote:There's also an additional option:Postgres 9.6 is also available on Buster (We already use it for cescout, which has a strict dependency on 9.6 since OONI upstream publishes there datasets that way).
From a quick glance Buster has all the deps (src:leveldb, src:geos, Golang 1.11) required to build imposm3, so we can create an imposm 3 deb on Buster, setup a Ganeti instance on Buster (maps-import1001) with Postgres 9.6 and add it to the Maps Postgres setup. Then the OSM import can simply happen from that separate instance (until we eventually also migrate maps at large).
Dec 3 2020
In T238753#6630924, @MSantos wrote:@hnowlan this can be a good resource for this task https://github.com/omniscale/imposm3#binary
In T269357#6666853, @MSantos wrote:In T269357#6666829, @MoritzMuehlenhoff wrote:Is this specifically only about maps2007? Given that you're both members of the maps-admin group you can log into any maps host, but maps2007 in particular is currently inaccessible since Puppet was disabled there for a long time (which made it evicted from the system records in the Puppet database). This needs to be fixed in general and is unrelated to your access permissions, right now not even people in SRE can perform a standard SSH login into the system in the current state of it.
@MoritzMuehlenhoff hmm, I assumed that the machine was only out of production traffic. I guess there is another machine depooled (maps2002) but this machine have disk space issues with Cassandra.
VMs are up and running and the services (etcd, apiserver) have been setup on them
Dec 2 2020
In T267214#6662807, @Marostegui wrote:In T267214#6662790, @akosiaris wrote:In T267214#6662762, @Marostegui wrote:I have tested the connection from kubernetes1017 which is on 10.64.0, and it works fine, it can reach m2-master.eqiad.wmnet thru port 3306 just fine.
For what is worth, that test would NOT catch a problem if there was one. Kubernetes pods do not utilize the IPs of their nodes, but rather each have their own.
However, I did test too and it's fine.
For posterity's sake (and this is probably worthy of a wikitech page), the process was (needs sudo on deploy1001, aka root, but this is being revisited due to the helm3 migration)
$ ssh deploy1001 $ sudo -i $ kube_env admin codfw $ kubectl -n default run testing --rm -it --image=docker-registry.discovery.wmnet/wmfdebug wait for the prompt $ nmap -p 3306 m2-master.eqiad.wmnet [snip] rDNS record for 10.64.0.135: dbproxy1013.eqiad.wmnet PORT STATE SERVICE 3306/tcp open mysqlThanks Alex: for my own understanding, by the comment at T267214#6652778 - any those pods would be using their own IPs, but always belonging to 10.64.0 range no? (which was the intention of the test with the node)
In T267214#6662762, @Marostegui wrote:I have tested the connection from kubernetes1017 which is on 10.64.0, and it works fine, it can reach m2-master.eqiad.wmnet thru port 3306 just fine.
Dec 1 2020
In T267327#6659813, @Ladsgroup wrote:oh it's not stateful but I think it's high I/O compared to other applications (maybe not as high as jitsi but higher than other apps in k8s). let's do benchmark and see.
Nov 30 2020
Nov 26 2020
In T229397#6651039, @ayounsi wrote:Larger scope could be to look at all the IPs hardcoded in Puppet and see if it would make sens to import them from Netbox?
Same for prefixes I guess.
Nov 25 2020
In T265512#6648623, @Gehel wrote:
Nov 24 2020
In T268612#6644689, @JMeybohm wrote:In T268612#6644662, @Joe wrote:In T268612#6644659, @JMeybohm wrote:Ouch.
Could we normalize everything to use the public image reference? That would also make local test more easy or straight forward.
Or do we gain a bit benefit by using the internal reference?The point is consistency. We want to use the same registry when referencing images and saving them.
Sure. My question is more like: Why are did we start using both names in first place and can we stop doing so. :)
Nov 23 2020
In T265512#6637980, @Mstyles wrote:@akosiaris it was unclear to me whether we need the promote section in the pipeline config. I'm referring to this: https://wikitech.wikimedia.org/wiki/PipelineLib/Reference#Promote and I saw it in a couple of configs here: https://gerrit.wikimedia.org/r/plugins/gitiles/mediawiki/services/mathoid/+/refs/heads/master/.pipeline/config.yaml#34.
Nov 20 2020
In T242855#6636185, @hashar wrote:What is the status of the decommissioning of Graphoid ?
Dependent T210268 and T210269 have been declined, declining this as well. See T210268#6488834 for the reasoning.
Nov 19 2020
OK then. +1 from my side (and my role as a rubber-stamper is done here). Feel free to create those VMs. Docs if you need them are at https://wikitech.wikimedia.org/wiki/Ganeti#Create_a_VM
In T268202#6633500, @akosiaris wrote:Just to verify, total is 20 vCPUs, 40GB RAM and 500GB disk space?
Just to verify, total is 20 vCPUs, 40GB RAM and 500GB disk space?
The service has been deployed yesterday, and the traffic switch happened today. Per https://grafana.wikimedia.org/d/Y5wk80oGk/recommendation-api?orgId=1&var-dc=thanos&var-site=eqiad&var-service=recommendation-api&var-prometheus=k8s&var-container_name=All&from=now-3h&to=now traffic (alas there is no corresponding dashboard for the legacy infrastructure) is flowing now to the kubernetes based deployment. There is some cleanup work to happen, but otherwise this is done. I am gonna resolve it successfully, but feel free to reopen. Thanks to @bmansurov for working through getting the container created and the helm chart ready.