Tue, Nov 24
Mon, Nov 23
Fri, Nov 20
Thu, Nov 19
OK then. +1 from my side (and my role as a rubber-stamper is done here). Feel free to create those VMs. Docs if you need them are at https://wikitech.wikimedia.org/wiki/Ganeti#Create_a_VM
Just to verify, total is 20 vCPUs, 40GB RAM and 500GB disk space?
The service has been deployed yesterday, and the traffic switch happened today. Per https://grafana.wikimedia.org/d/Y5wk80oGk/recommendation-api?orgId=1&var-dc=thanos&var-site=eqiad&var-service=recommendation-api&var-prometheus=k8s&var-container_name=All&from=now-3h&to=now traffic (alas there is no corresponding dashboard for the legacy infrastructure) is flowing now to the kubernetes based deployment. There is some cleanup work to happen, but otherwise this is done. I am gonna resolve it successfully, but feel free to reopen. Thanks to @bmansurov for working through getting the container created and the helm chart ready.
For what is worth
Wed, Nov 18
More and more duplicates are being merged into this one and stats from tests above suggest a mean rate of failures of ~20%, which is a lot. Bumping priority to High
Mon, Nov 16
Wed, Nov 11
A few more tests. the TL;DR says varnish 6 is at fault probably, but with a question mark.
Tue, Nov 10
Interestingly, proton returns transfer-encoding: chunked responses, that don't have a Content-Length obviously. So, for the internal service, cl-matches-bytes makes no sense and it's not there.
I 've also ran the same tests against restbase.svc.eqiad.wmnet in P13257 and I have the following
Mon, Nov 9
Change merged, 2.1.0 is up for review at https://gerrit.wikimedia.org/r/c/operations/docker-images/docker-pkg/+/640268, I expect it to be released soon. I 'll resolve this, feel free to reopen though
Fri, Nov 6
So this is not specific to frwiki it seems. Is there perhaps some correlation between page size and failure rate? Or maybe some failure rate and response time?
That's excellent news @sdkim . Many thanks for this!
Thu, Nov 5
Some of the above items are optional (e.g. cookbooks if nothing is done often and is automatable) but good to have.
Wed, Nov 4
Tue, Nov 3
We are not able to go 1.19 because of calico only supporting 1.18
Mon, Nov 2
The idea was indeed to just make sure that the packages are installed before anything else in the class happens. These days, if one puts ensure_packages() at the top of the manifest, we have that. So we can indeed probably move off from require_packages. However, the bad thing with all of this is that the migration is untestable. The relationships aren't exercised during catalog compilation but rather during catalog application by the agent. Which we don't have any decent way of testing :-(. Of course, the worse that can happen is that we regress to having to run puppet >1 times during reimaging of a host.
Fri, Oct 30
This was done, resolving.
Couple of points
Thu, Oct 29
Oct 27 2020
I tried installing 6.0.2 on cp4032, and to my surprise I found out that 6.0.6 and 6.0.2 are not binary compatible:
Oct 26 2020
Oct 23 2020
/me subscribing anyway, thanks !
For what is worth, the idea that Daniel explains above, would solve the issue for now without the need to move to kubernetes, satisfying multiple of the requirements without requiring significant effort.
Oct 22 2020
That's pretty interesting, there shouldn't be so much throttling at so low CPU usage. user+system summed barely hit 1/5 of the limit.
Oct 20 2020
LGTM, perhaps old do codfw as well since you are at it to have a fallback/backup?
Oct 19 2020
I fear this ain't gonna be easy. When we tried the approach that exists for all other hosts, we ended up in broken connectivity for ganeti hosts. See T233906 which ended up with the decision described in T233906#5529507. T234207 was then created as an investigation task to handle improvements to our puppetization of network configuration (which is crude and barely existing to be honest). There has been no move in this since then.
Any news on this one? (just found out today about it while working on T265607)
Oct 16 2020
For the We are limited on the docker-registry infrastructure side., the sanest way out of this (until we hit the next bottleneck) is to scale out, aka just more docker registry VMs. That should be easily doable, we got the capacity. The VMs should be split across the rack rows for higher availability.
This was brought to my attention yesterday by @WDoranWMF, sorry for missing it and many thanks for the ping.
Oct 15 2020
The first pull test was successful. 34 hosts pull from the registry simultaneously. The test lasted about 5minutes.
Lowering priority as the service isn't broken and setting as Stalled as we are waiting from the upstream to release the new version to fix this.
Oct 14 2020
1st obstacle found already. The push failed with '500 internal server error'. Logs indicate
Oct 12 2020
Setting as stalled for now, pending the investigation mentioned in the last comment.
And again, can't reproduce. Not only that, but logs around the time of the report indicate that the daemon was working fine. That is also supported by systemd's status for the service
Oct 9 2020
A couple of requirements from my side, regardless of where those sites are deployed and the technology used:
Oct 8 2020
Overall, I am willing to test this out, couples of points though:
Oct 7 2020
Envoy is being documented at https://wikitech.wikimedia.org/wiki/Envoy#Envoy_at_WMF. It is being used by termbox to talk to mediawiki (it's a component of a service mesh). The idea is to have low cost persistent TLS connections, with retries and telemetry. More more insights aside from the doc link above the following grafana dashboard is useful https://grafana.wikimedia.org/d/b1jttnFMz/envoy-telemetry-k8s?orgId=1&var-datasource=thanos&var-site=codfw&var-prometheus=k8s&var-app=termbox&var-destination=mwapi-async&from=now-7d&to=now
Sorry for not answering earlier.
Oct 6 2020
I think it's showing already
Oct 5 2020
The changeprop change about the changeprop service above seems to have solved the daily saw like pattern
Oct 3 2020
The upgrade has happened successfully and tickets for followup work that is required as a result of this upgrade can now be opened under the OTRS 6 column in OTRS project in phabricator. So, I 'll resolve this
Oct 2 2020
/me rubberstamping. Thanks for this!
Oct 1 2020
All old stuff has been removed, I 'll resolve this.