We just reverted this deployment on production it seems to have caused issues with the queryservice-updater.
@WMDE-leszek give me the go ahead to delete the old instance and I will try to do that
Ok the queryservice should now be updated with a fresh dump of the old instance and the queryservice-updater is running again.
https://openstack-browser.toolforge.org/project/wikibase-registry is cached now but should hopefully update and indicate that the proxies are using the new instance.
So, i think it's done but turns out the wdqs-updater has been broken since forever :D
We haven't seen any of these errors in a while for the nodes, i'd suggest we close this ticket as resolved or declined.
So, another update. Last 7 days this occurred 24 times in total and has not happened since june 24th.
Since the pods got more memory last week we've had 10 aborted clients.
Thu, Jun 23
Ok, we are definitely seeing this again and this time it seems like normal usage through the api which might indicate something still is broken.
Ok, so I tried out installing this for a while on staging and it indeed seems to be working pretty much out of the box, some dashboards still do not report the metrics i suppose they should but overall seems promising.
So starting off i had a wee look at https://github.com/bitnami/charts/tree/master/bitnami/kube-prometheus and realized I'd have to connect all the pieces myself to get the full stack with alertmanager etc.
Looking at this together with @Rosalie_WMDE we saw this just happened for a bunch of request, lets atleast figure out what happened there and see if it's some other bot or user.
sweet, seems to work!
Seems to work, moving to done.
Wed, Jun 22
lets cutting a new chart and just change the image value for each environment, it's already there.
fantastic, it works!
Sweet, looks like we got them all. thanks!
deployed https://github.com/wmde/wbaas-deploy/pull/418 to staging but the email is still exposing the internal mediawiki ip
Tue, Jun 21
Jobs are currently executed on requests, and you can keep an eye on the number of jobs in the queue by checking https://beyond-notability.wikibase.cloud/w/api.php?action=query&meta=siteinfo&siprop=statistics&format=json
Hello @Drjwbaker there is currently no dedicated job-runner for wikibase.cloud and there could therefore be a delay in how fast the items are updated/added to elasticsearch
Will do the same change for that pod.
Ok, i didn't realize it but the backend pod is suffering from the same problems, it's using the default values in the probes.
None of the nodes seem to have reports of these events any longer.
So, after a few days the large amount of aborted connections have stopped with the drop in traffic from the bot.
Fri, Jun 17
needs another prod deployment after staging.
timeout is too short.
thanks @Deniz_WMDE for deploying the changes, moving back to blocked/stalled to keep and eye on it.
since we don't have a robots file yet, this has been previously done in the nginx-ingress chart by specifying the user-agent pattern.
Ok, so looking in to these premature closures of the upstream it seems most of these are due a bot.
Thu, Jun 16
So, giving up that idea I had a look in trying to correlate the name of the mediawiki database with the actual domain used for the requests I think i found something interesting.
Batch C looks good to me,