I am going to resolve this per the comments above. Feel free to reopen. Many thanks to @BBlack and @CDanis for figuring this out.
- Queries
- All Stories
- Search
- Advanced Search
- Transactions
- Transaction Logs
Advanced Search
Nov 19 2020
For what is worth
Nov 18 2020
More and more duplicates are being merged into this one and stats from tests above suggest a mean rate of failures of ~20%, which is a lot. Bumping priority to High
Nov 16 2020
In T266373#6623109, @Jgiannelos wrote:In T266373#6616778, @akosiaris wrote:In T266373#6613038, @Jgiannelos wrote:@akosiaris More from debugging on this issue:
- Querying directly the RESTBase service for a PDF render of an article doesn't reproduce the issue
Looking at the pastes above, I must say I don't see a direct RESTBase service (restbase.discovery.wmnet or restbase.svc.eqiad.wmnet or restbase.svc.codfw.wmnet) call. There is a single wget against the proton service, but I don't think that gives us enough data. However, I think we indeed need to test this more.
I used the same scripts pointing to restbase.svc.eqiad.wmnet and proton.svc.eqiad.wmnet but since it didn't show any issues I didn't paste the whole run output.
Where are all these run from btw ? Judging by the speeds a local DSL?
Yes, local cable connection but i think I had similar results from the deployment node too.
Nov 11 2020
A few more tests. the TL;DR says varnish 6 is at fault probably, but with a question mark.
Nov 10 2020
Interestingly, proton returns transfer-encoding: chunked responses, that don't have a Content-Length obviously. So, for the internal service, cl-matches-bytes makes no sense and it's not there.
In T266373#6616625, @sdkim wrote:Given we, Product Infra, are not finding issues at our service level investigation, we're asking that SRE take a look as next steps in helping to resolve. @akosiaris would you be a good person to assign?
I 've also ran the same tests against restbase.svc.eqiad.wmnet in P13257 and I have the following
In T266373#6613038, @Jgiannelos wrote:@akosiaris More from debugging on this issue:
- Querying directly the RESTBase service for a PDF render of an article doesn't reproduce the issue
In T265504#6615197, @Mstyles wrote:@akosiaris I started using the new Java images that you uploaded. I wasn't able to install gpg in the build process. There are some conflicts. We can skip gpg verification of the Flink tar, but I don't think that's a good idea. I will continue to do some debugging.
Error message:
The following packages have unmet dependencies: gpg : Depends: gpgconf (= 2.2.12-1+deb10u1~bpo9+1) but it is not going to be installed Depends: libassuan0 (>= 2.5.0) but 2.4.3-2 is to be installed Depends: libgpg-error0 (>= 1.35) but 1.26-2 is to be installed E: Unable to correct problems, you have held broken packages.
Nov 9 2020
Change merged, 2.1.0 is up for review at https://gerrit.wikimedia.org/r/c/operations/docker-images/docker-pkg/+/640268, I expect it to be released soon. I 'll resolve this, feel free to reopen though
In T263910#6609648, @calbon wrote:@akosiaris Can you review it? I don't know enough about the nodes vs redis connection to intelligently review.
In T171157#6611644, @Aklapper wrote:In T171157#3538015, @faidon wrote:Setting to stalled until we decide what to actually do with the internal CA, as we're considering dropping it entirely in favour of other options.
@akosiaris / @faidon: Has this situation somehow changed by resolved T133717: Letsencrypt all the prod things we can - planning / T194962: Create and deploy a centralized letsencrypt service / Acme-chief (though I'm not sure if that also touched CA monitoring at all)?
Nov 6 2020
So this is not specific to frwiki it seems. Is there perhaps some correlation between page size and failure rate? Or maybe some failure rate and response time?
That's excellent news @sdkim . Many thanks for this!
Nov 5 2020
Some of the above items are optional (e.g. cookbooks if nothing is done often and is automatable) but good to have.
Nov 4 2020
In T244335#6602317, @JMeybohm wrote:In T244335#6600895, @akosiaris wrote:We are not able to go 1.19 because of calico only supporting 1.18
Looks like this isn't true. Judging from https://github.com/projectcalico/calico/commit/21a45a4a141fff03b251fde2f1ab77fbb0c903ee#diff-f386c272afd3d855bf9f1d3609d1782962951258a58e2b298df60c70b16517ee, the calico 3.16 requirements page will be updated soon.
Not sure about that. The commit is from 2nd Sept. and never made it to the 3.16 release branch. I would guess it's for 3.17.
Nov 3 2020
In T265504#6598430, @Mstyles wrote:@akosiaris when you get some time, can you please take another look at https://gerrit.wikimedia.org/r/c/wikidata/query/rdf/+/635074
We are not able to go 1.19 because of calico only supporting 1.18
Nov 2 2020
The idea was indeed to just make sure that the packages are installed before anything else in the class happens. These days, if one puts ensure_packages() at the top of the manifest, we have that. So we can indeed probably move off from require_packages. However, the bad thing with all of this is that the migration is untestable. The relationships aren't exercised during catalog compilation but rather during catalog application by the agent. Which we don't have any decent way of testing :-(. Of course, the worse that can happen is that we regress to having to run puppet >1 times during reimaging of a host. Anyway, I am fine with this is require_packages feels like tech debt, just keep in mind we might have regressions that show their heads down the road
Oct 30 2020
This was done, resolving.
Couple of points
In T266766#6591219, @JMeybohm wrote:With Kubernetes 1.19.3 things changed a bit and we now need a docker version supporting the --platform flag for FROM.
The envoy builder host would support that but lacks enough space on /var/lib/docker to hold all the intermediate images used for build.
Oct 29 2020
In T264710#6586070, @Addshore wrote:Sounds like a fine solution from our side for now.
I'll let serviceops do with this ticket as they wish (keep it or close it) and I'll get on to creating some tickets for:
Oct 27 2020
In T266373#6576166, @CDanis wrote:In T266373#6576161, @Framawiki wrote:See also this Grafana dashboard showing increase of daily PDF rendering by Proton from 80k to 20k, since beginning of August.
Looks like that's when Proton migrated to Kubernetes:
https://grafana.wikimedia.org/d/llIEd7MMz/proton?viewPanel=68&orgId=1&from=now-30d&to=nowThe dashboard linked by @Framawiki is probably overdue for deletion.
I tried installing 6.0.2 on cp4032, and to my surprise I found out that 6.0.6 and 6.0.2 are not binary compatible:
In T265504#6580948, @Zbyszko wrote:Could you elaborate on that a bit?
Sure, here goes: We are using Apache Flink[1] as a platform for our event processing we do to feed Wikidata Query Service. We've want to move to Flink deployment to Kubernetes, hence this ticket. Apache Flink provides it's own docker image[2] which, in other circumstances, we would build upon. What @Mstyles is doing now is basically replaying work original Flink contributors did for their docker image - which, according to our current knowledge is what we must do.
The actual docker file (with additional entry script) is here [3] - it would be great if we wouldn't need to make sure that we covered everything that is handled here with each Flink update.
In T265504#6580645, @Zbyszko wrote:@akosiaris I see, makes sense. I still would like to solve the issue with replicating the original dockerfile - can we deploy Flink images to our registry - even if we'd need to fork Flink docker repo?
Oct 26 2020
In T265504#6578280, @Zbyszko wrote:@akosiaris Can we base a blubber enabled project on a 3rd party docker image, provided on docker hub? I was wondering if we have to replicate original dockerfile here (I'd rather base of their image to reduce future maintenance).
Oct 23 2020
In T229397#6574323, @Volans wrote:I fear that we are going down another path like the dns generation in which the Netbox API don't really suits our needs in terms of performance and efficiency (multiple API calls per device). I'm wondering if we should convert John's patch into a Netbox script instead and take advantage of the speed and power of Django/Netbox internal APIs instead.
It was already discussed but I want to re-surface the fact that this will create a direct link Netbox data -> Puppet without any manual stopgap. Are we ready for this?
/me subscribing anyway, thanks !
For what is worth, the idea that Daniel explains above, would solve the issue for now without the need to move to kubernetes, satisfying multiple of the requirements without requiring significant effort.
Oct 22 2020
That's pretty interesting, there shouldn't be so much throttling at so low CPU usage. user+system summed barely hit 1/5 of the limit.
In T265904#6570744, @Volans wrote:@jbond thanks for looking into this, unfortunately the data used in the Netbox import comes from the networking fact because it needs all of them and parses that one, so not sure if the above patch would be useful in practice.
Oct 20 2020
LGTM, perhaps old do codfw as well since you are at it to have a fallback/backup?
Oct 19 2020
In T254954#6556769, @kaldari wrote:@akosiaris - Thanks for the reply and clearing up my misunderstandings. We have no need to keep the OCR service on Toolforge and would like to eventually move it into a MediaWiki extension (which would also make it easier for all the Wikisource projects to utilize). But it won't make any sense for us to do that unless there is an API proxy available in production that can communicate with the Google Vision API. What API proxy is cxserver.wikimedia.org utilizing to communicate with Google? Would it be possible for us to use that as well or have a similar proxy set up for this service?
In T255410#6550492, @Michael wrote:
I fear this ain't gonna be easy. When we tried the approach that exists for all other hosts, we ended up in broken connectivity for ganeti hosts. See T233906 which ended up with the decision described in T233906#5529507. T234207 was then created as an investigation task to handle improvements to our puppetization of network configuration (which is crude and barely existing to be honest). There has been no move in this since then.
Any news on this one? (just found out today about it while working on T265607)
Oct 16 2020
In T265183#6537941, @Joe wrote:In T265183#6533921, @jeena wrote:My concern is that this transition step becomes a permanent step.
It's up to us to avoid that being true, in case, but I have further doubts I will clarify below. And yes, I clarified in a meeting with our managers I have that concern too (after all, Starling's first law still holds), and they swore commitment to repay that debt once we've transitioned traffic.
For the We are limited on the docker-registry infrastructure side., the sanest way out of this (until we hit the next bottleneck) is to scale out, aka just more docker registry VMs. That should be easily doable, we got the capacity. The VMs should be split across the rack rows for higher availability.
In T264209#6547169, @dancy wrote:In T264209#6547152, @JMeybohm wrote:We need to permanently bump the tmpfs /var/lib/nginx size if we want to be able to consistently push images with blobs that are larger than 1 GB compressed
Couldn't we get around this by using a (bigger) non tmpfs filesystem as client_body_temp_path?
Not sure how much the upload performance would suffer in this case, but we could test that...+1 on this suggestion. For small requests, there will be minimal writing to a real filesystem for files that exist briefly. These writes would be background I/O in most cases.
This was brought to my attention yesterday by @WDoranWMF, sorry for missing it and many thanks for the ping.
In T255410#6543118, @Michael wrote:@akosiaris Thank you a lot for your detailed response. I did look into those errors a tiny bit more to properly document them as can be now seen on wikitech.
In the course of that I looked at the last days and noticed some discrepancies to the numbers you provided above. All the following data is for the 7 days between 2020-10-07 00:00:00 and 2020-10-13 23:59:59.
Oct 15 2020
The first pull test was successful. 34 hosts pull from the registry simultaneously. The test lasted about 5minutes.
Lowering priority as the service isn't broken and setting as Stalled as we are waiting from the upstream to release the new version to fix this.
Oct 14 2020
1st obstacle found already. The push failed with '500 internal server error'. Logs indicate
Oct 12 2020
Setting as stalled for now, pending the investigation mentioned in the last comment.
And again, can't reproduce. Not only that, but logs around the time of the report indicate that the daemon was working fine. That is also supported by systemd's status for the service
Oct 9 2020
A couple of requirements from my side, regardless of where those sites are deployed and the technology used:
Oct 8 2020
In T239459#6504349, @Mvolz wrote:The hold-up seems to be eventstreams; it actually uses a fork of service runner, and the fork is missing the feature, so it's slightly more complicated to update. https://github.com/wikimedia/service-runner/tree/prometheus_metrics
Overall, I am willing to test this out, couples of points though:
Oct 7 2020
In T260917#6525089, @JMeybohm wrote:In T260917#6525081, @akosiaris wrote:What's peculiar, is that https://puppet-compiler.wmflabs.org/compiler1002/25621/deploy1001.eqiad.wmnet/change.deploy1001.eqiad.wmnet.pson doesn't have the test:data thing. Perhaps that's an earlier PCC though?
Yeah, that's a PCC from before the second change to labs/private. https://puppet-compiler.wmflabs.org/compiler1003/25719/ is the current one (which contains test: data in change catalog).
In T260917#6518755, @JMeybohm wrote:I could use a pair of eyes on https://gerrit.wikimedia.org/r/q/bug:T260917
The PCC full diff (https://puppet-compiler.wmflabs.org/compiler1002/25621/) lacks defaultsecret: notdefault for staging zotero. What am I missing here?
Envoy is being documented at https://wikitech.wikimedia.org/wiki/Envoy#Envoy_at_WMF. It is being used by termbox to talk to mediawiki (it's a component of a service mesh). The idea is to have low cost persistent TLS connections, with retries and telemetry. More more insights aside from the doc link above the following grafana dashboard is useful https://grafana.wikimedia.org/d/b1jttnFMz/envoy-telemetry-k8s?orgId=1&var-datasource=thanos&var-site=codfw&var-prometheus=k8s&var-app=termbox&var-destination=mwapi-async&from=now-7d&to=now
In T122220#6523243, @King_of_Hearts wrote:Why does OTRS even use password authentication? All OTRS agents presumably have Wikimedia accounts, so wouldn't OAuth be a more secure and convenient method?
Sorry for not answering earlier.
Oct 6 2020
I think it's showing already
Oct 5 2020
The changeprop change about the changeprop service above seems to have solved the daily saw like pattern
Oct 3 2020
The upgrade has happened successfully and tickets for followup work that is required as a result of this upgrade can now be opened under the OTRS 6 column in Znuny project in phabricator. So, I 'll resolve this
Oct 2 2020
/me rubberstamping. Thanks for this!
Oct 1 2020
All old stuff has been removed, I 'll resolve this.
In T122220#6474967, @Matthewrbowker wrote:In T122220#6473842, @akosiaris wrote:In T122220#6471605, @Matthewrbowker wrote:In T122220#6471256, @Krd wrote:I appreciate that we have that feature now and do test it, but I second akosiaris' statement above that we don't have any process yet to recover an account. I guess if anybody locks out themselves at the moment we could be unable to recover the account. Please consider that OTRS admins may not have enough resources for assistance at the moment.
If 2fa forces us to rely on anything not verifyable during account recovery, it perhaps doesn't add much additional security.This is my biggest concern as well. My feeling, at least for the trial period, would be to require a post to [[AR]] on OTRS wiki. But, if we have to remove their 2FA token we also scramble their password and force them to reset it.
Long-term, we need a better solution (like a committed identity that enwiki has) - but all of that is taking more time from an already stretched thin admin team.
Another question/idea would be: How hard are OTRS modules to build? We could theoretically build an enhanced version of the built-in 2FA module, which allowed for scratch codes and hide the generator token.
A more suitable question would be "How costly are OTRS modules to maintain?" as I my experience up to now with OTRS modules is that most of their cost goes into maintenance and not building them. And since we would be probably patching the current implementation, it would be considerable. We would need to find someone to do build it and maintain it down the road.
Would... this be a bad time to mention that I know Perl? While I could maintain it, I don't know how to build an OTRS module, so it might take a bit to build.
It's a week and noone has complained, I don't see much point in keeping this open waiting for someone to complain. I 'll resolve this for now, please do reopen if you hear any complaints.
In T263842#6506360, @RhinosF1 wrote:I just noticed the IR on wikitech says:
duplicate key for ip banning
The ipblocks table actually stores both account and ip blocks. A ban does not also equal a block in the terms most wikis use it.
It would probably be more accurate and clearer to just say:
about a duplicate key in enwikivoyage's ipblocks table
Instead of:
about enwikivoyage duplicate key for ip banning
Not making the change myself to avoid stepping on or confusing anyone in SRE.
Sep 30 2020
Awesome, many thanks!
In T263910#6503978, @Joe wrote:In T263910#6503699, @Ladsgroup wrote:Looking at the pattern of requests to ores in the past couple of days, it seems OKAPI has been bringing down ores: https://logstash.wikimedia.org/goto/f91cdb8acc0f46a02bd7552ffaf0efd4
- Suddenly, there's a 1M more requests a day by OKAPI, have this been counted for? specially during the switchover that we are basically half the capacity.
My expectation would be that if such a volume of requests is going to happen, we (SRE and the service owners) would be contacted beforehand. If this was any other external client, we would just ban it with no second thoughts.
Sep 29 2020
We 've talked with @MoritzMuehlenhoff and we 'll solve this on the process upgrade front by restarting apache and otrs-daemon after perl upgrades on OTRS. Resolving for now.
I 've manually restarted both apache and otrs-daemon just to make sure everything is fine and using the new version of the DBI perl module. But otherwise this doesn't seem very worthwhile digging deep into it. I think we 'll just treat perl upgrades on the host are requiring an apache restart just to be on the safe side and close this.
Relevant SAL log line is https://sal.toolforge.org/log/rY9y2XQBhxWNv8gITPYl. Remediation manually is also probably very easy, just restarting apache.
@Cmjohnson, Wednesday it's fine.
Sep 28 2020
This is close to 15months old, and the service has been moved to kubernetes in the meantime, so most stuff in here is quite probably not relevant anymore. I 'll just close as Invalid