Page MenuHomePhabricator
Feed Advanced Search

Nov 19 2020

akosiaris closed T241230: Migrate recommendation-api to kubernetes, a subtask of T198901: Migrate production services to kubernetes using the pipeline, as Resolved.
Nov 19 2020, 9:59 AM · serviceops, Release-Engineering-Team (Seen), Platform Team Legacy (Watching / External), Epic, Services (watching), SRE, Release Pipeline
akosiaris closed T241230: Migrate recommendation-api to kubernetes, a subtask of T248355: Archive mediawiki/services/recommendation-api/deploy once it's no longer used, as Resolved.
Nov 19 2020, 9:59 AM · Patch-For-Review, Recommendation-API, Projects-Cleanup
akosiaris updated the task description for T241230: Migrate recommendation-api to kubernetes.
Nov 19 2020, 9:56 AM · Product-Infrastructure-Team-Backlog-Deprecated, serviceops, Release-Engineering-Team, Services, Recommendation-API
akosiaris closed T266373: Connection closed while downloading PDF of articles as Resolved.

I am going to resolve this per the comments above. Feel free to reopen. Many thanks to @BBlack and @CDanis for figuring this out.

Nov 19 2020, 9:55 AM · Traffic, Web-Team-Backlog (Tracking), Proton, serviceops, Product-Infrastructure-Team-Backlog-Deprecated, SRE, Desktop Improvements (Vector 2022), Wikimedia-production-error
akosiaris updated subscribers of T267065: eqiad: Server moves to free up space on 10g racks.

For what is worth

Nov 19 2020, 9:37 AM · Platform Team Workboards (Green), ops-eqiad, SRE, DC-Ops

Nov 18 2020

akosiaris raised the priority of T266373: Connection closed while downloading PDF of articles from Low to High.

More and more duplicates are being merged into this one and stats from tests above suggest a mean rate of failures of ~20%, which is a lot. Bumping priority to High

Nov 18 2020, 10:08 AM · Traffic, Web-Team-Backlog (Tracking), Proton, serviceops, Product-Infrastructure-Team-Backlog-Deprecated, SRE, Desktop Improvements (Vector 2022), Wikimedia-production-error

Nov 16 2020

akosiaris added a comment to T266373: Connection closed while downloading PDF of articles.

@akosiaris More from debugging on this issue:

  • Querying directly the RESTBase service for a PDF render of an article doesn't reproduce the issue

Looking at the pastes above, I must say I don't see a direct RESTBase service (restbase.discovery.wmnet or restbase.svc.eqiad.wmnet or restbase.svc.codfw.wmnet) call. There is a single wget against the proton service, but I don't think that gives us enough data. However, I think we indeed need to test this more.

I used the same scripts pointing to restbase.svc.eqiad.wmnet and proton.svc.eqiad.wmnet but since it didn't show any issues I didn't paste the whole run output.

Where are all these run from btw ? Judging by the speeds a local DSL?

Yes, local cable connection but i think I had similar results from the deployment node too.

Nov 16 2020, 11:05 AM · Traffic, Web-Team-Backlog (Tracking), Proton, serviceops, Product-Infrastructure-Team-Backlog-Deprecated, SRE, Desktop Improvements (Vector 2022), Wikimedia-production-error

Nov 11 2020

akosiaris added a comment to T266373: Connection closed while downloading PDF of articles.

A few more tests. the TL;DR says varnish 6 is at fault probably, but with a question mark.

Nov 11 2020, 1:47 PM · Traffic, Web-Team-Backlog (Tracking), Proton, serviceops, Product-Infrastructure-Team-Backlog-Deprecated, SRE, Desktop Improvements (Vector 2022), Wikimedia-production-error

Nov 10 2020

akosiaris added a comment to T266373: Connection closed while downloading PDF of articles.

Interestingly, proton returns transfer-encoding: chunked responses, that don't have a Content-Length obviously. So, for the internal service, cl-matches-bytes makes no sense and it's not there.

Nov 10 2020, 8:54 PM · Traffic, Web-Team-Backlog (Tracking), Proton, serviceops, Product-Infrastructure-Team-Backlog-Deprecated, SRE, Desktop Improvements (Vector 2022), Wikimedia-production-error
akosiaris added a project to T266373: Connection closed while downloading PDF of articles: Traffic.

Given we, Product Infra, are not finding issues at our service level investigation, we're asking that SRE take a look as next steps in helping to resolve. @akosiaris would you be a good person to assign?

Nov 10 2020, 8:48 PM · Traffic, Web-Team-Backlog (Tracking), Proton, serviceops, Product-Infrastructure-Team-Backlog-Deprecated, SRE, Desktop Improvements (Vector 2022), Wikimedia-production-error
akosiaris added a comment to T266373: Connection closed while downloading PDF of articles.

I 've also ran the same tests against restbase.svc.eqiad.wmnet in P13257 and I have the following

Nov 10 2020, 8:44 PM · Traffic, Web-Team-Backlog (Tracking), Proton, serviceops, Product-Infrastructure-Team-Backlog-Deprecated, SRE, Desktop Improvements (Vector 2022), Wikimedia-production-error
akosiaris added a comment to T266373: Connection closed while downloading PDF of articles.

@akosiaris More from debugging on this issue:

  • Querying directly the RESTBase service for a PDF render of an article doesn't reproduce the issue
Nov 10 2020, 5:06 PM · Traffic, Web-Team-Backlog (Tracking), Proton, serviceops, Product-Infrastructure-Team-Backlog-Deprecated, SRE, Desktop Improvements (Vector 2022), Wikimedia-production-error
akosiaris edited P13257 More pdf stats.
Nov 10 2020, 2:19 PM · Proton
akosiaris edited P13257 More pdf stats.
Nov 10 2020, 1:58 PM · Proton
akosiaris edited P13257 More pdf stats.
Nov 10 2020, 1:22 PM · Proton
akosiaris created P13257 More pdf stats.
Nov 10 2020, 1:19 PM · Proton
akosiaris added a comment to T265504: Create Blubberfile in WDQS repo.

@akosiaris I started using the new Java images that you uploaded. I wasn't able to install gpg in the build process. There are some conflicts. We can skip gpg verification of the Flink tar, but I don't think that's a good idea. I will continue to do some debugging.

Error message:

The following packages have unmet dependencies:
 gpg : Depends: gpgconf (= 2.2.12-1+deb10u1~bpo9+1) but it is not going to be installed
       Depends: libassuan0 (>= 2.5.0) but 2.4.3-2 is to be installed
       Depends: libgpg-error0 (>= 1.35) but 1.26-2 is to be installed
E: Unable to correct problems, you have held broken packages.
Nov 10 2020, 12:01 PM · Patch-For-Review, Discovery-Search (Current work), Wikidata-Query-Service, Wikidata

Nov 9 2020

akosiaris closed T266730: AttributeError: 'Changelog' object has no attribute 'get_version' as Resolved.

Change merged, 2.1.0 is up for review at https://gerrit.wikimedia.org/r/c/operations/docker-images/docker-pkg/+/640268, I expect it to be released soon. I 'll resolve this, feel free to reopen though

Nov 9 2020, 11:44 PM · docker-pkg
akosiaris added a comment to T263910: ORES redis: max number of clients reached....

@akosiaris Can you review it? I don't know enough about the nodes vs redis connection to intelligently review.

Nov 9 2020, 10:23 AM · User-Ladsgroup, Sustainability (Incident Followup), Patch-For-Review, Wikimedia Enterprise, serviceops, SRE, Machine-Learning-Team, ORES
akosiaris closed T171157: Monitor internal CA expirations as Declined.

Setting to stalled until we decide what to actually do with the internal CA, as we're considering dropping it entirely in favour of other options.

@akosiaris / @faidon: Has this situation somehow changed by resolved T133717: Letsencrypt all the prod things we can - planning / T194962: Create and deploy a centralized letsencrypt service / Acme-chief (though I'm not sure if that also touched CA monitoring at all)?

Nov 9 2020, 10:23 AM · observability, SRE

Nov 6 2020

akosiaris added a comment to T266373: Connection closed while downloading PDF of articles.

So this is not specific to frwiki it seems. Is there perhaps some correlation between page size and failure rate? Or maybe some failure rate and response time?

Nov 6 2020, 3:40 PM · Traffic, Web-Team-Backlog (Tracking), Proton, serviceops, Product-Infrastructure-Team-Backlog-Deprecated, SRE, Desktop Improvements (Vector 2022), Wikimedia-production-error
akosiaris added a comment to T267339: [EPIC] Address maps level of support issues.

That's excellent news @sdkim . Many thanks for this!

Nov 6 2020, 2:51 PM · Maps (Kartotherian)

Nov 5 2020

akosiaris added a comment to T267339: [EPIC] Address maps level of support issues.

Some of the above items are optional (e.g. cookbooks if nothing is done often and is automatable) but good to have.

Nov 5 2020, 4:32 PM · Maps (Kartotherian)
akosiaris created T267339: [EPIC] Address maps level of support issues.
Nov 5 2020, 4:30 PM · Maps (Kartotherian)

Nov 4 2020

akosiaris added a comment to T244335: Upgrade kubernetes clusters to v1.16.

We are not able to go 1.19 because of calico only supporting 1.18

Looks like this isn't true. Judging from https://github.com/projectcalico/calico/commit/21a45a4a141fff03b251fde2f1ab77fbb0c903ee#diff-f386c272afd3d855bf9f1d3609d1782962951258a58e2b298df60c70b16517ee, the calico 3.16 requirements page will be updated soon.

Not sure about that. The commit is from 2nd Sept. and never made it to the 3.16 release branch. I would guess it's for 3.17.

Nov 4 2020, 10:59 AM · Patch-For-Review, Kubernetes, Prod-Kubernetes, serviceops

Nov 3 2020

akosiaris added a comment to T265504: Create Blubberfile in WDQS repo.

@akosiaris when you get some time, can you please take another look at https://gerrit.wikimedia.org/r/c/wikidata/query/rdf/+/635074

Nov 3 2020, 5:30 PM · Patch-For-Review, Discovery-Search (Current work), Wikidata-Query-Service, Wikidata
akosiaris added a comment to T244335: Upgrade kubernetes clusters to v1.16.

We are not able to go 1.19 because of calico only supporting 1.18

Nov 3 2020, 5:26 PM · Patch-For-Review, Kubernetes, Prod-Kubernetes, serviceops

Nov 2 2020

akosiaris added a comment to T266479: Puppet Proposal to remove require_package.

The idea was indeed to just make sure that the packages are installed before anything else in the class happens. These days, if one puts ensure_packages() at the top of the manifest, we have that. So we can indeed probably move off from require_packages. However, the bad thing with all of this is that the migration is untestable. The relationships aren't exercised during catalog compilation but rather during catalog application by the agent. Which we don't have any decent way of testing :-(. Of course, the worse that can happen is that we regress to having to run puppet >1 times during reimaging of a host. Anyway, I am fine with this is require_packages feels like tech debt, just keep in mind we might have regressions that show their heads down the road

Nov 2 2020, 2:42 PM · Infrastructure-Foundations, Patch-For-Review, SRE, Puppet

Oct 30 2020

akosiaris closed T204907: Scap is checking canary servers in dormant instead of active-dc as Resolved.

This was done, resolving.

Oct 30 2020, 5:16 PM · Patch-Needs-Improvement, Sustainability (Incident Followup), Release-Engineering-Team (Deployment services), Release-Engineering-Team-TODO, SRE, Datacenter-Switchover, Scap
akosiaris added a comment to T265876: Logging options for apache httpd in k8s.

Couple of points

Oct 30 2020, 4:43 PM · observability, SRE, serviceops, MW-on-K8s
akosiaris added a comment to T266766: Build new kubernetes packages.

With Kubernetes 1.19.3 things changed a bit and we now need a docker version supporting the --platform flag for FROM.
The envoy builder host would support that but lacks enough space on /var/lib/docker to hold all the intermediate images used for build.

Oct 30 2020, 2:55 PM · Patch-For-Review, Kubernetes, Prod-Kubernetes, serviceops

Oct 29 2020

akosiaris closed T264710: Host static sites on kubernetes as Invalid.

Sounds like a fine solution from our side for now.
I'll let serviceops do with this ticket as they wish (keep it or close it) and I'll get on to creating some tickets for:

Oct 29 2020, 4:28 PM · Wikidata Query UI, Wikidata Query Builder, serviceops, Wikidata, User-Addshore

Oct 27 2020

akosiaris placed T177371: Phase out DSA keys for SSH access (ssh-dss) up for grabs.
Oct 27 2020, 4:58 PM · Patch-For-Review, Infrastructure-Foundations, SRE
akosiaris added a comment to T266373: Connection closed while downloading PDF of articles.

See also this Grafana dashboard showing increase of daily PDF rendering by Proton from 80k to 20k, since beginning of August.

Looks like that's when Proton migrated to Kubernetes:
https://grafana.wikimedia.org/d/llIEd7MMz/proton?viewPanel=68&orgId=1&from=now-30d&to=now

The dashboard linked by @Framawiki is probably overdue for deletion.

Oct 27 2020, 1:21 PM · Traffic, Web-Team-Backlog (Tracking), Proton, serviceops, Product-Infrastructure-Team-Backlog-Deprecated, SRE, Desktop Improvements (Vector 2022), Wikimedia-production-error
akosiaris added a comment to T264398: 8-10% response start regression (Varnish 5.1.3-1wm15 -> 6.0.6-1wm1).

I tried installing 6.0.2 on cp4032, and to my surprise I found out that 6.0.6 and 6.0.2 are not binary compatible:

Oct 27 2020, 10:39 AM · Patch-For-Review, Performance-Team (Radar), SRE, Traffic
akosiaris added a comment to T265504: Create Blubberfile in WDQS repo.

Could you elaborate on that a bit?

Sure, here goes: We are using Apache Flink[1] as a platform for our event processing we do to feed Wikidata Query Service. We've want to move to Flink deployment to Kubernetes, hence this ticket. Apache Flink provides it's own docker image[2] which, in other circumstances, we would build upon. What @Mstyles is doing now is basically replaying work original Flink contributors did for their docker image - which, according to our current knowledge is what we must do.
The actual docker file (with additional entry script) is here [3] - it would be great if we wouldn't need to make sure that we covered everything that is handled here with each Flink update.

Oct 27 2020, 10:36 AM · Patch-For-Review, Discovery-Search (Current work), Wikidata-Query-Service, Wikidata
akosiaris added a comment to T265504: Create Blubberfile in WDQS repo.

@akosiaris I see, makes sense. I still would like to solve the issue with replicating the original dockerfile - can we deploy Flink images to our registry - even if we'd need to fork Flink docker repo?

Oct 27 2020, 10:05 AM · Patch-For-Review, Discovery-Search (Current work), Wikidata-Query-Service, Wikidata

Oct 26 2020

akosiaris added a comment to T265504: Create Blubberfile in WDQS repo.

@akosiaris Can we base a blubber enabled project on a 3rd party docker image, provided on docker hub? I was wondering if we have to replicate original dockerfile here (I'd rather base of their image to reduce future maintenance).

Oct 26 2020, 1:59 PM · Patch-For-Review, Discovery-Search (Current work), Wikidata-Query-Service, Wikidata

Oct 23 2020

akosiaris added a comment to T229397: Puppet: get data (row, rack, site, and other information) from Netbox.

I fear that we are going down another path like the dns generation in which the Netbox API don't really suits our needs in terms of performance and efficiency (multiple API calls per device). I'm wondering if we should convert John's patch into a Netbox script instead and take advantage of the speed and power of Django/Netbox internal APIs instead.

It was already discussed but I want to re-surface the fact that this will create a direct link Netbox data -> Puppet without any manual stopgap. Are we ready for this?

Oct 23 2020, 1:24 PM · Infrastructure-Foundations, observability, User-crusnov, User-jbond, Patch-For-Review, Puppet, SRE
akosiaris added a comment to T266333: Xml/sql dumps are still querying etcd excessively, fix this..

/me subscribing anyway, thanks !

Oct 23 2020, 12:36 PM · Dumps-Generation
akosiaris added a comment to T264710: Host static sites on kubernetes.

For what is worth, the idea that Daniel explains above, would solve the issue for now without the need to move to kubernetes, satisfying multiple of the requirements without requiring significant effort.

Oct 23 2020, 8:37 AM · Wikidata Query UI, Wikidata Query Builder, serviceops, Wikidata, User-Addshore

Oct 22 2020

akosiaris added a comment to T266194: wikifeeds-production-tls-proxy regularly exceeding its k8s CPU reservation.

That's pretty interesting, there shouldn't be so much throttling at so low CPU usage. user+system summed barely hit 1/5 of the limit.

Oct 22 2020, 10:05 AM · Kubernetes, Wikifeeds, serviceops
akosiaris added a comment to T265904: Remove SLAAC IPs from Ganeti hosts.

@jbond thanks for looking into this, unfortunately the data used in the Netbox import comes from the networking fact because it needs all of them and parses that one, so not sure if the above patch would be useful in practice.

Oct 22 2020, 9:48 AM · Ganeti, Patch-For-Review, Traffic, SRE

Oct 20 2020

akosiaris added a comment to T265982: eqiad: New ganeti instance for orchestrator installation.

LGTM, perhaps old do codfw as well since you are at it to have a fallback/backup?

Oct 20 2020, 9:33 AM · SRE, vm-requests, serviceops

Oct 19 2020

akosiaris committed rLPRI0a36d8754bc4: Remove restrouter.
Remove restrouter
Oct 19 2020, 2:47 PM
akosiaris added a comment to T254954: Move Wikisource OCR's API proxy to production.

@akosiaris - Thanks for the reply and clearing up my misunderstandings. We have no need to keep the OCR service on Toolforge and would like to eventually move it into a MediaWiki extension (which would also make it easier for all the Wikisource projects to utilize). But it won't make any sense for us to do that unless there is an API proxy available in production that can communicate with the Google Vision API. What API proxy is cxserver.wikimedia.org utilizing to communicate with Google? Would it be possible for us to use that as well or have a similar proxy set up for this service?

Oct 19 2020, 2:20 PM · Tech-Product API Roadmap, Community-Tech, Wikimedia OCR
akosiaris added a comment to T255410: Termbox SSR connection terminated very often.
Oct 19 2020, 2:11 PM · User-Michael, serviceops-radar, Wikidata-Campsite (Wikidata-Campsite-Iteration-∞ (On Hold)), [DEPRECATED] wdwb-tech, Wikidata, Wikidata-Termbox
akosiaris added a comment to T265904: Remove SLAAC IPs from Ganeti hosts.

I fear this ain't gonna be easy. When we tried the approach that exists for all other hosts, we ended up in broken connectivity for ganeti hosts. See T233906 which ended up with the decision described in T233906#5529507. T234207 was then created as an investigation task to handle improvements to our puppetization of network configuration (which is crude and barely existing to be honest). There has been no move in this since then.

Oct 19 2020, 1:51 PM · Ganeti, Patch-For-Review, Traffic, SRE
akosiaris added a comment to T261130: ganeti5002 was down / powered off, machine check entries in SEL.

Any news on this one? (just found out today about it while working on T265607)

Oct 19 2020, 11:42 AM · serviceops, SRE, ops-eqsin

Oct 16 2020

akosiaris added a comment to T265183: In a k8s world: where does MediaWiki code live?.

My concern is that this transition step becomes a permanent step.

It's up to us to avoid that being true, in case, but I have further doubts I will clarify below. And yes, I clarified in a meeting with our managers I have that concern too (after all, Starling's first law still holds), and they swore commitment to repay that debt once we've transitioned traffic.

Oct 16 2020, 2:55 PM · MW-on-K8s
akosiaris added a comment to T264209: Run stress tests on docker images infrastructure.

For the We are limited on the docker-registry infrastructure side., the sanest way out of this (until we hit the next bottleneck) is to scale out, aka just more docker registry VMs. That should be easily doable, we got the capacity. The VMs should be split across the rack rows for higher availability.

Oct 16 2020, 1:26 PM · Patch-For-Review, serviceops, Release Pipeline, MW-on-K8s
akosiaris added a comment to T264209: Run stress tests on docker images infrastructure.

We need to permanently bump the tmpfs /var/lib/nginx size if we want to be able to consistently push images with blobs that are larger than 1 GB compressed

Couldn't we get around this by using a (bigger) non tmpfs filesystem as client_body_temp_path?
Not sure how much the upload performance would suffer in this case, but we could test that...

+1 on this suggestion. For small requests, there will be minimal writing to a real filesystem for files that exist briefly. These writes would be background I/O in most cases.

Oct 16 2020, 11:02 AM · Patch-For-Review, serviceops, Release Pipeline, MW-on-K8s
akosiaris updated subscribers of T254954: Move Wikisource OCR's API proxy to production.

This was brought to my attention yesterday by @WDoranWMF, sorry for missing it and many thanks for the ping.

Oct 16 2020, 10:23 AM · Tech-Product API Roadmap, Community-Tech, Wikimedia OCR
akosiaris added a comment to T255410: Termbox SSR connection terminated very often.

@akosiaris Thank you a lot for your detailed response. I did look into those errors a tiny bit more to properly document them as can be now seen on wikitech.

In the course of that I looked at the last days and noticed some discrepancies to the numbers you provided above. All the following data is for the 7 days between 2020-10-07 00:00:00 and 2020-10-13 23:59:59.

Oct 16 2020, 8:47 AM · User-Michael, serviceops-radar, Wikidata-Campsite (Wikidata-Campsite-Iteration-∞ (On Hold)), [DEPRECATED] wdwb-tech, Wikidata, Wikidata-Termbox

Oct 15 2020

akosiaris added a comment to T264209: Run stress tests on docker images infrastructure.

The first pull test was successful. 34 hosts pull from the registry simultaneously. The test lasted about 5minutes.

Oct 15 2020, 11:55 AM · Patch-For-Review, serviceops, Release Pipeline, MW-on-K8s
akosiaris updated subscribers of T265490: rate limited etherpad.
Oct 15 2020, 9:10 AM · Patch-For-Review, SRE, Wikimedia-Etherpad
akosiaris changed the status of T265490: rate limited etherpad from Open to Stalled.

Lowering priority as the service isn't broken and setting as Stalled as we are waiting from the upstream to release the new version to fix this.

Oct 15 2020, 9:07 AM · Patch-For-Review, SRE, Wikimedia-Etherpad

Oct 14 2020

akosiaris added a comment to T264209: Run stress tests on docker images infrastructure.

1st obstacle found already. The push failed with '500 internal server error'. Logs indicate

Oct 14 2020, 2:17 PM · Patch-For-Review, serviceops, Release Pipeline, MW-on-K8s
akosiaris awarded T219544: Make hadoop cluster able to push to swift a Love token.
Oct 14 2020, 11:09 AM · Patch-For-Review, Analytics-Kanban, Research, SRE, Discovery-ARCHIVED, Analytics

Oct 12 2020

akosiaris changed the status of T263038: Dev images registry from Open to Stalled.

Setting as stalled for now, pending the investigation mentioned in the last comment.

Oct 12 2020, 9:26 AM · Release-Engineering-Team (Seen)
akosiaris added a comment to T263873: "The OTRS background task is not launched. Please contact your administrator" on OTRS web interface.

And again, can't reproduce. Not only that, but logs around the time of the report indicate that the daemon was working fine. That is also supported by systemd's status for the service

Oct 12 2020, 8:59 AM · Znuny

Oct 9 2020

akosiaris added a comment to T264710: Host static sites on kubernetes.

A couple of requirements from my side, regardless of where those sites are deployed and the technology used:

Oct 9 2020, 11:09 AM · Wikidata Query UI, Wikidata Query Builder, serviceops, Wikidata, User-Addshore

Oct 8 2020

akosiaris added a comment to T239459: service-runner apps running on kubernetes emit logs with log level 50 .

The hold-up seems to be eventstreams; it actually uses a fork of service runner, and the fork is missing the feature, so it's slightly more complicated to update. https://github.com/wikimedia/service-runner/tree/prometheus_metrics

Oct 8 2020, 12:48 PM · Platform Team Workboards (Clinic Duty Team), Patch-For-Review, CX-cxserver, serviceops-radar, Product-Infrastructure-Team-Backlog-Deprecated, SRE
akosiaris added a comment to T264888: Review default ferm INPUT policy.

Overall, I am willing to test this out, couples of points though:

Oct 8 2020, 7:45 AM · Infrastructure-Foundations, Security, SRE, netops, User-jbond

Oct 7 2020

akosiaris added a comment to T260917: Support TLS for service-to-service communication in k8s staging.

What's peculiar, is that https://puppet-compiler.wmflabs.org/compiler1002/25621/deploy1001.eqiad.wmnet/change.deploy1001.eqiad.wmnet.pson doesn't have the test:data thing. Perhaps that's an earlier PCC though?

Yeah, that's a PCC from before the second change to labs/private. https://puppet-compiler.wmflabs.org/compiler1003/25719/ is the current one (which contains test: data in change catalog).

Oct 7 2020, 1:04 PM · serviceops, Kubernetes
akosiaris added a comment to T260917: Support TLS for service-to-service communication in k8s staging.

I could use a pair of eyes on https://gerrit.wikimedia.org/r/q/bug:T260917
The PCC full diff (https://puppet-compiler.wmflabs.org/compiler1002/25621/) lacks defaultsecret: notdefault for staging zotero. What am I missing here?

Oct 7 2020, 12:56 PM · serviceops, Kubernetes
akosiaris edited projects for T263764: Termbox service: unusual errors that could be from envoy, added: serviceops-radar; removed serviceops.

Envoy is being documented at https://wikitech.wikimedia.org/wiki/Envoy#Envoy_at_WMF. It is being used by termbox to talk to mediawiki (it's a component of a service mesh). The idea is to have low cost persistent TLS connections, with retries and telemetry. More more insights aside from the doc link above the following grafana dashboard is useful https://grafana.wikimedia.org/d/b1jttnFMz/envoy-telemetry-k8s?orgId=1&var-datasource=thanos&var-site=codfw&var-prometheus=k8s&var-app=termbox&var-destination=mwapi-async&from=now-7d&to=now

Oct 7 2020, 10:43 AM · serviceops-radar, Wikidata, Wikidata-Termbox
akosiaris added a comment to T122220: Enable optional two-factor authentication for OTRS.

Why does OTRS even use password authentication? All OTRS agents presumably have Wikimedia accounts, so wouldn't OAuth be a more secure and convenient method?

Oct 7 2020, 9:43 AM · collaboration-services, Security, Znuny
akosiaris added a project to T255410: Termbox SSR connection terminated very often: serviceops-radar.

Sorry for not answering earlier.

Oct 7 2020, 8:43 AM · User-Michael, serviceops-radar, Wikidata-Campsite (Wikidata-Campsite-Iteration-∞ (On Hold)), [DEPRECATED] wdwb-tech, Wikidata, Wikidata-Termbox

Oct 6 2020

akosiaris changed the status of T264195: Kubernetes pods are being periodically evicted because of Disk Space pressure caused by cpjobqueue from Open to Stalled.

I think it's showing already

Oct 6 2020, 9:07 AM · serviceops-radar, Kubernetes

Oct 5 2020

akosiaris added a comment to T264195: Kubernetes pods are being periodically evicted because of Disk Space pressure caused by cpjobqueue.

The changeprop change about the changeprop service above seems to have solved the daily saw like pattern

Oct 5 2020, 11:42 AM · serviceops-radar, Kubernetes

Oct 3 2020

akosiaris closed T187984: Update OTRS to the latest stable version (6.0.x) as Resolved.

The upgrade has happened successfully and tickets for followup work that is required as a result of this upgrade can now be opened under the OTRS 6 column in Znuny project in phabricator. So, I 'll resolve this

Oct 3 2020, 11:55 AM · User-notice-archive, Patch-For-Review, serviceops, SRE, Znuny
akosiaris closed T187984: Update OTRS to the latest stable version (6.0.x), a subtask of T122220: Enable optional two-factor authentication for OTRS, as Resolved.
Oct 3 2020, 11:55 AM · collaboration-services, Security, Znuny
akosiaris closed T187984: Update OTRS to the latest stable version (6.0.x), a subtask of T126759: Some URLs seem to get replaced with [##############2] when written in notes and responses. , as Resolved.
Oct 3 2020, 11:55 AM · Upstream, Znuny

Oct 2 2020

akosiaris created P12908 cleanup etherpad js.
Oct 2 2020, 6:01 PM
akosiaris added a comment to T264390: Site: 4 VM request for LDAP replicas.

/me rubberstamping. Thanks for this!

Oct 2 2020, 8:35 AM · vm-requests, SRE

Oct 1 2020

akosiaris closed T255877: Move proton to use TLS only as Resolved.

All old stuff has been removed, I 'll resolve this.

Oct 1 2020, 12:53 PM · Patch-For-Review, Prod-Kubernetes, Kubernetes, serviceops, SRE
akosiaris closed T255877: Move proton to use TLS only, a subtask of T235411: Add TLS termination to services running on kubernetes, as Resolved.
Oct 1 2020, 12:52 PM · Prod-Kubernetes, Kubernetes, serviceops, SRE
akosiaris updated the task description for T255877: Move proton to use TLS only.
Oct 1 2020, 12:52 PM · Patch-For-Review, Prod-Kubernetes, Kubernetes, serviceops, SRE
akosiaris created P12876 (An Untitled Masterwork).
Oct 1 2020, 8:25 AM
akosiaris updated the task description for T255877: Move proton to use TLS only.
Oct 1 2020, 7:57 AM · Patch-For-Review, Prod-Kubernetes, Kubernetes, serviceops, SRE
akosiaris added a comment to T122220: Enable optional two-factor authentication for OTRS.

I appreciate that we have that feature now and do test it, but I second akosiaris' statement above that we don't have any process yet to recover an account. I guess if anybody locks out themselves at the moment we could be unable to recover the account. Please consider that OTRS admins may not have enough resources for assistance at the moment.
If 2fa forces us to rely on anything not verifyable during account recovery, it perhaps doesn't add much additional security.

This is my biggest concern as well. My feeling, at least for the trial period, would be to require a post to [[AR]] on OTRS wiki. But, if we have to remove their 2FA token we also scramble their password and force them to reset it.

Long-term, we need a better solution (like a committed identity that enwiki has) - but all of that is taking more time from an already stretched thin admin team.

Another question/idea would be: How hard are OTRS modules to build? We could theoretically build an enhanced version of the built-in 2FA module, which allowed for scratch codes and hide the generator token.

A more suitable question would be "How costly are OTRS modules to maintain?" as I my experience up to now with OTRS modules is that most of their cost goes into maintenance and not building them. And since we would be probably patching the current implementation, it would be considerable. We would need to find someone to do build it and maintain it down the road.

Would... this be a bad time to mention that I know Perl? While I could maintain it, I don't know how to build an OTRS module, so it might take a bit to build.

Oct 1 2020, 7:41 AM · collaboration-services, Security, Znuny
RhinosF1 awarded T263842: S5 replication issue, affecting watchlist and probably recentchanges a Like token.
Oct 1 2020, 7:20 AM · Sustainability (Incident Followup), Wikimedia-Incident, SRE, DBA
akosiaris closed T263328: Agents can view watched tickets outside of assigned queues as Resolved.

It's a week and noone has complained, I don't see much point in keeping this open waiting for someone to complain. I 'll resolve this for now, please do reopen if you hear any complaints.

Oct 1 2020, 7:20 AM · Znuny
akosiaris added a comment to T263842: S5 replication issue, affecting watchlist and probably recentchanges.

I just noticed the IR on wikitech says:

duplicate key for ip banning

The ipblocks table actually stores both account and ip blocks. A ban does not also equal a block in the terms most wikis use it.

It would probably be more accurate and clearer to just say:

about a duplicate key in enwikivoyage's ipblocks table

Instead of:

about enwikivoyage duplicate key for ip banning

Not making the change myself to avoid stepping on or confusing anyone in SRE.

Oct 1 2020, 7:15 AM · Sustainability (Incident Followup), Wikimedia-Incident, SRE, DBA

Sep 30 2020

akosiaris added a comment to T259909: (Need By: TBD) install memory upgrades in ores100[1-9].

Awesome, many thanks!

Sep 30 2020, 4:01 PM · ops-eqiad, DC-Ops, SRE
akosiaris triaged T264209: Run stress tests on docker images infrastructure as Medium priority.
Sep 30 2020, 3:59 PM · Patch-For-Review, serviceops, Release Pipeline, MW-on-K8s
akosiaris created T264209: Run stress tests on docker images infrastructure.
Sep 30 2020, 3:59 PM · Patch-For-Review, serviceops, Release Pipeline, MW-on-K8s
akosiaris updated the task description for T264195: Kubernetes pods are being periodically evicted because of Disk Space pressure caused by cpjobqueue.
Sep 30 2020, 3:51 PM · serviceops-radar, Kubernetes
akosiaris updated the task description for T264195: Kubernetes pods are being periodically evicted because of Disk Space pressure caused by cpjobqueue.
Sep 30 2020, 2:33 PM · serviceops-radar, Kubernetes
akosiaris created T264195: Kubernetes pods are being periodically evicted because of Disk Space pressure caused by cpjobqueue.
Sep 30 2020, 2:31 PM · serviceops-radar, Kubernetes
akosiaris added a comment to T263910: ORES redis: max number of clients reached....

Looking at the pattern of requests to ores in the past couple of days, it seems OKAPI has been bringing down ores: https://logstash.wikimedia.org/goto/f91cdb8acc0f46a02bd7552ffaf0efd4

  • Suddenly, there's a 1M more requests a day by OKAPI, have this been counted for? specially during the switchover that we are basically half the capacity.

My expectation would be that if such a volume of requests is going to happen, we (SRE and the service owners) would be contacted beforehand. If this was any other external client, we would just ban it with no second thoughts.

Sep 30 2020, 7:39 AM · User-Ladsgroup, Sustainability (Incident Followup), Patch-For-Review, Wikimedia Enterprise, serviceops, SRE, Machine-Learning-Team, ORES

Sep 29 2020

akosiaris closed T264093: OTRS will return sporadic 500 HTTP errors in some cases, reporting that it can't find parts of the perl code as Resolved.

We 've talked with @MoritzMuehlenhoff and we 'll solve this on the process upgrade front by restarting apache and otrs-daemon after perl upgrades on OTRS. Resolving for now.

Sep 29 2020, 1:59 PM · Znuny
akosiaris added a comment to T264093: OTRS will return sporadic 500 HTTP errors in some cases, reporting that it can't find parts of the perl code.

I 've manually restarted both apache and otrs-daemon just to make sure everything is fine and using the new version of the DBI perl module. But otherwise this doesn't seem very worthwhile digging deep into it. I think we 'll just treat perl upgrades on the host are requiring an apache restart just to be on the safe side and close this.

Sep 29 2020, 1:53 PM · Znuny
akosiaris triaged T264093: OTRS will return sporadic 500 HTTP errors in some cases, reporting that it can't find parts of the perl code as Lowest priority.

Relevant SAL log line is https://sal.toolforge.org/log/rY9y2XQBhxWNv8gITPYl. Remediation manually is also probably very easy, just restarting apache.

Sep 29 2020, 1:40 PM · Znuny
akosiaris created T264093: OTRS will return sporadic 500 HTTP errors in some cases, reporting that it can't find parts of the perl code.
Sep 29 2020, 1:37 PM · Znuny
akosiaris added a comment to T259909: (Need By: TBD) install memory upgrades in ores100[1-9].

@Cmjohnson, Wednesday it's fine.

Sep 29 2020, 8:20 AM · ops-eqiad, SRE, DC-Ops

Sep 28 2020

akosiaris closed T263993: Decommission mendelevium as Resolved.
Sep 28 2020, 4:21 PM · Znuny, vm-requests, SRE
akosiaris closed T214975: proton experienced a period of high CPU usage, busy queue, lockups as Invalid.

This is close to 15months old, and the service has been moved to kubernetes in the meantime, so most stuff in here is quite probably not relevant anymore. I 'll just close as Invalid

Sep 28 2020, 2:57 PM · Product-Infrastructure-Team-Backlog-Deprecated, Proton, SRE
akosiaris awarded T260670: db2125 crashed - mgmt iface also not available a Love token.
Sep 28 2020, 2:46 PM · User-Kormat, ops-codfw, DBA, SRE