User Details
- User Since
- Aug 31 2020, 5:40 PM (169 w, 6 d)
- Availability
- Available
- LDAP User
- LSobanski
- MediaWiki User
- LSobanski (WMF) [ Global Accounts ]
Thu, Nov 30
I wonder if https://phabricator.wikimedia.org/L3 would also be in scope for this question, as moving it elsewhere would allow us to decommission Legalpad?
Wed, Nov 29
We're on the second to last upgrade. We ran into some performance issues the reason for which is not fully clear yet. Arnold is out this week, looking at the calendar there is a chance we'll be done before the holiday break.
Resolving as the alert is no longer active.
Resolving as the alert is no longer active.
Tue, Nov 28
Mon, Nov 27
Stalled until we have clarity on Dockerfile based builds in Gitlab.
@bking, this change is causing Puppet failures for miscweb1003 because of the existence of duplicate blackbox checks.
Fri, Nov 24
As of midday 2023-11-22 the backup size is down to around 8GB, in line with what it was before the recent increase. Resolving.
Nov 24 17:37:01 vrts1001 systemd[1]: apache2.service: A process of this unit has been killed by the OOM killer.
Thu, Nov 23
I'll leave this open for another week or two to see if the backup size changes after the reboots.
Wed, Nov 22
If we end up copying a static snapshot that will no longer be generated afterwards to a folder on a host, let's make sure a backup of it exists as well.
Tue, Nov 21
Mon, Nov 20
Removing collaboration-services as I don't see any clear activity for us here.
Fri, Nov 17
operations/software/wmfbackups
Wed, Nov 15
Tue, Nov 14
The alert has since recovered but looking at the names in the linked change I'm adding Data Platform SRE to review.
Mon, Nov 13
Fri, Nov 10
@BCornwall For operations/software/varnish, looks like it should just be archived and not migrated? Let me know if that's the case.
Thu, Nov 9
The one thing that remains is to revert the config change and restart the daemon to see if this brings the problem back.
Wed, Nov 8
Tue, Nov 7
Mon, Nov 6
I think this can be closed in the light of later troubleshooting.
All good recommendations, so far I confirmed:
- operations/docker-images/debian
- operations/software/certpy
- operations/software/hhvm_exporter
Would operations/software/knead-wikidough and operations/software/liberica fit here as well?
Nov 3 2023
operations/debs/wikistats
No, it fails on different runners, I just used 1004 as an example.
Nov 2 2023
Nov 1 2023
Both Apache and ClamAV were oom-killed.
Looks like Apache restarted as all its metrics are missing during this time: https://grafana.wikimedia.org/d/000000371/vrts?orgId=1&from=1698612983731&to=1698644084645
The service is up after a Puppet run and needs to be monitored. The resource usage could be related to the Envoy timeout changes made in T349471: Error when accessing a specific VRTS ticket: "upstream request timeout".
100% CPU and ~100% memory usage since yesterday: https://grafana.wikimedia.org/d/000000371/vrts?orgId=1&from=1698832908164&to=1698842575883
The error on ticket.wikimedia.org is:
Most likely related to T350118: Investigate PKI errors
Related to GitLab security update - {T350215}.