User Details
- User Since
- Sep 30 2014, 4:39 PM (443 w, 3 d)
- Roles
- Administrator
- Availability
- Available
- IRC Nick
- mutante
- LDAP User
- Dzahn
- MediaWiki User
- Mutante [ Global Accounts ]
Today
Yesterday
also: maybe it should be more than 2 people for all of SRE nowadays? not sure. cc: @Muehlenhoff
Since I am planning to go on a sabbatical I should find someone to replace me as one of the only 2 users who can add/remove users in pwstore.
works again per logstash (no errors shown)
Thu, Mar 30
well there is still https://gerrit.wikimedia.org/r/c/operations/puppet/+/904616 as a follow-up action.. so maybe it should still be open.
available disk space in /var/lib/docker is back to: used: 21G available: 17G usage: 56%
root@runner-1030:/var/lib/docker/volumes# for volume in $(du -hs * | grep G | cut -d "G" -f2 | xargs); do ls -1 ${volume}/_data/*; du -hs ${volume}; done mwbot mwbot.tmp 1.1G runner-m4mqfjvt-project-1177-concurrent-0-cache-c33bcaa1fd2c77edfc3893b41966cea8 mwbot mwbot.tmp 1.7G runner-m4mqfjvt-project-1177-concurrent-1-cache-c33bcaa1fd2c77edfc3893b41966cea8 mwbot mwbot.tmp 4.5G runner-m4mqfjvt-project-1177-concurrent-2-cache-c33bcaa1fd2c77edfc3893b41966cea8 mwbot mwbot.tmp 1.1G runner-m4mqfjvt-project-1187-concurrent-0-cache-c33bcaa1fd2c77edfc3893b41966cea8 poty-stuff poty-stuff.tmp 2.0G runner-m4mqfjvt-project-1215-concurrent-2-cache-c33bcaa1fd2c77edfc3893b41966cea8 apt-browser apt-browser.tmp 1.7G runner-m4mqfjvt-project-828-concurrent-2-cache-c33bcaa1fd2c77edfc3893b41966cea8 upcoming-mainpage upcoming-mainpage.tmp 1.8G runner-m4mqfjvt-project-837-concurrent-0-cache-c33bcaa1fd2c77edfc3893b41966cea8 mwbot-rs 3.2G runner-m4mqfjvt-project-860-concurrent-0-cache-c33bcaa1fd2c77edfc3893b41966cea8 mwbot-rs 1.7G runner-m4mqfjvt-project-860-concurrent-2-cache-c33bcaa1fd2c77edfc3893b41966cea8 mwbot-rs 1.6G runner-m4mqfjvt-project-860-concurrent-3-cache-c33bcaa1fd2c77edfc3893b41966cea8 data-engineering 1.4G runner-m4mqfjvt-project-93-concurrent-0-cache-3c3f060a0374fc8bc39395164f415a70 data-engineering 2.1G runner-m4mqfjvt-project-93-concurrent-0-cache-c33bcaa1fd2c77edfc3893b41966cea8 data-engineering 1.1G runner-m4mqfjvt-project-93-concurrent-1-cache-c33bcaa1fd2c77edfc3893b41966cea8
Alright, thanks dancy!
I wonder if this happens to be the runner from:
usage right now in / is 40% and in /var/lib/docker it's 85%
The production role for ci::master is now applied on contint2002.
01:20 <+jinxer-wm> (ProbeDown) resolved: (7) Service miscweb1002:443 has failed probes (http_commons_query_wikimedia_org_ip4) -
01:20 <+jinxer-wm> (ProbeDown) resolved: (7) Service miscweb1002:443 has failed probes (http_commons_query_wikimedia_org_ip4) -
also see T333510
filed T333510 for the root cause of this
host commons-query.wikimedia.org commons-query.wikimedia.org is an alias for dyna.wikimedia.org. dyna.wikimedia.org has address 208.80.154.224 dyna.wikimedia.org has IPv6 address 2620:0:861:ed1a::1
Wed, Mar 29
these can all be seen together at:
multiple of these have been done in T329587 meanwhile
Yes, the issue with mismatched UIDs on hosts that rsync has come up multiple times before and the preferred fix in SRE is definitely to use reserved UIDs. We have applied this to other hosts and it ends the problems once and for the future.
I did not get my own subtask to add new wikis to wikistats as usual in the past.
Given that this blocks getting off of "miscweb on buster" and "switch miscweb to codfw" I would prioritize it a bit higher.
I have definitely imported a repo from Gerrit before, seems like this must have been introduced in a gitlab version upgrade.
The usual ticket to add a new wiki to wikistats seems to be notably missing but used to be automatic.
Tue, Mar 28
my best path right now is to download that notebook from the stat box I was working on and then either email it or DM it to them via Slack, which they will need to download to view.
@jcrespo Oh, I noticed now before it was listening only on 127.0.0.1 but after it is listening on 0.0.0.0. Must be a race condition. Seems rare though. Thanks for confirming :)
@jcrespo The bacula-fd service was running and listening on port 9102 but still refusing connections. Restarting the service fixed it though and now a connection from backup1001 to port 9102 on miscweb2003 works.
Mon, Mar 27
23:47 < mutante> !log people1003 - taking down apache to provoke monitoring alert (inactive instances) and confirm IRC alerting change works
Hi @Htriedman and @MoritzMuehlenhoff,
We got the "widespread puppet failures" alert which made me look at some random failed hosts in the list. I found the reason was this offboarding, because:
@larissagaulia Thank you for adding the information. Does "until July" mean "until last day of June"? I uploaded a code change above that is now in review.
Thanks @Varnent no problem:)
fyi, regardless of how one feels about differential, there are already no more active repositories on Phabricator.
The code that installs it is at modules/profile/manifests/phorge.pp in operations/puppet git repo