Wow, vm-name.wmflabs must be from the extremely early days because I don't remember anything before vm-name.pmtpa.wmflabs, and that was in 2012.
- Queries
- All Stories
- Search
- Advanced Search
- Transactions
- Transaction Logs
Advanced Search
Jan 8 2024
Nov 18 2022
Jun 29 2022
Apr 19 2022
Dec 28 2021
I just went to have a look and it appears the cert in
/var/lib/acme-chief/certs/tools-legacy/live/rsa-2048.crt just got renewed
like a minute ago. Majavah I see you're logged in, did you do some magic?
Sep 22 2021
May 29 2021
May 26 2021
May 24 2021
Martin: Sounds good, thanks. If you re-run your WebAuthn query now you should get a more convenient result :)
May 22 2021
If we're going to do it, should probably be VRT rather than VRTS
May 12 2021
I would like to know why bureaucrats should not be allowed to revoke IA
rights.
Apr 22 2021
Second one is part of and dependent on T280400
Apr 19 2021
Sorry, just saw this. I might not be the best of contacts for OTRS, am just an ordinary user with technical knowledge, not an admin or anything.
Apr 18 2021
tl;dr: I think we can do a wiki domain rename here:
- add new name to dns
- add new name to apache config
- add to staticMappings
- change these in mediawiki-config:
tests/multiversion/MWMultiVersionTest.php: [ 'otrs_wikiwiki', 'otrs-wiki.wikimedia.org' ], tests/urls.txt:https://otrs-wiki.wikimedia.org/wiki/Main_page wmf-config/CommonSettings.php: 'otrs-wiki.wikimedia.org', wmf-config/CommonSettings.php: 'otrs-wiki.m.wikimedia.org', wmf-config/InitialiseSettings.php: 'otrs_wikiwiki' => '//otrs-wiki.wikimedia.org', wmf-config/InitialiseSettings.php: 'otrs_wikiwiki' => 'https://otrs-wiki.wikimedia.org', wmf-config/InitialiseSettings.php: 'otrs_wikiwiki' => 'OTRS Wiki', wmf-config/logos.php: 'otrs_wikiwiki' => '/static/images/project-logos/otrs_wikiwiki.png',
- update the interwiki map on meta appropriately and go through the dumpInterwiki process
- find and update any RB/Parsoid/etc. config needed
- set up redirects for old name in apache
In T280400#7011162, @Ladsgroup wrote:Renaming a wiki is extremely complex and we have done it only once (b-x-old to be-tarask and that one is not also finished properly yet) and it's still quite a mess (probably the renaming script is also terribly broken by now) and all renames are blocked T172035: Blockers for Wikimedia wiki domain renaming
Apr 13 2021
Looks like it's all okay. Thanks Alexandros
In T275294#6972793, @Keegan wrote:In T275294#6972574, @Krenair wrote:I can be around during that window I think.
Thanks!
Apr 10 2021
I would assume the owner of a task is the assignee, not the author.
Apr 7 2021
https://gerrit.wikimedia.org/r/c/openstack/horizon/wmf-proxy-dashboard/+/609859/1/wikimediaproxydashboard/views.py#b240 looks possibly suspect as it appears to conflate proxy external IP with backend internal IP but it's from July, might be it didn't get rolled out until recently?
Apr 6 2021
In T275294#6972793, @Keegan wrote:You might be interested in something else we may need to do in the near future, move OTRS wiki to a new name. It doesn't have the Wikidata problem blocking it, so perhaps it's feasible? Anyway, that's a ticket to look out for eventually and it can be discussed there when it's created.
Apr 5 2021
In T275294#6972463, @akosiaris wrote:In T275294#6972430, @Keegan wrote:In T275294#6971756, @akosiaris wrote:In T275294#6968900, @Keegan wrote:I would prefer to wait a week until this is published in Tech/News, I'm meeting with the admins and we're discussing and planning a larger re-branding effort. For now the migration is simply a re-branding our software internals, communicating this change needs to be done carefully to avoid confusion.
So, April 20th? Fine by me.
@akosiaris apologies, I was referring to waiting to publish the notice in this week's Tech/News instead of last week. I'd still like to do this on 13 April. Do you have time window we can schedule this for, and would you like me to make a task for the migration?
Oh, sorry my bad. I had reserved 2 hours on the 13th, from 07:00 UTC to 09:00 UTC, for the main migration. Judging from the input from Znuny, it should be sufficient, major issues aside. If any minor issues show up up after the migration, we can handle them outside that window of course.
+1 on the task. We probably want to have the most technically inclined agents aware and able to quickly provide input.
Apr 2 2021
Thanks @Majavah
Mar 26 2021
Feb 23 2021
This brings back memories. Think it's long past time we get some tests
added so this script stops getting broken multiple times a year.
Feb 22 2021
@TonyBallioni's original comment before deletion:
Is there any non-open source proprietary software that will function well and we wouldn’t be picking for ideological reasons?
I’d strongly support one of those.
Feb 19 2021
Feb 6 2021
I think I've seen acme-chief not responding to SIGHUP as expected before in deployment-prep, I worry this could happen in prod too.
Feb 5 2021
Jan 30 2021
Also ran into this, after first login I saw over 100 logins on this page. Either needs removal or clarification
Jan 17 2021
In T271778#6749717, @AlexisJazz wrote:In T271778#6747588, @Krenair wrote:Unlikely
I was asking because T267006#6624466 (deployment-cache-upload06 is upload.beta.wmflabs.org I think?) and the problems started I think around T267858. But could be just coincidence.
Jan 14 2021
Unlikely
In T271808#6739578, @Vgutierrez wrote:root@deployment-cache-upload06:/etc/acmecerts/unified/live# openssl x509 -dates -noout -in rsa-2048.crt notBefore=Jan 12 01:23:09 2021 GMT notAfter=Apr 12 01:23:09 2021 GMT root@deployment-cache-upload06:/etc/acmecerts/unified/live# touch /srv/trafficserver/tls/etc/ssl_multicert.config root@deployment-cache-upload06:/etc/acmecerts/unified/live# systemctl reload trafficserver-tls.serviceIt should be up & running now.. I'm not really familiar with the cloud puppetization but this doesn't mimic production behaviour
Jan 12 2021
re acme-chief part: It looks like the same thing happened to the mx and wikibase certs too. Haven't checked those updated on the machines that serve them.
Also spotted various prod ncredir certs in /etc/acme-chief/config.yaml that can't be doing any good.
well, ideally it would've been a script applicable to all installs of the
package, not just in wikimedia puppet.git
Dec 5 2020
Dec 1 2020
reopening too to ensure this gets looked at
Protecting as security issue due to presence of what appears to be a Jenkins API token in the task description, based on https://wikitech.wikimedia.org/wiki/Help:Puppet-compiler#Catalog_compiler_local_run_(pcc_utility)
Nov 30 2020
@Urbanecm: I can see it being interpreted either way - at the time this task was named for Wikipedias :) But I don't mind
Nov 29 2020
I just went and checked on this again and found I can SSH in, tasks are running on it, SGE has cleared the alarm/unreachable flags, based on the prometheus data it came back at 02:12:25 (after having stopped at 21:36:25), and according to uptime it hasn't been restarted.
sudo service webservicemonitor restart has shut it up. Broken connection to LDAP/SSSD or something? I notice sssd has only been running since Tue 2020-11-24 18:06:07 UTC; 4 days ago, and zgrep collector-runner /var/log/syslog.3.gz | grep Traceback -C3 | head -n 300 reveals these exceptions took off only 34 seconds later. That file also shows puppet had just applied a config change and restarted sssd. Maybe we're missing a subscribe/notify relationship in puppet to have it restart webservicemonitor as well, or if that's awkward (do we still have some old sssd alternative lurking somewhere that's conditional in puppet?) then maybe we can make it detect this through monitoring the existence of some always-existing LDAP user, and when that fails, crash to have systemd restart it.
No we just removed some stuff under the system's /var/log. It looks like /var/log/syslog for example had filled up with collector-runner exceptions, it had managed to generate over 9 million lines in 36 hours, like this:
Nov 29 17:42:11 tools-sgecron-01 collector-runner[9414]: 2020-11-29 17:42:11,517 Exception trying to validate / load tool grantmetrics Nov 29 17:42:11 tools-sgecron-01 collector-runner[9414]: Traceback (most recent call last): Nov 29 17:42:11 tools-sgecron-01 collector-runner[9414]: File "/usr/lib/python3/dist-packages/tools/manifest/webservicemonitor.py", line 39, in from_name Nov 29 17:42:11 tools-sgecron-01 collector-runner[9414]: user_info = pwd.getpwnam(username) Nov 29 17:42:11 tools-sgecron-01 collector-runner[9414]: KeyError: 'getpwnam(): name not found: tools.grantmetrics' Nov 29 17:42:11 tools-sgecron-01 collector-runner[9414]: During handling of the above exception, another exception occurred: Nov 29 17:42:11 tools-sgecron-01 collector-runner[9414]: Traceback (most recent call last): Nov 29 17:42:11 tools-sgecron-01 collector-runner[9414]: File "/usr/lib/python3/dist-packages/tools/manifest/webservicemonitor.py", line 146, in collect Nov 29 17:42:11 tools-sgecron-01 collector-runner[9414]: tool = Tool.from_name(toolname) Nov 29 17:42:11 tools-sgecron-01 collector-runner[9414]: File "/usr/lib/python3/dist-packages/tools/manifest/webservicemonitor.py", line 42, in from_name Nov 29 17:42:11 tools-sgecron-01 collector-runner[9414]: raise Tool.InvalidToolException("No tool with name %s" % (name,)) Nov 29 17:42:11 tools-sgecron-01 collector-runner[9414]: tools.manifest.webservicemonitor.Tool.InvalidToolException: No tool with name grantmetrics
It's still generating more so this will happen again at some point. It has 3.7G to burn through first though.
Me and @Andrew removed some stuff, please try again. Note this is writing files on tools-sgecron-01 rather than whatever bastion you are logged on to, so a simple df won't show anything.
Some of the continuous jobs that were stopped (except anomie's) have issued root@ failure emails with errors like can't get password entry for user "tools.ket-bot" (I imagine that given how broken this instance is, LDAP connectivity is one of the issues), and there's another couple more 'problems with defaults entries' emails too. All around 01:12
Nov 28 2020
In T268904#6653490, @Mohnd_Kh wrote:In T268904#6653489, @Krenair wrote:In T268904#6653487, @Mohnd_Kh wrote:Hi @Krenair sorry for bothering you, i just have a little question, shortly before i use --backend=kubernetes and worked fine, now i try to use --backend=gridengine, and it give me a massage:
Could not find a public_html folder or a .lighttpd.conf file in your tool home.is that normal and i just need to set lighttpd.conf file or it's a problem need to be fixed? thx again.
I don't know much about the Grid Engine, sorry.
@Krenair that's fire thank you, and if can mention any body knows i'll be grateful, if u don't it's OK also.
In T268904#6653487, @Mohnd_Kh wrote:Hi @Krenair sorry for bothering you, i just have a little question, shortly before i use --backend=kubernetes and worked fine, now i try to use --backend=gridengine, and it give me a massage:
Could not find a public_html folder or a .lighttpd.conf file in your tool home.is that normal and i just need to set lighttpd.conf file or it's a problem need to be fixed? thx again.
alex@alex-laptop:~$ ssh deployment-puppetdb03 Linux deployment-puppetdb03 4.19.0-11-amd64 #1 SMP Debian 4.19.146-1 (2020-09-17) x86_64 Debian GNU/Linux 10 (buster) deployment-puppetdb03 is a PuppetDB server (puppetmaster::puppetdb (postgres master)) The last Puppet run was at Thu Nov 26 20:45:39 UTC 2020 (1890 minutes ago). Last puppet commit: Last login: Sun Jul 26 10:45:20 2020 from 172.16.1.136 krenair@deployment-puppetdb03:~$ sudo service puppetdb status ● puppetdb.service - Puppet data warehouse server Loaded: loaded (/lib/systemd/system/puppetdb.service; enabled; vendor preset: enabled) Active: failed (Result: signal) since Thu 2020-11-26 21:15:48 UTC; 1 day 7h ago Docs: man:puppetdb(8) file:/usr/share/doc/puppetdb/index.markdown Main PID: 519 (code=killed, signal=KILL)
@Mohnd_Kh: Try now
Nov 25 2020
Based on that task having been done, I think we can safely say the rest of this is fairly pointless. If you have membership on deployment-prep the most destructive stuff you could do is likely within the instances you're expected to have root on, DNS is likely the more easily recoverable part. If anything this permissions setup more closely relates to what a production root would be able to do.
Nov 22 2020
@Vgutierrez: Can we make it dynamically reload its code somehow? We should probably have another task for this if so, I'm resolving this one
Nov 16 2020
It's coming from the job runner, given the values shown this is likely a
post distributed via MassMessage or something that expects to be able to
write cross-wiki, and a wikitech page was one of the expected destinations?
This probably should've been permitted
I don't think we use certbot anywhere except maybe Gerrit.
Nov 14 2020
@Vgutierrez FYI in case this could happen in prod too, I haven't been keeping track of changes lately. If we think it won't happen again or won't happen in prod (e.g. maybe it didn't restart because puppet is erroring somewhere in varnish code on this box?) then I guess we can close this
For some reason I had to do a full restart of the trafficserver-tls service on the cache-upload06 VM but it has loaded the latest cert now:
root@deployment-cache-upload06:~# openssl s_client -connect upload.beta.wmflabs.org:443 2>/dev/null | openssl x509 -noout -text | grep After Not After : Jan 12 06:00:26 2021 GMT
Cert was renewed:
root@deployment-acme-chief03:~# openssl x509 -in /var/lib/acme-chief/certs/unified/live/rsa-2048.crt -noout -text | grep After Not After : Jan 12 05:01:51 2021 GMT
Nov 13 2020
Sep 19 2020
This worked in the past on OTRS 5 with e.g. oversight queues. I assumed it
was deliberate - the most sensitive part of a ticket is almost always going
to be the first article and in this case the agent has already seen it.
Sep 14 2020
It's possible - if acme chief has got a new cert issued but the cache-text
box hasn't run puppet since, you'll see this. Check whether acme-chief has
a new one and if it does, fix puppet on cache-text. If not investigate why.
Am having lunch and then working again but I can look this evening if no
one has fixed it by then.
Sep 2 2020
Yeah that would work and is the mechanism that allows people direct access to e.g. the bastions and the tools login machines. The other potential option is just to require people using the test setup to use some custom SSH config to proxy through the bastions to get there.
Aug 31 2020
Aug 30 2020
maybe you can look for an X-Forwarded-Proto: https header which I think the proxy should be setting? if it's set then treat the request as if you would on port 443, if it's not set than issue redirect?
Aug 28 2020
In T251414#6420161, @Cyberpower678 wrote:This is not something I believe I have control over.
Aug 24 2020
Note to those I see in the ptwiki comments proposing AbuseFilters: Abuse Filter has emergency checks that will disable a filter matching 5% or more of edits.
Yeah this should probably be added to https://meta.wikimedia.org/wiki/Limits_to_configuration_changes
Aug 19 2020
Aug 18 2020
modules/icinga/manifests/monitor/ores_labs_web_node.pp has check_command => "check_ores_workers!oresweb/node/${title}", which would be e.g. check_ores_workers!oresweb/node/ores-web-04. Also host ores.wmflabs.org.
modules/nagios_common/files/check_commands/check_ores_workers.cfg says this is $USER4$/check_ores_workers $HOSTADDRESS$ '$ARG1$'
So it becomes /usr/local/lib/nagios/plugins/check_ores_workers ores.wmflabs.org 'check_ores_workers!oresweb/node/ores-web-04'
./modules/nagios_common/files/check_commands/check_ores_workers turns that into /usr/local/lib/nagios/plugins/check_http -f follow -H "ores.wmflabs.org" -I "ores.wmflabs.org" -A "wmf-icinga/something (root@wikimedia.org)" -u "http://oresweb/node/ores-web-04/v3/scores/fakewiki/$(/bin/date +%s)/"
Aug 17 2020
In T260449#6387326, @CDanis wrote:In T260449#6387317, @Josve05a wrote:I don't have OTRS access, sorry. Is this a new reported issue with Jio users?
Aug 4 2020
Aug 3 2020
replacing with a medium instance, deployment-puppetdb04
Aug 2 2020
As I recall with the meet project the project itself in OpenStack was named meet, therefore you automatically got a meet.wmflabs.org designate zone. Could get one for lists created too I guess (similar to the beta zone in deployment-prep). This way you could administer it without going through more tickets in future
This should probably just be a record under mailman.wmcloud.org ?
Jul 30 2020
I think the keys are generated first and the certs appear when acme-chief
has gone through the ACME API to get stuff signed by the CA
Jul 17 2020
In T257968#6311130, @Samwilson wrote:I'm still getting the cert error on https://upload.beta.wmflabs.org . Other subdomains, e.g. https://en.wikisource.beta.wmflabs.org , are working fine now.
Jul 15 2020
Jul 14 2020
@Vgutierrez: I'm guessing puppet had failed to run the reload exec itself due to the errors connecting to acme-chief (Error 400 on SERVER: part must be in ['ec-prime256v1.crt', 'ec-prime256v1.chain.crt', 'ec-prime256v1.chained.crt', 'ec-prime256v1.key', 'ec-prime256v1.ocsp', 'rsa-2048.crt', 'rsa-2048.chain.crt', 'rsa-2048.chained.crt', 'rsa-2048.key', 'rsa-2048.ocsp'] from puppet and requests like /puppet/v3/file_content/acmedata/mx/bfcd4752e6b346289533bcb6934671a2/rsa-2048.crt.key?environment=production& showing up in the uwsgi-acme-chief logs) - it had new puppet classes and was making the new .crt.key CERTIFICATE_TYPE calls to acme-chief, and the acme-chief instance had v0.26 installed, but the uwsgi-acme-chief service on the acme-chief box had not been restarted. Wonder if we should automatically restart uwsgi-acme-chief on upgrading the acme-chief package somehow (puppet?)
the immediate problem is solved by me manually doing the cert reload (something like touch /srv/trafficserver/tls/etc/ssl_multicert.config && /bin/systemctl reload trafficserver except there are two different ssl_multicert.config files on the system and two different trafficserver services)