The guy who fixes bugs. In daily life I am a Computer Science student specialising in Cyber Security & Cloud.
User Details
- User Since
- Oct 12 2014, 7:12 AM (402 w, 3 d)
- Availability
- Available
- IRC Nick
- Southparkfan
- LDAP User
- Southparkfan
- MediaWiki User
- Southparkfan [ Global Accounts ]
Wed, Jun 1
May 24 2022
Is this a duplicate of T292348?
Important: the cherry-pick has fixed this bug in REL1_35, but introduced another bug: T292348. Please see T292348#7954637 for a workaround.
This explains why my logs went missing upon dumping using --full. For the record, creating two XML dumps (one with --full, the other one with --logs) and importing both dumps using importDump.php (I did the --full one first, then the --logs one, but this may not matter) works fine. It's just annoying to create dumps twice.
I hit this bug after applying https://gerrit.wikimedia.org/r/c/mediawiki/core/+/787858 to REL1_35 (which was merged due to T288423). However, the patch provided in T288423#7268918 didn't trigger this bug.
May 2 2022
Added per https://www.mediawiki.org/wiki/Backporting_fixes. 1.35 is the only affected (supported) release, >= 1.36 already contains the fix that I have cherry-picked (see above).
Apr 24 2022
As long as a file containing the CA's root certificate (and intermediates, if applicable) and a machine certificate signed by CA (restricted to client authentication; Extended Key Usage) is present on all syslog clients, that's fine. In hindsight, addition to the OS' trust store (which impacts all applications on that machine that do not manage their own trust store) induces a vulnerability, shall the (root) CA's private key ever be compromised.
Since all puppetmasters do some trickery with the central puppetmaster before starting to use local ones, we might be able to conditionally copy the central puppetmaster certs to some other location based on the puppetmaster variable (to get them before overwritten with the local puppetmaster certs).
Note that the central puppetmasters don't do particularly strong verification for the certificates it gives out, as it was not designed to hold any secrets. The duplicate prevention + instance must exist checks might however be good enough for this use case.
Thanks! I am not sure how these checks work, unfortunately.
[...]
Is that correct?
Yes! In particular; the handling of mutual authentication for instances from Cloud VPS projects that employ local puppetmasters. Instances that use the cloud-wide puppetmaster must receive a valid certificate[1] from that puppetmaster, otherwise the integrity of the hostname in a syslog message is substantially lowered.
Let me know if you need unblocking/help on any of those, I don't have lots of time, but I can make some if that unblocks you.
Looking forward to ideas regarding establishing a chain of trust between the central syslog server and all of its clients (i.e. Cloud VPS Instances).
Feb 13 2022
Now that the patch above has been merged, we can start thinking about applying the syslog client configuration by default on Cloud VPS instances. The central syslog server should be in the cloudinfra project. There are a few challenges to tackle, though:
Oct 8 2021
Jun 29 2021
May 4 2021
Apr 12 2021
Apr 2 2021
Short update: using local puppet patches, I got the syslog forwarding working. Next step will be getting those patches into the production branch.
Mar 4 2021
@Bstorm thank you! I presume the project will be created shortly? Let me know if there is anything I can do.
Mar 2 2021
Feb 26 2021
Feb 25 2021
Yes, Grafana Loki is one of the newer solutions. I am not experienced with Grafana Loki, but it may be the best solution for extracting metrics from logs, although I am not sure if Grafana Loki fits any high availability / scalability requirements. Elastic is more known, but with the license issues, it's better to let that slide for now (if the software is forked, we can reconsider!). What do you think?
Feb 17 2021
Feb 15 2021
As the original reporter of T127656, could you use my help here? Central logging sounds like a fun, but useful side project.
Feb 6 2021
Aug 25 2020
@MoritzMuehlenhoff understood. Patch set 4 will use the custom unit (with /run) on systems older than bullseye. On bullseye systems we'll be using the built-in unit.
Correct, see this:
Aug 24 2020
I have uploaded a new patch using /run on all servers (regardless of OS). However, what about removing the unit and using the built-in one? That reduces the likelihood of units conflicting with intended behavior from upstream. An apt-get source for nagios-nrpe=3.2.1-2 shows the following unit (debian/nagios-nrpe-server.service):
[Unit] Description=Nagios Remote Plugin Executor Documentation=http://www.nagios.org/documentation After=var-run.mount nss-lookup.target network.target local-fs.target remote-fs.target time-sync.target Before=getty@tty1.service plymouth-quit.service xdm.service Conflicts=nrpe.socket
Aug 23 2020
@Ciencia_Al_Poder What do you mean with internal URLs? purgeList.php fetches the URLS using Title->getCdnUrls(), which seems to work fine in 1.34.2:
The PID directory has been changed to /run since Buster: https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=932353#10
You're welcome.
Except I wanted to ask where you got the 308 from specifically.
I got that from T256276. However, https://gerrit.wikimedia.org/r/c/operations/puppet/+/612442/3/modules/dynamicproxy/files/domainproxy.lua says "ngx.HTTP_MOVED_PERMANENTLY", which is a 301. Either one is fine, though.
Aug 18 2020
Aug 11 2020
Aug 8 2020
Aug 7 2020
May 22 2020
Apr 10 2020
Mar 13 2020
Jan 11 2020
Definitely not working on this. Thanks for the help!
Dec 4 2019
Nov 4 2019
I feel my point was completely missed, regardless of the cause or fix: this change has a performance impact of roughly 28x. And since I have new questions, reopening.
Oct 17 2019
It doesn't seem to be limited to broken citations, {{Special:WhatlinksHere/{{FULLPAGENAME}}}} also seems enough to reproduce this.
Oct 15 2019
Aug 8 2019
GDNSD supports HTTP health checks (or custom ones as well). Provided the DNS TTL is no longer than a few minutes, 'automatic failover' is possible.
Jun 10 2019
The tasks regarding loss of PSU redundancy on cp303[2689] are normal priority, does this one need to be high priority?
Mar 28 2019
Jan 25 2017
(correct datacenter)
Dec 30 2016
Nov 18 2016
The same issue is occurring again at a high rate, and is causing problems. Also, there are users reporting they are silently being logged out of the site. I'm not sure how much they are related to each other, but I spotted this in the debug logs:
[session] Session "[50]CentralAuthSessionProvider<-:2:Southparkfan>[REDACTED]": Metadata merge failed: [Exception MediaWiki\Session\MetadataMergeException( /srv/mediawiki/w/includes/session/SessionProvider.php:195) Key "CentralAuthSource" changed] #0 /srv/mediawiki/w/includes/session/SessionManager.php(629): MediaWiki\Session\SessionProvider->mergeMetadata(array, array) #1 /srv/mediawiki/w/includes/session/SessionManager.php(498): MediaWiki\Session\SessionManager->loadSessionInfoFromStore(MediaWiki\Session\SessionInfo, WebRequest) #2 /srv/mediawiki/w/includes/session/SessionManager.php(182): MediaWiki\Session\SessionManager->getSessionInfoForRequest(WebRequest) #3 /srv/mediawiki/w/includes/WebRequest.php(700): MediaWiki\Session\SessionManager->getSessionForRequest(WebRequest) #4 /srv/mediawiki/w/includes/session/SessionManager.php(121): WebRequest->getSession() #5 /srv/mediawiki/w/includes/Setup.php(747): MediaWiki\Session\SessionManager::getGlobalSession() #6 /srv/mediawiki/w/includes/WebStart.php(137): require_once(string) #7 /srv/mediawiki/w/index.php(40): require(string) #8 {main}
Nov 9 2016
Sep 17 2016
Is this the same as the recent behavior seen on various API appservers? https://ganglia.wikimedia.org/latest/?c=API%20application%20servers%20eqiad&m=cpu_report&r=hour&s=by%20name&hc=4&mc=2
Aug 24 2016
I wonder why @RobH and @Cmjohnson are talking about 4TB disks. The current problems are caused by 4k (6TB) disks, and the accessoires link given by Cmjohnson mentions a non-4k 6TB disk.
I wonder why @RobH and @Cmjohnson are talking about 4TB disks. The current problems are caused by 4k (6TB) disks, and the accessoires link given by Cmjohnson mentions a non-4k 6TB disk.
Aug 10 2016
Do we prefer a fallback that cannot be impacted by a Wikimedia outage of any kind? Conpherence is an option, but it is not off-site; a network outage affecting Wikimedia will render Conpherence useless.
Jul 31 2016
It could be possible that my configuration is not up-to-date. Like Wikimedia, we use redis for sessions (production 1.26 config):
$wgObjectCaches['redis'] = array( 'class' => 'RedisBagOStuff', 'servers' => array( '<IP>' ), 'password' => $wmgRedisPassword, ); $wgMainCacheType = 'redis'; $wgSessionCacheType = 'redis'; $wgSessionsInObjectCache = true; $wgMessageCacheType = CACHE_NONE; $wgParserCacheType = CACHE_DB; $wgLanguageConverterCacheType = CACHE_DB;
Jul 27 2016
Jul 26 2016
Jul 15 2016
Jul 12 2016
Yes, it does.
Step 1 seems to be re-introducing the texvc package (whose deletion actually was unrelated when I created the issue, it just made this task more or less invalid, because the issue was kinda gone - since nobody (except users who still have that package installed) would be hit by it anymore). That is a blocker for this task.
Jul 11 2016
@Andrew labvirt1012 lacks hyperthreading. Can you enable that?
Jul 3 2016
mw1017 and mw1099 are the eqiad MW debug hosts, so those should not be in conftool-data.
mw1161-mw1169 were previous appservers, but have been converted to jobrunners in December, per rOPUPba0a47b56ded3dc748c436c2940f114389b312f2. Jobrunners are not present in conftool-data, so I guess everything is as expected?
@Kghbln yes, I am aware of that, and I spoke to @MoritzMuehlenhoff, and he said that in hindsight the mediawiki-math-texvc should never have been deleted. Perhaps it will be packaged again - but I'm not sure if Debian's policies allow that, since it would mean that it has to be introduced in the jessie (stable) suite again. Otherwise we'd still need to wait until Debian 9 has been released.
Jun 17 2016
@Joe mw1306 seems to have the same IP as mw1091 (although that one shows up as mw1091 in Ganglia, whereas mw1090 shows up as mw1305...). So I think you need to fix that one as well.
Jun 11 2016
Just tested this, this doesn't happen in Firefox 47.0 on Windows 8.1. Perhaps an upgrade to 47.0 might help, although I doubt it because I can't see in the 47.0 release notes anything related to this.
Yep, that change (and the maintenance script) worked perfect. Everything is displaying correctly now.
Jun 10 2016
cp1044 has been decomissioned per T133614
Jun 8 2016
As of PHP 5.5 (Jessie has 5.6), there is an opcache extension included (only needs to be enabled in config), which caches compiled bytecode in the memory. Together with the other performance improvements since 5.4 (assuming paymentswiki runs on 5.3 now), CPU usage / response time should already decrease without using HHVM (so you get some more headroom before HHVM is really needed).
Jessie has PHP 5.6 actually.
Jun 6 2016
I'm sure Legoktm is right :-)! But I failed to make it work, so someone else should take it over honestly...
It would be cool if this bug could be fixed (whatever the solution is).
May 31 2016
southparkfan@tools-exec-1407:~$ echo 'oht0ipe1Pho7aa7pohChie8eath0ogoo9Eesoh9nahc3aefoh7ie2ais6oohugoo' > reset_2fa.txt southparkfan@tools-exec-1407:~$ cat reset_2fa.txt oht0ipe1Pho7aa7pohChie8eath0ogoo9Eesoh9nahc3aefoh7ie2ais6oohugoo southparkfan@tools-exec-1407:~$
May 30 2016
May 22 2016
@Paladox I've already upgraded to HHVM 3.12, which seems to work.
@Joe did you already install one of those servers? noticed https://ganglia.wikimedia.org/latest/?c=Application%20servers%20eqiad&h=mw1305.eqiad.wmnet&m=cpu_report&r=hour&s=by%20name&hc=4&mc=2 (I can't view which specs the new servers have, but this 'mw1305' seems to have the same cpu/ram specs as the old appservers?)
May 6 2016
@Cmjohnson yeah, perhaps I have been a bit too fast by already doing the DNS part (despite that's the only thing I can do it seems) :-)
May 5 2016
Just to learn how the process works, I've submitted a patch for the DNS adjustments. I noticed db1058 is referenced in the dhcpd and manifests/role/coredb.pp files in puppet but I have no idea how the latter one works, so I'll leave the puppet work to someone else.
Apr 28 2016
Apr 26 2016
Can this patch be backported to REL1_26, REL1_25 and REL1_23 too? The current security patch is only available in master, and these three branches still receive security support.