Things my team is working on: MediaWiki-Platform-Team
Side projects I am working on (or planning to, eventually): User-Tgr
You can find more info about me on my user page.
User Details
- User Since
- Sep 19 2014, 4:55 PM (538 w, 3 d)
- Availability
- Available
- IRC Nick
- tgr
- LDAP User
- Gergő Tisza
- MediaWiki User
- Tgr (WMF) [ Global Accounts ]
Yesterday
I think the more reliable solution would be to add a hook to MediaWikiIntegrationTestCase::resetNonServiceCaches() so that extensions can reset their own caches.
(Or we could just create a CentralAuthUserFactory service and move the cache there, of course.)
Optimistically closing, maybe Cassandra just needs a reboot every couple months or something. We'll see whether it repeats.
systemctl says
Jan 11 08:09:58 deployment-sessionstore06 systemd[1]: cassandra.service: Main process exited, code=killed, status=9/KILL Jan 11 08:10:07 deployment-sessionstore06 nodetool[922509]: nodetool: Found unexpected parameters: [disablethrift] Jan 11 08:10:07 deployment-sessionstore06 nodetool[922509]: See 'nodetool help' or 'nodetool help <command>'. Jan 11 08:10:09 deployment-sessionstore06 nodetool[923332]: nodetool: Failed to connect to '127.0.0.1:7199' - ConnectException: 'Connection refused (Connection refused)'. Jan 11 08:10:10 deployment-sessionstore06 nodetool[923396]: nodetool: Failed to connect to '127.0.0.1:7199' - ConnectException: 'Connection refused (Connection refused)'. Jan 11 08:10:12 deployment-sessionstore06 nodetool[923458]: nodetool: Failed to connect to '127.0.0.1:7199' - ConnectException: 'Connection refused (Connection refused)'. Jan 11 08:10:13 deployment-sessionstore06 nodetool[923520]: nodetool: Failed to connect to '127.0.0.1:7199' - ConnectException: 'Connection refused (Connection refused)'. Jan 11 08:10:13 deployment-sessionstore06 systemd[1]: cassandra.service: Control process exited, code=exited, status=1/FAILURE Jan 11 08:10:13 deployment-sessionstore06 systemd[1]: cassandra.service: Failed with result 'oom-kill'. Jan 11 08:10:13 deployment-sessionstore06 systemd[1]: cassandra.service: Consumed 3min 8.148s CPU time.
free says there's almost 1.5G available, which seems decent. A restart seems to work, with some complaints about free space (but seems to be about disk rather than memory):
Jan 13 19:57:36 deployment-sessionstore06 cassandra[1055000]: WARN [main] 2025-01-13 19:57:36,983 DatabaseDescriptor.java:1034 - Small commitlog volume detected at '/var/lib/cassandra/commitlog'; setting commitlog_total_space to 4997. You can override this in cassandra.yaml Jan 13 19:57:36 deployment-sessionstore06 cassandra[1055000]: WARN [main] 2025-01-13 19:57:36,987 DatabaseDescriptor.java:650 - Only 13.541GiB free across all data volumes. Consider adding more capacity to your cluster or removing obsolete snapshots Jan 13 19:57:39 deployment-sessionstore06 cassandra[1055000]: WARN [main] 2025-01-13 19:57:39,176 StartupChecks.java:257 - JMX is not enabled to receive remote connections. Please see cassandra-env.sh for more info. Jan 13 19:57:39 deployment-sessionstore06 cassandra[1055000]: WARN [main] 2025-01-13 19:57:39,211 SigarLibrary.java:172 - Cassandra server running in degraded mode. Is swap disabled? : true, Address space adequate? : true, nofile limit adequate? : true, nproc limit adequate? : false
No idea if that's bad.
Lock wait timeout means the process was waiting for a lock held by another process for more than something like 3s, right? So it's not obvious to me how import being a long-running process would cause this. Or rather, it could cause lock wait timeouts in some other process, if it holds locks, but not in its own process, no?
Notes from @elukey on IRC:
17:12 < elukey> IIUC the config needs to run on the deployment servers via puppet run, so the correspondent yaml files for helmfile are updated
17:13 < elukey> and after that, a deploy would need to be kicked off to refresh the httpd config
I think that proves my suspicion that there were two unrelated errors: the one described in T379254 (introduced around August 10 and fixed around November 20) which reduced session lifetime to 24 hours under certain fairly common circumstances, and resulted in a big increase in top-level autologins; and another one which affects fewer people, and can cause multiple logouts on the same wiki within 24 hours.
Sun, Jan 12
I'll deprioritize this because it doesn't seem to affect many people and probably only happens in an edge case (switching between multiple accounts). Also I don't have much idea what could be done here, short of someone being able to reproduce and inspect in detail what's happening with the cookies. None of us could reproduce it, and looking through the relevant code didn't surface anything suspect (as far as I can see from the code this behavior should be impossible without manual cookie tampering, which means I'm probably missing something, but just knowing that isn't much help).
Fri, Jan 10
More specifically, what we want is probably
- the Matrix server (probably Synapse - the other options are less mature)
- probably the synapse-admin web UI
- Element (the standard Matrix web interface)
- A Matrix-Slack bridge (T382163) - we'd have to figure out what exactly, see task
- A Matrix-IRC bridge (T382164) - probably matrix-appservice-irc
The most official bridge is matrix-appservice-irc which seemed fine from a user perspective, but Libera kicked out the official matrix.org bridge because it caused problems at scale. They didn't ban use of matrix-appservice-irc so probably still the way to go.
Our experience with the "official" matrix-appservice-slack bridge has been pretty disappointing on the WMF internal Matrix instance. There is an Automattic fork which supposedly fixes most of the problems with the bridge, but isn't actively maintained. There are a bunch of alternative bridges but those seem to be meant for users, not admins (ie. every user would have to set it up separately? I might be misunderstanding how it works).
Thu, Jan 9
Should be straightforward technically via the LinkerMakeExternalLink hook, but yes it would raise all kinds of SEO and usability and tool B/C issues, and on Wikimedia sites it would increase the amount of user tracking (currently we don't learn about a user clicking a link, I think it would be hard to avoid that data getting at least into the webrequest table).
Moving to radar until we get more details.
Firefox apparently considers the field not username if it has a class like search. We could use that as a workaround but eww.
Wed, Jan 8
Consumer keys are public identifiers.
Tue, Jan 7
@MarioB if they are still not approved, can you provide the app IDs? It's hard to find apps by name.
Edge login doesn't create accounts on wikis where the user doesn't have them yet, so for temporary accounts at the time of creation it's a no-op. (Which also means there is not much point in triggering it... although figuring out whether we are right after an account creation might be more hassle than worth it.)
Possible although a bit ugly. There are two core mechanisms for removing authentication methods, invalidateSessionsForUser() which we are already calling for credential changes, and preventSessionsForUser() which is intended to be permanent (used for account usurpation by a system account). AuthManager::changeAuthenticationData() which handles credential changes does not handle multiple form fragments like login/signup does. This doesn't fit any of those generic mechanisms, so it would have to be a one-off - modify the form generation logic in SpecialChangeCredentials to add the checkbox, add a new hook to SpecialChangeCredentials::success() (and I suppose changePassword.php) that gets called when the option is checked, and add a hook handler to the OAuth extension that removes the access tokens. And maybe add some way for hook handlers to return a message, so e.g. OAuth can advise the user on how to reset owner-only keys.
This was done in rMW8c8654cce0af: Add a maintenance script to create bot passwords. a while ago (although only for creation).
Mon, Jan 6
T383049: No central session found is similar, but with the session store instead of the token store. (Although the session store is Kask and the token store is the microstash so there is not much infrastructure overlap there.)
Possibly caused by T380500: CentralAuthUser returning outdated data after user creation? It seems like the only way for this to happen is for CentralAuthUser::exists() or CentralAuthUser::isAttached() to return false.
Added a note to the install instructions about lack of SQLite support.
Seems to be affecting temporary accounts mostly.
Doesn't seem to have gotten more frequent in the last 90 days; I think this is just normal DB noise, like the various deadlock errors.
Happens a few hundred times per week, and has been going on for at least the 90 days Logstash can remember. Not sure about the k8s connection, this is a fairly generic error message.
Sun, Jan 5
Fri, Jan 3
(Ideally, of course, we would not have two separate HTTP components, just one that's librarized and uses a PSR standard. That's T110022: Move HTTP-related code from MW to its own library.)
T296433: MultiHttpClient is in includes/libs/ but uses MediaWiki components is the older task for MultiHttpClient.
The task description explicitly says that the skin name is already localizable but it would be confusing to change the localized name but not change the internal identifier. Are you asking for another of level of mapping (internal identifier -> publicly exposed identifier -> localized string)? It seems like a lot of potential complexity for dubious value - the disparity would still be confusing (e.g. technical documentation would still use the internal identifier, since that's the one that's the same on all wikis), it would add one more problem to copying gadget code between wikis, and the internal identifier would still have to be publicly exposed (because e.g. JS code in source control can't account for how a site might set up its mapping). Just changing skinname-vector-2022 seems like a strictly less bad option at that point.
Thu, Jan 2
(See T323867: Clarify use of non-confidential OAuth 2.0 clients for some limitations which currently make JS apps hard to use.)
Oh, right, I missed that this is a publicly available wiki.
Stretch goal: access_token query parameter.
It says
Probably a Web UI?
so I don't think so.
Or rather this is a duplicate of T229505: Admin adds new client since that form didn't support OAuth 2 yet at the time of writing this task.
If you think something other than https://meta.wikimedia.org/wiki/Special:OAuthConsumerRegistration/propose/oauth2 is needed, please clarify and reopen.
I'd say the best practice is never to change $wgPasswordPolicy['checks'] (you can of course append your own). Added a warning to that effect to the documentation page.
This has existed since the beginning: https://meta.wikimedia.org/wiki/Special:OAuthConsumerRegistration/list
Admittedly the UX is terrible; that's T104078: Update OAuth consumer list table styles.
Duplicate of T254190: Allow a user to disable an OAuth client, I think.
Seems essentially a duplicate of T234674: Delete OAuth 2.0 access tokens on password change? We could provide a separate button if that's useful, of course.
We'd have to delete accepted consumers in the OAuth session provider's invalidateSessionsForUser() callback, much like we already do for preventSessionsForUser(). That would be run every time the user is force-logged out (e.g. password change, 2FA change, user rename, steward lock, invalidateUserSessions.php). Code-wise a trivial change, not 100% sure of the implications but seems reasonable. (Maybe a bit disruptive for owner-only consumers, where the user would have to go to each such app's Special:OAuthComsumerRegistration subpage and do a token reset. Not much security value if we don't disable owner-only consumers, though.)
@sbassett maybe the Security team has an opinion on this?
It's unfortunate but I'm not sure what would be a more reasonable behavior. Just take the message key of the first error and use it as the error code, ignoring all others?
Can you copy the output from api.php?action=query&format=json&meta=authmanagerinfo&formatversion=2&amirequestsfor=create&amimergerequestfields=1?
Mon, Dec 23
I guess the likely fallout from pyJWT is larger. Let's make sure the change is well-announced then.
@Reedy not sure about the backports... a similar change in T283456: OAuth identfy endpoint should not expose unconfirmed email address broke lots of things. A breaking API change should probably not go into minor releases?
Thu, Dec 19
(We don't have Developer-notice anymore, that would be the appropriate tag here.)
The session ID gets reset during password change, apparently something went wrong with that and it tried to reuse the old session ID to create a new session. Since the report is very old, not worth looking into IMO - please reopen if it's still happening.
I'd generally try to avoid classes which are neither services (stateless) nor value objects (pure state). A PreviewParserInput that mixes page content with some DB access logic seems unwieldy. Since (AFAIK) we only ever use one speculative rev / page ID mechanism, we just need to track whether 1) the parse is actually associated with a real page / revision 2) whether it already has a page / rev id, and have a service do the speculating. I don't think either the ParserInput or ParserOptions needs to depend on that service.
The CentralAuth part seems fine - the editcount is tracked in the DB, the account is attached.
Not sure what could be done about this at the RDBMS or CentralAuth level. What could be done is having a DB connection pool of one, so when using SQLite requests involving the DB in any way just have to wait until the previous request has been served.
The private part is in commit 95517e85 in PrivateSettings. I'll apply that at the same time when the two patches above get deployed, to minimize disruption.
Wed, Dec 18
Tue, Dec 17
This was done a while ago, I forgot to close it.