Run maintenance/cleanupUsersWithNoId.php on all wikis
Closed, ResolvedPublic

Description

This should only be done after 1.31.0-wmf.11 is deployed.

To fully resolve T9240: Usernames in history of imported pages should refer to original wiki and prepare for T167246: Refactor "user" & "user_text" fields into "actor" reference table, this maintenance script needs to be run to clean up existing imported rows and CentralAuth global blocks.

The script needs to be run twice for each wiki.

  1. With --table ipblocks --prefix meta to adjust CentralAuth global blocks. If for some reason $wgCentralAuthGlobalBlockInterwikiPrefix is changed, adjust the prefix according.
  2. With --assign --prefix imported --force to clean up old imports. Or of someone wants to give me a list of wiki language codes and short prefixes that mean more or less "imported", I could probably use that.

Related Objects

There are a very large number of changes, so older changes are hidden. Show Older Changes
Restricted Application added a subscriber: Aklapper. · View Herald TranscriptNov 30 2017, 4:29 PM
Anomie updated the task description. (Show Details)Nov 30 2017, 4:57 PM

Mentioned in SAL (#wikimedia-releng) [2017-11-30T16:58:36Z] <anomie> Running cleanupUsersWithNoId.php on Beta Cluster, see T181731

Mentioned in SAL (#wikimedia-cloud) [2017-11-30T16:58:36Z] <anomie> Running cleanupUsersWithNoId.php on Beta Cluster, see T181731

Anomie updated the task description. (Show Details)Nov 30 2017, 5:00 PM

Mentioned in SAL (#wikimedia-releng) [2017-11-30T17:49:42Z] <anomie> Finished running cleanupUsersWithNoId.php on Beta Cluster for T181731

Mentioned in SAL (#wikimedia-cloud) [2017-11-30T17:49:42Z] <anomie> Finished running cleanupUsersWithNoId.php on Beta Cluster for T181731

Nirmos added a subscriber: Nirmos.Nov 30 2017, 6:15 PM

Mentioned in SAL (#wikimedia-cloud) [2017-11-30T18:59:57Z] <bd808> Testing stashbot fix for double phab logging (T181731)

Mentioned in SAL (#wikimedia-operations) [2017-12-06T16:03:11Z] <anomie@terbium> Running cleanupUsersWithNoId.php for testwiki, see T181731

Mentioned in SAL (#wikimedia-operations) [2017-12-06T16:14:38Z] <anomie@terbium> Running cleanupUsersWithNoId.php for test2wiki, see T181731

Mentioned in SAL (#wikimedia-operations) [2017-12-06T16:27:00Z] <anomie@terbium> Running cleanupUsersWithNoId.php for testwikidatawiki, see T181731

Mentioned in SAL (#wikimedia-operations) [2017-12-06T16:29:37Z] <anomie@terbium> Running cleanupUsersWithNoId.php for mediawikiwiki, see T181731

Graham87 added a subscriber: Graham87.EditedDec 8 2017, 12:48 AM

Another complication relating to this script would be T2323, involving usernames stored with underlines, extra spaces and initial lower-case letters. Quite a few edits affected by this bug also have a rev_user of 0 ... they can probably be found in all the tables besides the "/Positive rev_user" one here: https://en.wikipedia.org/wiki/User:Nemo_bis/Bug_323_revisions

Anomie added a comment.Dec 8 2017, 5:53 PM

Another complication relating to this script would be T2323, involving usernames stored with underlines, extra spaces and initial lower-case letters.

Such entries won't be touched by this script, since they cause User::isUsableName() to return false, and will eventually be copied as-is into the actor table.

If someone decides to resolve that bug in the future, they would need to implement similar prefix-or-assign logic as is being used here.

Mentioned in SAL (#wikimedia-operations) [2017-12-12T19:35:26Z] <anomie> Running cleanupUsersWithNoId.php on all wikis (this will take a while), see T181731

It appears that this script is causing SUL accounts to be created at wikis where pages have been imported – not unreasonable, but the comments at https://en.wikipedia.org/wiki/Wikipedia:Village_pump_(technical)#Local_accounts_attached_without_a_visit_(and_welcomed_without_an_edit) indicate that it's surprising a few people.

This broke wikidata and dewiki during the night, causing mediawiki exceptions and (suprisingly only) timeouts on page views/edits, freeze of the recentchanges and watchlist functionality. Apparently, archive table differs between servers due to and old mediawiki bug that inserted into archive with INSERT...SELECT, and this script touches many old archive tables, breaking replication on half of the servers.

@Anomie were you by any chance running unattended long-running scripts without screen? Maybe I was wrong, but it confused me a lot to be able to kill your processes.

Sigh, I guess this task is another one that's blocked on s5 DB weirdness being fixed. Let me know when that happens please.

I don't know much of anything about screen. Are there instructions somewhere for how to run maintenance scripts in it correctly? I don't see any mention of it on https://wikitech.wikimedia.org/wiki/Heterogeneous_deployment in the sections about running maintenance scripts.

Graham87 added a comment.EditedDec 14 2017, 7:28 AM

Would we able to undo the results of this script or reconfigure it for the Nostalgia Wikipedia? One of the good things about that site (a copy of the Wikipedia database from 20 December 2001) is that it made it fairly easy to compare edits from 2001 between enwiki and the Nostalgia Wikipedia database. Now that's impossible:
https://en.wikipedia.org/w/index.php?title=Astronomy_and_Astrophysics/History&action=history

I'm 99% sure we won't have these problems with pre-MediaWiki edits on enwiki, except perhaps for the editors who got renamed to a ~enwiki prefix while their edits got left behind (see my earlier link to the village pump thread), because I created all the old account names on that site. I'm not so sure about other old Wikipedias though ...
*edit* it's not really a problem in this case, e.g.
https://en.wikipedia.org/wiki/Special:CentralAuth/Jmccann

Sigh, I guess this task is another one that's blocked on s5 DB weirdness being fixed

@Anomie- actually it is blocked on core taking ownership and followup of the problems generated by bad archive queries. It made other other wikis break, it just complains loudly on s5. If those queries wouldn't have broken dewiki, s5 weirdness would not have affected them.

@Anomie As you can see here, for example, s7 has the same problems with archive: T163190 (also tags-related tables, but that is not part of core, and lesser issue). The data loss on archive Non-deterministic query problems were brought up as early as 2015 T112637 (even before, in a non-formal way).

jcrespo added a comment.EditedDec 14 2017, 10:21 AM

There is a good introduction to screen on https://wikitech.wikimedia.org/wiki/Screen I do not think there should be any guidelines on deployment (as you said in the past, let's not red-tape unnecessarily) but it is a hugely vital tools for managing tasks on a server. With screen I can deploy code from a train or a plane, and not worry about connection interruptions. I thought about what you said of "being limited by buffer", and I wonder if you didn't know you could, on a screen session:

Ctrl+a, Esc (technically, '[', but escape is easier on my keyboard) to go to edit mode, then scroll up and down.I think by default it has around 1000 lines of buffer, you can add more to match those on your terminal, like [https://phabricator.wikimedia.org/source/operations-puppet/browse/production/modules/admin/files/home/jynus/.screenrc | I did ].

Nobody expects you to know all these, but it certainly makes collaborating with others way easier- I see a SCREEN process by anomie? I know some long running process is ongoing is there- you can connect from any client, share it with other people, etc. Like a vim editor, it takes to get used to, but later you cannot live without it.

Trizek-WMF added a subscriber: Trizek-WMF.

The script has caused SUL accounts to be created and it has been noted, like Sherry said. I have also seen some reports as well on fr.wp. Explain it on Tech News would be a good thing IMO.

Johan added a subscriber: Johan.Dec 14 2017, 11:32 AM

@Trizek-WMF Something like this for Tech News?

When you import a page from another wiki the usernames of the users who edited the article on the wiki you imported it from are shown in the article history. This should link to the users on the original wiki. A script to fix this caused problems for Wikidata and German Wikipedia. It also created a large number of [[<tvar|sul>m:Special:MyLanguage/Help:Unified login|SUL accounts]] on wikis where editors had never edited.

Sigh, I guess this task is another one that's blocked on s5 DB weirdness being fixed

@Anomie- actually it is blocked on core taking ownership and followup of the problems generated by bad archive queries.

Is there a task somewhere that says specifically what needs to be done? Not a huge generic RFC like T112637, a task with a checklist of actual work needed.

I note that if problems of this sort are already in the database, that's probably outside the scope of "core".

@Trizek-WMF Something like this for Tech News?

When you import a page from another wiki the usernames of the users who edited the article on the wiki you imported it from are shown in the article history. This should link to the users on the original wiki. A script to fix this caused problems for Wikidata and German Wikipedia. It also created a large number of [[<tvar|sul>m:Special:MyLanguage/Help:Unified login|SUL accounts]] on wikis where editors had never edited.

@Anomie, can you review that sentence?

That text seems appropriate to me.

jhsoby added a subscriber: jhsoby.Dec 15 2017, 12:31 AM

@Trizek-WMF Something like this for Tech News?

When you import a page from another wiki the usernames of the users who edited the article on the wiki you imported it from are shown in the article history. This should link to the users on the original wiki. A script to fix this caused problems for Wikidata and German Wikipedia. It also created a large number of [[<tvar|sul>m:Special:MyLanguage/Help:Unified login|SUL accounts]] on wikis where editors had never edited.

Actually, it's not just imports. My work account got registered in plenty of wikis, even though it hasn't edited anything that would be imported to many wikis. EBernhardson on IRC figured out that it was because of Wikidata changes being reflected in projects' Recent changes and Watchlist feeds. In my case, this edit to an item for a template used in many different projects probably triggered most of those account creations.

Johan added a comment.Dec 15 2017, 5:39 PM

@jhsoby Noted, but the text doesn't say it's because of the imports, but because of a script that tried to fix an issue with the imports.

EBernhardson on IRC figured out that it was because of Wikidata changes being reflected in projects' Recent changes and Watchlist feeds.

Hmm. That probably means I'll need to figure out what code in Wikidata is doing this, and then re-run the script over the recentchanges tables. Thanks for pointing that out.

Nemo_bis added a subscriber: Nemo_bis.EditedJan 5 2018, 8:54 AM

EBernhardson on IRC figured out that it was because of Wikidata changes being reflected in projects' Recent changes and Watchlist feeds.

Hmm. That probably means I'll need to figure out what code in Wikidata is doing this, and then re-run the script over the recentchanges tables. Thanks for pointing that out.

I see that the script works on the recentchanges table too (which I didn't expect, and maybe should be optional: it's not very useful when you throw away the RC in few weeks, IMHO). Then the query needs a rc_type = 0 condition, or at any rate rc_type < 5.

Current:

		$this->cleanup(
			'recentchanges', 'rc_id', 'rc_user', 'rc_user_text',
			[], [ 'rc_id' ]
		);
Anomie added a comment.Jan 5 2018, 2:23 PM

I see that the script works on the recentchanges table too (which I didn't expect, and maybe should be optional: it's not very useful when you throw away the RC in few weeks, IMHO).

It's required for the actor table migration that all tables involved, including recentchanges, are properly cleaned up.

Then the query needs a rc_type = 0 condition, or at any rate rc_type < 5.

All rows have to be cleaned up for the actor table migration, regardless of rc_type.

Anomie added a comment.Jan 9 2018, 2:51 PM

@jcrespo: Is this still blocked for dewiki (s5) and wikidatawiki (now s8)? Or did the issues blocking it get fixed with the resolution of T161294?

You can run it, but please add it to "week of" on the deployment page. Check with @Marostegui as it may or may not interfere with the comment refactoring schema change.

I am currently running the comment refactoring schema change on s5. Once done, I will go for s8.

Anomie closed this task as Resolved.Feb 10 2018, 4:34 PM

Would we able to undo the results of this script or reconfigure it for the Nostalgia Wikipedia?

Someone could do such a thing. Whether anyone will is a different question.

One of the good things about that site (a copy of the Wikipedia database from 20 December 2001) is that it made it fairly easy to compare edits from 2001 between enwiki and the Nostalgia Wikipedia database. Now that's impossible:
https://en.wikipedia.org/w/index.php?title=Astronomy_and_Astrophysics/History&action=history

How is it impossible? It's just slightly more difficult to match up the usernames.

One of the good things about that site (a copy of the Wikipedia database from 20 December 2001) is that it made it fairly easy to compare edits from 2001 between enwiki and the Nostalgia Wikipedia database. Now that's impossible:
https://en.wikipedia.org/w/index.php?title=Astronomy_and_Astrophysics/History&action=history

How is it impossible? It's just slightly more difficult to match up the usernames.

In hindsight, that wasn't the best explanation.

What I should have said was comparing lists of contributions by a specific user. I might do this, for instance, to find article creations that were recorded in the Nostalgia Wikipedia but not the English Wikipedia. This could occur because when the UseModWiki edits were imported into Wikipedia, the latest edit was omitted. An example that comes to mind is Magnus Manske: when he was creating articles back in 2001, he would often use the phrase "Initial entry" in the edit summary. I used to be able to open his English and Nostalgia Wikipedia contributions side-by-side, search for that phrase in each of those lists, and import missing edits where I found them. I imported all the edits from the Nostalgia Wikipedia with the edit summary "Initial entry" in this list of contribs where the byte difference is negative.

I might also want to compare lists of contribs to check that both enwiki and the Nostalgia Wikipedia have the same first edit for a specific user, or just to see if any old edits by a specific editor are misssing for enwiki. I can still do this through the regular interface with IP addresses, but not users.

Anomie added a comment.Mar 8 2018, 7:19 PM

I'm going to have to run the script again. But since it'll be only Wikidata changes, I'm going to use options --table recentchanges --prefix wikidata --force that will not create any new SUL accounts.

Mentioned in SAL (#wikimedia-operations) [2018-03-08T19:33:05Z] <anomie> Running cleanupUsersWithNoId.php --table recentchanges --prefix wikidata --force on wikidata client wikis for T181731. This shouldn't create any local SUL accounts.