Page MenuHomePhabricator

Run maintenance/cleanupUsersWithNoId.php on all wikis
Closed, ResolvedPublic

Description

This should only be done after 1.31.0-wmf.11 is deployed.

To fully resolve T9240: Usernames in history of imported pages should refer to original wiki and prepare for T167246: Refactor "user" & "user_text" fields into "actor" reference table, this maintenance script needs to be run to clean up existing imported rows and CentralAuth global blocks.

The script needs to be run twice for each wiki.

  1. With --table ipblocks --prefix meta to adjust CentralAuth global blocks. If for some reason $wgCentralAuthGlobalBlockInterwikiPrefix is changed, adjust the prefix according.
  2. With --assign --prefix imported --force to clean up old imports. Or of someone wants to give me a list of wiki language codes and short prefixes that mean more or less "imported", I could probably use that.

Related Objects

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes

Mentioned in SAL (#wikimedia-releng) [2017-11-30T16:58:36Z] <anomie> Running cleanupUsersWithNoId.php on Beta Cluster, see T181731

Mentioned in SAL (#wikimedia-cloud) [2017-11-30T16:58:36Z] <anomie> Running cleanupUsersWithNoId.php on Beta Cluster, see T181731

Mentioned in SAL (#wikimedia-releng) [2017-11-30T17:49:42Z] <anomie> Finished running cleanupUsersWithNoId.php on Beta Cluster for T181731

Mentioned in SAL (#wikimedia-cloud) [2017-11-30T17:49:42Z] <anomie> Finished running cleanupUsersWithNoId.php on Beta Cluster for T181731

Mentioned in SAL (#wikimedia-cloud) [2017-11-30T18:59:57Z] <bd808> Testing stashbot fix for double phab logging (T181731)

Mentioned in SAL (#wikimedia-operations) [2017-12-06T16:03:11Z] <anomie@terbium> Running cleanupUsersWithNoId.php for testwiki, see T181731

Mentioned in SAL (#wikimedia-operations) [2017-12-06T16:14:38Z] <anomie@terbium> Running cleanupUsersWithNoId.php for test2wiki, see T181731

Mentioned in SAL (#wikimedia-operations) [2017-12-06T16:27:00Z] <anomie@terbium> Running cleanupUsersWithNoId.php for testwikidatawiki, see T181731

Mentioned in SAL (#wikimedia-operations) [2017-12-06T16:29:37Z] <anomie@terbium> Running cleanupUsersWithNoId.php for mediawikiwiki, see T181731

Another complication relating to this script would be T2323, involving usernames stored with underlines, extra spaces and initial lower-case letters. Quite a few edits affected by this bug also have a rev_user of 0 ... they can probably be found in all the tables besides the "/Positive rev_user" one here: https://en.wikipedia.org/wiki/User:Nemo_bis/Bug_323_revisions

Another complication relating to this script would be T2323, involving usernames stored with underlines, extra spaces and initial lower-case letters.

Such entries won't be touched by this script, since they cause User::isUsableName() to return false, and will eventually be copied as-is into the actor table.

If someone decides to resolve that bug in the future, they would need to implement similar prefix-or-assign logic as is being used here.

Mentioned in SAL (#wikimedia-operations) [2017-12-12T19:35:26Z] <anomie> Running cleanupUsersWithNoId.php on all wikis (this will take a while), see T181731

It appears that this script is causing SUL accounts to be created at wikis where pages have been imported – not unreasonable, but the comments at https://en.wikipedia.org/wiki/Wikipedia:Village_pump_(technical)#Local_accounts_attached_without_a_visit_(and_welcomed_without_an_edit) indicate that it's surprising a few people.

This broke wikidata and dewiki during the night, causing mediawiki exceptions and (suprisingly only) timeouts on page views/edits, freeze of the recentchanges and watchlist functionality. Apparently, archive table differs between servers due to and old mediawiki bug that inserted into archive with INSERT...SELECT, and this script touches many old archive tables, breaking replication on half of the servers.

@Anomie were you by any chance running unattended long-running scripts without screen? Maybe I was wrong, but it confused me a lot to be able to kill your processes.

Sigh, I guess this task is another one that's blocked on s5 DB weirdness being fixed. Let me know when that happens please.

I don't know much of anything about screen. Are there instructions somewhere for how to run maintenance scripts in it correctly? I don't see any mention of it on https://wikitech.wikimedia.org/wiki/Heterogeneous_deployment in the sections about running maintenance scripts.

Would we able to undo the results of this script or reconfigure it for the Nostalgia Wikipedia? One of the good things about that site (a copy of the Wikipedia database from 20 December 2001) is that it made it fairly easy to compare edits from 2001 between enwiki and the Nostalgia Wikipedia database. Now that's impossible:
https://en.wikipedia.org/w/index.php?title=Astronomy_and_Astrophysics/History&action=history

I'm 99% sure we won't have these problems with pre-MediaWiki edits on enwiki, except perhaps for the editors who got renamed to a ~enwiki prefix while their edits got left behind (see my earlier link to the village pump thread), because I created all the old account names on that site. I'm not so sure about other old Wikipedias though ...
*edit* it's not really a problem in this case, e.g.
https://en.wikipedia.org/wiki/Special:CentralAuth/Jmccann

Sigh, I guess this task is another one that's blocked on s5 DB weirdness being fixed

@Anomie- actually it is blocked on core taking ownership and followup of the problems generated by bad archive queries. It made other other wikis break, it just complains loudly on s5. If those queries wouldn't have broken dewiki, s5 weirdness would not have affected them.

@Anomie As you can see here, for example, s7 has the same problems with archive: T163190 (also tags-related tables, but that is not part of core, and lesser issue). The data loss on archive Non-deterministic query problems were brought up as early as 2015 T112637 (even before, in a non-formal way).

There is a good introduction to screen on https://wikitech.wikimedia.org/wiki/Screen I do not think there should be any guidelines on deployment (as you said in the past, let's not red-tape unnecessarily) but it is a hugely vital tools for managing tasks on a server. With screen I can deploy code from a train or a plane, and not worry about connection interruptions. I thought about what you said of "being limited by buffer", and I wonder if you didn't know you could, on a screen session:

Ctrl+a, Esc (technically, '[', but escape is easier on my keyboard) to go to edit mode, then scroll up and down.I think by default it has around 1000 lines of buffer, you can add more to match those on your terminal, like [https://phabricator.wikimedia.org/source/operations-puppet/browse/production/modules/admin/files/home/jynus/.screenrc | I did ].

Nobody expects you to know all these, but it certainly makes collaborating with others way easier- I see a SCREEN process by anomie? I know some long running process is ongoing is there- you can connect from any client, share it with other people, etc. Like a vim editor, it takes to get used to, but later you cannot live without it.

Screenshot_20171214_113441.png (703×1 px, 43 KB)

Trizek-WMF subscribed.

The script has caused SUL accounts to be created and it has been noted, like Sherry said. I have also seen some reports as well on fr.wp. Explain it on Tech News would be a good thing IMO.

@Trizek-WMF Something like this for Tech News?

When you import a page from another wiki the usernames of the users who edited the article on the wiki you imported it from are shown in the article history. This should link to the users on the original wiki. A script to fix this caused problems for Wikidata and German Wikipedia. It also created a large number of [[<tvar|sul>m:Special:MyLanguage/Help:Unified login|SUL accounts]] on wikis where editors had never edited.

Sigh, I guess this task is another one that's blocked on s5 DB weirdness being fixed

@Anomie- actually it is blocked on core taking ownership and followup of the problems generated by bad archive queries.

Is there a task somewhere that says specifically what needs to be done? Not a huge generic RFC like T112637, a task with a checklist of actual work needed.

I note that if problems of this sort are already in the database, that's probably outside the scope of "core".

@Trizek-WMF Something like this for Tech News?

When you import a page from another wiki the usernames of the users who edited the article on the wiki you imported it from are shown in the article history. This should link to the users on the original wiki. A script to fix this caused problems for Wikidata and German Wikipedia. It also created a large number of [[<tvar|sul>m:Special:MyLanguage/Help:Unified login|SUL accounts]] on wikis where editors had never edited.

@Anomie, can you review that sentence?

That text seems appropriate to me.

@Trizek-WMF Something like this for Tech News?

When you import a page from another wiki the usernames of the users who edited the article on the wiki you imported it from are shown in the article history. This should link to the users on the original wiki. A script to fix this caused problems for Wikidata and German Wikipedia. It also created a large number of [[<tvar|sul>m:Special:MyLanguage/Help:Unified login|SUL accounts]] on wikis where editors had never edited.

Actually, it's not just imports. My work account got registered in plenty of wikis, even though it hasn't edited anything that would be imported to many wikis. EBernhardson on IRC figured out that it was because of Wikidata changes being reflected in projects' Recent changes and Watchlist feeds. In my case, this edit to an item for a template used in many different projects probably triggered most of those account creations.

@jhsoby Noted, but the text doesn't say it's because of the imports, but because of a script that tried to fix an issue with the imports.

EBernhardson on IRC figured out that it was because of Wikidata changes being reflected in projects' Recent changes and Watchlist feeds.

Hmm. That probably means I'll need to figure out what code in Wikidata is doing this, and then re-run the script over the recentchanges tables. Thanks for pointing that out.

EBernhardson on IRC figured out that it was because of Wikidata changes being reflected in projects' Recent changes and Watchlist feeds.

Hmm. That probably means I'll need to figure out what code in Wikidata is doing this, and then re-run the script over the recentchanges tables. Thanks for pointing that out.

I see that the script works on the recentchanges table too (which I didn't expect, and maybe should be optional: it's not very useful when you throw away the RC in few weeks, IMHO). Then the query needs a rc_type = 0 condition, or at any rate rc_type < 5.

Current:

		$this->cleanup(
			'recentchanges', 'rc_id', 'rc_user', 'rc_user_text',
			[], [ 'rc_id' ]
		);

I see that the script works on the recentchanges table too (which I didn't expect, and maybe should be optional: it's not very useful when you throw away the RC in few weeks, IMHO).

It's required for the actor table migration that all tables involved, including recentchanges, are properly cleaned up.

Then the query needs a rc_type = 0 condition, or at any rate rc_type < 5.

All rows have to be cleaned up for the actor table migration, regardless of rc_type.

@jcrespo: Is this still blocked for dewiki (s5) and wikidatawiki (now s8)? Or did the issues blocking it get fixed with the resolution of T161294?

You can run it, but please add it to "week of" on the deployment page. Check with @Marostegui as it may or may not interfere with the comment refactoring schema change.

I am currently running the comment refactoring schema change on s5. Once done, I will go for s8.

Would we able to undo the results of this script or reconfigure it for the Nostalgia Wikipedia?

Someone could do such a thing. Whether anyone will is a different question.

One of the good things about that site (a copy of the Wikipedia database from 20 December 2001) is that it made it fairly easy to compare edits from 2001 between enwiki and the Nostalgia Wikipedia database. Now that's impossible:
https://en.wikipedia.org/w/index.php?title=Astronomy_and_Astrophysics/History&action=history

How is it impossible? It's just slightly more difficult to match up the usernames.

One of the good things about that site (a copy of the Wikipedia database from 20 December 2001) is that it made it fairly easy to compare edits from 2001 between enwiki and the Nostalgia Wikipedia database. Now that's impossible:
https://en.wikipedia.org/w/index.php?title=Astronomy_and_Astrophysics/History&action=history

How is it impossible? It's just slightly more difficult to match up the usernames.

In hindsight, that wasn't the best explanation.

What I should have said was comparing lists of contributions by a specific user. I might do this, for instance, to find article creations that were recorded in the Nostalgia Wikipedia but not the English Wikipedia. This could occur because when the UseModWiki edits were imported into Wikipedia, the latest edit was omitted. An example that comes to mind is Magnus Manske: when he was creating articles back in 2001, he would often use the phrase "Initial entry" in the edit summary. I used to be able to open his English and Nostalgia Wikipedia contributions side-by-side, search for that phrase in each of those lists, and import missing edits where I found them. I imported all the edits from the Nostalgia Wikipedia with the edit summary "Initial entry" in this list of contribs where the byte difference is negative.

I might also want to compare lists of contribs to check that both enwiki and the Nostalgia Wikipedia have the same first edit for a specific user, or just to see if any old edits by a specific editor are misssing for enwiki. I can still do this through the regular interface with IP addresses, but not users.

I'm going to have to run the script again. But since it'll be only Wikidata changes, I'm going to use options --table recentchanges --prefix wikidata --force that will not create any new SUL accounts.

Mentioned in SAL (#wikimedia-operations) [2018-03-08T19:33:05Z] <anomie> Running cleanupUsersWithNoId.php --table recentchanges --prefix wikidata --force on wikidata client wikis for T181731. This shouldn't create any local SUL accounts.

Looks like some wikis may have been missed. The logs I have don't record getting as far as the recentchanges table for the following: advisorswiki bnwikivoyage euwikisource fawiki fixcopyrightwiki frwiktionary gorwiki hewiki hiwikimedia huwiki id_internalwikimedia idwikimedia inhwiki kowiki labswiki labtestwiki lfnwiki liwikinews metawiki pmswikisource pswikivoyage punjabiwikimedia romdwikimedia rowiki sahwikiquote satwiki shnwiki ukwiki viwiki wikimaniawiki yuewiktionary zhwikiversity

I'm going to re-run the script for those wikis, just in case.

Mentioned in SAL (#wikimedia-operations) [2018-12-05T14:43:02Z] <anomie> Running cleanupUsersWithNoId.php on metawiki for T181731 / T210985

Mentioned in SAL (#wikimedia-operations) [2018-12-05T15:07:03Z] <anomie> Running cleanupUsersWithNoId.php on potentially missed s3 and s7 wikis for T181731

Retrospective: The code I used to log the runs last year wasn't as robust at logging as what I've been using since, and in particular seem to have not logged that s3 and s7 were killed when issues arose. I didn't follow up with a double-check that all the wikis were completely processed, I just assumed that the lack of logged errors meant everything was good.

I hope not too many people complain about accounts being auto-created on the missed wikis this time (cf. T181731#3835006). If they do, I apologize in advance.

It turned out none of the wikis from s3 actually needed the run (all reported "assigned 0 and prefixed 0 row(s)" for all tables).

The s7 wikis all completed successfully, and this time I know any errors would have been logged. ;)