Page MenuHomePhabricator

user names show up in <ip> tags
Closed, ResolvedPublic

Description

Sometimes, the <ip> tag does not contain an IP address, but a user's name. Example from dewiki-20100903-stub-meta-history.xml:

<revision>
<id>7</id>
<timestamp>2002-07-08T01:55:46Z</timestamp>
<contributor>
<ip>Ben-Zin</ip>
</contributor>
<minor/>
<comment>*</comment>
<text id="7" />
</revision>

Supposedly this happens when rev_user is 0, because the user was unknown on the wiki when the revision was created. This is expected behavior and frequently happens when revisions are imported from other wikis. It's especially frequent for very old revisions, imported from usermod.

The expected behavior would be to only show valid IPv4 and IPv6 addresses in the <ip> tag. If the user ID is 0 but the user name is not a valid IP address, it should exported as a regular user but without an ID:

<contributor>
<username>Ben-Zin</username>
</contributor>

This is especially important for researches who want to be able to distinguish between anonymous contributions and contributions of logged in users. The presence if the <ip> tag is supposed to indicate an anonymous contribution. This bug makes that assumption false and leaves researches only with the possibility to work around the issue by looking for themselves if the <ip> tag actually contains an IP address or not.


Version: unspecified
Severity: normal

Details

Reference
bz27992

Event Timeline

bzimport raised the priority of this task from to Medium.Nov 21 2014, 11:29 PM
bzimport set Reference to bz27992.

gleim wrote:

Since I wrote a little patch for the data: It concerns the Contributors of 42916 Revisions (dewiki 2010-09-03). I can provide additional data if required- just contact me.

gleim, what does your patch do exactly?

What do you propose to do with values such as 210.50.203.xxx or 123.office.bomis.com ?

If researchers are going to make software to crawl it by themselves, doesn't seem unreasonable that they filter such values to their liking, too.

(In reply to comment #3)

What do you propose to do with values such as 210.50.203.xxx or
123.office.bomis.com ?

they are not valid IP addresses, so they should not be treated as ip adresses. We we *should* recognize valid ipv6 addresses (but only in the form mediawiki uses when recording them for anon edits).

If researchers are going to make software to crawl it by themselves, doesn't
seem unreasonable that they filter such values to their liking, too.

requiring our users to fix our broken output isn't really the best practice, is it? we can easily fix this, so we should.

Ideally, truely anonymous edits should be distinguishable from edits by unknown authors in the database. Instead of using user=0 for both, unknown users could have -1 or something. But I suppose there's no knowing what that would break... adding an extra "anon" flag along with every xxx_user field would be a pain too. Oh, well.

gleim wrote:

(In reply to comment #2)

gleim, what does your patch do exactly?

Hello,
sorry for my late reply, I haven't been online for a few days. My patch is to correct the data, I have not been touching any code. All I did was simply to check for each case a username is falsely marked as anonymous ip, wether there is a newer entry with valid userID etc. That allows me to rewrite the old entries. Of course this does not fix anything but it helped me in our case.

Best wishes,

Rüdiger

Rüdiger:

Would you mind making your script available as an attachment here, so that users of the dumps can make these corrections until we have a patch approved and deployed?

To Daniel:

We have no way to be sure that a username really exists on a project, i.e. really existed at the time of an edit. We can't actually look up all names in the user table to see if they are valid, because if a user is renamed and all goes well, the old name disappears from the table. Bearing that in mind, I think the only approach we can reasonably take is that if the rev_user is 0, and rev_user_text looks *exactly* like an IP address, then we log it as an IP address, otherwise not.

Examples that would be recorded as usernames, all real usernames taken from enwp:

193.251.9.132 is back for more
152.163.xx.xx
64.175.249.214 (Hephaestos)

(In reply to comment #6)

yes, i aree. it can and should be implemented as "if user == 0 and user_name matches ip_pattern".

my thoughts about making the distinction explicit in the database only apply to fresh imports. this can not be done reliable in retrospect, as you pointed out.

Examples that would be recorded as usernames, all real usernames taken from
enwp:

193.251.9.132 is back for more
152.163.xx.xx
64.175.249.214 (Hephaestos)

and they should, because they *are* usernames. people actually created an account with that name (one *some* wiki, somewhere. or someone imported manipulated/broken dumps).

btw, something like 123.345.111.333 should also be logged as a user name :) not sure about 0.0.0.0, but i don't think that can happen anyway.

(In reply to comment #7)

(In reply to comment #6)
btw, something like 123.345.111.333 should also be logged as a user name :) not
sure about 0.0.0.0, but i don't think that can happen anyway.

I believe we consider 0.0.0.0 to be an IP address server side, so naturally that should be considered an ip.

We already have this kind of heuristic going on server side and already have code to differentiate, any change we make should just make use of that same code.

Maybe a better place to fix this is to have a cleanupUsernames.php file in /maintenance?

(In reply to comment #9)

Maybe a better place to fix this is to have a cleanupUsernames.php file in
/maintenance?

and what exactly would that do?

in the database, we have user entries with id 0 for two things: unknown users (from imports) and anon edits (IPs). how would a cleanup script help with that?

this bug is making this distinction correctly in xml dumps, just as mediawiki makes that distinction in other places.

Three things: revisions which got screwed up from undeletions or other complications with a rename user. I'm not sure what the exact bug(s) are but there are some.

(In reply to comment #11)

Three things:

right. and that *could* perhaps be fixed with a maintenance script.

but as far as this ticket is concerned, it doesn't make a difference: if the userid is null, put the username into <ip> tags only if it's a valid IP address.

Ah, I meant to suggest rather that some of those can't be fixed by a maintenance script actually :-D Not without knowing for sure what the right fix is for each such revision, and I don't think we can know that. Anyways, as has now been said to death on this bug, if it is exactly a valid ip and has rev_user 0 it goes into ip tags.

gleim wrote:

(In reply to comment #6)

Rüdiger:

Would you mind making your script available as an attachment here, so that
users of the dumps can make these corrections until we have a patch approved
and deployed?

Iam using an optimized MySQL Representation with code which would not be of much use for others :-(.

Patched in rev 103448. Note that in the case where we conclude that the username is really a username, we still write out a 0 uid since that's the value in the db.

(In reply to comment #15)

Note that in the case where we conclude that the
username is really a username, we still write out a 0 uid since that's the
value in the db.

That is only true for imports, so that should be no problem.