Page MenuHomePhabricator

Set $wgTitleBlacklistLogHits = true on WMF wikis
Open, MediumPublic

Description

+++ This bug was initially created as a clone of T23206: Log of title blacklist hits +++

Change merged in extension, needs WMF config update.


See Also:

Details

Reference
bz66450
Related Gerrit Patches:
operations/mediawiki-config : masterSet $wgTitleBlacklistLogHits = true on all wikis

Event Timeline

bzimport raised the priority of this task from to Medium.Nov 22 2014, 3:16 AM
bzimport set Reference to bz66450.
bzimport added a subscriber: Unknown Object (MLST).
Legoktm created this task.Jun 10 2014, 8:41 PM

Change 138684 had a related patch set uploaded by Legoktm:
Set $wgTitleBlacklistLogHits = true on all wikis

https://gerrit.wikimedia.org/r/138684

It should be restricted to oversighters, per the privacy policy.

CU, if it does contain private info. Might be best to consult the legal team.

CU is more relevant than OS. So CU.

CC'd James in case LCA wants any comments on this.

Roping Luis in, -1'd the patch turning it on just for now so that we know what it does. Is there a mediawiki page or something else that gives examples (or can someone help explain/point me in the right direction?) I see some suggestion of the IP of newly created users being in the log? Anything else?

The log would show hits of users attempting to create accounts which trigger the TBL (global or local). An example log entry: 20:28, 9 June 2014 99.99.99.99 (talk | block) attempted to create "Account name" (rule: whatever rule prevents it). The formatting would be slightly different but that is the info that would be included. I've requested Pir2 to set up a test instance to confirm.

Unfortunately, this log is needed for the TBL to be usable. Currently there is no way to see what impact the TBL is having, and this would allow CheckUsers at least to confirm the impact of it.

An alternative would be to somehow take out the initiator from the log, listing only the name and which rule blocked it.

In it's current form, CheckUser seems to be the only group which this would be usable for.

Copying my gerrit comment here, with some annotations, for the record:
"Yeah, I'm actually a bit [read: very] surprised it [read: original patch for title blacklist log] was merged (I was about to abandon it, frankly, since pagemoves and editing cannot be logged), but I trust Legoktm reviewed it sufficiently. Perhaps it would be possible to only log the account name and not the IP. Another problem might be that when the log is viewed, there is no record (unlike CU log [for when people check others with Special:CheckUser]). Admins definitely should not be allowed to view it, as they can just add .* to the title blacklist and collect list of all IPs and usernames [that are blocked by the blacklist]."

(In reply to Ajraddatz from comment #7)

The log would show hits of users attempting to create accounts which trigger
the TBL (global or local). An example log entry: 20:28, 9 June 2014
99.99.99.99 (talk | block) attempted to create "Account name" (rule:
whatever rule prevents it). The formatting would be slightly different but
that is the info that would be included. I've requested Pir2 to set up a
test instance to confirm.
Unfortunately, this log is needed for the TBL to be usable. Currently there
is no way to see what impact the TBL is having, and this would allow
CheckUsers at least to confirm the impact of it.
An alternative would be to somehow take out the initiator from the log,
listing only the name and which rule blocked it.

I think without the IP completely eliminates the concern, with the IP however is concerning and would want to be limited to just Checkusers (and related, stewards obviously count as well).

The major use case that I've seen listed in the request is to see whether the filter is successful/useful (and potentially I guess if people are trying, and being blocked, from making legitimate accounts). Is that the main use case? For that use case having the IP/requester doesn't actually seem horribly useful and we should only reveal an IP if we have a strong use case for it.

Do the stewards or others believe that the IP/requester would be useful itself? (and what would be the primary use case for it).

(In reply to James Alexander from comment #10)

(In reply to Ajraddatz from comment #7)

The log would show hits of users attempting to create accounts which trigger
the TBL (global or local). An example log entry: 20:28, 9 June 2014
99.99.99.99 (talk | block) attempted to create "Account name" (rule:
whatever rule prevents it). The formatting would be slightly different but
that is the info that would be included. I've requested Pir2 to set up a
test instance to confirm.
Unfortunately, this log is needed for the TBL to be usable. Currently there
is no way to see what impact the TBL is having, and this would allow
CheckUsers at least to confirm the impact of it.
An alternative would be to somehow take out the initiator from the log,
listing only the name and which rule blocked it.

I think without the IP completely eliminates the concern, with the IP
however is concerning and would want to be limited to just Checkusers (and
related, stewards obviously count as well).
The major use case that I've seen listed in the request is to see whether
the filter is successful/useful (and potentially I guess if people are
trying, and being blocked, from making legitimate accounts). Is that the
main use case? For that use case having the IP/requester doesn't actually
seem horribly useful and we should only reveal an IP if we have a strong use
case for it.
Do the stewards or others believe that the IP/requester would be useful
itself? (and what would be the primary use case for it).

It would be easy to remove the IP from the log. That would mean CUs could not find it if they need it, however. It may be best to remove IPs from the log even if it prevents CUs from finding it, as they could CheckUser any successfully created accounts.

We can already CheckUser any successfully created accounts, so no change from the status quo there.

I would argue it is needed. If I globally blacklist a common string found in usernames being created by an LTA, being able to see the IPs he is using to try and create new accounts could allow for a proactive response - being able to block before they have even made an account to vandalize with. This is especially important if they are creating attack names.

Even modified to not include the IP, the log would be useful to ensure that entries aren't blocking obvious good-faith names. This would also mean that sysops could view it. I'd certainly be fine with either option.

(In reply to PiRSquared17 from comment #11)

(In reply to James Alexander from comment #10)

(In reply to Ajraddatz from comment #7)

The log would show hits of users attempting to create accounts which trigger
the TBL (global or local). An example log entry: 20:28, 9 June 2014
99.99.99.99 (talk | block) attempted to create "Account name" (rule:
whatever rule prevents it). The formatting would be slightly different but
that is the info that would be included. I've requested Pir2 to set up a
test instance to confirm.
Unfortunately, this log is needed for the TBL to be usable. Currently there
is no way to see what impact the TBL is having, and this would allow
CheckUsers at least to confirm the impact of it.
An alternative would be to somehow take out the initiator from the log,
listing only the name and which rule blocked it.

I think without the IP completely eliminates the concern, with the IP
however is concerning and would want to be limited to just Checkusers (and
related, stewards obviously count as well).
The major use case that I've seen listed in the request is to see whether
the filter is successful/useful (and potentially I guess if people are
trying, and being blocked, from making legitimate accounts). Is that the
main use case? For that use case having the IP/requester doesn't actually
seem horribly useful and we should only reveal an IP if we have a strong use
case for it.
Do the stewards or others believe that the IP/requester would be useful
itself? (and what would be the primary use case for it).

It would be easy to remove the IP from the log. That would mean CUs could
not find it if they need it, however. It may be best to remove IPs from the
log even if it prevents CUs from finding it, as they could CheckUser any
successfully created accounts.

In the long run, if it's useful for checkusers, it is probably best going into the checkuser log database anyway (so if they CU'd the IP the attempts to create an account would come up). Without it going into the CU log table it is likely to be less useful just because it's "another" place that would need to be checked.

If I85717770c9885b48f128474aad77833994714778 is merged, then the community and the legal team can choose whether to include IPs in the log.

I am not certain why the privacy policy is coming into play as the priority, when the terms of use sit as equal level. At this point of time there is nothing to indicate which person is making the edits to identify against an IP, and could be considered no different from any normal standard IP edit, especially as we are not recording a username of the person editing, just the IP address and the TBL hit.

Here we are talking about accounts that are hitting the TitleBlacklist, which is bigger than account usernames, and also includes numbers of keywords. So let us then explore a little more ...

Do we chase someone down who uses a(n absolute)? vulgarity in a user name? Generally not, though we can. If we act, it is usually to block and to maybe block the IP for autoblock time.

Do we chase down someone who successfully finds a new variation of a TBL keyword? Between sometimes and probably, and we update the TBL.

Do we chase down an IP address of someone who is creating TBL-like pages with their IP address? No, we just block the IP address, and update the TBL.

So we have the situations of

Secnario A) — Limiting output to stewards and checkusers (both by default) that they they can see the IP address of someone who is 1) accidentally looking to circumvent the TBL, or 2) purposefully looking to circumvent the TBL.

In situation 1) I doubt that we will know who the person is, nor care, nor would take any action. We will not know what other accounts exist unless the IP address is reverse checked, and there is no reason to do so, and for dynamic IP addresses would be pointless.

In situation 2) Like with a revealed IP address, we won't necessarily know who they are or are not, than any other set of LTAs, and it won't really matter, we are more interested in terminating the abuse. If it is a known IP address for a vandal, CU normally familiar then, so nothing new anyway.

Exceptions to this may be where a person's name has been added to the GLOBAL TBL where it has been abused xwiki and added by agreement. To the point at a new wiki that a person creating a new account would have their IP exposed. This alone may be reason to limit an IP address, though such

Scenario B) — We have no IP addresses, and limited to Stewards and Checkusers. We have nothing of value for situations 1 or 2, beyond "gee look someone is maybe abusing ... sit. 1) silly beggars; sit. 2) I hope they don't find a way around ... what a PITA. Heads up in case they do.

Scenario C) — We have no IP addresses and make it more visible to the advanced rights holders (admins +). Sit 1) Gee look! Sit 2) Gee look! Alert! (plus). So we have more vigilance, and probably a lot more people watching the page for not a lot of reason beyond reaction time. Not necessarily a lot of value.

Scenario D) — We have IP addresses and make it more visible to the advanced rights holders (admin+). This has been addressed above as being bad as it could easily be (mis|ab)used by a poor addition. => not reasonable to have.

So to me, it would seem that if we are going to judge this by effectiveness it would be along the lines of
A >> C > B (not D)

So how do I see that the privacy policy does come into play? That would be if we chose Scenario A) then we should only record that IP address temporarily, which would be up to three months (noting that the effectiveness of the IP address is probably only really good for a week to a month anyway).

Though maybe as has been indicated above, that situation C is the case that we then have the ability to have recorded and discoverable through a CU search, though I am not sure what we would have to be able to search in that space. If there is no username created, or no edit, for what are we searching? At this stage the only other means to find something is through a global filter, and at this stage I am unaware of any filter that is limited to checkusers.

AbuseFilter is using the new account name to show the hits on account creations in its own log. That means clicking the user name shows a user page with the hint, that no user exists, but there is no problem with ips. That sounds like the best solution to allow also sysop to see the log.

For checkuser the generated log entry should be given to checkuser and than gets saved to that table with the used ip. Than also tbl hits for logged in users can be found with the ip (for page creation or account creation from an existing account without bypassing the titleblacklist)

The privacy policy comes into play because the privacy policy always comes into play when recording IP addresses. That is non-negotiable.

This isn't to say IPs can never be recorded, but they should only be recorded when (1) there is a clearly stated and described reason for the recording and (2) the code is already written to delete or otherwise handle them after a reasonable retention period (likely 90 days but ideally shorter).

I'm not really clear if #1 is the case here; I didn't raise #2 initially because we're normally pretty good about that, but comment 16 implies otherwise?

Note that it's impossible for this to ever result in an account and IP being linked, since any creation attempts recorded in this log necessarily did not result in an account being created.

I personally think just not recording the IPs at all [see comment 14] may be the best way to go, since having to remove old (90 days) entries in a log seems like something to avoid. Besides, the way the log works, one could just copy all the IPs/usernames at once before it expires. Not very good for privacy.

Another solution would be to just not enable it at all on Wikimedia (i.e. WONTFIX the bug).

(In reply to Jackmcbarn from comment #19)

Note that it's impossible for this to ever result in an account and IP being
linked, since any creation attempts recorded in this log necessarily did not
result in an account being created.

That may be true in theory, but in practice it would most likely be possible to correlate the IPs with creations of accounts around the same time or with similar names.

(In reply to Jackmcbarn from comment #19)

Note that it's impossible for this to ever result in an account and IP being
linked, since any creation attempts recorded in this log necessarily did not
result in an account being created.

If you are indicating that without a reverse CU, then it would not be the case for never, just unusual and rarely.

The [[m:Titleblacklist]] has entries that have user names within them, so if one of those users went to create an account at a new wiki, they would not be able to do so, and it would log in the TBL log.

Even with some small wikis, they copy the Mediawiki:... files from enWP and utilise them at their wiki, and if they copy that TBL, and the user looks to create an account at that new wiki, we are in the same situation.

While such may possibly be remedied by oversight of the logs, I am not sure that there is an oversight capacity of the TBL log.

If you mean that with the IP address we could not identify a user, it may be anywhere between certain and impossible with checkuser, and that is due to the nature of the tool and IP addresses.

(In reply to billinghurst from comment #21)

(In reply to Jackmcbarn from comment #19)

Note that it's impossible for this to ever result in an account and IP being
linked, since any creation attempts recorded in this log necessarily did not
result in an account being created.

If you are indicating that without a reverse CU, then it would not be the
case for never, just unusual and rarely.
The [[m:Titleblacklist]] has entries that have user names within them, so if
one of those users went to create an account at a new wiki, they would not
be able to do so, and it would log in the TBL log.
Even with some small wikis, they copy the Mediawiki:... files from enWP and
utilise them at their wiki, and if they copy that TBL, and the user looks to
create an account at that new wiki, we are in the same situation.
While such may possibly be remedied by oversight of the logs, I am not sure
that there is an oversight capacity of the TBL log.
If you mean that with the IP address we could not identify a user, it may be
anywhere between certain and impossible with checkuser, and that is due to
the nature of the tool and IP addresses.

Attempts to autocreate an account aren't logged, so getting an IP that way isn't possible.

I think that the most useful application of this log would be with IPs, thus accessible by CheckUser only (keeping in mind Jackmcbarn's valid observation with which I agree). If we were to disallow any action which could infer IPs to usernames then we'd need to disable anonymous editing entirely.

I don't think that it would be more useful to hide the IPs from the log but make it CheckUser-able. Far better to be able to easily see which IP is trying to create those usernames so that it can be blocked to prevent abuse. Otherwise it would be impossible to check the IP until after an account had been made, and thus we'd be getting back into reactive rather than proactive territory.

Nonpublic information-granting tools are defined as "tool[s] that permits them to view nonpublic information about other users". Jackmcbarn's point here that this would never occur, due to any accounts listed in the log not being created, is important IMO.

That said, if this is too much of a stretch, it would still be good to be able to see the log without IPs so I think that implementing it in some form would be a positive. The TBL is largely useless with no way to see its impact.

(In reply to PiRSquared17 from comment #20)

I personally think just not recording the IPs at all [see comment 14] may be
the best way to go, since having to remove old (90 days) entries in a log
seems like something to avoid. Besides, the way the log works, one could
just copy all the IPs/usernames at once before it expires. Not very good for
privacy.
Another solution would be to just not enable it at all on Wikimedia (i.e.
WONTFIX the bug).
(In reply to Jackmcbarn from comment #19)

Note that it's impossible for this to ever result in an account and IP being
linked, since any creation attempts recorded in this log necessarily did not
result in an account being created.

That may be true in theory, but in practice it would most likely be possible
to correlate the IPs with creations of accounts around the same time or with
similar names.

Going back to where this was originally lodged at https://bugzilla.wikimedia.org/show_bug.cgi?id=1542#c4

As en effective tool to _prevent_ abuse the tool itself, logging and the IP addresses give the value for the prevention.

Without the IP address, the modification the log should be considered a reflective tool that allows you to assess the validity of the use of the tool.

(In reply to Jackmcbarn from comment #19)

Note that it's impossible for this to ever result in an account and IP being
linked, since any creation attempts recorded in this log necessarily did not
result in an account being created.

There are aggressive TBL entries which block innocent user names. They request sysops to create those names afterwards. In this case users get linked to IPs.

(In reply to Jackmcbarn from comment #22)

(In reply to billinghurst from comment #21)

(In reply to Jackmcbarn from comment #19)

Note that it's impossible for this to ever result in an account and IP being
linked, since any creation attempts recorded in this log necessarily did not
result in an account being created.

If you are indicating that without a reverse CU, then it would not be the
case for never, just unusual and rarely.
The [[m:Titleblacklist]] has entries that have user names within them, so if
one of those users went to create an account at a new wiki, they would not
be able to do so, and it would log in the TBL log.
Even with some small wikis, they copy the Mediawiki:... files from enWP and
utilise them at their wiki, and if they copy that TBL, and the user looks to
create an account at that new wiki, we are in the same situation.
While such may possibly be remedied by oversight of the logs, I am not sure
that there is an oversight capacity of the TBL log.
If you mean that with the IP address we could not identify a user, it may be
anywhere between certain and impossible with checkuser, and that is due to
the nature of the tool and IP addresses.

Attempts to autocreate an account aren't logged, so getting an IP that way
isn't possible.

Not sure that I fully follow, and I don't know the backend code, however, ...

Autocreates are logged, so I am not certain why someone attempting to login to zzWX would not show on the TBL. Such autocreate account creations definitely show through RC and pop in IRC feeds.

(In reply to billinghurst from comment #26)

(In reply to Jackmcbarn from comment #22)

(In reply to billinghurst from comment #21)

(In reply to Jackmcbarn from comment #19)

Note that it's impossible for this to ever result in an account and IP being
linked, since any creation attempts recorded in this log necessarily did not
result in an account being created.

If you are indicating that without a reverse CU, then it would not be the
case for never, just unusual and rarely.
The [[m:Titleblacklist]] has entries that have user names within them, so if
one of those users went to create an account at a new wiki, they would not
be able to do so, and it would log in the TBL log.
Even with some small wikis, they copy the Mediawiki:... files from enWP and
utilise them at their wiki, and if they copy that TBL, and the user looks to
create an account at that new wiki, we are in the same situation.
While such may possibly be remedied by oversight of the logs, I am not sure
that there is an oversight capacity of the TBL log.
If you mean that with the IP address we could not identify a user, it may be
anywhere between certain and impossible with checkuser, and that is due to
the nature of the tool and IP addresses.

Attempts to autocreate an account aren't logged, so getting an IP that way
isn't possible.

Not sure that I fully follow, and I don't know the backend code, however, ...
Autocreates are logged, so I am not certain why someone attempting to login
to zzWX would not show on the TBL. Such autocreate account creations
definitely show through RC and pop in IRC feeds.

I meant that autocreate attempts that fail due to the titleblacklist won't be logged.

(In reply to Jackmcbarn from comment #27)

I meant that autocreate attempts that fail due to the titleblacklist won't
be logged.

I can confirm that this is correct as the author of the original code. It specifically does not log in the auto-creation hook. As far as I know, that should prevent these from being logged, but perhaps someone should test it to be sure.

(In reply to Liangent from comment #25)

There are aggressive TBL entries which block innocent user names. They
request sysops to create those names afterwards. In this case users get
linked to IPs.

+1, this is a valid concern, which is another reason admins should not have this data (besides the fact that it violates the privacy policy)

(In reply to PiRSquared17 from comment #28)

(In reply to Jackmcbarn from comment #27)

I meant that autocreate attempts that fail due to the titleblacklist won't
be logged.

I can confirm that this is correct as the author of the original code. It
specifically does not log in the auto-creation hook. As far as I know, that
should prevent these from being logged, but perhaps someone should test it
to be sure.

I did test that (before merging), and it worked as expected.


So I think the general consensus is that we *should* log the IP address and restrict to CU only? (Please correct me if I'm wrong...)

If so, are we going to need a script to remove the IP addresses after 90 days? Should it delete the entire log entry or just the IP address?

(In reply to PiRSquared17 from comment #28)

+1, this is a valid concern, which is another reason admins should not have
this data (besides the fact that it violates the privacy policy)

To be clear, we can log IP addresses under the privacy policy, and even share them publicly, if there is a good reason for it and we do appropriate clean up. I mean, for better or for worse, what we currently do with "anonymous" edits is permitted under the policy! We're just trying not to create new problems. :)

(In reply to Kunal Mehta (Legoktm) from comment #29)

So I think the general consensus is that we *should* log the IP address and
restrict to CU only? (Please correct me if I'm wrong...)

I'm still not clear that I've seen a persuasive rationale that the addresses should be logged, and at least one person has said this won't get used if it isn't integrated into existing workflows. But I'm not super-familiar with the workflows here so I probably shouldn't be the final decision-maker here.

If so, are we going to need a script to remove the IP addresses after 90
days? Should it delete the entire log entry or just the IP address?

Addresses only is fine - those are the potentially identifying information here.

The only person who has said this wouldn't be used unless integrated with existing workflows doesn't currently serve in a role which would make use of this. As I said in a post above, there would be no benefit to integrating it with CU - we'd be back to reactive, rather than proactive action since we'd only be able to find the IP if they successfully create an account. Integrating it with CU in addition to providing the IP would be beneficial.

To make perfectly clear the rationale for this: It gives us a unique opportunity to take proactive action while allowing us to confirm that we aren't generating ridiculous false positives using the TBL. It turns the TBL from an unusable extension to something which we can use alongside the abusefilter and spamblacklist to actively prevent abuse. With this, we could stop abuse before it has even happened - especially important when dealing with attack names. Is that a sufficiently-convincing rationale?

(In reply to Ajraddatz from comment #31)

The only person who has said this wouldn't be used unless integrated with
existing workflows doesn't currently serve in a role which would make use of
this. As I said in a post above, there would be no benefit to integrating it
with CU - we'd be back to reactive, rather than proactive action since we'd
only be able to find the IP if they successfully create an account.
Integrating it with CU in addition to providing the IP would be beneficial.

I would question your statement that I would not make use of it as someone who does quite a lot of log reading and checkusering in my job ;).

That said there is no doubt it would be most useful to stewards and other active global users which is why I was interested in their thoughts. If they/you think it would be more actively used as a separate log that is a mark in favor of having that separate log. My concern was mostly that relatively few people would be actively using it and so having it in the CU tables would be more beneficial (having it in the CU tables also allows it to take advantage of the CU automatic self destruct). I agree it would be beneficial integrating it with CU, and I can see some usefulness for the separate log if it would be used.

To make perfectly clear the rationale for this: It gives us a unique
opportunity to take proactive action while allowing us to confirm that we
aren't generating ridiculous false positives using the TBL. It turns the TBL
from an unusable extension to something which we can use alongside the
abusefilter and spamblacklist to actively prevent abuse. With this, we could
stop abuse before it has even happened - especially important when dealing
with attack names. Is that a sufficiently-convincing rationale?

I think it is, to be clear (I know you didn't say this, but I think some did) this current change is ONLY for accounts edits that trigger the TBL are not logged.

I am not yet completely convinced that there is sufficient need to display the IP to non advanced users though (I would currently think CU/Steward) though I don't have a real issue with the log itself (without IP) being shown to Sysops or others who interact with the TBL/Abusefilter. My biggest worry is the 'ridiculous false positives' part, I can see a lot of cases where I'd be able to piece together IP information on regular users in those cases (though I can certainly see a high desire to want to KNOW when those cases are happening).

(In reply to James Alexander from comment #32)

I would question your statement that I would not make use of it as someone
who does quite a lot of log reading and checkusering in my job ;).

I thought about that after I posted my comment. Certainly no offence meant, and sorry if it was overly confrontational or dismissive. What I know is that I would regularly use it, and I've been trying to drive that point home here :).

My comment was also a bit confusing there. I think the most benefit would come from both having the log display IPs and being integrated with CU. Simply being integrated with CU would be little change from what we have now. In the interim, I think that the log displaying IPs would be best, and the integrating with CU when possible.

I am not yet completely convinced that there is sufficient need to display
the IP to non advanced users though (I would currently think CU/Steward)
though I don't have a real issue with the log itself (without IP) being
shown to Sysops or others who interact with the TBL/Abusefilter. My biggest
worry is the 'ridiculous false positives' part, I can see a lot of cases
where I'd be able to piece together IP information on regular users in those
cases (though I can certainly see a high desire to want to KNOW when those
cases are happening).

I'd agree that CU/steward access would make the most sense. There is definitely some potential for inferencing IPs --> accounts, but no more than what already exists. Most LTAs/spambots use mobile ranges it seems these days, so any check can have the potential for false positives. Same with the ACC interface, or even just looking at the histories of articles that new users have edited. I'd argue that the potential benefits here outweigh the potential risks.

I wonder if it would be possible to somehow give admins access to a redacted version of the logs, and give checkusers the ability to see the IPs.

  • Bug 72905 has been marked as a duplicate of this bug. ***
tomasz removed a project: Shell.Feb 23 2015, 8:02 PM
tomasz set Security to None.

I wonder if it would be possible to somehow give admins access to a redacted version of the logs, and give checkusers the ability to see the IPs.

If it's not possible with Special:Log, create a Special:TitleBlacklistLog or something like that.. but we would probably need a table for that, I guess..

Restricted Application added a subscriber: Matanya. · View Herald TranscriptAug 8 2015, 12:08 PM

Change 138684 had a related patch set uploaded (by Paladox):
Set $wgTitleBlacklistLogHits = true on all wikis

https://gerrit.wikimedia.org/r/138684

I've found the spamblacklist log very helpful and probably the TBL log would be as well, but restricting to checkusers won't solve the issue. In its current state I think this can't be deployed as it'll give unlogged access to users' IP addresses (the same problem we had with 'abusefilter-private' rights) and that's is a no-go even for checkusers.

Teles added a subscriber: Teles.Dec 12 2015, 4:10 AM

Change 138684 abandoned by Legoktm:
Set $wgTitleBlacklistLogHits = true on all wikis

Reason:
I thought I had abandoned this.

https://gerrit.wikimedia.org/r/138684

Restricted Application added a subscriber: JEumerus. · View Herald TranscriptJan 7 2016, 3:29 AM
Meno25 removed a subscriber: Meno25.Feb 19 2016, 5:41 PM

Came across this task a while ago. In my mind restricting the visibility to OS or CU is not an acceptable solution; aside from the issues @MarcoAurelio mentioned in T68450#1656957 (and that such a log would keep an IP-account connection listed indefinitely, while CU only keeps it for 3 months at this time), the scope of this log is much broader than just identifying the IPs behind a blacklisted username.

It also struck to me that the issue mentioned in T68450#696668 is potentially a problem for non-Wikimedia sites that use the Titleblacklist but who don't bother with CU to avoid privacy issues altogether.

Thus, I'd recommend using the same solution that autoblock uses to prevent private information leaks and replace the IP of a failed account creation in the TBL with a random ID.

As for passing the IP information on to the CheckUser tables, I see potentially two implementations:

  1. If the CU extension tolerates non-existent usernames in its tables, just pass on the log entry into the table like any other log action, but using the disallowed username for the username field.
  2. If that doesn't work, one would need a CheckUser function specifically for failed account creations (and not for actions by existing accounts, as normal CheckUser does). That could also help for T107651

Final consideration: The TBL should not be publicly visible, seeing as it will almost certainly contain a large number of offensive or vandal usernames or pagetitles that were disallowed by the titleblacklist.

Came across this task a while ago. In my mind restricting the visibility to OS or CU is not an acceptable solution; aside from the issues @MarcoAurelio mentioned in T68450#1656957 (and that such a log would keep an IP-account connection listed indefinitely, while CU only keeps it for 3 months at this time), the scope of this log is much broader than just identifying the IPs behind a blacklisted username.

We currently have logs for many IP address events forever, so is the problem the TBL hits for an account name that was stopped from being created?, or a standard page created? or just a philosophical point?

A CU would only be able to identify accounts actually made by the IP address for three months, thereafter, the IP address cannot identify associated accounts, just the TBL hits.

It also struck to me that the issue mentioned in T68450#696668 is potentially a problem for non-Wikimedia sites that use the Titleblacklist but who don't bother with CU to avoid privacy issues altogether.

Since when is usage at a non-WMF site our concern? Our job is to tell them of the issues and the dangers, to make recommendations to improve. We cannot direct or enforce any installation, and they can already have that setting open, and open broadly. Due care, due diligence, information that allows appropriate decision making, security that protects appropriately, and ensure that we do not do evil.

Thus, I'd recommend using the same solution that autoblock uses to prevent private information leaks and replace the IP of a failed account creation in the TBL with a random ID.

Having that as a possible configuration seems valuable, especially as the default, whether it is always desirable may be a different matter. The purpose of the bug request is to improve our defences against vandals, and the incessant vandals/bots/...

As for passing the IP information on to the CheckUser tables, I see potentially two implementations:

  1. If the CU extension tolerates non-existent usernames in its tables, just pass on the log entry into the table like any other log action, but using the disallowed username for the username field.
  2. If that doesn't work, one would need a CheckUser function specifically for failed account creations (and not for actions by existing accounts, as normal CheckUser does). That could also help for T107651

Final consideration: The TBL should not be publicly visible, seeing as it will almost certainly contain a large number of offensive or vandal usernames or pagetitles that were disallowed by the titleblacklist.

I am not certain that this is a reasonable solution. The words are through the wiktionaries, the encyclopaedia, in the history of articles, etc., it also would mean that for the global blacklist, only meta admins can identify a problem reported across WMF, which is undesirable. Most of the worst words in the TBL are now usually regex'd anyway, so the worst risk is that vandals can work around the existing regex.

To also note that the TBL has components like .*WMF.* which I doubt that are really that secret, we just don't want them prohibition, not the secrecy. TBLs have been in use for a long time, and to this point their abuse is not evidently widespread,, especially in the larger scope of abuse that does occur.

Sometimes account creations that were held up by the titleblacklist are overridden by an administrator (due to false positives, say). In such cases, logging an IP of an accountcreation attempt may connect it to the future user of such an account. Which was already mentioned in T68450#696668.

How many are we talking about? Having some metrics would be useful.

Having a deletion/oversight function would overcome this problem from most perspectives

Agree with Billinghurst. If any IP attempts to create a page that is blocked by an abusefilter, that IP is kept forever in the logs. An IP attempting to create a username (and failing) is no different in that regard. The log could be restricted to sysops, and individual IP --> account connections suppressed out.

So to wrap this up, I think from the consensus we need to:

  1. avoid displaying anons publicly, for this I did T155967, so the log can be displayed to admins
  2. if WMF wants it, remove anons from the log entries after 90 days, but this can be done with a job on the servers
  3. build a way inside the checkuser extension to view disallowed creation attempts on a given account name, for this I did T155969

For this specific task, only 1 and 2 are blockers.
I've made a patch for 1 awaiting review. 2 can only be done by someone at the WMF. 3 is a bit more complex than 1 but looks feasible for anyone interested.

T155967 seems to restrict all IP addresses from title blacklist logs, I have made a comment there about that being not reasonable as we have many spam strings which are favoured by IP addresses. We want to be able to cull strings if they are never hit.

Also to note that if it is the IP address that is the problem for account creations, not the account creation hit itself, then we could

  1. restrict all logs to admins ++
  2. for account creations only
    1. munge IPs in logs for admins, though show the regex hit
    2. <90 days display IPs for designated group; 90+ days munge IP addresses
  3. set any log deletion by a separate process (if it is needed)

this keeps the logs intact, and I would think more useful as a historical and usable log.

I can also see value for someone like @Jalexander and the WMF security team for having access to full IP logs (not time restricted) with a special staff right, when they are chasing down global ban evaders, where we can determine whether to add evaders' preferred evasion regex

@Billinghurst I've clarified the meaning of T155967, it doesn't remove the log entries by IPs, but replaces "IP(talk) matched..." with "An anonymous user matched...".
Regarding your suggestion to display the IP to members of a privileged usergroups, Jforrester said that we needed to log accesses to this data, and the only way to do that is through the checkuser extension.
How I imagine it: in addition to the "Get IP addresses", "Get edits", "Get users" options, there would be a "Get creation attempts" option showing the IPs that attempted to create a disallowed username.

@Billinghurst I've clarified the meaning of T155967, it doesn't remove the log entries by IPs, but replaces "IP(talk) matched..." with "An anonymous user matched...".
Regarding your suggestion to display the IP to members of a privileged usergroups, Jforrester said that we needed to log accesses to this data, and the only way to do that is through the checkuser extension.
How I imagine it: in addition to the "Get IP addresses", "Get edits", "Get users" options, there would be a "Get creation attempts" option showing the IPs that attempted to create a disallowed username.

Okay that makes sense to me now, as that now disambiguates the "access_logging" of the "view_logging", to this point the argument (for me) had been about the restriction, not noting who did what and when.

As a supplementary question for clarity, the logging of "get creation attempts" will be at a local wiki only? Or will it be at the prime set of "(local)wiki, metawiki, loginwiki" as happens for current creations?

I don't know CentralAuth enough to answer.

I've looked into this and the titleblacklist log entry will only appear at the wiki where the user created the account, not on the central loginwiki.
It certainly would be appreciable to have a centralized log of titleblacklist hits (same for spamblacklist) but it doesn't look easy to accomplish.

Maybe one could roll such a task together with the one to create a global contributions list, that is to make a dedicated wiki that combines all logs and changelogs of the individual non-private projects. Probably something "epic" in size, though.

revi added a subscriber: revi.Jan 27 2017, 1:53 AM