Page MenuHomePhabricator

Expose spamblacklist log type on wiki replica servers
Closed, ResolvedPublic

Description

In order to maintain the meta spam blacklist, I need access to the 'logging' table in the database where log_type is 'spamblacklist'.
This was possible for several months at tool server:

$ sql dewiki
MariaDB [dewiki_p]> SELECT `log_timestamp`, `log_namespace`, `log_title`, `log_comment`, `log_params`, `log_user_text` FROM `logging` WHERE `log_type` = 'spamblacklist' AND log_params regexp '(?:some spam url part)' ORDER BY `log_timestamp`;

But now those table entries seem to be invisible, so

SELECT DISTINCT log_type FROM `logging`;

does not list 'spamblacklist'.

I know that this information should be visible to admins only, but being admin at meta I need this information to maintain the global SBL and get rid of old entries.

A more detailed example, why I need access can be seen at https://meta.wikimedia.org/wiki/Talk:Spam_blacklist#Expert_maintenance (at that time I had success to this information)

What can we do about this?

Event Timeline

Umherirrender added subscribers: Bawolff, Andrew, bd808 and 2 others.

The spamblacklist log is not public data and therefor not viewable on toollabs (T178052#3684930).

You have to use the api's list=logevents to get the data, but that does not allow to search for an url.

One solution could be a new anon view without user data and only the url (in log_params) and timestamp, maybe ns + title

The spamblacklist log records were deliberately removed in T178052: pagetranslation log_type missing on replicas as non-public information that should not have been present in the Wiki Replica databases. You will need to use the Action API and an authenticated account with proper rights to see them.

I'm afraid, bdb808's answer does not help here, because the first part is just a repetition of what has been said already. And the API will not help, as Umherirrender already said.

The spamblacklist log is not public data and therefor not viewable on toollabs (T178052#3684930).

yes, it's a pity.

You have to use the api's list=logevents to get the data, but that does not allow to search for an url.

Correct. Even if I were admin everywhere, it would take several years to generate statistics like mentioned above (that took only minutes or maybe a few hours)

One solution could be a new anon view without user data and only the url (in log_params) and timestamp, maybe ns + title

That would indeed be sufficient for most of the cases.

You have to use the api's list=logevents to get the data, but that does not allow to search for an url.

Correct. Even if I were admin everywhere, it would take several years to generate statistics like mentioned above (that took only minutes or maybe a few hours)

The Action API could be extended to support your use case. ApiQueryLogEvents implements the list=logevents action. This is the only safe place to expose information which requires special rights on-wiki to view.

One solution could be a new anon view without user data and only the url (in log_params) and timestamp, maybe ns + title

That would indeed be sufficient for most of the cases.

Scraping data out of the serialized PHP data contained in the log_params field in a MariaDB view would be difficult at best. Performance would certainly be horrible if this artificial field was then expected to be searchable with SQL.

One solution could be a new anon view without user data and only the url (in log_params) and timestamp, maybe ns + title

That would indeed be sufficient for most of the cases.

Scraping data out of the serialized PHP data contained in the log_params field in a MariaDB view would be difficult at best. Performance would certainly be horrible if this artificial field was then expected to be searchable with SQL.

I'm also not sure how good an idea this would be in general. Isn't the entire point the log is restricted so that non-admins cannot access this info?

Scraping data out of the serialized PHP data contained in the log_params field in a MariaDB view would be difficult at best. Performance would certainly be horrible if this artificial field was then expected to be searchable with SQL.

log_search table would probably fit here much better. That is if we were to do something to make this much easier to query.

I still can't imagine we'd allow to search against arbitrary regexes though. Maybe we could search against which spam blacklist rule or something.

One solution could be a new anon view without user data and only the url (in log_params) and timestamp, maybe ns + title

That would indeed be sufficient for most of the cases.

Scraping data out of the serialized PHP data contained in the log_params field in a MariaDB view would be difficult at best. Performance would certainly be horrible if this artificial field was then expected to be searchable with SQL.

I'm also not sure how good an idea this would be in general. Isn't the entire point the log is restricted so that non-admins cannot access this info?

Yes, the log is restricted, but the question is, if the whole log is restricted or the relationship between user and log data. When the whole log is restricted a view cannot added. But when only the user part is restricted, a view without user data could be created and used to restore the necessary data for the requesting user without need of sysop rights on all wikis.
The evaluation of the log type, if whole or just parts are restricted, must be done by security or other peoples.

This would be similar to the watchlist, where the title part of the watchlist is viewable as a watchlist_count.

Scraping data out of the serialized PHP data contained in the log_params field in a MariaDB view would be difficult at best. Performance would certainly be horrible if this artificial field was then expected to be searchable with SQL.

log_search table would probably fit here much better. That is if we were to do something to make this much easier to query.

I still can't imagine we'd allow to search against arbitrary regexes though. Maybe we could search against which spam blacklist rule or something.

When mediawiki core or the extension want allows to search blocked urls (over api or special page or direct on Special:Log) than the log_search could contains such data for the search. The data should be stored in "reversed order" for the domain level as done in externallinks table el_index, than queries could be simple and without regex.

The log_search table is not on toollabs (T85756).

I'm also not sure how good an idea this would be in general. Isn't the entire point the log is restricted so that non-admins cannot access this info?

The evaluation of the log type, if whole or just parts are restricted, must be done by security or other peoples.

@Bawolff is a member of the Security-Team and the person who generally does access reviews for the Wiki Replicas.

Bawolff and I discussed this in person, and we lowered the security level of the logs on-wiki themselves: rESPBe62376620d8a: Make spamblacklist log viewable by logged-in users by default. With that I think we can basically treat these logs as public, and make them visible in the replica dbs.

That sounds great! Do we/I have to trigger anyone? Or do I just have to wait now?

bd808 renamed this task from reading spamblacklist logs to Expose spamblacklist log type on wiki replica servers.Mar 11 2018, 12:59 AM

Change 418710 had a related patch set uploaded (by BryanDavis; owner: Bryan Davis):
[operations/puppet@production] wiki replicas: Add spamblacklist to allowed log types

https://gerrit.wikimedia.org/r/418710

Change 418710 merged by Bstorm:
[operations/puppet@production] wiki replicas: Add spamblacklist to allowed log types

https://gerrit.wikimedia.org/r/418710

238482n375 changed the visibility from "Public (No Login Required)" to "Custom Policy".
This comment was removed by Reedy.
Aklapper raised the priority of this task from Lowest to Needs Triage.Jun 15 2018, 2:25 PM
Aklapper subscribed.