Page MenuHomePhabricator

Should CheckUser track bot edits?
Closed, DeclinedPublic

Description

Problem
CheckUser currently tracks all changes, even if they are marked with the bot flag. Does including these edits in CheckUser add value or does it increase the noise?

If these edits can be skipped, it will significantly reduce the size of the cu_changes table.

Proposed Solution
If a change is marked with the bot flag, do not insert the change into the cu_changes table.

Event Timeline

Change 583468 had a related patch set uploaded (by DannyS712; owner: DannyS712):
[mediawiki/extensions/CheckUser@master] Don't track edits marked as bot edits

https://gerrit.wikimedia.org/r/583468

Reedy subscribed.

I know bot edits make up a large chunk of edits on most wikis, but this feels a bit odd making a change like this without any real information and statistics.

Has anyone asked the people with checkuser rights and the stewards?

Do we actually lose anything in terms of useful data? Sure, we still get login events for bots, but what about if a bot account goes rogue, is compromised and ends up hopping between IPs? We lose that data.

Would sampling bot edits be a reasonable compromise? 1 in 10? 1 in 100?

It's not a decision that should be taken lightly, even if it gives performance gains because the amount of data decreases by a large factor

A quick poke at enwiki and the recent changes table...

10209779 rows. 1369809 rows where rc_bot = 1... That's only 13.4%

I'm not going to claim 13% is an insignificant amount of rows, but it doesn't seem anywhere near as high as I'd expect...

rc_type in (5,6) (ie RC_EXTERNAL and RC_CATEGORIZE) is 4474425 rows... Which is just shy of 44% of the rows, and are already filtered...

MariaDB [enwiki]> select count(*) from cu_changes where cuc_user > 0 AND cuc_user IN (select user_id from user INNER JOIN user_groups ON (user_id=ug_user) WHERE ug_group='bot');
+----------+
| count(*) |
+----------+
|  2303874 |
+----------+
1 row in set (2.52 sec)

(2303874/10209779)*100=22.56% of the rows in cu_changes are bots

The cu_changes table is only 5.11G for enwiki (data and indexes). Considering revision_actor_temp is 135G and revision_comment_temp at just under 25G... cu_changes really isn't big, and as it only keeps 90 days worth of data, it won't grow by much unless we stuff more data into it

I can't say I've ever had a reason to CheckUser a bot, but I *can* imagine situations where it would be useful to be able to do so. Sampling is a good idea, but does the cu_changes table really represent a significant storage cost?

I can't say I've ever had a reason to CheckUser a bot, but I *can* imagine situations where it would be useful to be able to do so. Sampling is a good idea, but does the cu_changes table really represent a significant storage cost?

Would it be useful to get info for edits, if there was still info for log entries? I.e. what about only ignore bots' edits, which keeping log entries, etc

I also oppose this task, for the following three reasons:

  1. Bot edits are only a fraction of all edits.
  2. They often don't add any noise to CU results (typically, bots edit from IP specific addresses, like those of Labs, that are not used by regular users).
  3. Abuse of bot accounts is a possibility, and throwing away CU data will make that very hard to investigate.

I am aware of at least one incidence of misuse by a sysop through Labs in which CU played an important role in the investigation. Details of the case are irrelevant hear, and not something I would talk about in a public task anyway.

A Bot edits MUST be stored to cu_changes.
In Security theory, search spaces should not be restricted.

If the bot is compromised, abused, or anyone sockpuppeter using same IP address or CIDR range, that may hint for operator or bot are also malicious. And bureaucrat can be grant to an optional user to bot flag (including ownself), additionally, some wikis introduced flood flag. Also rollback with mark as bot edit too. An any actions mark as bot are not stored to cu_changes in this plan's patch.

I can't say I've ever had a reason to CheckUser a bot, but I *can* imagine situations where it would be useful to be able to do so. Sampling is a good idea, but does the cu_changes table really represent a significant storage cost?

Would it be useful to get info for edits, if there was still info for log entries? I.e. what about only ignore bots' edits, which keeping log entries, etc

I don't think that would be wise, no. If we need to reduce the size of the table, I would suggest using sampling to reduce duplicate rows, i.e., never allow sampling to hide a (user, ip, ua) tuple that is not already present in the table, but if the user has already edited from that IP with that ua X times in the last Y days, then sample 1 of every 10 edits. However, it will confuse the CUs and may hide useful information, so I would not advise that unless it is necessary.

Sampling is a good idea, but does the cu_changes table really represent a significant storage cost?

It's very much undefined in the tasks purpose what it's triyng to solve.

Reduce DB size on disk? (I've not seen the DBA's complaining about it, and at that size on enwiki (sure, wikidata is probably larger), it's very much far from a storage concern)

Increase query speed by having less rows to potentailly look at? Could this be done with better/different indexes, or storing rc_bot into cu_changes too as an extra filtering criteria?

And bureaucrat can be grant to an optional user to bot flag (including ownself), additionally, some wikis introduced flood flag. Also rollback with mark as bot edit too. An any actions mark as bot are not stored to cu_changes in this plan's patch.

I forgot about this. When is a bot not a bot? When it's a user pretending to be a bot...

And of course, bots can not flag their own edits as bot edits if they want...

Oppose this change. I’ve never directly checked a bot but there absolutely have been circumstances where CU data from a bot has been extremely helpful when checking a range or IP.

Increase query speed by having less rows to potentailly look at? Could this be done with better/different indexes, or storing rc_bot into cu_changes too as an extra filtering criteria?

Query speed isn’t really an issue. There’s the 5000 edit timeout limitation, but no one is running a bot on the Reliance Jio or Three mobile networks. Ordinarily CU returns results in seconds if not less than a second. I’ve literally never heard a complaint about speed.

Having been around longer than some of the other checkusers, I can say that I *have* checked bots, and that there have been abuse issues and other technical problems identified through these checks. (Example of technical issues previously identified in times past - finding that a bot was using a *non-existant* IP address, finding bots that were using the WMF server addresses, both apparently due to misconfigurations.)

Oppose proceeding with this task. It does not seem that it has been requested by anyone who is actually using the checkuser interface or maintaining the CU databases.

Change 583468 abandoned by DannyS712:
Don't track edits marked as bot edits

https://gerrit.wikimedia.org/r/583468

DannyS712 subscribed.

Declining per above discussion - @dbarratt if there is a compelling reason to remove bot info that wasn't raised here, please let us know

Increase query speed by having less rows to potentailly look at? Could this be done with better/different indexes, or storing rc_bot into cu_changes too as an extra filtering criteria?

Query speed isn’t really an issue. There’s the 5000 edit timeout limitation, but no one is running a bot on the Reliance Jio or Three mobile networks. Ordinarily CU returns results in seconds if not less than a second. I’ve literally never heard a complaint about speed.

It should be noted that Anti-Harassment are working on a 2.0 with different queries, which might have different performance requirements

You can't really answer that question on @dbarratt's or Anti-Harassment's behalf as you're not developing it

Just because a query has a 5000 result limit doesn't mean it'll necessarily follow it, nor how many rows it'll search through. It is certainly possible that a query will select/inspect many more rows (because of a lack of an index) and throw many of them away. Similarly, just because a LIMIT 10 exists, doesn't mean it'll be a quick query

dbarratt reopened this task as Open.EditedApr 27 2020, 5:15 PM
dbarratt added a subscriber: Niharika.

I'm going to re-open this for @Niharika's feedback. This actually came out of looking at Wikidata's cu_changes not English Wikipedia. Wikidata has way more bot edits than English Wikipedia and I have no idea if having those in there adds value or not.

As above, please listen to the people actually using the tools in the wild before making any breaking changes like this.

MariaDB [wikidatawiki]> select count(*) from cu_changes;
+----------+
| count(*) |
+----------+
| 63162149 |
+----------+
1 row in set (1 min 12.83 sec)

MariaDB [wikidatawiki]> select count(*) from cu_changes where cuc_user > 0 AND cuc_user IN (select user_id from user INNER JOIN user_groups ON (user_id=ug_user) WHERE ug_group='bot');
+----------+
| count(*) |
+----------+
| 28895087 |
+----------+
1 row in set (1 min 4.58 sec)

(28895087/63162149)*100 = 45%

Are we sure? 22.5 to 45 isn't that much... There's obviously more, but is it a massive amount more?

MariaDB [wikidatawiki]> select count(*) from recentchanges;
+----------+
| count(*) |
+----------+
| 22828968 |
+----------+
1 row in set (29.72 sec)

MariaDB [wikidatawiki]> select count(*) from recentchanges where rc_bot = 1;
+----------+
| count(*) |
+----------+
|  9833424 |
+----------+
1 row in set (1 min 26.22 sec)

MariaDB [wikidatawiki]> select count(*) from recentchanges where rc_type in (5,6);
+----------+
| count(*) |
+----------+
|   145564 |
+----------+
1 row in set (41.25 sec)

Table is ~46% larger than the enwiki one

+---------------------------------------+---------------------------------------------+---------+---------+------------+---------+
| CONCAT(table_schema, '.', table_name) | CONCAT(ROUND(table_rows / 1000000, 2), 'M') | DATA    | idx     | total_size | idxfrac |
+---------------------------------------+---------------------------------------------+---------+---------+------------+---------+
| wikidatawiki.cu_changes               | 63.12M                                      | 9.41G   | 5.03G   | 14.44G     |    0.54 |
+---------------------------------------+---------------------------------------------+---------+---------+------------+---------+

I'm inclined not to change this until it proves to be a problem. It seems straightforward to change these queries should they prove to be a problem in the future.

Ultimately, @Niharika decides the product value but from the engineering side, keeping the bots edit included doesn't seem risky.

Yeah, it definitely needs some evidence/stats/metrics before making any sort of change like this. There's certainly technical things that could be potentially done; additional indexes, additional columns etc

It's really not clear what problem this is trying to solve

it will significantly reduce the size of the cu_changes table

The DBA's haven't complained about this, so there's not a problem here IMHO. A 10G table (15G with indexes) on wikidata is not an issue

It's clear that Checkusers benefit from having access to this data. I don't see a compelling reason to remove bot edits from checkuser.
If the query performance start being an issue, especially with Wikidata, we can talk about other ways to solve that problem.

FWIW, if it is ever implemented, it should be a choice, not an all-or-nothing change. In other words, maybe in we want not to log bot edits (or only log a random subset) on Wikidata, but for most other wikis, we want 100% of the edits to be logged in cu_changes.