Page MenuHomePhabricator

Create cu_private_event_no_actor table
Closed, DeclinedPublic2 Estimated Story Points

Description

A table to store a subset of rows which currently go into the cu_private_event table is needed. These rows will be for private events performed by IP addresses, as when temporary accounts are enabled we cannot get an actor ID. The proposed schema of this table is:

MariaDB [my_database]> describe cu_private_event_no_actor;
+-------------------+---------------------+------+-----+---------+----------------+
| Field             | Type                | Null | Key | Default | Extra          |
+-------------------+---------------------+------+-----+---------+----------------+
| cupena_id         | bigint(20) unsigned | NO   | PRI | NULL    | auto_increment |
| cupena_namespace  | int(11)             | NO   |     | 0       |                |
| cupena_title      | varbinary(255)      | NO   |     | NULL    |                |
| cupena_log_type   | varbinary(32)       | NO   |     | NULL    |                |
| cupena_log_action | varbinary(32)       | NO   |     | NULL    |                |
| cupena_params     | blob                | NO   |     | NULL    |                |
| cupena_comment_id | bigint(20) unsigned | NO   |     | NULL    |                |
| cupena_page       | int(10) unsigned    | NO   |     | NULL    |                |
| cupena_timestamp  | binary(14)          | NO   | MUL | NULL    |                |
| cupena_ip_hex     | varbinary(255)      | YES  | MUL | NULL    |                |
| cupena_xff        | varbinary(255)      | YES  |     | NULL    |                |
| cupena_xff_hex    | varbinary(255)      | YES  | MUL | NULL    |                |
| cupena_agent_id   | bigint(20) unsigned | NO   |     | NULL    |                |
+-------------------+---------------------+------+-----+---------+----------------+

The indexes in this table:

MariaDB [my_database]> show indexes in cu_private_event_no_actor;
+---------------------------+------------+---------------------+--------------+------------------+-----------+-------------+----------+--------+------+------------+---------+---------------+---------+
| Table                     | Non_unique | Key_name            | Seq_in_index | Column_name      | Collation | Cardinality | Sub_part | Packed | Null | Index_type | Comment | Index_comment | Ignored |
+---------------------------+------------+---------------------+--------------+------------------+-----------+-------------+----------+--------+------+------------+---------+---------------+---------+
| cu_private_event_no_actor |          0 | PRIMARY             |            1 | cupena_id        | A         | 0           |     NULL | NULL   |      | BTREE      |         |               | NO      |
| cu_private_event_no_actor |          1 | cupena_ip_hex_time  |            1 | cupena_ip_hex    | A         | 0           |     NULL | NULL   | YES  | BTREE      |         |               | NO      |
| cu_private_event_no_actor |          1 | cupena_ip_hex_time  |            2 | cupena_timestamp | A         | 0           |     NULL | NULL   |      | BTREE      |         |               | NO      |
| cu_private_event_no_actor |          1 | cupena_xff_hex_time |            1 | cupena_xff_hex   | A         | 0           |     NULL | NULL   | YES  | BTREE      |         |               | NO      |
| cu_private_event_no_actor |          1 | cupena_xff_hex_time |            2 | cupena_timestamp | A         | 0           |     NULL | NULL   |      | BTREE      |         |               | NO      |
| cu_private_event_no_actor |          1 | cupena_timestamp    |            1 | cupena_timestamp | A         | 0           |     NULL | NULL   |      | BTREE      |         |               | NO      |
+---------------------------+------------+---------------------+--------------+------------------+-----------+-------------+----------+--------+------+------------+---------+---------------+---------+
Acceptance criteria
  • Create the cu_private_event_no_actor table

Related Objects

StatusSubtypeAssignedTask
ResolvedSecurityZabe
ResolvedSecurityDreamy_Jazz
ResolvedSecurityDreamy_Jazz
ResolvedSecurityDreamy_Jazz
OpenBUG REPORTNone
ResolvedBUG REPORTDreamy_Jazz
OpenBUG REPORTDreamy_Jazz
ResolvedDreamy_Jazz
OpenNone
OpenNone
OpenNone
OpenFeatureDreamy_Jazz
DuplicateNone
OpenFeatureNone
OpenFeatureNone
OpenFeatureNone
ResolvedFeatureDreamy_Jazz
ResolvedFeatureDreamy_Jazz
ResolvedFeatureDreamy_Jazz
OpenFeatureNone
OpenFeatureNone
ResolvedBUG REPORTDreamy_Jazz
OpenFeatureNone
OpenFeatureNone
ResolvedGlaisher
OpenNone
ResolvedFeatureDreamy_Jazz
ResolvedGlaisher
ResolvedNiharika
ResolvedNone
ResolvedFeatureDreamy_Jazz
DeclinedNone
ResolvedFeatureDreamy_Jazz
ResolvedFeatureDreamy_Jazz
DuplicateNone
OpenNone
DuplicateNone
ResolvedDreamy_Jazz
ResolvedDreamy_Jazz
ResolvedDreamy_Jazz
OpenFeatureNone
ResolvedDreamy_Jazz
ResolvedBUG REPORTDreamy_Jazz
ResolvedBUG REPORTDreamy_Jazz
OpenNone
OpenFeatureNone
DeclinedFeatureNone
OpenFeatureNone
ResolvedDreamy_Jazz
Resolvedkostajh
OpenNone
OpenNone
Resolved tstarling
OpenNone
ResolvedTchanders
OpenBUG REPORTNone
OpenFeatureDreamy_Jazz
DeclinedDreamy_Jazz
DeclinedDreamy_Jazz

Event Timeline

Change 1009236 had a related patch set uploaded (by Dreamy Jazz; author: Dreamy Jazz):

[mediawiki/extensions/CheckUser@master] Add cu_private_event_no_actor table

https://gerrit.wikimedia.org/r/1009236

Dreamy_Jazz added a project: DBA.

I would welcome DBA feedback on the structure of this new table. Adding DBA tag to this task per https://wikitech.wikimedia.org/wiki/Creating_new_tables. Note that the cupena_agent_id column refers to the table that has been proposed in T359312.

Dreamy_Jazz updated the task description. (Show Details)
Dreamy_Jazz updated the task description. (Show Details)
Dreamy_Jazz set the point value for this task to 2.Mar 6 2024, 12:19 PM

This schema is basically duplicating schema of other tables and it will be quite a pain if we want to make cross table schema changes, e.g. add something for browser fingerprinting, then you have to add the columns to every table. Why not simply make actor nullable in private events table?

This schema is basically duplicating schema of other tables and it will be quite a pain if we want to make cross table schema changes, e.g. add something for browser fingerprinting, then you have to add the columns to every table. Why not simply make actor nullable in private events table?

I did discuss setting the value to 0 in T353953#9584881 and AFAIK setting the row value to NULL still has the same problems as setting the value to 0 (though a value of NULL outside a MediaWiki context may be clearer than 0).

For example, on commonswiki the actor_user column is be NULL for both IP addresses, external users and also for the Conversion script system user, so we would need to ensure the code does not use actor_user being NULL to indicate that the row is an IP address with no actor ID. Instead, we should be able to rely on the actor_name being NULL as that column is not nullable. However, if there is a bug in a CheckUser interface where the LEFT JOIN conditions are wrong, we could hide this problem behind the handling of a NULL actor_user. This could be addressed by checking the value of the cupe_actor column and asserting that the value in it is NULL, but my concern was that we could end up having a lot of extra code to handle these edge-cases.

Furthermore there are some other differences to the schema that I didn't mention in the commit message:

  • This new table also does not need to have the cupe_private column as this is only used when emailing users (and IP addresses cannot do this). I don't think this would affect the size of the table on disk in any significant way (as it seems that MEDIUMBLOB is sized based on the content and not the maximum size).
  • We don't need the index on the actor ID for this new table (the cupe_actor_ip_time column on the cu_private_event table) and as such there is one less index. There are at least a million rows which would be moved over to the new table on enwiki and therefore would no longer have this index.

However, I am happy to explore the idea of setting the column value as NULL if you disagree that the above reasons are enough justification for a new table.

Every new table adds a bit of overhead specially when it's a table in core dbs. For example, this one table means maraidb needs to open 2000 more files in runtime in s3 (the sheer number of files being opened by mariadb in s3 is causing issues and that's why we create new wikis on s5 instead). Also backup time, reporting of backup sizes and etc will get bigger. It's not a lot of overhead but it's not zero either. Lots of times we need to make sure it justifies it (e.g. the UA table definitely justify it). Specially it looks like DRY violation as schema changes are harder to pull off (so again, if we want to add a column for browser fingerprint or any other changes like that, we now have to add it to three or four tables and that's much more painful work than adding it to one or two). The other note is that, I assume there won't be that many rows in this table so the benefits of not indexing on actor id and so on are quite small (tell me if my assumption is wrong).

We could mitigate many of these issues you mentioned. I thought system users get an actor id and user id too (maybe that could be implemented).

The other note is that, I assume there won't be that many rows in this table so the benefits of not indexing on actor id and so on are quite small (tell me if my assumption is wrong).

While I don't know what the "that many rows in this table" threshold is, there is a million rows from failed login attempts on enwiki that would be moved to this new table:

wikiadmin2023@10.192.16.4(enwiki)> select count(*) from cu_private_event where cupe_log_action = 'login-failure';
+----------+
| count(*) |
+----------+
|  1045089 |
+----------+
1 row in set (9.107 sec)

Furthermore, also these rows would be moved too:

wikiadmin2023@10.192.16.4(enwiki)> select count(*) from cu_private_event join actor on actor_id = cupe_actor where cupe_log_action = 'password-reset-email-sent' and actor_user is null;
+----------+
| count(*) |
+----------+
|    42362 |
+----------+
1 row in set (3.112 sec)

We could mitigate many of these issues you mentioned. I thought system users get an actor id and user id too (maybe that could be implemented).

It seems on commonswiki that this specific system user does not have a user ID but has an actor ID. Most system users do get a user ID, but it seems that this is a historical special case.

If you think that the million rows does not outweigh the problems of adding a new table, I will explore using NULL as the value for the column. I think using NULL over 0 will be easier to deal with from a non-MediaWIki context, as to me NULL indicates no ID exists whereas 0 doesn't do that as clearly.

I'm actually more concerned that we have 1m failed login attempt in the past 90 days. (If you catch my drift). In normal times, it should be much lower.

I think going with NULL is a much better option for the simplicity of the infra.

In which case I will close this as declined. Thanks for the feedback.

Change 1009236 abandoned by Dreamy Jazz:

[mediawiki/extensions/CheckUser@master] Add cu_private_event_no_actor table

Reason:

Per DBA feedback, making the cupe_actor column NULLable is better for WMF infrastructure.

https://gerrit.wikimedia.org/r/1009236