Page MenuHomePhabricator

Create cu_useragent table
Closed, ResolvedPublic2 Estimated Story Points

Description

A table is needed to de-duplicate user agent strings in the CheckUser tables. This is because, as described in T305930 and T326379, there is a lot of duplication in these columns. This has become even more of a problem since T295073: <Org-Wide Impact> Google Chrome User-Agent Deprecation Impact. For example, on enwiki there are about on average 200 rows to each distinct user agent string value. Some rough calculations suggest that by de-duplicating the column the cu_changes table on enwiki would be several gigabytes smaller.

We cannot use the comment table, because as described in T305930 Old UA strings should be removed. Therefore, we need a table that CheckUser solely controls and can perform delete operations on.


The schema for this proposed table looks like the following:

MariaDB [my_database]> describe cu_useragent;
+-----------+---------------------+------+-----+---------+----------------+
| Field     | Type                | Null | Key | Default | Extra          |
+-----------+---------------------+------+-----+---------+----------------+
| cuua_id   | bigint(20) unsigned | NO   | PRI | NULL    | auto_increment |
| cuua_text | varbinary(255)      | NO   | MUL | NULL    |                |
+-----------+---------------------+------+-----+---------+----------------+

The indexes for this proposed table are as follows:

MariaDB [my_database]> show indexes in cu_useragent;
+--------------+------------+-----------+--------------+-------------+-----------+-------------+----------+--------+------+------------+---------+---------------+---------+
| Table        | Non_unique | Key_name  | Seq_in_index | Column_name | Collation | Cardinality | Sub_part | Packed | Null | Index_type | Comment | Index_comment | Ignored |
+--------------+------------+-----------+--------------+-------------+-----------+-------------+----------+--------+------+------------+---------+---------------+---------+
| cu_useragent |          0 | PRIMARY   |            1 | cuua_id     | A         |           0 |     NULL | NULL   |      | BTREE      |         |               | NO      |
| cu_useragent |          1 | cuua_text |            1 | cuua_text   | A         |           0 |     NULL | NULL   |      | BTREE      |         |               | NO      |
+--------------+------------+-----------+--------------+-------------+-----------+-------------+----------+--------+------+------------+---------+---------------+---------+
Acceptance criteria
  • Create the cu_useragent table

Related Objects

StatusSubtypeAssignedTask
OpenFeatureNone
OpenFeatureNone
Resolved TBolliger
OpenNone
OpenNone
OpenNone
OpenNone
OpenNone
OpenNone
OpenNone
OpenNone
OpenNone
OpenNone
OpenNone
OpenNone
OpenNone
OpenNone
OpenNone
OpenNone
OpenNone
OpenNone
ResolvedDreamy_Jazz
ResolvedDreamy_Jazz
ResolvedDreamy_Jazz
ResolvedMarostegui
OpenNone
ResolvedMarostegui

Event Timeline

Change 1009234 had a related patch set uploaded (by Dreamy Jazz; author: Dreamy Jazz):

[mediawiki/extensions/CheckUser@master] Add cu_useragent table

https://gerrit.wikimedia.org/r/1009234

I would welcome DBA feedback on the structure of this new table. Adding DBA tag to this task per https://wikitech.wikimedia.org/wiki/Creating_new_tables.

Dreamy_Jazz set the point value for this task to 2.Mar 6 2024, 12:19 PM

Hi, I think we need to step back and look at the design of the tables more holistically. Give me a bit.

This table has my sign off as long as purgeOldData.php also clean pruned UAs from cu_useragent as well (to avoid race conditions in large wikis, you have to do some magic such as only deleting when pk value is below 99%) otherwise this will grow out of bound and gets filled up with garbage.

[...] as long as purgeOldData.php also clean pruned UAs from cu_useragent as well (to avoid race conditions in large wikis, you have to do some magic such as only deleting when pk value is below 99%) otherwise this will grow out of bound and gets filled up with garbage.

Is this captured in a task? I see it's mentioned in T305930#7854184 too, but would be good to make a specific task (or acceptance criterion in another task) before adding the table.

[...] as long as purgeOldData.php also clean pruned UAs from cu_useragent as well (to avoid race conditions in large wikis, you have to do some magic such as only deleting when pk value is below 99%) otherwise this will grow out of bound and gets filled up with garbage.

Is this captured in a task? I see it's mentioned in T305930#7854184 too, but would be good to make a specific task (or acceptance criterion in another task) before adding the table.

I will file a task.

Change #1009234 merged by jenkins-bot:

[mediawiki/extensions/CheckUser@master] Add cu_useragent table

https://gerrit.wikimedia.org/r/1009234

Suggested QA steps for a local wiki:

  1. Run update.php
  2. Connect to the database for the wiki
  3. List the tables in the database (if using mariadb you can use show tables;, for SQLite you can use .tables)
  4. Verify that the table cu_useragent appears in the list
  5. Inspect the contents of the table by using describe cu_useragent; for mariadb, or .schema cu_useragent for SQLite
  6. Verify that the columns cuua_id and cuua_text are in the table. For example, you should see the following for a MariaDB local wiki:
+-----------+---------------------+------+-----+---------+----------------+
| Field     | Type                | Null | Key | Default | Extra          |
+-----------+---------------------+------+-----+---------+----------------+
| cuua_id   | bigint(20) unsigned | NO   | PRI | NULL    | auto_increment |
| cuua_text | varbinary(255)      | NO   | MUL | NULL    |                |
+-----------+---------------------+------+-----+---------+----------------+

Mentioned in SAL (#wikimedia-operations) [2024-04-02T17:13:41Z] <Dreamy_Jazz> Creating cu_useragent table on WMF wikis - T359312

I don't remember if we discussed this but just to be sure, is this table meant to be public or private?

I don't remember if we discussed this but just to be sure, is this table meant to be public or private?

It is meant to be private.

I don't remember if we discussed this but just to be sure, is this table meant to be public or private?

It is meant to be private.

Right, then please hold on creating the table until we filter it on sanitarium. I will create a task for it and get it done today hopefully

I don't remember if we discussed this but just to be sure, is this table meant to be public or private?

It is meant to be private.

Right, then please hold on creating the table until we filter it on sanitarium. I will create a task for it and get it done today hopefully

I had forgot that this needed to be done. I already created the table on all wikis except labtestwiki. However, no data has been written to the table.

Should the tables be deleted? I'm not sure if deleting would break replication.

The need to do this should be added to https://wikitech.wikimedia.org/wiki/Creating_new_tables so that it's clear this needs to be done.

I don't remember if we discussed this but just to be sure, is this table meant to be public or private?

It is meant to be private.

Right, then please hold on creating the table until we filter it on sanitarium. I will create a task for it and get it done today hopefully

I had forgot that this needed to be done. I already created the table on all wikis except labtestwiki. However, no data has been written to the table.

Should the tables be deleted? I'm not sure if deleting would break replication.

Don't worry about it - it will take care of it but please don't enable writes for now (even if you did by mistake, the data is still not leaked cause there are no views on that table that's why we have those two systems in place :), but anyway, better be safe and wait for the subtask to be completed)

The need to do this should be added to https://wikitech.wikimedia.org/wiki/Creating_new_tables so that it's clear this needs to be done.

It is there:

Request DBA signoff (requests are processed in order, with no specific SLA, if you are working against a deadline contact the DBA manager to negotiate):
Add the following information to the task:
Should this table be replicated to wiki replicas (does it not contain private data)?

I don't remember if we discussed this but just to be sure, is this table meant to be public or private?

It is meant to be private.

Right, then please hold on creating the table until we filter it on sanitarium. I will create a task for it and get it done today hopefully

I had forgot that this needed to be done. I already created the table on all wikis except labtestwiki. However, no data has been written to the table.

Should the tables be deleted? I'm not sure if deleting would break replication.

Don't worry about it - it will take care of it but please don't enable writes for now (even if you did by mistake, the data is still not leaked cause there are no views on that table that's why we have those two systems in place :), but anyway, better be safe and wait for the subtask to be completed)

The need to do this should be added to https://wikitech.wikimedia.org/wiki/Creating_new_tables so that it's clear this needs to be done.

It is there:

Request DBA signoff (requests are processed in order, with no specific SLA, if you are working against a deadline contact the DBA manager to negotiate):
Add the following information to the task:
Should this table be replicated to wiki replicas (does it not contain private data)?

Ah, that's why this was not done as I had taken the signoff in T359312#9614750 and then not listed these points in the task.

Perhaps the need to add the table to the private list, if the table is private, should be added a step before step 8 in the preparation section. But then usually DBA would handle this.

I believe the main issue was that the template wasn't followed - perhaps we need to make it more clear like we do for schema changes which I believe the template format is a bit clearer https://wikitech.wikimedia.org/wiki/Schema_changes#Workflow_of_a_schema_change

Whereas the CREATE table one is a bit hidden in the doc but it does say:

Add the following information to the task:
Should this table be replicated to wiki replicas (does it not contain private data)?

^ This would make us know we need to add it to the private list of tables

Will you be doing cross-joins with the wiki metadata?
Size of the table (number of rows expected).
Expected growth per year (number of rows).
Expected amount of queries, both writes and reads (per minute, per hour...per day, any of those are ok).
Examples of queries that will be using the table.
The release plan for the feature (are there specific wikis you'd like to test first etc).
Move the task to the Triage column on the DBA workboard.

I believe the main issue was that the template wasn't followed - perhaps we need to make it more clear like we do for schema changes which I believe the template format is a bit clearer https://wikitech.wikimedia.org/wiki/Schema_changes#Workflow_of_a_schema_change

Yeah. Making the creating tables process also use similar text to the schema changes one would make it easier to see that adding this information (even if the table has received DBA signoff) is necessary.

dom_walden subscribed.

I checked the new table on MariaDB and SQLite3.

FTR on sqlite3 I see:

CREATE TABLE IF NOT EXISTS "cu_useragent" (
 cuua_id INTEGER PRIMARY KEY AUTOINCREMENT NOT NULL,
 cuua_text BLOB NOT NULL
 );
CREATE INDEX cuua_text ON "cu_useragent" (cuua_text)
;

I couldn't get my local postgres environment to work, so I couldn't see how the table was created there.