Page MenuHomePhabricator

Creation of database tables cu_useragent_clienthints and cu_useragent_clienthints_map
Closed, ResolvedPublic

Description

As part T257893: [EPIC] Support User-Agent Client Hints header in CheckUser, CheckUser needs tables to store Client Hint data. The implementation proposed in https://gerrit.wikimedia.org/r/c/mediawiki/extensions/CheckUser/+/931875 adds two new tables, namely:

  • cu_useragent_clienthints - A table that stores name and value combinations of Client Hint data.
  • cu_useragent_clienthints_map - A many-to-many table between cu_useragent_clienthints and one of cu_changes, cu_private_event or cu_log_event.

Answers to the questions at https://wikitech.wikimedia.org/wiki/Creating_new_tables:

  • Should this table be replicated to wiki replicas (does it not contain private data)? No, as private data is stored.
  • Will you be doing cross-joins with the wiki metadata? There will be no CROSS JOINs in the SQL sense. If this means any JOIN then, JOINs for these tables will occur to the tables cu_useragent_clienthints, cu_useragent_clienthints, cu_changes, cu_private_event and cu_log_event. When generating results for Special:CheckUser and Special:Investigate a JOIN would be made to the logging and comment table.
  • Size of the table (number of rows expected):
    • cu_useragent_clienthints - Minimal as the table stores unique combinations of client hint name and value, with most being shared by browsers. Probably around 1000 rows. Local testing using current versions of Chrome and Edge gives 14 rows, but each version number of a browser adds a new row.
    • cu_useragent_clienthints_map - Very large number of rows expected, but entries are to be removed when the entry they reference is deleted in cu_changes and the column size and count is kept deliberately small to address this. Local testing indicates around 10 rows per entry in cu_changes, cu_private_event or cu_log_event. As such, the row count at most should be roughly 10 times the number of rows in cu_changes on wikidata.
  • Expected growth per year (number of rows): Initially there will be a large increase, but data will be deleted after 3 months as is done for cu_changes so no expected growth after 3 months of full deployment.
  • Expected amount of queries, both writes and reads (per minute, per hour...per day, any of those are ok): At minimum writes will occur to at least one of the tables every time a entry is stored in cu_changes when the browser is Chromium based (i.e. no Firefox or Safari). Initially this would be at least for every successful edit, but is likely to expand to log actions including logging in and out.
  • Examples of queries that will be using the table: (` character removed from queries for code highlighting reasons)
    • cu_useragent_clienthints
      • SELECT cu_useragent_clienthints_id FROM cu_useragent_clienthints WHERE uach_name = 'architecture' AND uach_value = 'x86' LIMIT 1
      • INSERT INTO cu_useragent_clienthints (uach_name,uach_value) VALUES ('fullVersionList','Google Chrome 114.0.5735.199')
    • cu_useragent_clienthints_map
      • SELECT COUNT(*) AS rowcount FROM (SELECT 1 FROM cu_useragent_clienthints_map WHERE uachm_id_type = 0 AND uachm_reference_id = 353 AND (uachm_id IS NOT NULL) ) tmp_count
      • INSERT INTO cu_useragent_clienthints_map (uachm_uach_id,uachm_id_type,uachm_reference_id) VALUES (109,0,'353')
  • The release plan for the feature (are there specific wikis you'd like to test first etc): T341110: Deploy client hints functionality (testwiki, four production wikis and then all wikis)

Event Timeline

Ladsgroup triaged this task as Medium priority.
Ladsgroup moved this task from Triage to Ready on the DBA board.

I will take a look at this soon

Dreamy_Jazz removed Ladsgroup as the assignee of this task.EditedJul 4 2023, 3:15 PM
Dreamy_Jazz assigned this task to Ladsgroup.
Dreamy_Jazz raised the priority of this task from Medium to Needs Triage.
Dreamy_Jazz triaged this task as Medium priority.
Dreamy_Jazz updated the task description. (Show Details)
Dreamy_Jazz updated the task description. (Show Details)

No idea why Phabricator did that... Seems like editing the task at the same time as others causes issues with that.

Example data after one submission:

cu_useragent_clienthints table:

cu_useragent_clienthints_idcu_useragent_clienthints_namecu_useragent_clienthints_value
56architecturearm
57bitness64
58brandsNot.A/Brand 8
59brandsChromium 114
60brandsGoogle Chrome 114
61fullVersionListNot.A/Brand 8.0.0.0
62fullVersionListChromium 114.0.5735.198+
63fullVersionListGoogle Chrome 114.0.5735.198
64mobile0
65platformmacOS
66platformVersion13.4.1

cu_useragent_clienthints_map table (6639 refers to the revision ID, 0 in`cu_useragent_clienthints_map_id_type` column refers to the cu_changes table, where 6639 maps to cuc_this_oldid column.

cu_useragent_clienthints_map_idcu_useragent_clienthints_map_cu_useragent_clienthints_idcu_useragent_clienthints_map_reference_idcu_useragent_clienthints_map_id_type
1135666390
1145766390
1155866390
1165966390
1176066390
1186166390
1196266390
1206366390
1216466390
1226566390
1236666390

Haven't fully reviewed everything (while the structure generally looks good). The naming needs a bit of work.

For fields, it's better to prefix them with an abbreviation of table name, not the full table name (with exception of one-word-tables such as actor). So instead of having a field name such as cu_useragent_clienthints_map_cu_useragent_clienthints_id which is more like name of a German law. Use uachm_clienthints_id and instead of cu_useragent_clienthints_name use uach_name (make sure it's documented in https://www.mediawiki.org/wiki/Database_field_prefixes)

Thanks. The original abbreviation wording was decided against, but the abbreviation you've suggested would be good as neither are words.

For fields, it's better to prefix them with an abbreviation of table name, not the full table name (with exception of one-word-tables such as actor).

We can make this change, but I'd also like to understand the rationale for this. Abbreviations like uachm are hard to read, and you have to mentally map them to an expanded set of multiple words when looking at them.

A couple of reasons:

For example, the query given above:

INSERT INTO cu_useragent_clienthints_map (cu_useragent_clienthints_map_cu_useragent_clienthints_id,cu_useragent_clienthints_map_id_type,cu_useragent_clienthints_map_reference_id) VALUES (109,0,'353')

can turn into

INSERT INTO cu_useragent_clienthints_map (uachm_clienthints_id,uachm_reference_id_type,uachm_clienthints_reference_id) VALUES (109,0,'353')

Which makes you focus on the part that matters (_reference_id_type) in the query that's going to repeat name of multiple fields many times.

you have to mentally map them to an expanded set of multiple words when looking at them.

Instead of mapping them to words, they should be mapped to a table, to me when I see lt_ it maps the link normalization table (whatever it name it may have)

A couple of reasons:

For example, the query given above:

INSERT INTO cu_useragent_clienthints_map (cu_useragent_clienthints_map_cu_useragent_clienthints_id,cu_useragent_clienthints_map_id_type,cu_useragent_clienthints_map_reference_id) VALUES (109,0,'353')

can turn into

INSERT INTO cu_useragent_clienthints_map (uachm_clienthints_id,uachm_reference_id_type,uachm_clienthints_reference_id) VALUES (109,0,'353')

Which makes you focus on the part that matters (_reference_id_type) in the query that's going to repeat name of multiple fields many times.

you have to mentally map them to an expanded set of multiple words when looking at them.

Instead of mapping them to words, they should be mapped to a table, to me when I see lt_ it maps the link normalization table (whatever it name it may have)

Thank you for the explanation! That makes sense.

Updated the column names and should be ready for a re-review on DB structure.

Change 936055 had a related patch set uploaded (by Dreamy Jazz; author: Dreamy Jazz):

[operations/mediawiki-config@master] Disable purging of old client hint data by default

https://gerrit.wikimedia.org/r/936055

Change 936055 merged by jenkins-bot:

[operations/mediawiki-config@master] Disable purging of old client hint data by default

https://gerrit.wikimedia.org/r/936055

Mentioned in SAL (#wikimedia-operations) [2023-07-06T20:06:31Z] <thcipriani@deploy1002> Started scap: Backport for [[gerrit:936055|Disable purging of old client hint data by default (T340959 T341076)]]

Mentioned in SAL (#wikimedia-operations) [2023-07-06T20:07:54Z] <thcipriani@deploy1002> thcipriani and dreamyjazz: Backport for [[gerrit:936055|Disable purging of old client hint data by default (T340959 T341076)]] synced to the testservers: mwdebug2001.codfw.wmnet, mwdebug1002.eqiad.wmnet, mwdebug2002.codfw.wmnet, mwdebug1001.eqiad.wmnet

Mentioned in SAL (#wikimedia-operations) [2023-07-06T20:16:40Z] <thcipriani@deploy1002> Finished scap: Backport for [[gerrit:936055|Disable purging of old client hint data by default (T340959 T341076)]] (duration: 10m 08s)

The tables have been added into the main branch of the extension. I presume a patch like https://gerrit.wikimedia.org/r/c/operations/puppet/+/874781/ will be needed?

Yup. I'll make it if noone beats me to it.

Change 938841 had a related patch set uploaded (by Ladsgroup; author: Amir Sarabadani):

[operations/puppet@production] realm: Add two new private tables of CheckUser

https://gerrit.wikimedia.org/r/938841

I've added the creation of the tables to the Thursday late backport window, so that the SQL files would be on all wikis.

We have to wait a bit :/ making sure private tables don't accidentally get replicated to the cloud is complex.

We have to wait a bit :/ making sure private tables don't accidentally get replicated to the cloud is complex.

Is this time needed meaning the table creation can't be done on Thursday? If so, is there an estimate as to how long that might be (plan currently is to collect the data ASAP to work out what has enough entropy to be displayed to checkusers so it would be useful to know when I can expect to start to collect this data).

If it helps, the tables are unused and will not be written to until a different change is made (to change a config to true on some wikis as part of T341110: Deploy client hints functionality)

yeah, we need to wait for a bit. I can't tell how long. Sorry.

Change 938841 merged by Ladsgroup:

[operations/puppet@production] realm: Add two new private tables of CheckUser

https://gerrit.wikimedia.org/r/938841

Restarted sanitarium of s5 in codfw (db2186:3315). I can see the clienthint tables in SHOW SLAVE STATUS\G: ...%.cu_changes,%.cu_log,%.cu_log_event,%.cu_private_event,%.cu_useragent_clienthints,%.cu_useragent_clienthints_map,...

Progress:

  • s1
    • eqiad
    • codfw
  • s2
    • eqiad
    • codfw
  • s3
    • eqiad
    • codfw
  • s4
    • eqiad
    • codfw
  • s5
    • eqiad
    • codfw
  • s6
    • eqiad
    • codfw
  • s7
    • eqiad
    • codfw
  • s8
    • eqiad
    • codfw

@Dreamy_Jazz I restarted the script in s5, so you can start writing to any small wiki there https://noc.wikimedia.org/conf/dblists/s5.dblist

Thanks @Ladsgroup. Do the tables need to be created on s5 / s3 (just wondering as they have been added to the private list, but didn't see a notification here about the tables being created)?

The tables still have to be created. It is usually done by the deployers https://wikitech.wikimedia.org/wiki/Creating_new_tables

Thanks. I'm going to presume that the tables cannot be created on any wikis other than those on s3 / s5.

As such, I will create the tables on just testwiki in the backport window in a few minutes (as that is on s3).

Tables created on testwiki. I intend to enable client hints on testwiki in the UTC late backport window that happens today.

s2 is also done now.

Tables created on testwiki. I intend to enable client hints on testwiki in the UTC late backport window that happens today.

Did this happen?

s2 is also done now.

Tables created on testwiki. I intend to enable client hints on testwiki in the UTC late backport window that happens today.

Did this happen?

Yes. https://gerrit.wikimedia.org/r/c/operations/mediawiki-config/+/941954

Awesome. It doesn't show up in toolforge \o/

All sections are done except s1, s4 and s8 (enwiki, commonswiki, wikidatawiki). I'll do that next week.

This is done everywhere. I did a getting replication status on all sanitarium instances both in eqiad and codfw and all have the tables to filter.

Feel free to move forward with creating the tables.

Tables created on all wikis:

13:14 TheresNoTime: `[samtar@mwmaint1002 ~]$ foreachwiki sql.php /srv/mediawiki-staging/php-1.41.0-wmf.20/extensions/CheckUser/schema/mysql/cu_useragent_clienthints_map.sql` for T258105
13:09 TheresNoTime: `[samtar@mwmaint1002 ~]$ foreachwiki sql.php /srv/mediawiki-staging/php-1.41.0-wmf.20/extensions/CheckUser/schema/mysql/cu_useragent_clienthints.sql` for T258105

I can see no further tasks to perform. Closing.