Page MenuHomePhabricator

Create index for agent_id columns in the CheckUser result tables
Open, MediumPublic3 Estimated Story Points

Description

In order to make user agents searchable from the CheckUser interface, we will need to make it possible to search for rows in the CheckUser result tables by their agent_id column (which were added in T361140).

Related Objects

StatusSubtypeAssignedTask
Resolved TBolliger
OpenNone
OpenNone
OpenNone
OpenNone
OpenNone
OpenNone
OpenNone
OpenNone
OpenNone
OpenNone
OpenNone
OpenNone
OpenNone
OpenNone
OpenNone
OpenNone
OpenNone
ResolvedDreamy_Jazz
OpenNone
ResolvedDreamy_Jazz
ResolvedDreamy_Jazz
Resolved Marostegui
OpenNone
OpenNone
Resolved Marostegui

Event Timeline

DannyH set the point value for this task to 3.Oct 11 2016, 9:37 PM
DannyH moved this task from Needs Discussion to Up Next (June 3-21) on the Community-Tech board.
Huji raised the priority of this task from Low to Medium.Oct 12 2016, 2:08 AM

@Huji mentioned in T147895#2768090:

"If we want a new index it should probably be a multi-column index (both IP and UA)."

@DannyH in the 10 months that have passed since T147895, I have actually changed my opinion a bit: there are cases where a user-agent is so unique that you would want to search for it everywhere (i.e. with no IP restriction).

Gladly, partial indexes can help in situations like that. So I think an index on user_agent, ip_hex would be ideal because it can also be used as a partial index if IP is not provided.

there are cases where a user-agent is so unique that you would want to search for it everywhere (i.e. with no IP restriction).

This is more often than not the use case for me. If it's limited to a particular range or specific IP, I just query for that and use the browser's "find in page" feature to identify instances of the user agent I'm looking for.

Huji renamed this task from Create index for user agents in cu_changes table to Create index for cu_agents in cu_changes table.Feb 24 2018, 7:56 PM

Change 414120 had a related patch set uploaded (by Huji; owner: Huji):
[mediawiki/extensions/CheckUser@master] Create index for cu_agents in cu_changes table

https://gerrit.wikimedia.org/r/414120

@jcrespo can you please take a look at this as well?

Hey @Huji - we, DBAs, are a bit overwhelmed lately with lots of unexpected fires and requests, so it might take sometime until we can have a proper look at this.
This doesn't mean will not take a look, but I am basically setting some expectations :-)
If this is really really super urgent let us know so we can try to see if we can make some room for this.

Thanks and sorry for the inconveniences.

@jcrespo said on gerrit: "The idea seems sane, but I am concerned about the number of indexes this table has- it amounts to half of its total size right now (5 out of 10GB)."

I wonder what alternatives we have in a situation like this. Best be discussed here on Phab.

well, what I mean is that for short term this can be deployed as is, but tables with many indexes and trying to apply all kinds of filters normally need more advanced search technology like elastic search (e.g. the data could be kept canonically on the database, but indexed on the specialized search engine). Of course I understand that is not a small task, nor it is for long-term, but that is how things like wikidata item search (in addition to the general search) and other parts of mediawiki have been moved to.

Understandable. The question is, is the CU tool worth these extra measures? It is used by a very small community of users, sporadically, so a slight latency might be okay and may be we shouldn't even put an index in place. Also, the data set is so small (compared to Wikidata or the Wikipedia pages) that maybe an index is a more cost-effective strategy than elastic search? I don't know how to call the shots on that, or who should.

This can be a hash index, it's fast and not heavy or large but you can get only exact match, I still think it's okay and very needed (as a CU).

For doing a hash match, you need to store the hashed UA as a new column in the table, correct? There will be some space implications for it (I think CU tables on WMF production servers are in the 20GB range now, and the largest field in them is the UA field). That being said, I am okay with an interim solution that consists of adding a hashed UA field, indexing it, and allowing exact match searches. That will probably be easier to implement, and it will help us demonstrate several use cases of UA lookups, which would hopefully facilitate moving this task and its parent along.

For doing a hash match, you need to store the hashed UA as a new column in the table, correct? There will be some space implications for it (I think CU tables on WMF production servers are in the 20GB range now, and the largest field in them is the UA field). That being said, I am okay with an interim solution that consists of adding a hashed UA field, indexing it, and allowing exact match searches. That will probably be easier to implement, and it will help us demonstrate several use cases of UA lookups, which would hopefully facilitate moving this task and its parent along.

Hash values are way smaller than UAs, otherwise what's the point of having them at all? Also 20 GB is not much for enwiki for example.

@jcrespo @Marostegui: Is a hash value column and index on that would work for you? Which hashing do you prefer?

@jcrespo Given the analysis you already did on T212092#4934152 would you recommend that we go ahead with creating this index? (If I understand your comment there, the index would be beneficial for a search only on agent and timestamp, with no restrictions on IP).

If yes, would that change your opinion about https://gerrit.wikimedia.org/r/#/c/mediawiki/extensions/CheckUser/+/414120/ which your previously were not a fan of?

@Huji I would need to see the code changed, the proposals I gave at T212092 are different from the one shown at https://gerrit.wikimedia.org/r/#/c/mediawiki/extensions/CheckUser/+/414120/ . Do you have such code or at least some plan about it somewhere? I would also like to analyze the table and see if all indexes are being used, and drop at the same time those that are not used, if any.

@jcrespo The index you created in T212092#4934152 is exactly the one proposed in https://gerrit.wikimedia.org/r/#/c/mediawiki/extensions/CheckUser/+/414120/1/cu_changes.sql so I am a bit confused about what exactly you are asking for.

Do you want me to create a patch that would be using this index once created? That is, a patch for T146837? (Happy to do that, though it would be a waste of time if we end up concluding that the index won't be created and the code would take too long with the index)

Or do you want me to create a scenario similar to T212092 for T146837?

Note that, at least for now, I do *not* want to pursue T146837 completely; if we were to do all that task asks for, we would need an index both on IP and UA, because the wireframe proposed there allows for specifying both an IP address and a UA string. For now, though, I would rather we make it an either-or kind of thing (which means the DB will also use either the existing index on the IPs or the proposed one on the UAs). If we ever decide to allow both the UA and the IP to be specified, then we can have the separate discussion about whether the other index would be needed and justified. (It may not be needed, as in most cases the DB would probably use one of the index [the one on IP or the one on UA] to narrow down the results enough that a filesort on the returned rows to apply the other WHERE logic wouldn't take too long).

Aklapper removed Huji as the assignee of this task.Jul 2 2021, 5:22 AM

Removing task assignee due to inactivity, as this open task has been assigned for more than two years (see emails sent to assignee on May26 and Jun17, and T270544). Please assign this task to yourself again if you still realistically [plan to] work on this task - it would be very welcome!

(See https://www.mediawiki.org/wiki/Bug_management/Assignee_cleanup for tips how to best manage your individual work in Phabricator.)

Change 414120 abandoned by Dreamy Jazz:

[mediawiki/extensions/CheckUser@master] Create index for cu_agents in cu_changes table

Reason:

Due to the move to the abstract schema in CheckUser, this patch would probably need to be largely re-written and so I'm abandoning this due to it's age and that need.

https://gerrit.wikimedia.org/r/414120

Dreamy_Jazz renamed this task from Create index for cu_agents in cu_changes table to Create index for agent_id columns in the CheckUser result tables.Mar 28 2024, 9:30 AM
Dreamy_Jazz updated the task description. (Show Details)