Page MenuHomePhabricator

Create index for cu_agents in cu_changes table
Open, NormalPublic3 Story Points

Description

In order to make user agents searchable from the CheckUser interface, we will need to add a new index for the user agent column.

Event Timeline

kaldari created this task.Oct 11 2016, 9:28 PM
kaldari moved this task from Untriaged to Ready for development on the Community-Tech board.
DannyH set the point value for this task to 3.Oct 11 2016, 9:37 PM
Huji raised the priority of this task from Low to Normal.Oct 12 2016, 2:08 AM

@Huji mentioned in T147895#2768090:

"If we want a new index it should probably be a multi-column index (both IP and UA)."

Krinkle moved this task from Backlog to Schema on the MediaWiki-Database board.May 8 2017, 1:21 AM
Huji added a comment.Sep 11 2017, 3:18 PM

@DannyH in the 10 months that have passed since T147895, I have actually changed my opinion a bit: there are cases where a user-agent is so unique that you would want to search for it everywhere (i.e. with no IP restriction).

Gladly, partial indexes can help in situations like that. So I think an index on user_agent, ip_hex would be ideal because it can also be used as a partial index if IP is not provided.

there are cases where a user-agent is so unique that you would want to search for it everywhere (i.e. with no IP restriction).

This is more often than not the use case for me. If it's limited to a particular range or specific IP, I just query for that and use the browser's "find in page" feature to identify instances of the user agent I'm looking for.

Huji renamed this task from Create index for user agents in cu_changes table to Create index for cu_agents in cu_changes table.Feb 24 2018, 7:56 PM

Change 414120 had a related patch set uploaded (by Huji; owner: Huji):
[mediawiki/extensions/CheckUser@master] Create index for cu_agents in cu_changes table

https://gerrit.wikimedia.org/r/414120

Huji added a project: DBA.Feb 24 2018, 7:56 PM
Huji claimed this task.
Huji moved this task from Backlog to Patches in review on the CheckUser board.Feb 24 2018, 8:08 PM
Huji added a subscriber: jcrespo.Feb 25 2018, 3:57 AM

@jcrespo can you please take a look at this as well?

Hey @Huji - we, DBAs, are a bit overwhelmed lately with lots of unexpected fires and requests, so it might take sometime until we can have a proper look at this.
This doesn't mean will not take a look, but I am basically setting some expectations :-)
If this is really really super urgent let us know so we can try to see if we can make some room for this.

Thanks and sorry for the inconveniences.

Marostegui moved this task from Triage to Backlog on the DBA board.Feb 25 2018, 9:20 AM
Huji added a comment.May 8 2018, 3:28 PM

@jcrespo said on gerrit: "The idea seems sane, but I am concerned about the number of indexes this table has- it amounts to half of its total size right now (5 out of 10GB)."

I wonder what alternatives we have in a situation like this. Best be discussed here on Phab.

well, what I mean is that for short term this can be deployed as is, but tables with many indexes and trying to apply all kinds of filters normally need more advanced search technology like elastic search (e.g. the data could be kept canonically on the database, but indexed on the specialized search engine). Of course I understand that is not a small task, nor it is for long-term, but that is how things like wikidata item search (in addition to the general search) and other parts of mediawiki have been moved to.

Huji added a comment.May 8 2018, 7:34 PM

Understandable. The question is, is the CU tool worth these extra measures? It is used by a very small community of users, sporadically, so a slight latency might be okay and may be we shouldn't even put an index in place. Also, the data set is so small (compared to Wikidata or the Wikipedia pages) that maybe an index is a more cost-effective strategy than elastic search? I don't know how to call the shots on that, or who should.

This can be a hash index, it's fast and not heavy or large but you can get only exact match, I still think it's okay and very needed (as a CU).

Huji added a comment.Feb 1 2019, 4:15 PM

For doing a hash match, you need to store the hashed UA as a new column in the table, correct? There will be some space implications for it (I think CU tables on WMF production servers are in the 20GB range now, and the largest field in them is the UA field). That being said, I am okay with an interim solution that consists of adding a hashed UA field, indexing it, and allowing exact match searches. That will probably be easier to implement, and it will help us demonstrate several use cases of UA lookups, which would hopefully facilitate moving this task and its parent along.

For doing a hash match, you need to store the hashed UA as a new column in the table, correct? There will be some space implications for it (I think CU tables on WMF production servers are in the 20GB range now, and the largest field in them is the UA field). That being said, I am okay with an interim solution that consists of adding a hashed UA field, indexing it, and allowing exact match searches. That will probably be easier to implement, and it will help us demonstrate several use cases of UA lookups, which would hopefully facilitate moving this task and its parent along.

Hash values are way smaller than UAs, otherwise what's the point of having them at all? Also 20 GB is not much for enwiki for example.

@jcrespo @Marostegui: Is a hash value column and index on that would work for you? Which hashing do you prefer?

Huji added a comment.Feb 12 2019, 1:58 AM

@jcrespo Given the analysis you already did on T212092#4934152 would you recommend that we go ahead with creating this index? (If I understand your comment there, the index would be beneficial for a search only on agent and timestamp, with no restrictions on IP).

If yes, would that change your opinion about https://gerrit.wikimedia.org/r/#/c/mediawiki/extensions/CheckUser/+/414120/ which your previously were not a fan of?

@Huji I would need to see the code changed, the proposals I gave at T212092 are different from the one shown at https://gerrit.wikimedia.org/r/#/c/mediawiki/extensions/CheckUser/+/414120/ . Do you have such code or at least some plan about it somewhere? I would also like to analyze the table and see if all indexes are being used, and drop at the same time those that are not used, if any.

Huji added a comment.Feb 18 2019, 10:23 PM

@jcrespo The index you created in T212092#4934152 is exactly the one proposed in https://gerrit.wikimedia.org/r/#/c/mediawiki/extensions/CheckUser/+/414120/1/cu_changes.sql so I am a bit confused about what exactly you are asking for.

Do you want me to create a patch that would be using this index once created? That is, a patch for T146837? (Happy to do that, though it would be a waste of time if we end up concluding that the index won't be created and the code would take too long with the index)

Or do you want me to create a scenario similar to T212092 for T146837?

Note that, at least for now, I do *not* want to pursue T146837 completely; if we were to do all that task asks for, we would need an index both on IP and UA, because the wireframe proposed there allows for specifying both an IP address and a UA string. For now, though, I would rather we make it an either-or kind of thing (which means the DB will also use either the existing index on the IPs or the proposed one on the UAs). If we ever decide to allow both the UA and the IP to be specified, then we can have the separate discussion about whether the other index would be needed and justified. (It may not be needed, as in most cases the DB would probably use one of the index [the one on IP or the one on UA] to narrow down the results enough that a filesort on the returned rows to apply the other WHERE logic wouldn't take too long).