Page MenuHomePhabricator

Newcomer tasks: task suggestions fail because of search queries exceeding length limits
Closed, ResolvedPublic

Description

CirrusSearch has a limit for search query length (currently 300 on Wikimedia wikis) to avoid expensive searches (which are typically made by bots and are pointless or abusive); task suggestion searches (coming in the form of hastemplate:<list of templates defining a task type> morelikethis:<list of 5 pages defining a topic, combined for up to 20 topics selected by the user>) can easily exceed that, causing the user to get an error.

Possible fixes:

  • increase the search limit for newcomer task queries (would have to assure first that these searches are not particularly expensive; since morelikethis works by picking the top 50 representative words from each page and substituting them in the search query, that might well not be the case). This would leave the remote (API-based) task suggester broken since we can't reliably identify newcomer task queries there.
  • split into multiple queries. Currently we make up to 5 queries (one for each task type); this way it would be up to 100. That could take seconds (but worth profiling at least).
  • prevent the user from picking too many topics in some way.

Event Timeline

Change 563810 had a related patch set uploaded (by Gergő Tisza; owner: Gergő Tisza):
[mediawiki/extensions/GrowthExperiments@master] Newcomer tasks: Make a separate search query for every topic

https://gerrit.wikimedia.org/r/563810

Change 563810 merged by jenkins-bot:
[mediawiki/extensions/GrowthExperiments@master] Newcomer tasks: Make a separate search query for every topic

https://gerrit.wikimedia.org/r/563810

For QA (if this is something to be QA-d, I'm not sure): before this patch, selecting many topics probably caused an error. Now it should work (but maybe cause some slow-down).

When done, this should be moved off the sprint board and 1.1 board, but I'd like to keep the task open as I'd like to find a better solution for this in the long term.

Change 564162 had a related patch set uploaded (by Catrope; owner: Gergő Tisza):
[mediawiki/extensions/GrowthExperiments@wmf/1.35.0-wmf.14] Newcomer tasks: Make a separate search query for every topic

https://gerrit.wikimedia.org/r/564162

Change 564162 merged by jenkins-bot:
[mediawiki/extensions/GrowthExperiments@wmf/1.35.0-wmf.14] Newcomer tasks: Make a separate search query for every topic

https://gerrit.wikimedia.org/r/564162

When done, this should be moved off the sprint board and 1.1 board, but I'd like to keep the task open as I'd like to find a better solution for this in the long term.

Doing that requires solving T243478: Newcomer tasks: fetch ElasticSearch data for search results and T242476: Newcomer tasks: when selecting multiple topics, one topic should not dominate over the others, both of which are currently also worked around via per-topic queries.

Etonkovidova closed this task as Resolved.EditedJan 23 2020, 5:11 PM
Etonkovidova subscribed.

Checked in betalabs and in production - selecting all topics and all difficulty levels (and do less extreme queries but still extensive queries) does not fail. However, there is a difference in count of articles between the topic overlay count and the task card counter. Often the right count appears only in the task card but not in the topic overlay. The issue would be addressed in separate tickets.

This is fixed right now, but will become an issue again when we undo the separate-search-query-for-every-topic hack.

FWIW all the ORES topics joined together are about 800 characters. So this starts becoming a problem after around a third of the topics have been selected.

We could also chunk queries to make use of character limits as well as possible, but it seems more effort than worth it.

kostajh renamed this task from Newcomer tasks: task suggestions fail because of search queries exceeding lenght limits to Newcomer tasks: task suggestions fail because of search queries exceeding length limits.Mar 5 2020, 12:31 PM
kostajh subscribed.

FWIW all the ORES topics joined together are about 800 characters. So this starts becoming a problem after around a third of the topics have been selected.

Could we define two character aliases for each topic, and update the API to support mapping the two character aliases to the canonical name? e.g. bi <-> biography

If we do 2 character aliases we'd use 128 characters with the existing 64 topics.

@dcausse @Gehel @TJones is it possible to handle something like this on the search backend? So instead of articletopic:architecture|philosophy we'd have aliases so you could equally do articletopic:ar|ph?

Similarly but not as important as the topic aliases, I was wondering if it's possible to use a numeric identifier for the hastemplate query, so instead of hastemplate:Style non encyclopédique (which looks for articles with this template) you could do hastemplate:235874 (the page ID for the template).

Actually, now that I look at this again, I think we could take TERMS_TO_LABELS in ArticleTopicFeature and modify it to use a pipe-delimited notation (so like, 'bi|biography' => Culture.Biography.Biography*'), or we could define a new const for ABBREVATIONS_TO_TERMS, like 'bi' => 'biography' and then adjust relevant code accordingly to make use of it. That's something we (Growth) could potentially do if Search team agrees with the approach.

That feels like a hack. I'd rather lift / limit the length restrictions. We talked about that with @EBernhardson (although now I can't find the place) and it seemed doable - the length limit is fairly arbitrary and was intended as a stopgap measure against bots making expensive queries. If long articletopic lists aren't expensive, we should be able to exempt them from the length limit.

Change 607290 had a related patch set uploaded (by DCausse; owner: DCausse):
[mediawiki/extensions/CirrusSearch@master] Exempt articletopic from query length limitations

https://gerrit.wikimedia.org/r/607290

The patch above should exempt articletopic from length limitations (similar to what we do for incategory), the hard limit remains 2048 chars.

The patch above should exempt articletopic from length limitations (similar to what we do for incategory), the hard limit remains 2048 chars.

Nice, thanks very much!

That feels like a hack. ... If long articletopic lists aren't expensive, we should be able to exempt them from the length limit.

A looooooong time ago, when we first put in the query length limit, we broke the Deep Cat tool because it made very long queries. We put in a check to exempt Deep Cat, and that worked okay. (Now deepcat: is a keyword, but that's another story.) So it is technically feasible.

I'm not sure which is the hackier hack—hacking your query so it can get past the safety net, or hacking a hole in the safety net to let it through.

Two thoughts in favor of hacking the query:

  • Depending on implementation, the hole in the safety net could be abused; would articletopic:ar|ph very long expensive query.... sneak through?
  • Would power users appreciate the articletopic:ar|ph syntax? It's easier to type. (But it may be hard to remember—the list of codes will not all be mnemonic if they are 2 letters long.) Or do the likely use cases not include people typing queries manually?

On the other hand, will the list of topics grow until it could hit the limit again?

On the other other hand, if a query did hit the limit, would it be too expensive or nonsensical anyway?

If we do 2 character aliases we'd use 128 characters with the existing 64 topics.

Plus 63 | separators and 13 characters for articletopic:, so it's 204; still under the limit, but getting closer.

That feels like a hack. ... If long articletopic lists aren't expensive, we should be able to exempt them from the length limit.

A looooooong time ago, when we first put in the query length limit, we broke the Deep Cat tool because it made very long queries. We put in a check to exempt Deep Cat, and that worked okay. (Now deepcat: is a keyword, but that's another story.) So it is technically feasible.

We still have this exemption in place for incategory eventhough it might not be necessary since the addition of the deepcat keyword.

I'm not sure which is the hackier hack—hacking your query so it can get past the safety net, or hacking a hole in the safety net to let it through.

Two thoughts in favor of hacking the query:

  • Depending on implementation, the hole in the safety net could be abused; would articletopic:ar|ph very long expensive query.... sneak through?

The way we now compute the length will prevent such abuse now, it's handled by the new query parser in cirrus (only characters owned by the keyword are exempted, remaining ones are counted as if the keyword was not used).

Change 607290 merged by jenkins-bot:
[mediawiki/extensions/CirrusSearch@master] Exempt articletopic from query length limitations

https://gerrit.wikimedia.org/r/607290

I think this is all done.