Page MenuHomePhabricator

Optimize number of results requested from API
Closed, ResolvedPublic

Description

We currently request 250 results from the API when providing a list of suggested edits. We should investigate optimizing the number of results we request to speed up the time it takes to retrieve the data we need.

Notes

  • Update client-side and server-side code to request a smaller set of tasks (25-50)
  • Retrieve more results when the user gets to the end of the queue, and deduplicate items that are already in the queue (since we don't have continuation in the API)
  • We can show the true number of tasks available (no upper limit of 200)

Event Timeline

@nettrom_WMF / @MMiller_WMF would it be easy for you to find the lowest number of tasks we could fetch that would result in 95% or 99% of users not needing to fetch more tasks in the suggested edits queue? You'd have to look at ordinal_position from the NewcomerTask schema.

The idea with is that we request, say, 25 tasks (currently we ask for 250) in the hope that the user wouldn't get to task number 25 in the queue. If the user is clicking through and approaches task number 25 we would make a request to fetch more tasks. The counter would still show 200 (assuming there are more than 200 tasks available for the query).

Reducing the number of tasks fetched means faster interactions as users click through topic/task type filters, and also helps us avoid problems like T272103.

I wonder if keeping the 200 visible size is worth the effort of complicating things with multiple requests? I find it very hard to believe that anyone would want to tab through hundreds of task cards.

If we do want to stitch multiple requests into 200 tasks, we'd either need query continuation for the growthtasks API (and for that we need to handle the offset parameter in the task suggester, and for *that* we'd probably want pseudorandom sorting instead of random), or we could just fetch N random articles multiple time, deduplicate, and cut off when the rate of duplicates gets high.

@nettrom_WMF / @MMiller_WMF would it be easy for you to find the lowest number of tasks we could fetch that would result in 95% or 99% of users not needing to fetch more tasks in the suggested edits queue? You'd have to look at ordinal_position from the NewcomerTask schema.

Actually, probably not worth your time to look this up, we could just use a value like 25 to begin with. The end-user shouldn't notice any difference in any case.

I wonder if keeping the 200 visible size is worth the effort of complicating things with multiple requests? I find it very hard to believe that anyone would want to tab through hundreds of task cards.

I think these are separate concerns, no? I'm proposing to make the full task queue available to the user if they feel inclined to click through hundreds or thousands of tasks. But that we only fetch this queue in chunks of ~25 tasks at a time.

And I agree that the 200 visible size makes the code more complicated and would be nice to do away with. @RHo / @MMiller_WMF would you be open to doing that? If so, then we could start using the already available total task count number in the pager.

If we do want to stitch multiple requests into 200 tasks, we'd either need query continuation for the growthtasks API (and for that we need to handle the offset parameter in the task suggester, and for *that* we'd probably want pseudorandom sorting instead of random), or we could just fetch N random articles multiple time, deduplicate, and cut off when the rate of duplicates gets high.

Yeah, the latter would be less work for now.

I think these are separate concerns, no? I'm proposing to make the full task queue available to the user if they feel inclined to click through hundreds or thousands of tasks. But that we only fetch this queue in chunks of ~25 tasks at a time.

They are different approaches, sure. Paging the results (invisibly to the user) results no change in functionality, but it's not trivial to implement. Just reducing the queue length to 50 or something is a user-visible change, but it can be done by just changing a number. So I'm wondering if the extra effort is worth it, given that 200 tasks are a lot and it seems unlikely to me that users would actually engage them. If the issue is that most results are not relevant enough, there are probably better solutions (eg. keyword-based task search).

@kostajh -- what is it that the user would see for the task count if you do your proposal?

And I agree that the 200 visible size makes the code more complicated and would be nice to do away with. @RHo / @MMiller_WMF would you be open to doing that? If so, then we could start using the already available total task count number in the pager.

@kostajh -- what is it that the user would see for the task count if you do your proposal?

The actual number of available tasks for that task type + topic filter combination. If you open your browser's dev tools on Special:Homepage, click on the network tab and look for api.php?action=query you'll find the API call, and if you look at the response you can see the total count. In the example above, it's 775.

image.png (824×2 px, 853 KB)

I think these are separate concerns, no? I'm proposing to make the full task queue available to the user if they feel inclined to click through hundreds or thousands of tasks. But that we only fetch this queue in chunks of ~25 tasks at a time.

They are different approaches, sure. Paging the results (invisibly to the user) results no change in functionality, but it's not trivial to implement. Just reducing the queue length to 50 or something is a user-visible change, but it can be done by just changing a number. So I'm wondering if the extra effort is worth it, given that 200 tasks are a lot and it seems unlikely to me that users would actually engage them. If the issue is that most results are not relevant enough, there are probably better solutions (eg. keyword-based task search).

Oh, I see. Yes, I'm OK with that too, assuming that a task queue size of 50 is satisfactory from a product perspective.


So to summarize we are proposing to either 1) reduce the "max queue size" of 200 to something like 50, or 2) show the true size of the tasks returned by a set of task type and topic filters, and allow for unlimited paging through the queue of tasks. The first is little work and the second approach is medium effort.

@kostajh -- what is it that the user would see for the task count if you do your proposal?

The actual number of available tasks for that task type + topic filter combination. If you open your browser's dev tools on Special:Homepage, click on the network tab and look for api.php?action=query you'll find the API call, and if you look at the response you can see the total count. In the example above, it's 775.

image.png (824×2 px, 853 KB)

I think these are separate concerns, no? I'm proposing to make the full task queue available to the user if they feel inclined to click through hundreds or thousands of tasks. But that we only fetch this queue in chunks of ~25 tasks at a time.

They are different approaches, sure. Paging the results (invisibly to the user) results no change in functionality, but it's not trivial to implement. Just reducing the queue length to 50 or something is a user-visible change, but it can be done by just changing a number. So I'm wondering if the extra effort is worth it, given that 200 tasks are a lot and it seems unlikely to me that users would actually engage them. If the issue is that most results are not relevant enough, there are probably better solutions (eg. keyword-based task search).

Oh, I see. Yes, I'm OK with that too, assuming that a task queue size of 50 is satisfactory from a product perspective.


So to summarize we are proposing to either 1) reduce the "max queue size" of 200 to something like 50, or 2) show the true size of the tasks returned by a set of task type and topic filters, and allow for unlimited paging through the queue of tasks. The first is little work and the second approach is medium effort.

My preference is for option 2, to show the true size of tasks. The reason is that it is confusing feedback when people change filters for it to still remain as the relatively low number of 50. This is currently true with the max set at 200, but seems less noticeable than 50.
My theory is that more people will filter to relevant interest topics if we show the actual number, and this may lead to more people to try a suggested edit.

show the true size of the tasks returned by a set of task type and topic filters, and allow for unlimited paging through the queue of tasks.

FWIW showing the true size is orthogonal (although as a user I'd approve), if we want to keep 200 as an arbitrary cutoff, it would be the same amount of effort. I can't remember if we only had it to avoid paging, or there was some product purpose.

@Tgr -- the product purpose was that we were worried that really large numbers would confuse or intimidate users. For instance, it would be easy on some wikis to select no topics and end up with a queue of 50,000 or more articles.

Anyway, I agree with @RHo. Let's show the true number and have unlimited paging.

Is this work needed for "add a link", or just for general performance? If the latter, do we need to discuss when to do the work?

Is this work needed for "add a link", or just for general performance? If the latter, do we need to discuss when to do the work?

General performance, and to undo the quirk we introduced with the last round of performance fixes (which is more or less T271993: Homepage - variant D users see un-prompted SE cards change, except it happens in more situations than described in that task).
I think for the purposes of searching for tasks we'll be able to handle link recommendations and template-based tasks the exact same way so the same performance improvement will work for both.

Change 663041 had a related patch set uploaded (by Kosta Harlan; owner: Kosta Harlan):
[mediawiki/extensions/GrowthExperiments@master] [WIP] Reduce number of tasks requested and allow unlimited paging

https://gerrit.wikimedia.org/r/663041

kostajh raised the priority of this task from Low to Medium.Jun 29 2021, 7:27 AM

I'm going to finish up my patch for this, it's long overdue.

Change 663041 merged by jenkins-bot:

[mediawiki/extensions/GrowthExperiments@master] Reduce number of tasks requested and show full task count

https://gerrit.wikimedia.org/r/663041

Some nice improvements in load times and API response times after this change went live:

image.png (1×1 px, 214 KB)

image.png (1×1 px, 195 KB)

That's really cool! 0-1s is usually given as the ideal range for waiting time, and looks like we just managed to get below that.
Also I think showing the total number of tasks (vs. the total on the filter dialog footer but max 200 above the task card) made the interface less confusing.
Also, the excluded ID set is a great solution for paging! I was thinking of using search offsets (with T288845: Support pseudo-random sorting in CirrusSearch fixed now in theory possible) but this is so much simpler.

There are a few issues that we should IMO iron out:

  • The total number of tasks is not very accurate. Not related to the patch here, we just didn't notice before as that number wasn't used much. Specifically:
    • If you select multiple topics / task types we make separate search queries for each, and then just sum the results together, which is not very accurate. This gets weird for very low results (e.g. the total count is 5 but actually there are only two cards). We could either subtract the number of deduplicated tasks (I have a patch for that) or make an extra search query for all topics/task types at the same time to get an accurate count.
    • Haven't verified but I think we have a similar issue when filtering the cached task suggester result set, and the solution could be similar too.
    • (In theory we have the same issue for page protection, but I think that very rarely happens in real life, and probably existed with the old paging system too.)
  • The logic for fetching the next 20 tasks gets triggered in some cases where it's not really necessary (because we already have $totalCount tasks, or we already tried to load more tasks and did not find any).
  • T289942: Suggested edits pager is hard to read (this might be language-specific)

Fetching 20 tasks at a time also means there is no point in making more than 20 search requests internally (we make topics x tasktypes requests so that limit is not so hard to hit). Not sure if our code is smart enough currently to take advantage of this.

That's really cool! 0-1s is usually given as the ideal range for waiting time, and looks like we just managed to get below that.
Also I think showing the total number of tasks (vs. the total on the filter dialog footer but max 200 above the task card) made the interface less confusing.
Also, the excluded ID set is a great solution for paging! I was thinking of using search offsets (with T288845: Support pseudo-random sorting in CirrusSearch fixed now in theory possible) but this is so much simpler.

There are a few issues that we should IMO iron out:

  • The total number of tasks is not very accurate. Not related to the patch here, we just didn't notice before as that number wasn't used much. Specifically:
    • If you select multiple topics / task types we make separate search queries for each, and then just sum the results together, which is not very accurate. This gets weird for very low results (e.g. the total count is 5 but actually there are only two cards). We could either subtract the number of deduplicated tasks (I have a patch for that) or make an extra search query for all topics/task types at the same time to get an accurate count.

Could you link the patch for that to this task?

Etonkovidova subscribed.

There are a few issues that we should IMO iron out:

  • The total number of tasks is not very accurate. Not related to the patch here, we just didn't notice before as that number wasn't used much. Specifically:
    • If you select multiple topics / task types we make separate search queries for each, and then just sum the results together, which is not very accurate. This gets weird for very low results (e.g. the total count is 5 but actually there are only two cards). We could either subtract the number of deduplicated tasks (I have a patch for that) or make an extra search query for all topics/task types at the same time to get an accurate count.

Could you link the patch for that to this task?

I did some additional testing for selecting multiple topics and checking if they add up. It seems that is an edge case - I cannot find specific examples (although I did see them before). The numbers were off before if an article has multiple templates or it is classified as fitting into different topics (not sure about the last one).

Resolving since the scope of the task is done (checked on testwiki wmf.23 and lang wiki wmf.21).