Page MenuHomePhabricator

[Microtask generator] Reduce latency / load for tool and external APIs
Closed, ResolvedPublic

Description

Existing approaches:

Additional changes to consider:

  • Cache results. This could be at the level of the full output table or individual quality/topic/etc. outputs for each page title
  • Limit number of results that the tool will fetch

Stretch:

  • Ability to come back later to get the result (this would relate to the cache results concept but actually assigning the output a stable ID like pagepiles)

Event Timeline

Isaac renamed this task from [Microtask generator] Reduce latency / load for tool to [Microtask generator] Reduce latency / load for tool and external APIs.Tue, Feb 17, 6:24 PM
Isaac updated the task description. (Show Details)

I made some updates to reflect where we stand. Thoughts on next steps:

  • I think it's worth putting in place a hard-cap on how many articles will be fetched even if we set it so high that it won't be reached. This is a useful "switch" to be able to adjust if we hear complaints from folks about the tool being slow or from LiftWing about our API call volume. I would suggest that this is not hard-coded into the code but put in a config file that is loaded in when the tool starts. That way we can make very fast changes to it as needed without pushing an API call -- e.g., someone just logs into the tool and switches the config from a max of 300 down to 50 and restarts the tool or something like that. We can make the number of async threads (currently 20) a config variable as well. There might be others but those are the two that stand out to me.
  • I think it'd be interesting to consider caching. I suspect one use-case of the tool is someone sharing an article list that multiple people enter in the tool so they interact with it. Or someone entering a list and then deciding they want to add another 10 articles to it so they go back and repeat but with a slightly different input. That would be a lot of redundancy in terms of our API calls and we could greatly reduce load on the external APIs we depend on as well as speed up the response time of the tool by building in some basic caching. In the table below, I added the different features we might cache and my thoughts on how to approach them. My current suggestion would be to start with a cache for the quality/task data (and we can expand to topics/countries later). This needs to be keyed on the language + page title and have the revision ID as a piece of metadata to check in considering whether to use it or replace with fresh data. In theory it doesn't need an expiration date but to keep the cache size reasonable, we probably want to cap it on a certain number of entries. And then the value needs to be the quality + feature scores.
FeatureCache?CostTrue expirationPractical expiration
Current revision ID (and therefore timestamp)NoCheap (part of bulk API call w/ 50 articles)Only changes if an article is editedCore data that determines the rest of the features so important to have up-to-date information
Quality score + individual article featuresYesSemi-expensive (LiftWing call that has no caching)Only changes if an article is editedSame (on article edit) -- this is core data and we don't want it to be out-of-date
PageviewsProbablyI'm not sure how expensive the API call isChanges dailyNot super important so can be out-of-date and not a big deal.
TopicsYesSemi-expensive (LiftWing call that has no caching though that's being worked on)Only changes if an article is editedChanges slowly even with edits for most articles so not priority to update
CountriesYesSemi-expensive (LiftWing call that has no caching)Only changes if an article or its Wikidata item is editedChanges slowly even with edits for most articles so not priority to update

@SBisson I'd appreciate your thoughts on the above, specifically any thoughts on the design of the cache system and if you have any recommendations for approaches. I've used sqlite for very generic "caches" (but it's really just a database so doesn't have nice automatic deletion mechanisms) but looks like diskcache is a possibility too or using the official Redis instance on Toolforge though that feels perhaps like more overhead than we want and would complicate our local dev/testing?

Isaac claimed this task.

Per discussion, we're not going to prioritize this right now. We aren't seeing clear issues with latency, don't know how much duplicate usage there actually is, and haven't been told that the load on external APIs is too much. I'm marking as resolved then because we did make some changes and we can re-open if needed.