Some validators (notably, UserDef) do DB lookups as part of their validation. There should be a way to batch these; otherwise, they can become a bottleneck for batch APIs.
Description
Event Timeline
Which validators need batch lookup support, besides UserDef? Is there a precedent for batch DB lookups? Moving to needs further discussion before considering a possible implementation.
TitleDef with the must-exist flag, probably.
Is there a precedent for batch DB lookups?
As in, at all? For users? For the API?
I think most user-related API modules do it, but by that point parameter validation already happened, so even if the batch lookup would warm the User cache (which might or might not be the case) UserDef doesn't benefit from that.
Moving to bugs and chores.
Layman's summary: We are doing DB lookups sequentially, where they could be optimized as a bulk operation for performance reasons. Marking as medium because it seems like things aren't currently breaking, but performance is generally a priority. Could also be a good first task for understanding the Action API + DB config -- assigning to @AGhirelli-WMF to poke around if/when he joins chore rotation :)
Next steps:
- Initial investigation needed to understand what this might look like in code.
- Create a light proposal (~1 page design doc) for how we might fix to review for the team.
I took some time to do some investigation into this and take the moment also to try to build some knowledge in my mind 😄
So, here is a possible approach (let's discuss about it, maybe I forget something and correct if I'm wrong in some way).
We can try to add an optional preloadForValidation() method to the base TypeDef class (default no-op) that ParamValidator calls before the validation loop. The idea, basically, is that ParamValidator::validateValue(), right before the foreach that iterates over the values list, would call this method passing all the values at once. TypeDefs that need DB lookups can override it to batch-fetch everything upfront and warm their caches, so the individual validate() calls that follow find cache hits instead of going to the DB. TypeDefs that don't need it just inherit the no-op default and nothing changes for them.
This could work because if I'm not wrong the batch infrastructure already exists, we're just not using it during parameter validation:
- For users, ActorStore already has an in-memory cache that gets populated when you do a batch query through UserSelectQueryBuilder. The individual lookups during validation would just hit that cache.
- For titles, LinkBatch already does exactly this — one query for all titles, results stored in LinkCache, including negative caching for pages that don't exist.
The change would be minimal and backwards-compatible: one new optional method on TypeDef, a single call added before the existing loop in ParamValidator, and override implementations in UserDef and TitleDef. The existing validate() contract doesn't change at all.
(When we will reach a kind of deal, I can create a little design doc, with some more technical infos, I wanted just to write a thought at a high level perspective)
Sounds good at a high level. The details might be a bit more fiddly:
- the same API request can contain a mix of user IDs and names so you might need two lookups
- might want a size threshold under which batching is not used - the objects might be in the cache already, but the query is always uncached. At the very least for a batch of size 1 it does not make sense to do it.