Page MenuHomePhabricator

RFC: Expand API title generator to support other generated data
Closed, ResolvedPublic

Description

  • Affected components: Classic MW API generators
  • Engineer(s) or team for initial implementation: Policy change, no implementation. If needed, Search Platform and Structured Data.
  • Code steward: Policy change, no code steward needed.

Motivation

https://www.mediawiki.org/wiki/API:Query#Additional_notes states that "Generators only pass page titles to the query module and do not output any information themselves. Setting parameters like gcmprop will therefore have no effect."
While I understand this POV to some extent, it also undermines the use of generators.

The search API, for example, can be used both as list and generator.
As list, it includes additional search information that it does not when used as generator:
via srinfo: rewrittenquery, suggestion, totalhits
via srprop: categorysnippet, extensiondata, isfilematch, redirectsnippet, redirecttitle, sectionsnippet, sectiontitle, size, snippet, timestamp, titlesnippet, wordcount, hasrelated, score

As generator, the result simply contain mostly the same things as other generators: pageid, ns & title (and index, which indicates the order of the search result)

It essentially means that - if you need any of that search-specific data, if basically becomes useless as a generator because you can't access the data that way.
You're then forced to fetch the rest of the (non-search, other API) properties you're interested in via a separate request, the old fashioned way.

In recent months, while building out an alternative search results UI for Commons, we have had 3 use cases where we could really have used the additional search data:

  • the matched search snippet (srprop=sectionsnippet|titlesnippet|snippet)
  • the word count (srprop=wordcount)
  • the "did you mean" term suggestion (srinfo=suggestion)

This would only be a change in policy and would not require significant changes in code (see this patch - already possible with existing infrastructure and would not change or take away existing functionality or data.)
In a way, the search API is already doing this with the addition of the 'index' key to convey the order of search results.

Requirements
  • https://www.mediawiki.org/wiki/API:Query#Additional_notes is updated to reflect that "Generators always pass page titles to the query module. Unlike lists (which may include additional data by default), generators should not output any information themselves, unless when explicitly requested via the generator module's query parameters."

Exploration

  • If generators are allowed to start adding additional data, should they default to no data at all (and require all props to be requested explicitly) or use the same existing set of (non-generator) defaults?

Event Timeline

Krinkle renamed this task from RFC: Allow generators to return data instead of just generating titles to RFC: Expand API title generator to support other generated data.Sep 30 2020, 6:24 PM

One potential question, is this going to be a generic implementation where all generators suddenly expose their properties, or are we expecting to implement this per-generator where it's requested?

CBogen subscribed.

Hi TechCom - the SD and Search teams are looking for feedback on this and would love to move it forward by the end of this quarter (Q2 2020-2021). We'd love some help moving it through the process. Thanks!

@CBogen I note that "Code steward: TBD" is still empty, which is expected to be determined during Phase 2.

I note that "Code steward: TBD" is still empty, which is expected to be determined during Phase 2.

@Krinkle I believe that's the Core Platform team that owns the mw api. Not sure who the best person is to confirm, maybe @EvanProdromou or @Naike
can help?

Generators are allowed to add additional data, they are just discouraged from doing so. (Plus it's probably undocumented.) See ApiPageSet::setGeneratorData.

It is indeed perfectly possible to do it already, but https://www.mediawiki.org/wiki/API:Query#Additional_notes states that "Generators only pass page titles to the query module and do not output any information themselves."
This position has prevented at least https://gerrit.wikimedia.org/r/c/mediawiki/core/+/394120 from moving forward a few years ago.
If generators are allowed (but not required/without enforced consistency) to add additional data, then it's simply a matter of updating https://www.mediawiki.org/wiki/API:Query#Additional_notes to reflect that :)

I have not dug into this yet, but I recall that generator logic is complicated by two factors:

  • It has to work efficiently with continuations. Search doesn't really have efficient continuations anyway, but for other kinds of queries, this is significant.
  • It's mix and match, the data added by the generator has to match the structure of the result generated by the query module.

These are the only two issues I see with allowing that. Have they been considered/addressed?

@Krinkle I believe that's the Core Platform team that owns the mw api. Not sure who the best person is to confirm, maybe @EvanProdromou or @Naike
can help?

By the "new API" you mean the core REST API? Yes, but that doesn't have generators.

Ownership of the action API infrastructure isn't clear cut. I suppose PET (aka core platform) inherited it, but we currently have nobody on the team who is intimately familiar with it. The generator logic is probably the most complex part of that part pf the code.

In terms of process: if you are proposing something something as an RFC that you are not going to do yourself, but would need to be done by another team, you need to get at least preliminary commitment from that team that they are willing to spend time working on this. Otherwise, an RFC may be discussed and accepted, without anybody actually implementing it.

Dropping this back to phase 2 until the resourcing is clear.

This RFC is only about a change in policy of the action API.

I am more than happy to change the wording on https://www.mediawiki.org/wiki/API:Query#Additional_notes to reflect that generators *are* allowed to add their own data.
And both Search Platform & Structured Data teams will be happy to update the search API (since the lack of own data as generator has already been a blocker for multiple feature) in accordance to the outcome of this RFC.

Current policy explicitly disallows this, but:

  • there already are generators that add their own data (e.g. T263841#6586697)
  • without an ability to add own data, the point of using generators is sometimes lost (e.g. search, if you want search-related extra output along with other props)

That said, if further discussion of this leads to the policy change takes another step forward toward enforcing certain additional behavior across the whole range of existing generator APIs (requiring them all to be updated), then that's out of scope for this RFC and cause for denying it, as there is no steward to see those changes through.

I don't think that note is any kind of policy; in any case, it only says the normal prop parameters don't work. The inline docs for ApiPageSet::setGeneratorData() say "This should generally be limited to data that is likely to be particularly useful to end users rather than just being a dump of everything returned in non-generator mode." so if you feel this is something that's generally useful and important for search results, you should just go ahead and add it. (Just remember there will be no way to turn it off and it will increase response size for people who don't need it, so it should be limited to important data, and ideally to small data.)

If the proposal is about making generators not ignore prop parameters, that's worth an RfC, but I think every query API module would have to be updated to implement it.

At some point in the past, https://gerrit.wikimedia.org/r/c/mediawiki/core/+/394120/1#message-f61538d02dc5612b011cd177eaaa59cc4255fc6a got blocked over this, stating that it would require a change in policy/RFC.
That was 3y ago, though, so I'm more than happy to go ahead and just do it, if we no longer feel like that is true.

I certainly agree that the existing evidence (counter-examples and conflicting documentation) certainly advocate against the current notes on https://www.mediawiki.org/wiki/API:Query#Additional_notes - can we update that to reflect reality?

It looks like Roan wrote the language in question, so I guess we could just ask how he meant it, but to my eye too it reads as a description of how generators behave (or did at the time it was written) rather than a prescription of how they should behave. I think there are good reasons (as suggested in T263841#6590941) for it to be treated as policy in general, but to make an exception at least for search, where the data with which we're interested in supplementing the query results doesn't consist of properties of the result pages but rather is produced by the search itself. So, I'd support updating that language to better reflect present reality and to clarify that search data may be included in the query results even in generator mode. As a first draft, something like:

Additional notes
[...]

  • In general, generators only pass page titles to the query module and do not output any information themselves. Setting parameters like gcmprop will therefore have no effect.
  • Queries using search results as a generator may produce additional search data in their results, such as the index of the page in the search result set.

I think there are good reasons (as suggested in T263841#6590941) for it to be treated as policy in general

I don't think those two things specifically are problematic. Continuation is not affected much by the extra data (it can force continuation to happen earlier if the response exceeds the allowed size limits, but that can happen with prop modules just the same as generators), and adding to the module just means ensuring that the field names used in ApiPageSet::setGeneratorData() do not conflict with the field names used by other modules / the same module when it is not a generator, which seems like an unlikely thing to happen in the first place.

I think the main reason is simply that the client might want a small response for better performance or to minimize bandwidth use, and normally the action API gives the client a lot of control over that, so generators adding a lot of data that cannot be opted out from could be a problem.

In terms of process, the RFC page says with respect to Phase 2: Resource:

For RFCs that propose a new development policy, the above does not apply; you can lead the writing yourself, if you want to. TechCom is also available to help write the documentation for new policies.

It seems the consensus above is that no change to mediawiki is required from this proposal. This is about clarifying the policy around what generators are allowed to do. Once the policy is resolved the search api's can be adjusted to expose the properties, but the specific changes to search api's can go through typical development processes and not be involved in the RFC. Based on this reasoning, I'm adjusting the code steward from TBD to no code involved, along with adjusting the requirements to no longer mention performing changes to the search API's.
With that complete, I'm moving this forward to Phase 3: Explore.

With respect to Phase 3: Explore, it seems this has been actively occuring in the discussions so far. To resolve Phase 3:

Once you have heard from the stakeholders, and have a selected at least one proposal that meets the requirements, the RFC should enter Phase 4. You can move it on the Phabricator workboard.

We have discussed the proposals a bit, and the requirements appear to be met. It seems we could move forward to Phase 4: Tune, where the exact details need to be agreed upon before last call. But i want to allow anyone to chime in before jumping forward a few steps.

In terms of process, the RFC page says with respect to Phase 2: Resource […]

It seems the consensus above is that no change to mediawiki is required from this proposal. This is about clarifying the policy around what generators are allowed to do. […]

It wasn't clear initially that no code change would be required. It seemed like it may've potentially required cross-cutting changing to how pagination works fundamentally. However now that we know this isn't the case, there is indeed no code steward needed. Thanks :)

(That's not to say it isn't an issue that there is once again a lapse in stewardship, that issue is separate from the RFC process.)

Thanks all - what's left to do to move this forward?

Hi all - I noticed that the techcom notes said this was "awaiting resourcing from core API steward to confirm support, risk, compatibility as proposed." It's my understanding that we no longer need that. Is that a misunderstanding?

I think at this point, all that is needed is a concrete proposal for a the new desired wording.

The suggested new wording is in the description (under "requirements"): "Generators always pass page titles to the query module, but may also output information themselves"
It would replace the current: "Generators only pass page titles to the query module and do not output any information themselves"

I'm not very familiar with this, but I reviewed Anomie's comments, and briefly reviewed the relevant code (ApiPageSet, ApiContinuationManager, ApiQueryGeneratorBase, etc.). I have no objection to this change, although per Anomie's suggestion on the task, and per @Tgr's concern that responses will be bloated, I suggest making it default to adding no extra data. Per the task description:

If generators are allowed to start adding additional data, should they default to no data at all (and require all props to be requested explicitly) or use the same existing set of (non-generator) defaults?

It seems to me that it is backwards compatible to make the default be empty, and that typically a client will not need as much data from a module when it is being used as a generator.

Also, maybe it is awkward to have a different default for the "prop" parameter when the module is used as a generator. So maybe the parameter should have a different name when it is used as a generator, like "gprop". Add prefixes and you have "gsrgprop". Basically "prop" would be the properties to include in the result, and "gprop" would be the properties to include in the ApiPageSet.

Defaulting to empty I think would be a safer direction indeed as it naturally can't affect existing API users in potentially unexpected ways, and also makes it more explicitly used, which helps track adoption and potentially iterate on it, which is harder to do it is optaimed implicitly without any parameter.

@matthiasmullie Would that work or would it complicate things?

I think it makes sense to default to empty and expect desired data to be requested explicitly via params.

I don't think we need to touch the existing general prop parameter (for enriching the generator list with other data), though - we'd simply allow the module-specific parameters (e.g. srprop`, which would become gsrprop) to influence the response.

New proposal is changing the wording to: "Generators always pass page titles to the query module. Unlike lists (which may include additional data by default), generators should not output any information themselves, unless when explicitly requested via the generator module's query parameters."
(which would replace the exitsing "Generators only pass page titles to the query module and do not output any information themselves")

Does that work?


Here's what this would mean in practice:

For reference, the response of that last uri is currently the same as the second; i.e. the gsrprop is ignored (even though the API does not warn of "Unrecognized parameter")

Per yesterday's TechCom meeting, this RFC enters the Last Call period. If no objections remain unaddressed by January 6th, the proposed change to the policy will be adopted.

Approved as of today, no concerns were raised during the Last Call period, and there's approval quorom from TechCom (no objections or abstains).