Page MenuHomePhabricator

Performance review of Extension:WikimediaCampaignEvents
Closed, ResolvedPublic

Description

Description

(Please provide the context of the performance review, and describe how the feature or service works at a high level technically and from a user point of view, or link to documentation describing that.)

The WikimediaCampaignEvents extension adds WMF-specific features to the CampaignEvents extension. Currently, it provides a single such feature: it lets organisers organizers share the grant ID (if they have one) when they configure a specific event so that grant officers can easily track and analyze the impact of events supported by grants.

Here's a high-level overview of how it works:

  1. When the organiser visits Special:EnableEventRegistration (or Special:EditEventRegistration), they see a form field where they can enter the grant ID
  2. If they enter an ID and submit the form, then we run the following validations on the grant ID value:
    1. First check if it's empty, in which case we skip straight to (3.)
    2. Then check if it's the same value as currently stored in the database, and skip to (4.) if it is.
    3. Then check if the format of the ID is correct (e.g., 1234-56789). If not, show an error.
    4. If the format is valid, make a request to the Fluxx API to confirm that the ID exists and is valid. If not, show an error.
  3. If the ID is valid, we store it into the database. Or we just delete it from the DB if the value is empty.
  4. Event organizers can then see their grant ID in two places: in Special:EditEventRegistration (where they can remove/update it if they wish), and in Special:EventDetails (read-only).
Preview environment

(Insert one or more links to where the feature can be tested, e.g. on Beta Cluster.)

TODO: we will provide a link to the beta cluster as soon as the extension is enabled there.

Which code to review

(Provide links to all proposed changes and/or repositories. It should also describe changes which have not yet been merged or deployed but are planned prior to deployment. E.g. production Puppet, wmf config, or in-flight features expected to complete prior to launch date, etc.).

https://gerrit.wikimedia.org/g/mediawiki/extensions/WikimediaCampaignEvents

NOTE: There's more planned work to do, and the code base is not ready for review at this time.

Performance assessment

Please initiate the performance assessment by answering the below:

  • What are likely to be the weak areas (e.g. bottlenecks) of the code in terms of performance?

Making requests to the Fluxx API is probably the main/only bottleneck in the application. In particular, given that these requests must be authenticated, we need to request an authorization token before we can send the actual request, which would mean 2 HTTP requests when the form is submitted.

  • What work has been done to ensure the best possible performance of the feature?

The optimisations we implemented are all geared towards avoiding HTTP requests if possible. The first barrier is a simple validation layer to ensure that the grant ID value is non-empty, different than the one we already have, and that it matches a certain pattern. Validation with Fluxx is only attempted if the grant ID looks legitimate.

Next, we use object caching to reduce the number of HTTP request. The Fluxx token is cached in APCu for as long as possible (the API includes an expires_in field). API responses for an individual grant ID are also cached in APCu for 30 seconds.

We believe that changes to the grant ID would be relatively uncommon in real-world usage, and therefore that the cache may not be extremely effective, particularly the one for individual IDs. The main goal of this cache is to optimise subsequent form submissions whenever an unrelated validation error occurs, so that 1) we do not attempt validation of a grant ID we just validated, and 2) we don't need to re-authenticate if the user tries a different ID. Still, we would probably be making 2 requests most of the times in the happy path.

  • Are there potential optimisations that haven't been performed yet?

The aforementioned caching could be improved further. For instance, we might consider switching to a shared cache backend such as memcached. We could also consider increasing the cache TTL for individual IDs. And finally, we could cache request failures caused by the ID being invalid (as opposed to other issues such as network errors, or bad auth data, etc.); however, this is not straightforward as we would need to further distinguish between the possible failure scenarios.

  • Please list which performance measurements are in place for the feature and/or what you've measured ad-hoc so far. If you are unsure what to measure, ask the MediaWiki Platform Team for advice: mediawiki-platform-team@wikimedia.org.

We currently have no performance measurements in place.

Event Timeline

Note, as mentioned in the task description, we're filing this request now even if the code base isn't ready and the test environment isn't available, as a sort of heads-up to allow proper scheduling etc. We will get back to this once the extension is available in beta. Thanks in advance!

The beta deployment has been scheduled for tomorrow. Do you have an ETA of when this might be done?

Krinkle triaged this task as Medium priority.Feb 20 2024, 3:57 PM

Change 1005853 had a related patch set uploaded (by Krinkle; author: Krinkle):

[mediawiki/extensions/WikimediaCampaignEvents@master] [WIP] Minor improvents and consistency tweaks

https://gerrit.wikimedia.org/r/1005853

Change 1005853 merged by jenkins-bot:

[mediawiki/extensions/WikimediaCampaignEvents@master] Minor improvements and consistency tweaks

https://gerrit.wikimedia.org/r/1005853

I'm currently reviewing this extension in isolation, without the CampaignEvents frontend under consideration. If I understand correctly, from a frontend perspective WikimediaCampaignEvents only modifies a pre-existing special page, and only generates static HTML through an extra form field, using fields of a built-in type (i.e. not a custom HTMLFormField subclass or custom OOUI widget). I was unable to find any CSS or JavaScript code, and no places where it generates its own custom HTML. If I missed something, please let me know!

Database schema

The schema looks like a typical schema, that follows our best practices to the best of my knowledge. It's pretty minimal already, but to cover all bases however, I have a few questions. Feel free to point elsewhere if these were answered before as part of a DBA or perf review for the main extension.

  • Which wikis need this table? One central one? Many wikis in a given family? All public wikis? (I'm assuming Meta-Wiki only, apart from testing.)
  • In what scenarios can a user or bot cause writes to this table? Does it closely follow the write rate (insert/update/deletes) of an existing table in CampaignEvents, or is it likely to have its own independent load? (I'm assuming only the shared SpecialPage, but if there are write paths through other routes, let me know!)
  • What if any user right restrictions and/or rate limits (PingLimiter/wgRateLimits) are in place for these writes? In particular, to prevent and limit abusive growth.

The use of the novel CampaignsDatabaseHelper and MWDatabaseProxy do stand out to me as as unusual and distracting from the POV of a fellow MW developer who may have to debug this in production. Especially now that T330590 is resolved, I would recommend CampaignEvents adopts IConnectionProvider directly without any wrapping class. This makes the code more obviously correct instead of requiring additional review and familiarity for operations that are logically equivalent to what other core components and extensions do, yet the same confidence, guarantees, and familiarity cannot be applied here due to the indirection.

API

I do notice the newly added event_registration REST API endpoints. However, I could not find any calls to this. Is this for development and/or internal use only? (i.e. some low-traffic non-human use from a scheduled maintenance script perhaps?) If not, it may be worth reviewing these a bit more closely in terms of latency and timeouts.

GrantIDLookup / FluxxClient

GrantIDLookup currently uses MediaWikiServices->LocalServerCache as cache with an expiry of 1 minute. I see this is currently backed by php-apcu on individual app servers. I believe this is equivalent in practice to no caching at all, given that we have 400+ separate web servers across 2 data centers, and this endpoint has presumably a relatively low request rate. Even for a popular campaign that many people read about and sign up for, I believe the lookups would be limited to reads and changes on the organizer's view, not the participant's view. Is that correct? It seems unlikely to me that a person would hit the same web server twice in a row, much less get a cache hit.

I recommend removing the cache, or moving it from per-server APCU to per-datacenter Memcached, via WANObjectCache. For GrantIDLookup, I recommend to (if keeping the cache) to make the expiry at least 1 hour, so that it gets more reliably re-used. Assuming people submit the form within 60 seconds seems needlessly tight. This is backed by Memcached and thus shared by servers in the same data center, and will naturally be evicted when unused.

I am unable to review the way the Fluxx token is cached, since I can't see in what format the Fluxx API returns the expires_infield. The Fluxx API docs are hidden behind an auth wall. Is this something we can allow all WMF staff+contractors to at least read? Or is there a public documentation page for at least the API contract and response format? I'll assume for now that the API returns TTL in "relative" and "whole" seconds, and thus that it is compatible with the BagOStuff &$ttl by-ref parameter.

I would recommend, regardless of whether or not all staff can access it, to write a summary of the shape and response format of the Fluxx REST API on a subpage of the extensions' public mediawiki.org page, accessible to all contributors, and link to that from the class documentation.

EventDetailsHandler

I would caution against writing boolean returns or typing hook handlers against :bool and to instead use void. Boolean handlers is needlessly risky. In attempting to replace the returns with implicit nulls in the form of return;, I found Phan rejects this. It appears the mistake is actually in the CampaignEvents extension, where it is declared as CampaignEventsGetEventDetailsHook: bool.

We have two kinds of hooks:

  • Non-abortable hooks, documented as @return void. Most new hooks fall in this category (since MW 1.23 in 2014).
  • Abortable hooks, documented as @return bool|void.

Under no circumstances should a hook mandate a signature of @return bool, as this prevents consumers from the safe way to consume this hook for the cases where aborting isn't relevant. This stands out as suspicious in code review per https://www.mediawiki.org/wiki/Manual:Hooks#Hook_handler_return_values. This also came up during T245364 for another perf review, so I've improved the on-wiki docs to incorporate more of this guidance at https://doc.wikimedia.org/mediawiki-core/REL1_35/php/md_docs_2Hooks.html

Thanks for the review! @ifried @cmelo I have a few questions for you below. Please feel free to skip the wall of text and go straight to the places where you are mentioned.

If I understand correctly, from a frontend perspective WikimediaCampaignEvents only modifies a pre-existing special page, and only generates static HTML through an extra form field, using fields of a built-in type (i.e. not a custom HTMLFormField subclass or custom OOUI widget). I was unable to find any CSS or JavaScript code, and no places where it generates its own custom HTML. If I missed something, please let me know!

Yup, your understanding is correct. I guess one thing I may add is that the extension might, in the future, be expanded with more Wikimedia-specific features; but we have no examples of that at the moment as far as I can tell.

Database schema

[...] Feel free to point elsewhere if these were answered before as part of a DBA or perf review for the main extension.

Answering your questions below, but for completeness, here are some previous DBA reviews: T350906: Review database schema of the WikimediaCampaignEvents extension, T318593: Review database schema of the CampaignEvents extension, T308738: Ensure that all tables have all the necessary indexes.

  • Which wikis need this table? One central one? Many wikis in a given family? All public wikis? (I'm assuming Meta-Wiki only, apart from testing.)

Our envisioned setup is that this table (like the CampaignEvents tables) will live in x1's wikishared for most production wikis. The main exceptions to this are test wikis and non-CentralAuth wikis (including private etc.); those will use their own per-wiki databases (also in x1).

  • In what scenarios can a user or bot cause writes to this table? Does it closely follow the write rate (insert/update/deletes) of an existing table in CampaignEvents, or is it likely to have its own independent load? (I'm assuming only the shared SpecialPage, but if there are write paths through other routes, let me know!)

It follows the write rate of the existing campaign_events table. In fact, the wikimedia_campaign_events_grant is an extension of that table (the wceg_grant_id and wceg_grant_agreement_at fields can conceptually be thought of as columns in the campaign_events table, although they live in a different place).

  • What if any user right restrictions and/or rate limits (PingLimiter/wgRateLimits) are in place for these writes? In particular, to prevent and limit abusive growth.

There are no rate limits in place. I don't think we have ever talked about adding one, but I believe it would make sense to add it. @ifried Would you have any objections to adding a rate limit for how often someone can enable/edit an event? And probably also registering. I think it may be worth discussing with the team.

As for user rights: by default every registered user can create new events and therefore cause writes to this table. This is how the extension is configured in the beta cluster, and also testwiki, test2wiki, and officewiki in production. On meta (and other wikis in the near future), the necessary rights are assigned to the event-organizer user group.

The use of the novel CampaignsDatabaseHelper and MWDatabaseProxy do stand out to me as as unusual and distracting [...]

We all agree with this, and made the decision to do that: T361026. The reason why these abstractions exist in the first place is briefly explained in this README. The TLDR is that in the early stages of development we were leaving the door open to the possibility of moving the backend of the extension to an off-MW application. Some of the existing abstractions (like the DB ones) are there to ease such a transition. We ended up not moving the code, and I don't think we have plans to do so. OTOH, the code has come to rely more and more on MW internals, and the abstractions also make it harder for us to do certain things, hence the (recent) decision to get rid of them.

API

I do notice the newly added event_registration REST API endpoints. However, I could not find any calls to this. Is this for development and/or internal use only?

They are currently unused. The endpoint only exists for feature parity with the web UI.

GrantIDLookup / FluxxClient

GrantIDLookup currently uses MediaWikiServices->LocalServerCache as cache with an expiry of 1 minute. I see this is currently backed by php-apcu on individual app servers. I believe this is equivalent in practice to no caching at all, given that we have 400+ separate web servers across 2 data centers, and this endpoint has presumably a relatively low request rate. Even for a popular campaign that many people read about and sign up for, I believe the lookups would be limited to reads and changes on the organizer's view, not the participant's view. Is that correct?

Lookups are only triggered when adding or changing the "Grant ID" field on Special:EnableEventRegistration. This cache has two main purposes:

  1. When someone submits the form but there's an error (unrelated to the grant ID), it avoids a new lookup when the form is submitted again. As you pointed out, this is currently not working because the cache is per-server and we should fix that.
  2. Within the same request, the lookup code is actually called twice: first in the validation callback of the HTMLForm field, to verify that the grant ID is valid (calls doLookup). Then in the form submit callback, to get additional information for the grant ID we just validated (specifically the timestamp when the grant was agreed to, calls getAgreementAt()). While we could use in-object caching to save this state without resorting to object caching, I think we went straight with object caching because we needed it anyway for the item above. AIUI, caching is currently working as expected for this part.

I recommend removing the cache, or moving it from per-server APCU to per-datacenter Memcached, via WANObjectCache. For GrantIDLookup, I recommend to (if keeping the cache) to make the expiry at least 1 hour, so that it gets more reliably re-used. Assuming people submit the form within 60 seconds seems needlessly tight. This is backed by Memcached and thus shared by servers in the same data center, and will naturally be evicted when unused.

I think moving to WAN cache would make sense, and also extending the TTL.

I am unable to review the way the Fluxx token is cached, since I can't see in what format the Fluxx API returns the expires_infield. The Fluxx API docs are hidden behind an auth wall. Is this something we can allow all WMF staff+contractors to at least read?

Wholeheartedly agree with this. I would also like to have access and I'm not even sure why it's auth-walled in the first place. @cmelo do you have any ideas about this?

Or is there a public documentation page for at least the API contract and response format?

I don't think there's one, but again, I'll defer to @cmelo.

I'll assume for now that the API returns TTL in "relative" and "whole" seconds, and thus that it is compatible with the BagOStuff &$ttl by-ref parameter.

IIRC, we ran some test Fluxx queries while writing this code, and made the same assumptions based on the response data.

I would recommend, regardless of whether or not all staff can access it, to write a summary of the shape and response format of the Fluxx REST API on a subpage of the extensions' public mediawiki.org page, accessible to all contributors, and link to that from the class documentation.

Agreed, if removing the auth wall is not an option. @cmelo would you be willing to start a conversation with someone in Advancement about this?

EventDetailsHandler

I would caution against writing boolean returns or typing hook handlers against :bool and to instead use void.

Makes sense, we'll change that. The intention was really to have it as @return true,as I think the existing @return true|void pattern is confusing (at least for code analyzers). But indeed, : void would be perfect.

@Daimona wrote:
@Krinkle wrote:

GrantIDLookup currently uses MediaWikiServices->LocalServerCache as cache with an expiry of 1 minute. […]

  1. […]
  2. Within the same request, the lookup code is actually called twice: […]

[…]

I think moving to WAN cache would make sense, and also extending the TTL.

Awesome. In that case, I'll note that WAN cache has a pcTTL option that you may want to enable at the same time. This gives you the in-request cache for free, without incurring a Memcached roundtrip.

Change #1015528 had a related patch set uploaded (by Daimona Eaytoy; author: Daimona Eaytoy):

[mediawiki/extensions/CampaignEvents@master] Hooks: Drop `:bool` return type hint

https://gerrit.wikimedia.org/r/1015528

Change #1015529 had a related patch set uploaded (by Daimona Eaytoy; author: Daimona Eaytoy):

[mediawiki/extensions/WikimediaCampaignEvents@master] Hooks: Use `:void` as return type hint

https://gerrit.wikimedia.org/r/1015529

Change #1015531 had a related patch set uploaded (by Daimona Eaytoy; author: Daimona Eaytoy):

[mediawiki/extensions/CampaignEvents@master] Hooks: add `void` return type hint

https://gerrit.wikimedia.org/r/1015531

Change #1015536 had a related patch set uploaded (by Daimona Eaytoy; author: Daimona Eaytoy):

[mediawiki/extensions/WikimediaCampaignEvents@master] Use WAN cache for Fluxx data

https://gerrit.wikimedia.org/r/1015536

Change #1015536 merged by jenkins-bot:

[mediawiki/extensions/WikimediaCampaignEvents@master] Use WAN cache for Fluxx data

https://gerrit.wikimedia.org/r/1015536

Thank you @Krinkle and @Daimona, and sorry for the delay on this.

I am unable to review the way the Fluxx token is cached, since I can't see in what format the Fluxx API returns the expires_infield. The Fluxx API docs are hidden behind an auth wall. Is this something we can allow all WMF staff+contractors to at least read?

@Krinkle I will check if we can allow alal WMF staff+contractors to at least read it, but here is the response format for the oauth/token request:
Respose

{
    "access_token": "yourReturnedAccessToken",
    "token_type": "bearer",
    "expires_in": 7200,
    "created_at": 1712627859
}

Or is there a public documentation page for at least the API contract and response format?

Unhapply there is no public documentation page, but I will check if we can create one for you

I would recommend, regardless of whether or not all staff can access it, to write a summary of the shape and response format of the Fluxx REST API on a subpage of the extensions' public mediawiki.org page, accessible to all contributors, and link to that from the class documentation.

Yes, I agree, I will look into this as well thank you

Change #1015528 merged by jenkins-bot:

[mediawiki/extensions/CampaignEvents@master] Hooks: Drop `:bool` return type hint

https://gerrit.wikimedia.org/r/1015528

Change #1015529 merged by jenkins-bot:

[mediawiki/extensions/WikimediaCampaignEvents@master] Hooks: Use `:void` as return type hint

https://gerrit.wikimedia.org/r/1015529

Change #1015531 merged by jenkins-bot:

[mediawiki/extensions/CampaignEvents@master] Hooks: add `void` return type hint

https://gerrit.wikimedia.org/r/1015531

There are no rate limits in place. I don't think we have ever talked about adding one, but I believe it would make sense to add it. @ifried Would you have any objections to adding a rate limit for how often someone can enable/edit an event? And probably also registering. I think it may be worth discussing with the team.

As for user rights: by default every registered user can create new events and therefore cause writes to this table. This is how the extension is configured in the beta cluster, and also testwiki, test2wiki, and officewiki in production. On meta (and other wikis in the near future), the necessary rights are assigned to the event-organizer user group.

@Daimona: Sure, I would be open to a rate limit for enabling/editing event, provided that it was very high. I don't know what number would be sufficient. Perhaps 1000 edits per event? I don't see anyone editing event information more than 1000 times (or anywhere close!). If that is too high to do the job, then perhaps 500 edits? My inclination, though, is to keep the number quite high, if possible.

As for registering, this would be for registering for the same event (so, someone registers, then unregisters, and then registers again)? Perhaps that could be 30 times?

What do you think, @Astinson, of these proposed limits?

I would consider doing something, like putting rate limits in a short window of time (i.e. x number of interactions in a 24 or 48 hour period). The thing you want to do is prevent a short term accidental misuse that effects others, rather than actually having someone who is having a bunch of tweaks to the tool right?

Yup, that makes sense, @Astinson! Any recommendations on what could be reasonable within, say, a 24 hour period of time for editing event registration or re-registering, based on what you have observed of organizer and participant behavior?

In a 24 hour period: 40 configuration interactions from the organizer? 6-10 registration actions from the editor? It should be just enough to get them to cool down what they are doing, without being too restrictive -- and create enough space for them to make mistakes are being indecisive.

Krinkle closed this task as Resolved.EditedFri, Apr 26, 9:04 PM

Marking as resolved as I believe the performance review is completed. Feel free to keep tagging this task with more inspired follow-up commits, though!