Page MenuHomePhabricator

Provide a way to run Parsoid against "latest git HEAD of mediawiki-vendor" during round trip testing on scandium
Open, HighPublic

Description

(I feel like this was already filed, but I can't find the dupe. Feel free to merge this into the previous task if you've got better search foo than I do.)

We already provide a "lookside" to composer in order to run parsoid on scandium against a specific commit of Parsoid, instead of whatever the train build includes. But sometimes the latest parsoid depends on a new library which has not yet been train-deployed. We need a similar trick in order to ensure that the round-trip testing runs Parsoid with a specific git tag of mediawiki-vendor as well as a specific git tag of Parsoid, in order to avoid a two week release delay (upgrade library, wait a week for library to ride the train, then commit corresponding fixes to parsoid and rt-test them, wait a week for that to ride the train, and if any problems are found during rt-testing, loop back to the beginning and start over).

EDIT: when we change code in core (ie, stuff in includes/parser/Parsoid) we also can't run regression testing until that core code rides the train fully. So the task description should perhaps be amended to include "latest git HEAD of mediawiki-core" as well. But as discussed below (T298046#7785777) a partial solution would be to run regression tests against the latest wmf.X version *regardless of domain*, so that we can start regression testing as soon as the train rolls out to group.0, instead of having to wait until it reaches group.2.

Event Timeline

There's a patch for T194880: Allow the path to the vendor directory to be customized within MediaWiki which might help solve our problem by allowing mediawiki to run with a configurable $VENDOR directory.

@cscott: since you meantion rt-testing, I presume you mean scandium.

Arlolra renamed this task from Provide a way to run Parsoid against "latest git HEAD of mediawiki-vendor" during round trip testing on selenium to Provide a way to run Parsoid against "latest git HEAD of mediawiki-vendor" during round trip testing on scandium.Jan 6 2022, 12:09 AM
Arlolra triaged this task as High priority.
Arlolra moved this task from Needs Triage to Testing on the Parsoid board.

Came up again this week, as another patch required core changes to be deployed before it could be RT tested. We ended up pushing our deploy schedule back and it was one of the factors causing this to miss the 1.38 branch. I expect as we move more and more Parsoid integration to core this will become an increasingly frequent issue.

The "train-speriment" T303759: Determine if we need to communicate anything special about forward and backwards compatibility during train experiment week interacts with this task. If trains are running very frequently, we might find that our regression tests are most often running against a mix of different mediawiki-core versions.

One way of partially-solving this task is, instead of running latest mediawiki-core master on scandium, we can pin requests to a given train version *irregardless of domain* -- ie, usually requests for en.wikipedia.org go to whatever train version is active for group2, but we'd want to pin all of our rt test requests to either group0 (which lets us do RT testing without waiting for a full train rollout) or (ideally, but maybe technically more difficult) a specific release, like -wmf.25. The latter would also ensure consistent testing, even if the long-running rt testing task overlaps with a train deploy. (If we pinned to group0 and train deploys were daily, we'd probably often end up running against at least two different mediawiki-core versions, since rt tests take a significant fraction of a day.)

Notes from Content-Transform-Team forum

Options for RT-testing / scandium discussion

  • Status quo: just accept that we will have some degradation of our QA abilities
    • Improve ways to break cycles between core and Parsoid
    • Wait for RelEng team to get their “production” beta cluster in place and we leverage that
  • Non-status quo: Two steps in the process:
    • Given the risks of running undeployed master code on a production server, before we take any action to get master branch of MediaWiki on scandium, we want to get a robust solution for a read-only connection from SREs
    • If the above is done, we can figure out strategies for deployment:
      • Use classloader hacks
      • Create a new “group” (say, “group-parsoid”) in the deployment scripts so we can deploy special branches to scandium.
    • Taking https://gerrit.wikimedia.org/r/c/mediawiki/services/parsoid/+/779920 as a example:
      • Option 1: Status quo (pause Parsoid merge and/or deploy until required code hits production)
        • we just have to wait a week for the new Autoloader::registerNamespaces() to be deployed to production before we can rt-test this Parsoid change
      • Option 2: Status quo plus expanded Parsoid-side hooks (we sort of did this with Translate, eg)
        • wouldn’t really work in this case, it’s not a change to a core Parsoid API like SiteConfig
      • Option 3: Expanded lookaside (classpath hack) (maybe with DBA support to make read-only-db more robust)
        • this is an API in the root \ namespace, this would be very hard
      • Option 4: Make a “group-parsoid” to go along with group0/1/etc. This is a full “release” with its own mediawiki-vendor etc. (a bit heavy weight to be doing on a regular basis for RT testing)
        • This particular patch would work in this model.
      • Option 5: Wait for beta – need to work with the “beta team” (which doesn’t exist) to ensure that content mirroring is included. (a side effect benefit would be that it would be very easy to set up a VM which points at a read-only version of the “beta” database, for development.
        • This patch would work in that model, assuming beta is running the main branch of master and parsoid.

I was asked to comment on this on the database part of testing.

I talked to @Marostegui and came up with three paths to move forward wrt database access to minimize the risk:

Good

Get a dedicated replica per section (s1-s8, x1, es1-6, ...) and take away write rights (DELETE for example on grants) from wikiuser and wikiadmin so if all checks in mw fails, it wouldn't be able to cause an issue in production.

Then you need to change the config of mw and override db configs to only take these replicas as both master and replica.

This is the best approach but it's expensive and time-consuming. If we go with multiinstance, we still would need probably five to ten new hosts which require dedicated budget, ordering, setting them up in the rack, etc. We don't have the budget for that for this FY and you need to talk to SRE directors about this (and provide the budget).

While this is expensive, you probably could see it as part of T308283: Beta Cluster Tech Decision Forum which we slowly build a new beta cluster in production.

Bad

We can provide you with a new mw user which won't have any write rights and you can even make sure it doesn't bring down production with slow queries using "MAX_USER_CONNECTIONS" set to 1 or 2. But you still connect to the same replicas that get traffic and be careful (but not too much as max connections, should help a lot here)

Again, you need to find a way to override mw configs.

Ugly

The previous option but you would connect to dbstore replicas instead of production (in analytics: https://wikitech.wikimedia.org/wiki/Analytics/Systems/MariaDB), this is cheap, you can do it and won't bring down anything. but you also need to talk to data engineering folks to see if they are okay with this.

The same thing applies, you need to override production mw configs.

Thanks @Ladsgroup for laying this out all for us! We'll evaluate these options and see how we want to proceed.

Discussed this in team meeting today: we think the "Bad" option is probably the direction we want to go. We've already got a hook to override mw configs on Scandium: https://github.com/wikimedia/operations-mediawiki-config/blob/master/wmf-config/CommonSettings.php#L4270

Long queries have never been an issue in all the time we've been using scandium to date; perhaps there's a process-level connection timeout or some such we could set for an extra belt-and-suspenders protection? But again, we are rarely doing database-query-level work, and if we did hit some issue of that sort we can always kill our rt-testing process to recover; we're generally closely watching it anyway. My primary worry has always been db corruption: if our in-test code ended up writing to the database, how to recover would be a lot less obvious.