Provide a way to run Parsoid against "latest git HEAD of mediawiki-vendor" during round trip testing on scandium
Open, HighPublic
Actions

Assigned To

None

Authored By

	cscott
	Dec 20 2021, 8:51 PM

Description

(I feel like this was already filed, but I can't find the dupe. Feel free to merge this into the previous task if you've got better search foo than I do.)

We already provide a "lookside" to composer in order to run parsoid on scandium against a specific commit of Parsoid, instead of whatever the train build includes. But sometimes the latest parsoid depends on a new library which has not yet been train-deployed. We need a similar trick in order to ensure that the round-trip testing runs Parsoid with a specific git tag of mediawiki-vendor as well as a specific git tag of Parsoid, in order to avoid a two week release delay (upgrade library, wait a week for library to ride the train, then commit corresponding fixes to parsoid and rt-test them, wait a week for that to ride the train, and if any problems are found during rt-testing, loop back to the beginning and start over).

EDIT: when we change code in core (ie, stuff in includes/parser/Parsoid) we also can't run regression testing until that core code rides the train fully. So the task description should perhaps be amended to include "latest git HEAD of mediawiki-core" as well. But as discussed below (T298046#7785777) a partial solution would be to run regression tests against the latest wmf.X version *regardless of domain*, so that we can start regression testing as soon as the train rolls out to group.0, instead of having to wait until it reaches group.2.

Related Objects

Mentioned In: T303747: Ensure we can leave a single fallback version of MediaWiki in place during train experiment week
T303759: Determine if we need to communicate anything special about forward and backwards compatibility during train experiment week
Mentioned Here: T308283: Beta Cluster Tech Decision Forum
T303759: Determine if we need to communicate anything special about forward and backwards compatibility during train experiment week
T194880: Allow the path to the vendor directory to be customized within MediaWiki

Event Timeline

cscott created this task.Dec 20 2021, 8:51 PM

Restricted Application added a subscriber: Aklapper. · View Herald TranscriptDec 20 2021, 8:51 PM

There's a patch for T194880: Allow the path to the vendor directory to be customized within MediaWiki which might help solve our problem by allowing mediawiki to run with a configurable $VENDOR directory.

Reedy added a project: Parsoid.Dec 20 2021, 10:20 PM

Reedy updated the task description. (Show Details)

@cscott: since you meantion rt-testing, I presume you mean scandium.

• BilalShirwani closed this task as a duplicate of T298058: [Impact Analysis] Evaluate Impact of Mobile Reply and New Discussion Tools.Dec 21 2021, 12:05 AM

ssastry reopened this task as Open.Dec 21 2021, 12:08 AM

Arlolra renamed this task from Provide a way to run Parsoid against "latest git HEAD of mediawiki-vendor" during round trip testing on selenium to Provide a way to run Parsoid against "latest git HEAD of mediawiki-vendor" during round trip testing on scandium.Jan 6 2022, 12:09 AM

Arlolra triaged this task as High priority.

Arlolra moved this task from Needs Triage to Testing on the Parsoid board.

cscott updated the task description. (Show Details)Mar 7 2022, 4:37 PM

Came up again this week, as another patch required core changes to be deployed before it could be RT tested. We ended up pushing our deploy schedule back and it was one of the factors causing this to miss the 1.38 branch. I expect as we move more and more Parsoid integration to core this will become an increasingly frequent issue.

cscott mentioned this in T303759: Determine if we need to communicate anything special about forward and backwards compatibility during train experiment week.Mar 17 2022, 2:32 PM

The "train-speriment" T303759: Determine if we need to communicate anything special about forward and backwards compatibility during train experiment week interacts with this task. If trains are running very frequently, we might find that our regression tests are most often running against a mix of different mediawiki-core versions.

One way of partially-solving this task is, instead of running latest mediawiki-core master on scandium, we can pin requests to a given train version *irregardless of domain* -- ie, usually requests for en.wikipedia.org go to whatever train version is active for group2, but we'd want to pin all of our rt test requests to either group0 (which lets us do RT testing without waiting for a full train rollout) or (ideally, but maybe technically more difficult) a specific release, like -wmf.25. The latter would also ensure consistent testing, even if the long-running rt testing task overlaps with a train deploy. (If we pinned to group0 and train deploys were daily, we'd probably often end up running against at least two different mediawiki-core versions, since rt tests take a significant fraction of a day.)

cscott mentioned this in T303747: Ensure we can leave a single fallback version of MediaWiki in place during train experiment week.Mar 17 2022, 2:40 PM

cscott updated the task description. (Show Details)Mar 17 2022, 8:16 PM

Notes from Content-Transform-Team forum

Options for RT-testing / scandium discussion

Status quo: just accept that we will have some degradation of our QA abilities

Improve ways to break cycles between core and Parsoid

Wait for RelEng team to get their “production” beta cluster in place and we leverage that

Non-status quo: Two steps in the process:

Given the risks of running undeployed master code on a production server, before we take any action to get master branch of MediaWiki on scandium, we want to get a robust solution for a read-only connection from SREs

If the above is done, we can figure out strategies for deployment:

Use classloader hacks

Create a new “group” (say, “group-parsoid”) in the deployment scripts so we can deploy special branches to scandium.

Taking https://gerrit.wikimedia.org/r/c/mediawiki/services/parsoid/+/779920 as a example:

Option 1: Status quo (pause Parsoid merge and/or deploy until required code hits production)

we just have to wait a week for the new Autoloader::registerNamespaces() to be deployed to production before we can rt-test this Parsoid change

Option 2: Status quo plus expanded Parsoid-side hooks (we sort of did this with Translate, eg)

wouldn’t really work in this case, it’s not a change to a core Parsoid API like SiteConfig

Option 3: Expanded lookaside (classpath hack) (maybe with DBA support to make read-only-db more robust)

this is an API in the root \ namespace, this would be very hard

Option 4: Make a “group-parsoid” to go along with group0/1/etc. This is a full “release” with its own mediawiki-vendor etc. (a bit heavy weight to be doing on a regular basis for RT testing)

This particular patch would work in this model.

Option 5: Wait for beta – need to work with the “beta team” (which doesn’t exist) to ensure that content mirroring is included. (a side effect benefit would be that it would be very easy to set up a VM which points at a read-only version of the “beta” database, for development.

This patch would work in that model, assuming beta is running the main branch of master and parsoid.

I was asked to comment on this on the database part of testing.

I talked to @Marostegui and came up with three paths to move forward wrt database access to minimize the risk:

Good

Get a dedicated replica per section (s1-s8, x1, es1-6, ...) and take away write rights (DELETE for example on grants) from wikiuser and wikiadmin so if all checks in mw fails, it wouldn't be able to cause an issue in production.

Then you need to change the config of mw and override db configs to only take these replicas as both master and replica.

This is the best approach but it's expensive and time-consuming. If we go with multiinstance, we still would need probably five to ten new hosts which require dedicated budget, ordering, setting them up in the rack, etc. We don't have the budget for that for this FY and you need to talk to SRE directors about this (and provide the budget).

While this is expensive, you probably could see it as part of T308283: Beta Cluster Tech Decision Forum which we slowly build a new beta cluster in production.

Bad

We can provide you with a new mw user which won't have any write rights and you can even make sure it doesn't bring down production with slow queries using "MAX_USER_CONNECTIONS" set to 1 or 2. But you still connect to the same replicas that get traffic and be careful (but not too much as max connections, should help a lot here)

Again, you need to find a way to override mw configs.

Ugly

The previous option but you would connect to dbstore replicas instead of production (in analytics: https://wikitech.wikimedia.org/wiki/Analytics/Systems/MariaDB), this is cheap, you can do it and won't bring down anything. but you also need to talk to data engineering folks to see if they are okay with this.

The same thing applies, you need to override production mw configs.

Thanks @Ladsgroup for laying this out all for us! We'll evaluate these options and see how we want to proceed.

Discussed this in team meeting today: we think the "Bad" option is probably the direction we want to go. We've already got a hook to override mw configs on Scandium: https://github.com/wikimedia/operations-mediawiki-config/blob/master/wmf-config/CommonSettings.php#L4270

Long queries have never been an issue in all the time we've been using scandium to date; perhaps there's a process-level connection timeout or some such we could set for an extra belt-and-suspenders protection? But again, we are rarely doing database-query-level work, and if we did hit some issue of that sort we can always kill our rt-testing process to recover; we're generally closely watching it anyway. My primary worry has always been db corruption: if our in-test code ended up writing to the database, how to recover would be a lot less obvious.

Jdforrester-WMF subscribed.Apr 19 2023, 8:54 PM

Provide a way to run Parsoid against "latest git HEAD of mediawiki-vendor" during round trip testing on scandiumOpen, HighPublicActions