Page MenuHomePhabricator

[PS] Update PropertySuggester update process for mwscript-k8s
Open, Needs TriagePublic

Description

The process to update PropertySuggester data (permalink) currently uses mwscript in one step:

mwscript extensions/PropertySuggester/maintenance/UpdateTable.php --wiki wikidatawiki --file wbs_propertypairs.csv

We need to figure out what to do here following the deprecation (and eventual removal) of “classic” mwscript. and how to migrate this use case to mwscript-k8s.

Event Timeline

I assume that the input on stdin option should work in theory:

mwscript-k8s --attach -- extensions/PropertySuggester/maintenance/UpdateTable.php --wiki wikidatawiki --file php://stdin < wbs_propertypairs.csv

However, I’m not sure this is the best way to do it. The input file is quite large (the last one was 192 MB uncompressed), and I’m not sure how long the maintenance script runs (the wiki page says “a few minutes” but my vague recollection claims it was much longer? but it’s been a while since I did it). If the surrounding shell dies partway through the maintenance script, will the input file remain attached?

What do you think?

Prio Notes:

Impact AreaAffected
production / end usersyes
monitoringno
development effortsno
onboarding effortsno
additional stakeholdersyes (deprecation of legacy mwscript)
ItamarWMDE renamed this task from Update PropertySuggester update process for mwscript-k8s to [PropertySuggester] Update PropertySuggester update process for mwscript-k8s.Oct 10 2024, 11:01 AM
ItamarWMDE moved this task from Incoming to [DOT] Prioritized on the wmde-wikidata-tech board.
ItamarWMDE renamed this task from [PropertySuggester] Update PropertySuggester update process for mwscript-k8s to [PS] Update PropertySuggester update process for mwscript-k8s.Oct 10 2024, 11:04 AM

I assume that the input on stdin option should work in theory:

mwscript-k8s --attach -- extensions/PropertySuggester/maintenance/UpdateTable.php --wiki wikidatawiki --file php://stdin < wbs_propertypairs.csv

Eh, it probably wouldn’t work out of the box after all :/ UpdateTable.php does this:

		$path = $this->getOption( 'file' );
		$fullPath = realpath( $path );
		$fullPath = str_replace( '\\', '/', $fullPath );

		if ( !file_exists( $fullPath ) ) {
			$this->fatalError( "Cant find $path \n" );
		}

Which isn’t going to work (realpath( 'php://stdin' ) is false, and even if it wasn’t, file_exists( 'php://stdin' ) is also false). I don’t see an obvious reason why a stream wouldn’t work with the rest of the process (BasicImporter only reads from the file, without seeking), but we’d first have to convince the maintenance script to accept stdin.

The input file is quite large (the last one was 192 MB uncompressed), and I’m not sure how long the maintenance script runs (the wiki page says “a few minutes” but my vague recollection claims it was much longer? but it’s been a while since I did it). If the surrounding shell dies partway through the maintenance script, will the input file remain attached?

What do you think?

FTR, I mainly intended this question for @RLazarus or other WMFers familiar with (mw-on-)k8s :)

(If passing in the large input file via stdin seems fine, then I’m sure we can patch around the issue mentioned in T376604#10254725.)

As currently implemented, if the shell dies (or the network is disconnected, or anything else interrupts the stream) then stdin is closed in the container.

That's configurable at the pod level with the stdinOnce setting, but note the maybe-unexpected consequence of setting it to false: the maintenance script will never get an EOF, because it's always possible to reconnect, so the platform can never tell if the input is done.

I can see a couple of options here:

  1. Start mwscript-k8s inside a tmux (for safekeeping) and pass the input on stdin as-is, hoping that it runs straight through on the first attempt. This strategy is easiest, and it's reasonable if a partial run can be retried safely -- if anything weird happens, just run it again. ("Safely" might mean the partial run has no effect, like an aborted transaction -- or it might mean the individual steps are idempotent.) It's a bad strategy if a partial run would be dangerous -- for example if the structure of the maintenance script were "first, delete all the data in prod; then, go through the input file and recreate the data line-by-line." But if the script is structured that way, I urge you to rethink it anyway! :)
  2. 192 megs is too big to pass through mwscript-k8s with --file; the maximum size (enforced at the Kubernetes level for the ConfigMap mechanism we're using) is 1 MiB. But if the structure of the script is such that the input file can be broken into 1 MiB chunks, and the script run once on each chunk, that could work nicely. That also gives you quicker retries if any individual run fails.
  3. Launch the pod with stdinOnce: false and pass the input on stdin; add an extra sentry line at the bottom (a line by itself with a plain - or . or THE_INPUT_IS_NOW_CONCLUDED or something) and have the script terminate on that line instead of on EOF. Now you can attach, disconnect, re-attach, and resume sending input. But this requires some thoughtfulness on the client side: when you resume, you might not know which line was the last one received successfully. This is doable, with some care, especially if the script's handling of individual input lines is idempotent so you can back up and repeat some safely. But it's not trivial. (mwscript-k8s doesn't currently have an option for controlling stdinOnce, but it could easily enough; if you decide to go this route, open a feature request and I'll take care of it.)

This is one of those times when mwscript-k8s can support whatever you need, but the reliable systems design problem itself is (and always has been) the complicated part.

Thanks! IMHO 1 seems like the best option, but we’ll look into how the script behaves (I suspect it can be retried safely).

I had a quick look at approach #1, and it looks doable. The BasicImporter there would need to be changed to receive (and be happy with) a filehandle instead of expecting and making assertions about a filename, but that's not the end of the world. That said, the current implementation does appear to truncate the table (UpdateTable.php#L55) and the importer appears to make commits in batches (i.e. as far as I can see there's no top-level transaction to keep the truncate and the inserts in an atomic block). Running the script more than once isn't a problem - it is in that sense idempotent - but when the script breaks off the table is in an inconsistent (incomplete) state, and naïvely resuming the CSV file at a later point would (because of the truncate at the start) then delete any already-imported rows.

Still, none of these seem insurmountable obstacles to approach #1. For me the only remaining questions would be how long the job actually take in prod and how often the input is aborted for whatever reason. I don't think I would bother with any resume logic if the runtime of the job is short and aborts rare.

I think if the script runs relatively fast in production (if the wikitech page is accurate in saying that it takes “a few minutes”), then considering that it’s okay for the property suggester to be temporarily broken, I think the truncate-and-insert approach is okay, and option 1 still sounds good. But let’s try out how long the script takes to run under non-k8s mwscript, just so we have a current figure.

Change #1092818 had a related patch set uploaded (by Arthur taylor; author: Arthur taylor):

[mediawiki/extensions/PropertySuggester@master] Add support for stream paths (`php://stdin`) for CSV import

https://gerrit.wikimedia.org/r/1092818

We (@hoo , @AudreyPenven_WMDE ) just reimported the 20240429 data using the legacy mwscript and it took 9m45 secs for 5.5mil rows. We didn't use a tmux and it doesn't seem like it would be the end of the world if the connection got reset and you needed to restart the import. I think this speaks for the stdin approach.