[PS] Update PropertySuggester update process for mwscript-k8s
Closed, ResolvedPublic
Actions

Description

The process to update PropertySuggester data (permalink) currently uses mwscript in one step:

mwscript extensions/PropertySuggester/maintenance/UpdateTable.php --wiki wikidatawiki --file wbs_propertypairs.csv

We need to figure out what to do here following the deprecation (and eventual removal) of “classic” mwscript. and how to migrate this use case to mwscript-k8s.

Details

	Subject	Repo	Branch	Lines +/-
	Add support for stream paths (`php://stdin`) for CSV import	mediawiki/extensions/PropertySuggester	master	+192 -11

Customize query in gerrit

Related Objects
Search...

Status	Assigned	Task
Open	None	T341560 Migrate mwmaint server functionality to mw-on-k8s
Open	RLazarus	T341553 Allow running one-off scripts manually
Resolved	Lucas_Werkmeister_WMDE	T376604 [PS] Update PropertySuggester update process for mwscript-k8s
Open	None	T377986 Q4 2024 update of Property Suggester data

Event Timeline

Lucas_Werkmeister_WMDE created this task.Oct 7 2024, 10:48 AM

Restricted Application added a subscriber: Aklapper. · View Herald TranscriptOct 7 2024, 10:48 AM

I assume that the input on stdin option should work in theory:

mwscript-k8s --attach -- extensions/PropertySuggester/maintenance/UpdateTable.php --wiki wikidatawiki --file php://stdin < wbs_propertypairs.csv

However, I’m not sure this is the best way to do it. The input file is quite large (the last one was 192 MB uncompressed), and I’m not sure how long the maintenance script runs (the wiki page says “a few minutes” but my vague recollection claims it was much longer? but it’s been a while since I did it). If the surrounding shell dies partway through the maintenance script, will the input file remain attached?

What do you think?

Prio Notes:

Impact Area	Affected
production / end users	yes
monitoring	no
development efforts	no
onboarding efforts	no
additional stakeholders	yes (deprecation of legacy mwscript)

ItamarWMDE renamed this task from Update PropertySuggester update process for mwscript-k8s to [PropertySuggester] Update PropertySuggester update process for mwscript-k8s.Oct 10 2024, 11:01 AM

ItamarWMDE moved this task from Incoming to [DOT] Prioritized on the wmde-wikidata-tech board.

ItamarWMDE renamed this task from [PropertySuggester] Update PropertySuggester update process for mwscript-k8s to [PS] Update PropertySuggester update process for mwscript-k8s.Oct 10 2024, 11:04 AM

In T376604#10206276, @Lucas_Werkmeister_WMDE wrote:
I assume that the input on stdin option should work in theory:
mwscript-k8s --attach -- extensions/PropertySuggester/maintenance/UpdateTable.php --wiki wikidatawiki --file php://stdin < wbs_propertypairs.csv

Eh, it probably wouldn’t work out of the box after all :/ UpdateTable.php does this:

		$path = $this->getOption( 'file' );
		$fullPath = realpath( $path );
		$fullPath = str_replace( '\\', '/', $fullPath );

		if ( !file_exists( $fullPath ) ) {
			$this->fatalError( "Cant find $path \n" );
		}

Which isn’t going to work (realpath( 'php://stdin' ) is false, and even if it wasn’t, file_exists( 'php://stdin' ) is also false). I don’t see an obvious reason why a stream wouldn’t work with the rest of the process (BasicImporter only reads from the file, without seeking), but we’d first have to convince the maintenance script to accept stdin.

ItamarWMDE moved this task from [DOT] Prioritized to [DOT] By Project on the wmde-wikidata-tech board.Oct 24 2024, 12:56 PM

ItamarWMDE added a project: Wikidata Dev Team.

ItamarWMDE moved this task from [DOT] By Project to [DOT] Prioritized on the wmde-wikidata-tech board.

WMDE-leszek moved this task from Incoming to [DOT] Tech Backlog on the Wikidata Dev Team board.Oct 29 2024, 10:00 PM

ItamarWMDE moved this task from [DOT] Tech Backlog to Unified DOT Backlog on the Wikidata Dev Team board.Nov 6 2024, 10:47 AM

ArthurTaylor claimed this task.Nov 8 2024, 12:16 PM

ArthurTaylor edited projects, added Wikidata Dev Team (Wikidata.org Slice); removed Wikidata Dev Team.

ArthurTaylor moved this task from In Task Breakdown to In Development on the Wikidata Dev Team (Wikidata.org Slice) board.

Maintenance_bot moved this task from [DOT] Prioritized to Ongoing on the wmde-wikidata-tech board.Nov 8 2024, 12:29 PM

In T376604#10206276, @Lucas_Werkmeister_WMDE wrote:

The input file is quite large (the last one was 192 MB uncompressed), and I’m not sure how long the maintenance script runs (the wiki page says “a few minutes” but my vague recollection claims it was much longer? but it’s been a while since I did it). If the surrounding shell dies partway through the maintenance script, will the input file remain attached?

What do you think?

FTR, I mainly intended this question for @RLazarus or other WMFers familiar with (mw-on-)k8s :)

(If passing in the large input file via stdin seems fine, then I’m sure we can patch around the issue mentioned in T376604#10254725.)

As currently implemented, if the shell dies (or the network is disconnected, or anything else interrupts the stream) then stdin is closed in the container.

That's configurable at the pod level with the stdinOnce setting, but note the maybe-unexpected consequence of setting it to false: the maintenance script will never get an EOF, because it's always possible to reconnect, so the platform can never tell if the input is done.

I can see a couple of options here:

Start mwscript-k8s inside a tmux (for safekeeping) and pass the input on stdin as-is, hoping that it runs straight through on the first attempt. This strategy is easiest, and it's reasonable if a partial run can be retried safely -- if anything weird happens, just run it again. ("Safely" might mean the partial run has no effect, like an aborted transaction -- or it might mean the individual steps are idempotent.) It's a bad strategy if a partial run would be dangerous -- for example if the structure of the maintenance script were "first, delete all the data in prod; then, go through the input file and recreate the data line-by-line." But if the script is structured that way, I urge you to rethink it anyway! :)
192 megs is too big to pass through mwscript-k8s with --file; the maximum size (enforced at the Kubernetes level for the ConfigMap mechanism we're using) is 1 MiB. But if the structure of the script is such that the input file can be broken into 1 MiB chunks, and the script run once on each chunk, that could work nicely. That also gives you quicker retries if any individual run fails.
Launch the pod with stdinOnce: false and pass the input on stdin; add an extra sentry line at the bottom (a line by itself with a plain - or . or THE_INPUT_IS_NOW_CONCLUDED or something) and have the script terminate on that line instead of on EOF. Now you can attach, disconnect, re-attach, and resume sending input. But this requires some thoughtfulness on the client side: when you resume, you might not know which line was the last one received successfully. This is doable, with some care, especially if the script's handling of individual input lines is idempotent so you can back up and repeat some safely. But it's not trivial. (mwscript-k8s doesn't currently have an option for controlling stdinOnce, but it could easily enough; if you decide to go this route, open a feature request and I'll take care of it.)

This is one of those times when mwscript-k8s can support whatever you need, but the reliable systems design problem itself is (and always has been) the complicated part.

Thanks! IMHO 1 seems like the best option, but we’ll look into how the script behaves (I suspect it can be retried safely).

I had a quick look at approach #1, and it looks doable. The BasicImporter there would need to be changed to receive (and be happy with) a filehandle instead of expecting and making assertions about a filename, but that's not the end of the world. That said, the current implementation does appear to truncate the table (UpdateTable.php#L55) and the importer appears to make commits in batches (i.e. as far as I can see there's no top-level transaction to keep the truncate and the inserts in an atomic block). Running the script more than once isn't a problem - it is in that sense idempotent - but when the script breaks off the table is in an inconsistent (incomplete) state, and naïvely resuming the CSV file at a later point would (because of the truncate at the start) then delete any already-imported rows.

Still, none of these seem insurmountable obstacles to approach #1. For me the only remaining questions would be how long the job actually take in prod and how often the input is aborted for whatever reason. I don't think I would bother with any resume logic if the runtime of the job is short and aborts rare.

I think if the script runs relatively fast in production (if the wikitech page is accurate in saying that it takes “a few minutes”), then considering that it’s okay for the property suggester to be temporarily broken, I think the truncate-and-insert approach is okay, and option 1 still sounds good. But let’s try out how long the script takes to run under non-k8s mwscript, just so we have a current figure.

ArthurTaylor removed ArthurTaylor as the assignee of this task.Nov 14 2024, 8:36 AM

ArthurTaylor moved this task from In Development to Ready for Development on the Wikidata Dev Team (Wikidata.org Slice) board.

ArthurTaylor subscribed.

ArthurTaylor claimed this task.Nov 19 2024, 10:20 AM

ArthurTaylor moved this task from Ready for Development to In Development on the Wikidata Dev Team (Wikidata.org Slice) board.

Change #1092818 had a related patch set uploaded (by Arthur taylor; author: Arthur taylor):

[mediawiki/extensions/PropertySuggester@master] Add support for stream paths (`php://stdin`) for CSV import

https://gerrit.wikimedia.org/r/1092818

gerritbot added a project: Patch-For-Review.Nov 19 2024, 12:28 PM

karapayneWMDE added a subtask: T377986: Q4 2024 update of Property Suggester data.Nov 20 2024, 9:50 AM

We (@hoo , @AudreyPenven_WMDE ) just reimported the 20240429 data using the legacy mwscript and it took 9m45 secs for 5.5mil rows. We didn't use a tmux and it doesn't seem like it would be the end of the world if the connection got reset and you needed to restart the import. I think this speaks for the stdin approach.

ArthurTaylor reassigned this task from ArthurTaylor to hoo.Wed, Nov 27, 8:30 AM

ArthurTaylor moved this task from In Development to In Peer Review on the Wikidata Dev Team (Wikidata.org Slice) board.

Updated the docs (permalink) to reflect the expected change in usage.

ArthurTaylor removed hoo as the assignee of this task.Thu, Nov 28, 12:28 PM

ArthurTaylor moved this task from In Peer Review to Ready for Peer Review on the Wikidata Dev Team (Wikidata.org Slice) board.

Lucas_Werkmeister_WMDE claimed this task.Thu, Nov 28, 4:13 PM

Lucas_Werkmeister_WMDE moved this task from Ready for Peer Review to In Peer Review on the Wikidata Dev Team (Wikidata.org Slice) board.

Change #1092818 merged by jenkins-bot:

[mediawiki/extensions/PropertySuggester@master] Add support for stream paths (`php://stdin`) for CSV import

https://gerrit.wikimedia.org/r/1092818

ReleaseTaggerBot added a project: MW-1.44-notes (1.44.0-wmf.6; 2024-12-03).Fri, Nov 29, 11:00 AM

Lucas_Werkmeister_WMDE removed Lucas_Werkmeister_WMDE as the assignee of this task.Fri, Nov 29, 11:14 AM

Lucas_Werkmeister_WMDE moved this task from In Peer Review to Ready for Tech Verification on the Wikidata Dev Team (Wikidata.org Slice) board.

Maintenance_bot removed a project: Patch-For-Review.Fri, Nov 29, 11:30 AM

ArthurTaylor mentioned this in T377986: Q4 2024 update of Property Suggester data.Tue, Dec 3, 12:40 PM

akosiaris subscribed.Wed, Dec 4, 2:36 PM

Mentioned in SAL (#wikimedia-operations) [2024-12-09T14:00:44Z] <Lucas_WMDE> 'Updated the Wikidata property suggester with data from 20241125’s JSON dump: mwscript-k8s --attach -- extensions/PropertySuggester/maintenance/UpdateTable.php --wiki wikidatawiki --file php://stdin < wbs_propertypairs.csv # T377986, T376604'

php://stdin worked just fine on mwscript-k8s --attach \o/ I think with that we can close this task.

[PS] Update PropertySuggester update process for mwscript-k8sClosed, ResolvedPublicActions

Description

Details

Related ObjectsSearch...

Event Timeline

[PS] Update PropertySuggester update process for mwscript-k8s
Closed, ResolvedPublic
Actions

Related Objects
Search...