Page MenuHomePhabricator

Load models' config directly rather than proxying via MediaWiki
Open, Needs TriagePublic

Description

In the beginning of the OCR project, we loaded the language information for each engine as part of the ext.wikisource.OCR module, because of wanting to avoid calls to non-production APIs from the Wikisource frontend.

The request is made to e.g. https://en.wikisource.org/w/load.php?modules=ext.wikisource.OCR to fetch the three engine's details, which is done by making three calls to e.g. https://ocr.wmcloud.org/api/available_langs?engine=tesseract

The data from those three API calls is then cached for a day in the MainWANObjectCache.

The available_langs API https://github.com/wikimedia/wikimedia-ocr/blob/96a22ac2a6a76c0e8397fad81ef6de372dfcfeda/src/Engine/EngineBase.php#L76 gets its data from a file called models.json which is manually maintained (because not all engines have a quick API to list what they support).

That file is also directly available at https://ocr.wmcloud.org/models.json

So, if we accept that we don't mind making remote calls from the frontend (and we don't, because that's how we run the actual OCR requests) then we can just fetch the models file directly and bypass MediaWiki altogether, saving a bunch of HTTP requests and re-reading the same file three times.

This will also avoid the /Langs.json virtual ResourceLoader file being constructed on all page requests (which is bad). And it'll lighten the load of the above RL module (which is good). And we'll only load the modules.json file when opening the config popup, rather than on all Page-NS page edits.