New OCR tool should be able to OCR part of a page
Closed, ResolvedPublic
Actions

Assigned To

Authored By

	Inductiveload
	Dec 10 2020, 3:38 AM

Description

Many pages at Wikisource are extremely difficult to OCR because they flummox the segmenting engines of the OCR tools. For example:

https://en.wikisource.org/w/index.php?title=Page:Cary%27s_New_Itinerary_(1819).djvu/357&action=edit&redlink=1

While not as ideal as an AI-driven automagic system that just works, a decent fallback would be if the user could select a rectangular region to OCR. This at least would allow the proofreading to proceed in chunks.

Related Objects

Mentioned In: T288672: Add toggle button for the image cropper
T288190: Text box is too small to view
T288389: OCR support should recognise and process mutliple regions, with user assisted selection , where a work does not have a 'linear' layout.
T288146: JumpToFile: Port to OpenSeadragon implementation
T288141: ProofreadPage: use OpenSeadragon for the Page NS image viewer
Mentioned Here: rEWSO982b1873606b: Localisation updates from https://translatewiki.net.
T288389: OCR support should recognise and process mutliple regions, with user assisted selection , where a work does not have a 'linear' layout.
T283917: Add zoom and pan to the Pagelist Widget
T288141: ProofreadPage: use OpenSeadragon for the Page NS image viewer
T9757: allow cropping images when rendered

Event Timeline

Inductiveload created this task.Dec 10 2020, 3:38 AM

Restricted Application added a subscriber: Aklapper. · View Herald TranscriptDec 10 2020, 3:38 AM

I would add that being able to 'partition' a page would also be useful in order to help OCR content that was tabular data or columnular in nature.

This is because of certain use cases:

#Sometimes with existing OCR I've found that the OCR has sometime read column by column, whereas whats typically needed is a row by row reading.
#Some older works use multiple columns for text. In some OCR, this is read as interleaved blocks from each column , or in a worse case as each line , which can be tedious to split and descramble ( As with some pages of Cary's Itenary
#Some older works especially British legislation has extensive marginal citations, or annotations, OCR of these in on occasion has interleaved these marginal citations into the main text (on a line by line basis), again needing manual de scrambling. Having an OCR that a user was able to set up partitioning on would enable these citations and annotations to be read seperately, (potentially with differently optomised settings for the OCR, given that such citations may be in smaller fonts, different line spacings etc.) - (Aside: Having partioned marginal citations and annotations would also potentially allow for alternative approaches to their display to be implemented in a cleaner way than has been attempted to date.)

jayantanth subscribed.Mar 28 2021, 7:17 AM

Frostly subscribed.May 24 2021, 9:18 PM

Restricted Application added a project: Community-Tech. · View Herald TranscriptMay 24 2021, 9:18 PM

This is very common. Most of the time we work with tabular data or whole works that have a two column layout, all the actual OCR tools fail

We already have a tool from Commons, CropTool, one of the coolest apps of 2020, that have a very simple UI, and can be reused to generate the segments and feed them to the OCR tools.

https://commons.m.wikimedia.org/wiki/Commons:CropTool

Invoking external tools should not be necessary assuming you have imagemagick. Once the image is downloaded to the OCR server, it can be cropped with a commands like this:

convert in.png -crop +10+20 -crop -30-40 +repage out.png

Where 10, 20, 30 and 40 are the number of pixels to crop from the left, top, right and bottom respectively.

There are probably PHP/Python/whatever you are using image libraries too, but a quick shell-out is sometimes all it takes :-D

The UI to select a rectangle on the image is the harder part here (maybe a couple of hours' work with jQuery to do for an MVP in-place in the PRP image pane).

However I suggest that the backend is done FIRST so that community can then build their own region selections should the OCR project not get around to it, and so that the core service is more generally useful for other tools. That is to say the URL for the request could be something like:

https://ocr.wmcloud.org/api.php?engine=tesseract&langs[]=en&image=URLHERE&uselang=en&x=100&y=100&w=100&h=100

Where:

x is the selection left margin: omitting means 0
y is the selection bottom margin: omitting means 0
w is the selection width: omitting means total width - x
h is the selection height: omitting means total height - y

Thumbor supports dynamic cropping of images, although I'm not sure if our installation of it has this feature enabled. That'd make it even easier to OCR the cropped image because we'd just be altering the upload.wikimedia.org URL, and it would also work with Google (to which we just pass the URL, so if we had to crop we'd have to do so and store the result in a publicly-accessible spot – not that that's too hard, and the image would still just be deleted afterwards).

@Samwilson apparently Mediawiki doesn't support this (according to #wikimedia-tech).

Which i suppose makes sense, otherwise https://en.wikipedia.org/wiki/Template:CSS_image_crop would not be needed.

Also let me know if this is something CommTech plans to include in the project, because if not, I can take a crack at it myself and hopefully get it into review while there are current eyes-on.

In T269818#7220187, @Inductiveload wrote:

apparently Mediawiki doesn't support this (according to #wikimedia-tech).

Could it, though? Not MW itself, obviously, but Thumbor?

You can do in-browser cropping in JS by using Canvas and some tricks (I've been toying with the idea of implementing that for OCRtoy), but doing it in Thumbor, which is actually designed for that task, would be a lot more elegant and preserve a nice clean separation of concerns: the UA provides selection and requests OCR; the OCR tool backend just fetches the image data it is supposed to perform OCR on; and Thumbor generates thumbnails and performs other transformations (in this case cropping).

The most labour-intensive part in that architecture would be enabling cropping in Thumbor in a way that doesn't have undesirable unintended consequences. Not having worked with Thumbor, much less the WMF deployment of it, I have no idea if that is feasible without a major effort or not.

The second most would be making a JS UI for rectangular image selection, and I'm pretty sure I can dig up at least example code if not outright supported frameworks for doing that (I think the image annotations gadget on Commons even has some code for that already). This should definitely be doable, one way or the other.

Does one have access to Thumbor at that level in the OCR tool? (I don't know how it works). If you're not going through the upload.wikimedia.org endpoint?

The OCR tool already has to download the image somewhere at least for Tesseract, cropping it is trivial once you have that. Faffing about with anything outside the OCR tool sounds like it would guarantee that this task will not even be started until the OCR project is over. Meanwhile, ImageMagick will do this in about 20 lines of code all-in including bounds checking. Or any other image library for that matter.

The UI for this is also not that hard, I can throw something together today if I knew there was a API to throw the 4 numbers at. Longer-term, I think PRP is looking at OpenSeadragon for image zoom/pan, but again, if we wait for that, we'll have to do it outside of the current OCR project. the UI can always be upgraded/ported later, as long as the JS posts the same 4 numbers.

In T269818#7221123, @Inductiveload wrote:

ImageMagick will do this in about 20 lines of code all-in including bounds checking.

As POC/MVP, sure. As a permanent solution I don't recommend basing it on ImageMagick for all sorts of reasons, but mostly I was thinking what the most elegant overall solution would be long term.

The request path, AIUI, is client ➔ Varnish ➔ either Swift or MediaWiki ➔ Thumbor; and MW is only in the path for some edge cases (it's handling authentication, essentially). The WMF Swift deployment is pretty customised, which might make things complicated, but barring that it is not inconceivable that this could be enabled through config changes in Varnish and Swift.

@Tgr Is this something you would be able to comment on the feasibility of (cf. T269818#7220781)?

Another option

Write a super simple Toolforge tool that only crops images from upload.wikimedia.org. Is this a microservice? Sounds like it might be? They're "in", right? ¯\_(ツ)_/¯
Then, add the x, y, w, h parameters to the OCR tool
If the OCR tool gets those parameters, use the cropping service URL instead
One fine day, if there's a a way to construct an upload.wikimedia.org URL that includes crop, just switch out the URLs in the OCR backend

I wonder if the croptool would be a good place to provide that sort of service? It's already doing the download-and-crop part.

@Samwilson adding this to the to be discussed column, I want to understand how complex it would be in terms of point and see our bandwidth for sprint 6! not sure if you'll be at estimation, but maybe an asynchronous pointing if not? talked to @Inductiveload yesterday and want to see if we can point this and then discuss if it can be part of our maintenance percentage for sprint 6

cc @HMonroy

Please note we are not comitted to this work, this is just moving it to estimating it to understaing its complexity.

Also let me know if this is something CommTech plans to include in the project, because if not, I can take a crack at it myself and hopefully get it into review while there are current eyes-on.

@Inductiveload if you want to give it a stab and get a code review, we could also estimate that work too

• NRodriguez moved this task from Needs Discussion to CommTech-Sprint-6 on the Community-Tech board.Jul 22 2021, 7:55 PM

• NRodriguez edited projects, added Community-Tech (CommTech-Sprint-6); removed Community-Tech.

In T269818#7222723, @Xover wrote:

The request path, AIUI, is client ➔ Varnish ➔ either Swift or MediaWiki ➔ Thumbor; and MW is only in the path for some edge cases (it's handling authentication, essentially). The WMF Swift deployment is pretty customised, which might make things complicated, but barring that it is not inconceivable that this could be enabled through config changes in Varnish and Swift.

@Tgr Is this something you would be able to comment on the feasibility of (cf. T269818#7220781)?

If I understand you correctly, you are asking about T9757: allow cropping images when rendered.
It's theoretically possible, sure. You'd have an easier time adding the cropping feature into the OCR tool, though.

We discussed this and believe the best route forward is to add server-side cropping with Imagemagick, as proposed at T269818#7219863. The UI aspect we may not be able to get to, unfortunately.

Adding cropping to the tool should be fine for both Tesseract and Google. We'll add four new parameters to the form: width and height of the region, and x and y offset from the top left of the image (this matches the -crop command of ImageMagick, but should be usable with any future image manipulation software we want to switch to in the future).

We could also add something like Cropper.js to the tool form, perhaps without too much drama (because it wouldn't have to coexist with zooming and panning). That'd make it much easier to figure out the coordinates.

We discussed this and believe the best route forward is to add server-side cropping with Imagemagick, as proposed at T269818#7219863. The UI aspect we may not be able to get to, unfortunately.

@Inductiveload on our call, there was some mention of volunteers being able to work on the UI if we built the server-side cropping! Lmk if i remember correctly

and thanks @Samwilson for chipping away and @MusikAnimal for making sure this ticket didn't get lost in the mix

Inductiveload mentioned this in T288141: ProofreadPage: use OpenSeadragon for the Page NS image viewer.Aug 4 2021, 4:58 PM

@NRodriguez sure, I have some progress in a "plain" JS way, but it's rather ugly around the jQuery events. Since OpenSeadragon looks very close with the PageLsit widget, I'm kind of hoping it'll be possible to use that instead without having to do it twice. It's quite easy in OSD:

2021-08-04_074417_566x461_screenshot.png (461×566 px, 20 KB)

For reference:

OSD in the Pagelist widget: T283917
OSD in the Page NS: T288141
- And a hacked-up demo (will be able to reuse the OSD module when the above commit lands): https://gerrit.wikimedia.org/r/c/mediawiki/extensions/ProofreadPage/+/709879

Inductiveload mentioned this in T288146: JumpToFile: Port to OpenSeadragon implementation.Aug 4 2021, 5:07 PM

Soda subscribed.Aug 5 2021, 5:38 AM

PR ready for review: https://github.com/wikimedia/wikimedia-ocr/pull/54

@nayoub You might want to take a look at the above patch, which I've installed on the test site. I'm not sure if we want to change any of the design now there's a draggable box thing.

The cropping is nice. Do you have plans to implement multiple boxes, so you can identify regions of a page? ( Would that need a new ticket?)

The cropping is nice. Do you have plans to implement multiple boxes, so you can identify regions of a page? ( Would that need a new ticket?)

I haven't really looked at multiple regions; a separate task would be best (we're hoping to get this one completed next week). It looks like it should be possible, although I guess it'd also be good to be able to save those region definitions between pages... not sure how that'd work, given the tool doesn't actually know about multipage works. Perhaps it'd be better to do it on the wiki side, where the region info could be saved against the Index. Then e.g. columns could be set and reused on each page (is that the sort of thing you mean?).

ShakespeareFan00 mentioned this in T288389: OCR support should recognise and process mutliple regions, with user assisted selection , where a work does not have a 'linear' layout..Aug 7 2021, 7:11 AM

@Samwilson : see T288389 in respect of a ticket for a 'multi-region' feature request. I've listed 3 uses cases I've encountered (there may be others.)