Sun, Jun 20
Hmm. Is this really an issue with LST as such? The example at frWS uses the pseudo-LST ## section name ## syntax provided by a local Gadget, which is what forces a newline. So far as I know, raw <section begin="section name" /> syntax does not force a newline and should work out of the box for this.
Why Flexbox for this?
Sat, Jun 12
@Catrope @Reedy As someone who has handled similar requests in the past (5+ years ago), can you advice on who one might ping now to get eyes on this request? While I'm sure Mathieu can be persuaded to refresh as needed, the current download links expire in ~3 days and it would be good, if practical, to get the job started before then.
Fri, Jun 11
I’m pretty sure the math pseudo-language is not supported in Tesseract 4.x.
Sun, Jun 6
May 15 2021
@Mnafisalmukhdi1 The idWS OCR Gadget is outdated. You need to get a local interface administrator to apply this diff, or you could just cross-load OCR.js from Multilingual Wikisource. The only current interface admin I see on idWS is @Rachmat04, but if they are unavailable you can probably request assistance from the global interface admins by making a request on meta.
Just to note, treating the OCR text layer as metadata is conceptually a bit awkward: it is a separate representation of the file, and that happens to be automatically generated. It's more akin to a MIME email message that contains both a text/html representation and a text/plain representation. Still speaking conceptually, metadata about the text layer would be stuff like "Does this file have a text layer?" and "What format/text encoding is the text layer using?" and "Is the text layer for this file in a structured format?" and "What is the size in bytes of the text layer for this file?".
T253072 strikes again!
May 2 2021
@Jdlrobson The screenshots in the description are using the then-default view (the RC filters was still a beta feature, or needed explicit opt-in, until 2018-ish as I recall; T157642, maybe?), but it doesn't matter which variant you use. The problem is equally evident in the screenshot you provided:
Apr 30 2021
Getting an empty string back for an image that contains no recognizable text is not an error, that's just returning the correct output. There are any number of reasons people might ask for OCR of an image that would return no text: there is text but the OCR engine fails to recognize it; they are doing a page image in a sequence and hitting the OCR button by habit (or their gadget does more than just load the OCR); they have a gadget that automatically requests the OCR on page load; etc. And in a bulk OCR scenario it will be entirely normal for the sequence of images being processed to contain anything from a few to several tens of blank pages.
Apr 29 2021
I just hit (what I think is) this issue with a ~30k page watchlist on enWS. The error message now (9 years after first report) looks like:
Going by the comments on the patch in Gerrit, isn't the actual state of this task "Declined"?
Apr 26 2021
Apr 24 2021
Ping. It would be useful to get an idea of what would be involved in making this work, and whether there are any on-wiki workarounds or fixes that could be made.
People involved with the project have been saying on-wiki that it is actually dead (the front page is just a zombie) for going on a year now, and the iw prefix is on a todo for removal on enWS along with deleting and/or deprecating all the related templates and references to it. Linking to it was also always iffy legally speaking due to the concept of linking as contributory copyright infringement (with which concept one may disagree, but which courts and legislators appear entirely at ease with). I think the relevant support not only could but /should/ go.
Apr 17 2021
Apr 8 2021
@ldelench_wmf That page is counting pages that have been marked as "Proofread" or "Validated"—using the radioboxes the Proofread Page extension adds to the edit form—as a result of a manual transcription, that may or may not have used OCR text from one of several different possible sources as a starting point. It does not directly measure anything related to OCR (but could, of course, conceivably provide an indirect measure).
Apr 5 2021
The very simplest way would be to just change pageForIAItem() to always return an empty string.
@Samwilson A couple of thoughts on skimming (and I do mean skimming) the diff…
Apr 1 2021
In fact, looking at the code in SpecialFileDuplicateSearch.php it looks like querying for Commons media isn't particularly more complicated than local media when inside core, and T175088 suggests Special:ListDuplicatedFiles should be on the monthly "expensive query pages" cron job in any case. In that context, is there any particular reason SpecialListDuplicatedFiles.php for a given project couldn't do a (very specialised version of a) cross-wiki join itself and stuff the results in a category?
I'm sure there are other uses for the functionality described here, but…
Mar 31 2021
As Peter says, this needs some form of configurability and probably at the per-user level. English Wikisource generally unwraps lines, but even there there are users who rely on hard linebreaks when proofreading. OCR is also imperfect at detecting page features, so for some scans automatic unwrapping will end up going to the opposite extreme (all text in one big lump with no line breaks).
Mar 28 2021
@Aklapper I'm not entirely steady on the projects/components and their scope, so apologies if I'm hopelessly confused, but looking at the descriptions for them I would say this task falls under MediaWiki-Uploading and UploadWizard? Or is this obviously pinpointed somewhere down in the Swift part of the stack? And maybe UploadWizard is excluded since this happens via API upload too?
Mar 22 2021
Possibly related: T254459
Ok, testing the >100MB file locally on enWS (I think most of the relevant bits of the stack are the same as for Commons), bigChunkedUpload.js tells me "Upload is stuck" for every single chunk (32 x 20MB chunks) but then seems to recover. After the last chunk hits 100% it tells me "Server error 0 after uploading chunk:" (I think this is an empty response from the server). After waiting and retrying a couple more times it terminates with the message "FAILED: internal_api_error_DBQueryError: [91f56af6-cec2-4969-938f-3aeaf9f35aff] Caught exception of type Wikimedia\Rdbms\DBQueryError" which I'm pretty certain is coming from somewhere inside MW proper rather than from Rillke's code.
I've successfully uploaded several <100MB files in the time period. The one >100MB file I've tried fails (I've been blindly trying different things so exact failure symptoms are a bit vague). All uploads with bigChunkedUpload.js with stash/async deselected.
Mar 16 2021
Credits by default may be playing it safe, but does the risk really justify that much caution?
Hmm. Does it actually need to be machine-readable? I would have thought what was wanted was a way to just identify the license template output so that it could be rendered in the appropriate place, but otherwise just use the on-wiki rendered template. Structured data is nice for all sorts of other reasons, but for this purpose I would think a simple CSS class would be sufficient; or possibly an ID in order to ensure there is only one container for license information.
Mar 6 2021
Mar 4 2021
@Prtksxna The Wikisourcen (unlike Wikipedia) do not create original content that attracts a copyright. They merely (mechanically) reproduce public domain or already-freely-licensed works. The standard licensing terms under the edit form are for contributions outside the content namespaces (Scriptorium, User pages, Talk, etc.). Thus the only relevant licensing information is the one for the work itself, much as the licensing for a media file on Commons.
Mar 2 2021
Should the config change be a separate task for Site-Requests to be visible on the board?
Feb 27 2021
Hmm. As I recall, PRP uses a hard 1024px size for the "thumbnail" it requests. I am assuming this was a value picked as a sort of compromise between full fidelity to the user and various optimization concerns.
Hmm. Based on this and a few other recent failures, I'm starting to wonder if php-exec-command (which is the Command::exec(); wrapper ia-upload is using to execute binaries) is broken and returning "Command not found" for any non-zero exit status.
Feb 26 2021
Feb 25 2021
Feb 19 2021
So… we're currently waiting for a suitable volunteer to materialize out of thin air to address an issue whose details are not public for security reasons? And in the mean time we have many thousand broken pages across multiple projects and all we can do is bleed contributors in those areas?
Feb 11 2021
Absent specific proposals for better wording I think the status quo works well enough. Far from every ebook user has any conception of file formats, much less any idea what kind is best for their device, so giving them enough information suited to their frame of reference to make a sensible choice is a priority.
Feb 6 2021
Let me throw an extra angel on the head of this needle: a user might conceivably want to export a work when currently on a wikipage in these namespaces, and a user might conceivably want to export a single page, as defined by a Page: wikipage, of a work.
I think that for any inherently paged format (like PDF), print should be a primary concern. For everything else we should nudge people to ePub where content can be dynamically reflowed. I have trouble imagining that a significant number of people actually print these onto dead trees, but that is the main rationale for the design of the PDF format the way it is.
Uhm. A5? Every printer in the world is designed for A4 (or its bastard offshoot, US Letter), and every sheet of printer paper sold ditto. The other sizes, including A5, are barely measurable in comparison. In fact, I think some of the B sizes may actually outsell A5 due to use in automated mass-mailings of various kinds.
Jan 20 2021
No, please don't. Forcing links to open in a new tab or window to keep the user on your site is literally a dark pattern in web design. Users are quite capable of opening a link in a new tab if they want to, and, conversely, those users who have trouble with this are also apt to be confused by navigating multiple tabs or windows.
Jan 13 2021
Jan 10 2021
Dec 21 2020
Dec 20 2020
Dec 18 2020
Yeah, daily would be better for newly added works. For changes to existing works the frequency could be much lower with not much problem I think. Alternatively new works could be manually triggered (we have lots of manual processes already) given an interface for it.
Dec 11 2020
Oh, no, wait… I think I'm just being a dummy!
Dec 10 2020
It definitely isn't working. On this page the paragraphs run together, but the output from djvutxt thefile.djvu -page=17 -detail=page is:
Hmm. $wgDjvuTxt is set in CommonSettings.php, so that should be ok.
Hmm. I didn't think there'd be any caching of this, but I may have misunderstood. It might also be that retrieveMetaData() is called once on upload rather than on demand as I'd assumed. And we need to check what $wgDjvuTxt is set to, since this whole block is only executed if that config var isset().
Nov 18 2020
Just to add a perspective…
Nov 14 2020
Nov 13 2020
Nov 12 2020
Ok, I've now had some independent testing (Big big thank you to Jan!) that confirms the tweaked Gadget code now produces results that are at least within a reasonable distance of what it used to produce.
Nov 11 2020
Ok, an update on the corrupted cache…
Nov 10 2020
Nov 9 2020
Nov 6 2020
The cache for a given work will be in a subdirectory of ~/cache/hocr/ created from the MD5 hash of the file's name (spaces replaced with underscores) concatenated with the invoking project's language code. So for Mexico_under_Carranza.djvu requested from English Wikisource, you can generate the hash with…
Ok, having gotten access to the project in connection with T265640 I've been trying to debug this a bit.
Nov 5 2020
Nov 4 2020
@Candalua That leaves you as the only admin on phetools with any likelihood of having the spare cycles to look at this (Phe and Tpt are highly unlikely to be available for this any time soon). Any chance you could poke around here a bit?
Nov 2 2020
Nov 1 2020
Oct 19 2020
@kaldari Nope, still seeing the same failure mode. It greys out the text in the editor and then throws an error in the JS console ala. An error occurred during ocr processing: /tmp/52004_6179/page_0199.tif.