@ldelench_wmf That page is counting pages that have been marked as "Proofread" or "Validated"—using the radioboxes the Proofread Page extension adds to the edit form—as a result of a manual transcription, that may or may not have used OCR text from one of several different possible sources as a starting point. It does not directly measure anything related to OCR (but could, of course, conceivably provide an indirect measure).
- Queries
- All Stories
- Search
- Advanced Search
- Transactions
- Transaction Logs
Advanced Search
Sat, Apr 17
Thu, Apr 8
In T269518#6983324, @Samwilson wrote:Do we want to allow duplicates of the same format?
Mon, Apr 5
The very simplest way would be to just change pageForIAItem() to always return an empty string.
@Samwilson A couple of thoughts on skimming (and I do mean skimming) the diff…
Thu, Apr 1
In fact, looking at the code in SpecialFileDuplicateSearch.php it looks like querying for Commons media isn't particularly more complicated than local media when inside core, and T175088 suggests Special:ListDuplicatedFiles should be on the monthly "expensive query pages" cron job in any case. In that context, is there any particular reason SpecialListDuplicatedFiles.php for a given project couldn't do a (very specialised version of a) cross-wiki join itself and stuff the results in a category?
In T268240#6965596, @JJMC89 wrote:The file usage section on file pages lists duplicates, including from Commons. However, there is no way to find these since Special:ListDuplicatedFiles only lists local duplicates.
I'm sure there are other uses for the functionality described here, but…
In T277768#6963612, @Inductiveload wrote:I think the current OCR tool will read ahead in the current file and OCR the other pages in the background and cache the results, on the assumption that if you want one, you or others will want more. But I'm not sure how far ahead it goes.
In T277768#6963360, @Samwilson wrote:
- it's not possible to add the text layer to the PDF/DjVu/etc.
Wed, Mar 31
In T278623#6961814, @Inductiveload wrote:For your conversion of Lippincot's v45 from Hathi, you can do a lot better:
In T278623#6958104, @Languageseeker wrote:There are multiple issues with PDF.
As Peter says, this needs some form of configurability and probably at the per-user level. English Wikisource generally unwraps lines, but even there there are users who rely on hard linebreaks when proofreading. OCR is also imperfect at detecting page features, so for some scans automatic unwrapping will end up going to the opposite extreme (all text in one big lump with no line breaks).
Sun, Mar 28
@Aklapper I'm not entirely steady on the projects/components and their scope, so apologies if I'm hopelessly confused, but looking at the descriptions for them I would say this task falls under MediaWiki-Uploading and UploadWizard? Or is this obviously pinpointed somewhere down in the Swift part of the stack? And maybe UploadWizard is excluded since this happens via API upload too?
Mon, Mar 22
Possibly related: T254459
Ok, testing the >100MB file locally on enWS (I think most of the relevant bits of the stack are the same as for Commons), bigChunkedUpload.js tells me "Upload is stuck" for every single chunk (32 x 20MB chunks) but then seems to recover. After the last chunk hits 100% it tells me "Server error 0 after uploading chunk:" (I think this is an empty response from the server). After waiting and retrying a couple more times it terminates with the message "FAILED: internal_api_error_DBQueryError: [91f56af6-cec2-4969-938f-3aeaf9f35aff] Caught exception of type Wikimedia\Rdbms\DBQueryError" which I'm pretty certain is coming from somewhere inside MW proper rather than from Rillke's code.
I've successfully uploaded several <100MB files in the time period. The one >100MB file I've tried fails (I've been blindly trying different things so exact failure symptoms are a bit vague). All uploads with bigChunkedUpload.js with stash/async deselected.
Mar 16 2021
Random, possibly not useful or relevant, thought: there's an effort somewhere to tighten the privacy policy in such a way that IP addresses are no longer visible (not even to Checkusers). IPs are also not very useful as an entry in a "Contributors to this book" list. Perhaps both issues could be addressed by grouping all logged-out contributions at the end as "…, and n anonymous contributors."?
Credits by default may be playing it safe, but does the risk really justify that much caution?
Hmm. Does it actually need to be machine-readable? I would have thought what was wanted was a way to just identify the license template output so that it could be rendered in the appropriate place, but otherwise just use the on-wiki rendered template. Structured data is nice for all sorts of other reasons, but for this purpose I would think a simple CSS class would be sufficient; or possibly an ID in order to ensure there is only one container for license information.
Mar 6 2021
In T274959#6884032, @Xover wrote:The Wikisourcen (unlike Wikipedia) do not create original content that attracts a copyright.
In T274959#6888776, @Prtksxna wrote:… we won’t be showing it to most downloaders.
Mar 4 2021
@Prtksxna The Wikisourcen (unlike Wikipedia) do not create original content that attracts a copyright. They merely (mechanically) reproduce public domain or already-freely-licensed works. The standard licensing terms under the edit form are for contributions outside the content namespaces (Scriptorium, User pages, Talk, etc.). Thus the only relevant licensing information is the one for the work itself, much as the licensing for a media file on Commons.
In T273708#6803667, @JAnD wrote:Not very good idea. There are works like encyclopedias or periodicals with thousands of subpages.
This solutions would need to add magic word to every subpage.
Mar 2 2021
Should the config change be a separate task for Site-Requests to be visible on the board?
Feb 27 2021
Hmm. As I recall, PRP uses a hard 1024px size for the "thumbnail" it requests. I am assuming this was a value picked as a sort of compromise between full fidelity to the user and various optimization concerns.
Hmm. Based on this and a few other recent failures, I'm starting to wonder if php-exec-command (which is the Command::exec(); wrapper ia-upload is using to execute binaries) is broken and returning "Command not found" for any non-zero exit status.
Feb 26 2021
Feb 25 2021
In T101075#6780815, @matmarex wrote:… there's a difference in wikitext between an empty parameter and a not-provided-at-all parameter, …
Feb 19 2021
In T257066#6843760, @sbassett wrote:There is some progress being made on various protected tasks, …
So… we're currently waiting for a suitable volunteer to materialize out of thin air to address an issue whose details are not public for security reasons? And in the mean time we have many thousand broken pages across multiple projects and all we can do is bleed contributors in those areas?
Feb 11 2021
In T274495#6822475, @Koavf wrote:In T274495#6822466, @Xover wrote:Absent specific proposals for better wording
I gave a proposal.
Absent specific proposals for better wording I think the status quo works well enough. Far from every ebook user has any conception of file formats, much less any idea what kind is best for their device, so giving them enough information suited to their frame of reference to make a sensible choice is a priority.
Feb 6 2021
Let me throw an extra angel on the head of this needle: a user might conceivably want to export a work when currently on a wikipage in these namespaces, and a user might conceivably want to export a single page, as defined by a Page: wikipage, of a work.
I think that for any inherently paged format (like PDF), print should be a primary concern. For everything else we should nudge people to ePub where content can be dynamically reflowed. I have trouble imagining that a significant number of people actually print these onto dead trees, but that is the main rationale for the design of the PDF format the way it is.
Uhm. A5? Every printer in the world is designed for A4 (or its bastard offshoot, US Letter), and every sheet of printer paper sold ditto. The other sizes, including A5, are barely measurable in comparison. In fact, I think some of the B sizes may actually outsell A5 due to use in automated mass-mailings of various kinds.
Jan 20 2021
No, please don't. Forcing links to open in a new tab or window to keep the user on your site is literally a dark pattern in web design. Users are quite capable of opening a link in a new tab if they want to, and, conversely, those users who have trouble with this are also apt to be confused by navigating multiple tabs or windows.
Jan 13 2021
Jan 10 2021
Dec 21 2020
In T134469#6705758, @cscott wrote:I bet something like __NO_P_WRAP__ would be fairly easy to support. Would it get enough adoption to get us closer to our goal of turning it off by default?
Dec 20 2020
In T134469#2366099, @cscott wrote:… In ten years, I'd love for us to be at the point where we don't do <p>-wrapping at all …
Dec 18 2020
Yeah, daily would be better for newly added works. For changes to existing works the frequency could be much lower with not much problem I think. Alternatively new works could be manually triggered (we have lots of manual processes already) given an interface for it.
Dec 11 2020
Oh, no, wait… I think I'm just being a dummy!
Dec 10 2020
It definitely isn't working. On this page the paragraphs run together, but the output from djvutxt thefile.djvu -page=17 -detail=page is:
Hmm. $wgDjvuTxt is set in CommonSettings.php, so that should be ok.
Hmm. I didn't think there'd be any caching of this, but I may have misunderstood. It might also be that retrieveMetaData() is called once on upload rather than on demand as I'd assumed. And we need to check what $wgDjvuTxt is set to, since this whole block is only executed if that config var isset().
Nov 18 2020
Just to add a perspective…
Nov 14 2020
In T228594#6622243, @Ankry wrote:In T228594#6622158, @Xover wrote:Could you apply this diff
Done.
In T228594#6622087, @FreeCorp wrote:In T228594#6615484, @Jan.Kamenicek wrote:… every word is on a new line. …
Same feedback as @Jan.Kamenicek tonight, although it seemed to worked great a week ago.
In T228594#6621848, @Mpaa wrote:@Xover, I think it is a misunderstanding
data.text.substring(0,5) != "<?xml" -> XML is accepted, if it is not XML, then is considered error.
Nov 13 2020
In T228594#6620968, @Pols12 wrote:…fallback to old OCR when got text is an error message instead of XML content:
function hocr_callback(data) { if ( data.error || data.text.substring(0,5)!="<?xml" ) {
Nov 12 2020
Ok, I've now had some independent testing (Big big thank you to Jan!) that confirms the tweaked Gadget code now produces results that are at least within a reasonable distance of what it used to produce.
Nov 11 2020
Ok, an update on the corrupted cache…
In T228594#6615608, @Xover wrote:In T228594#6615484, @Jan.Kamenicek wrote:… the [OCR] result is very poor, …: every word is on a new line.
This is a separate problem, and is most likely related to Tesseract being upgraded to 4.x.
Nov 10 2020
In T228594#6615484, @Jan.Kamenicek wrote:Unfortunately, the OCR does not work with any of these at all
Nov 9 2020
In T228594#6614254, @Jan.Kamenicek wrote:… I tested it now e. g. on Page:John_Huss,_his_life,_teachings_and_death,_after_five_hundred_years.pdf/122 and some other pages of the same book and it still does not work here :-(
In T228594#6613257, @kaldari wrote:@Xover - What would be the effect of just deleting all the caches? Tesseract has been upgraded since most of those caches were generated anyway.
In T228594#6611611, @Pols12 wrote:… still broken for some others, e.g. Page:Europe in China.djvu/422, Page:Europe in China.djvu/422 or Page:Plutarch's Lives (Clough, v.4, 1865).djvu/215.
Nov 6 2020
The cache for a given work will be in a subdirectory of ~/cache/hocr/ created from the MD5 hash of the file's name (spaces replaced with underscores) concatenated with the invoking project's language code. So for Mexico_under_Carranza.djvu requested from English Wikisource, you can generate the hash with…
Ok, having gotten access to the project in connection with T265640 I've been trying to debug this a bit.
Nov 5 2020
@JJMC89 Thanks!
Nov 4 2020
@Candalua Thanks!
@Candalua That leaves you as the only admin on phetools with any likelihood of having the spare cycles to look at this (Phe and Tpt are highly unlikely to be available for this any time soon). Any chance you could poke around here a bit?
Nov 2 2020
@Aklapper Indeed. Community-Tech was added as their Toolforge group account is one of the four accounts set as admin for the phetools Toolforge project.
Nov 1 2020
In T244657#5995256, @JTannerWMF wrote:… is this a challenge a lot of people are encountering?
Oct 19 2020
@kaldari Nope, still seeing the same failure mode. It greys out the text in the editor and then throws an error in the JS console ala. An error occurred during ocr processing: /tmp/52004_6179/page_0199.tif.
Oct 16 2020
In T265571#6549666, @Seudo wrote:Apparently the HTML entities are fixed automatically in the English Wikisource (when I try in this book). ~~~~
This is a dup of T265571.
Oct 15 2020
Not having access to T263371 it's hard to say anything intelligent about the specific issue, but…
Hmm. If one API endpoint returns unencoded text, then lower-level components seem unlikely culprits. If one API returns encoded text, then higher-level components seem unlikely. IOW: those results suggest to me that this is happening at the API layer.
Oct 4 2020
In T256086#6515146, @Soda wrote:… That isn't what this task is about... It's about deprecating a hook that specifically allows the proofreadpage extension to disable the javascript-based mobile editor. …
In T255345#6514822, @Soda wrote:I'm not really sure if having the header and footer will help given that they will probably be in wikitext format and extracting it from the html would be easier than attempting to parse the wikitext.
In T256086#6484551, @ifried wrote:… This would mean that, if no additional changes are made, wikis that use the ProofReadPage extension would not be able to disable the editor. Instead, all wikis with the extension would have the same, standard editor?
Sep 4 2020
I didn't see the "Has somebody already looked at this edit?" aspect (i.e. mark as patrolled) in the above list. But maybe that's included in one of the existing bullet points?
Aug 15 2020
@kamholz I don't understand what it is you're proposing to do here, nor see how it will have applicability outside just Balinese content. From whence comes #transliterate and what does it do? Why hard-code <br> inside ProofreadPage and provide two copies of the text? Why can this not be done with a normal template?
Aug 8 2020
Aug 5 2020
In T259645#6362157, @kamholz wrote:You have a good point about transclusion. I haven't looked into that yet, but presumably the pagelang should be applied there too?
How would this interact with the ability to set the language for a single Page: page? How about mainspace transclusions of Page pages?
Aug 3 2020
@Aklapper Thanks, but as @TheDJ notes, going by that overview nobody owns multimedia features in WMF wikis now. That's a pretty sad state of affairs given how central multimedia is for almost all the projects (including Wikidata and whatever "Abstract" will end up as).
Jul 30 2020
Just to note: Multimedia was tagged as a key player here, but that Phab team seems to have been archived. Who owns the components previously in that group's remit now?
Jul 23 2020
In T258666#6330133, @daniel wrote:… the import code will try to load the previous revision immediately after creating it. …
Presuming T212428 is an API race condition triggered by replication lag, this seems to be a different issue.
Note that T258666 looks like it could be an expression of this problem that was exacerbated by MediaWiki 1.36/wmf.1 (which is, I think, scheduled to hit the Wikipedias today). And as I commented there: I've never before seen this problem, and I've now seen 3 reports from 3 different projects in the last 24 hours. If so, this is not just a log spamming problem with the occasional user-visible weirdness any more.
The FileImporter issue is apparently T258666 and not obviously related.
I'd never seen the issue from T212428 (and I use FileImporter a lot), but in the last 24 hours I've seen this issue reported from enWS, jpWP, and by a Chinese user (not sure which project they call home) on Commons. If it's not actually more prevalent after deploying MediaWiki 1.36/wmf.1 that's a heck of a coincidence!