Maniphest T194861

Text is offset by one page
Open, HighPublic
Actions

Assigned To

None

Authored By

	Mpaa
	May 16 2018, 9:46 PM

Description

In the generated djvu file, there is an offset of 1 between OCR text and scan.

See e.g page 27 of https://ia801202.us.archive.org/34/items/dollshousetwooth00ibse/dollshousetwooth00ibse.pdf
vs.
https://en.wikisource.org/w/index.php?title=Page:A_Doll%27s_House_and_two_other_Plays_by_Henrik_Ibsen.djvu/27&action=edit&redlink=1

I am not 100% sure it is a bug in IAupload tool, as two files uploaded after this were OK.
Maybe there is some special char in some files that trigger the problem?

As a matter of fact:
djvutoxml A_Doll\'s_House_and_two_other_Plays_by_Henrik_Ibsen.djvu is OK
while
djvused A_Doll\'s_House_and_two_other_Plays_by_Henrik_Ibsen.djvu -e 'output-txt' gives SegFault.

Related Objects

Mentioned In: T347797: IA upload tool
T204020: OCR extracted from DjVu files is incorrectly assigned to pages

Event Timeline

Mpaa created this task.May 16 2018, 9:46 PM

Restricted Application added projects: Internet-Archive, Community-Tech. · View Herald TranscriptMay 16 2018, 9:46 PM

Mpaa updated the task description. (Show Details)May 16 2018, 9:55 PM

The issue is present from the first page, so I think SegFault is a different issue (it happens later on in the file).

This text below is on page 2 of the pdf file.
But in the first page of the djvu file.

djvudump A_Doll\'s_House_and_two_other_Plays_by_Henrik_Ibsen.djvu | less

FORM:DJVM [61940416] 
    DIRM [2475]       Document directory (bundled, 293 files 293 pages)
    FORM:DJVU [324804] {dollshousetwooth00ibse_p1.djvu} [P1]
      INFO [10]         DjVu 936x1500, v24, 650 dpi, gamma=2.2
      BG44 [8735]       IW4 data #1, 74 slices, v1.2 (color), 936x1500
      BG44 [150752]     IW4 data #2, 15 slices
      BG44 [164970]     IW4 data #3, 10 slices
      TXTz [292]        Hidden text (text, etc.)
    FORM:DJVU [74378] {dollshousetwooth00ibse_p2.djvu} [P2]
      INFO [10]         DjVu 936x1500, v24, 650 dpi, gamma=2.2
      BG44 [13446]      IW4 data #1, 74 slices, v1.2 (color), 936x1500
      BG44 [30066]      IW4 data #2, 15 slices
      BG44 [30584]      IW4 data #3, 10 slices
      TXTz [228]        Hidden text (text, etc.)

djvused A_Doll\'s_House_and_two_other_Plays_by_Henrik_Ibsen.djvu -e 'output-txt' | less

select; remove-txt
# ------------------------- 
select "dollshousetwooth00ibse_p1.djvu" # page 1
set-txt
(page 241 495 677 1188
 (column 407 1171 507 1188
  (region 407 1171 507 1188
   (para 407 1171 507 1188
    (line 407 1171 507 1188
     (word 407 1171 507 1188 "LIBRARY")))))
 (column 276 1127 637 1144
  (region 276 1127 637 1144
   (para 276 1127 637 1144
    (line 276 1127 637 1144
     (word 276 1127 390 1144 "BRIGHAM")
     (word 399 1127 486 1144 "YOUNG")
     (word 495 1127 637 1144 "UNIVERSITY")))))
 (column 241 936 677 1062
  (region 241 936 677 1062
   (para 241 936 677 1062
    (line 241 1029 677 1062
     (word 241 1029 506 1062 "THEODORE")
     (word 524 1029 677 1062 "FUCHS"))
    (line 339 936 574 972
     (word 339 937 512 972 "collection")
     (word 531 936 574 960 "on")))))
 (column 0 0 936 1500 "")
 (column 268 495 640 538
  (region 268 495 640 538
   (para 268 495 640 538
    (line 268 495 640 538
     (word 268 503 408 538 "Theatre")
     (word 429 495 640 538 "Technology")))))
 (column 0 0 936 1500 ""))

And also in the XML file in https://ia801202.us.archive.org/34/items/dollshousetwooth00ibse/dollshousetwooth00ibse_djvu.xml it is at page 2.
Something wrong with the xml parsing/mapping onto the djvu?

<OBJECT data="file://localhost//tmp/derive/dollshousetwooth00ibse//dollshousetwooth00ibse.djvu" height="4011" type="image/x.djvu" usemap="dollshousetwooth00ibse_0002.djvu" width="2502">
<PARAM name="PAGE" value="dollshousetwooth00ibse_0002.djvu"/>
<PARAM name="DPI" value="650"/>
<HIDDENTEXT>
<PAGECOLUMN>
<REGION backgroundColor="13346986">
<PARAGRAPH>
<LINE>
<WORD coords="1088,879,1357,834,878">LIBRARY</WORD>
</LINE>
</PARAGRAPH>
</REGION>
</PAGECOLUMN>
<PAGECOLUMN>
<REGION backgroundColor="13281193">
<PARAGRAPH>
<LINE>
<WORD coords="738,996,1044,951,995">BRIGHAM</WORD>
<WORD coords="1068,997,1301,951,995">YOUNG</WORD>
<WORD coords="1325,996,1704,951,995">UNIVERSITY</WORD>
</LINE>
</PARAGRAPH>
</REGION>
</PAGECOLUMN>
<PAGECOLUMN>
<REGION backgroundColor="13412778">
<PARAGRAPH>
<LINE>
<WORD coords="645,1257,1355,1171,1254">THEODORE</WORD>
<WORD coords="1403,1257,1811,1171,1256">FUCHS</WORD>
</LINE>
<LINE>
<WORD coords="907,1505,1371,1411,1503">collection</WORD>
<WORD coords="1421,1507,1537,1443,1507">on</WORD>
</LINE>
</PARAGRAPH>
</REGION>
</PAGECOLUMN>
<PAGECOLUMN></PAGECOLUMN>
<PAGECOLUMN>
<REGION backgroundColor="13347242">
<PARAGRAPH>
<LINE>
<WORD coords="719,2665,1093,2571,2663">Theatre</WORD>
<WORD coords="1147,2687,1711,2571,2663">Technology</WORD>
</LINE>
</PARAGRAPH>
</REGION>
</PAGECOLUMN>
<PAGECOLUMN></PAGECOLUMN>
</HIDDENTEXT>
</OBJECT>
<MAP name="dollshousetwooth00ibse_0002.djvu"/>

I have no debug capabilities.
A possibility is that under some circumstances there is a misalignment during the generation of
djvu numbering:

$djvuFile = $buildDir . '/' . $this->itemId . '_p' . $jp2FileNum . '.djvu';

and XML numbering:

$object->PARAM[0]['value'] = $this->itemId . '_p' . $pageNum . '.djvu';

So djvuxmlparser makes the wrong association when matching the two.

Mpaa added a subscriber: Samwilson.May 17 2018, 6:56 PM

• 4nn1l2 subscribed.May 17 2018, 11:00 PM

It happened again: https://commons.wikimedia.org/wiki/File:Atlantis_Arisen.djvu
Explanation above justifies also this case.

This is is happening quite often nowadays.

Mpaa triaged this task as High priority.May 18 2018, 8:13 PM

Thanks Mpaa, I'll keep track of the ticket. Like I said on Wikisource, no urgency on my end with regard to this particular file (Atlantis Arisen).

Perhaps helpful in troubleshooting: IAupload offers the opportunity to remove the first page of a book (intended, I believe, for the kind of misleading rights claims typically introduced on a leading page by Google and others). On this one, it offered that option, but I declined. However, the interesting part: the image it showed me was of PAGE 2, not of the one that actually came through as PAGE 1. In other words, the actual cover [https://en.wikisource.org/w/index.php?title=Page:Atlantis_Arisen.djvu/2] rather than the registration page, or whatever it is, used by the Internet Archive [https://en.wikisource.org/w/index.php?title=Page:Atlantis_Arisen.djvu/1]. Since the offset of this unexpected behavior was also one page, it seems possible it might be connected...

@Peteforsyth, seems little action around here, if you need I can fix the file.
The original upload will be there for inspection, if needed.

Hi @Mpaa , just noticed your reply. Yes, if you're able to fix the file, I'd appreciate it! I'm also curious what the process looks like, if you're able to describe.

Done.
I manipulate the file produced by djvused (http://djvu.sourceforge.net/doc/man/djvused.html), see section "Dumping/restoring annotations and text", realigning the page numbers in myfile.dsed.

• Vvjjkkii renamed this task from Text is offset by one page to iucaaaaaaa.Jul 1 2018, 1:09 AM

• Vvjjkkii added projects: CheckUser, Connected-Open-Heritage-Batch-uploads (RAÄ-KMB_1_2017-02), Tamil-Sites, Gamepress, Hashtags, Jade, KartoEditor, Language-2018-Apr-June, New-Editor-Experiences, Mail, TCB-Team (now WMDE-TechWish).

• Vvjjkkii updated the task description. (Show Details)

Bodhisattwa renamed this task from iucaaaaaaa to Text is offset by one page.Jul 1 2018, 3:00 PM

Bodhisattwa removed projects: TCB-Team (now WMDE-TechWish), Mail, New-Editor-Experiences, Language-2018-Apr-June, KartoEditor, Jade, Hashtags, Gamepress, Tamil-Sites, Connected-Open-Heritage-Batch-uploads (RAÄ-KMB_1_2017-02), CheckUser.

Bodhisattwa updated the task description. (Show Details)

Niharika removed a project: Community-Tech.Jul 24 2018, 3:16 AM

@Samwilson:
IMO the root cause is that sometimes the XML file does not necessarily contains all the jp2 images.
In such cases, when you update the XML file with the new names, as you go incrementally, you can introduce an offset.

In Jp2DjvuMaker.php:

		$pageNum = 0;
		foreach ( $xml->BODY->OBJECT as $object ) {
			$object['data'] = 'file://localhost'.$djvuFile;
			// The first PARAM is always 'PAGE'.
			$object->PARAM[0]['value'] = $this->itemId . '_p' . $pageNum . '.djvu';
			$pageNum++;

I think the best is to keep the original names and do not create new ones, which would actually make the XML modification even easier..
But maybe you have a reason for that that I am missing, as you stated that.

			// Make DjVu file of this page. Use the item identifier as the filename instead of
			// matching the JP2 so we can later modify the XML more easily.

See for instance https://archive.org/details/b28710964:

XML:

...
<OBJECT data="file://localhost//var/tmp/autoclean/derive/b28710964//b28710964.djvu" height="2999" type="image/x.djvu" usemap="b28710964_0008.djvu" width="1755"></OBJECT><MAP name="b28710964_0008.djvu"/>
<OBJECT data="file://localhost//var/tmp/autoclean/derive/b28710964//b28710964.djvu" height="2999" type="image/x.djvu" usemap="b28710964_0011.djvu" width="1755"></OBJECT><MAP name="b28710964_0011.djvu"/>
<OBJECT data="file://localhost//var/tmp/autoclean/derive/b28710964//b28710964.djvu" height="2999" type="image/x.djvu" usemap="b28710964_0012.djvu" width="1755">
...

JP2: you also process 0009 and 0010

...
b28710964_jp2/b28710964_0008.jp2	jpg	2016-09-21 16:39	317877
b28710964_jp2/b28710964_0009.jp2	jpg	2016-09-21 16:39	331279
b28710964_jp2/b28710964_0010.jp2	jpg	2016-09-21 16:39	576592
b28710964_jp2/b28710964_0011.jp2	jpg	2016-09-21 16:39	97126
...

Ankry mentioned this in T204020: OCR extracted from DjVu files is incorrectly assigned to pages.Sep 11 2018, 6:39 AM

If you look in the scandata XML file at the IA, some pages are marked <addToAccessFormats>false</addToAccessFormats>. For example the registration images (generally first and last, sometimes more). Skipping these images then brings everything back into alignment.

• 4nn1l2 unsubscribed.Jun 11 2020, 4:33 PM

Peteforsyth merged a task: T276616: Text in DJVU file generated by IA-Import 2 is offset by one page.Mar 5 2021, 8:45 PM

Harej moved this task from Backlog to Integrations on the Internet-Archive board.Nov 2 2021, 3:39 AM

Yann subscribed.Feb 18 2023, 10:18 AM

Samwilson mentioned this in T347797: IA upload tool.Oct 3 2023, 11:14 AM

Text is offset by one pageOpen, HighPublicActions

Description

Related Objects

Event Timeline

Text is offset by one page
Open, HighPublic
Actions