Figure out why two different runs of WikiExtractor.py used different paragraph delimiters?
Closed, ResolvedPublic
Actions

Assigned To

Authored By

	NikolajLindberg
	Sep 5 2020, 12:12 PM

Description

WikiExtractor.py was run on the Swedish Wikipedia dump file on two different machines, and produced different text files, using "\n" or "\n\n" as paragraph delimiter (the latter should be more correct)?

Furthermore WikiExtractor.py skips some tokens, such as "1 500" (they disappear).

(WikiExtractor.py is currently used but will probably be replaced.)

Related Objects
Search...

Status	Assigned	Task
Resolved	NikolajLindberg	T261928 ☂Wikispeech Recording Manuscript Tool
Resolved	NikolajLindberg	T261938 Find out how to best extract raw text from Wikipedia articles (probably from a dump file)
Resolved	NikolajLindberg	T262131 Figure out why two different runs of WikiExtractor.py used different paragraph delimiters?

Event Timeline

NikolajLindberg created this task.Sep 5 2020, 12:12 PM

Restricted Application added a project: Wikispeech-Jobrunner. · View Herald TranscriptSep 5 2020, 12:12 PM

Somehow, the "same* script had probably been downloaded using different methods:

A diff indicates that some characters had been mangled:

58c558
<         text = text.replace('<<', u'«').replace('>>', u'»')
---
>         text = text.replace('<<', u'Â«').replace('>>', u'Â»')
566,567c566,567
<         text = re.sub(u' (,:\.\)\]»)', r'\1', text)
<         text = re.sub(u'(\[\(«) ', r'\1', text)
---
>         text = re.sub(u' (,:\.\)\]Â»)', r'\1', text)
>         text = re.sub(u'(\[\(Â«) ', r'\1', text)
2082c2082
<         # Funny characters like ö aren't valid in URLs anyway
---
>         # Funny characters like Ã¶ aren't valid in URLs anyway

What is still puzzling, is that the mangled version of the script seems to to produce the correct output with respect to paragraph delimiters ("\n\n").

NikolajLindberg closed this task as Resolved.Sep 17 2020, 10:52 AM

Figure out why two different runs of WikiExtractor.py used different paragraph delimiters?Closed, ResolvedPublicActions

Description

Related ObjectsSearch...

Event Timeline

Figure out why two different runs of WikiExtractor.py used different paragraph delimiters?
Closed, ResolvedPublic
Actions

Related Objects
Search...