Page MenuHomePhabricator

Figure out why two different runs of WikiExtractor.py used different paragraph delimiters?
Closed, ResolvedPublic

Description

WikiExtractor.py was run on the Swedish Wikipedia dump file on two different machines, and produced different text files, using "\n" or "\n\n" as paragraph delimiter (the latter should be more correct)?

Furthermore WikiExtractor.py skips some tokens, such as "1 500" (they disappear).

(WikiExtractor.py is currently used but will probably be replaced.)

Event Timeline

Somehow, the "same* script had probably been downloaded using different methods:

A diff indicates that some characters had been mangled:

58c558
<         text = text.replace('<<', u'«').replace('>>', u'»')
---
>         text = text.replace('<<', u'«').replace('>>', u'»')
566,567c566,567
<         text = re.sub(u' (,:\.\)\]»)', r'\1', text)
<         text = re.sub(u'(\[\(«) ', r'\1', text)
---
>         text = re.sub(u' (,:\.\)\]»)', r'\1', text)
>         text = re.sub(u'(\[\(«) ', r'\1', text)
2082c2082
<         # Funny characters like ö aren't valid in URLs anyway
---
>         # Funny characters like ö aren't valid in URLs anyway

What is still puzzling, is that the mangled version of the script seems to to produce the correct output with respect to paragraph delimiters ("\n\n").