Page MenuHomePhabricator

CirrusSearch dumps for the Norwegian Bokmål Wikipedia link to the Italian Wikipedia
Closed, ResolvedPublic

Description

(Sorry, not sure exactly what tag to use here.)

Someone reported to Wikimedia Norge's Twitter that some records in the CirrusSearch content dumps for the Norwegian Bokmål Wikipedia (nowiki) link to the Italian Wikipedia instead of to nowiki. They included an example in the tweet:

<doc id="1382225" url="https://it.wikipedia.org/wiki?curid=1382225" title="Gunnar Simenstad">

That link is not valid, but if you change it to no the link leads to the correct article. And that's all I know.

Event Timeline

Restricted Application added subscribers: jeblad, Danmichaelo, jhsoby. · View Herald Transcript

@Aklapper Could you please see if these are the right tags for this bug?

We'll double check on where the data is going here

We'll double check on where the data is going here

Thanks! As you may see in the Twitter thread, the error was in the file "nowiki-20180910-cirrussearch-content.json.gz", which seems to already have been deleted. I don't know how to check if it's a current issue myself.

These were being sent to archive.org before, but looks like that stopped in may: https://archive.org/search.php?query=cirrussearch&sort=-publicdate

So sadly i don't seem to have a copy of the 20180910 dump anywhere. I've grabbed the oldest available dump (20180917) and the latest (20181126) and will see if i can find anything related. One curious portion of this report is that the dump is not in XML, and we don't have a url field in the dumps. Will still poke around for something plausibly related

Looking at the dump itself, I can't find anything suspicious. I poked around to figure out what software they were likely using, it seems like: https://github.com/attardi/wikiextractor/blob/master/cirrus-extract.py

Note that this is from the university at pisa, and line 48 hardcodes: urlbase = 'http://it.wikipedia.org/'
Essentially, the script doesn't know how to determine the url from the dumps (it's actually super not obvious, i don't know that we have a public api anywhere that they could use to figure out from the dump filename what url everything belongs to.

This essentially amounts to a bug report for wikiextractor. For the current user to move forward they will need to modify their version of cirrus-extract.py to use the correct urlbase for the file being parsed.

Thank you very much, @EBernhardson, that's some impressive detective work. I'll let the user know!