Page MenuHomePhabricator

Recent Dump with UTF-8 problem
Open, Needs TriagePublic

Description

From Support Desk:

I'm trying to get all Wikipedia categories links for a project. I've successfully managed to load enwiki-latest-page.sql SQL dump using UTF-8. However, I get the following error when trying to parse enwiki-latest-categorylinks.sql in Python:

UnicodeDecodeError: 'utf-8' codec can't decode byte 0xdc in position 1957: invalid continuation byte

If I ignore the errors, or use decode the file byte by byte, the SQL file seems to contain non-unicode characters after line 45, which messes the file up. Can anyone shed some light on the issue? Is this expected? Couldn't find anything on the help page. Since I was able to easily open the the page database using UTF-8, I did not expect this error. The code I'm using is very simple:

with open(filepath, "r", encoding="utf-8") as f:
  for _ in range(80):
    # Peek first 80 lines
    # outputFile.write(f.readline())
    print(f.readline())

Wikidump link: https://dumps.wikimedia.org/enwiki/latest/

Relevant help page: Manual:Categorylinks table, https://meta.wikimedia.org/wiki/Data_dumps/What%27s_available_for_download#Database_tables