From Support Desk:
I'm trying to get all Wikipedia categories links for a project. I've successfully managed to load enwiki-latest-page.sql SQL dump using UTF-8. However, I get the following error when trying to parse enwiki-latest-categorylinks.sql in Python:
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xdc in position 1957: invalid continuation byteIf I ignore the errors, or use decode the file byte by byte, the SQL file seems to contain non-unicode characters after line 45, which messes the file up. Can anyone shed some light on the issue? Is this expected? Couldn't find anything on the help page. Since I was able to easily open the the page database using UTF-8, I did not expect this error. The code I'm using is very simple:
with open(filepath, "r", encoding="utf-8") as f: for _ in range(80): # Peek first 80 lines # outputFile.write(f.readline()) print(f.readline())Wikidump link: https://dumps.wikimedia.org/enwiki/latest/
Relevant help page: Manual:Categorylinks table, https://meta.wikimedia.org/wiki/Data_dumps/What%27s_available_for_download#Database_tables