Page MenuHomePhabricator

Investigate `UnicodeEncodeError` thrown by Add-A-Link training pipeline for fywiki model
Open, Needs TriagePublic

Description

During the training pipeline for the Western Frisian Wikipedia Add-A-Link model, a UnicodeEncodeError is thrown as shown in the screenshot below:

fywiki training pipeline error - Screenshot from 2022-12-12 10-19-15.png (741×1 px, 222 KB)

The goal is to investigate the cause of this error and a possible solution to fix it.

Event Timeline

Discussed this issue with @MGerlach and he advised that it could be caused by the wikipedia2vec package that the link recommendation algorithm relies on.

Worked on reproducing the error by running this package instead of the entire training pipeline and indeed the UnicodeEncodeError was thrown as seen in the screenshot below:

Input:

$ WIKI_ID=fywiki
$ ionice wikipedia2vec train \
--min-entity-count=0 \
--dim-size 50 \
--pool-size 8 \
"/mnt/data/xmldatadumps/public/${WIKI_ID}/latest/${WIKI_ID}-latest-pages-articles.xml.bz2" \
"${WIKI_ID}.w2v.bin"

Results:

wikipedia2vec package fywiki training pipeline error - Screenshot from 2022-12-20 11-34-28.png (741×1 px, 222 KB)

Looked at the wikipedia2vec repo and found UnicodeEncodeError issues:
https://github.com/wikipedia2vec/wikipedia2vec/search?q=UnicodeEncodeError&type=issues

There are two possible solutions provided in the comments in these issues:

Solution 1: use --disambi option. Not sure whether this will this give us the training data we need.
Source: https://github.com/wikipedia2vec/wikipedia2vec/issues/68#issuecomment-848422809

Solution 2: change the encoding in wikipedia2vec. This requires us to tamper with the wikipedia2vec package. This means that if our fix is not merged like in the example link below, we are stuck with maintaining our version of the wikipedia2vec package.
Source: https://github.com/wikipedia2vec/wikipedia2vec/pull/72#issue-1182830170

Will consider more solutions as I come across them.

While looking into wikipedia2vec and where to fix the UnicodeEncodeError issue, I noticed it happens at this point:

File "wikipedia2vec/mention_db.pyx", line 162, in wikipedia2vec.mention_db.MentionDB.build
File "src/marisa_trie.pyx", line 88, in marisa_trie._Trie.__init__
File "src/marisa_trie.pyx", line 120, in marisa_trie._Trie._build
File "src/marisa_trie.pyx", line 86, in genexpr
File "src/marisa_trie.pyx", line 399, in marisa_trie._UnicodeKeyedTrie._encode_key
UnicodeEncodeError: 'utf-8' codec can't encode characters in position 0-11: surrogates not allowed

this suggests that this error is caused by the marisa-trie package that the wikipedia2vec package relies on.

Now that the problem bubbles down two levels of the dependency tree, the ideal solution would be one that does not tamper with the dependencies.

Just want to add that this error is not unique to fywiki but will likely appear when working with other wikis as well. It as been reported for zhwiki (see the issues mentioned above) and I was able to reproduce that error.

The Chinese Wikipedia (zhwiki) training pipeline is also throwing this error:

File "wikipedia2vec/dictionary.pyx", line 231, in wikipedia2vec.dictionary.Dictionary.build
File "wikipedia2vec/dump_db.pyx", line 124, in wikipedia2vec.dump_db.DumpDB.is_disambiguation
File "wikipedia2vec/dump_db.pyx", line 125, in wikipedia2vec.dump_db.DumpDB.is_disambiguation
File "wikipedia2vec/dump_db.pyx", line 126, in wikipedia2vec.dump_db.DumpDB.is_disambiguation
UnicodeEncodeError: 'utf-8' codec can't encode characters in position 2-3: surrogates not allowed

Sharing this here so that we can keep track of wikis being affected by this error so that when it's solved we remember to run their respective training pipelines.