Page MenuHomePhabricator

Categorylinks dump might have some problem with the encoding
Closed, InvalidPublicBUG REPORT

Description

Steps to Reproduce:
I am parsing the dump rowiki-latest-cateogrylinks.sql and I realized that the characters seem corrupted.
I wanted to find specific categories and I couldn't. That is when I realized that the characters were not the same.
I printed some lines in the terminal and I saw it.

Expected and Actual Results:
For example, I expected to see Dansuri_românești and I see Dansuri_romxc3xa2nexc8x99ti.

This is the line I printed in the terminal:
['750456', "'Dansuri_romxc3xa2nexc8x99ti'", "'GEGQ?)K1x03x06/)CMQK9x04GEGQ?)K1x04KEA*x03C1Nx02O9x01%x06x01xdcxbexdcx1d'", "'2010-08-04 16:36:40'", "'Populare'", "'uca-ro-u-kn'", "'subcat'"]

This is a line in the sql file:
(750456,'Dansuri_româneÈ™ti','GEGQ?)K1/)CMQK9GEGQ?)K1KEA*C1NO9%ܾÜ','2010-08-04 16:36:40','Populare','uca-ro-u-kn','subcat')

This is the same line in the Mysql replica database:
select * from categorylinks where cl_from = 750456 limit 1;
+---------+---------------------+------------------------------------------------+---------------------+-------------------+--------------+---------+

cl_fromcl_tocl_sortkeycl_timestampcl_sortkey_prefixcl_collationcl_type

+---------+---------------------+------------------------------------------------+---------------------+-------------------+--------------+---------+

750456Dansuri_româneștiGEGQ?)K1/)CMQK9GEGQ?)K1KEA*C1NO9%ܾ?2010-08-04 16:36:40Populareuca-ro-u-knsubcat

+---------+---------------------+------------------------------------------------+---------------------+-------------------+--------------+---------+

Perhaps I am doing something incorrectly but I had no problem with other languages. Please let me know.
Thank you very much.

Marc Miquel

Event Timeline

I've noticed that other languages like Russian or Macedonian have the same problem.

@ArielGlenn is this something you'd know about or know who to point me to?

echo -n ânești  | od -t x1
0000000 c3 a2 6e 65 c8 99 74 69

You appear to be seeing a string representation of the non-ascii characters as hex bytes, i.e. xc3 xa2 ne xc8 x99 ti. What command are you using to display the test in the file, and on what platform?

@ArielGlenn is this something you'd know about or know who to point me to?

I think the wdqs folks are going to be your best bet, I've added the project. Looks like a simple text encoding error, but I'd like to know exactly what tools were used to display the text before saying that for sure.

The encoding looks correct in my terminal:

$ curl -s https://dumps.wikimedia.org/rowiki/20201001/rowiki-20201001-categorylinks.sql.gz | gunzip | sed 's/),(/),\n(/g' | grep -aF Dansuri_rom
(750456,'Dansuri_românești','GEGQ?)K1/)CMQK9GEGQ?)K1KEA*C1NO9%ܾ','2010-08-04 16:36:40','Populare','uca-ro-u-kn','subcat'),
(770750,'Dansuri_românești','+K*Q       /)CM    ','2012-03-01 17:39:10','','uca-ro-u-kn','page'),

Thank you @ArielGlenn and @Lucas_Werkmeister_WMDE,

So, to explain what I am doing ( https://pastebin.com/kPrwQ0Lb ).

I am first collecting all the categories from the page dump and put them into some dictionaries.
Then, I am parsing the categorylinks dump and I add the page_ids these categories contain.

The problem is in the category titles in which there are these special characters.
The first dump seems to work, but the second shows these hex bytes.

Perhaps it is something with how the second dump must be opened or read, but I cannot find a way to read it in ('utf-8'). I just put the print ('error') and I see many.

Shouldn't the second dump work exactly like the first?
What could I do?

Thanks.

It looks to me like there are truncated values for the cl_sortkey for rowiki, which prevent the utf8 conversion on line 55 of the pastebin from working. This leads to your lines all remaining essentially byte-encoded with the results you see when displaying the content. I would look into the sortkeys as stored on rowiki and see what's going on. By contrast when I look at e.g. elwiki's category links, there is plenty of non-ascii text there but no bad entries in the table.

I copy-paste a selection from the raw sql on rowiki so that you can see what I mean:

ariel@mwmaint2001:~$ sql rowiki
Reading table information for completion of table and column names
You can turn off this feature to get a quicker startup with -A

Welcome to the MariaDB monitor.  Commands end with ; or \g.
Your MariaDB connection id is 3392742762
Server version: 10.4.14-MariaDB-log MariaDB Server

Copyright (c) 2000, 2018, Oracle, MariaDB Corporation Ab and others.

Type 'help;' or '\h' for help. Type '\c' to clear the current input statement.

wikiadmin@10.192.0.146(rowiki)> select * from categorylinks limit 5;
+---------+-----------------------------------------------------------------+--------------+---------------------+-------------------+--------------+---------+
| cl_from | cl_to                                                           | cl_sortkey   | cl_timestamp        | cl_sortkey_prefix | cl_collation | cl_type |
+---------+-----------------------------------------------------------------+--------------+---------------------+-------------------+--------------+---------+
|       1 | Articole_cu_multe_probleme                                      | KE-)KO)
                                                                                     �
  | 2015-10-25 20:15:17 |                   | uca-ro-u-kn  | page    |
|       1 | Articole_cu_suport_bibliografic_necorespunzător                 | KE-)KO)
                                                                                     �
  | 2010-07-24 17:29:17 |                   | uca-ro-u-kn  | page    |
|       1 | Articole_cu_suport_bibliografic_necorespunzător_din_iulie_2010  | KE-)KO)
                                                                                     �
  | 2015-01-10 20:21:51 |                   | uca-ro-u-kn  | page    |
|       1 | Articole_scrise_într-un_ton_nepotrivit_iulie_2010               | KE-)KO)
                                                                                     �
  | 2012-07-08 20:03:11 |                   | uca-ro-u-kn  | page    |
|       1 | Enciclopedii_din_secolul_al_XX-lea                              | KE-)KO)
                                                                                     �
  | 2019-10-05 07:22:16 |                   | uca-ro-u-kn  | page    |
+---------+-----------------------------------------------------------------+--------------+---------------------+-------------------+--------------+---------+
5 rows in set (0.00 sec)

I think I understand what you mean, but I am not entirely familiar with encoding.
So, if the cl_sortkey field is breaking the conversion into utf-8, can't I just avoid this field somehow? Actually, I don't need it - I only need cl_from and cl_to.

You have a couple of options. You can replace/ skip / ignore the bad charcters, see https://docs.python.org/3/howto/unicode.html and look for the paragraph starting "The errors argument specifies the response when the input string can’t be converted according to the encoding’s rules." Alternatively you can split the line into entries yourself and just pull out the ones you want. Whichever is easiest for you!

I hope that at some point these truncated values will be fixed in the dump.
But meanwhile I went with the "ignore" into .decode and it works!

Thank you so much, ArielGlenn!

Marc

They cannot be fixed in the dump; they are truncated on the wiki itself. That's what the sql query shows. Someone will have to go onto rowiki and find out what is going on with the entry of those sortkeys and why they are bad.

For now though I am resolving this task as "invalid" because it turned out not to be an error with the dumps or with the code anywhere. Good luck with your processing!

Ah, I see what you mean. It's good I have this workaround with the
"invalid" parameter.
Thanks.

Marc

Missatge de ArielGlenn <no-reply@phabricator.wikimedia.org> del dia dl., 12
d’oct. 2020 a les 13:13:

ArielGlenn closed this task as "Invalid".
ArielGlenn added a comment. View Task
https://phabricator.wikimedia.org/T264850

They cannot be fixed in the dump; they are truncated on the wiki itself.
That's what the sql query shows. Someone will have to go onto rowiki and
find out what is going on with the entry of those sortkeys and why they are
bad.

For now though I am resolving this task as "invalid" because it turned out
not to be an error with the dumps or with the code anywhere. Good luck with
your processing!

*TASK DETAIL*
https://phabricator.wikimedia.org/T264850

*EMAIL PREFERENCES*
https://phabricator.wikimedia.org/settings/panel/emailpreferences/

*To: *JAllemandou, ArielGlenn
*Cc: *Lucas_Werkmeister_WMDE, ArielGlenn, Milimetric, Aklapper,
marcmiquel, Strainu, jannee_e, Lunewa, gnosygnu