Page MenuHomePhabricator

Incorrect encoding of id fields when runnign StructuredDiscussion on a Windows server
Open, Needs TriagePublic

Description

We recently ran a wiki on a Windows Server (never again) on which we had StructuredDiscussion/Flow 1.1 installed. When we dumped the sql tables to migrate to a Linux server we realised that all of the binary(11) type fields (UUIDs?) related to the Flow extension were encoded weirdly and could not be imported (duplicated key entries) in the Linux environment.

No other tables in the dump suffered from these issues.

Some example fields:

  • ref_id, ref_src_object_id, ref_src_workflow_id in flow_ext_ref
  • rev_id, rev_type_id in flow_revision
  • topic_list_id topic_id in flow_topic_list

A trimmed version of the dump (Flow tables only)

Event Timeline

Restricted Application added a project: Collaboration-Team-Triage. · View Herald TranscriptMay 14 2018, 9:11 PM

The dumping itself occured using the following in Windows Powershell

cd 'C:\ProgramData\MySQL\MySQL Server 5.7\Data\'
[System.Console]::OutputEncoding = [System.Text.Encoding]::UTF8
.'C:\Program Files\MySQL\MySQL Server 5.7\bin\mysqldump.exe' -e -u<wgDBuser> -h<wgDBserver> -p<wgDBpassword> <wgDBname> | Out-File '<path to outfile>' -Encoding UTF8

My stab-in-the-dark guess is that UUID::create misidentifies the string somehow.

I think this is a problem with your export method. The data in fields like ref_id is actually raw binary data, but it looks like your export tool tried to interpret is at UTF-8 (decode and re-encode it). It is not valid UTF-8 text, so some of the bytes were replaced with '�' (U+FFFD replacement character), which obviously corrupts the data. If you managed to import it, you'd probably find your Flow threads to be inaccessible.

Maybe try to just remove -Encoding UTF8? Maybe that will make it pass through without messing with the data. I am not really sure how Powershell handles encodings, though. Or maybe you can get mysqldump to escape it somehow.

The reason I believed it might be due to flow is that the dump contains other binary fields which do not get bytes replaced by � (but that might be due to differing contents).

Yes the import worked for everything apart from the flow threads which we had to nuke after doing a forced import.

Dropping -Encoding UTF8 actually still gives the same result (guess that is now handled by the [System.Console]::OutputEncoding = [System.Text.Encoding]::UTF8. Dropping that one gives me UTF-16 output (don't know if it still messed up the binary fields). Outputting using the -r flag instead of piping also resulted in other encoding issues (again don't know if it still messed up the binary fields).

If you are convinced that this is down to the dumping rather than the original data entry into those fields then I would chalk this down to Windows Behaving Badly and maybe just add a warning to Manual:Backing up a wiki and maybe somewhere in the Flow-extension documentation.

MediaWiki uses the binary type for many text fields, but most of them contain UTF-8 text, so treating their contents as UTF-8 instead of binary data doesn't corrupt them. (Off the top of my head, the only columns that actually may contain binary data, depending on your configuration, are categorylinks.cl_sortkey and page_props.pp_value.)

Try to pass --result-file='<path to outfile>' to mysqldump instead of relying on PowerShell to write to the file, I think that should just dump the data as-is and not try to treat it as text. The doc page I found even mentions doing this when using PowerShell: https://dev.mysql.com/doc/refman/8.0/en/mysqldump.html

@Catrope, is this something we should fix or does it belong to someone else?

@jmatazzoni I don't think there is a StructuredDiscussion bug here.

@jmatazzoni I don't think there is a StructuredDiscussion bug here.

OK thanks. I'll mark it as "external" on our board.

Vvjjkkii renamed this task from Incorrect encoding of id fields when runnign StructuredDiscussion on a Windows server to 9ycaaaaaaa.Jul 1 2018, 1:09 AM
Vvjjkkii triaged this task as High priority.
Vvjjkkii updated the task description. (Show Details)
CommunityTechBot renamed this task from 9ycaaaaaaa to Incorrect encoding of id fields when runnign StructuredDiscussion on a Windows server.Jul 2 2018, 4:12 PM
CommunityTechBot raised the priority of this task from High to Needs Triage.
CommunityTechBot updated the task description. (Show Details)
Restricted Application added a project: Growth-Team. · View Herald TranscriptOct 5 2018, 1:45 PM