Page MenuHomePhabricator

Apply patch to fix Bugzilla XML-RPC API issue before importing comments, remove it before importing attachments
Closed, ResolvedPublic

Description

@Aklapper made a patch to fix a Bugzilla XML-RPC API issue that works for comments but breaks attachments. We need to apply it, migrate comments, revert it, and migrate attachments.

Old description

When we tried to export some Bugzilla tickets via Bugzilla's XML RPC API we ran into an upstream Bugzilla bug. Certain characters in bug report text comments create invalid XML provided by the API. This breaks our custom script to export data from Bugzilla (to import it into Phabricator later).
That API issue was tracked in https://bugzilla.wikimedia.org/show_bug.cgi?id=69747 . The corresponding (unresolved) upstream ticket is https://bugzilla.mozilla.org/show_bug.cgi?id=839023.

We locally applied the hack described in https://bugzilla.mozilla.org/show_bug.cgi?id=839023#c10 in https://gerrit.wikimedia.org/r/#/c/156100/ . It just drops specific problematic characters/bytes from the output, as a quick'n'dirty workaround.

This worked around the problem for text in Bugzilla comments, but as that Perl hack is applied to any output, the hack also drops those certain bytes from binary attachment content, breaking certain attachments containing this bytes. Hence we reverted that hack in https://gerrit.wikimedia.org/r/#/c/168200/ again.

Somebody with Perl skills (who?) could improve the hack to only drop certain characters from Bugzilla text comment content provided via Bugzilla's XML RPC API, but not to apply that hack on any other content such as attachments.

Original summary of this task:

See: https://phabricator.wikimedia.org/P36

I swear this worked, maybe that fix put in to strip characters for funky comments caused weirdness here?

UPDATE:

the patch indeed caused it. reverted to fix attachments. We need a new patch.

Event Timeline

chasemp raised the priority of this task from to High.
chasemp updated the task description. (Show Details)
chasemp added a project: Bugzilla-Preview.
chasemp changed Security from none to None.
chasemp subscribed.
~/test/data
cpettet@cair>md5 ff_from_api.jpg
MD5 (ff_from_api.jpg) = 9bb0dbe9e207b82ff2629748d99a649a

~/test/data
cpettet@cair>md5 ff_fromui.jpg
MD5 (ff_fromui.jpg) = fbe699a57bffa278e8ff3600dffd3702

we could do one of these things:

  • try the script on bugzilla.mozilla.org as the reference upstream Bugzilla, to confirm it's our custom code or the BZ version or not
  • setup a Bugzilla in labs with current upstream code, unpatched attachment.cgi do confirm the same (i can help to get it done)
  • create a patch in bugzilla-modifications repo that diffs attachment.cgi in our repo against upstream and suggest to merge that , then test again
  • try to dump the files straight out of the mysql database, i glanced at the tables, seems doable too

I just tested the script from P36.

When downloading the attachment https://bug-attachment.wikimedia.org/attachment.cgi?id=16853 (from bug 72256) through the browser
I got a file with md5sum fbe699a57bffa278e8ff3600dffd3702.

Uploading this file to a local vanilla Bugzilla 4.4.4 and fetching it with the script from P36, I get the same file back (md5sum fbe699a57bffa278e8ff3600dffd3702)

But when running the same script against wmf's bugzilla I got a file with md5sum 9bb0dbe9e207b82ff2629748d99a649a.

confirmed. reverting that patch fixes attachments.

i think it's this part. "slightly
overkill (applied to the entire output, not just the values provided back when
querying the API)
" where this is also applied if it is an attachment.

i remember now checking all those chars that are removed here, see comments on PS1 of https://gerrit.wikimedia.org/r/#/c/156100/ but we can't apply this to attachment content or it breaks

let me paste my entire gerrit comment from back then here:

so, we are removing x01-x08 x0b,x0c and x0f-x1f, which means the entire control character range [1], besides keeping x00, x09, x0a, x0d and x0e. These are C0 and C1 control characters.

Let's look at the ones we keep:

x00 = NUL [3] x09 = (horizontal) tab (\t) x0a = line feed (\n) x0d = carriage return (\r) x0e = shift out (to a differente char set) [4]

I'm not sure about NUL and 'shift out' but \t, \n and \r certainly make sense.

Wikipedia says on [5] that "Most of these characters play no explicit role in Unicode text handling. The characters U+0000 <control-0000>, U+0009 <control-0009> (HT), U+000A <control-000A> (LF), U+000D <control-000D> (CR), and U+0085 <control-0085> (CR+LF) are commonly used in text processing as formatting characters."

[1] http://www.utf8-chartable.de/unicode-utf8-table.pl?utf8=string-literal [2] https://en.wikipedia.org/wiki/C0_and_C1_control_codes [3] https://en.wikipedia.org/wiki/Null_character [4] https://en.wikipedia.org/wiki/Shift_Out_and_Shift_In_characters [5] https://en.wikipedia.org/wiki/Unicode_control_characters#ISO_6429_control_characters_.28C0_and_C1.29

Issues affected by original bug in bugzilla preview dataset:

id: 63579
*************************** 2. row ***************************
id: 62532
*************************** 3. row ***************************
id: 68404
luser renamed this task from bugzilla attachments more complex than text are junk on retrieval? to bugzilla attachments more complex than text are junk on retrieval.Oct 24 2014, 1:19 AM
luser updated the task description. (Show Details)
In T815#14222, @chasemp wrote:

Issues affected by original bug in bugzilla preview dataset

Wondering if T854 is another example or a separate problem.

Aklapper renamed this task from bugzilla attachments more complex than text are junk on retrieval to Upstream Bugzilla XML-RPC API issue creates invalid XML & our workaround hack damages attachments - hence apply our hack only on text comment output.Oct 24 2014, 7:48 PM

Petr Bena via lists.wikimedia.org @Petrb ?
2:20 PM (2 hours ago)

to Wikimedia
After short investigation the answer is pretty straight forward and
explained in https://bugzilla.mozilla.org/show_bug.cgi?id=839023

quoting:

U+0000-U+001F are illegal in HTML 4.0 and XML 1.0 (except the
characters HR, LF and CR). And it's not permitted to use named
character references such as &#x03; either (although it is permitted
in XML 1.1, except for NUL):
http://www.w3.org/International/questions/qa-controls

possible fixes:

  • Run SQL query that find and replace these characters
  • Patch bugzilla so that it replace them during xml conversion

Inside Bugzilla/WebService/Server/XMLRPC.pm, in _strip_undefs, at the
end of the function (around line 250):

if (ref $initial eq '')
{
  $initial =~ s/([\x01-\x08\x0b\x0c\x0f-\x1f])/sprintf "\\x%02x",ord($1)/ge;
}

should do the trick but that, indeed, damages some binaries. Do we
actually want to export them? Because XML is not a good format for
exports of binary files as it doesn't allow some characters. What
about getting the out using some SQL query? Why do we even need to use
XML? Is it only way to import to phab?

Jeff Gage asked me to take a look on this problem. Here's a workaround, but it's a bit icky:

*** xmlrpc.cgi  2014-10-24 17:16:49.137457470 -0700
--- xmlrpc.cgi  2014-10-24 20:54:20.652451329 -0700
*************** use Bugzilla::WebService::Server::XMLRPC
*** 22,27 ****
--- 22,29 ----

  Bugzilla->usage_mode(USAGE_MODE_XMLRPC);

+ $ENV{'BUGZILLA_WORKAROUND_XMLRPC_ESC'} = 1;  # TTK qv https://phabricator.wikimedia.org/T815
+
  # Fix the error code that SOAP::Lite uses for Perl errors.
  local $SOAP::Constants::FAULT_SERVER;
  $SOAP::Constants::FAULT_SERVER = ERROR_UNKNOWN_FATAL;
*** Bugzilla/Comment.pm 2014-10-24 20:17:09.771281474 -0700
--- Bugzilla/Comment.pm 2014-10-24 20:55:40.316457391 -0700
*************** sub body_full {
*** 156,161 ****
--- 156,165 ----
      if ($params->{wrap} and !$self->already_wrapped) {
          $body = wrap_comment($body);
      }
+     if (defined($ENV{'BUGZILLA_WORKAROUND_XMLRPC_ESC'})) { # TTK qv https://phabricator.wikimedia.org/T815
+         use bytes;
+         $body =~ s/([\x01-\x08\x0b\x0c\x0f-\x1f\x7f-\xff])/sprintf "\\x%02x", ord($1)/ge;
+     }
      return $body;
  }

I ran it against a local instance of Bugzilla 4.4.4 and it appears to work, though only on the comment text. Usernames and summaries with funky characters will require further escaping:

http://ciar.org/ttk/public/patch.bugzilla.2014-10-24.txt

Why not running the SQL query against all tables that contain these affected data which would just replace these characters in DB?

I agree with petrb, although I think we should write a script that fetches comments text from the db, sanitize it in a sane language (not SQL), and update the comment.

Should be easy doable and it's probably the right thing to do - but!

Did you check if those chars are in fact spurious and the result of some funky encoding on the client? Or some strange encoding/collation combination on the db?

I'd rather fix this at the data source level than doing patches to the output program.

Would it be an option to change

if (ref $initial eq '')
{
  $initial =~ s/([\x01-\x08\x0b\x0c\x0f-\x1f])/sprintf "\\x%02x",ord($1)/ge;
}

to

if (ref $initial eq '')
{
  $initial =~ s/([\x01-\x08\x0b\x0c\x0f-\x1f\x5c])/sprintf "\\x%02x",ord($1)/ge;
}

(i.e. also escape backslashes), and then do the reverse transformation on the python side?

We have decided that it is better to solve this problem in Bugzilla land, before the migration. Details to be defined. Anybody wants to step in with a solution that can be tested in Bugzilla sooner than later?

Thanks everybody for the comments here, really appreciated. I owe you drinks. :)
65, 66, 67, 412, 896, 3079, 9444 are example tickets in Wikimedia Bugzilla which currently break our export script.

I'm attaching a stripped down minimal version of that export script here that exposes the problem and could be used locally for further testing, called "minimal.py". It reads the configuration values from a configuration file defined in parser.read(); that configuration file should include four lines:

[bz]
url = https://bugzilla.wikimedia.org/xmlrpc.cgi
Bugzilla_login = <your email>
Bugzilla_password = <your password>

The script calls a range of Bugzilla IDs, called like this:

time seq 1 100 | python minimal.py

Also note that passing "verbose=True" in the "server" value provides more debug output (that line is commented out in the file).

We really need help here. If someone wants to take this task as a weekend project or equivalent and solve this task, it will be very difficult to forget their name when they request travel sponsorship for the Wikimedia EU Hackathon. :roll:

For the records, if all goes wrong we will put the ugly hack back in place when pulling the comments from Bugzilla, remove the hack after having finished, and then pulling the attachments from Bugzilla.

Okay, with this hack I have 65, 66, 67, 412, 3079 and 9444 working. 896 is still elusive due to the binary blob in the comment -- did that work with the old export script?!

I copied the system xmlrpclib.py to the same directory as minimal.py, and applied this patch:

valhallasw@maeglin:tmp$ diff -u /usr/lib/python2.7/xmlrpclib.py "xmlrpclib.py"                                             0
--- /usr/lib/python2.7/xmlrpclib.py     2014-03-23 00:55:10.000000000 +0100
+++ xmlrpclib.py        2014-11-05 23:46:20.753698853 +0100
@@ -138,6 +138,12 @@

 import re, string, time, operator

+illegalxmlchars = re.compile(r"[\x01-\x08\x0b\x0c\x0d\x0f-\x1f\x9c]")
+illegalformatter = lambda x: "\\x%02x" % ord(x.group(0))
+
+filterillegalxmlchars = lambda x: illegalxmlchars.sub(illegalformatter, x)
+
+
 from types import *
 import socket
 import errno
@@ -1460,6 +1466,7 @@

         while 1:
             data = stream.read(1024)
+            data = filterillegalxmlchars(data)
             if not data:
                 break
             if self.verbose:

(final xmlrpclib.py =

)

This basically does the same thing as the bugzilla patch, but client-side instead.

I still think the approach I suggested in T815#15155 should be considered; it's *much* easier to undo a transform client-side (after all the XML parsing) than hacking into the XML machinery.

Would it be possible to have a test bugzilla instance with that suggested patch applied?

After digging a bit further into the code to see what character caused this: this was \x0e, which although according to https://stackoverflow.com/a/5110103 is legal, is not a useful character. Filtering that out makes 896 work, too.

The patch then becomes:

valhallasw@maeglin:tmp$ diff -u /usr/lib/python2.7/xmlrpclib.py "xmlrpclib.py"                                             1
--- /usr/lib/python2.7/xmlrpclib.py     2014-03-23 00:55:10.000000000 +0100
+++ xmlrpclib.py        2014-11-06 00:00:54.417669816 +0100
@@ -138,6 +138,12 @@

 import re, string, time, operator

+illegalxmlchars = re.compile(r"[\x01-\x08\x0b-\x1f]")
+illegalformatter = lambda x: "\\x%02x" % ord(x.group(0))
+
+filterillegalxmlchars = lambda x: illegalxmlchars.sub(illegalformatter, x)
+
+
 from types import *
 import socket
 import errno
@@ -1459,12 +1465,17 @@
         p, u = self.getparser()

         while 1:
-            data = stream.read(1024)
+            data = stream.read() #1024)
+            data = filterillegalxmlchars(data)
             if not data:
                 break
             if self.verbose:
                 print "body:", repr(data)
-            p.feed(data)
+            try:
+                p.feed(data)
+            except Exception:
+                print p._parser.CurrentByteIndex, repr(data[p._parser.CurrentByteIndex])
+                raise

         if stream is not response:
             stream.close()

download:

Which now gives me:

valhallasw@maeglin:tmp$ python minimal.py 65 66 67 412 896 3079 9444                                                      
got 65
got 66
got 67
got 412
got 896
got 3079
got 9444

Two more notes I thought of:

  1. it might be better to replace with &#x12; instead of \x12
  2. in minimal.py, one would probably want to do
import xmlrpclib
dofilter = xmlrpclib.filterillegalxmlchars
nofilter = lambda x: x

xmlrpclib.filterillegalxmlchars = nofilter

(...)
xmlrpclib.filterillegalxmlchars = dofilter
comment = GET_COMMENT_CODE()
xmlrpclib.filterillegalxmlchars = nofilter

to prevent every part of the code being affected.

hi all, honestly it's hard to understand what the current status is here from just reading the ticket.

it seems like several people all shared ideas what _could_/_should_/_probably_ works and can be done but which route are we going to take?

afaict the current option would be to re-apply andre_'s patch again, export the tickets with comments, then revert the patch, and export the attachments. is it all ok then and we don't need to worry about other ways to do it?

There are three options:

  1. My hack, https://phabricator.wikimedia.org/T815#19294 . Seems to work for the test cases listed here.
  2. Re-apply a reversible version of Andre_'s patch (see https://phabricator.wikimedia.org/T815#15155 ) , then apply the reverse transformation in the attachment-getting code. Probably more stable (errors show up quicker), but not currently available.
  3. Apply Andre_'s existing patch, export comments, remove patch, export attachments.

if there is no drawback to 3. why don't we just do that

Yesterday in our team meeting we also were leaning toward 3.

If this is our decision, then it would be useful to demonstrate it in bugzillapreview (somehow), by simply migrating a few bugs with the patch applied/reverted, showing that attachments are now correct and no collateral damage was created.

We are leaning towards option 3.

The "xml.parsers.expat.ExpatError: not well-formed (invalid token)" problem seems to happen on 57 tickets currently.
If somebody fancies more testing, the bug IDs are:

65,66,67,412,896,3079,6455,7111,8716,9444,9445,9901,9909,10321,11745,13180,13204,15428,16262,21964,25563,33152,34471,35948,37002,43340,44259,44846,46377,47719,47855,47992,48741,49342,49944,50782,51010,51832,53400,54195,55844,57789,58659,58909,59079,59611,59669,60172,60363,61463,61702,62532,63579,68404,71479,72505,73274

Bugzilla's advanced search is funny (attachment mime type does not match regular expression ^$ does not work). But both attachment mime type matches regular expression .+ and Attachment is obsolete: 0 OR Attachment is obsolete: 1) lists these nine affected reports contain attachments out of the list of 57 tickets above (pasted into the "Bugs numbered" field of Bugzilla's advanced search). The bug IDs of affected tickets which also have attachments are:

6455,9444,13180,25563,33152,49944,50782,51010,57789

This task needs an owner. Who will apply and revert the patch during the Bugzilla migration?

For option (3), please make sure to test 896, as it might not be covered fully by the existing regexp (specifically, \x0e might need to be added).

As for

If somebody fancies more testing, the bug IDs are:

65,66,67,412,896,3079,6455,7111,8716,9444,9445,9901,9909,10321,11745,13180,13204,15428,16262,21964,25563,33152,34471,35948,37002,43340,44259,44846,46377,47719,47855,47992,48741,49342,49944,50782,51010,51832,53400,54195,55844,57789,58659,58909,59079,59611,59669,60172,60363,61463,61702,62532,63579,68404,71479,72505

These are all retrieved without issues with option (1), so (3) with \x0e added should work, too.

In T815#20734, @Qgil wrote:

This task needs an owner. Who will apply and revert the patch during the Bugzilla migration?

i can do it, i'm the one who applied and reverted it in the past

Great, thank you! All yours.

In T815#21264, @Dzahn wrote:
In T815#20734, @Qgil wrote:

This task needs an owner. Who will apply and revert the patch during the Bugzilla migration?

i can do it, i'm the one who applied and reverted it in the past

Thanks man, just fyi, this will be weird times. Evening and then really early morning probably. You are cool w/ being around?

Qgil renamed this task from Upstream Bugzilla XML-RPC API issue creates invalid XML & our workaround hack damages attachments - hence apply our hack only on text comment output to Apply patch to fix Bugzilla XML-RPC API issue before importing comments, remove it before importing attachments.Nov 13 2014, 2:28 PM
Qgil updated the task description. (Show Details)
Qgil lowered the priority of this task from High to Medium.Nov 14 2014, 4:03 PM

i'v been asked to add docs. happy to do so:

basics:

the gerrit project for our custom changes to Bugzilla upstream is 'wikimedia/bugzilla/modifications'

https://gerrit.wikimedia.org/r/#/q/project:wikimedia/bugzilla/modifications,n,z

the change that we want to apply / revert is:

https://gerrit.wikimedia.org/r/#/c/156100/
Change-Id: I960b2c80308213f32

how to apply and revert patch on bugzilla:

  • ssh to host zirconium (via bastion)
  • either git clone your own copy of the bugzilla-modifications repo or just become root and reuse mine (/home/dzahn/modifications/ or git clone https://gerrit.wikimedia.org/r/p/wikimedia/bugzilla/modifications and end up with a 'modifications' dir in your home)
  • on gerrit, merge and submit (you need to +2 and Verify, no Jenkins doing +2 here)
  • on zirconium in the local repo, git pull origin master, see the change you just merged
  • optional: use "diff" one more time between the local repo and the actual Bugzilla directory to double-check diff Bugzilla/WebService/Server/XMLRPC.pm /srv/org/wikimedia/bugzilla/Bugzilla/WebService/Server/XMLRPC.pm
  • copy the file XMLRPC.pm into the actual Bugzilla directory to deploy: cp Bugzilla/WebService/Server/XMLRPC.pm /srv/org/wikimedia/bugzilla/Bugzilla/WebService/Server/XMLRPC.pm

using cp because it's just one file, if multiple files i'd use rsync to deploy

revert:

Bugzilla installation directory is /srv/org/wikimedia/bugzilla/ (and in there you have ./Bugzilla/, capitalized)