Page MenuHomePhabricator

Page titles are being double-encoded
Closed, ResolvedPublic

Description

It looks like there's some issues with double-encoding in some codes. For example, the centre plaque in
https://commons.wikimedia.org/wiki/File:Toodyaypedia_plates_stg_1_gnangarra_fs-1.jpg
resolves to
http://en.qrwp.org/St_John_the_Baptist_Church,_Toodyay_%25281963-_%2529
instead of
http://en.qrwp.org/St_John_the_Baptist_Church,_Toodyay_%281963-_%29
(i.e. the % was URL-encoded a second time).

Did a prior version of the QRpedia software handle this double-encoding, or was this code always incorrect?

(Initially reported at T209019#4761418)

Event Timeline

@ErrantX
as this hasnt been resolved, would it be possible to incorporate the coding https://meta.wikimedia.org/wiki/Free_Knowledge_Portal which points to the WikiData identifier as a solution

@ErrantX: Where is the QRpedia source code? The rGQRP repository that's linked from the project's page on Meta has not been updated since Sep 2 2015. Is that the most recent? (Also pinging @Kelson @Addshore as you were involved back then it looks like.)

@Samwilson yes the qrpedia repo on Wikimedia source forge is the good one.

@Samwilson yes the qrpedia repo on Wikimedia source forge is the good one.

Do you have a link to this?

I'm also working on this issue and looking to resolve the immediate issues on the site and creating a canonical codebase.

@Jkbr the repository is: rGQRP (it's hosted here in Phabricator).

@Jkbr the repository is: rGQRP (it's hosted here in Phabricator).

Thanks. I'm trying to find the most recent version and this just looked like it hadn't been changed for a while.

Thanks. I'm trying to find the most recent version and this just looked like it hadn't been changed for a while.

I think that's just because it's been a while. :)

I'm happy to help review any patches.

Because I am told that I wrote at a wrong place, I try once more and paste it here:

The issues still persist:
a) Double slashes in de.Wikipedia: https://de.qrwp.org/Berlin
The solution should be known at WMUK, because the bug has been fixed for other language versions.

b) Incorrect URL resolution for articles with non-letter character:
The issue already existed once in 2014, but was quickly fixed by a simple method at qrwp.org: With the incoming URL, do a regex_replace or string_replace, find '%25' replace with '%'.
https://en.wikipedia.org/w/index.php?title=User_talk:Victuallers&oldid=592634542#QRpedia_bug_-_Apostrophes._HELP!
I would like to know why the fix was undone.

Because I am told that I wrote at a wrong place, I try once more and paste it here:

The issues still persist:
a) Double slashes in de.Wikipedia: https://de.qrwp.org/Berlin
The solution should be known at WMUK, because the bug has been fixed for other language versions.

b) Incorrect URL resolution for articles with non-letter character:
The issue already existed once in 2014, but was quickly fixed by a simple method at qrwp.org: With the incoming URL, do a regex_replace or string_replace, find '%25' replace with '%'.
https://en.wikipedia.org/w/index.php?title=User_talk:Victuallers&oldid=592634542#QRpedia_bug_-_Apostrophes._HELP!
I would like to know why the fix was undone.

Thanks. WMUK is now actively reviewing the QRPedia project including these issues and others reported here.

as at 5 July codes like;

https://commons.wikimedia.org/wiki/File:Connor%27s_Cottage_QRpedia.png looking for the en article Connor%2527s Cottage instead of Connor%27s_Mill

to me its reading/adding/returning the hex code of % which is 25 in the url and is then causing it to be unable to resolve %2527 rather then just reading the hex code as %27

actually since the url is https://en.wikipedia.org/wiki/Connor%27s_Cottage it shouldnt be trying to anything with the hex code

Jo Brook, working for WMUK, who I don't think is active on Phabricator, sent the following email today:

Hi all,

Firstly, I've made changes on our server so that URLs with wrongly encoded apostrophes and parentheses as in the 37 Toodyay URLs are now handled and redirected.
This change was at the webserver config rather than in the PHP code, which is only involved in generating new codes.

There is one exception to this:
https://en.wikipedia.org/wiki/St_John_the_Baptist_Church,_Toodyay_(1963-_) mentioned in https://phabricator.wikimedia.org/T210050

which fails because the English Wikipedia page itself does not now exist (I think it's maybe now https://en.wikipedia.org/wiki/St_John_the_Baptist_Church_(Toodyay) ?)
Fixing this would need someone to create a redirect page on wikipedia.org

Thanks for your patience on this -- during the call I was speaking on the basis mainly of newly created codes (which was fixed earlier this year) rather than redirecting ones which are already in use, but this email clarified that.

I've also done some further work on making this a more general solution to handle other double-encoded codes involving other characters not in the Toodyay set, but have some remaining special cases to deal with like https://en.wikipedia.org/wiki/We_are_the_99%25 which break some assumptions about applying decoding which works for other cases.

I will test any changes against your 37 codes to ensure they are fixed up correctly.

thanks and all the best,

Jo

Firstly, I've made changes on our server so that URLs with wrongly encoded apostrophes and parentheses as in the 37 Toodyay URLs are now handled and redirected.

Thanks a lot! Wouldn't it be better not to handle every single special character but to generally run an urldecode? In German, for example, we also need äöüÄÖÜß. Accents are also needed (French, Danish...).

Firstly, I've made changes on our server so that URLs with wrongly encoded apostrophes and parentheses as in the 37 Toodyay URLs are now handled and redirected.

Thanks a lot! Wouldn't it be better not to handle every single special character but to generally run an urldecode? In German, for example, we also need äöüÄÖÜß. Accents are also needed (French, Danish...).

I've forwarded your question to Jo via email; I don't know if she's monitoring this ticket.

The problem with the double slashes is fixed, very good! Now many QRpedia codes work stable again. But unfortunately the issue with the double urlencoded addresses persists, although the solution is not difficult at all.
Above it was pointed out that the code is on rGQRP. Apparently index.php is the important file here.
In line 47 the page title is loaded:

$request = stripslashes( $_GET['title'] );

Because the input is urldecoded by $_GET, everything is fine with normal (simple) urlencoded strings.
The issue is the processing of double urlencoded strings, which qrpedia.org has built into the QRpedia codes and which are now physically in the world as info boards for example.
The solution: Give $request an additional urldecode.

Change 530236 had a related patch set uploaded (by Putnik; owner: Putnik):
[qrpedia@master] Fix double-encoded page titles

https://gerrit.wikimedia.org/r/530236

Change 530331 had a related patch set uploaded (by Putnik; owner: Putnik):
[qrpedia@master] Fix double-encoded page titles

https://gerrit.wikimedia.org/r/530331

Change 530331 abandoned by Putnik:
Fix double-encoded page titles

Reason:
Duplicate of https://gerrit.wikimedia.org/r/#/c/qrpedia/ /530236/

https://gerrit.wikimedia.org/r/530331

I think progress here is waiting on feedback from WMUK.

The above patch may fix the issue, but I'm not sure where we're at with ensuring that this repo is actually the one in production.

I think progress here is waiting on feedback from WMUK.

Who, at WMUK?

MichaelMaggs added a subscriber: ErrantX.

Jkbr I believe.

A patch was uploaded three months ago. I'm curious how it will go on. Is it accepted that many QRpedia codes worldwide have no function - so they are garbage? Or does WMUK still want to repair the software - when?
We just have the situation in Bremen that 70 codes on public information boards are affected by the bug. Due to a faulty QRpedia generator, do we now have to look for a sponsor for the renewal of the codes?

The patch failed some of our manual tests on the punctuation-encoding edge cases. To go forward we are developing a test suite and adding some basic READMEs etc. so functionality can be agreed and fixes tested.

The patch failed some of our manual tests on the punctuation-encoding edge cases.

The patch was provided over three months ago. Has the failure been fed back to the author? Are they available to fix it? If not, what is "plan B"? Who is responsible for ensuring this gets resolved? Is there a target date for resolution?

The patch failed some of our manual tests on the punctuation-encoding edge cases. To go forward we are developing a test suite and adding some basic READMEs etc. so functionality can be agreed and fixes tested.

I would strongly prefer that most faulty codes be cured immediately if possible!
This is better than hoping for complete bug-free performance for even longer. You can also repair in steps.

The patch fails for codes the live system accepts. The live deployment is based on an earlier codebase which we have patched to fix this issue. This code will be reconciled with the current master and deployed after testing.
The test suite is close to being ready. Task: https://phabricator.wikimedia.org/T62525
There really needs to be an accepted set of use cases for the valid/invalid decodes (as well as trying to handle 'legacy' codes which were created by the previous system).

What would be helpful is to report to me specific examples which are failing for you on qrpedia.org. I have the set of codes relating to Toodyay, but others would be useful.

It's always the same thing with all characters. Therefore you should not look for a separate solution for every single letter.
The string was urlencoded:
Bremer_Baumwollbörse -> Bremer_Baumwollb%C3%B6rse
ok, but then urlencoded again
-> Bremer_Cottonb%25C3%25B6rse
The % became %25. The address stored in the QR code is
http://de.qrwp.org/Bremer_Baumwollb%25C3%25B6rse
and results in an error message: Der angeforderte Seitentitel enthält ungültige Zeichen: „%C3“.
More broken URLs generated by qrpedia.org:
http://de.qrwp.org/Sch%25C3%25BCtting_%2528Bremen%2529
http://de.qrwp.org/Ratscaf%25C3%A9e/Deutsches_Haus
http://fr.qrwp.org/H%25C3%25B4tel_de_ville_de_Br%25C3%25AAme

This %25 identifies all faulty QR codes - or are there valid article names in Wikipedia that contain a %?
Affected are characters such as äöüÄÖÜ()áà and many more that are not member of a-z, A-Z and 0-9.
More defect URLs generated by qrpedia.org:

The solution is: Replace %25 with % before the urldecode. This is only possible if this change is done before the import with $_GET.

Better controlled way:
Step 1: Change htaccess RewriteRule to
RewriteRule ^(.*)$ index.php?$1 [L,QSA]
(instead of RewriteRule ^(.*)$ index.php?title=$1 [L,QSA])

Step 2: In index.php do input with "$_SERVER['QUERY_STRING']" instead of "$_GET":
$request = stripslashes($_SERVER['QUERY_STRING']);
$request = str_replace('%25','%',$request);
$request = urldecode($request);

Again more than a month has passed without any noticeable progress in the matter.
What is the intention of WMUK?
What are the abilities of WMUK?
Is QRpedia going to be fixed in the near future?

Someone needs to review (and merge) https://gerrit.wikimedia.org/r/#/c/530236/ I guess.
https://gerrit.wikimedia.org/r/#/admin/groups/534,members implies that currently only @Kelson could +2 that patch, if I understand correctly.

Edit: However, T210050#5699611 implies that this is not the actual codebase? Then where is the codebase?

This %25 identifies all faulty QR codes - or are there valid article names in Wikipedia that contain a %?

Yes; en.Wikipedia has:

and others. The first of those is also in hu. and ja. Wikipedias.

Yes; en.Wikipedia has:

and others. The first of those is also in hu. and ja. Wikipedias.

it's not very likely that QRpedia codes are needed for these articles, right?
Don't forget: There are a lot of physical QRpedia codes in the world that cost money to make, and are now dead. People want to use them and find out: Wikipedia is broken.

it's not very likely that QRpedia codes are needed for these articles, right?

I couldn't say, as I haven't reviewed all those on the English Wikipedia, let alone all those in other languages.

I am certainly not going to make guessses.

Don't forget: There are a lot of physical QRpedia codes in the world that cost money to make, and are now dead. People want to use them and find out: Wikipedia is broken.

Thank you. I haven't forgotten that; indeed, I've been lobbying to get this resolved. Have you perhaps mistaken me for someone else?

Thank you. I haven't forgotten that; indeed, I've been lobbying to get this resolved. Have you perhaps mistaken me for someone else?

I'm sorry my words were misleading. I didn't mean you. I remember Pigsonthewing as very constructive! It was intended as a general appeal.

Apparently, the doubly urlencoded addresses are now processed correctly again.
Many thanks to all who have worked on this!

My pleasure is mixed up with some concern: Will qrwp.org process the double-urlencoded addresses permanently in the correct way?
Now I could test if the article names mentioned in T210050#5792831 by @Pigsonthewing, that contain % characters are processed correctly. It works. Test it yourself:

It looks like this task is resolved (since last year). The codes I've checked are all working now. I think we can close it; everyone agree?

Yes, this task can be closed.
But it would be better if QRpedia would be resistant to such problems caused by non-ANSI URL characters in the future. It would also be better if QRpedia codes permanently point to their article. I made a suggestion for this in February and offered the appropriate code adaptations, but they are based on the outdated rGQRP. I would also adapt the current code if I get read access.
Please have a look at T245907.

I totally agree!

(For others' reference, the code-access issue is T244135.)