Page titles are being double-encoded
Closed, ResolvedPublic
Actions

Assigned To

Authored By

	Samwilson
	Nov 21 2018, 7:45 AM

Description

It looks like there's some issues with double-encoding in some codes. For example, the centre plaque in
https://commons.wikimedia.org/wiki/File:Toodyaypedia_plates_stg_1_gnangarra_fs-1.jpg
resolves to
http://en.qrwp.org/St_John_the_Baptist_Church,_Toodyay_%25281963-_%2529
instead of
http://en.qrwp.org/St_John_the_Baptist_Church,_Toodyay_%281963-_%29
(i.e. the % was URL-encoded a second time).

Did a prior version of the QRpedia software handle this double-encoding, or was this code always incorrect?

(Initially reported at T209019#4761418)

Details

	Subject	Repo	Branch	Lines +/-
	Fix double-encoded page titles	qrpedia	master	+22 -21
	Fix double-encoded page titles	qrpedia	master	+22 -21

Customize query in gerrit

Related Objects

Mentioned In: T244135: Publish current version of QRpedia code in a public code repository
T62526: Apostrophes in URL titles causing problems
T214268: qrwp.org URL results in double slash and broken link for non-English languages
T209019: Accessing qrwp.org without a language-code prefix should redirect to another site (not fail)
Mentioned Here: T244135: Publish current version of QRpedia code in a public code repository
T245907: Make QRpedia codes become permanent links
T62525: Create a QRpedia test suite
T209019: Accessing qrwp.org without a language-code prefix should redirect to another site (not fail)

Event Timeline

Samwilson created this task.Nov 21 2018, 7:45 AM

Restricted Application added a subscriber: Aklapper. · View Herald TranscriptNov 21 2018, 7:45 AM

Samwilson mentioned this in T209019: Accessing qrwp.org without a language-code prefix should redirect to another site (not fail).Nov 21 2018, 7:46 AM

@ErrantX do you have any update on this?

Samwilson added a subscriber: Wikimedia Australia.Jan 8 2019, 5:08 PM

@ErrantX
as this hasnt been resolved, would it be possible to incorporate the coding https://meta.wikimedia.org/wiki/Free_Knowledge_Portal which points to the WikiData identifier as a solution

example qrcodes that dont work on qrpedia

@ErrantX: Where is the QRpedia source code? The rGQRP repository that's linked from the project's page on Meta has not been updated since Sep 2 2015. Is that the most recent? (Also pinging @Kelson @Addshore as you were involved back then it looks like.)

@Samwilson yes the qrpedia repo on Wikimedia source forge is the good one.

Pigsonthewing added a subscriber: MichaelMaggs.Jan 22 2019, 4:55 PM

Pigsonthewing subscribed.

In T210050#4879998, @Kelson wrote:

@Samwilson yes the qrpedia repo on Wikimedia source forge is the good one.

Do you have a link to this?

I'm also working on this issue and looking to resolve the immediate issues on the site and creating a canonical codebase.

@Jkbr the repository is: rGQRP (it's hosted here in Phabricator).

In T210050#5019749, @Samwilson wrote:

@Jkbr the repository is: rGQRP (it's hosted here in Phabricator).

Thanks. I'm trying to find the most recent version and this just looked like it hadn't been changed for a while.

Thanks. I'm trying to find the most recent version and this just looked like it hadn't been changed for a while.

I think that's just because it's been a while. :)

I'm happy to help review any patches.

Because I am told that I wrote at a wrong place, I try once more and paste it here:

In T209019#5235945, @Quarz wrote:

The issues still persist:
a) Double slashes in de.Wikipedia: https://de.qrwp.org/Berlin
The solution should be known at WMUK, because the bug has been fixed for other language versions.

b) Incorrect URL resolution for articles with non-letter character:
The issue already existed once in 2014, but was quickly fixed by a simple method at qrwp.org: With the incoming URL, do a regex_replace or string_replace, find '%25' replace with '%'.
https://en.wikipedia.org/w/index.php?title=User_talk:Victuallers&oldid=592634542#QRpedia_bug_-_Apostrophes._HELP!
I would like to know why the fix was undone.

In T210050#5243739, @Quarz wrote:

Because I am told that I wrote at a wrong place, I try once more and paste it here:

In T209019#5235945, @Quarz wrote:

The issues still persist:
a) Double slashes in de.Wikipedia: https://de.qrwp.org/Berlin
The solution should be known at WMUK, because the bug has been fixed for other language versions.

b) Incorrect URL resolution for articles with non-letter character:
The issue already existed once in 2014, but was quickly fixed by a simple method at qrwp.org: With the incoming URL, do a regex_replace or string_replace, find '%25' replace with '%'.
https://en.wikipedia.org/w/index.php?title=User_talk:Victuallers&oldid=592634542#QRpedia_bug_-_Apostrophes._HELP!
I would like to know why the fix was undone.

Thanks. WMUK is now actively reviewing the QRPedia project including these issues and others reported here.

Samwilson added a project: Wikimedia Australia.Jun 28 2019, 1:26 AM

as at 5 July codes like;

List Item

https://commons.wikimedia.org/wiki/File:Connor%27s_Mill_QRpedia.png are looking for the en article Connor%2527s_Mill instead of Connor%27s_Mill

https://commons.wikimedia.org/wiki/File:Connor%27s_Cottage_QRpedia.png looking for the en article Connor%2527s Cottage instead of Connor%27s_Mill

to me its reading/adding/returning the hex code of % which is 25 in the url and is then causing it to be unable to resolve %2527 rather then just reading the hex code as %27

actually since the url is https://en.wikipedia.org/wiki/Connor%27s_Cottage it shouldnt be trying to anything with the hex code

Jo Brook, working for WMUK, who I don't think is active on Phabricator, sent the following email today:

Hi all,

Firstly, I've made changes on our server so that URLs with wrongly encoded apostrophes and parentheses as in the 37 Toodyay URLs are now handled and redirected.
This change was at the webserver config rather than in the PHP code, which is only involved in generating new codes.

There is one exception to this:
https://en.wikipedia.org/wiki/St_John_the_Baptist_Church,_Toodyay_(1963-_) mentioned in https://phabricator.wikimedia.org/T210050

which fails because the English Wikipedia page itself does not now exist (I think it's maybe now https://en.wikipedia.org/wiki/St_John_the_Baptist_Church_(Toodyay) ?)
Fixing this would need someone to create a redirect page on wikipedia.org

Thanks for your patience on this -- during the call I was speaking on the basis mainly of newly created codes (which was fixed earlier this year) rather than redirecting ones which are already in use, but this email clarified that.

I've also done some further work on making this a more general solution to handle other double-encoded codes involving other characters not in the Toodyay set, but have some remaining special cases to deal with like https://en.wikipedia.org/wiki/We_are_the_99%25 which break some assumptions about applying decoding which works for other cases.

I will test any changes against your 37 codes to ensure they are fixed up correctly.

thanks and all the best,

Jo

Samwilson mentioned this in T214268: qrwp.org URL results in double slash and broken link for non-English languages.Jul 9 2019, 1:40 AM

In T210050#5315880, @Samwilson wrote:

Firstly, I've made changes on our server so that URLs with wrongly encoded apostrophes and parentheses as in the 37 Toodyay URLs are now handled and redirected.

Thanks a lot! Wouldn't it be better not to handle every single special character but to generally run an urldecode? In German, for example, we also need äöüÄÖÜß. Accents are also needed (French, Danish...).

In T210050#5316081, @Quarz wrote:

In T210050#5315880, @Samwilson wrote:

Firstly, I've made changes on our server so that URLs with wrongly encoded apostrophes and parentheses as in the 37 Toodyay URLs are now handled and redirected.

Thanks a lot! Wouldn't it be better not to handle every single special character but to generally run an urldecode? In German, for example, we also need äöüÄÖÜß. Accents are also needed (French, Danish...).

I've forwarded your question to Jo via email; I don't know if she's monitoring this ticket.

The problem with the double slashes is fixed, very good! Now many QRpedia codes work stable again. But unfortunately the issue with the double urlencoded addresses persists, although the solution is not difficult at all.
Above it was pointed out that the code is on rGQRP. Apparently index.php is the important file here.
In line 47 the page title is loaded:

$request = stripslashes( $_GET['title'] );

Because the input is urldecoded by $_GET, everything is fine with normal (simple) urlencoded strings.
The issue is the processing of double urlencoded strings, which qrpedia.org has built into the QRpedia codes and which are now physically in the world as info boards for example.
The solution: Give $request an additional urldecode.

Change 530236 had a related patch set uploaded (by Putnik; owner: Putnik):
[qrpedia@master] Fix double-encoded page titles

https://gerrit.wikimedia.org/r/530236

gerritbot added a project: Patch-For-Review.Aug 14 2019, 10:14 PM

putnik subscribed.Aug 14 2019, 10:14 PM

Change 530331 had a related patch set uploaded (by Putnik; owner: Putnik):
[qrpedia@master] Fix double-encoded page titles

https://gerrit.wikimedia.org/r/530331

Change 530331 abandoned by Putnik:
Fix double-encoded page titles

Reason:
Duplicate of https://gerrit.wikimedia.org/r/#/c/qrpedia/ /530236/

https://gerrit.wikimedia.org/r/530331

putnik added a project: Wikimania-Hackathon-2019.Aug 15 2019, 8:44 AM

I think progress here is waiting on feedback from WMUK.

The above patch may fix the issue, but I'm not sure where we're at with ensuring that this repo is actually the one in production.

In T210050#5430260, @Samwilson wrote:

I think progress here is waiting on feedback from WMUK.

Who, at WMUK?

Jkbr I believe.

MichaelMaggs assigned this task to Jkbr.Aug 25 2019, 5:55 PM

A patch was uploaded three months ago. I'm curious how it will go on. Is it accepted that many QRpedia codes worldwide have no function - so they are garbage? Or does WMUK still want to repair the software - when?
We just have the situation in Bremen that 70 codes on public information boards are affected by the bug. Due to a faulty QRpedia generator, do we now have to look for a sponsor for the renewal of the codes?

Fae subscribed.Nov 11 2019, 11:35 AM

Dzahn mentioned this in T62526: Apostrophes in URL titles causing problems.Nov 15 2019, 2:01 AM

The patch failed some of our manual tests on the punctuation-encoding edge cases. To go forward we are developing a test suite and adding some basic READMEs etc. so functionality can be agreed and fixes tested.

In T210050#5699460, @Jkbr wrote:

The patch failed some of our manual tests on the punctuation-encoding edge cases.

The patch was provided over three months ago. Has the failure been fed back to the author? Are they available to fix it? If not, what is "plan B"? Who is responsible for ensuring this gets resolved? Is there a target date for resolution?

In T210050#5699460, @Jkbr wrote:

The patch failed some of our manual tests on the punctuation-encoding edge cases. To go forward we are developing a test suite and adding some basic READMEs etc. so functionality can be agreed and fixes tested.

I would strongly prefer that most faulty codes be cured immediately if possible!
This is better than hoping for complete bug-free performance for even longer. You can also repair in steps.

The patch fails for codes the live system accepts. The live deployment is based on an earlier codebase which we have patched to fix this issue. This code will be reconciled with the current master and deployed after testing.
The test suite is close to being ready. Task: https://phabricator.wikimedia.org/T62525
There really needs to be an accepted set of use cases for the valid/invalid decodes (as well as trying to handle 'legacy' codes which were created by the previous system).

What would be helpful is to report to me specific examples which are failing for you on qrpedia.org. I have the set of codes relating to Toodyay, but others would be useful.

It's always the same thing with all characters. Therefore you should not look for a separate solution for every single letter.
The string was urlencoded:
Bremer_Baumwollbörse -> Bremer_Baumwollb%C3%B6rse
ok, but then urlencoded again
-> Bremer_Cottonb%25C3%25B6rse
The % became %25. The address stored in the QR code is
http://de.qrwp.org/Bremer_Baumwollb%25C3%25B6rse
and results in an error message: Der angeforderte Seitentitel enthält ungültige Zeichen: „%C3“.
More broken URLs generated by qrpedia.org:
http://de.qrwp.org/Sch%25C3%25BCtting_%2528Bremen%2529
http://de.qrwp.org/Ratscaf%25C3%A9e/Deutsches_Haus
http://fr.qrwp.org/H%25C3%25B4tel_de_ville_de_Br%25C3%25AAme

This %25 identifies all faulty QR codes - or are there valid article names in Wikipedia that contain a %?
Affected are characters such as äöüÄÖÜ()áà and many more that are not member of a-z, A-Z and 0-9.
More defect URLs generated by qrpedia.org:

The solution is: Replace %25 with % before the urldecode. This is only possible if this change is done before the import with $_GET.

Better controlled way:
Step 1: Change htaccess RewriteRule to
RewriteRule ^(.*)$ index.php?$1 [L,QSA]
(instead of RewriteRule ^(.*)$ index.php?title=$1 [L,QSA])

Step 2: In index.php do input with "$_SERVER['QUERY_STRING']" instead of "$_GET":
$request = stripslashes($_SERVER['QUERY_STRING']);
$request = str_replace('%25','%',$request);
$request = urldecode($request);

Again more than a month has passed without any noticeable progress in the matter.
What is the intention of WMUK?
What are the abilities of WMUK?
Is QRpedia going to be fixed in the near future?

Someone needs to review (and merge) https://gerrit.wikimedia.org/r/#/c/530236/ I guess.
https://gerrit.wikimedia.org/r/#/admin/groups/534,members implies that currently only @Kelson could +2 that patch, if I understand correctly.

Edit: However, T210050#5699611 implies that this is not the actual codebase? Then where is the codebase?

In T210050#5700040, @Quarz wrote:

This %25 identifies all faulty QR codes - or are there valid article names in Wikipedia that contain a %?

Yes; en.Wikipedia has:

[[50% & 50%]] : https://en.wikipedia.org/wiki/50%25_%26_50%25
[[The 50% Rule]] : https://en.wikipedia.org/wiki/The_50%25_Rule
[[10%: What Makes a Hero?]] : https://en.wikipedia.org/wiki/10%25:_What_Makes_a_Hero%3F

and others. The first of those is also in hu. and ja. Wikipedias.

In T210050#5792831, @Pigsonthewing wrote:

Yes; en.Wikipedia has:

[[50% & 50%]] : https://en.wikipedia.org/wiki/50%25_%26_50%25

[[The 50% Rule]] : https://en.wikipedia.org/wiki/The_50%25_Rule

[[10%: What Makes a Hero?]] : https://en.wikipedia.org/wiki/10%25:_What_Makes_a_Hero%3F

and others. The first of those is also in hu. and ja. Wikipedias.

it's not very likely that QRpedia codes are needed for these articles, right?
Don't forget: There are a lot of physical QRpedia codes in the world that cost money to make, and are now dead. People want to use them and find out: Wikipedia is broken.

In T210050#5793185, @Quarz wrote:

it's not very likely that QRpedia codes are needed for these articles, right?

I couldn't say, as I haven't reviewed all those on the English Wikipedia, let alone all those in other languages.

I am certainly not going to make guessses.

Don't forget: There are a lot of physical QRpedia codes in the world that cost money to make, and are now dead. People want to use them and find out: Wikipedia is broken.

Thank you. I haven't forgotten that; indeed, I've been lobbying to get this resolved. Have you perhaps mistaken me for someone else?

In T210050#5793272, @Pigsonthewing wrote:

Thank you. I haven't forgotten that; indeed, I've been lobbying to get this resolved. Have you perhaps mistaken me for someone else?

I'm sorry my words were misleading. I didn't mean you. I remember Pigsonthewing as very constructive! It was intended as a general appeal.

Apparently, the doubly urlencoded addresses are now processed correctly again.
Many thanks to all who have worked on this!

My pleasure is mixed up with some concern: Will qrwp.org process the double-urlencoded addresses permanently in the correct way?
Now I could test if the article names mentioned in T210050#5792831 by @Pigsonthewing, that contain % characters are processed correctly. It works. Test it yourself:

Quarz mentioned this in T244135: Publish current version of QRpedia code in a public code repository.Feb 3 2020, 2:09 PM

Aklapper removed a subscriber: Wikimedia Australia.Mar 22 2020, 9:36 PM

It looks like this task is resolved (since last year). The codes I've checked are all working now. I think we can close it; everyone agree?

Yes, this task can be closed.
But it would be better if QRpedia would be resistant to such problems caused by non-ANSI URL characters in the future. It would also be better if QRpedia codes permanently point to their article. I made a suggestion for this in February and offered the appropriate code adaptations, but they are based on the outdated rGQRP. I would also adapt the current code if I get read access.
Please have a look at T245907.

I totally agree!

(For others' reference, the code-access issue is T244135.)

Page titles are being double-encodedClosed, ResolvedPublicActions

Description

Details

Related Objects

Event Timeline

Page titles are being double-encoded
Closed, ResolvedPublic
Actions