Keep page number parameter when processing Google Books links
Open, NormalPublic8 Story Points

Josve05a created this task.Sep 21 2015, 5:52 PM
Josve05a updated the task description. (Show Details)
Josve05a raised the priority of this task from to Needs Triage.
Josve05a added a subscriber: Josve05a.
Restricted Application added a subscriber: Aklapper. · View Herald TranscriptSep 21 2015, 5:52 PM
Josve05a updated the task description. (Show Details)Sep 21 2015, 5:55 PM
Josve05a set Security to None.
Elitre added a subscriber: Elitre.Sep 21 2015, 6:03 PM

(It does work if you add that very same link in the template after generating the reference, FWIW.)

Josve05a added a comment.EditedSep 21 2015, 6:03 PM

Currently citing that link generates this: https://sv.wikipedia.org/w/index.php?title=Anv%C3%A4ndare%3AElitre_%28WMF%29%2Fsandl%C3%A5da&type=revision&diff=30579070&oldid=29855053

{{Bokref|titel = Principles in the Emergence and Evolution of Linguistic Features in World Englishes|url = https://books.google.com/books?id=FJ7dAgAAQBAJ|utgivare = Anchor Academic Publishing (aap_verlag)|datum = 2014-02-01|hämtdatum = 2015-09-21|isbn = 9783954891917|språk = en|förnamn = Tobias|efternamn = Weber}}

It completely removes the preview of the book from the url and the page number. I would expect Citoid not to change the link (more than remove redundant and unnecessary strings which doesn't do anything) so that when I click on it, I can still be able to find the page and preview the page, as was intended when citing the link.

An example how that would look can be generated with CitationBot, see https://test.wikipedia.org/w/index.php?title=User%3AJosve05a%2Fsandbox&diff=prev&oldid=242717

{{cite book|url=http://books.google.com/?id=FJ7dAgAAQBAJ&pg=PA48&lpg=PA48&dq=Th-debuccalization#v=onepage&q=Th-debuccalization&f=false|title=Principles in the Emergence and Evolution of Linguistic Features in World Englishes|isbn=9783954891917|author1=Weber|first1=Tobias|date=2014-02}}

It removed &source=bl&ots=1W9kMsAB1l&sig=2AySeojrJAExh74uHfHAHx7lBQs&hl=gd&sa=X&ved=0CDgQ6AEwCGoVChMI-MmEx7bTxwIVTOwUCh1pVAva but left &pg=PA48&lpg=PA48&dq=Th-debuccalization#v=onepage&q=Th-debuccalization&f=false which is the expected result for a users perspective.

Mvolz moved this task from Backlog to Site specific issues on the Citoid board.Sep 30 2015, 6:05 PM
Mvolz added a subscriber: Mvolz.Sep 30 2015, 6:09 PM

This was caused by the fix for T107322

Restricted Application added a project: VisualEditor. · View Herald TranscriptJul 17 2016, 12:34 PM
Nemo_bis triaged this task as Normal priority.Jul 17 2016, 12:35 PM
Jdforrester-WMF renamed this task from Problem with stripping Google Books-links to Citoid doesn't strip links to Google Books correctly, removing page number along with preview.
Jdforrester-WMF set the point value for this task to 8.Aug 1 2016, 4:56 PM

This was by design, because there was concern about leaving the search string in there, so we removed all query strings for google book links and used the canonical link instead.

T107322

However, this had the side effect of removing page numbers, or sometimes the search word is actually useful to point to instances in the book. I experienced this myself recently and I manually pasted the link back in to include the search string.

It would be trivial to revert this; but wondering what @Jdforrester-WMF thinks about leaving the search string back in again?

Mvolz claimed this task.Aug 22 2016, 4:20 PM

This is potentially important. I had begun using this feature to convert their Google books URLs but I abandon it after getting complaints that the converted URLs did not identify the page. I had originally requested that the conversion automatically leave up the page parameter so it would be easy to fill in, but it seems like it would be better if the URL took you directly to the page. I don't know the history that led to the decision to switch to the base URL and perhaps it's not something that can be easily undone because they may be other problems of which I am unaware, but as constituted the reference conversion option is useless to me.

This was by design, because there was concern about leaving the search string in there, so we removed all query strings for google book links and used the canonical link instead.

T107322

However, this had the side effect of removing page numbers, or sometimes the search word is actually useful to point to instances in the book. I experienced this myself recently and I manually pasted the link back in to include the search string.

It would be trivial to revert this; but wondering what @Jdforrester-WMF thinks about leaving the search string back in again?

Yeah, I think not having the main search string is a good call (for user privacy reasons). Would it be possible to pull out just the page number bit from the URL? If not, let's Decline this.

I do get that there could be some issues with the search string. Certainly not a problem in most circumstances but there can be a situation where it could be problematic. I don't see such an issue with page number although I recognize that having to convert the string to remove the search and leave in the page number may be tricky.

Can you provide more details on what the privacy problem is? I'm concerned removing the page numbers and search string turns Citationbot into Vandalbot. That's a non trivial change to the ref, and there is no clear consensus for removing it with a bot.

Can you provide more details on what the privacy problem is?

Search queries are generally problematic because they can contain anything, see e.g. https://blog.wikimedia.org/2012/09/19/what-are-readers-looking-for-wikipedia-search-data-now-available/ ; but AFAIK Google Books only retains the query from which you clicked the book link and it's unlikely you ended up on a book page by searching e.g. your phone number or exact home address. In case you do, I think problems will arise only if one doesn't realise that the original search query is preserved even if you search again and the fragment updates, as you get an URL like https://books.google.it/books?id=gS65RUG1Ns0C&pg=PA288&dq=via+pagano+54&hl=en&sa=X&ved=0ahUKEwiWmKDl_9bOAhVCVBQKHSSMClQQ6AEIKTAC#v=snippet&q=disse&f=false ; however this search URL doesn't update when you click a page number, so you wouldn't be able to use such an URL to reference a page anyway.

Is there a scenario where one can reference a single page and unwittingly include unexpected information in the URL?

I've thought a little bit more about the privacy concern. I can appreciate the decision not to publish a database of search queries because of the realization that people occasionally, either deliberately or accidentally, included personal information in the search query. However, I think that case can be distinguished from this case in two ways.

The first is that while many people are vaguely aware that anything typed into a search box is possibly captured by some organization, even those people would be quite surprised if their specific search were published in a publicly available database, (as opposed to being used in an aggregate way to study search terms). In contrast, adding a reference to a google book entry in an article is a deliberate act of public publication. They would not only be not surprised that the link they published were available, they would be unhappy if it were not. (I do realize I am glossing over a possibly important point — they editor might respond. "Yes, I fully understood I was publishing a link to a search in Google books", but might also respond "Seriously, I included my home phone number in the search string? That was not intended.")

The second point is that they copy and paste the link as a reference, and while they might view it as a long string of barely comprehensible characters, they, more than anyone else, are likely to notice if the string includes some personal information. While they still might miss it, I think it is extremely unlikely that a search string including some personal information will produce a hit in a Google Books search for information relevant to the article they are editing.

For example, I can imagine accidentally having my phone number in my copy paste buffer and doing a search, either in Wikipedia or in Google books for my phone number. I fully understand that both Wikimedia and Google, respectively, have that search in some data base. However, I am having trouble imagining that such a search would result in a useful hit - a Google books reference that I want to use in an article. I suggest that such an example is extremely rare, and if it ever happens, we can do the usual revdel and remove it. (Keep it might that it was already in the article, so conversion of the raw url to the properly formatted reference is not adding the personal information, it was already there.)

Finally, if someone still thinks the theoretical concern is worth worrying about, we could proceed as follows: Presuming that the system can distinguish between a url pointing to the title page and a url pointing elsewhere, as the result of a page refence and/or search term, if the second is the case, a popup could be displayed, warning that the link is not to the title page and the url should be reviewed to see if it includes personal information. I think that is overkill, but it would let the editor check to make sure they weren't preserving personal information in the link.

(Apologies for length, let me know if long dispatches should be handled differently.)

Mvolz added a comment.EditedAug 26 2016, 1:51 PM

What is the conclusion for this? Do we

  • Remove search query string only
  • Leave all query parameters in

What is the conclusion for this? Do we

  • Remove search query string only
  • Leave all query parameters in

The first is preferable, the second could be ok.

The second would make it so we don't have to have a separate translator repo from zotero anymore which I would LOVE so I'm clearly angling at that one... :D Appeal to the higher authority of Security? @dpatrick?

Mvolz added a comment.Aug 26 2016, 2:21 PM

(the git merge algorithm does NOT merge these well at all, it involves lots
of manual fixing of the merge unfortunately, which impacts our ability to
regularly pull in upstream changes)

The second please.

czar added a subscriber: czar.Aug 31 2016, 5:59 AM

I typically strip my Google Books citations of everything but the book ID and the desired page to link (manually). So for the original example: https://books.google.com/books?id=FJ7dAgAAQBAJ&pg=PA48

Is there a need for more metadata? I'd argue that highlighting the search term makes more of a mess than it's worth, though users are welcome to link whatever they want manually.

Ramalepe set Security to Software security bug.Nov 20 2016, 6:36 PM
Ramalepe added a project: Security.
Ramalepe changed the visibility from "Public (No Login Required)" to "Custom Policy".
Ramalepe added a subscriber: Ramalepe.

How to make wiki projects to be able to get Google ...lyk to google through wikidata?????tnx reply plz..

Restricted Application removed a subscriber: Zppix. · View Herald TranscriptNov 20 2016, 6:36 PM
Restricted Application added a project: Security. · View Herald TranscriptNov 20 2016, 6:56 PM
Legoktm changed the visibility from "Custom Policy" to "Public (No Login Required)".
Legoktm changed Security from Software security bug to None.

[offtopic]

How to make wiki projects to be able to get Google ...lyk to google through wikidata?????tnx reply plz..

@Ramalepe: This question is off-topic for this task as it is unrelated to Google Books links in Citoid. Please ask in a support forum like https://www.mediawiki.org/wiki/Project:Support_desk - thanks!

What is the conclusion for this? Do we

  • Remove search query string only
  • Leave all query parameters in

I believe that the removing the search query string is the most privacy-respecting option, and the one that allows for safe contribution by the broadest range of users (less-knowledgeable to highly-knowledgeable).

I can't tell you how badly this breaks things, it might as well be vandalism. It's more than search query being removed it's the page number. So when you click on the link it goes to the first page of the book not to the page where the citation occurs. Theoretically the page number is referenced in the cite itself, but this is often not the case. So you end up deleting critical information needed for verification. I would expect if this went to RfC, and the community saw what it was doing, there would be an outcry. The privacy issue IMO is not nearly as common a problem, and there is a level of personal responsibility.

The second would make it so we don't have to have a separate translator repo from zotero anymore

Sounds like a good reason to prefer the second then.

Mvolz removed Mvolz as the assignee of this task.Jan 23 2017, 12:57 PM
czar added a comment.Mar 14 2017, 2:53 AM

I'm not exactly sure how this ended but it looks like the GitHub discussion (#1066) ended in the decision to put the Google-Books-ID in the extra field. I suppose this means that we need a Citoid-specific Google Books translator that will retain the URL submitted to Citoid? Or is there some other means for catching the URL without making a Citoid-specific translator? I still think the GB link's parameters should be reduced to the single page only (as done in http://reftag.appspot.com/).

czar renamed this task from Citoid doesn't strip links to Google Books correctly, removing page number along with preview to Keep page number parameter when processing Google Books links.