Page MenuHomePhabricator

Keep anchor when generating reference from URL
Open, MediumPublic

Description

E.g. https://net.jogtar.hu/jogszabaly?docid=A1200001.TV#pr575id will result in

[
  {
    "key": "RS78AC8V",
    "version": 0,
    "itemType": "webpage",
    "tags": [],
    "title": "2012. évi I. törvény - 1.oldal - Hatályos Jogszabályok Gyűjteménye",
    "url": "https://net.jogtar.hu/jogszabaly?docid=A1200001.TV",
    "abstractNote": "a munka törvénykönyvéről",
    "language": "en",
    "accessDate": "2018-12-25",
    "websiteTitle": "net.jogtar.hu",
    "author": [
      [
        "Wolters Kluwer Hungary",
        "Kft"
      ]
    ],
    "source": [
      "Zotero"
    ]
  }
]

Note how #pr575id is not present in the URL.

Event Timeline

Restricted Application added a subscriber: Aklapper. · View Herald Transcript
Mvolz renamed this task from Citoid drops anchor when generating cite web reference to Keep anchor when generating reference from URL.Mar 7 2019, 2:52 PM
Mvolz triaged this task as Medium priority.

Somewhat amusingly, the anchor is kept in the DOI (when it shouldn't be) but not the url with this one: https://en.wikipedia.org/api/rest_v1/data/citation/mediawiki/https%3A%2F%2Fwww.jstor.org%2Fstable%2F10.14321%2Frhetpublaffa.21.2.0279%3Fseq%3D1%23page_scan_tab_contents

Ideally a solution would be to handle anchors in a way here that also avoids adding it to the doi. Unfortunately the doi spec allows hashes but it's still probably avoidable in most cases.

This issue (Citoid replacing given URLs with the ones found in meta data) does not only cause problems with anchors, but also with sites using short urls in their meta data:

As reported in https://de.wikipedia.org/wiki/Wikipedia:Technik/Text/Edit/VisualEditor/R%C3%BCckmeldungen#kurz-urls_von_faz.net , Citoid is replacing

https://www.faz.net/aktuell/gesellschaft/menschen/verleumdungsklage-von-e-jean-carroll-gegen-trump-16469341.html

with the short URL https://www.faz.net/1.6469341 , which is not recommended to use in citations.

Wouldn't it be an easy solution to just keep whatever URL was entered in the first place?

Wouldn't it be an easy solution to just keep whatever URL was entered in the first place?

It sure would! In fact, that's how it used to be. The reason we started using the canonical url instead of the user entered one was for privacy reasons, since many urls contain query parameters that track users, and also may contain identifying information. See T107322 where this was done.

I don't see any easy way around this conflict rather than to maintain a blacklist of query parameters we *don't* want to include; there are some that are universally used (i.e. utm https://en.wikipedia.org/wiki/UTM_parameters) but a lot of them are site specific, unfortunately, and of course change. This is the sort of thing that kind of explodes, maintenance wise.

Use the short URL, keep the anchor? It's not typically used for tracking (although in theory it could). If you want to be super paranoid, maybe verify that there's a HTML id matching the anchor.
(Note though that some webapps might use the anchor for routing and completely break when it is removed.)