Maniphest T242089

Consider keeping user entered URL and removing tracking parameters
Open, MediumPublic
Actions

Assigned To

None

Authored By

	Mvolz
	Jan 7 2020, 10:22 AM

Description

We started using the canonical rather than the user entered URL for security reasons (T107322), but users have been regularly complaining this is destructive (removes anchors: T212608, page number in google books links and other important query parameters, and sometimes websites redirect us to a weird place like the home page if we aren't emulating a browser well enough).

An alternative is to remove tracking parameters based on a black list (i.e. https://en.wikipedia.org/wiki/UTM_parameters for some) and default to leaving them in if we don't "know" the parameter isn't needed.

Cons

Users adding prohibited urls like private ip addresses that we don't allow, could be left in (this can probably be ameliorated if we disallow this particular case.)
User tracking parameters that aren't on our blacklist would be left in.
The metadata might actually not be from the URL they used, for instance in T210871, the metadata is from a splash page, or in this bug here the metadata is from the home page, not the intended url. Putting the "bad" url in indicates to the user they need to fix the metadata manually, and also notes the actual source of the metadata.
The actual source of the metadata is discarded (the resolved / canonical url, as opposed to the user entered one)

Pros

Leaves important page query parameters in
Leaves anchors in
Creates a citation that needs slightly less fixing if the metadata is bad, because
Less confusing for users who expect that the url they enter will go in the url field as written

Related Objects

Mentioned In: T271438: Citoid breaks URLs for www.transfermarkt.de - domain replaced by underscore in URL
Mentioned Here: T210871: Citoid is overwriting editor provided values without notification (was "Bloomberg - Are you a robot?")
T107322: Trim the user's search string from Google Books search URLs
T212608: Keep anchor when generating reference from URL

Event Timeline

Mvolz created this task.Jan 7 2020, 10:22 AM

Restricted Application added a subscriber: Aklapper. · View Herald TranscriptJan 7 2020, 10:22 AM

@Xaosflux I think this was what you were trying to report on the other task.

Mvolz added a project: acl*security.Jan 9 2020, 10:02 AM

Mvolz updated the task description. (Show Details)

Mvolz updated the task description. (Show Details)Jan 9 2020, 10:49 AM

@Mvolz in T210871 the problem isn't so much about "removing parameters" but it seems to be that the process is following third party redirects and then replacing the entire path (and coincidentally in the example it is resulting in adding parameters). In replacing the entire path, editorial control of references is being lost (and in the example also results in a references that is useless for readers and editors)