We started using the canonical rather than the user entered URL for security reasons (T107322), but users have been regularly complaining this is destructive (removes anchors: T212608, page number in google books links and other important query parameters, and sometimes websites redirect us to a weird place like the home page if we aren't emulating a browser well enough).
An alternative is to remove tracking parameters based on a black list (i.e. https://en.wikipedia.org/wiki/UTM_parameters for some) and default to leaving them in if we don't "know" the parameter isn't needed.
Cons
- Users adding prohibited urls like private ip addresses that we don't allow, could be left in (this can probably be ameliorated if we disallow this particular case.)
- User tracking parameters that aren't on our blacklist would be left in.
- The metadata might actually not be from the URL they used, for instance in T210871, the metadata is from a splash page, or in this bug here the metadata is from the home page, not the intended url. Putting the "bad" url in indicates to the user they need to fix the metadata manually, and also notes the actual source of the metadata.
- The actual source of the metadata is discarded (the resolved / canonical url, as opposed to the user entered one)
Pros
- Leaves important page query parameters in
- Leaves anchors in
- Creates a citation that needs slightly less fixing if the metadata is bad, because
- Less confusing for users who expect that the url they enter will go in the url field as written