Page MenuHomePhabricator

Filter editing links in tracking_param_remover.py by domain
Open, Needs TriagePublicFeature

Description

Feature summary (what you would like to be able to do and where): My request is to add logic to filter out some domains, and, for starters, to include "web.archive.org" and "archive.is".

Use case(s) (list the steps that you performed to discover that problem, and describe the actual underlying problem which you want to solve. Do not describe only a solution): I uncovered the issue while running my bot, an example of such an edit can be seen here. What I would expect the script to do in this case is to not edit the link to web.archive.org and thus skip the page.

Benefits (why should this be implemented?): By design, links to archived URLs should stay the way they were archived, otherwise they are not functional. I reckon that there might be a number of other cases, where the tracking parameter removal shouldn't apply, but filtering out the archive links is my primary concern here.

Event Timeline

I also suggest filtering out ASP.NET links (those that include ".asp") because they extensively use URI parameters and some of those happen to match to "KNOWN_TRACKER_PARAMS" in script, see example.