Page MenuHomePhabricator

InternetArchiveBot should detect and not link to snapshots of domain reselling/domain squatting pages
Open, Needs TriagePublic

Description

When domains expire they sometimes get "squatted" by domain reselling pages. These are unsuitable as references for encyclopaedia articles (except possibly in an article about domain squatting, but that's a niche case).
Where possible, IABot should

  • detect a site as a reselling page and mark it as dead
  • not link to archives of the domain reselling page

In my experience, the following strings as the page title reliably indicate that the page is a domain reselling page:

  • "This website is for sale"
  • "Deze website is te koop"
  • "HugeDomains.com"
  • "Denna sida är till salu"
  • "available at DomainMarket.com" [this is the tail end of the string which typically includes the domain name]

The following strings indicate the page is not the original content, but they are not necessarily domain reselling pages

  • "主婦が消費者金融に対して思う事"
  • "page not found"
  • "ACTUAL ARTICLE TITLE BELONGS HERE"
  • "Website disabled"

These are obviously non-exhaustive lists and false positives are not impossible but will be very rare.