Page MenuHomePhabricator

InternetArchiveBot should detect blank snapshots
Closed, ResolvedPublic

Description

Here IABot added a link to this snapshot which is essentially a blank page. It should be possible to detect whether a snapshot contains a reasonable amount of text or not.

Of course, there might be snapshots which contain the referenced information within an image or a video, but I believe that at least cases like the one given above could be avoided.

Event Timeline

Cirdan triaged this task as Lowest priority.May 17 2018, 2:27 PM

I personally think this should be done on the Wayback Machine's end. There's little point to fetching a snapshot and then making an additional call to try and read the snapshot, when they have the means to do all of that internally themselves.

While I agree, I believe that this and other cases under T193158: Detect unsuitable archive snapshots should be dealt with also on Wikipedia's end. But that doesn't necessarily mean that IABot does this, there could also be an external service interacting with the database.

Aklapper added subscribers: Cyberpower678, Aklapper.

@Cyberpower678: I am resetting the assignee of this task because there has not been progress lately (please correct me if I am wrong!). Resetting the assignee avoids the impression that somebody is already working on this task. It also allows others to potentially work towards fixing this task. Please claim this task again when you realistically plan to work on it (via Add Action...Assign / Claim in the dropdown menu). Thanks for your understanding!

Cyberpower678 claimed this task.

Closing this as resolved as the Wayback Machine has gotten better over the years.