Page MenuHomePhabricator

IABot sets title as "Archived copy" even when title is available on the archived webpage
Closed, DeclinedPublic

Description

See this diff where "Archived copy" is set as the cite titles. In a subsequent edit, I was easily able to go to these archived webpages and get the title to plug in manually. I thought it was odd that the bot couldn't retrieve this easily available information.

Event Timeline

The bot isn't given this information in API calls, and adding it to screen scrap the title adds to much overhead for the benefit, and it only uses that title when it converts from an external link to a cite template.

Is editors' time manually correcting titles taken into account in the cost-benefit analysis? The idea of a bot creating new work doesn't seem quite right.

Considering the external links didn't have a title to begin with, it doesn't seem to be adding new work.

It can be argued that way, but with "Archived copy" sticking out like a sore thumb, it calls for addressing, and therefore, additional time is spent, when the bot could have in theory inserted this.

The bot does a large amount of work already. Pulling the page content of a page to extract the title adds a significant load to the bot. Not to mention, the bot is more likely to get blacklisted from the site when it pings it for a heartbeat, which will increase the false positive rate. I'm not willing to risk that.

Would it be reasonable to open a request to have a particular API that is used return the page's title?

I'm not sure what you mean. I don't know of any API that does that.

This was discussed above. You had said that the APIs you use to fix a dead link using an archive didn't return the title from the archive page. Would it be prudent to request that they do?

They currently have much larger issues to deal with. Adding title support is very low on the priority list.

So, let's open a ticket and they will set it as low priority. As long as it's documented in the system would be fine with me.

OK, so I'll open a ticket wherever these API developers are. Have a link?

I unfortunately do not. I only have direct communication with the API devs of the Wayback Machine.

I have started reaching out to them about this. If they can do it, I will re-open this ticket and wait for them to implement it, so I can implement it on my end, if it's too hard on available resources, then this will rest here.

And they said it's not feasible without severely limiting performance. The same reason why I won't do it either. The cost vs gain favors cost too much, considering the gain.