Page MenuHomePhabricator

Try to find link in archive.org when direct scraping fails
Closed, ResolvedPublic0 Estimated Story Points

Description

Given our recent issues with large publishers blocking us from scraping them directly, it would be very useful to fallback to getting our metadata from archive.org. In the process we could fetch the archive URL and data and feed that into the template (similar to T115224).

Original description:

This is a total wishlist, but it would be nice if (when given a URL that 404s) citoid attempted to pull something from archive.org instead of saying (more or less) "you can construct this manually".

Details

Event Timeline

LuisVilla raised the priority of this task from to Needs Triage.
LuisVilla updated the task description. (Show Details)
LuisVilla added a project: Citoid.
LuisVilla subscribed.
Mvolz renamed this task from Be more constructive on a 404d page to Try to find link in archive.org for 520 page.Apr 8 2015, 8:44 AM
Mvolz set Security to None.

We actually currently do have partial metadata for "bad" links that return a 520 (404s only return for bad pmid/pmcid/doi). UX/VE team decided not to use the 520s in the extension because the 520s usually just put the url in the title and url field, but it's something I'd like to revisit soon. Looking for something in archive.org could help.

Mvolz triaged this task as Medium priority.
Esanders renamed this task from Try to find link in archive.org for 520 page to Try to find link in archive.org when direct scraping fails.Jun 25 2024, 4:15 PM
Esanders updated the task description. (Show Details)

Change #375810 had a related patch set uploaded (by Mvolz; author: Mvolz):

[mediawiki/services/citoid@master] Scrape archive.org if page isn't available

https://gerrit.wikimedia.org/r/375810

I've implemented this.

Unfortunately checks are currently failing because of instability in the availability api. For some reason the api often returns empty results even when a snapshot is available. So it works sometimes, but not always or even often! I'm not sure the api is stable enough for us to merge this.

I think it's in theory acceptable if there is periodic failure, because the occasional success is better than 100% failure we have now.

It's just that at this point it's not passing CI and is hard for people to test because of how often the failures occur. :/

ppelberg subscribed.

Per what what the Editing Team discussed offline during Tuesday's standup, we're NOT going to merge this patch (375810/21), until the team can consider the value of doing so with the information @Mvolz uncovered above about the reliability of this approach.

ppelberg added a subscriber: Esanders.

Next step
Editing Team to make a clear choice about whether to move forward with this approach or to let it go for now.

Next step
Editing Team to make a clear choice about whether to move forward with this approach or to let it go for now.

My understanding is we were all in agreement to do it and the patch is just waiting for review.

Change #375810 merged by jenkins-bot:

[mediawiki/services/citoid@master] Scrape archive.org if page isn't available

https://gerrit.wikimedia.org/r/375810

ppelberg added a project: Editing QA.
ppelberg added a subscriber: Ryasmeen.

Assigning this to @Ryasmeen to verify. Assuming no issues/bits of confusion surface, please boldly resolve when you're through.

Note: we'll report on the impact of the deployment in T370682.

In the meantime, the you can see the URLs the archive.org fallback is "rescuing" here: https://logstash.wikimedia.org/goto/aee518bcd082cca005da5d01befde2db

I checked this with couple different URLS. For example for this one here: http://www.nanaimodailynews.com/news/nanaimo-region/nanaimo-s-20-most-powerful-people-1.292150, Citoid correctly pulled the metadata from archive.org.

However, I found some URLS for which it seemed to be not doing that:

http://www.tahoordanesh.com/page.php?pid=10562
http://www.ville.amos.qc.ca/

stjn subscribed.

Yeah, just encountered that, if this is supposed to work, it does not for https://kargormaslihat.gov.kz/ru/ispolvlast/ which was archived 43 times: https://web.archive.org/web/*/https://kargormaslihat.gov.kz/ru/ispolvlast/
You can still put in archive URL directly but this feature would be so good if it was implemented fully.

Yeah, just encountered that, if this is supposed to work, it does not for https://kargormaslihat.gov.kz/ru/ispolvlast/ which was archived 43 times: https://web.archive.org/web/*/https://kargormaslihat.gov.kz/ru/ispolvlast/
You can still put in archive URL directly but this feature would be so good if it was implemented fully.

This is working now, so probably just an intermittent issue with the archive.org API:

image.png (415×229 px, 25 KB)

I checked this with couple different URLS. For example for this one here: http://www.nanaimodailynews.com/news/nanaimo-region/nanaimo-s-20-most-powerful-people-1.292150, Citoid correctly pulled the metadata from archive.org.

However, I found some URLS for which it seemed to be not doing that:

http://www.tahoordanesh.com/page.php?pid=10562

Working now:

image.png (421×216 px, 20 KB)

http://www.ville.amos.qc.ca/

Working now:

image.png (421×216 px, 20 KB)