Page MenuHomePhabricator

Save contents of URLs linked from Wikidata in the Internet Archive
Closed, ResolvedPublic

Description

Just like Wikipedia, Wikidata faces the problem of link rot (https://en.wikipedia.org/wiki/Link_rot). A lot of the url's we're linking, for example as references, will stop working or have already stopped working. The first step is to make a backup of the contents of these linked pages.

Let's automatically save in the Internet Archive the external webpages linked from Wikidata items in order to prevent data loss.

The mechanism could consist in invoking https://web.archive.org/save/<URL> internally when a new value for a property with an external link (URL or external ID data types) is defined.

Implementation could be to update or fork the internet archive bot (source in php at https://github.com/cyberpower678/Cyberbot_II/tree/master/IABot ) or to write a minimal bot from scratch to bridge the gap between now and when the internet archive bot will support Wikidata.

Event Timeline

Restricted Application added a subscriber: Aklapper. · View Herald Transcript

I think it might be easier to run InternetArchive bot here.

I think it might be easier to run InternetArchive bot here.

Perfect. InternetArchive bot, then. :-)

I'm currently in contact with them to get this done.

Any news? Would it be easy to add Wikidata to the list of wikis where the InternetArchiveBot runs?

It's doable, but not easy. Wikidata has a different structure.

Multichill raised the priority of this task from Lowest to High.Oct 17 2018, 12:02 PM
Multichill updated the task description. (Show Details)

It's doable, but not easy. Wikidata has a different structure.

Extending the current bot seems to be the most future proof solution. In this task we only care about getting things into the archive, nothing else. So my guess is that parsing a Wikidata item is what you run into? Take for example https://www.wikidata.org/wiki/Q24066189 . You could force it to some other format like https://www.wikidata.org/entity/Q24066189.rdf or https://www.wikidata.org/entity/Q24066189.json to make it easier to find and extract urls.

Do you have some pointers where you think the challenge is going to be? We have an upcoming hackathon and we might be able to work on this.

It's doable, but not easy. Wikidata has a different structure.

Extending the current bot seems to be the most future proof solution. In this task we only care about getting things into the archive, nothing else. So my guess is that parsing a Wikidata item is what you run into? Take for example https://www.wikidata.org/wiki/Q24066189 . You could force it to some other format like https://www.wikidata.org/entity/Q24066189.rdf or https://www.wikidata.org/entity/Q24066189.json to make it easier to find and extract urls.

Do you have some pointers where you think the challenge is going to be? We have an upcoming hackathon and we might be able to work on this.

I am currently at that Hackathon in Thompson 150 right now. If you care to meet me during lunch break I will be happy to work on this with you.

I am currently at that Hackathon in Thompson 150 right now. If you care to meet me during lunch break I will be happy to work on this with you.

That didn't work out. Can you please have a look at my previous questions? You don't seem to be working on it and the task does not contain enough information for others to work on this.

I am currently at that Hackathon in Thompson 150 right now. If you care to meet me during lunch break I will be happy to work on this with you.

That didn't work out. Can you please have a look at my previous questions? You don't seem to be working on it and the task does not contain enough information for others to work on this.

And a Merry Christmas to you too. I guess you don't celebrate it, as I've personally been busy with the holiday season. Just because you don't see public movement, doesn't mean I'm not working on it. It takes some work to get IABot ready for Wikidata.

To answer your questions, IABot will be using the wbgetentities and wbeditentity API calls on MW.

To answer your questions, IABot will be using the wbgetentities and wbeditentity API calls on MW.

Good to hear you're working on this! I guess for this part of the bot to work you need to be able to fetch all the external inks on a Wikidata item so you can go to the next step of checking if you need to index these? I understand you considering using "wbgetentities", but not sure how "wbeditentity" fits in. If you're just reading, I would probably use entitydata ( https://www.mediawiki.org/wiki/Wikibase/EntityData ) to fetch the item in your favorite format. That will give you most of the links ( https://www.wikidata.org/wiki/Q219831 / http://www.wikidata.org/entity/Q219831.rdf / http://www.wikidata.org/entity/Q219831.json). The RDF format is quite nice because some of the links get expanded already (look for wdtn). For some of the links you have to expand it yourself using the formatter url ( https://www.wikidata.org/wiki/Property:P1630). At the start of the bot run I would do a SPARQL query for all the formatter urls and make a lookup table out of it. For each item you process you could just do an easy lookup in this table if you encounter an external-id property.

Anyway, if you need any help, please let me know. Link rot is quite a large problem right now on Wikidata and I would love to have a bot start indexing links so we at least have a copy somewhere.

Well IABot can save to the Wayback Machine, and it probably will, but it's primary job to find an archive for the URL and add it to the Wikidata entry.

Support finally added in beta13. BRFA will be filed shortly.