Save contents of URLs linked from Wikidata in the Internet Archive
Closed, ResolvedPublic
Actions

Assigned To

Authored By

	abian
	Aug 20 2016, 6:24 PM

Description

Just like Wikipedia, Wikidata faces the problem of link rot (https://en.wikipedia.org/wiki/Link_rot). A lot of the url's we're linking, for example as references, will stop working or have already stopped working. The first step is to make a backup of the contents of these linked pages.

Let's automatically save in the Internet Archive the external webpages linked from Wikidata items in order to prevent data loss.

The mechanism could consist in invoking https://web.archive.org/save/<URL> internally when a new value for a property with an external link (URL or external ID data types) is defined.

Implementation could be to update or fork the internet archive bot (source in php at https://github.com/cyberpower678/Cyberbot_II/tree/master/IABot ) or to write a minimal bot from scratch to bridge the gap between now and when the internet archive bot will support Wikidata.

Related Objects
Search...

		Status	Subtype	Assigned	Task
		Resolved		Cyberpower678	T143488 Save contents of URLs linked from Wikidata in the Internet Archive
		Resolved		Cyberpower678	T187611 Adapt InternetArchiveBot to Wikidata

Event Timeline

abian created this task.Aug 20 2016, 6:24 PM

Restricted Application added a project: Internet-Archive. · View Herald TranscriptAug 20 2016, 6:24 PM

Restricted Application added a subscriber: Aklapper. · View Herald Transcript

I think it might be easier to run InternetArchive bot here.

In T143488#2569692, @Izno wrote:

I think it might be easier to run InternetArchive bot here.

Perfect. InternetArchive bot, then. :-)

thiemowmde triaged this task as Lowest priority.Sep 5 2016, 3:28 PM

thiemowmde added projects: patch-welcome, Wikidata-Gadgets, MediaWiki-extensions-WikibaseRepository.

Sjoerddebruin moved this task from Backlog to Suggestions for new gadgets on the Wikidata-Gadgets board.Sep 5 2016, 4:44 PM

@Cyberpower678 might want to be tagged on this one.

• iecetcwcpggwqpgciazwvzpfjpwomjxn subscribed.Dec 8 2016, 9:26 AM

This comment was removed by • iecetcwcpggwqpgciazwvzpfjpwomjxn.

I'm currently in contact with them to get this done.

Abbe98 subscribed.Jan 5 2018, 3:16 PM

Multichill subscribed.Apr 11 2018, 9:37 PM

Any news? Would it be easy to add Wikidata to the list of wikis where the InternetArchiveBot runs?

Sjoerddebruin added a subtask: T187611: Adapt InternetArchiveBot to Wikidata.Apr 15 2018, 3:17 PM

It's doable, but not easy. Wikidata has a different structure.

Multichill raised the priority of this task from Lowest to High.Oct 17 2018, 12:02 PM

Multichill updated the task description. (Show Details)

In T143488#4132389, @Cyberpower678 wrote:

It's doable, but not easy. Wikidata has a different structure.

Extending the current bot seems to be the most future proof solution. In this task we only care about getting things into the archive, nothing else. So my guess is that parsing a Wikidata item is what you run into? Take for example https://www.wikidata.org/wiki/Q24066189 . You could force it to some other format like https://www.wikidata.org/entity/Q24066189.rdf or https://www.wikidata.org/entity/Q24066189.json to make it easier to find and extract urls.

Do you have some pointers where you think the challenge is going to be? We have an upcoming hackathon and we might be able to work on this.

Jane023 awarded a token.Oct 18 2018, 8:33 AM

Jane023 subscribed.

In T143488#4673884, @Multichill wrote:

In T143488#4132389, @Cyberpower678 wrote:

It's doable, but not easy. Wikidata has a different structure.

Extending the current bot seems to be the most future proof solution. In this task we only care about getting things into the archive, nothing else. So my guess is that parsing a Wikidata item is what you run into? Take for example https://www.wikidata.org/wiki/Q24066189 . You could force it to some other format like https://www.wikidata.org/entity/Q24066189.rdf or https://www.wikidata.org/entity/Q24066189.json to make it easier to find and extract urls.

Do you have some pointers where you think the challenge is going to be? We have an upcoming hackathon and we might be able to work on this.

I am currently at that Hackathon in Thompson 150 right now. If you care to meet me during lunch break I will be happy to work on this with you.

Cyberpower678 added a project: InternetArchiveBot (v2.0).Oct 18 2018, 3:29 PM

Restricted Application assigned this task to Cyberpower678. · View Herald TranscriptOct 18 2018, 3:29 PM

Cyberpower678 moved this task from Unsorted to New feature on the InternetArchiveBot (v2.0) board.Oct 18 2018, 3:29 PM

Redalert2fan subscribed.Oct 20 2018, 10:24 AM

In T143488#4677644, @Cyberpower678 wrote:

I am currently at that Hackathon in Thompson 150 right now. If you care to meet me during lunch break I will be happy to work on this with you.

That didn't work out. Can you please have a look at my previous questions? You don't seem to be working on it and the task does not contain enough information for others to work on this.

Multichill added subscribers: Wittylama, Spinster.Dec 25 2018, 10:51 AM

In T143488#4843283, @Multichill wrote:

In T143488#4677644, @Cyberpower678 wrote:

I am currently at that Hackathon in Thompson 150 right now. If you care to meet me during lunch break I will be happy to work on this with you.

That didn't work out. Can you please have a look at my previous questions? You don't seem to be working on it and the task does not contain enough information for others to work on this.

And a Merry Christmas to you too. I guess you don't celebrate it, as I've personally been busy with the holiday season. Just because you don't see public movement, doesn't mean I'm not working on it. It takes some work to get IABot ready for Wikidata.

To answer your questions, IABot will be using the wbgetentities and wbeditentity API calls on MW.

In T143488#4843337, @Cyberpower678 wrote:

To answer your questions, IABot will be using the wbgetentities and wbeditentity API calls on MW.

Good to hear you're working on this! I guess for this part of the bot to work you need to be able to fetch all the external inks on a Wikidata item so you can go to the next step of checking if you need to index these? I understand you considering using "wbgetentities", but not sure how "wbeditentity" fits in. If you're just reading, I would probably use entitydata ( https://www.mediawiki.org/wiki/Wikibase/EntityData ) to fetch the item in your favorite format. That will give you most of the links ( https://www.wikidata.org/wiki/Q219831 / http://www.wikidata.org/entity/Q219831.rdf / http://www.wikidata.org/entity/Q219831.json). The RDF format is quite nice because some of the links get expanded already (look for wdtn). For some of the links you have to expand it yourself using the formatter url ( https://www.wikidata.org/wiki/Property:P1630). At the start of the bot run I would do a SPARQL query for all the formatter urls and make a lookup table out of it. For each item you process you could just do an easy lookup in this table if you encounter an external-id property.

Anyway, if you need any help, please let me know. Link rot is quite a large problem right now on Wikidata and I would love to have a bot start indexing links so we at least have a copy somewhere.

Well IABot can save to the Wayback Machine, and it probably will, but it's primary job to find an archive for the URL and add it to the Wikidata entry.

Cyberpower678 removed a project: patch-welcome.Dec 28 2018, 7:35 PM