Goal: Creating the list of pages with corresponding articles is a first step to find pages that were translated.
Points to think about:
- We might want to create a list for every other language (EN, RU...).
- make sure that we are not missing information that is recorded in text of the page itself.
Possible approaches to completing the task:
- Working with the API Sandbox:
- An example for a query for one article: Michael Jordan
- write a script that first gets all the Hebrew titles as ids and then calls the language-links query on all these titles.
- It might be good to do a time slice and just look at edits within a particular date range.
- Working with Wikimedia dumps:
- download "Wiki interlanguage link records" dump.
- if we want to get the other language's ID then "Name/value pairs for pages" dump from that language should be downloaded.
- build an SQL query that returns as the IDs.
- Working with Wikidata dumps:
- download the dump wikidatawiki-latest-langlinks.sql.gz
- build an SQL query that returns as the IDs.
NOTE: Decide what to do with inline interlanguage links and with contradictions between wikidata and local links