Page MenuHomePhabricator

Improve interlingual links across wikis through Wikidata IDs
Open, HighPublic

Description

Given that the page's Wikidata ID has been become de facto on of the main identifiers for Wikipedia pages, it is really useful to have ease access to that ID. Currently, there are three ways to do this (please complete the list if you need more):

(1) querying the wb_items_per_site table in the wikidatawiki on MariaDB, or
(2) through the sitelinks on Wikidata Json dumps.
(3) using page_props table for a specific wiki (its wikibase_item rows give the local page IDs corresponding to a Wikidata ID and vice versa, cf. T215616#4945911)

In the first two cases, the triplet is {WikidataId, wiki, Page_Title}, for example {Q298, enwiki, Chile}. Therefore, for any other information about that page, a join on page_title is needed. Considering that page_titles might change, and also that joining by strings might create some problems depending on the language, having the page Id would be also useful.

For example, if now I want to know how many revisions have the Q298 in each wiki, I would need first to query wikidatawiki.wb_items_per_site for Q298, and next join by page title in each wiki_db.

We should find and schema that make these kind of tasks easier.

UPDATE:

  • @JAllemandou suggested using sitelinks instead of wb_items_per_site table. I'm exploring if these two methodologies returns the same results.
  • I've updated the task to describe a more general problem: how to link content across wikis.

Event Timeline

I wonder if the indformation present in the table mentioned is the same as the one we could extract from site-links in the wikidata items. @diego : Could you please triple check that? If it is the case, this task is another one in need of wikidata-json dumps being productionized :)

diego renamed this task from Add (scoop) wikidatadawiki.wb_items_per_site MariaDB table to wmf_raw to Improve interlingual links across wikis through Wikidata IDs.Feb 11 2019, 12:58 PM
diego added a project: DBA.
diego updated the task description. (Show Details)
Nuria moved this task from Incoming to Smart Tools for Better Data on the Analytics board.

@diego :
This has worked for me (takes some time to compute and needs a bunch of resources). I hope it's close enough to what you want :) :

spark.sql("SET spark.sql.shuffle.partitions=512")
val wikidataParquetPath = "/user/joal/wmf/data/wmf/mediawiki/wikidata_parquet/20181001"
spark.read.parquet(wikidataParquetPath).createOrReplaceTempView("wikidata")

val df = spark.sql("""

WITH namespaced_revisions AS (
  SELECT
    wiki_db,
    revision_id,
    event_timestamp,
    page_title,
    page_namespace,
    CASE WHEN (LENGTH(namespace_localized_name) > 0)
      THEN CONCAT(namespace_localized_name, ':', page_title)
      ELSE page_title
    END AS title_namespace_localized
  FROM wmf.mediawiki_history mwh
    INNER JOIN wmf_raw.mediawiki_project_namespace_map nsm
      ON (
        mwh.wiki_db = nsm.dbname
        AND mwh.page_namespace = nsm.namespace
        AND mwh.snapshot = nsm.snapshot
      )
  WHERE mwh.snapshot = '2019-01'
    AND nsm.snapshot = '2019-01'
    AND event_entity = 'revision'
    AND NOT revision_is_deleted
),

wikidata_sitelinks AS (
  SELECT
    id as item_id,
    EXPLODE(siteLinks) AS sitelink
  FROM wikidata
  WHERE size(siteLinks) > 0
)

SELECT
  item_id,
  wiki_db,
  revision_id,
  event_timestamp,
  page_title,
  page_namespace
FROM wikidata_sitelinks ws
  INNER JOIN namespaced_revisions nsr
    ON (
      ws.sitelink.site = nsr.wiki_db
      AND ws.sitelink.title = title_namespace_localized
    )
""")

Looks good @JAllemandou, thanks.
This is a good workaround, but imho, we should have an structure or schema that makes this kind of tasks easier, specially for people outside without access to a cluster.

I don't know if this meets your needs, but the cirrussearch dumps have the wikidata id's broken out. This is the wikibase_item field of the ebernhardson.cirrus2hive table in hive. Alternatively there are full dumps with each article as a json object: https://dumps.wikimedia.your.org/other/cirrussearch/

Example hive query:

SELECT page_id, title, wikibase_item from ebernhardson.cirrus2hive where wikiid = 'enwiki' and dump_date='20190121' limit 10;

@EBernhardson , this looks exactly what I was looking for, initially. Thank you very much for that.

However, I wont close this task, because wikibase_item is still missing the page_id information. Joining by page_title does not seems very 'healthy'. We should keep discussing how to solve that. Thanks

@EBernhardson , this looks exactly what I was looking for, initially. Thank you very much for that.

However, I wont close this task, because wikibase_item is still missing the page_id information. Joining by page_title does not seems very 'healthy'. We should keep discussing how to solve that. Thanks

The page_id property is in the cirrus2hive table as well

diego added a project: DBA.

I don't understand what is the actionable here for us. Without context, I would say that:

(1) querying the wb_items_per_site table in the wikidatawiki on MariaDB, or
(2) through the sitelinks on Wikidata Json dumps.

That is not accurate, wikidata has an api for that:

https://www.wikidata.org/w/api.php?action=wbgetentities&ids=Q298&props=sitelinks

Which returns a JSON to further analyze. Of course, you cannot join that with anything else, but those would be features outside of mediawiki, and that an analytics-focus data store could handle, or your own scratch data? Unless you are suggesting mediawiki (the software) should have that complex queries out of the box, I am not sure how a #DBAs could help you?

@jcrespo, the API works good for query specific pages/entities, not for example to know which pages that existing in X_wiki are missing on the Y_wiki.
My point here it is that the wikidata identifier is currently the main identifier for a page/concept, and that this fact is not reflected on the DB structure. I understand that this might be due historical reasons, but it would be good to think in a way that our DBs make easier to link content across wikis.

The page_props table contains wikibase_item values for a given page ID. See e.g. T209891#4798717 for a query that uses this.

the API works good for query specific pages/entities, not for example to know which pages that existing in X_wiki are missing on the Y_wiki.

Sure, and you are free to either:

a) Setup your own database from dumps or even expose it to other people
b) Do a feature request for Mediawiki to support that

I think a) has more chances to go through, similar to T59617, but if you want b), you are not adding the right people here https://www.mediawiki.org/wiki/RFC

Hi @Isaac, I have generated some parquet data here /user/joal/wmf/data/wmf/wikidata/item_page_link/20190204 with the following query:

spark.sql("SET spark.sql.shuffle.partitions=128")
val wikidataParquetPath = "/user/joal/wmf/data/wmf/mediawiki/wikidata_parquet/20190204"
spark.read.parquet(wikidataParquetPath).createOrReplaceTempView("wikidata")

spark.sql("""

WITH namespaced_revisions AS (
  SELECT
    wiki_db,
    page_id,
    page_title,
    page_namespace,
    CASE WHEN (LENGTH(namespace_localized_name) > 0)
      THEN CONCAT(namespace_localized_name, ':', page_title)
      ELSE page_title
    END AS title_namespace_localized
  FROM (
    SELECT
      wiki_db,
      page_id,
      page_title,
      page_namespace,
      row_number() OVER (PARTITION BY wiki_db, page_id ORDER BY start_timestamp DESC) as row_num
    FROM wmf.mediawiki_page_history
    WHERE snapshot = '2019-01'
      AND page_id IS NOT NULL AND page_id > 0
      AND page_title IS NOT NULL and LENGTH(page_title) > 0
  ) ph
    INNER JOIN wmf_raw.mediawiki_project_namespace_map nsm
      ON (
        ph.wiki_db = nsm.dbname
        AND ph.page_namespace = nsm.namespace
        AND nsm.snapshot = '2019-01'
      )
  WHERE row_num = 1
),

wikidata_sitelinks AS (
  SELECT
    id as item_id,
    EXPLODE(siteLinks) AS sitelink
  FROM wikidata
  WHERE size(siteLinks) > 0
)

SELECT
  item_id,
  wiki_db,
  page_id,
  page_title,
  page_namespace,
  title_namespace_localized
FROM wikidata_sitelinks ws
  INNER JOIN namespaced_revisions nsr
    ON (
      ws.sitelink.site = nsr.wiki_db
      AND REPLACE(ws.sitelink.title, ' ', '_') = title_namespace_localized
    )
""").repartition(16).write.parquet("/user/joal/wmf/data/wmf/wikidata/item_page_link/20190204")

@diego : I can generate similar data for by-revision, but before doing wanted to be sure we agree on the fact that it'll flag every historical revision associated to an item currently linked to a page. Is that what you're after ? Or more of the history of linkagebetween page and item ?
Thanks!

thank you @JAllemandou this is awesome!!! completely unblocks me (i have a bunch of page titles across all the wikipedias and need to check whether a pair of them match the same wikidata item)!

@JAllemandou , yes. Having this by revision would be great!

@diego: my interpretation is that right now in the revision history version, the same wikidb/page ID/title is associated with the same wikidata ID regardless of when the revision occurred. what is the use for that over a table that has just one entry per wikidb/page ID/title? i'm trying to understand so i don't end up making a mistake about my interpretation of the links

Thanks @Isaac for reformulating the question I tried to explain above :)
@diego: Can you confirm there is value for you in having revisions tied to wikidata-items regardless of when the link happened?

I think we are talking about three different things:

i) page_id -> CurrentWikidataItem: this was my original request, and I think @JAllemandou 's script solves this issue. Having that table updated would be great.
ii) revision_id-> CurrentWikidataItem: This can be obtained by joining the previous table with the revision table. Having that table pre-computed would save time and resources on joining, but we can also do the join just when is needed.
iii)revision_id ->HistoricalWikidataItem: I was not looking for that, although it would be very interesting information.

We're on the same page @diego :)
I can precompute the table described in ii) if needed, and will surely do it once we'll have the wikidata-dump productioned - Let me know if you need it before

Marostegui subscribed.

Going to remove the DBA tag from here as there are not really any actionables (yet) for the DBAs and we already provided some input here (T215616#4946564) and there is not much we can do about this at the moment.
I am leaving the MediaWiki-libs-Rdbms tag in case you want to discuss queries or even schema changes (then I would suggest you add Schema-change once you have some thoughts or proposals about it).
Lastly, I will remain subscribed to this task in case you need further help from us!

Hey @JAllemandou, some debugging: a number of items aren't showing up and I can't for the life of me figure out. The few I've looked at are pretty normal articles (for example: https://de.wikipedia.org/wiki/Gregor_Grillemeier) and show up in the original parquet files (/user/joal/wmf/data/wmf/mediawiki/wikidata_parquet/20190204)

But according to this analysis (T209891#4798717) and ebernhardson's table (SELECT count(page_id) from ebernhardson.cirrus2hive where wikiid = 'enwiki' and dump_date='20190121';), there should be ~5.7 million english articles w/ associated wikidata items and I'm only seeing 916 thousand. I went through your query but could not find anything that would be causing this dropout so I'm at a loss. Thoughts?

Code in case I'm doing something wrong:

count_per_db = sqlContext.sql('SELECT wiki_db, count(*) FROM wikidata GROUP BY wiki_db')
wikidataParquetPath = '/user/joal/wmf/data/wmf/wikidata/item_page_link/20190204'
spark.read.parquet(wikidataParquetPath).createOrReplaceTempView('wikidata')
count_per_db = sqlContext.sql('SELECT wiki_db, count(*) FROM wikidata GROUP BY wiki_db')

If you sort the outcome then, you get:

+--------------+--------+
|       wiki_db|count(1)|
+--------------+--------+
|        zhwiki| 1245854|
|        jawiki| 1210483|
|        enwiki|  916393|
|       cebwiki|  891045|
|        svwiki|  778952|
|        dewiki|  656622|
|        frwiki|  414492|
|        nlwiki|  414469|
|        ruwiki|  413733|
...

Hi @Isaac
Sorry for the issue. I correcte the query above (last query, join criteria: AND ws.sitelink.title = title_namespace_localized --> AND REPLACE(ws.sitelink.title, ' ', '_') = title_namespace_localized
We were not joining correctly on title as mediawikik-history encodes them with underscores while wikidata dump uses spaces.
Problem solves, data regenerated at the same place as before, double check on enwiki numbers look good: 5.96M pages have an item in namespace 0 (7.95M for all namespaces).

Hey @JAllemandou - this is great! thanks for catching that - looks all good to me now too.

@diego Hi! Is there anythin additional for us Analytics here? Thaanks

Krinkle added a project: MediaWiki-General.
Krinkle subscribed.

(Does this appear to be an issue with the wikimedia-rdbms PHP library, and not a bug or specific change request for the MediaWiki core schema.)