Page MenuHomePhabricator

Number of links to other Wikimedia projects
Closed, DeclinedPublic

Description

As a PM of Wikidata Analytics @Manuel needs to know what proportion of Wikidata is a “knowledge-base of general statements about the world”.

Indicator

  • Does an Item have a link to an other Wikimedia project?
  • i.e. We will be looking into the number of Sitelinks per item

Notes on analytical systems that might be of help here

Segments

  • all Items excluding astronomical data and citation data
  • astronomical data
  • citation data

Hypotheses

The fundamental dataset for this task

  • Rows: Wikidata Classes
  • Columns: Wikimedia projects
  • Cells: number of items in that class that link to the Wikimedia project in the respective column
  • Comment: this is essentially a distribution of the number of Sitelinks across items in per class
  • Additional:
    • Number of items per class
    • Number of items with Sitelinks per class
    • Total number of Sitelinks per class
    • Proportions (% of items that have a Sitelink towards a specific project)

Related

Event Timeline

@Manuel

The general case (whole Wikidata) is solved, result:

  • a table
  • rows: Wikidata classes
  • columns: Wikimedia projects
  • cells: number of items in a particular class w. sitelinks towards a particular project
  • additional columns:
    • number of items in the class
    • number of items w. sitelinks in the class
    • total number of sitelinks in the class

The dataset is huge (477,4 MB, ~11 MB .zip archive) and will be shared via Google Drive.

I will deliver tomorrow (Friday), as agreed:

  • reduced dataset 1: Wikidata - (Scholarly Articles + Astronomical Objects)
  • reduced dataset2: Scholarly Articles
  • reduced dataset3: Astronomical Objects

N.B. The resulting datasets will be large and I don't think that you will be able to easily understand the patterns (if any interesting emerge) from data only (at least I wouldn't be able to do so). Please get in touch and let's discuss what type of insights/visualizations we are looking for to understand this.

@Manuel

  • The data are published here (tar.gz -> .csv files) - better than in Google Drive;
  • Filenames
    • contingency_WD_FULL.tar.gz - everything, whole WIkidata
    • contingency_WD_CORE.tar.gz - Wikidata - (Astronomical Objetcs + Scholarly Articles)
    • contingency_WD_CITATIONS.tar.gz - Scholarly Articles only
    • contingency_WD_ASTRONOMY.tar.gz - Astronomical Objects only
  • Columns
    • A set of columns indicating a particular WMF projects; please note there is one column called NO_SITELINK among them;
    • class - the Wikidata class in the respective row
    • num_items - how many items are found in this Wikidata class
    • num_items_w_sitelinks - how many items w. sitelinks are found in this Wikidata class
    • total_sitelinks - how many sitelinks in total exist for the items in this class

@Manuel

From our 1:1 TUE 17. August 2021:

Number and % of items in WD with (no) sitelinks [split by core, astronomical, citation]

"Core" Wikidata (i.e. Wikidata - (Astronomical Objects + Scholarly Articles))

  • number of items w. sitelinks: 27907021, percent of items w. sitelinks: 31.35%

Astronomical Objects only

  • number of items w. sitelinks: 480508, percent of items w. sitelinks: 3.99%

Scholarly Articles only

  • number of items w. sitelinks: 22063, percent of items w. sitelinks: 0.47%

N.B. Take into your consideration that this data are approximate because an item can be an instanceOf/subclassOf/partOf different classes, and our source dataset here is organize class-wise, not item-wise. However, I doubt that the result would change in a significant way if we would go for a whole new item-wise ETL here.

Edit: In fact, I can refine this analysis from our basic ETL datasets (not used in analytics here - the class-wise data are derived from them) so to have an exact, per-item based statistics.

@Manuel

Do we know why there are so many astronomical objects with sitelinks? (e.g. what projects do they predominantly connect to?)

The following table should be able to help answer your question.

@Manuel

Here are a few more things, general statistics on whole Wikidata, to consider:

  • we consider 590,404 classes in total;
  • 307,646 classes (52%) do not have a single item with a sitelink;
  • here are (a) a chart with the top 50 classes with a large number of items missing sitelinks (point labels: number of item w/o sitelinks (% of items in that class w/o sitelinks), and
  • (b) a table (csv, zip compression) with all of the classes listed, sorted by the number of items w/o sitelinks; English labels are provided for the top 1,000 classes only.

Wikidata_NO_SITELINKS.png (929×1 px, 99 KB)

@Manuel From our 1:1

Number and % of items in WD with (no) external identifier [split by core, astronomical, citation]

  • ETL phase completed, datasets obtained;
  • re-composition in R, in RAM analysis now.

@Manuel

IMPORTANT. Probably all numbers - except those reported for whole Wikidata - will have to be corrected here.
I have been using WDQS to obtain the instances of all sub-classes of Astronomical Objects and Scholarly Articles until now.
How naive of me. I have just realized - while I should have been well aware of the fact - that some of my queries in Scholarly Articles timeout.
The consequence is that I have only partial lists of items that are instances of Scholarly Articles. Most probably, nothing similar has happened in Astronomical Objects.

Re-run everything on the dump, Pyspark ETL. Reporting back ASAP.

@Manuel

External Identifiers Statistics

  1. In whole Wikidata, we currently find 78,505,497 (out of 94,158,141) items with at least one External Id: that would be about 83% of all Wikidata items, implying 17% of items w/o External Ids**.
  1. In Astronomical Objects alone, we currently find 8,415,674 (out of 8,417,204) items with at least one External Id: that would be about 99.99% of all Astronomical Objects items, implying almost no items w/o External Ids**.
  1. In Scholarly Papers alone, we currently find 37,264,887 (out of 37,380,570 ) items with at least one External Id: that would be about 99.97% of all Scholarly Papers items, implying almost no items w/o External Ids**.
  1. In "core" Wikidata (i.e. Wikidata - (Astronomical Objects + Scholarly Papers)), we currently find 32,824,937 (out of 48,360,368) items with at least one External Id: that would be about 67.88% of all "core Wikidata items", implying around 32.12% of "core items" w/o External Ids**.

Now I really need to focus on a partial re-do of the Sitelinks datasets in accordance with T288611#7296573: I had underestimated the number of Scholarly Papers by a possibly large margin of error in my previous analyses!

@Manuel

The datasets described in T288611#7283258 are now updated with correct data and found in this public directory.

Next step: a re-work of T288611#7293369.

@Manuel

Here is a refinment of T288611#7293369:

Sitelinks Statistics

  1. In whole Wikidata, we currently find 26,368,626 items (out of 91,437,737 items with P31 instance of, P279 subclass of, or P361 part of) with sitelinks: that would be about 28.84% of all Wikidata items, implying 71.16% of items w/o sitelinks.
  1. In Astronomical Objects alone, we currently find 354,814 items (out of 8,417,204) with sitelinks: that would be about 4.22% of all Astronomical Objects in Wikidata items, implying 95.78% of items w/o sitelinks.
  1. In Scholarly Papers alone, we currently find 20,700 items (out of 37,380,570) with sitelinks: that means that close to 0% of items in Scholarly Papers have sitelinks.
  1. In "core" Wikidata (i.e. Wikidata - (Astronomical Objects + Scholarly Papers)), we currently find 25,993,112 items (out of 45,639,964 items with P31 instance of, P279 subclass of, or P361 part of) with sitelinks: that means that 57% of items in "core" Wikidata have sitelinks while 43% do not.

For Commons, don't forget that the sitelink may be in a different item (category vs. topic or list item), which probably complicates the queries quite a bit.

@GoranSMilovanovic: Per emails from Sep18 and Oct20 and https://www.mediawiki.org/wiki/Bug_management/Assignee_cleanup , I am resetting the assignee of this task because there has not been progress lately (please correct me if I am wrong!). Resetting the assignee avoids the impression that somebody is already working on this task. It also allows others to potentially work towards fixing this task. Please claim this task again when you plan to work on it (via Add Action...Assign / Claim in the dropdown menu) - it would be welcome. Thanks for your understanding!