Page MenuHomePhabricator

[Analytics] Find out size of term subgraph
Closed, ResolvedPublic

Description

Problem:
As Wikidata PMs we need to better understand how much of Wikidata's graph consists of Labels, Descriptions, and Aliases, in order to make a good decision about how to split the Blazegraph database.

Questions:

  • # of triples that describe Labels
  • % of triples that describe Labels
  • # of triples that describe Descriptions
  • % of triples that describe Descriptions
  • # of triples that describe Aliases
  • % of triples that describe Aliases

How the data will be used:

What difference will these insights make:

Notes:

  • The most recent numbers that we can get will do.

Assignee Planning

Information below this point is filled out by WMDE Analytics and specifically the assignee of this task.

Sub Tasks

Full breakdown of the steps to complete this task:

  • Look into prior research on this topic
  • Define tables to be used below
  • Derive total triples
    • 2023-7-10: 15,033,775,713
    • 2023-7-19: 15,043,046,814
  • Aggregate total and percent for labels
    • 2023-7-10:
      • Total: 801,847,766
      • Percent: 5.334
    • 2023-07-19:
      • Total: 802,163,906
      • Percent: 5.332
  • Aggregate total and percent for descriptions
    • 2023-7-10:
      • Total: 2,877,509,113
      • Percent: 19.14
    • 2023-07-19:
      • Total: 2,878,727,304
      • Percent: 19.137
  • Aggregate total and percent for aliases
    • 2023-7-10:
      • Total: 178,352,219
      • Percent: 1.186
    • 2023-07-19:
      • Total: 178,333,657
      • Percent: 1.185
  • Putting results/process in a public place for future reference

Data to be used

See Analytics/Data_Lake for the breakdown of the data lake databases and tables.

The following tables will be referenced in this task:

  • The discovery.wikibase_rdf table will be used for this
    • Schemas are not documented for this table on Wikitech, but anyone with access to the analytics cluster can access it (as of 17-7-2023)
    • The table includes subject-predicate-object relationships for Wikibase instances including Wikidata

Notes and Questions

Things that came up during the completion of this task, questions to be answered and follow up tasks:

Event Timeline

Manuel renamed this task from Find out size of term subgraph to [Analytics] Find out size of term subgraph.May 30 2023, 6:39 PM
Manuel moved this task from Incoming to Needs PM work on the Wikidata Analytics board.
Manuel moved this task from Needs PM work to Kanban on the Wikidata Analytics board.
Manuel edited projects, added Wikidata Analytics (Kanban); removed Wikidata Analytics.
Manuel moved this task from Incoming to Prioritized backlog on the Wikidata Analytics (Kanban) board.
Manuel reassigned this task from Andrew-WMDE to AndrewTavis_WMDE.
Manuel added a subscriber: Andrew-WMDE.
Manuel removed a subscriber: Andrew-WMDE.

@Manuel, the task description has now been updated with the aggregate values and percents for the dump from 2023-7-10. As this is weekly, the next dump will be made on the 19th, so I'll go ahead and rerun the process then so we have the most up to date numbers.

How do we want to document this? Everything in the notebook is using discovery.wikibase_rdf, which should be fine to publish. Would this be a case where we'd put the notebook on GitHub for reference?

And would a table output be preferable for easier comparison?

Cool, thank you!

the next dump will be made on the 19th, so I'll go ahead and rerun the process then

I don't think that there should be significant changes on that scale. So no need to rerun.

How do we want to document this?

We might create a little report for this and the coming WDQS task.

Would this be a case where we'd put the notebook on GitHub for reference?

I think so, yes: In terms of code documentation GitHub would make sense to me. That way we can link to it in the report and we (and others) can find it easily in the repo in case we need to rerun it. What would you think makes most sense?

We might create a little report for this and the coming WDQS task.

We could do this, yes. Do you mean something in Google Docs? (I mention a readme with the code below as another alternative)

In terms of code documentation GitHub would make sense to me. That way we can link to it in the report and we (and others) can find it easily in the repo in case we need to rerun it. What would you think makes most sense?

GitHub makes sense for visibility, and I'd say we should try to figure out the repo structure we'll be operating from. For small tasks like this I think it makes sense that the work goes into a general wmde/wikidata-analytics/tasks directory or the like, whereas larger projects would get their own repos in wmde. We can split the tasks directory into maybe quarters or something like this so we're not dumping things into it endlessly, and then any code that's able to be open-sourced can be along with a quick README.md file that would serve as the report?

Yes, let's think about the structure some more and then just try something out, we can change the structure later, correct?

Some thoughts:

  • all of our work for Wikidata is Wikidata Analytics
  • tasks can sometimes be grouped by topic (e.g. content/contributors/etc or epics like T337799)
  • topical groupings would also not be perfect, as it might miss some links between topics

we can change the structure later, correct?

Yes, no stress on that whatsoever 😊

tasks can sometimes be grouped by topic (e.g. content/contributors/etc or epics like T337799)

We could also have something like a wmde/wikidata-analytics/epics directory or the like that would allow us to group tasks together, where /tasks would be quarter separated.

topical groupings would also not be perfect, as it might miss some links between topics

I agree that it'd be best to limit ourselves to tasks and epics for this, as topical groups will end up getting confused. No use in stressing about where to put some work as long as it kind of makes it ordered :)

Some thoughts about the notebook:

Double checking

Triples should always be distinct, correct? But the number 15 Billion seems lower than I have read elsewhere.

Size calculations

The predicates look correct to me for this analysis.

predicate_representation_dict = {
    "label": "<http://www.w3.org/2000/01/rdf-schema#label>",
    "description": "<http://schema.org/description>",
    "alias": "<http://www.w3.org/2004/02/skos/core#altLabel>" 
}

But for the other tasks (e.g. T342111) it will not be as easy as querying Q-Ids in subjects. Otherwise, we would underestimate the size of the subgraph in question. I can e.g. see that qualifiers and references follow a different pattern.

I would suggest that we set up a short meeting with someone from the Wikidata team who can explain this table to us. In the meeting, you could also briefly explain the most relevant steps in this notebook so that they could provide a high-level code review.

Manuel closed this task as Resolved.EditedJul 20 2023, 9:32 AM

For documentation:

  • No high-level code review is required as we followed AKhatun's approach.
  • The notebook will still be documented in a public place for future reference.

We are done here! \o/

Great, @Manuel! Let me know what you want to do for the documentation of this. Happy to setup a repo for us on GitHub in the coming days if that would help :)

Is triple count the only important parameter? It seems likely that the descriptions could be larger, on average, than labels.

It seems odd that there are more descriptions (19% of total) than labels (5%), although that agrees with what the previous study found. The strong spike at 58-61 descriptions per item tells me that some bot probably machine generated templated descriptions for a large number of languages. The fact that there are more Dutch descriptions than any other language also says "machine generated" to me. (Edit: Most likely due to Edoderoobot)

Storing machine generated templated descriptions in the graph seems wasteful. I've observed anecdotally when working with person entities that a large number of them have pro-forma descriptions of <nationality> <occupation> (<birth year> - <death year>). These obviously don't need to be stored in the graph because they're just reiterating / duplicating existing information. If Wikidata search/autocomplete were made smarter, these could be generated on the fly.

I have a theory as to where a big chunk of the machine generated descriptions are from. They are the phrase "Wikimedia category" in hundreds of languages as a textual transcription of the triple instanceOf Q4167836. For example, Catégorie:Naissance à Seri Menanti has a single label in French and the P31 instanceOf claim which together occupy 802 bytes. Then two bots (Mr.Ibrahembot and Emijrpbot) came along and added another 11.5K (!) of static text (not even anything templated) in 129 languages, none of which have labels for the category.

There are 5.1M category items, 1.4M disambiguation page items, and more than 7M internal items of this type in total. The bots haven't fully populated all the descriptions yet, but this could amount to over 0.6B triples and 58 GB of wasted storage just for category items at the 130 language level. Imagine the waste as more languages are included and more items are added.

This is a huge waste of resources caused by humans attempting to work around a single product deficiency. It's only going to get more expensive over time.

p.s. These bots apparently aren't limited to internal Wikipedia items. Here's a user who's adding Asturian boilerplate descriptions not only to Wikipedia categories, but also U. S. patents. This flood of useless data isn't going to be sustainable.

Thanks for writing, @tfmorris! :)

Is triple count the only important parameter? It seems likely that the descriptions could be larger, on average, than labels.

Descriptions are something we could definitely look into in relation to this. This task and T342111 are generally trying to get up to date values from the study you mentioned as a check of the process and me getting some onboarding to working with this data (I'm referencing AKhatun's work).

(not even anything templated)

Just so I understand this a bit better, is there some kind of functionality where ideally they would have added a template for Wikimedia category as a description which then would return any of the localized versions of it while only saving that one reference? There's still lots I don't understand 😇 Would a goal here to be to try to template things that we're seeing that people keep adding statically and could be replaced with templates?

Really appreciate your insights and the time you took to look into all this. Thanks again 🙏

Updated the totals given the most recent dump to test my connection to it in relation to T342416. As expected, no major changes in terms of percentages :)

@tfmorris:

Is triple count the only important parameter? It seems likely that the descriptions could be larger, on average, than labels.

This task is about Blazegraph, so triple counts are what matter for this specifically. But we are also concerned about the more general problems that you mentioned, and your comments were helpful in that regard, so thank you for sharing! :)

@AndrewTavis_WMDE:

functionality where ideally they would have added a template for Wikimedia category as a description which then would return any of the localized versions of it while only saving that one reference

A new feature that would solve this problem is already planned, but it does not exist yet (see T303677).

@Manuel when you write:

A new feature that would solve this problem is already planned, but it does not exist yet (see T303677).

Thanks for the pointer! What does "planned" mean in this context? How do I find the schedule and/or priority of the task? My naive reading of the ticket gives the impression that it's been stalled without action for over a year.

@tfmorris:

What does "planned" mean in this context?

It is something that we decided to do but I do not know when we can prioritize this over other work.

My naive reading of the ticket gives the impression that it's been stalled without action for over a year.

It has a high priority. But unfortunately, as you know, we have only limited developer resources available. In this case we decided to first start with the related problem of redundancy in Labels and Aliases (see T285156).

I hope this helps!

I realise this ticket is already closed (I only just noticed it) but please bear in mind when making any decisions about how to split the data that introducing mul will hopefully result in a huge change to these numbers (at least for labels/aliases).

I've been adding things where we might be able to use to mul to https://www.wikidata.org/wiki/User:Nikki/Terms (with statistics) and have already found 500 million labels (62% of all labels) and nearly 150 million aliases (82% of all aliases) which could potentially be removed by using mul.

Thank you, Nikki, and great to see your estimates! I am mainly responsible for our analytics tasks these days, so no worries, I already made the point in the evaluation (and I also recomended your analysis for more details). :)

I realise this ticket is already closed (I only just noticed it) but please bear in mind when making any decisions about how to split the data that introducing mul will hopefully result in a huge change to these numbers (at least for labels/aliases).

I've been adding things where we might be able to use to mul to https://www.wikidata.org/wiki/User:Nikki/Terms (with statistics) and have already found 500 million labels (62% of all labels) and nearly 150 million aliases (82% of all aliases) which could potentially be removed by using mul.

We're keeping mul in mind. Even if those are large numbers, they don't seem to be large enough to address our scaling concerns on their own.