Page MenuHomePhabricator

Improve bulk import via API
Open, Needs TriagePublic

Description

Introduction

The Wikibase API is a recommended way to import entities in bulk into a Wikibase instance. However, the current performance of entity creation via the Wikibase API and its wrappers is roughly 0.5-20 items per second. There is no a reported comparison, but a few values were mentioned in the Wikibase Community Telegram group in March, 2021: 0.55 (Andra), 5 (Myst), 18 (Adam) and 20 (Jeroen). Usually I managed to create 5 items per second using the Wikibase API or its wrapper. That performance is fine for years-long collaborative knowledge graph construction. But in short small projects with 5-100 millions of entities it would be great to have performance 100 items per second or faster at least for initial upload of data.

As a consequence of that performance issue the third-party Wikibase users are searching workarounds for bulk import. For example, RaiseWikibase inserts entities and wikitexts directly to roughly ten tables of SQL database. However, filling four secondary tables (to get labels of entities in the Wikibase frontend) and building CirrusSearch index are outsourced to the maintenance scripts of Wikibase (see building_indexing function). The direct insert of data into SQL database boosts performance to 280 items per second, but filling the secondary tables and CirrusSearch indexing have poor performance as well which was pointed out by Aidan Hogan in the Wikibase Community mailing list. Update on 04.08.2021: the secondary tables are filled now on the fly as well (see the commit).

Adam Shorland participated in all those discussions. This resulted in a performance benchmark tool wikibase-profile, the post What happens in Wikibase when you make a new Item? and the related ticket T285987.

Problem

In short, the Wikibase API does many things. Some of those are not needed for bulk import by admins.

Adam mentioned six levels that could be improved and optimized for a bulk import use case. I'll mention some things which are probably could be ignored during the initial bulk import by admins.

  1. The API
    • parameter validation
    • permission checks for the user
  2. The Business
    • The edit token is validated
    • edit permissions are checked
    • rate limits are also checked
    • edit filter hooks run
  3. Wikibase persistence time
    • ?
  4. MediaWiki PageUpdater
    • some more permission checks
  5. Derived data updates
    • ?
  6. Secondary data updates
    • ?

Apart of that we need a kind of benchmark. The wikibase-profile is a good starting point for that.

Possible tasks & solutions

  1. Benchmark
  2. Discuss what is not needed in the Wikibase API for bulk import.
  3. Check the codes in the post and find performance bottlenecks. See T285987 as an example.

Predicted impact

  • Faster data import.
  • More users of Wikibase.
  • Faster grow of the Wikibase Ecosystem.

Event Timeline

Hmm, this is missing a detail of how your entity data sets or the community's is likely formatted (either from some other system or program, or manually created by database exports or software tools).

  1. What are the import formats that are likely to be wanted to import in bulk into Wikibase? Simple CSV Tables? JSON? RDF/XML? Or directly any of the formats that Rio https://github.com/oxigraph/rio currently provides (RDF-star is one of the newest it now supports)?

My experience taken from enterprise bulk data loading strategies has always been to do the following separately because of the load that indexing can put on a single servers' CPU's (a likely case with any Wikibase deployment by consumers):

  1. bulk load
  2. then index
  1. What are the import formats that are likely to be wanted to import in bulk into Wikibase? Simple CSV Tables? JSON? RDF/XML? Or directly any of the formats that Rio https://github.com/oxigraph/rio currently provides (RDF-star is one of the newest it now supports)?

I would say that highly likely CSV & JSON and likely RDF. Does it affect the ticket?

  1. What are the import formats that are likely to be wanted to import in bulk into Wikibase? Simple CSV Tables? JSON? RDF/XML? Or directly any of the formats that Rio https://github.com/oxigraph/rio currently provides (RDF-star is one of the newest it now supports)?

I would say that if we could solve this issue for one format, it would be a major step forward from the current situation. Users would then just need to get their data into the desired format, call the script, and ingest the data. Probably the most natural format to focus on would be JSON, as this is the "native format" of Wikidata. Wrappers can later be added for other syntaxes.

It would be good to have some defined public dataset that we could use for profiling and testing load times over time.
The wikibase-profile project was also a good starting point experiment for entities creations, but has its flaws and it would be good to create a profiling script that could be used, with said data set, to track progress, on some fairly fixed hardware.

A good experiment could be a "hack" prototype API module that would just avoid most of the complex parts of the current edit process that may not be needed, and see then how using wbeditentity vs that "hack" prototype compares, so that we can figure out what is realistic.

In general there will be quite a lot to be gained from running multiple servers or processes (or beefing up existing processes) and tweaking caches etc to speed things up too.

Regarding the dataset for testing: it would be good to formulate some requirements for it. Then we could just create a synthetic dataset. For example, in performance analysis with RaiseWikibase I used randomly generated strings of fixed length and different number of claims. I did it for one datatype only (string). Probably in the public dataset for this ticket more datatypes have to be used.

Regarding the dataset, we are working on inserting DBLP into Wikibase. It might be a good test-case? The scale is sufficient to be a challenge without being overwhelming, and the dumps are available here.

A downside is that it's mostly monolingual (mostly English labels, but in the case of papers in other languages, to the best of my knowledge, there is no translation, and no indication of language).

Another option, of course, would be to use self-contained extracts of Wikidata for testing. :)

Hi @aidhog Aidan in my opinion I would say "NO, not a good test-case for this need". And the only reason is this... it's ASCII only (chars <128) and doesn't let us unsure proper load handling for all data in all languages, multilingual data (ASCII > 128) such as UTF-8, etc.
DBLP.xml is however a great test-case for any SAX parser as I can see in it's PDF https://dblp.uni-trier.de/xml/docu/dblpxml.pdf

We ideally need to find a CC-0 public domain data set (or even create or generate one) in UTF-8 in both JSON and RDF/XML. Leaving out CSV for now, since pre-processing of CSV files into JSON records or RDF/XML is best in other tools that more easily can handle those conversions.

Something like the British National Library's Linked Open Data - Serials LOD samples file https://www.bl.uk/bibliographic/downloads/BNBLODSerials_sample_rdf.zip (or the full file https://www.bl.uk/bibliographic/downloads/BNBLODSerials_202106_rdf.zip) available here https://www.bl.uk/collection-metadata/downloads#

Hi folks! Any suggestions how to move this a bit forward? It seems we have not even agreed on a dataset for testing, not saying about the rest.

At WikidataCon2021 we had Open meeting of the Wikibase Stakeholder Group and interactive roadmapping session. During that session we worked interactively on roadmap miro board. I copy and paste the discussion about this ticket:

Anonymous: How can we push this? Can this really be done without Wikimedia Deutschalnd

Adam Shorland: I'd like to think that most of the problems here in some way can be worked on without WMDE, but collaboration there would always be needed.

The question around API performance, is are we going for small improvements, or waiting for the rewrite to REST

Renat Shigapov: When the REST API will be ready?

Adam Shorland: Though designed and feedback gathered, work has not started on implementation yet.

Dragan Espenschied: The REST API is probably not going to end up calling different PHP for doing the actual API work than the "action API" I assume? So if an improvement was made on the action API level before the REST API was ready, the REST API would end up getting it  too?

Dragan Espenschied: The REST API is probably not going to end up calling different PHP for doing the actual API work than the "action API" I assume?

The topic of improve bulk imports and make editing / importaing / data load faster are likely 2 separate concerns here.
The REST API will probably call some different code but the majority of this core part of editing would likley stay the same.

So if an improvement was made on the action API level before the REST API was ready, the REST API would end up getting it too?

Yes

Any news on this? Is something hindering it from being triaged?

Hi all, I'm also interested in this item's progress. Is anyone still working on this?

Hi, we have an extension for this:

https://gitlab.the-qa-company.com/FrozenMink/batchingestionextension

it is based on the ideas proposed by AddShore.

Hi, we have an extension for this:

https://gitlab.the-qa-company.com/FrozenMink/batchingestionextension

it is based on the ideas proposed by AddShore.

Nice.

(FYI) I had a similar import problem in 2020 when I was going to submit a paper, but the importer was too slow to finish before the deadline. Which I solved by writing my import script

https://phabricator.wikimedia.org/rEMASe12cd7a9d47a289a189f4283cfac5ff57588044b

handling of foreign keys

It has one feature I did not see for other importers, but I find it pretty helpful. Often, you have foreign keys in your data model. For example, for the StackExchange example from above users and posts. If you start with an empty wikibase, both users and posts will be empty. So you don't know the item IDs of the referenced wikibase items upfront. To handle that I created a references field in my datamodel, which will replace external foreign key ids with internal wikibase qids.

I recently started to write new custom import PHP scripts, but I think using a batch rest API would be much more convenient as presented in the batchingestionextension.