Page MenuHomePhabricator

Make Exports output contain one large ndjson file of content
Closed, ResolvedPublic

Description

As a downloader, I am able to download the export file for each project and have it contain one large or (a few) large JSON files with the objects that I need so that I can more easily upload the content into my KG.

Need to research:

  • Limit of JSON file size (or number of objects) so that it can be efficiently ingested.

Event Timeline

Here are two proposed solutions:

  1. Have a single JSON file with the collection of all titles. The file size may reach several hundred GBs.
  2. Split data into multiple files that can fit memory of a regular computer. Each file is going to be presented as a paginated version having the total number of pages, total number of titles, current page and the number of titles on this page. Each file size is going to be about 10GB and the total number of files may reach 100.

#1 seems like not an option :)

For #2...what would that that paginated version look like?

We can follow the approach we have for diffs where it would be a set of files named with the wiki name and the incremental number

enwiki_1.json
enwiki_2.json
...

The content of files will be similar to diffs:

[
 {title1},
 {title2},
  ...
]

We want this to be in njson format.

Meaning without top level array, so that this can be easily parsed line by line.

{...}
{...}
{...}

If we are to keep file size under certain limit for single file, let's say 10GB, then this approach to naming files makes the most sense to me:

enwiki_1.json
enwiki_2.json
...
RBrounley_WMF renamed this task from JSON format of Exports to Make Exports output contain one large ndjson file of content.May 12 2021, 3:41 PM
RBrounley_WMF raised the priority of this task from High to Unbreak Now!.May 19 2021, 1:52 PM
tstarling lowered the priority of this task from Unbreak Now! to High.May 20 2021, 4:40 AM