There are several categories of file and data which we'll handle in this scraper:
| **Name** | **Container format(s)** | **Content format** | **Example filename** | **Description** |
| HTML dump (tarball) | tar + gzip + split | JSON lines, [[ https://www.mediawiki.org/wiki/Specs/HTML/2.7.0 | Parsoid-RDFa ]] | hawiki-NS0-20230320-ENTERPRISE-HTML.json.tar.gz | These are the canonical upstream data source. They contain all the articles from one wiki, for one namespace (we only deal with the Main namespace or NS 0). |
| Uncompressed dump | none | JSON lines | hawiki_0.ndjson | Same as the tarball. We may manually decompress these files when debugging but they aren't a normal part of the pipeline. |
| Intermediate summary | gzip | JSON lines | hawiki-summary.jsonlines | The output of HtmlPageParser is a map of summary statistics for each page in a dump. |
| Sample | any | any | hawiki-sample100.ndjson | We may take smaller samples of each file type while prototyping the processor. |
There are several dimensions of organization to consider for each file type.
| **Name** | **Example value** | **Description** |
| snapshot date | 20230320 | Date dump run is completed, shared across all wikis in the job. (So the maximum of revision timestamps). This is the top-level organizing key when downloading dumps. |
| wiki | hawiki | Database name of a wiki. |
| file type | tarball / summary / sample / ... | The contents and purpose of a file. |
| aggregation scale | page / wiki | One line of text corresponds to what level of detail? |
| pipeline stage | raw input / intermediate result / final output | Note that the intermediate files need a more specific name. |
This task is complete when we agree on a directory structure and file naming convention.
Iterating on filesystem organization is annoying and error-prone so the first draft should be quite final already.
Assume that the filesystem is mounted over an NFS share, maybe even separate shares for input and outputs.
---
Draft proposal to edit and extend,
```
inputs/
<snapshot>/
<wiki>-NS0-<snapshot>-ENTERPRISE-HTML.json.tar.gz
<wiki>-NS0-<snapshot>-ENTERPRISE-HTML-sample<count>.json.tar.gz
references/
<wiki>-references.ndjson.gz
<wiki>-references-sample<count>.ndjson.gz
reports/
reference-summary-<snapshot>.csv
```
Reasoning:
* Inputs are streaming from a dedicated and read-only dumps source (fileshare or web), outputs are persisted to another store so this is the highest-level partitioning.
* We'll be following up with a second run on a later snapshot, in 1-2 years. Snapshot results will be compared so are parallel siblings and a natural partition. This is also the highest partitioning on the dumps server.
* Analysing references will be just one application of the dump scraper. This directory also gives a descriptive name to our "intermediate files" and nearly suggests what it is: a list of summarized references per article over a wiki.
* Reports are the only files which should normally be opened by end users. We'll have a diverse array of reports in here.