There are several categories of file and data which we'll handle in this scraper:
| **Name** | **Container format(s)** | **Content format** | **Example filename** | **Description** |
| HTML dump (tarball) | tar + gzip + split | JSON lines, [[ https://www.mediawiki.org/wiki/Specs/HTML | Parsoid-RDFa ]] | hawiki-NS0-20230320-ENTERPRISE-HTML.json.tar.gz | These are the canonical upstream data source. They contain all the articles from one wiki, for one namespace (we only deal with the Main namespace or NS 0). |
| Uncompressed dump | none | JSON lines | hawiki_0.ndjson | Same as the tarball. We may manually decompress these files when debugging but they aren't a normal part of the pipeline. |
| Intermediate summary | gzip (TODO) | JSON lines | hawiki-summary.jsonlinesreferences-20230320.ndjson | The output of HtmlPageParser is a map of summary statistics for each page in a dump. |
| Sample | any | any | hawiki-sample100.ndjson | We may take smaller samples of each file type while prototyping the processor. |
There are several dimensions of organization to consider for each file type.
| **Name** | **Example value** | **Description** |
| snapshot date | 20230320 | Date dump run is completed, shared across all wikis in the job. (So the maximum of revision timestamps). This is the top-level organizing key when downloading dumps. |
| wiki | hawiki | Database name of a wiki. |
| file type | tarball / summary / sample / ... | The contents and purpose of a file. |
| aggregation scale | page / wiki | One line of text corresponds to what level of detail? |
| pipeline stage | raw input / intermediate result / final output | Note that the intermediate files need a more specific name. |
This task is complete when we agree on a directory structure and file naming convention.
Iterating on filesystem organization is annoying and error-prone so the first draft should be quite final already.
Assume that the filesystem is mounted over an NFS share, maybe even separate shares for input and outputs. The input share will probably have an existing directory structure that should be hardcoded into the scraper.
---
Draft proposal to edit and extend,:
```
inputs/
<snapshot>/
<wiki>-NS0-<snapshot>-ENTERPRISE-HTML.json.tar.gz
samples/
<snapshot>/
<wiki>-NS0-<snapshot>-ENTERPRISE-HTML-sample<count>.json.tar.gz
references/aggregates/
<snapshot>/
<wiki>-references-<snapshot>.ndjson.gz
<wiki>-references-<snapshot>-sample<count>.ndjson.gz
reports/
<snapshot>/
all-wikis-reference-summarys-statistics-<snapshot>.csv
all-wikis-references-templates-<snapshot>.csv
```
Reasoning:
* Inputs are streaming from a dedicated and read-only dumps source (fileshare or web), outputs are persisted to another store so this is the highest-level partitioningand have their own predetermined structure which we should consider as opaque here.
* Input vs. outputs must be the highest-level partitioning because it crosses filesystem boundaries.
* Samples are like inputs, but again cross a filesystem boundary and are owned by the scraper whereas input files are not.
* We'll be following up with a second run on a later snapshot, in 1-2 years. Snapshot results will be compared so are parallel siblings and a natural partition. This is also the highest partitioning on the dumps serverSnapshots are siblings because more than one snapshot set can be present.
* Analysing references will be just one application of the dump scraper. This directory also gives a descriptive name to our "intermediate files" and nearly suggests what it is: a list ofggregate file names should include a meaningful term "references" because these data are the summarized references per article over a wiki.
* Reports are the only files which should normally be opened by end users. We'll have a diverse array of reports in here, formats should be user-friendly, ...