There are several categories of file and data which we'll handle in this scraper:
Name | Container format(s) | Content format | Example filename | Description |
HTML dump (tarball) | tar + gzip + split | JSON lines, Parsoid-RDFa | hawiki-NS0-20230320-ENTERPRISE-HTML.json.tar.gz | These are the canonical upstream data source. They contain all the articles from one wiki, for one namespace (we only deal with the Main namespace or NS 0). |
Uncompressed dump | none | JSON lines | hawiki_0.ndjson | Same as the tarball. We may manually decompress these files when debugging but they aren't a normal part of the pipeline. |
Intermediate summary | gzip (TODO) | JSON lines | hawiki-references-20230320.ndjson | The output of HtmlPageParser is a map of summary statistics for each page in a dump. |
Sample | any | any | hawiki-sample100.ndjson | We may take smaller samples of each file type while prototyping the processor. |
There are several dimensions of organization to consider for each file type.
Name | Example value | Description |
snapshot date | 20230320 | Date dump run is completed, shared across all wikis in the job. (So the maximum of revision timestamps). This is the top-level organizing key when downloading dumps. |
wiki | hawiki | Database name of a wiki. |
file type | tarball / summary / sample / ... | The contents and purpose of a file. |
aggregation scale | page / wiki | One line of text corresponds to what level of detail? |
pipeline stage | raw input / intermediate result / final output | Note that the intermediate files need a more specific name. |
This task is complete when we agree on a directory structure and file naming convention.
Iterating on filesystem organization is annoying and error-prone so the first draft should be quite final already.
Assume that the filesystem is mounted over an NFS share, maybe even separate shares for input and outputs. The input share will probably have an existing directory structure that should be hardcoded into the scraper.
Draft proposal to edit and extend:
inputs/ <snapshot>/ <wiki>-NS0-<snapshot>-ENTERPRISE-HTML.json.tar.gz aggregates/ <snapshot>/ <wiki>-references-<snapshot>.ndjson.gz reports/ <snapshot>/ all-wikis-references-statistics-<snapshot>.csv all-wikis-references-templates-<snapshot>.csv sampled/ inputs/ <snapshot>/ <wiki>-NS0-<snapshot>-ENTERPRISE-HTML-sample<count>.ndjson.gz aggregates/ <snapshot>/ <wiki>-references-<snapshot>-sample<count>.ndjson.gz reports/ <snapshot>/ all-wikis-references-statistics-<snapshot>-sample<count>.csv
Reasoning:
- Inputs are streaming from a dedicated and read-only dumps source (fileshare or web), and have their own predetermined structure which we should consider as opaque here.
- Input vs. outputs must be the highest-level partitioning because it crosses filesystem boundaries.
- Snapshots are siblings because more than one snapshot set can be present.
- Aggregate file names should include a meaningful term "references" because these data are the summarized references per article over a wiki.
- Reports are the only files which should normally be opened by end users. We'll have a diverse array of reports in here, formats should be user-friendly, ...
- File naming is verbose enough to convey meaning even out-of-context.
- I went back and forth about "sampled". Reproducing the tree seems like a safe and clear partition, reproducing the tree is a bit odd but I'm thinking it's a bit of a scratch directory compared to the others.