There are several categories of file and data which we'll handle in this scraper:
| **Name** | **Container format(s)** | **Content format** | **Example filename** | **Description** |
| HTML dump (tarball) | tar + gzip + split | JSON lines, [[ https://www.mediawiki.org/wiki/Specs/HTML/2.7.0 | Parsoid-RDFa ]] | hawiki-NS0-20230320-ENTERPRISE-HTML.json.tar.gz | These are the canonical upstream data source. They contain all the articles from one wiki, for one namespace (we only deal with the Main namespace or NS 0). |
| Uncompressed dump | | JSON lines | hawiki_0.ndjson | Same as the tarball. We may manually decompress these files when debugging but they aren't a normal part of the pipeline. |
| Intermediate summary | | JSON lines | hawiki-summary.jsonlines | The output of HtmlPageParser is a map of summary statistics for each page in a dump. |
| Sample | | | hawiki-sample100.ndjson | We may take smaller samples of each file type while prototyping the processor. |
There are several dimensions of organization to consider for each file type.
| **Name** | **Example value** | Description |
| snapshot date | 20230320 | Dump snapshot timestamp. This is the top-level organizing key when downloading dumps. |
| wiki | hawiki | Database name of a wiki. |
| file type | tarball / summary / sample / ... | The contents and purpose of a file. |
This task is complete when we agree on a directory structure and file-naming convention.