Change Details

There are several categories of file and data which we'll handle in this scraper: | **Name** | **Container format(s)** | **Content format** | **Example filename** | **Description** | | HTML dump (tarball) | tar + gzip + split | JSON lines, [[ https://www.mediawiki.org/wiki/Specs/HTML/2.7.0 | Parsoid-RDFa ]] | hawiki-NS0-20230320-ENTERPRISE-HTML.json.tar.gz | These are the canonical upstream data source. They contain all the articles from one wiki, for one namespace (we only deal with the Main namespace or NS 0). | | Uncompressed dump | | JSON lines | hawiki_0.ndjson | Same as the tarball. We may manually decompress these files when debugging but they aren't a normal part of the pipeline. | | Intermediate summary | | JSON lines | hawiki-summary.jsonlines | The output of HtmlPageParser is a map of summary statistics for each page in a dump. | | Sample | | | hawiki-sample100.ndjson | We may take smaller samples of each file type while prototyping the processor. | There are several dimensions of organization to consider for each file type. | **Name** | **Example value** | **Description |** | | snapshot date | 20230320 | Dump snapshot timestamp. This is the top-level organizing key when downloading dumps. | | wiki | hawiki | Database name of a wiki. | | file type | tarball / summary / sample / ... | The contents and purpose of a file. | This task is complete when we agree on a directory structure and file-naming convention.