Change Details

There are several categories of file and data which we'll handle in this scraper: | **Name** | **Container format(s)** | **Content format** | **Example filename** | **Description** | | HTML dump (tarball) | tar + gzip + split | JSON lines, [[ https://www.mediawiki.org/wiki/Specs/HTML/2.7.0 | Parsoid-RDFa ]] | hawiki-NS0-20230320-ENTERPRISE-HTML.json.tar.gz | These are the canonical upstream data source. They contain all the articles from one wiki, for one namespace (we only deal with the Main namespace or NS 0). | | Uncompressed dump | none | JSON lines | hawiki_0.ndjson | Same as the tarball. We may manually decompress these files when debugging but they aren't a normal part of the pipeline. | | Intermediate summary | gzip | JSON lines | hawiki-summary.jsonlines | The output of HtmlPageParser is a map of summary statistics for each page in a dump. | | Sample | any | any | hawiki-sample100.ndjson | We may take smaller samples of each file type while prototyping the processor. | There are several dimensions of organization to consider for each file type. | **Name** | **Example value** | **Description** | | snapshot date | 20230320 | Date dump run is completed, shared across all wikis in the job. (So the maximum of revision timestamps). This is the top-level organizing key when downloading dumps. | | wiki | hawiki | Database name of a wiki. | | file type | tarball / summary / sample / ... | The contents and purpose of a file. | | aggregation scale | page / wiki | One line of text corresponds to what level of detail? | | pipeline stage | raw input / intermediate result / final output | Note that the intermediate files need a more specific name. | This task is complete when we agree on a directory structure and file naming convention. Iterating on filesystem organization is annoying and error-prone so the first draft should be quite final already. Assume that the filesystem is mounted over an NFS share, maybe even separate shares for input and outputs. --- Draft proposal to edit and extend, ``` inputs/ <snapshot>/ <wiki>-NS0-<snapshot>-ENTERPRISE-HTML.json.tar.gz <wiki>-NS0-<snapshot>-ENTERPRISE-HTML-sample<count>.json.tar.gz references/ <wiki>-references.ndjson.gz <wiki>-references-sample<count>.ndjson.gz reports/ reference-summary-<snapshot>.csv ``` Reasoning: * Inputs are streaming from a dedicated and read-only dumps source (fileshare or web), outputs are persisted to another store so this is the highest-level partitioning. * We'll be following up with a second run on a later snapshot, in 1-2 years. Snapshot results will be compared so are parallel siblings and a natural partition. This is also the highest partitioning on the dumps server. * Analysing references will be just one application of the dump scraper. This directory also gives a descriptive name to our "intermediate files" and nearly suggests what it is: a list of summarized references per article over a wiki. * Reports are the only files which should normally be opened by end users. We'll have a diverse array of reports in here.

There are several categories of file and data which we'll handle in this scraper: | **Name** | **Container format(s)** | **Content format** | **Example filename** | **Description** | | HTML dump (tarball) | tar + gzip + split | JSON lines, [[ https://www.mediawiki.org/wiki/Specs/HTML | Parsoid-RDFa ]] | hawiki-NS0-20230320-ENTERPRISE-HTML.json.tar.gz | These are the canonical upstream data source. They contain all the articles from one wiki, for one namespace (we only deal with the Main namespace or NS 0). | | Uncompressed dump | none | JSON lines | hawiki_0.ndjson | Same as the tarball. We may manually decompress these files when debugging but they aren't a normal part of the pipeline. | | Intermediate summary | gzip | JSON lines | hawiki-summary.jsonlines | The output of HtmlPageParser is a map of summary statistics for each page in a dump. | | Sample | any | any | hawiki-sample100.ndjson | We may take smaller samples of each file type while prototyping the processor. | There are several dimensions of organization to consider for each file type. | **Name** | **Example value** | **Description** | | snapshot date | 20230320 | Date dump run is completed, shared across all wikis in the job. (So the maximum of revision timestamps). This is the top-level organizing key when downloading dumps. | | wiki | hawiki | Database name of a wiki. | | file type | tarball / summary / sample / ... | The contents and purpose of a file. | | aggregation scale | page / wiki | One line of text corresponds to what level of detail? | | pipeline stage | raw input / intermediate result / final output | Note that the intermediate files need a more specific name. | This task is complete when we agree on a directory structure and file naming convention. Iterating on filesystem organization is annoying and error-prone so the first draft should be quite final already. Assume that the filesystem is mounted over an NFS share, maybe even separate shares for input and outputs. --- Draft proposal to edit and extend, ``` inputs/ <snapshot>/ <wiki>-NS0-<snapshot>-ENTERPRISE-HTML.json.tar.gz <wiki>-NS0-<snapshot>-ENTERPRISE-HTML-sample<count>.json.tar.gz references/ <wiki>-references.ndjson.gz <wiki>-references-sample<count>.ndjson.gz reports/ reference-summary-<snapshot>.csv ``` Reasoning: * Inputs are streaming from a dedicated and read-only dumps source (fileshare or web), outputs are persisted to another store so this is the highest-level partitioning. * We'll be following up with a second run on a later snapshot, in 1-2 years. Snapshot results will be compared so are parallel siblings and a natural partition. This is also the highest partitioning on the dumps server. * Analysing references will be just one application of the dump scraper. This directory also gives a descriptive name to our "intermediate files" and nearly suggests what it is: a list of summarized references per article over a wiki. * Reports are the only files which should normally be opened by end users. We'll have a diverse array of reports in here.