Page MenuHomePhabricator

Create one-off HTML dataset for "Who are moderators" SDS 1.2.3
Closed, ResolvedPublic

Description

Research Engineering to create a on-off dataset of HTML diff dataset for

  • for 6 larger wikis (including enwiki), for 1 month of revisions. ~10-15M html diffs
  • as html needs to be rendered by parsoid, going to the API for older revisions (that are not cached) is expensive (as opposed to getting older wikitext which is just a databases lookup). Need to coordinate with SRE to ensure we do this safely.

Context: While we are waiting for an production html dataset (T380874), an urgent and strategically important need has appeared that cannot wait for that task. As a result, we now need to create a one-off HTML data-set as quickly as possible.

Event Timeline

fkaelin changed the task status from Open to In Progress.Nov 26 2024, 2:06 PM
fkaelin triaged this task as High priority.
fkaelin added a project: Research.
fkaelin moved this task from Backlog to In Progress on the Research board.

Update:

The one-off html dataset is 99% done after ~10 days of querying the api for html. Once the last batch (of 100) completes later today, the final dataset can be generated. I will update this phab with the path and some statistics, and then this ticket can be closed.

The dataset is for the month of October 2024 and the following wikis: enwiki, dewiki, frwiki, svwiki, nlwiki, ruwiki, eswiki, itwiki, plwiki, arzwiki, zhwiki, jawiki. It contains the html of all revisions created in that month, the html of the parent revisions. The schema is equivalent to the dumps2 wikitext history, with additional columns

|-- revision_html: struct (nullable = true)
|    |-- html: string (nullable = true)
|    |-- error: string (nullable = true)
|-- revision_parent_html: struct (nullable = true)
|    |-- html: string (nullable = true)
|    |-- error: string (nullable = true)

This notebook shows how to read the data, and how to parse the html using the html-dumps to extract e.g. info boxes.

The jobs have completed: the final dataset is available on hdfs /user/fab/html_diff/oct/html_diff (size: 1TB).

Breakdown per wiki db

+-------+-------+
|wiki_db|  count|
+-------+-------+
| svwiki| 136051|
| nlwiki| 127367|
| enwiki|4749961|
| jawiki| 392964|
| zhwiki| 333316|
| dewiki| 703302|
| ruwiki| 574340|
| itwiki| 536121|
| plwiki| 195844|
|arzwiki|  31197|
| eswiki| 594948|
| frwiki| 714685|
+-------+-------+

The number of errors that occurred (after retries/back-off etc) is non-zero but very small. E.g. the largest category: for 0.8% of revisions there was 404 (not found) return code, indicating that they were deleted since.

+---------------+---------------+-------+
|          error|          error|  count|
+---------------+---------------+-------+
|           null|           null|8992895|
|           null|http status 503|      1|
|http status 504|           null|    467|
|           null|http status 504|    485|
|http status 403|http status 403|   6527|
|http status 500|http status 500|      4|
|http status 500|           null|     58|
|           null|http status 500|     65|
|http status 400|http status 400|     34|
|http status 403|           null|   5962|
|           null|http status 403|   6247|
|http status 500|http status 504|     29|
|http status 404|http status 403|      2|
|http status 404|http status 404|  73264|
|           null|http status 404|   2581|
|http status 404|           null|    556|
|http status 504|http status 504|    895|
|http status 504|http status 500|     24|
+---------------+---------------+-------+