Page MenuHomePhabricator

Sample HTML Dumps - Request for feedback
Closed, DeclinedPublic

Description

Hello!

We have completed a very nascent form of complete HTML dumps and looking for feedback on the quality and help us plan the right paths to improve them. We have a dump for each text-based Wiki project (not commons or wikidata) in HTML, building this as the first exploratory part of an API project aimed at our largest users.

Thought the best route here would be to provide access to a public drive folder which folks are free to download from to take a look. In the folder right now is simple english wiki (944mb) which seems to be a good case to look at. Please let me know other languages you'd like - I can put them in the drive folder (except english-wiki which is absolutely massive...).

Things I'm curious to learn:

  • Is the HTML complete/adequate for use at programmatic scale?
  • If the file structure is clear and easy to work with?
  • Is there content missing here that you would like to see?
  • Is there extraneous content? Is there content that is not useful?

I will add more as I think of them. Let me know your thoughts and below I'm gonna add some feedback I've received already to kick off the conversation - thanks @Isaac

Thanks in advance everyone,
Ryan

Event Timeline

Would someone on the team be able to add a pointer to the repo for how these are generated? That will be helpful if we are to evaluate completeness. Thanks!

Couple quick thoughts about the format: it would be good for the articles to be written into subdirectories for the larger wikis, so that we don't have hundreds of thousands of files (or millions!) in one directory. See https://gerrit.wikimedia.org/r/plugins/gitiles/mediawiki/extensions/DumpHTML/+/refs/heads/master/dumpHTML.inc#477 for way back when these were produced by extension (in 2008), I think they used three levels of subdirs as the default back then but this could be adjustable depending on the size of the wiki.

Although the large tech partners that will consume these dumps will probably be fine with one large gz tarball, we want these to be easily usable by volunteers and researchers too, so I'd consider providing them also in a format that makes parallel processing of the dumps possible, such as bz2 multistream format with 100 or 1000 pages per 'stream', maybe without any tarring up at all. It might be nice to have a close html tag too, the sample articles I looked at didn't have it.

As an example, here is our dump of metadata for all history and how we structure it to solve the problem of "enwiki is massive while other wikis are tiny": https://dumps.wikimedia.org/other/mediawiki_history/readme.html (tl;dr; we applied a flexible grouping by time and size of wiki in a way that kept the paths formulaic enough for machines to wildcard and make use of)

Couple quick thoughts about the format: it would be good for the articles to be written into subdirectories for the larger wikis, so that we don't have hundreds of thousands of files (or millions!) in one directory. See https://gerrit.wikimedia.org/r/plugins/gitiles/mediawiki/extensions/DumpHTML/+/refs/heads/master/dumpHTML.inc#477 for way back when these were produced by extension (in 2008), I think they used three levels of subdirs as the default back then but this could be adjustable depending on the size of the wiki.

Although the large tech partners that will consume these dumps will probably be fine with one large gz tarball, we want these to be easily usable by volunteers and researchers too, so I'd consider providing them also in a format that makes parallel processing of the dumps possible, such as bz2 multistream format with 100 or 1000 pages per 'stream', maybe without any tarring up at all. It might be nice to have a close html tag too, the sample articles I looked at didn't have it.

Interesting. So the bz2 multistream is actually a solution for parallel processing not just for download reliability? We'll look into this...I think regardless of use case for larger tech orgs, ease of ingesting this data is the top priority for us. English Wiki has 15m articles (I believe), so with a 1000 pages per stream that would be 15k streams? File structure stuff makes sense, we're attacking that next over here. Will post updates.

As an example, here is our dump of metadata for all history and how we structure it to solve the problem of "enwiki is massive while other wikis are tiny": https://dumps.wikimedia.org/other/mediawiki_history/readme.html (tl;dr; we applied a flexible grouping by time and size of wiki in a way that kept the paths formulaic enough for machines to wildcard and make use of)

IIUC are you able to dump by month because of this is just all of the new revisions or is it a full dump of mediawiki? Since we are just trying to keep the most current "state" of every wiki (with the best last revision on every article) as our output, I think it might be tough to follow a time cutoff since the long tail of content that hasn't been edited in years will be a part of every dump. Maybe I'm misunderstanding though...helpful nonetheless, we need to think of how to break our stuff up on similar axes.

@Milimetric @ArielGlenn - can I ask you about compression? One thing we're trying to work with is size of files. While I think that we can split things up better to make it more digestible - a full enwiki dump is clocking in at 944gb or something insanely large. We're approaching cleaning the HTML and file structures but curious how you've successfully worked around this. For instance the xml enwiki dump in full is something like 17gb if I'm not mistaken? @R.zhurba - can you elaborate on what we're doing exactly here.

Would someone on the team be able to add a pointer to the repo for how these are generated? That will be helpful if we are to evaluate completeness. Thanks!

Also...working on this! Will post here when ready.

English Wiki has 15m articles (I believe)
a full enwiki dump is clocking in at 944gb or something insanely large

I'm pretty sure a large part of this issue is based on how you handle redirects really and not compression format. Enwiki has 9.3M redirects. Right now the HTML of an article is fully reproduced for a redirect (i.e. not just redirect to [[article]] but the full-text of that article that the reader would see). English Wikipedia has just over 6M articles in the classic sense, so reproducing the full article text in the redirects would probably be what explodes it to 15M full articles and a very large file (as opposed to 6M full articles and ~9M very tiny files that just indicate that they are redirects).

@RBrounley_WMF The 15k streams are concatenated together into one file, but it's easy to look for start of bz2 file markers (since they are byte aligned) and process multiple streams at once if you want your tools to do that.

We don't use gz compression for text output; that's just not going to cut it for large wikis. You could provide bz2 multistreams and much smaller files for folks not concerned with fast processing. There might be other compression algorithms people would prefer as well.

@ArielGlenn regarding HTML cleaning I think about removing attributes that don't hold information, like styles and classes.

@RBrounley_WMF:
+1 on publishing the dataset as a small number of large splittable files compressed with a splittable format. It helps the download and distributed data processing.

@R.zhurba:
I'm not sure what the specific use cases for this work/dataset are. However, if you are "cleaning up" the HTML content and removing attributes such as styles and classes, you are actually removing a lot of useful information typically used for direct display and information extraction. Without them, this dataset won't be as useful.

English Wiki has 15m articles (I believe)
a full enwiki dump is clocking in at 944gb or something insanely large

I'm pretty sure a large part of this issue is based on how you handle redirects really and not compression format. Enwiki has 9.3M redirects. Right now the HTML of an article is fully reproduced for a redirect (i.e. not just redirect to [[article]] but the full-text of that article that the reader would see). English Wikipedia has just over 6M articles in the classic sense, so reproducing the full article text in the redirects would probably be what explodes it to 15M full articles and a very large file (as opposed to 6M full articles and ~9M very tiny files that just indicate that they are redirects).

Thanks @Isaac, this is very useful and you're totally right. Does anybody know the best way to discover redirects, as in, is there a direct queryable way to access them or a list somewhere? This might be something we need to build into our service to take into account. @ArielGlenn, how have you approached in the past? I believe we are grabbing all of the urls based on the dumps.

@Nicolastorzec, we're gonna hold off for now on messing with styles and see how much fixing this redirect stuff will shrink the file size...out of curiosity - what do you see as the main uses for information extraction with regards to the styling? Just for my understanding as we plan here.

@RBrounley_WMF Page content for redirects consists of the wikitext that contains the directive for the redirect. For example,

#REDIRECT [[Apgar score]]</text>

which can be seen in the output of the page content dumps for enwiki from this month, so that's how we deal with it. Here's the full entry for the example revision:

<revision>
  <id>17510236</id>
  <timestamp>2005-05-11T21:53:06Z</timestamp>
  <contributor>
    <username>Eugene van der Pijll</username>
    <id>22016</id>
  </contributor>
  <comment>EB2004</comment>
  <model>wikitext</model>
  <format>text/x-wiki</format>
  <text bytes="25" xml:space="preserve">#REDIRECT [[Apgar score]]</text>
  <sha1>l5sah1snscdu1y39tqy1i3ohgocfe0k</sha1>
</revision>

You could produce placeholders for those indicating that they are redirects, or omit them and provide a separate list of redirect titles along with the HTML dumps.

We dump the redirects table twice a month along with all other public tables; you can find it at https://dumps.wikimedia.org/enwiki/20200701/enwiki-20200701-redirect.sql.gz for example. The table's schema is described at https://www.mediawiki.org/wiki/Manual:Redirect_table You would need the page id of the title for which content is being retrieved, in order to check titles against the redirect table entries.

@RBrounley_WMF:

RE redirects - We don't reuse the redirect tables that @ArielGlenn dumps every few weeks because we need more up-to-date data, but we do something similar to what he described. We typically parse article pages for redirect templates, and add the information we extract about the redirect pages to the pages they redirect to.

RE information extraction - As you probably know, many organizations use machine learning techniques to automatically extract information about named entities and common topics from Wikipedia in various languages. Beyond the actual textual content of the pages, most of these techniques also explicitly or implicitly utilize the relative position of the elements to extract on the pages, and the metadata associated with them (incl. the HTML and CSS properties related to classes, IDs,
and. styling), as features.

I had a chance to look at the code a little, thanks for making it available!

I note that it currently skips titles that are redirects. Perhaps you would want to include a list of titles and what they redirect to, separately, rather than omitting them altogether?

In the event handler code, I don't see anything that deals with oversighted or hidden revisions. I'm not sure if it's guaranteed that a page will be re-rendered after oversight or hiding of the current revision. Maybe we need to have a closer look at this. Typically revisions are hidden or oversighted because of egregious copyright violations or really offensive material relating to living persons, see https://en.wikipedia.org/wiki/Wikipedia:Revision_deletion#Criteria_for_redaction so we definitely want to catch those. I hope that these are visible via the recent changes event stream; maybe someone who knows better can weigh in on that.

I hope that these are visible via the recent changes event stream; maybe someone who knows better can weigh in on that.

I don't think so. Sometimes the very fact that a revision was hidden is itself privacy sensitive, so we don't expose these. We have internal page-restrictions-change and page-suppress (which reuses the page/delete schema IIRC) streams, but we don't expose these externally.

Split this oversighted revision conversation into T262479 to continue the conversation.

Aklapper removed a subscriber: Fjalapeno.

Removing task assignee due to inactivity as this open task has been assigned for more than two years. See the email sent to the task assignee on August 22nd, 2022.
Please assign this task to yourself again if you still realistically [plan to] work on this task - it would be welcome!
If this task has been resolved in the meantime, or should not be worked on ("declined"), please update its task status via "Add Action… 🡒 Change Status".
Also see https://www.mediawiki.org/wiki/Bug_management/Assignee_cleanup for tips how to best manage your individual work in Phabricator. Thanks!