Page MenuHomePhabricator

set up automated HTML (restbase) dumps on francium
Open, NormalPublic

Description

This involves:

  • setting up a script on francium to dump a single wiki to a specified directory (ns0 current revision only, but these should be configurable)
  • setting up a script, also on francium, to loop through all wikis and dump them, doing cleanup of old files, generating pages with links for downloaders, etc.
  • adding a cron job to automate the run

This task does not cover dumping of other namespaces other than the main ns, nor dealign with revision history. If/when desired, that shoudl be a new task.

Related tasks:
T93396 Decide on format options for HTML and possibly other dumps
T93113 deploy francium for html/zim dumps
T97125 Determine service infra for HTML dumps
T17017 Wikimedia static HTML dumps broken (this task started out about the HTML dump extension but discussion veered off to dumps from Restbase)

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes
ArielGlenn moved this task from Up Next to Active on the Dumps-Generation board.Jul 18 2016, 8:07 AM

And 'this week' turned into several weeks later, due to contract issues. At any rate, here we are back in 'this week' again :-)

I think they've switched over to https only for blah.wikblah.org/w/api.php, does that ring any bells?

Never mind, I'll use this as a learning opportunity. Current status: working on getting manual testing set up on francium.

I;ve likely set up the dependencies all wrong and etc. but the important point is that after fixing up the url for content retrieval a test run seems to be going smoothly. I'll have a look at the redirect issue once that's done, and after that I'll ask around about https. I think we're getting redirected to use https, perhaps I want to change that right away in the code.

After playing for awhile, a couple comments:

./bin/dump_wiki --domain el.wikinews.org --ns 0 --apiURL http://el.wikinews.org/w/api.php --dataBase /srv/test/el.wikinews.org.articles.ns0.sqlite3

This worked but produced the following error message at the end:

{ cause: { [Error: SQLITE_BUSY: unable to close due to unfinalized statements or unfinished backups] errno: 5, code: 'SQLITE_BUSY' },

isOperational: true,
errno: 5,
code: 'SQLITE_BUSY' }

Any chance the sqlite db would not have been closed properly? Can this message be ignored (and if so perhaps it should not be displayed)?

/usr/bin/nodejs ./bin/dump_wiki --domain el.wikinews.org --ns 0 --apiURL http://el.wikinews.org/w/api.php --database /srv/test/el.wikinews.org.articles.ns0.sqlite3

This appeared to dump articles but wrote them nowhere, as far as I can tell. The script ought to whine rather than do that.

I've got fixups for those, and I'll be adding some more error checks soon.

However I have just come across a road block. This setup assumes we will re-use a database once it has been created, to save a lot of time. That's a great move. However... articles once dumped are never deleted, even when the page itself is deleted from the wiki and no longer available to the public. I've just done a check of this on live data to be sure.

That must be fixed before we can provide these dumps on a regular basis. The design of the dump seems such that it's not a quickie fix, or at least, I'm not sure what the best approach would be. Can I get your input, @GWicke ?

Right, the way the dump works is that it retrieves the list of titles from the MW API, so if a page gets deleted between two dump runs, the old version is left in the DB. Perhaps a way to fix this would be to mark the pages in the DB that were returned by the MW API, and then check individually the ones found in storage, but not returned by MW. This approach is rather time-consuming, though.

ArielGlenn added a comment.EditedAug 4 2016, 2:12 PM

In the meantime I leave here a couple more errors I encountered while doing a dump of wikimania2014.wm.o (closed wiki, it's fine that it's listed in the restbase domains though, our policy is to continue to dump closed wikis as long as they don't use too many resources).

Error in htmldumper: Friendly_Space_Policy 46566 { name: 'HTTPError',
message: '504: internal_http_error',
status: 504,
body: 
 { type: 'internal_http_error',
   description: 'Error: Exceeded maxRedirects. Probably stuck in a redirect loop https://wikimania2014.wikimedia.org/api/rest_v1/page/html/Friendly_Space_Policy',
   error: 
    { cause: [Error: Exceeded maxRedirects. Probably stuck in a redirect loop https://wikimania2014.wikimedia.org/api/rest_v1/page/html/Friendly_Space_Policy],
      isOperational: true },
   stack: 'Error: Exceeded maxRedirects. Probably stuck in a redirect loop https://wikimania2014.wikimedia.org/api/rest_v1/page/html/Friendly_Space_Policy\n    at Redirect.onResponse (/home/ariel/htmldumper/node_modules/preq/node_modules/request/lib/redirect.js:94:27)\n    at Request.onRequestResponse (/home/ariel/htmldumper/node_modules/preq/node_modules/request/request.js:897:22)\n    at ClientRequest.EventEmitter.emit (events.js:95:17)\n    at HTTPParser.parserOnIncomingClient [as onIncoming] (http.js:1688:21)\n    at HTTPParser.parserOnHeadersComplete [as onHeadersComplete] (http.js:121:23)\n    at CleartextStream.socketOnData [as ondata] (http.js:1583:20)\n    at CleartextStream.read [as _read] (tls.js:511:12)\n    at CleartextStream.Readable.read (_stream_readable.js:320:10)\n    at EncryptedStream.write [as _write] (tls.js:366:25)\n    at doWrite (_stream_writable.js:223:10)\n    at writeOrBuffer (_stream_writable.js:213:5)\n    at EncryptedStream.Writable.write (_stream_writable.js:180:11)\n    at write (_stream_readable.js:583:24)\n    at flow (_stream_readable.js:592:7)\n    at Socket.pipeOnReadable (_stream_readable.js:624:5)\n    at Socket.EventEmitter.emit (events.js:92:17)\n    at emitReadable_ (_stream_readable.js:408:10)\n    at emitReadable (_stream_readable.js:404:5)',
   uri: 'http://wikimania2014.wikimedia.org/api/rest_v1/page/html/Friendly_Space_Policy/46566',
   method: 'get' } }

Dumping Wikimania/yi 51065   
Error in htmldumper: Wikimania/Digital_Collaboration 44825 { name: 'HTTPError',
message: '404',
status: 404,
headers:
 { date: 'Thu, 04 Aug 2016 14:06:53 GMT',
   'content-type': 'application/problem+json',
   connection: 'keep-alive',
   'access-control-allow-origin': '*',
   'access-control-allow-methods': 'GET',
   'access-control-allow-headers': 'accept, content-type',
   'access-control-expose-headers': 'etag',
   'cache-control': 'private, max-age=0, s-maxage=0, must-revalidate',
   'x-content-type-options': 'nosniff',
   'x-frame-options': 'SAMEORIGIN',
   'x-xss-protection': '1; mode=block',
   'content-security-policy': 'default-src \'none\'; frame-ancestors \'none\'',
   'x-content-security-policy': 'default-src \'none\'; frame-ancestors \'none\'',
   'x-webkit-csp': 'default-src \'none\'; frame-ancestors \'none\'',
   'x-request-id': 'b23ff252-5a4c-11e6-9905-2c6680f91574',
   'x-varnish': '2810322805, 3174603618',
   via: '1.1 varnish, 1.1 varnish',
   'accept-ranges': 'bytes',
   age: '0',
   'x-cache': 'cp1065 pass, cp1067 pass',
   'strict-transport-security': 'max-age=31536000; includeSubDomains; preload',
   'set-cookie':
    [ 'WMF-Last-Access=04-Aug-2016;Path=/;HttpOnly;secure;Expires=Mon, 05 Sep 2016 12:00:00 GMT',
      'GeoIP=:::::v4; Path=/; secure; Domain=.wikimedia.org' ],
   'x-analytics': 'https=1;nocookies=1',
   'x-client-ip': '10.64.32.168',
   'content-location': 'https://wikimania2014.wikimedia.org/api/rest_v1/page/html/Main_Page%2FSocial_Machines' },
body: <Buffer 7b 22 74 79 70 65 22 3a 22 68 74 74 70 73 3a 2f 2f 6d 65 64 69 61 77 69 6b 69 2e 6f 72 67 2f 77 69 6b 69 2f 48 79 70 65 72 53 77 69 74 63 68 2f 65 72 72 ...> }
Dumping WikiWomen's_Lunch 49009

Why would we have a 404 for a page from a closed wiki? It's not like something could have changed between getting the titles and getting the article content... Weird.

  Dumping Wikimania/es 51039   
  (node) warning: possible EventEmitter memory leak detected. 11 listeners added. Use emitter.setMaxListeners() to increase limit.
 Trace
  at Request.EventEmitter.addListener (events.js:160:15)
  at Request.init (/home/ariel/htmldumper/node_modules/preq/node_modules/request/request.js:506:8)
  at Redirect.onResponse (/home/ariel/htmldumper/node_modules/preq/node_modules/request/lib/redirect.js:148:11)
  at Request.onRequestResponse (/home/ariel/htmldumper/node_modules/preq/node_modules/request/request.js:897:22)
  at ClientRequest.EventEmitter.emit (events.js:95:17)
  at HTTPParser.parserOnIncomingClient [as onIncoming] (http.js:1688:21)
  at HTTPParser.parserOnHeadersComplete [as onHeadersComplete] (http.js:121:23)
  at CleartextStream.socketOnData [as ondata] (http.js:1583:20)
  at CleartextStream.read [as _read] (tls.js:511:12)
  at CleartextStream.Readable.read (_stream_readable.js:320:10)
  at EncryptedStream.write [as _write] (tls.js:366:25)
  at doWrite (_stream_writable.js:223:10)
  at writeOrBuffer (_stream_writable.js:213:5)
  at EncryptedStream.Writable.write (_stream_writable.js:180:11)
  at write (_stream_readable.js:583:24)
  at flow (_stream_readable.js:592:7)
  at Socket.pipeOnReadable (_stream_readable.js:624:5)
  at Socket.EventEmitter.emit (events.js:92:17)
(node) warning: possible EventEmitter memory leak detected. 11 listeners added. Use emitter.setMaxListeners() to increase limit.
 Trace
  at Request.EventEmitter.addListener (events.js:160:15)
  at Request.start (/home/ariel/htmldumper/node_modules/preq/node_modules/request/request.js:796:8)
  at Request.end (/home/ariel/htmldumper/node_modules/preq/node_modules/request/request.js:1357:10)
  at end (/home/ariel/htmldumper/node_modules/preq/node_modules/request/request.js:574:14)
  at Object._onImmediate (/home/ariel/htmldumper/node_modules/preq/node_modules/request/request.js:588:7)
  at processImmediate [as _immediateCallback] (timers.js:330:15)
Dumping Wikimania/fr 51040
Restricted Application added a subscriber: Hydriz. · View Herald TranscriptAug 8 2016, 1:41 PM

Right, the way the dump works is that it retrieves the list of titles from the MW API, so if a page gets deleted between two dump runs, the old version is left in the DB. Perhaps a way to fix this would be to mark the pages in the DB that were returned by the MW API, and then check individually the ones found in storage, but not returned by MW. This approach is rather time-consuming, though.

I don't know what approach you folks want to take; you'll probably want to do testing of a few different options. But deletion of deleted material is a must, for this as for all dumps.

daniel added a subscriber: daniel.
ArielGlenn triaged this task as Normal priority.Aug 11 2016, 5:52 PM

The general idea was to leverage Event-Platform for all incremental updates (deletions & edits / creations). As a side effect, this will also be a lot faster than iterating through all titles.

Main issue is that we'll need to hook up a kafka client, and then adapt how events are processed.

This sure sounds like the way to go for the mid-long term. For the short term to get this rolled out the door, how do you want to proceed?

@ArielGlenn: A simple solution to deletions would be to always do full dumps (from scratch) for now. We can then iteratively improve on this:

  1. Apply deletions from eventbus & use current full-title iteration to produce incremental dumps.
    • Alternatively, delete all titles / revisions that were not listed in the current run, using a new "run id" column.
  2. Efficiently apply all changes from eventbus, without iterating through all titles.

All right, I'll proceed on that basis. Thanks.

@GWicke's comment more or less reflects what we were discussing during the last developer summit. Namely:

  • start with a complete dump
  • monitor events from the Event-Platform and apply the changes accordingly to the dumps

Yep, I concur that we want to use the eventbus stuff for updates for the mid-long term. My question was just about what we do in the next week to Get Stuff Done. :-)

mobrovac added a comment.EditedAug 25 2016, 11:25 AM

Yep, I concur that we want to use the eventbus stuff for updates for the mid-long term. My question was just about what we do in the next week to Get Stuff Done. :-)

Start a clean dump of only the latest revisions so that we at least have something ?

EDIT: s/titles/revisions/

Yep, that's the plan now.

Yep, I concur that we want to use the eventbus stuff for updates for the mid-long term. My question was just about what we do in the next week to Get Stuff Done. :-)

Start a clean dump of only the latest revisions so that we at least have something ?
EDIT: s/titles/revisions/

@ArielGlenn, https://github.com/wikimedia/htmldumper has logic for both single-wiki and incremental all-wiki dumping. These scripts were used to create https://dumps.wikimedia.org/htmldumps/dumps/. However, as you point out it will need some minor tweaks to reflect recent changes:

  • Use main project domains instead of rest.wikimedia.org for content requests (project listings are still available at rest.wikimedia.org).
  • Possibly, send ?redirect=false in content requests. However, redirects aren't currently included by default, so this should not make a noticeable difference in practice.

We will need to send it indeed; this case

https://wikimania2014.wikimedia.org/api/rest_v1/page/html/Friendly_Space_Policy

illustrates why.

Hm, the en wikipedia dump took (at least) two days to run; I'm not sure yet if it it ran to completion. Might there be some concurrency setting I have misplaced?

We will need to send it indeed; this case
https://wikimania2014.wikimedia.org/api/rest_v1/page/html/Friendly_Space_Policy
illustrates why.

You mean the ERR_TOO_MANY_REDIRECTS?

We will need to send it indeed; this case
https://wikimania2014.wikimedia.org/api/rest_v1/page/html/Friendly_Space_Policy
illustrates why.

You mean the ERR_TOO_MANY_REDIRECTS?

The page is a self-redirect. MW appends the redirect=no flag to the query parameter when it detects that, so RB should probably do the same.

Hm, the en wikipedia dump took (at least) two days to run; I'm not sure yet if it it ran to completion.

Why aren't you sure if it has completed or not?

Might there be some concurrency setting I have misplaced?

I submitted PR #7 for the dumper script that allows you to control the parameter. By default, it fetches and processes 50 articles in parallel.

The page is a self-redirect. MW appends the redirect=no flag to the query parameter when it detects that, so RB should probably do the same.

Filed T144218: RESTBase should detect self-redirects for this.

I wasn't sure if it had completed properly or not. I have since uncompressed the dump and compared its size to the 2015 en wp dump and it's bigger, so I believe the run was complete.

I have not touched the concurrency setting in any way, so may I assume that the two days for the run was with 50 articles dumped in parallel? If so I may rethink how we handle deletions in the short term; it may be faster to add a column to the rows in the sqlite table with the update date, and make a second pass across all entries, deleting any which do have the update date earlier than the start of the run, than to dump from scratch. It would not be hard to test this.

The following three patchsets are work in progress for this task:

https://gerrit.wikimedia.org/r/#/c/307257/
https://gerrit.wikimedia.org/r/#/c/308015/
https://gerrit.wikimedia.org/r/#/c/308016/

The first patchset will be broken up into smaller pieces for review, but folks can see where this is going. Needs a bit more cleanup (log exceptions when a dump job fails, don't regenerate index file when a dump job isn't run, unless indexonly is explicitly set), then ready to use.

Pchelolo moved this task from Backlog to watching on the Services board.Oct 12 2016, 5:27 PM
Pchelolo edited projects, added Services (watching); removed Services.

Change 322450 had a related patch set uploaded (by ArielGlenn):
move IncrDumpLib to miscdumpslib and rename classes and methods accordingly

https://gerrit.wikimedia.org/r/322450

Change 322451 had a related patch set uploaded (by ArielGlenn):
move config defaults to a separate method

https://gerrit.wikimedia.org/r/322451

Change 322452 had a related patch set uploaded (by ArielGlenn):
move some adds/changes-specific code out of miscdumpslib

https://gerrit.wikimedia.org/r/322452

Change 322453 had a related patch set uploaded (by ArielGlenn):
move some methods into miscdumpslib that will be reused for other misc dumps

https://gerrit.wikimedia.org/r/322453

Change 322491 had a related patch set uploaded (by ArielGlenn):
start moving adds/changes methods out to incr_dumps module

https://gerrit.wikimedia.org/r/322491

Change 322510 had a related patch set uploaded (by ArielGlenn):
move more incremental-related methods out to incr_dumps module

https://gerrit.wikimedia.org/r/322510

Change 322511 had a related patch set uploaded (by ArielGlenn):
move methods that dump things into the IncrDump class in incr_dump

https://gerrit.wikimedia.org/r/322511

Change 322512 had a related patch set uploaded (by ArielGlenn):
add run method to the IncrDump class to be used by the generate wrapper

https://gerrit.wikimedia.org/r/322512

Change 322514 had a related patch set uploaded (by ArielGlenn):
move options specific to adds/changes into args dict

https://gerrit.wikimedia.org/r/322514

Change 322515 had a related patch set uploaded (by ArielGlenn):
Change last few config options from 'incr' to 'misc'

https://gerrit.wikimedia.org/r/322515

Sorry for the delay. A bunch of patchsets are going to land here shortly, including a WIP for the HTML dumps (really just needs to appropriate nodejs wrapper to call, otherwise it's not really WIP).

Change 323874 had a related patch set uploaded (by ArielGlenn):
remove get_lockinfo and bogus info about when run started

https://gerrit.wikimedia.org/r/323874

Change 322450 merged by ArielGlenn:
move IncrDumpLib to miscdumpslib and rename classes and methods accordingly

https://gerrit.wikimedia.org/r/322450

Change 323874 merged by ArielGlenn:
remove get_lockinfo and bogus info about when run started

https://gerrit.wikimedia.org/r/323874

Change 322451 merged by ArielGlenn:
move config defaults to a separate method

https://gerrit.wikimedia.org/r/322451

Change 322452 merged by ArielGlenn:
move some adds/changes-specific code out of miscdumpslib

https://gerrit.wikimedia.org/r/322452

Change 322491 merged by ArielGlenn:
start moving adds/changes methods out to incr_dumps module

https://gerrit.wikimedia.org/r/322491

Change 322510 merged by ArielGlenn:
move more incremental-related methods out to incr_dumps module

https://gerrit.wikimedia.org/r/322510

Change 322511 merged by ArielGlenn:
move methods that dump things into the IncrDump class in incr_dump

https://gerrit.wikimedia.org/r/322511

Change 322512 merged by ArielGlenn:
add run method to the IncrDump class to be used by the generate wrapper

https://gerrit.wikimedia.org/r/322512

Change 322514 merged by ArielGlenn:
move options specific to adds/changes into args dict

https://gerrit.wikimedia.org/r/322514

Change 322515 merged by ArielGlenn:
Change last few config options from 'incr' to 'misc'

https://gerrit.wikimedia.org/r/322515

Change 322453 abandoned by ArielGlenn:
move some methods into miscdumpslib that will be reused for other misc dumps

Reason:
superceded by ba6390a0e0c5963ac209d6877c59284a76ccabe6
due to crappy rebases.

https://gerrit.wikimedia.org/r/322453

Change 324018 had a related patch set uploaded (by ArielGlenn):
move last references to incr/Incr out of generateincrementals module

https://gerrit.wikimedia.org/r/324018

Change 324019 had a related patch set uploaded (by ArielGlenn):
generateincrementals.py becomes generatemiscdumps.py at last

https://gerrit.wikimedia.org/r/324019

daniel removed a subscriber: daniel.Nov 29 2016, 10:35 AM

Change 324018 merged by ArielGlenn:
move last references to incr/Incr out of generateincrementals module

https://gerrit.wikimedia.org/r/324018

Change 324019 merged by ArielGlenn:
generateincrementals.py becomes generatemiscdumps.py at last

https://gerrit.wikimedia.org/r/324019

Change 324231 had a related patch set uploaded (by ArielGlenn):
html dumps script using misc dump generation framework

https://gerrit.wikimedia.org/r/324231

Change 324231 merged by ArielGlenn:
html dumps script using misc dump generation framework

https://gerrit.wikimedia.org/r/324231

Kelson added a subscriber: Kelson.Jan 17 2017, 11:58 AM

@GWicke

As promised at the dev summit, here is a list of the things still pending:

After these are done, I can add the appropriate call to the wrapper script and get these going out of cron. Let me know what you need.

Merged

Left some comments there.

  • add functionality to dump just one wiki as a compressed mysqlite db with no re-use of existing db unless specified by cli arg (avoids problem with deleted revs discussed earlier)
  • scap repo setup and deploy on francium with current dependencies

I can help with the latter when the time comes.

awight added a subscriber: awight.Apr 13 2017, 9:03 AM
GWicke moved this task from next to blocked on the Services board.Aug 1 2017, 2:58 PM
GWicke edited projects, added Services (blocked); removed Services (next).
ArielGlenn moved this task from Active to Up Next on the Dumps-Generation board.Sep 11 2017, 1:10 PM
ArielGlenn moved this task from Up Next to Backlog on the Dumps-Generation board.Dec 2 2017, 6:10 PM

I notice that this task has moved from "Up Next" to "Backlog". Any hope to see progress on restoring plain HTML dumps?

I moved it because next up is going to be moving to php7/stretch, and realistically the HTML stuff will take a back seat to this. It's not off the radar by any means though.

In the meantime I am rethinking the way these dumps ought to go. This is an alternative approach, still nascent. https://gerrit.wikimedia.org/r/#/c/413212/ I'll copy over some notes onto this ticket later about this. Meh I'll copy them now:


It would be nice to have the html output readable in a nice way so that we can just pick up the html from the xml without doing a bunch of processing. This can be done by using CDATA tags: <![CDATA[ HTML stuff goes here... ]]>

It would be nice to do something with redirects. Storing the plain html in there seems dumb. Storing the title for which we redirect seems smart. Remember that the magic word can be localized so we want to beat that too.
RestBase currently returns a 302 and a Location header with the new url if you request a page which has #REDIRECT in it. The body returned by RestBase is the html of the redirect page itself.
For an example, see redirect_sample.txt, produced in response to
curl --dump-header - -o - -X GET --header 'Accept: text/html; charset=utf-8; profile="https://www.mediawiki.org/wiki/Specs/HTML/1.6.0"' 'https://el.wiktionary.org/api/rest_v1/page/html/BD' > redirect_sample.xt
I would like to see a <redirect></redirect> tag which contains theurl, with somehow no content if it's not a redirect. Check xmlspecs for this.

It would be nice if these files could be used directly in offline readers. As such, it would be nice to put them in multistream bz2 files with an index, instead of the usual bz2 output. No images, true, but that's how it is.

Need to decide if these ought to be produced in small pieces and then recombined (or whatever) later, for speed, for the bigger wikis and the two huge ones. Should see if bz2 multistreams will handle that easily or if it will be a PITA. Worst case I could write 7z of these, combine those (might be fast-ish?), then write bz2 multistream by conversion of the combined file.

Some size estimates of the bz2 files for pages-current (so that's main ns plus templates and other random crap):

  • enwiki articles current xml bz2: 13.7GB
  • dewiki articles current xml bz2: 4.4GB
  • commonswiki articles current xml bz2: 7.1GB
  • wikidatawiki articles current xml bz2: 33.2GB

One can argue that we ought to skip wikidatawiki and commons for these, as less useful (unsure). 13GB is still plenty to download though. Pieces seems better.

Notes on the RestBase api:

  • You can pass in a specific revision if you like.
  • You can't request pages by page id, only by title; title is required even when providing a revision id.
  • Soft redirects (#REDIRECT) are converted to 302s with special etag headers consisting of revid/timeuuid. The body has the html of the redirect page but let's not parse that, ugh.

@ArielGlenn so these would include the output from RestBase, with parsoid-annotated DOM? That would be very helpful for all sorts of processing tasks.

I hope these dumps will be available soon, now that more wikis make use of Wikidata the XML files are no longer sufficient to get all the data. And on Wiktionary more and more content is created dynamically by Lua modules which makes it very difficult to extract.

Some more design decisions made as we go along (see updated patchset):

  • Do not expect consistency in these dumps; we don't lock the db, retrieve everything, and then unlock. Takes wayyy too long, too many other users.
  • The MW page table is used to get the list of page titles we'll retrieve, and the latest rev id at the time.
  • The MW redirect table is used to check if a page id is to a redirect.
  • For each page id in the page table:
    • If the page id is in the redirect table, we'll write REDIRECT in the contents and add a redirect tag with target title to the xml output.
    • if we can prefetch the contents and the revision is current according to our page table dump, we'll just write it.
    • If the contents aren't available for prefetch, then we'll get it from RestBase, get the rev id out of RestBase and write the output.
    • If we can't get it from restbase, we'll write MISSING in the output.
    • It's possible that by the time we request a page from restbase, it will have been turned into a redirect. That's life. RestBase will serve the content of the target page and that's what we'll write.

While this won't produce a 100% consistent snapshot of a project, it will produce html or redirection info for any page title that isn't gone (deleted). That should be good for the vast majority of uses.

@ArielGlenn so these would include the output from RestBase, with parsoid-annotated DOM? That would be very helpful for all sorts of processing tasks.
I hope these dumps will be available soon, now that more wikis make use of Wikidata the XML files are no longer sufficient to get all the data. And on Wiktionary more and more content is created dynamically by Lua modules which makes it very difficult to extract.

P6766 has sample RestBase output for your perusal; I am tentatively wrapping it in CDATA in an xml file. I need to look at a page with Wikidata properties embedded in it yet.

Additional content notes. In the case that a page consisting of a MW edirect points to a non-existent page, it will be omitted. That's easiest and wouldn't really be a loss of content.

awight removed a subscriber: awight.Mar 15 2019, 3:41 PM