Isolate parser from database dependencies
Closed, ResolvedPublic
Actions

Assigned To

Authored By

	Hippietrail
	Nov 18 2010, 2:35 AM

Description

Many people need to parse wikitext but due to its nature all attempts at alternative parsers are incomplete or have failed utterly.

The only parser known to "correctly" parse wikitext is the Parser.php - part of the MediaWiki source.

But it's not possible to use this parser in your own code or as a standalone PHP script because it calls the database directly for various things directly or indirectly, such as parser options which may depend on a user, and the localisation cache.

It would be a good thing if it were possible for third parties, or even unit tests to be able to use the genuine MediaWiki parser without the need for a MdiaWiki install and database import.

It should be possible to pass a string literal to the parser and get HTML back.

Version: unspecified
Severity: enhancement

Details

Reference: bz25984

Related Objects

Mentioned In: T108163: [WLM] Redirect script from Wikipedia to Commons
Mentioned Here: rMWb7ed0276e560: Added missing GPLv2 headers in some places.

Event Timeline

• bzimport raised the priority of this task from to Medium.Nov 21 2014, 11:11 PM

• bzimport added projects: MediaWiki-General, Parser.

• bzimport set Reference to bz25984.

• bzimport added a subscriber: Unknown Object (MLST).

Hippietrail created this task.Nov 18 2010, 2:35 AM

I don't think that'd be possible, since you need db access the moment you have wikitext with {{some template}} in it (to retrieve template). Same with [[link]] (so the parser can figure out if it should be a red link or not).

However, with that said, I just tried disabling all caches in my LocalSettings.php (along with db credentials so I know that no db access took place) and I was successfully able to parser a string literal from maintenance/eval.php as long as it didn't have any links or transclusions in it.

For reference, the LocalSettings.php I used in my test was:
<?php
$IP = dirname( FILE );
$path = array( $IP, "$IP/includes", "$IP/languages" );
set_include_path( implode( PATH_SEPARATOR, $path ) . PATH_SEPARATOR . get_include_path() );
require_once( "$IP/includes/DefaultSettings.php" );
$wgReadOnly = true;
$wgMessageCacheType = CACHE_NONE;
$wgParserCacheType = CACHE_NONE;
$wgMainCacheType = CACHE_NONE;
$wgLocalisationCacheConf['storeClass'] = 'LCStore_Null';

(In reply to comment #1)

I don't think that'd be possible, since you need db access the moment you have
wikitext with {{some template}} in it (to retrieve template). Same with
[[link]] (so the parser can figure out if it should be a red link or not).

You can always abstract DB access with something like

interface WikiAccess {

function getPageText( $title );
function getPageExistence( $title );
function getPageProps( $title );

}

The question is what will we achieve with this, because:

$grep $wg Parser.php | wc -l

This bug should be titled "Get rid of parser's external dependencies". And how we are going to untangle it from, say, $wgContLang? It depends on a half of MW.

For getting templates and red/blue link info I suggest adding a layer of abstraction that the parser can call rather than calling directly to the database.

Parser users would then have the choice of implementing them or knowing not to try parsing template calls. In my case I have written code to both existence checking and wikitext extraction directly from locally stored (and indexed) dump files.

ContentLang is harder but should probably also be abstracted with an English default, the default strings could be automatically be extracted to a text file included in the source tarball to make sure they're up to date. Parser users should be able to implement their own ContentLang equivalent also.

One problem I've found with ContentLang is it's not possible to instantiate one without a User. You either pass a user or the default constructor seems to call the database anyway to get the language settings for the default user. This would also need to be abstracted, which should not be difficult in principle since MediaWiki already serves mostly to anonymous users who are not logged in.

Getting rid of all external dependencies is probably a fair goal but some might be fine. The first goal might be to get the Parser working from a MediaWiki tarball that has been unarchived but has not had its installer ran, which is how I'm working on it on a machine with no web server or database software installed.

For getting templates and red/blue link info I suggest adding a layer of
abstraction that the parser can call rather than calling directly to the
database.

You could make your own custom db backend that recognizes certain queries and calls your thingy, but that kind of sounds insane.

One problem I've found with ContentLang is it's not possible to instantiate one
without a User. You either pass a user or the default constructor seems to call
the database anyway to get the language settings for the default user

That doesn't seem right. $wgContLang (which is what I assume you're referring to) does not depend on the user's language pref. I'm doubtful that $wgLang hits the db for anon users. Furthermore I managed to do $wgContLang->commaList(array('foo', 'bar')); on my local install without accessing the db.

the default strings could be automatically be extracted to a text file
included in the source tarball to make sure they're up to date

$wgUseDatabaseMessages = false; does that

Getting rid of all external dependencies is probably a fair goal but some might
be fine.

I'm unconvinced that it'd be worth all the effort given that its not that beneficial to mediawiki to do that (but I'm not planning to do these things, so it doesn't really matter if i see the benefit ;)

If you just want to make it work without installing db/apache/etc, you probably can make it work with just an "extension", but it'd be a bit "hacky"

fixing this bug (isolating the parser) would solve so many problems. Now, is it do-able? I think some of the best discussion I've seen yet is on this bug.

Which problems would that solve? It would make other folks who want to parse wikitext without mediawiki lives easier (which would be very nice, but not exactly a super high priority in my mind). It also might be slightly cleaner architecturally, and it might help slightly with the /make a js parser thingy so wysiwyg thingies can be more easily implemented on the client side/ goal, but not significantly as the parser would still be written in php, so couldn't just be plopped into a js library. Other then that, I'm not exactly sure what this would solve.

(In reply to comment #6)

Which problems would that solve?

It would make it easier for third parties to use, yes, but that isn't the point. It would be easier to maintain and less "scary" for people to work on.

Maybe there is a limit to how much the parser can be isolated from MediaWiki, maybe it isn't 100% achievable, but achieving this isn't an ivory tower goal, the point isn't simply "architectural cleanliness", but something far more pragmatic: maintainability.

vi0oss wrote:

Created special hacky patch to use MediaWiki parser without actual database (https://github.com/vi/wiki_dump_and_read/blob/master/hacked_mediawiki/0001-Make-MediaWiki-1.19-fetch-content-from-HTTP.patch)

If parser and database code were separated properly it would had been simpler and less hacky.

Yes, it is hacky :)

Some ideas:

Indent with tabs, not spaces.
If you add a new global, it has to be defined in DefaultSettings
Names like hackTriggered are fine for your code, but would carry no meaning if it were integrated upstream.
Instead of downloading from a web server, load from the filesystem. Check for ../ attacks. (Ideally, there would be different classes depending if it was db-backed or filesystem-based)
Wikipage::checkForDownloadingHack() should return itself the (cached) content, instead of manually doing the $this->downloadedContent
No need to hack Parser::fetchTemplateAndTitle(), that can be redirected through setTemplateCallback().
Why do you need to change EditPage, if you're not doing page editing?

vi0oss wrote:

Actually it was not intended to merging into upstream, I implemented this primarily just for me to be able to grab online wikis, save them in compressed form and use local MediaWiki to view them, without any lengthy indexing phase. The result is at https://github.com/vi/wiki_dump_and_read ("wikiexport" is also a hacky hack).

Indent with tabs, not spaces.

Is there patch checking utility to catch things and other common problems like on some other projects?

If you add a new global, it has to be defined in DefaultSettings

OK.

Names like hackTriggered are fine for your code, but would carry no meaning

if it were integrated upstream.
As I don't know the MediaWiki internals, such attempts will alway be hacky unless "Bug 25984" really closed. Can rename to some proper things, but it will remain hacky. I usually explicitly mention the word "hack" to tell users "beware, something may be wrong here".

Instead of downloading from a web server, load from the filesystem.

May be, but less flexible. The goal is to make easy to connect MediaWiki to other page source code source. HTTP approach is portable, with filesystem approach the only good way is FUSE.

Check for ../ attacks.

It is job of the server. https://github.com/vi/wiki_dump_and_read/blob/master/wikishelve_server.py just serves entries from Python's "shelve" database (single file on filesystem). And the whole thing is initially intended for local-only read-only usage.

Wikipage::checkForDownloadingHack() should return itself the (cached)

content, instead of manually doing the $this->downloadedContent
Yes.

No need to hack Parser::fetchTemplateAndTitle(), that can be redirected

through setTemplateCallback().
Not a PHP/MediaWiki hacker => just did what managed to do the first.

Why do you need to change EditPage, if you're not doing page editing?

To be able to view source (sometimes things get broken => can still view content in source form).

Ideally, there would be different classes depending if it was

db-backed or filesystem-based
I think creating good class structure to support DBBackend, FilesystemBackend, HttpBackend is a step in resolving "Bug 25984".

(Bumping this discussion was advised by Freenode/#mediawiki user)
(Will tell here again if/when implement the improved fetch-from-HTTP patch)

I would dearly love to see a version of this patch go upstream so that others who want to use the real live parser without a DB can see where to start.

Obviously having a proper abstraction layer between the parser and various db- / HTTP- / filesystem back end classes is the best way, but since the dev team is busy with bigger projects, having (a cleaned up version of) this starting point in the codebase will help people wanting 100% parse fidelity for offline viewers, data miners, etc.

Personally I've wanted an offline viewer that worked straight from the published dump files for years. (Other offline viewers like Kiwix need their own huge downloads in their own formats.) I had the code to extract the needed data from the dump files, but never succeeded with the next step of parser integration.

It was me who asked _Vi to participate in this discussion. He's the first one I bumped into over the years who has done some real work in this direction. Hack = prototype = as good a place to start as anywhere. (-:

Keisial wrote:

Patch for 0001-Make-MediaWiki-1.19-fetch-content-from-HTTP.patch

_Vi is also processing to a different format. :)

Did you see http://wiki-web.es/mediawiki-offline-reader/ ?

They are not straight dumps from dumps.wikimedia.org, although the original idea was that they would eventually be processed like that, publishing the index file along the xml dumps.
You could use the original files, treating them as a single bucket, but performance would be horrible with big dumps.

My approach was to use a new database type for reading the dumps, so it doesn't need an extra process or database.

Admittedly, it targetted the then current MediaWiki 1.13, so it'd need an update in order to work with current MediaWiki versions (mainly things like new columns/tables).

Vi, I did some tests with your code using eswiki-20081126 dump. For that version I store the processed file + categories + indexes in less than 800M. In your case, the shelve file needs 2.4G (a little smaller than the decompressed xml dump, 2495471616 vs 2584170611).

I had to perform a number of changes, in the patch to make it apply, to the interwikis so wikipedia is treated as a local namespace, to paths... Also the database contains references to /home/vi/usr/mediawiki_sa_vi/w/, but it mostly works.
The more noticeable problems are that images don't work and redirects are not followed.
Other features such as categories or special pages are also broken, but I assume that's expected?

Attached:

0001-Make-MediaWiki-1.19-fetch-content-from-HTTP.patch.patch1 KBDownload

vi0oss wrote:

Improved the patch: 0001-Make-MediaWiki-1.20-gb7ed02-be-able-to-fetch-from-al.patch .

Now applies to master branch (b7ed0276e560389913c629d97a46aaa47f48798b)
Separate class "AlternativeSourceObtainerBackend"
Is not _expected_ to break existing functions unless wgAlternativeSourceObtainerUri is set
wgAlternativeSourceObtainerUri is properly registered in DefaultSettings
Some comments above the variables and functions
Red/blue links are supported now, at expense of massive number of requests to AltBackend for each link.

Not implemented:

Tabs instead of spaces
"No need to hack Parser::fetchTemplateAndTitle"

It should be usable for everything PHP can fopen.

vi0oss wrote:

Patch to make MediaWiki obtain page sources from alternative locations

Attached:

0001-Make-MediaWiki-1.20-gb7ed02-be-able-to-fetch-from-al.patch13 KBDownload

sumanah wrote:

_Vi - thanks for your work. By the way, want developer access?

https://www.mediawiki.org/wiki/Developer_access

vi0oss wrote:

_Vi is also processing to a different format. :)
I did some tests with your code using eswiki-20081126 dump.

The main thing that with HTTP you can experiment with your own storage formats easily.

In your case, the shelve file needs 2.4G

The file is expected to be stored on compressed filesystem like reiser4/btrfs/fusecompress.

Other features such as categories or special pages are also broken, but I

assume that's expected?
It is expected for old patch to be usable only for simple rendering pages without extras. New patch should at least not break things when AltBackend is turned off (still only basic viewing features).

By the way, want developer access?

Unlikely it will be useful for me. /* Do you share dev access with anybody that easily? */ "Dev access" does not automatically lead to "knowledge about the system and good code quality" and I don't want to break things. If I came up with some patch I ask at Freenode and/or attach it to some bug report.

sumanah wrote:

Yes, we do share dev access with anyone, and recommend it for anyone who has ever given us a patch. It's access to suggest patches directly into our git repository, but you can't break things, because a senior developer has to approve it before it gets merged. If you get and use dev access, you make it *easier* for us to review, comment on, and eventually merge the code, and you can comment on the patch in the case someone else takes and merges it.

vi0oss wrote:

If you get and use dev access, you make it
*easier* for us to review, comment on,

Done, https://www.mediawiki.org/wiki/Developer_access#User:Vi2

(Not sure what to do with it yet)

sumanah wrote:

Thanks, _Vi. You should have a temporary password in your email now. Initial login & password change steps:

https://labsconsole.wikimedia.org/wiki/Help:Access#Initial_log_in_and_password_change

How to suggest your future patches directly into our source code repository (we use Git for version control and Gerrit for code review), in case you want to do that:

https://www.mediawiki.org/wiki/Git/Tutorial

If the patch under discussion in this bug is just a hacky prototype for discussion, then it's fine to keep on discussing it here and attaching improved patches here.

Keisial wrote:

In your case, the shelve file needs 2.4G

The file is expected to be stored on compressed filesystem like
reiser4/btrfs/fusecompress.

How would you do that?
And even more, how would you *share* such file?

vi0oss wrote:

How would you do that?

For example, in this way:
$ pv pages_talk_templates_dump.xml.xz | wikishelve_create.py shelve
800M 0:NN:NN [100KB/s] [============] 100%
$ fusecompress -o fc_c:bzip2,fc_b:512,allow-other store mount
$ pv shelve > mount/shelve
2.4G 0:NN:NN [1MB/s] [============] 100%
$ wikishelve_server.py mount/shelve 5077

And even more, how would you *share* such file?

Don't share, share "pages_talk_templates_dump.xml.xz" instead.

Note: here it's better to discuss only MediaWiki part. For storage part of "wiki_dump_and_read" better create an issue at GitHub.

gabriel-wmde subscribed.Aug 10 2015, 12:43 PM

Restricted Application added a subscriber: Aklapper. · View Herald TranscriptAug 10 2015, 12:43 PM

gabriel-wmde mentioned this in T108163: [WLM] Redirect script from Wikipedia to Commons.Aug 10 2015, 4:36 PM

Meno25 unsubscribed.Feb 22 2016, 5:42 PM

Nemo_bis added a project: OKR-Work.Mar 13 2016, 10:15 AM

Danny_B edited projects, added MediaWiki-Parser; removed OKR-Work, Parser, MediaWiki-General.Jul 10 2016, 1:26 AM

Danny_B removed a parent task: T28858: [DO NOT USE] Parser (tracking) [superseded by #MediaWiki-Parser].

Danny_B removed a subscriber: • wikibugs-l-list.

See https://www.mediawiki.org/wiki/Parsoid. To the extent considered feasible and worthwhile, it supports standalone use with configurable injection of dynamic/MW functionality.

For specific issues, file under Parsoid

	F6967: 0001-Make-MediaWiki-1.20-gb7ed02-be-able-to-fetch-from-al.patch
	Nov 21 2014, 11:11 PM

	F6966: 0001-Make-MediaWiki-1.19-fetch-content-from-HTTP.patch.patch
	Nov 21 2014, 11:11 PM

Isolate parser from database dependenciesClosed, ResolvedPublicActions

Description

Details

Related Objects

Event Timeline

Isolate parser from database dependencies
Closed, ResolvedPublic
Actions