Syntax for stripping HTML and wiki markup
OpenPublic

Description

Author: ui2t5v002

Description:
Similar to {{urlencode: }}, I'd like a parserfunction for stripping wikimarkup
and HTML from text. For instance:

The quick brown fox --> The quick brown fox
The [[quick]] [[brown]] [[fox]] --> The quick brown fox

CO<sub>2</sub> --> CO2

My specific application is for generating machine-readable COinS tags from
citation templates. For instance, if someone cites the book:

title = [[Aristotle for Everybody]]: Difficult Thought Made Easy
edition = 6<sup>th</sup> edition

which we have an article for, it shows up in the citation template with a link,
which is great. But in the machine-readable citation information, it needs to
become plain text:

Aristotle for Everybody: Difficult Thought Made Easy
6th edition

This would also be useful for templates where parameters need to be linked in
one place but not in another, are linked by the template itself, but people
often link their parameters by accident, etc. It might be useful for automated
linking to section anchors with markup, too?

Test with <sub>sub</sub> and <sup>sup</sup>

has the anchor

#Test_with_sub_and_sup

for instance.

I'm sure there are many other template-related functions that would be helped by
this, too.


Version: unspecified
Severity: enhancement
URL: http://www.mediawiki.org/wiki/Extension:Strip_Markup

bzimport added a subscriber: wikibugs-l.
bzimport set Reference to bz8161.
bzimport created this task.Via LegacyDec 5 2006, 4:35 PM
bzimport added a comment.Via ConduitDec 5 2006, 5:35 PM

robchur wrote:

I'd be concerned about the time this might require on large chunks of text.

bzimport added a comment.Via ConduitDec 5 2006, 5:44 PM

ui2t5v002 wrote:

(In reply to comment #1)

I'd be concerned about the time this might require on large chunks of text.

If that's a limitation, could it just be limited to short strings? Does the
urlencode function have the same problem?

bzimport added a comment.Via ConduitDec 5 2006, 5:44 PM

robchur wrote:

URL-encoding is less work.

bzimport added a comment.Via ConduitDec 5 2006, 5:56 PM

ui2t5v002 wrote:

Does a similar function already exist for section anchors?

bzimport added a comment.Via ConduitDec 5 2006, 6:41 PM

robchur wrote:

Yes, but there'd still be the potential for some moron to shove a load of
wikitext into the parser function and increase the amount of processing time.

I could just be being paranoid, of course; Tim Starling's probably the best
person to consult about this...

bzimport added a comment.Via ConduitDec 5 2006, 7:29 PM

ui2t5v002 wrote:

(In reply to comment #5)

Yes, but there'd still be the potential for some moron to shove a load of
wikitext into the parser function and increase the amount of processing time.

Yeah. The applications I'm imagining are only short snippets of text, though,
so limiting it to 100 characters or so per instance would be fine.

But then do you have to worry about many multlipe instances?

I could just be being paranoid, of course; Tim Starling's probably the best
person to consult about this...

Yes, I was mentioning the urlencode and anchor name functions so that their
processing time and server impact could be compared.

bzimport added a comment.Via ConduitDec 5 2006, 7:41 PM

ssanbeg wrote:

Image alt text may be a better comparison. i.e [[Image:wiki.png|some text]]
will parse "some text" for the caption, then strip the tags for the alt text.

I don't think you can directly strip wiki markup, so it would seem a bit
wasteful to parse that just to discard the results, but I don't think it would
be that much slower than normal parsing.

bzimport added a comment.Via ConduitDec 7 2006, 6:30 AM

ui2t5v002 wrote:

(In reply to comment #7)

Image alt text may be a better comparison. i.e [[Image:wiki.png|some text]]
will parse "some text" for the caption, then strip the tags for the alt text.

Oh. You mean like:

[[Image:Ant.jpg|thumb|Here is an [[ant]] with {{carbon}}{{oxygen|2}} and
3.63&times;10<sup>24</sup> things]]

will have alt text of:

Here is an ant with CO2 and 3.63×1024 things

I hadn't thought of that. So, in actuality, we already have a function that
does *exactly* what I'm looking for?

We've had it for years, it's in use on a very large number of articles, multiple
times each, and any moron can come along and put inordinate amounts of complex
wikicode into it (http://en.wikipedia.org/wiki/User:Omegatron/Sandbox) and no
one's ever complained about it causing server load problems?

:-)

How easy would it be to make this into a user-accessible ParserFunction?

bzimport added a comment.Via ConduitDec 7 2006, 4:28 PM

ssanbeg wrote:

(In reply to comment #8)

(In reply to comment #7)
> Image alt text may be a better comparison. i.e [[Image:wiki.png|some text]]
> will parse "some text" for the caption, then strip the tags for the alt text.

Oh. You mean like:

[[Image:Ant.jpg|thumb|Here is an [[ant]] with {{carbon}}{{oxygen|2}} and
3.63&times;10<sup>24</sup> things]]

will have alt text of:

Here is an ant with CO2 and 3.63×1024 things

I hadn't thought of that. So, in actuality, we already have a function that
does *exactly* what I'm looking for?

We've had it for years, it's in use on a very large number of articles, multiple
times each, and any moron can come along and put inordinate amounts of complex
wikicode into it (http://en.wikipedia.org/wiki/User:Omegatron/Sandbox) and no
one's ever complained about it causing server load problems?

:-)

Yeah, that's my thought.

How easy would it be to make this into a user-accessible ParserFunction?

Shouldn't be too hard. I don't think a parserfunction, though, since it's
harder to pass arbitrary text to them, and it would return text anyway.
Something like

<stripmarkup>Here is an [[ant]] with {{carbon}}{{oxygen|2}} and
3.63&times;10<sup>24</sup> things</stripmarkup>

would seem reasonable.

bzimport added a comment.Via ConduitDec 7 2006, 4:52 PM

ui2t5v002 wrote:

(In reply to comment #9)

Shouldn't be too hard. I don't think a parserfunction, though, since it's
harder to pass arbitrary text to them, and it would return text anyway.

I'm not sure what you mean by this, but a stripmarkup tag (or something shorter
to type) would make me just as happy. Just as long as I can do things like
<strip>{{{parameter}}}</strip> inside a template.

bzimport added a comment.Via ConduitDec 7 2006, 4:58 PM

ssanbeg wrote:

strip markup extension

I thank that's a bit simpler to add random text, since you don't have to worry
about something like a stray | terminating the argument.

Here's a quick extension I just put together.

Attached: StripMarkup.php

bzimport added a comment.Via ConduitDec 7 2006, 5:10 PM

ui2t5v002 wrote:

(In reply to comment #11)

I thank that's a bit simpler to add random text, since you don't have to worry
about something like a stray | terminating the argument.

Very good point. I agree that the pseudo-html tags are better.

mxn added a comment.Via ConduitFeb 1 2007, 5:03 AM

Changed summary from "ParserFunction for stripping HTML and wiki markup" to
"Syntax for stripping HTML and wiki markup" to reflect Attachment #2831.

bzimport added a comment.Via ConduitApr 23 2007, 11:34 PM

ui2t5v002 wrote:

Not to clutter up this bug, but are there plans for testing this/implementing it
on en?

bzimport added a comment.Via ConduitApr 24 2007, 12:16 AM

ayg wrote:

Note that due to bug 2257, I believe this patch would not presently work for
template parameters, the intended use. Please correct me if I'm wrong.

bzimport added a comment.Via ConduitApr 24 2007, 3:12 PM

ssanbeg wrote:

(In reply to comment #15)

Note that due to bug 2257, I believe this patch would not presently work for
template parameters, the intended use. Please correct me if I'm wrong.

Most of the examples are like <strip>{{thing}}</strip>, which would work fine;
but I see there is one example like <strip>{{{thing}}}</strip>, which wouldn't
work with the XML tag, but should be doable with a parser function.

bzimport added a comment.Via ConduitApr 24 2007, 3:40 PM

ui2t5v002 wrote:

(In reply to comment #16)

Most of the examples are like <strip>{{thing}}</strip>, which would work fine;
but I see there is one example like <strip>{{{thing}}}</strip>, which wouldn't
work with the XML tag, but should be doable with a parser function.

All of the things I want to use this for are inside templates, like the
<strip>{{{thing}}}</strip> style.

bzimport added a comment.Via ConduitMay 14 2009, 12:05 PM

Blindwanderer wrote:

*necromancy*
I contribute to a third party and we use tooltips to enhance the user experience. The problem is that they are an attribute, so all wiki markup has to be processed and all resulting HTML markup stripped. This wouldn't be a problem if we weren't using complex templates and Extension:VariablesExtension.

Here is an example page:
https://wiki.secondlife.com/wiki/PRIM_TEXTURE

Its annoying to have to supply and handle alternate text. I'd be more than willing to limit the execution time of this function if it could reduce the complexity of our code.

mxn added a subscriber: mxn.Via WebNov 24 2014, 8:55 PM

Add Comment