Page MenuHomePhabricator

SyntaxHighlight lost support for 30 languages (regression) by switching from GeSHi to Pygments
Open, LowestPublic

Description

Many languages (mostly obscure) were de-supported due to the switch from GeSHi to Pygments for syntax highlighting (T85794).
Fallbacks have been re-introduced for many of these languages. Report any new useful fallback mappings here.

For de-supported languages that were being used on wikis, please contribute language lexers to Pygments or report them as missing at https://bitbucket.org/birkenfeld/pygments-main/issues?status=new&status=open .
Note GeSHi included support for some languages which were not used in Wikimedia, and even some which were not real languages, so please do not raise bugs for all languages in GeSHi.

The next release of Pygments will include support for the following GeSHi languages which are not present in Pygments 2.0:

  • Algol
  • BNF
  • ...

Original bug report:

I have been told by syntaxhighlight-error-category that two languages are lost:

  1. bnf
  2. gettext

They are not mentioned by rESHG/SyntaxHighlight_GeSHi.lexers.php but mw:Extension:SyntaxHighlight GeSHi #Supported languages promise they will work and at least they are known upstream:

Event Timeline

PerfektesChaos updated the task description. (Show Details)
PerfektesChaos raised the priority of this task from to Needs Triage.
PerfektesChaos added a subscriber: PerfektesChaos.
Restricted Application added a subscriber: Aklapper. · View Herald TranscriptJul 15 2015, 2:36 PM
Reedy added a subscriber: Reedy.Jul 15 2015, 2:38 PM
if ( $wmgUseGeSHi ) {
	include( $IP . '/extensions/SyntaxHighlight_GeSHi/SyntaxHighlight_GeSHi.php' );

	// GeSHi supports 215 languages. The top 20 languages account for more than
	// 80% of usage. The bottom 75 are not used at all. Since each supported
	// language gets an entry in ResourceLoader's start-up module, it makes
	// sense to be economical and drop support for those languages. (T93025)
	$wgGeSHiSupportedLanguages = array(
		"c", "cpp", "bash", "html4strict", "text", "java", "latex",
		"javascript", "python", "xml", "csharp", "php", "css", "asm", "sql",
		"pascal", "matlab", "html5", "haskell", "vb", "lisp", "ruby", "ada",
		"oracle11", "dos", "rsplus", "fortran", "d", "bnf", "ocaml", "pcre",
		"perl", "vhdl", "actionscript", "lua", "bibtex", "go", "bf", "cobol",
		"ini", "delphi", "arm", "scheme", "objc", "prolog", "actionscript3",
		"mysql", "qbasic", "asp", "algol68", "groovy", "erlang", "abap",
		"email", "powershell", "ecmascript", "glsl", "sas", "apache", "yaml",
		"java5", "vbnet", "reg", "cfm", "fsharp", "scala", "applescript",
		"gwbasic", "clojure", "pli", "robots", "tsql", "whois", "freebasic",
		"verilog", "llvm", "visualfoxpro", "sparql", "tcl", "plsql",
		"coffeescript", "scilab", "dot", "autoit", "boo", "mirc", "lolcode",
		"gnuplot", "eiffel", "j", "teraterm", "oorexx", "diff", "smalltalk",
		"cmake", "avisynth", "perl6", "xpp", "typoscript", "basic4gl", "make",
		"awk", "e", "gml", "jquery", "zxbasic", "systemverilog", "6502acme",
		"properties", "oracle8", "q", "purebasic", "pic16", "ldif", "rexx",
		"unicon", "urbi", "modula3", "mpasm", "locobasic", "progress",
		"visualprolog", "vala", "octave", "winbatch", "oz", "autohotkey",
		"cadlisp", "euphoria", "pycon", "oobas", "povray", "thinbasic",
		"68000devpac", "mmix", "modula2", "cil", "mxml", "io", "blitzbasic",
		"parigp", "oberon2",
	);

}
	// GeSHi supports 215 languages. The top 20 languages account for more than
	// 80% of usage. The bottom 75 are not used at all. Since each supported
	// language gets an entry in ResourceLoader's start-up module, it makes
	// sense to be economical and drop support for those languages. (T93025)

That's no longer true.

Thank you for the patch.

However, I did mention above the rESHG/SyntaxHighlight_GeSHi.lexers.php which claims to be generated automatically by some updateLexerList.php script.

There are 572 languages listed, but neither bnf nor gettext.

These ones and potentially others vanished somewhere, perhaps showing up with next upstream splash?

Change 224799 had a related patch set uploaded (by Alex Monk):
Re-enable all languages in GeSHi

https://gerrit.wikimedia.org/r/224799

jayvdb added a subscriber: jayvdb.Jul 15 2015, 11:00 PM

The listing of supported languages on the MediaWiki page was not correct.
https://www.mediawiki.org/w/index.php?title=Extension:SyntaxHighlight_GeSHi&diff=1752036&oldid=1750595

Note that bnf should be supported due to aacd82820bdc2 . However I didnt find a suitable fallback for gettext.

Change 224799 merged by jenkins-bot:
Re-enable all languages in GeSHi

https://gerrit.wikimedia.org/r/224799

@jayvdb

Thank you for your hint wrt ebnf. I used it to improve one article.

  • However, the mapping ebnfbnf has no effect right now.
  • The mapping is not correct, btw.
    • While BNF encloses the metasyntactic variables in <angle brackets>, eBNF does not.
    • Assignment is made by ::= in BNF but with = only in eBNF.
    • There are more differences on bracket interpretation.
    • Therefore ebnf is throwing errors when applied to native BNF syntax.
    • You will find nice .err spans (and might restore red borders) on


<syntaxhighlight lang="ebnf">
<Program> ::= 'PROGRAM' <Identifier> 'BEGIN' <Statements> 'END' .
<Identifier> ::= <Letter> <IdentifierConsec>
<IdentifierConsec> ::= | <Letter or Digit> <IdentifierConsec>
<Letter or Digit> ::= <Letter> | <Digit>
</syntaxhighlight>

  • pygments.lexers.parsers.EbnfLexer claims alphanumeric identifiers only:


'identifier': [ (r'([a-zA-Z][\w \-]*)', Keyword) ]

  • One might upload a bracketed identifier and ::= variant for basic BNF at upstream.

The other fellow: gettext is known to us as pot (what did they smoke?) or po and works fine now.

  • A mapping gettextpot should be established and the entire mapping business needs to be told to rESHG/SyntaxHighlight_GeSHi.lexers.php
jayvdb renamed this task from SyntaxHighlight lost 2 languages (regression) to SyntaxHighlight lost many languages (regression).Jul 18 2015, 7:02 AM
jayvdb set Security to None.

Change 225527 had a related patch set uploaded (by John Vandenberg):
Map 'gettext' to 'pot'

https://gerrit.wikimedia.org/r/225527

Change 225527 merged by jenkins-bot:
Map 'gettext' to 'pot'

https://gerrit.wikimedia.org/r/225527

ori closed this task as Resolved.Dec 12 2015, 11:45 AM
ori claimed this task.
Restricted Application added a subscriber: StudiesWorld. · View Herald TranscriptDec 12 2015, 11:45 AM

How has this regression been resolved?

Reedy added a comment.Dec 12 2015, 7:30 PM

How has this regression been resolved?

Presumably by re-enabling all languages, and adding some mapping for others.

What are still missing?

ori added a comment.Dec 12 2015, 7:52 PM

How has this regression been resolved?

It's not a regression. We migrated from an unmaintained syntax highlighting library (GeSHi) to one that is both mature and active (Pygments). Each library has its own set of lexers; Pygments' is decidedly more modern and comprehensive, and it includes many popular languages and dialects which GeSHi did not support, like Swift and ECMAScript 6. On the whole, we got about 400 new lexers and lost about 30. Good trade.

Lexer requests should be made upstream:
https://bitbucket.org/birkenfeld/pygments-main/issues?status=new&status=open
http://pygments.org/docs/lexerdevelopment/

How has this regression been resolved?

Presumably by re-enabling all languages

How were they re-enabled?

.. and adding some mapping for others.

The mappings that I added are typically not sufficient; they are still typically regressions, but better than nothing.
That is especially true for all of the assembly languages we lost, as they have lots of keywords which are now not recognised.

T105889#1456754 above indicates the bnf/ebnf mapping is problematic.
Typically bnf was used for describing other concepts, so they could be converted to ebnf without loss of clarity.
https://en.wikipedia.org/wiki/Talk:Backus%E2%80%93Naur_Form#bnf_syntax_highlighting_lost
bnf support has been added to Pygments.

What are still missing?

many examples of the regressions at

https://en.wikipedia.org/wiki/Category:Pages_with_syntax_highlighting_errors

My analysis indicates that the most commonly used lost languages are ALGOL 68 and PL/I
https://en.wikipedia.org/wiki/Talk:ALGOL_68#Upcoming_SyntaxHighlight_GeSHi_changes
https://en.wikipedia.org/wiki/Talk:PL/I#Syntax_highlighting

Algol has been added to Pygments, but still needs one more patch to be reasonable. The maintainers are trying to get it into the next release.
I dont see any patches related to PL/I support.

jayvdb changed the task status from Resolved to Declined.Dec 12 2015, 8:20 PM

How has this regression been resolved?

It's not a regression. ... On the whole, we got about 400 new lexers and lost about 30. Good trade.

No disputing it was a good trade, but those 30 (it was higher iirc) are regressions which this task is tracking. IMO this task should be Open, have subtasks open for each Upstream request for missing lexers which were used on Wikimedia projects at least, or any others raised by wikis that were using SyntaxHighlight , and tracking those regressions to completion.

ori reopened this task as Open.Dec 12 2015, 8:22 PM

@jayvdb, thanks. That makes sense to me.

Aklapper renamed this task from SyntaxHighlight lost many languages (regression) to SyntaxHighlight lost support for 30 languages (regression) by switching from GeSHi to Pygments.Dec 12 2015, 8:55 PM
Aklapper triaged this task as Lowest priority.
Aklapper added a project: Regression.
jayvdb updated the task description. (Show Details)Dec 12 2015, 9:00 PM
jayvdb removed a project: Patch-For-Review.
Arthur2e5 added a subscriber: Arthur2e5.EditedApr 11 2016, 2:10 PM

autoconf => bash (technically m4 with [] as quotes and then shell, but I can't find a m4 lexer officially)

Actually there is a m4 lexer at https://github.com/FabriceSalvaire/pygments-lexer/blob/master/m4.py, but to deal with the quotes you need to replace BACKTICKS with \[ in the regexes and ' with \]. Then chain it up with bash like how they did with HtmlPhpLexer.

Restricted Application added a subscriber: TerraCodes. · View Herald TranscriptApr 19 2016, 4:00 PM

autoconf => bash (technically m4 with [] as quotes and then shell, but I can't find a m4 lexer officially)

Actually there is a m4 lexer at https://github.com/FabriceSalvaire/pygments-lexer/blob/master/m4.py, but to deal with the quotes you need to replace BACKTICKS with \[ in the regexes and ' with \]. Then chain it up with bash like how they did with HtmlPhpLexer.

Are you aware of any page which was using lang=autoconf before the switch? i.e. on Wikimedia wikis or elsewhere.

I see only two Wikipedia pages with autoconf source (https://cs.wikipedia.org/wiki/Autoconf and https://pl.wikipedia.org/wiki/Autoconf ) and no Wikibooks on https://www.wikidata.org/wiki/Q1336937 ;-( , and none of those pages used lang=autoconf.

I dont believe that typical autoconf source files benefit from downgrading to lang=bash/sh, or lang=m4 if it existed in the core pygments library (I'll need to look at that m4 lexer in detail; it should be fun), as a typical configure.in/ac file contains more use of the specialised autoconf m4 macros than raw shell coding or complex m4 usage.

Specialised autoconf support would be a nice little microtask for anyone wanting to work on Pygments, as the language needs only some very simple rules around keywords to be quite useful, and there are almost no strict rules to autoconf syntax (or m4 for that matter), as in the worst case almost every part of syntax can be modified many times within the source code, making almost anything legal.

TheDJ added a subscriber: TheDJ.Dec 8 2016, 3:07 PM
TheDJ removed ori as the assignee of this task.May 26 2017, 12:38 PM