Page MenuHomePhabricator

Fix nbsp arming for inward-pointing French quotes
Open, Needs TriagePublicBUG REPORT

Description

In French, guillemets are used pointing outwards and are supposed to be spaced out by (thin) spaces, as in

This is the « French » way.

For this reason, includes/parser/Sanitizer.php "arms" them with   (in the following represented by ):

This is the «▁French▁» way.

However, in a number of languages the guillemets are used pointing inwards, as in

This is the »Croatian« way.

Currently, the "arming" for these languages works like

This is the »Croatian«▁way.

Obviously, that is incorrect.

This can be a little annoying to work with in VisualEditor (see here), where   is shown as a gray rectangle with a pop-up hint, which affects cursor.
(A user filed a complaint about this on hrwiki, when he noticed some unusual line breaks.)

A minimal fix would be to replace

'/([«‹]) /u' => "\\1$space",

with

'/(?<!\w)([«‹]) /u' => "\\1$space",

Note: Another sanitizer, mediawiki/services/parsoid/src/Core/Sanitizer.php, will have to be synced.

Event Timeline

Change #1083885 had a related patch set uploaded (by Ponor; author: Ponor):

[mediawiki/core@master] Fix nbsp arming for inward-pointing French quotes

https://gerrit.wikimedia.org/r/1083885

It would probably be better just to make "french spacing" an optional feature, possibly controlled by the new CommunityConfiguration extension. It doesn't really do the right thing except in Francophone languages, and should probably just be disabled for (eg) Croatian.

This could also be a output transform, and removed from the "edit mode" parse, but that would probably require fixes in VisualEditor because they'd still probably want the "non breaking space" to display that way during editing. Unless VE is already just stripping these?

Ideally, yes. Though we benefit from the automatic &nbsp; armoring of our percentage sign that comes with that function, as in 42▁%.
That's why I'm hoping for a quick and minimal solution.

Any option is fine for French as long as:

  1. the result remains «[no brake space][content][no brake space]» in final rendering,
  2. the change is not noticeable (we fill in the configuration before releasing).

[From gerrit]

I understand that, and I'm asking for help, as we're used to ask on any collaborative site. I am questioning whether the CURRENT test cases are actually testing anything important. And whatever they are testing, they should be failing for any wiki that is not French, yet they are not. I'm okay with you guys adding your -1's, but someone will at some point have to work on this bug. So let's continue this discussion at phab, please

There are like 10,000 tests, so you will have to be more specific, but yes, broadly speaking the tests are testing important things. It wouldn't surprise me if a few are of questionable value, but the vast majority are important.

Some tests are marked as for a specific language, so a test marked as for french will only run in french mode.

Everytime you submit a patch, jenkins-bot runs all the tests and will mark the patch -1 if any of the tests don't work, so rest assured the tests are run regularly.

If its helpful, there is additional information about parser tests at https://www.mediawiki.org/wiki/Manual:Parser_tests

Per @cscott at Gerrit:

  • This is still waiting for a test case added to test/parser/parserTests.txt -- search for "french spaces". The test case "French spaces in wikitext" is a good start, but note also the tests in definition lists, and with surrounding span tags.

They are testing whether the characters » « are inside of some environments. Okay. But why? Why only these environments? What could possibly get broken there?

Now, since the thing I'm adding to '/([«‹]) /u' is "only if it isn't preceded by a word character", '/(?<!\w)([«‹]) /u' I am not really adding any new features, I'm only preventing the wrong ones to appear. The new rule does not break the old tests, shouldn't that be enough?

They are testing whether the characters » « are inside of some environments. Okay. But why? Why only these environments?

Because being exhaustive is hard so some representitive examples are chosen. Life's imperfect. Everyone is welcone to add additional tests if they feel so inclined.

I am not really adding any new features, I'm only preventing the wrong ones to appear. The new rule does not break the old tests, shouldn't that be enough?

No. Tests should verify that the wrong behaviour does not reappear. It is especially important in situations serious enough where a bug report is written so we dont end up fixing the same bug over and over again. See https://en.wikipedia.org/wiki/Regression_testing for more information on this concept.

There should be tests such that if your patch was removed it would cause a test failure.

The first eight lines of this task description offer two excellent test cases which could be included in a new parser test.

I will add that at the moment there are three different parts of the 'french spacing' issue: the legacy parser, Parsoid, and Visual Editor. We really need to verify that all three of these handle "french spaces" consistently after the change. But let's start with a test case demonstrating that your patch does what it claims to do.

Test cases are also documentation for future maintainers as to *why* code behaves the way it does, which is why the examples from the task description are so useful. They prevent regressions caused by future modifications to the code.

Change #1270156 had a related patch set uploaded (by SomeRandomDeveloper; author: SomeRandomDeveloper):

[mediawiki/services/parsoid@master] Sanitizer: Don't arm inward-pointing french quotes

https://gerrit.wikimedia.org/r/1270156