Page MenuHomePhabricator

Validate that the provided alphabet doesn’t include combinations like “ab, ac” etc.
Open, Needs TriagePublic

Description

Since we are autogenerating the combinations like “ab, ac” etc. for backlink markers, we want to display a warning, when users enter them in the config settings. See epic for details.

We are aware of two situations that are particularly relevant here:

  1. There are alphabets that actually contain such duplicates. E.g. hsb actually contains all of "CH", "C", "H", "DŹ", "D", and "Ź" as individual characters.
  2. We don't want admins to manually enter anything beyond the basic alphabet, as it was done with the old message.

We don't know yet what the most correct solution for situation 1 is. It should be possible to have different solutions for the two. In situation 2 everything is a duplicate after the basic alphabet while the duplicates in situation 1 are sprinkled around randomly. A superb idea came up here: We treat the 1st character as special and match it against the remaining string. When it appears again it's clear that the admin tried to continue the sequence manually.

Open questions:

  • Do we want the validation (only) client-side as part of the CommunityConfiguration page, or raise it as an error in other places as well?
    • The renderer could keep track of previous labels and simply skip duplicates. We could signal this to all users with the existing error reporting system in Cite (i.e. rendered as part of the article).
    • As above, but instead of skipping labels the renderer stops using the bogus alphabet and falls back to the default a…z.
    • We let the special page fail so the admin cannot save a bogus alphabet in the first place. This solves the issue in some situations, but not when an alphabet is provided manually via the $wg… configuration. However, that might be totally acceptable for all use-cases we care about.
    • …?

Event Timeline

As for the "why".

  • To illustrate the behavior, let's use the short alphabet "a b c" as an example.
    • Our code will generate "aa ab ac ba bb bc ca cb cc aaa aab aac aba abc …" and so on. There are 3^1 = 3 single letters, 3^2 = 9 two-letter combinations, 3^3 = 27 three-letter combinations, and so on.
    • Let's assume the user manually enters "a b c aa ab ac ba bb bc ca cb cc". How should our code know what this means? How should our code know what to do after it runs out of user-provided labels?
    • The standard algorithm we have in mind would continue this sequence with "aa ab ac aaa aab aac aba abb abc aca acb acc ba bb bc baa …" in a weird order with weird duplicates.
    • It might be possible to skip duplicates in the auto-generated sequence. This might work in some cases – or create unpredictable and unexpected results. It also makes the code quite complicated – for not much benefit.
  • As it often goes, it's easy to remove a limitation later in case it turns out that users really need the freedom. But it's hard to add a limitation later. Therefor we argue that we should start with the restriction in place.