Page MenuHomePhabricator

Investigate the parsing of config files for bad templates and categories
Closed, ResolvedPublic

Description

Acceptance Criteria:

Notes

  • The result should be two lists of category and template names.
  • Possible solutions: wiki parser, HTML DOM for infos, Regex..

Event Timeline

Lea_WMDE triaged this task as Medium priority.May 2 2018, 11:23 AM
Lea_WMDE created this task.
Lea_WMDE updated the task description. (Show Details)
  • ForTheCommonGood has it's own config file format that is read. Using partly reg ex for what should be done.
  • MTC! does some hardcoded things to normalize or remove stuff
  • CommonsHelper has the configuration coded into PHP arrays in the code files.
  • CommonsHelper2 reads the raw wikitext of the config files using action=raw and parses it manually using explode(), substr() and str_replace().

Using the HTML or DOM of the plain page gives us a lot of unwanted overhead. Like the CommonsHelper2 itself I would go for the plain raw wikitext. Either using action=raw or the API.

Since the wiki text given is pretty well structured and minimalistic In my opinion the best solution would be using Regex. As a first step we should to split the config into the parts defined by the headlines and then grab the individual settings from each part according to it's layout there.

WMDE-Fisch added subscribers: Andrew-WMDE, thiemowmde.

Comments on that @thiemowmde @Andrew-WMDE - otherwise could be moved to Demo.

I would like to see one of these ForTheCommonGood config files. Can you provide a link?

I agree and would also go for the CommonsHelper2 approach, possibly using regular expressions or whatever a trivial string parser needs. Note that we are not going to "reinvent the wikitext parser". These config pages use a very, very small subset of wikitext. We can control this very well. The only thing we need to do is to make sure there is a proper way to report errors back to the users. It is very easy to introduce mistakes on these config pages. If this happens, instructions on how to fix it should be as clear and easy to follow as possible.

I would like to see one of these ForTheCommonGood config files. Can you provide a link?

http://atlight.github.io/ftcg/localdata.html

http://atlight.github.io/ftcg/localdata.html

Awesome, thanks a lot!

@WMDE-Fisch, my impression is that these ForTheCommonGood config files are way ahead of the CommonsHelper2 ones. The syntax is more powerful, the config files are generally longer, and appear much more closely maintained (a lot of replacements CommonsHelper2 doesn't and can't do). However, I still suggest to stick to the CommonsHelper2 config for the MVP, for one particular reason: we really don't want to execute regular expressions anyone can edit. The CommonsHelper2 config syntax currently avoids this, as far as I can see.

If it later turns out we need a more complex syntax, I suggest to start our own JSON based one (compare TemplateData). But still avoid user-defined regex, if possible. ;-)

! In T193620#4190563, @thiemowmde wrote:

@WMDE-Fisch, my impression is that these ForTheCommonGood config files are way ahead of the CommonsHelper2 ones. The syntax is more powerful, the config files are generally longer, and appear much more closely maintained (a lot of replacements CommonsHelper2 doesn't and can't do). However, I still suggest to stick to the CommonsHelper2 config for the MVP, for one particular reason: we really don't want to execute regular expressions anyone can edit. The CommonsHelper2 config syntax currently avoids this, as far as I can see.

I agree but for the reasons given we still should stay with what we are going for now in the MVP. An extension of the code or even a change later should not be too hard.

In my mind user-defined regex is essential for this system to be fully effective. Maybe not essential for an MVP, true. But a lack of proper replacement functionality in existing tools was one of the things that drove me to build powerful regex replacements into FtCG - it seriously minimises the amount of grunt work involved in performing transfers, in some cases reducing it to no manual work at all.

@WMDE-Fisch Sounds good! I'm also in favor of using the CommonsHelper2 approach for now at least.

Vvjjkkii renamed this task from Investigate the parsing of config files for bad templates and categories to zsdaaaaaaa.Jul 1 2018, 1:12 AM
Vvjjkkii reopened this task as Open.
Vvjjkkii removed WMDE-Fisch as the assignee of this task.
Vvjjkkii raised the priority of this task from Medium to High.
Vvjjkkii updated the task description. (Show Details)
Vvjjkkii removed the point value for this task.
Vvjjkkii edited subscribers, added: WMDE-Fisch; removed: Aklapper.
Bodhisattwa renamed this task from zsdaaaaaaa to Investigate the parsing of config files for bad templates and categories.Jul 1 2018, 1:32 PM
Bodhisattwa closed this task as Resolved.
Bodhisattwa assigned this task to WMDE-Fisch.
Bodhisattwa lowered the priority of this task from High to Medium.
Bodhisattwa updated the task description. (Show Details)
Bodhisattwa added a subscriber: Aklapper.