Page MenuHomePhabricator

Create a bot to replace deprecated math syntax
Closed, ResolvedPublic

Description

Create an run a bot Texvc2LaTeXBot to run automated task replacing \and with \land etc

En wiki userpage https://en.wikipedia.org/wiki/Wikipedia:Bots/Texvc2LaTeXBot

Repository location https://phabricator.wikimedia.org/diffusion/TTEX/repository/master/

Event Timeline

SalixAlba created this task.

I've run the bot on a few pages on en-wiki a few problems

In this edit
https://en.wikipedia.org/w/index.php?title=User:Salix_alba/Angle_notation&diff=prev&oldid=846994088&diffmode=source
if replaced <math>1\ang \theta</math> with some odd unicode <math>1�ngle \theta</math>
Error fixed by changing the reg exp from

mstr=re.sub(r'(?<!\\)\\ang(?=[^a-zA-Z])', u'\\angle',mstr)

to

mstr=re.sub(r'(?<!\\)\\ang(?=[^a-zA-Z])', r'\\angle',mstr)

Its missing cases where the equation is split on multilines in https://en.wikipedia.org/wiki/User:Salix_alba/Annuity

<center><math>
PV(0.12/12,5\times 12,\$100) = $100 \times a_{\overline{60}|0.01}
= \$4\,495.50
</math></center>

the bot does not replace the $100. Fixed by using a .DOTALL in the math regexp. But opens up the problem of more incorrect matches.

In https://en.wikipedia.org/w/index.php?title=User:Salix_alba/Affine_focal_set&diff=prev&oldid=846999619&diffmode=source

is missing replacing one \bold

i.e. points <math>\bold{x} = p + t\bold{A}</math> where

is replaced by

i.e. points <math>\bold{x} = p + t\mathbf{A}</math> where

I don't quite understand why is failing here. I'm a little uneasy at the

page.text=page.text.replace(math[i],mstr)

line as not guaranteed that its the same part of the file which is being replaced.

SalixAlba renamed this task from Create a bot to replace deprecated syntax to Create a bot to replace deprecated math syntax .Jun 22 2018, 6:57 AM

@SalixAlba Thanks for identifying and fixing potential problems. I have no idea why it doesn't replace \bold there. I do not know an easy solution that would guarantee that the last replace matches the same part of the file, but so far it did not cause any problems.
@Physikerwelt For the simple substitutions we are doing at the moment, the source code of the images should be identical, so we could do a comparison to make sure we don't do any replacement texvcjs doesn't do:

#!/usr/bin/env python
import re
import requests

s = r"\begin{align} A & B \\ \C & D\end{align}"

mstr=re.sub(r'(?<!\\)\\C(?=[^a-zA-Z])', u'\\Complex',s) #this is a correct substitution and should return "identical"
#mstr=re.sub(r'(?<!\\)\\C(?=[^a-zA-Z])', u'\\int',s) #this is a wrong substitution and should return "different"

res_s    = requests.post('https://en.wikipedia.org/api/rest_v1/media/math/check/chem', data=[('q',s)])
res_mstr = requests.post('https://en.wikipedia.org/api/rest_v1/media/math/check/chem', data=[('q',mstr)])

svg_s    = requests.get('https://en.wikipedia.org/api/rest_v1/media/math/render/svg/' + res_s.headers["x-resource-location"]).text
svg_mstr = requests.get('https://en.wikipedia.org/api/rest_v1/media/math/render/svg/' + res_mstr.headers["x-resource-location"]).text

if (svg_s == svg_mstr):
    print "identical"
else:
    print "different"

would that convince you?

@SalixAlba I created a version of mathwikibot.py that should ensure that the same part of the file is being replaced. It is not extensively tested, I created a "finditer" branch so that we can keep using the master branch in case it doesn't work.

Debenben closed subtask Restricted Task as Resolved.Jun 27 2018, 5:12 PM

I've not got trial approval on enwiki.

A couple of points. Doing ls -l

-rw-r--r-- 1 tools.texbot tools.texbot   256 Jun 30 15:35 inputlist.txt
-rwxr-xr-x 1 debenben     tools.texbot  2634 Jun 27 16:36 mathwikibot.py

shows the owner of mathwikibot.py is debenben rather than tools.texbot. This means than I can't edit the script to do things like customise the edit summary. I've had to copy it to texbot2.py and run that.

A bit more serious is there seems to be a login problem. If I run it

tools.texbot@tools-bastion-03:~$ python texbot2.py 
Logging in to wikipedia:en as Texvc2LaTeXBot@texbot
Sleeping for 5.7 seconds, 2018-06-30 15:36:28
Page [[en:Talk:Continuum hypothesis]] saved
Logging in to wikipedia:en as Texvc2LaTeXBot@texbot
ERROR: Login failed (Aborted).
Password for user Texvc2LaTeXBot@texbot on wikipedia:en (no characters will be shown):

It seems to login in fine the first time, does the first edit successfully but then can't login again.

@SalixAlba It seems the problem with the login came with the try catch block, maybe it goes away when we remove it. It seems like the bot is trying to login when it is already logged in.

I changed the ownership. To avoid those problems I'll see if I can give the texbot account access rights to our repository on phabricator, so we can simply do git push without having to switch accounts.

The problem seems to be in line

site=pywikibot.site.APISite.fromDBName(sitename)

if I replace it with

	site=pywikibot.Site(sitename[:2])

the login problem goes away.

I've still not worked out how to commit into the repository. Is there a appropriate help page?

Thanks for finding the problem with the login. The problem with the missing \and replacement could be that [^\\] doesn't match at the beginning of the string when there is no character to match.

For the repository you can do

ssh-keygen
#press enter until key is generated
cd ~/.ssh
#copy the public key in file *.pub
#go to phabricator account settings and add the key
git add .
git commit
#enter some commit message
git push

however you might only be able to commit as texbot and push as Salix alba and I don't know how we can change that.

Vvjjkkii renamed this task from Create a bot to replace deprecated math syntax to ehaaaaaaaa.Jul 1 2018, 1:03 AM
Vvjjkkii raised the priority of this task from Low to High.
Vvjjkkii updated the task description. (Show Details)
Vvjjkkii removed subscribers: Aklapper, gerritbot.
SalixAlba renamed this task from ehaaaaaaaa to Create a bot to replace deprecated math syntax .Jul 1 2018, 6:17 AM
SalixAlba lowered the priority of this task from High to Medium.
SalixAlba updated the task description. (Show Details)

Managed to checkout project but I'm not sure if I've got permissions to push. Either that or I don't know what I've doing!

I've changed regexp to

mstr=re.sub(r'(?<!\\)\\or(?![a-zA-Z])', u'\\lor',mstr)

rather than

mstr=re.sub(r'(?<!\\)\\or(?=[^a-zA-Z])', u'\\lor',mstr)

this matched <math>AAA \or</math>.

I've a test page at https://en.wikipedia.org/wiki/User:Texvc2LaTeXBot/sandbox

Thanks for fix with the negative lookahead.

I couldn't see you in the access policy of the repository, so I added you again, maybe that was the problem. The other problem is that the .git folder is owned by the texbot account that cannot push because I don't know how to add someone without a user account. If you commit, the new files in the .git folder only get write access for the account that committed, so if you do git push with your user account you might get "fatal: Unable to create '/data/project/texbot/.git/refs/remotes/origin/master.lock': Permission denied" error but the change is still pushed.

Cool I can commit, now. A few test commits.

Latest has a temp fix for login problem, not the best as won't with commons or enwikibooks etc.

CommunityTechBot lowered the priority of this task from Medium to Low.Jul 5 2018, 7:04 PM

The bot has bot-rights on dewiki, but saving with botflag=True doesn't set the botflag. Any idea?

Turns out: There is no problem with setting the botflag. I was expecting a fat B to show up in the version history, but this is only shown in recent changes

One failed page
https://en.wikipedia.org/w/index.php?title=User_talk:Bdmy&diff=prev&oldid=852026081&diffmode=source
here the page had nowiki's and unmatched maths tags. This meant the bot matched a large proportion of the article, and did changes syntax outside of maths tags.

I've committed a change which rejects pages with nowiki, and also skips saving pages when there is no change.

@SalixAlba Thank you for finding the problem with the unmatched math tags and fixing it, also @Framawiki thanks for your pull request.

I somehow messed up both of them and then I just forced the push, so the changes should be there, but the version history was lost.

There is an interesting comment on

If number of pages are ~300, you could make replacement at 1 edit/minute without bot flag keeping relatively short time.

this is going to apply for most wiki's. The occurrences are < 300 for all bar 6 wikis. I don't know how many will allow a small number of edits without getting the botflag?

Just few clarifications:

  1. Pabricator is not the place to ask bot consensus
  2. You have to ask the consensus on every wiki you will touch, as you have correctly done for it.wiki. Remember that every wiki has different communities and different rules! :)
  3. Your source code should always send the bot flag to every POST edit to declare that you are not a human. Set it and don't care about it even if your account is not flagged as bot in that wiki (in this case your request will be simply treated like a human one).

@SalixAlba Thank you for taking care of the botflag on enwiki, feel free to take responsibility and request a botflag on any other project you like.

Other things we still have to do:

  • test if the current code also works for wikibooks etc.
  • check if there are projects that use a syntax like {{#tag:math| a^2+b^2=c^2}} and replace the macros if needed

I am sorry if the ownership and permissions of the files I create are sometimes wrong. With the texbot account you should be able to change the permission by copying the file, deleting the original and moving the copy to the old location though.

One little bug. On https://en.wikipedia.org/wiki/Wikipedia_talk:WikiProject_Mathematics/Archive/2012/Mar#Serif/sans-serif_for_math_expressions_in_running_text

<syntaxhighlight lang="CSS">
span.texhtml {
  font-family: 'DejaVu Serif', serif;
  font-size: 100%;
}
</syntaxhighlight>

Somehow this got matched by the bot replacing the font-size: 100%; by font-size: 100\%;

Here is the diff https://en.wikipedia.org/w/index.php?title=Wikipedia_talk:WikiProject_Mathematics/Archive/2012/Mar&diff=868844022&oldid=791871671&diffmode=source

@SalixAlba thanks for catching that. This is something we overlooked:

  1. text <math><nowiki>this is a math formula</nowiki></math> text
  2. text <nowiki><math>this is not a math formula and the page gets rejected because math is inside nowiki tags</math></nowiki> text
  3. text <nowiki><math></nowiki>this is also not a math formula and is falsely treated like case 1) and not rejected<nowiki></math></nowiki> text

I think we should additionally search the mstr and reject everything where tags like nowiki are found, because that should not occur inside in a typical formula.

I put some thoughts into the problem of replacing math generated by templates like in https://de.wikiversity.org/wiki/Kommutative_Ringtheorie/Algebra-Homomorphismus_%C3%BCber_Ring/Definition where <math>K</math> is produced with a source code

{{
Definitionswort
|Prämath={{{K|K}}}
|Algebrahomomorphismus|
|msw=Algebrahomomorphismus
|SZ=,
}}

and came up with a solution that should fix a lot of those cases:

  • look for all templates and expand them
  • do the replacements of <math>\or</math> -> <math>\lor</math> in the expanded template and compare it to the expanded original
  • if they are not equal, apply the replacements \or -> \lor to the source code of the original template
  • expand the modified template and compare it to the expanded original
  • only replace the source code in the template if the expanded templates outside the math tags are equal and the svg rendering of all math tags produced by the template are equal.

I pushed the modifications to the bot script to a branch called templateexpand:

https://phabricator.wikimedia.org/diffusion/TTEX/browse/templateexpand/mathwikibot.py

So far it is untested due to the other problem we have: How to generate the list of pages for the bot to work on? Or in other words: We still need a database of the original source code of all math on all WMF projects. I guess expanding each and every template in all the xml dumps with the mediawiki API would be slow and inefficient. I had a look at the template expansion functionality of "wikiextractor" but that does not seem to be powerful enough to expand most templates properly.

Any ideas?

Physikerwelt claimed this task.

The bot has been created.