For example, Ch and Sh characters keep appearing together in reverted revisions on tr.wikipedia. These are written as a placeholder for ç ş characters. In normal writing this is never done.
I have been thinking of this problem for some time now. I think best method is to identify if the words written with ch and sh are in English and have this as a feature only for tr wikipedia.
Mind that it is valid to have ch and sh in tr wikipedia to illustrate english words which is why a simple regex would not be optimal.
Another example is the use of "е" in crylic -- which is different from the letter "e" in the Latin alphabet, but displays very similarly.
>>> "e" == "е" False