Page MenuHomePhabricator

Wikisource script to find and replace across books
Open, Needs TriagePublic

Description

Rationale: As a Wikisource editor, there are times when a specific OCR scripts might mess up and cause repeating errors across many pages in a book. It would be great to have a script that can search and replace such issues across a whole book.

Event Timeline

@Soda: A good first task is a self-contained, non-controversial task with a clear approach. It should be well-described with pointers to help a completely new contributor. Given the current short task description and lack of code references I'm removing the good first task tag. Please add details what exactly has to happen where and how for a new contributor, and then add back the good first task project tag. Thanks a lot in advance!

There is already @Pathoschild's [[m:TemplateScript]] that users can set up to run these sorts of replacements (gadget), either through an active search (top mounted) form, or through adding a script to one's sidebar through their common.js--though one does need to know some regex to get this to work well

Noting that search and replace can occasionally be a fraught type of replacement and one has to have the search and replace terms very finely tuned. Seen great and horrid examples of such search and replace on our works, and there is some high value page status that probably should be safe from a script. (I wouldn't want to be seeing it run on already proofread pages unless it has a forced preview of the changes) I have done script replacements for a range of works where (somewhat) reproducible errors exist, and run those regularly as part of a cleanup script for all works (with care and continual tinkering as there will always be a case for fixing)

Other notes

  • Be very wry of works where there is the reproduction of reproduction from earlier era works as these contain spelling and literacy of their day. HENCE there needs to be the easy ability to UNDO TOGGLE for all the replacement undergoing change on that view
  • part of the issue with this request is that to iterate through a work that all the pages have yet to exist, so it needs to be able to be triggered to activate following the extraction of the text layer from the work
  • I would like to see some sort of per work, Index: subpage capability where a range of replacements can be added, ideally in side by side places per line of search and replace, which allows for the search terms to be iteratively added as new replacements are found
  • also that there is also the potential to add formatting in some works have reproducible formatting example biographies in [[s:en:The Indian Biographical Dictionary (1915)]]