Maniphest T319965

Migrate phetools from Toolforge GridEngine to Toolforge Kubernetes
Open, HighPublic
Actions

Assigned To

Authored By

	komla
	Oct 6 2022, 9:33 PM

Description

Kindly migrate your tool(https://grid-deprecation.toolforge.org/t/phetools) from Toolforge GridEngine to Toolforge Kubernetes.

Toolforge GridEngine is getting deprecated.
See: https://techblog.wikimedia.org/2022/03/14/toolforge-and-grid-engine/

Please note that a volunteer may perform this migration if this has not been done after some time.
If you have already migrated this tool, kindly mark this as resolved.

If you would rather shut down this tool, kindly do so and mark this as resolved.

Useful Resources:
Migrating Jobs from GridEngine to Kubernetes
https://wikitech.wikimedia.org/wiki/Help:Toolforge/Jobs_framework#Grid_Engine_migration
Migrating Web Services from GridEngine to Kubernetes
https://wikitech.wikimedia.org/wiki/News/Toolforge_Stretch_deprecation#Move_a_grid_engine_webservice
Python
https://wikitech.wikimedia.org/wiki/News/Toolforge_Stretch_deprecation#Rebuild_virtualenv_for_python_users

Related Objects

Mentioned In: T366036: Deprecate and remove phetools OCR from pywikibot
T359949: Update cache in Dutch Wikisource not happening
T353109: More graphs in Wikisource Transclusion statistics
Mentioned Here: T314664: [infra] Decommission the Grid Engine infrastructure

Event Timeline

komla created this task.Oct 6 2022, 9:33 PM

Restricted Application added a subscriber: Aklapper. · View Herald TranscriptOct 6 2022, 9:33 PM

My apologies if this ticket comes as a surprise to you. In order to ensure WMCS can provide a stable, secure and supported platform, it’s important we migrate away from GridEngine. I want to assure you that while it is WMCS’s intention to shutdown GridEngine as outlined in the blog post https://techblog.wikimedia.org/2022/03/14/toolforge-and-grid-engine/, a shutdown date for GridEngine has not yet been set. The goal of the migration is to migrate as many tools as possible onto kubernetes and ensure as smooth a transition as possible for everyone. Once the majority of tools have migrated, discussion on a shutdown date is more appropriate. See T314664: [infra] Decommission the Grid Engine infrastructure.

As noted in https://techblog.wikimedia.org/2022/03/16/toolforge-gridengine-debian-10-buster-migration/ some use cases are already supported by kubernetes and should be migrated. If your tool can migrate, please do plan a migration. Reach out if you need help or find you are blocked by missing features. Most of all, WMCS is here to support you.

However, it’s possible your tool needs a mixed runtime environment or some other features that aren't yet present in https://techblog.wikimedia.org/2022/03/18/toolforge-jobs-framework/. We’d love to hear of this or any other blocking issues so we can work with you once a migration path is ready. Thanks for your hard work as volunteers and help in this migration!

Oh, somehow missed that this task got created and assigned to me.

The short status for this is that I inherited phetools when the original contributor (Phe) went inactive in 2016 and the tool fell over (it was usurped to get it running again). The tool is a right royal mess of originally Python 2, PHP, and Bash code that has multiple moving parts with strong interdependencies (relatively loosely coupled code-wise, but strongly interdependent through IPC and data exchange), custom IPC methods (custom protocol over a socket to bridge PHP and python), local databases, custom text-based file formats (pseudo-database), a custom job queue and manager (it parses Grid Engine's accounting file to keep track of its own jobs' execution status), oh and a custom pywikibot library designed around pywikibot ca. 2015.

Some few bits may be possible to just retarget at k8s now that build service is available, but the bulk of it most likely needs to be rearchitected and rewritten to more modern standards (e.g. using standard libraries in Python 3 and modern pywikibot), upstreaming things like custom file formats and job queues, stop touching Grid Engine's accounting file, etc. before it can get rid of its Grid Engine dependency.

The likelihood of me finding the weeks to months of sustained time for this work in the near future is near nil. So this is likely to be one of the thorny cases to solve in the overall Grid->Kubernetes migration.

If anyone that actually speaks Python natively (I'm a Perl guy; my Python-fu is pretty weak) wants to help out I welcome it with open arms.

komla moved this task from Backlog to Help wanted on the Grid-Engine-to-K8s-Migration board.Nov 16 2023, 2:41 PM

This is a reminder that the tool for which this ticket is created is still running on the Grid.
The grid is deprecated and all remaining tools need to migrate to Toolforge Kubernetes.

We've sent several emails to maintainers as we continue to make the move away from the Grid.
Many of the issues that have held users back from moving away from the Grid have been addressed in
the latest updates to Build Service. See: https://wikitech.wikimedia.org/wiki/Portal:Toolforge/Changelog

You might find the following resources helpful in migrating your tool:

Don't hesitate to reach out to us using this ticket or via any of our support channels

If you have already migrated this tool, kindly mark this ticket as 'resolved'
To do this, click on the 'Add Action' dropdown above the comment text box, select 'Change Status', then 'Resolved'.
Click 'Submit'

Thank you!

beleg_tal subscribed.Nov 25 2023, 5:49 PM

Some initial thoughts:

I fear this migration will require major surgery, but… It is possible there is low-hanging fruit (parts that can be moved without ill effect), and its just barely conceivable that it's possible to get it mostly working on k8s with relatively smaller changes. If we could get it off Grid Engine with an acceptable level of brokenness before the hard deadline it'd make things easier (migrating an unfamiliar codebase when you don't have a currently working + running version for reference is a right royal pain).

The key there is going to be the Build Service, Python venv, shared storage, and the PHP⟺Python IPC. Docs for all these on Toolforge have the usual level of suck and I have real experience with none of them, so that part will be fun.

If the kinda-sorta-working approach doesn't work, and we have to perform major surgery for one or more reasons, we should probably start by doing some rearchitecting. Phetools is a bit of a mish-mash of all the tools @Phe made, so splitting up into multiple more separated and focussed tools is probably a good idea. Alternately it might make sense to move the whole shebang to WMCS: we'd still get a migration cost, but running on a single server is a less alien environment than Grid Engine ⇒ Jobs Framework. The problem there is that several of the tools really need a job queue / processing nodes.

Either way the first step should be to set up a real Wikimedia Gitlab repo (currently the ca. 2016 code is partially on Phe's personal Github repo, no changes since are in VCS, and a lot of the non-code, config, etc. exist only on Toolforge). I think setting up a clean repo and populating it as we migrate is the best approach. Dumping everything old in there is only going to create a mess and carry forward old crap.

The custom job management parses Grid Engine's accounting file so that will definitely break on k8s. At a minimum we will have to disable that code. What that will break I have no idea. The job status web pages certainly, and I think the last time I went spelunking I found that that will break spawning of new OCR and Match&Split jobs.

Another tricky bit might be the PHP⟺Python IPC. It's a custom text based protocol over a UNIX domain socket (iirc). Whether that'll keep working or not depends on the Jobs Framework / k8s architecture. As of right now I have idea whether that'll be a big deal or can be simply tweaked to keep working.

One central task is going to be figuring out the Build Service and Python venv (current code predates venv), plus all the supporting libraries (some of which are custom local builds). That should be entirely doable and just requires the time investment to read up and experiment.

Another is figuring out shared storage. Several of the tools have very large caches that need to be accessible from all running processes (NFS is a main cause of hangs currently). My knowledge of the shared storage situation on k8s is hazy, so I have no good idea how hard or easy this will be to figure out.

Fuzheado subscribed.Dec 9 2023, 3:23 PM

VIGNERON mentioned this in T353109: More graphs in Wikisource Transclusion statistics.Dec 10 2023, 10:17 AM

@Xover Could you add me as one of the maintainers (mainly to be able to see the current code).

Also, do we have any statistics on how much the OCR endpoint is being used ? If it is not that much, we could just have them switch over to the Wikimedia OCR equivalent (Tesseract I think?)

In T319965#9395054, @Soda wrote:

@Xover Could you add me as one of the maintainers (mainly to be able to see the current code).

Done.

Also, do we have any statistics on how much the OCR endpoint is being used ? If it is not that much, we could just have them switch over to the Wikimedia OCR equivalent (Tesseract I think?)

I don't have stats, and a lot of the users both could and should switch to Wikimedia OCR, but there are still a lot of users that explicitly prefer the Phetools OCR for various reasons (UI, interactive performance, output/quality, etc.).

But getting sensible usage stats is one reason I want to modernize and split the various tools into more standalone modules. The OCR and Match & Split is one thing, but there are some really obscure sub-tools in there that I'm pretty sure should be killed, but that I don't know what will break and for who if I do. There are also some strange inter-dependencies between the parts of phetools that needs careful mapping before starting tearing out parts of them.

The statistics are probably the most independent project (least interdependent), the one with the most users and that can't be easily replaced, and the one with the most dire need of modernization. It is also probably the simplest to migrate to a more modern and standalone tool, but since you need to keep historical data and properly design the new stats it's a somewhat of a time-consuming project (which is why I haven't tackled it yet).

I have some time off work around Christmas and plan to start looking at this then, but probably only at what it takes to migrate to k8s and keep the most important bits running.

Quick status update

The web front end uses both PHP and Python, so it can't run on any of the standard k8s images.

The backend services are all written in Python, but use a lot of app-internal libraries, upstream non-default Python libraries (like pywikibot), and third party binaries (like tesseract). Which means they can't just be fired up on any of the standard k8s images.

All of which means we are going to have to get into the Toolforge Build Service to build a custom image that can run both the web frontend and the backend services. Which in turn means we are going to have to create a Gitlab repo for it, since the Build Service has to have a Git repo to work from. Because we have mixed runtimes (PHP + Python) the buildpacks autodetection is not going to work. And because we have a lot of libraries and binaries as deps we're going to have to figure out all the nitty details of how to customize the Build Service builds, especially the parts that are even worse documented than the basic poor state of the documentation. The buildpacks (and the Build Service) also assume modern coding conventions (e.g. using Python venvs to manage deps) so in the process we are going to have to modernize the code at least somewhat.

All of which points to this being a long and tedious slog of a job, starting with importing all our old grotty code into Git, reverse engineering all our deps for venv, and then iterating on the Build Service images to get them working. Oh, and to make this as annoying as possible we can't work on these images locally due to needing deps from apt, so we're going to have to do all this on Toolforge, in production, in a tightly coupled code base.

The Gitlabs repo is going to have to be tightly managed, unlike the phetools account on Toolforge where first Phe and then subsequent admins have used it more like a ad hoc shell account than what is required for a k8s service (lots and lots of random temporary files in the home directory, lots of test and debug versions of code in the code directories, mixing code and other stuff, etc.). If we bring the current mess with us we'll never get the Build Service to work reliably and figuring out the migration is going to take ten times as long.

Some architecture notes

The web frontend expects to run on a "fat" image under lighttpd. It consists of some relatively light PHP files that mostly provide the interactive web interface on phetools. Most of the content served by PHP is pre-generated static assets from NFS, with some minor CGI-ish stuff. The exception is job status pages where it communicates with other running jobs over a socket and a custom text-based protocol (ping, status, etc.). The PHP is still very minimal, so migrating it to a "thin" image shouldn't be a big problem. The Python parts are what provide the API for things like the OCR and Match&Split Gadgets on-wiki. These parts use the same libraries as the actual jobs themselves, and so it basically needs everything the jobs need.

Communication between the frontend and backend jobs is established by each job writing its actual hostname and a port number it is listening on to a text file on shared NFS storage. The frontend reads this file, opens a socket to that host:port, sends a text command, waits for the answer, and sends it to the web browser. All the backend jobs use a common but app-custom library to implement this.

Phetools consists of the webservice described above, a set of scheduled grid jobs, and a set of continuous grid jobs.

The webservice

The webservice lives in public_html and currently consists of the following:

public_html/

May 20  2014 commonPrint.css -> ../phe/common/css/commonPrint.css
Oct 28  2021 data/
Oct 27  2021 dbtest.php -> /data/project/phetools/phe/statistics/dbtest.php
May 30  2014 dummy_robot.php -> ../phe/dummy_robot/dummy_robot.php
May 30  2014 extract_text_layer.php -> ../phe/extract_text_layer/extract_text_layer.php
Nov 23  2021 graphs/
Aug  5  2014 hocr.php -> ../phe/hocr/hocr.php
May 15  2014 index.html -> ../phe/public_html/index.html
May  4  2020 jquery-3.5.1.min.js
Oct  6  2020 jquery-3.5.1.min.js.save
May 17  2014 log-irc -> ../log/log-irc/
May 30  2014 match_and_split.php -> ../phe/match_and_split/match_and_split.php
Apr 27  2016 not_transcluded -> ../tmp/transclusions/
Jun  5  2014 ocr.php -> ../phe/ocr/ocr.php
Nov  3  2021 proofread.html -> ../phe/statistics/proofread.html
Nov  3  2021 proofread_per_day.html -> ../phe/statistics/proofread_per_day.html
May 20  2014 screen.css -> ../phe/common/css/screen.css
May 20  2014 shared.css -> ../phe/common/css/shared.css
Jan 29  2020 sorttable.css -> ../phe/statistics/sorttable.css
Jan 29  2020 sorttable.js -> ../phe/statistics/sorttable.js
Oct 13  2020 statistics.js
May 20  2014 statistics.php -> ../phe/statistics/statistics.php
Oct 27  2021 statistics2.php -> /data/project/phetools/phe/statistics/statistics2.php
May 20  2014 stats.html -> ../phe/statistics/stats.html
Oct 27  2021 stats_table.css -> /data/project/phetools/phe/statistics/stats_table.css
May 26  2014 status.php -> ../phe/public_html/status.php
May 20  2014 transclusions.html -> ../phe/statistics/transclusions.html
Nov  3  2021 validated.html -> ../phe/statistics/validated.html
Nov  3  2021 validated_per_day.html -> ../phe/statistics/validated_per_day.html
May 30  2014 verify_match.php -> ../phe/verify_match/verify_match.php

Not all files are used; some are obsolete and stopped working long ago, and some are tests. The graphs subdir contains static image pregenerated by a scheduled job. The JS parts are pretty simple and don't really need a local jQuery (and since the privacy policy apply to Toolforge we might as well use upstream jQuery from cdnjs).

Scheduled jobs

The current scheduled jobs are:

crontab -l

### KUBERNETES MIGRATION IN PROGRESS ###
### Please do not add any more Grid Engine jobs ###
#
47 4 * * * jsub -N wikisource_stats -l h_vmem=1024M -o ~/log/wikisource_stats.out -e ~/log/wikisource_stats.err ~/phe/statistics/gen_stats.sh
28 4 * * * jsub -N phe_logrotate -o ~/log/logrotate.out -e ~/log/logrotate.err /usr/sbin/logrotate ~/phe/logrotate.conf -s ~/log/logrotate.status
# Do not reenable hocr_request. It will try to download and run OCR on every
# single DjVu and PDF file on every single wikisource + commons. Xover 7. nov. 2021
#47 * * * * jsub -N hocr_request -o ~/log/hocr_request.out -e ~/log/hocr_request.err -v PYTHONPATH=$HOME/phe python3 ~/phe/hocr/hocr_request.py -prepare_request
*/10 * * * * jlocal python3 ~/phe/jobs/sge_jobs.py >> ~/log/sge_jobs.out 2>>  ~/log/sge_jobs.err
# broken with stretch
#    27 4 * * *  jsub -N wsircdaemon -once -o ~/log/cron_irc.err -e ~/log/cron_irc.err -quiet python -u phe/ircbot/pyirclogs.py
# Nice try, but it can't work as run_service.sh use jsub which is not available
# in exec node
#  13 4 * * *  jsub -N restart_ws_ocr -once -o ~/log/cron_ws_ocr.out -e ~/log/cron_ws_ocr.err ~/phe/run_service.sh restart ws_ocr_daemon
17 5 * * * jsub -N not_transcluded -o ~/log/not_transcluded.out -e ~/log/not_transcluded.err python3 ~/phe/statistics/not_transcluded.py
33 5 * * * jsub -N ppdir -o ~/log/dbtest.out -e ~/log/dbtest.err python3 ~/phe/statistics/dbtest.py
45 5 * * * jsub -N restart_match_split -o ~/log/cron_match_split.out -e ~/log/cron_match_split.err ~/phe/run_service.sh restart match_and_split

The IRC stuff hasn't worked in ages and I see no need to try to revive it, so that's going the way of the dodo.

restart_match_split is voodoo "fixing" a problem with the job in question frequently hanging and should probably just be dropped so we can fix it properly.

logrotate is just calling the system logrotate with our config to rotate our own log files. We really should have a better facility for logrotate built-in to Toolforge, but at least this one should be fairly straightforward to migrate to a k8s scheduled job.

sge_jobs is going to be a headache, because here's where we parse the Grid Engine accounting file to keep track of our own internally-scheduled jobs, and update the custom DB that in essence implements a completely custom job queue. I don't think we have access to anything comparable on k8s, and to the degree we do it would mean talking to the raw Kubernetes API. I don't fully understand this system or what depends on it, but I seem to recall that last time it had trouble other stuff started breaking. And the second we touch this we run the risk of adding garbage to that database that isn't going to be easy to untangle without any real documentation.

wikisource_stats isn't actually written in shell, it just has a thin shell wrapper to set the PYTHONPATH for pywikibot (Py2 v. Py3) and start the script with python3. It otherwise uses the same libraries etc. and pywikibot as the rest of the main tools.

How to deal with stdout/stderr logging is going to be an issue for all these (and our backends) on k8s.

Grid Engine jobs

The real backend for phetools are a set of continuous Grid Engine jobs started from ~phe/run_service.sh. The script gives you start/stop/status/restart options, sets environment variables and paths, etc. The jobs there are currently:

run_service.sh

dummy_robot
ws_ocr_daemon
match_and_split
extract_text_layer
verify_match

dummy_robot I am not sure of the purpose of (it may be just a test job). The rest are moving parts of the OCR and Match&Split services, and data generated by these are used in various other places. These have a custom common app framework, implement a lot of stuff that's now available upstream in pywikibot (lists of languages etc.), custom job queue system, massive caches, and write to the wikis using the phe-bot account (and as such should probably really be migrated to OAuth, but since it's very asynchronous that raises a whole host of other issues).

Build Service image requirements

From our custom Build Service image we're going to have to figure out how to meet at least the following requirements (in case someone has experience with this and can help):

First prerequisite to getting anywhere is getting the relevant code into a Gitlab repo
Due to the mixing of PHP and Python code in the frontend, we'll need a single image for the webservice and the various backend pieces.
1. The image needs to provide a lighttpd-like environment for the webservice components
2. It also needs to provide the full Python environment since the Python parts use the same common (custom) app framework as the backend jobs
The image needs the full codebase of the app
The image needs a fully configured pywikibot install
The image needs upstream Python libraries
1. Which precise libraries is TBD, and whether they are pip or apt remains to be seen
2. Here we also need to figure out how to set up venv sensibly
The image needs to have upstream binaries
1. At the very least tesseract and its language backs
2. There may be others
3. These are almost certainly apt
The image needs access to the toolsdb, but not the wikireplicas
1. It uses fairly primitive means right now, so have the connect info available in the Environment Service may actually be an improvement
The image needs to be network accessible for the frontend webservice
1. For communicating over the mentioned custom socket protocol
The image needs to mount /data/project/phetools
We need to figure out logging to replace the -o/-e redirects of stdin/stdout used in the current Grid Engine setup
1. Keeping in mind that the logs are big and absolutely need to be rotated
And the wildcard is the custom job queue stuff using Grid Engine's accounting file and a custom database
1. We may have to rewrite this from scratch using ad hoc k8s jobs and the k8s API for tracking status
2. Ideally we should switch to some upstream job queue system; which is a big job but may still be less effort than trying to reimplement it on top of k8s
3. So hopefully we can get away with just disabling the job queue system, removing deps on it, and lose the functionality it provides until someone has the capacity to do it right (but this may not be possible in practice)
And run_service.sh needs to be rewritten to work with the Toolforge Jobs Framework instead of Grid Engine, including dealing with custom resource allocation per job (CPU, RAM)

Vague as that list is in parts I think that probably covers what we need to get started. Once we have a functioning image we can start trying to migrate bits and pieces of the code to running on that image and on k8s. At that point we're going to start facing issues with most of the code base being written in Python 2, so we'll get a lot of rewriting to modern standards to do. We'll also start running into deps between the various moving pieces that will have to be resolved. And while ongoing we will generate a lot of downtime for the tool. My big worry is corrupting data or inserting garbage into caches, databases, and statistics stores. A secondary concern is that since we write to the wikis we can end up causing a mess on-wiki as well. And finally since we're doing this "live" we may hit a point where it is not feasible to back off to the old code on Grid Engine and may just have to power through until we get the new stuff working.

Some hints, hopefully helpful.

Due to the mixing of PHP and Python code in the frontend, we'll need a single image for the webservice and the various backend pieces.

We are working on adding better support for multi-component tools, but the actual implementation is not going to be around soon enough, so in the meantime I see some other options here:

Having two different tools, one with the fronted and one with the backend only, that the frontend calls (if I understood the architecture correctly), this will work in the long run.

You can have two different images, one for the frontend and one for the backend + jobs, by using the -n <image-name> parameter to toolforge build. This might change slightly in the future (say 1y) as we are trying to give better support for multi-component tools (ex. frontend + api + async runners). Your current discovery process should still work, but might break at some point, though probably we would have the multi-component support by then and you can migrate to that (that will be in the end something like using a hostname like mybackend.mytool.svc.kubernetes.local)

My big worry is corrupting data or inserting garbage into caches, databases, and statistics stores.

For caches, you can use object storage now https://wikitech.wikimedia.org/wiki/Help:Object_storage_user_guide that would be a better place for those :) (and while you migrate, you can use the object storage only for the new processes, so you avoid messing it up)

For databases, you can also create a new DB in trove for the new setup, avoiding the risk of messing the production up. Probably the statistics can do something similar (depends on how they persist data).

Poslovitch subscribed.Jan 16 2024, 6:06 PM

I'm looking into migrating some of the usable aspects (statistics and match + split) of phetools into seperate standalone tools. This might take a while however, so I'd like to request that the tool be kept running after the Feb 14th shutdown deadline.

In T319965#9534722, @Soda wrote:

I'm looking into migrating some of the usable aspects (statistics and match + split) of phetools into seperate standalone tools. This might take a while however, so I'd like to request that the tool be kept running after the Feb 14th shutdown deadline.

If all you're doing is forking bits and pieces of it into separate tools you should be able to do that without having phetools running on Grid Engine. I presume "shutting down the tool" means preventing it from spawning Grid Engine jobs, not deleting or making inaccessible the tool's account or code.

However, I still plan to try to port phetools entire to the Jobs Framework++ when I can find the time to sit down with it, so having an extension for that purpose would be appreciated.

In T319965#9536707, @Xover wrote:

In T319965#9534722, @Soda wrote:

I'm looking into migrating some of the usable aspects (statistics and match + split) of phetools into seperate standalone tools. This might take a while however, so I'd like to request that the tool be kept running after the Feb 14th shutdown deadline.

If all you're doing is forking bits and pieces of it into separate tools you should be able to do that without having phetools running on Grid Engine. I presume "shutting down the tool" means preventing it from spawning Grid Engine jobs, not deleting or making inaccessible the tool's account or code.

You are right yes, all the code + home dir + data + db + ... will still be there, just the grid processes will be stopped and prevented from spawning new ones. There's no limit or deadline on the migration itself, only on running things on the Grid.

However, I still plan to try to port phetools entire to the Jobs Framework++ when I can find the time to sit down with it, so having an extension for that purpose would be appreciated.

Much appreciated :)
We can only extend until 14th of March, after that date we will shut the actual Grid down so nothing can continue running. Just note again, all the data, code, DBs, etc. will still be there, only the grid processes will not work anymore.

In T319965#9536707, @Xover wrote:

In T319965#9534722, @Soda wrote:

I'm looking into migrating some of the usable aspects (statistics and match + split) of phetools into seperate standalone tools. This might take a while however, so I'd like to request that the tool be kept running after the Feb 14th shutdown deadline.

If all you're doing is forking bits and pieces of it into separate tools you should be able to do that without having phetools running on Grid Engine. I presume "shutting down the tool" means preventing it from spawning Grid Engine jobs, not deleting or making inaccessible the tool's account or code.

However, I still plan to try to port phetools entire to the Jobs Framework++ when I can find the time to sit down with it, so having an extension for that purpose would be appreciated.

I wasn't implying the deletion of the code, given that a lot of the community still depends on this, I wanted a working alternative while we explored migration options.

Based on looking at the code, and playing around with the buildpacks/the jobs framework, I'm a bit pessimistic that we will be able to migrate everything (especially the custom jobs mess) to the newer framework in a reasonable amount of time, which is why I'm exploring forking parts of the tool into seperate pieces.

Thanks @Soda, your work is greatly appreciated, we can extend the stopping of the tool one month to allow you to keep moving parts to different codebases/tools.

In T319965#9536932, @Soda wrote:

Based on looking at the code, and playing around with the buildpacks/the jobs framework, I'm a bit pessimistic that we will be able to migrate everything (especially the custom jobs mess) to the newer framework in a reasonable amount of time, …

I have a theory that we don't actually need the custom job management stuff, we just need to patch around what breaks when it's disabled. As I recall it mainly gives us jobs overviews (which are nice to have, not need to have) and we can live with sending stuff into the void and hoping it'll finish successfully. But, as you say, it's a bit of a mess to begin with and it's been a while since I dug into that code so I could be wrong.

Hi! The phetools repository seems empty (https://gitlab.wikimedia.org/toolforge-repos/phetools.git), I can try to help a bit, but I would need to the code to be there so I can send MRs and similar (also, that makes the tool compliant with the toolforge TOU).

Remember that the grid is being turned off next week (14th of March).

Cheers!

The grid engine has been shut down, so I'm closing any remaining migration tasks as Declined. If you're still planning to migrate this tool, please re-open this task and add one or more active project tags to it. (If you need a project tag for your tool, those can be created via the Toolforge admin console.)

Boldly changing this to open. While the grid engine might have shut down, there still a need to migrate usable functionality off phetools. (For example, the statistics component is actively being migrated (by me), see the wsstats toolforge tool account). I am not sure what tool this should be categorised under, but I don't think Declined correctly reflects the current state.

Soda triaged this task as High priority.Mar 14 2024, 1:09 PM

JJMC89 added a project: Tools.Mar 14 2024, 3:03 PM

I am still planning to migrate this tool, IRL is just being recalcitrant about giving me sustained time slots to work on it.

@dcaro The repo is empty because creating an empty repo is all I've had time to actually do so far. It used be in Phe's personal Github repo, but that code is now badly out of date. The running code has been hacked on and expanded by multiple people, in a fairly uncoordinated fashion, since 2016. IOW, getting the right bits of code into the new repo is actually a small project in itself.

But if you'd like I'd be happy to add you as an admin on the tool?

I've been able to fork off the statistics stuff to https://wsstats.toolforge.org (mostly). (I did look into migrating it, but had difficulty even identifying which files reference what)

In terms of priority, I think match and split is probably the next major piece that needs to migrated/recreated

Xover mentioned this in T359949: Update cache in Dutch Wikisource not happening.Mar 14 2024, 9:56 PM

@Soda Amazing! Thank you! A user request from French Wikisource: Would it be possible to add to the website plots for all the wikisources (a line per wiki)? And the data in a table format instead of charts (I guess just allowing to download a csv is fine)?

In T319965#9633459, @Tpt wrote:

@Soda Amazing! Thank you! A user request from French Wikisource: Would it be possible to add to the website plots for all the wikisources (a line per wiki)? And the data in a table format instead of charts (I guess just allowing to download a csv is fine)?

I wasn't able to get one line per wiki, but https://wsstats.toolforge.org/stats/all/alltime exists now which gives a overview of all wikisources. I will look into generating a phetools like CSV once I'm confident that the rest of the daily data updates are working as expected :)

I wasn't able to get one line per wiki, but https://wsstats.toolforge.org/stats/all/alltime exists now which gives a overview of all wikisources.

Thanks! Sadly it seems that the root page https://wsstats.toolforge.org/ links to https://wsstats.toolforge.org/stats/null/alltime (note the null). It would be also amazing to have at the beggining an "all Wikisources" section that would plot the sum for all Wikisources.

I will look into generating a phetools like CSV once I'm confident that the rest of the daily data updates are working as expected :)

Amazing! Thank you!

Lokal_Profil subscribed.Mar 18 2024, 10:32 AM

Thurs subscribed.Mar 18 2024, 11:20 AM

Lepticed7 subscribed.Mar 19 2024, 9:26 AM

Ninovolador subscribed.Apr 2 2024, 6:01 PM

Hi, is there any chance of MATCH&SPLIT being reintegrated into Wikisource soon? Many contributors (including myself) use it extensively to ease the process of matching pre-formated text to scans and we're desperately hoping to be able to be able to use it again! Many thanks in advance!

In T319965#9714273, @Epigeneticist wrote:

Hi, is there any chance of MATCH&SPLIT being reintegrated into Wikisource soon? Many contributors (including myself) use it extensively to ease the process of matching pre-formated text to scans and we're desperately hoping to be able to be able to use it again! Many thanks in advance!

https://matchandsplit.toolforge.org now exists. (Consider it experimental for now) :)

In T319965#9772020, @Soda wrote:

In T319965#9714273, @Epigeneticist wrote:

Hi, is there any chance of MATCH&SPLIT being reintegrated into Wikisource soon? Many contributors (including myself) use it extensively to ease the process of matching pre-formated text to scans and we're desperately hoping to be able to be able to use it again! Many thanks in advance!

https://matchandsplit.toolforge.org now exists. (Consider it experimental for now) :)

I'm not sure how to use it. I submitted the link to a page for a test (https://fr.wikisource.org/wiki/La_g%C3%A9n%C3%A9ration_des_animaux) with a line to indicate where the Match should start (MATCH:Page:Aristote - Traité de la génération des animaux - tome I.djvu/20) but so far nothing has happened.

In T319965#9773263, @Epigeneticist wrote:

In T319965#9772020, @Soda wrote:

In T319965#9714273, @Epigeneticist wrote:

Hi, is there any chance of MATCH&SPLIT being reintegrated into Wikisource soon? Many contributors (including myself) use it extensively to ease the process of matching pre-formated text to scans and we're desperately hoping to be able to be able to use it again! Many thanks in advance!

https://matchandsplit.toolforge.org now exists. (Consider it experimental for now) :)

I'm not sure how to use it. I submitted the link to a page for a test (https://fr.wikisource.org/wiki/La_g%C3%A9n%C3%A9ration_des_animaux) with a line to indicate where the Match should start (MATCH:Page:Aristote - Traité de la génération des animaux - tome I.djvu/20) but so far nothing has happened.

Looking at the logs, you set the language to french (instead of fr) which caused the bot to throw a fit :( I've clarified that in the form now. (Also, I've added a status page at https://matchandsplit.toolforge.org/status which should list all currently executing/queued tasks)

Lepticed7 unsubscribed.May 5 2024, 6:10 PM

Aklapper moved this task from Backlog to Hacking projects on the Wikimedia-Hackathon-2024 board.May 7 2024, 10:57 AM

In T319965#9773286, @Soda wrote:

In T319965#9773263, @Epigeneticist wrote:

In T319965#9772020, @Soda wrote:

In T319965#9714273, @Epigeneticist wrote:

Hi, is there any chance of MATCH&SPLIT being reintegrated into Wikisource soon? Many contributors (including myself) use it extensively to ease the process of matching pre-formated text to scans and we're desperately hoping to be able to be able to use it again! Many thanks in advance!

https://matchandsplit.toolforge.org now exists. (Consider it experimental for now) :)

I'm not sure how to use it. I submitted the link to a page for a test (https://fr.wikisource.org/wiki/La_g%C3%A9n%C3%A9ration_des_animaux) with a line to indicate where the Match should start (MATCH:Page:Aristote - Traité de la génération des animaux - tome I.djvu/20) but so far nothing has happened.

Looking at the logs, you set the language to french (instead of fr) which caused the bot to throw a fit :( I've clarified that in the form now. (Also, I've added a status page at https://matchandsplit.toolforge.org/status which should list all currently executing/queued tasks)

OK thanks, MATCH run successfully up untill a discontinuity between the text and the scan (as expected).
However, the last match was inserted within a tag, which should not occur.
The original tag:
{{t2|LIVRE PREMIER}}

becomes:
{{t2|

[[Page:Aristote - Traité de la génération des animaux - tome I.djvu/278]]

LIVRE PREMIER}}

I then tried to run SPLIT but it does not appear to work (I entered the exact same information as for MATCH), there was no message {"status":"recieved"} as in the case of MATCH.

update: now SPLIT shows the expected message {"status":"recieved"} but fails to run properly, it does not appear in the https://matchandsplit.toolforge.org/status page and the text is not distributed to the corresponding pages of the manuscript as it should.

In T319965#9787724, @Epigeneticist wrote:

update: now SPLIT shows the expected message {"status":"recieved"} but fails to run properly, it does not appear in the https://matchandsplit.toolforge.org/status page and the text is not distributed to the corresponding pages of the manuscript as it should.

It looks like the tool got stuck on a bad version. I've pushed a fix and triggered the job that got stuck manually.

In T319965#9787909, @Soda wrote:

In T319965#9787724, @Epigeneticist wrote:

update: now SPLIT shows the expected message {"status":"recieved"} but fails to run properly, it does not appear in the https://matchandsplit.toolforge.org/status page and the text is not distributed to the corresponding pages of the manuscript as it should.

It looks like the tool got stuck on a bad version. I've pushed a fix and triggered the job that got stuck manually.

Unfortunately I tired SPLIT with another page (https://fr.wikisource.org/wiki/La_g%C3%A9n%C3%A9ration_des_animaux_tome_II) and the problem persists.

In T319965#9788348, @Epigeneticist wrote:

In T319965#9787909, @Soda wrote:

In T319965#9787724, @Epigeneticist wrote:

update: now SPLIT shows the expected message {"status":"recieved"} but fails to run properly, it does not appear in the https://matchandsplit.toolforge.org/status page and the text is not distributed to the corresponding pages of the manuscript as it should.

It looks like the tool got stuck on a bad version. I've pushed a fix and triggered the job that got stuck manually.

Unfortunately I tired SPLIT with another page (https://fr.wikisource.org/wiki/La_g%C3%A9n%C3%A9ration_des_animaux_tome_II) and the problem persists.

Markup being submitted to the page was invalid causing matchandsplit to throw errors, I have made a few edits to the page, and the page should now work with matchandsplit

In T319965#9788793, @Soda wrote:

In T319965#9788348, @Epigeneticist wrote:

In T319965#9787909, @Soda wrote:

In T319965#9787724, @Epigeneticist wrote:

update: now SPLIT shows the expected message {"status":"recieved"} but fails to run properly, it does not appear in the https://matchandsplit.toolforge.org/status page and the text is not distributed to the corresponding pages of the manuscript as it should.

It looks like the tool got stuck on a bad version. I've pushed a fix and triggered the job that got stuck manually.

Unfortunately I tired SPLIT with another page (https://fr.wikisource.org/wiki/La_g%C3%A9n%C3%A9ration_des_animaux_tome_II) and the problem persists.

Markup being submitted to the page was invalid causing matchandsplit to throw errors, I have made a few edits to the page, and the page should now work with matchandsplit

Thanks, this indeed solved the issue!

Hello, I'm user of french Wikisource and I use Split for upload the work I prepare offline. I precise that I use "Split" only (I don't use Match) with a syntax like :

[[Page:MyFile.djvu/136]]

line of text
line of text

[[Page:MyFile.djvu/137]]

line of text
line of text

Did the new version of Split can run in this case ?

@Cunegonde1 It should work, which page is this? (I might be able to debug your issues better if I know the exact page)

Thank you very much for your help. I just did a test and everything works perfectly. Thanks again for your work on this precious tool which is Match and Split.

I tested MATCH again on a text that has formating tags and references (https://fr.wikisource.org/wiki/La_g%C3%A9n%C3%A9ration_des_animaux) and I get 2 weird behaviors:

1- page breaks that insert within format tags (as already reported in a previous post), which should not occur.
For example, the following text:
{{SéparateurDeTexte|2}}
{{t3|CHAPITRE V}}

becomes:
{{SéparateurDeTexte|2}}
{{t3|

[[Page:Aristote - Traité de la génération des animaux - tome I.djvu/320]]

CHAPITRE V}}

2- Many pages were not matched to the manuscript and insead, the MATCH appended page breaks within notes (see for example note 3, as well as many others below).

Do you think it may be possible to modifiy the code so that these behaviors do not happen anymore?
Many thanks in advance!

Soda mentioned this in T366036: Deprecate and remove phetools OCR from pywikibot.Tue, May 28, 3:46 AM

Migrate phetools from Toolforge GridEngine to Toolforge KubernetesOpen, HighPublicActions