Page MenuHomePhabricator

sweil
User

Today

  • No visible events.

Tomorrow

  • No visible events.

Wednesday

  • No visible events.

User Details

User Since
Jul 25 2022, 11:58 AM (199 w, 10 h)
Availability
Available
LDAP User
Unknown
MediaWiki User
Stefan Weil [ Global Accounts ]

Recent Activity

Tue, May 5

sweil added a comment to T403346: Add Arabic model to kraken OCR.

Both models are now installed on https://kraken-ocr.wmcloud.org/. Maybe someone who can read Arabic can test it.

Tue, May 5, 6:21 AM · Arabic-Sites, Community-Tech, Wikimedia OCR

Mon, May 4

sweil added a comment to T345055: Add kraken OCR engine to Wikimedia OCR.

After the upgrade to Debian trixie, the next step is upgrading wikimedia-ocr and kraken. This is currently not possible because the 20 GB disk space is not sufficient, even after removing most log files and lots of Debian packages (including Tesseract OCR). I think that a total capacity of 40 GB would be needed for a full installation with several kraken models.

Mon, May 4, 8:45 PM · Patch-For-Review, Wikimedia OCR
sweil added a comment to T424818: [Cloud VPS alert][wikisource] Puppet failure on kraken-ocr.wikisource.eqiad1.wikimedia.cloud.

A recreate with Debian trixie would be fine for me. Maybe this is easier than fixing the current installation.

Mon, May 4, 7:06 AM · cloud-services-team, Cloud-VPS, Wikimedia OCR, Essential-Work, Community-Tech

Sun, May 3

sweil added a subtask for T345055: Add kraken OCR engine to Wikimedia OCR: T403346: Add Arabic model to kraken OCR.
Sun, May 3, 7:50 AM · Patch-For-Review, Wikimedia OCR
sweil added a parent task for T403346: Add Arabic model to kraken OCR: T345055: Add kraken OCR engine to Wikimedia OCR.
Sun, May 3, 7:50 AM · Arabic-Sites, Community-Tech, Wikimedia OCR
sweil added a subtask for T345055: Add kraken OCR engine to Wikimedia OCR: T424818: [Cloud VPS alert][wikisource] Puppet failure on kraken-ocr.wikisource.eqiad1.wikimedia.cloud.
Sun, May 3, 7:48 AM · Patch-For-Review, Wikimedia OCR
sweil added a parent task for T424818: [Cloud VPS alert][wikisource] Puppet failure on kraken-ocr.wikisource.eqiad1.wikimedia.cloud: T345055: Add kraken OCR engine to Wikimedia OCR.
Sun, May 3, 7:48 AM · cloud-services-team, Cloud-VPS, Wikimedia OCR, Essential-Work, Community-Tech
sweil added a comment to T424818: [Cloud VPS alert][wikisource] Puppet failure on kraken-ocr.wikisource.eqiad1.wikimedia.cloud.

I'm afraid that I caused this issue with an update to Debian trixie. Maybe puppet was uninstalled accidentally. I cannot fix it, because the VM no longer accepts my SSH keys, and I have no other access like for example a VNC console.

Sun, May 3, 7:37 AM · cloud-services-team, Cloud-VPS, Wikimedia OCR, Essential-Work, Community-Tech

Jun 26 2024

sweil added a comment to T287460: Add Mannheim University OCR models.

Meanwhile german_print is the best model from UB Mannheim.

Jun 26 2024, 8:45 AM · User-Inductiveload, Wikimedia OCR
sweil added a comment to T322576: Extend API to get also hocr output.

As far as I know only Tesseract (and Kraken as soon as it is available) can produce hOCR output. Transkribus can produce PAGE XML which could be converted to hOCR.

Jun 26 2024, 8:32 AM · Wikimedia OCR
sweil created T368507: Wikimedia OCR is currently single threaded (blocking a 2nd access).
Jun 26 2024, 8:26 AM · Community-Tech, Wikimedia OCR
sweil added a comment to T330061: Rename Languages to Models in the OCR tool UI.

For Tesseract there exist language models, script models and models which are neither for a single language nor for a single script.

Jun 26 2024, 8:15 AM · Patch-For-Review, Wikimedia OCR

Nov 1 2023

sweil added a comment to T347562: Page view in non-default user interface should be translatable by Firefox (<html lang> attribute) .

Then let me rephrase my bug report: Wikipedia uses incorrect language attributes if a non-default user interface is selected.

Nov 1 2023, 3:52 PM · Upstream, MediaWiki-User-Interface, MediaWiki-Engineering, MediaWiki-Core-Skin-Architecture

Oct 25 2023

sweil added a comment to T347562: Page view in non-default user interface should be translatable by Firefox (<html lang> attribute) .

That is subjective (is it a page in German with a content block in French, or a page in French with various menu blocks in German?). The task description offers no reason why one interpretation would be more useful than the other, and changing the code would be a major effort (instead of tagging the content block(s) with the content language we'd have to tag all the non-content blocks with the UI language) so I'd decline this.

Oct 25 2023, 1:41 PM · Upstream, MediaWiki-User-Interface, MediaWiki-Engineering, MediaWiki-Core-Skin-Architecture

Oct 2 2023

sweil added a comment to T347562: Page view in non-default user interface should be translatable by Firefox (<html lang> attribute) .

Right, but the change would only affect users who are logged in. So it is a fix for Wikipedia authors, not for the majority of "normal" users.

Oct 2 2023, 7:25 PM · Upstream, MediaWiki-User-Interface, MediaWiki-Engineering, MediaWiki-Core-Skin-Architecture
sweil added a comment to T347562: Page view in non-default user interface should be translatable by Firefox (<html lang> attribute) .

The main content of the page uses the language of the selected Wikipedia, French for fr.wikipedia.org, German for de.wikipedia.org and so on. As long as the HTML tag specifies that language, translation programs will translate that content.

Oct 2 2023, 7:04 PM · Upstream, MediaWiki-User-Interface, MediaWiki-Engineering, MediaWiki-Core-Skin-Architecture

Sep 28 2023

sweil closed T346413: Install Kraken OCR (and web service) on a new Wikisource VPS, a subtask of T345055: Add kraken OCR engine to Wikimedia OCR, as Resolved.
Sep 28 2023, 11:38 AM · Patch-For-Review, Wikimedia OCR
sweil closed T346413: Install Kraken OCR (and web service) on a new Wikisource VPS as Resolved.

Menwhile Kraken is installed and configured, the web service is online.

Sep 28 2023, 11:38 AM · Community-Tech, Wikimedia OCR
sweil created T347562: Page view in non-default user interface should be translatable by Firefox (<html lang> attribute) .
Sep 28 2023, 10:09 AM · Upstream, MediaWiki-User-Interface, MediaWiki-Engineering, MediaWiki-Core-Skin-Architecture

Sep 20 2023

sweil added a comment to T346854: Increase quota for wikisource project (for new OCR service).

Merci bien.

Sep 20 2023, 4:46 PM · Wikimedia OCR, Community-Tech, Cloud-VPS (Quota-requests)

Sep 19 2023

sweil added a comment to T346413: Install Kraken OCR (and web service) on a new Wikisource VPS.

@Samwilson, it looks like Wikimedia OCR currently does not handle more than a single OCR process at the same time. Is that correct? Doesn't that cause much waiting if the service is used heavily? Did users complain about slow OCR because of that?

Sep 19 2023, 3:16 PM · Community-Tech, Wikimedia OCR
sweil added a comment to T345055: Add kraken OCR engine to Wikimedia OCR.

Kraken also supports different models for the segmentation (region and line detection).
The segmentation model should be selectable from the web interface and the API, too.

Sep 19 2023, 3:07 PM · Patch-For-Review, Wikimedia OCR

Sep 18 2023

sweil added a comment to T345055: Add kraken OCR engine to Wikimedia OCR.

The current implementation offers 3 different models for the text recognition.
Is there a need for non Latin scripts as well? Which ones? Arabic? Hebrew? Others?

Sep 18 2023, 12:41 PM · Patch-For-Review, Wikimedia OCR
sweil added a comment to T346413: Install Kraken OCR (and web service) on a new Wikisource VPS.

A virtual machine for tests with kraken should provide at least 4 VCPUs, 8 GiB RAM, 8 GB storage (minimum values). More VCPUs allow more parallel processing.

Sep 18 2023, 12:23 PM · Community-Tech, Wikimedia OCR

Aug 30 2023

sweil added a comment to T332611: Upgrade Wikimedia OCR to Bookworm and PHP 8.2.

Temporarily disabling this check for PHP 8.2 seems like a good idea to me. After all, it will still be run for PHP 8.1 and older versions.

Aug 30 2023, 12:41 PM · Community-Tech, Wikimedia OCR

Aug 28 2023

sweil updated the task description for T345055: Add kraken OCR engine to Wikimedia OCR.
Aug 28 2023, 8:29 AM · Patch-For-Review, Wikimedia OCR
sweil updated the task description for T345055: Add kraken OCR engine to Wikimedia OCR.
Aug 28 2023, 8:28 AM · Patch-For-Review, Wikimedia OCR
sweil created T345055: Add kraken OCR engine to Wikimedia OCR.
Aug 28 2023, 8:28 AM · Patch-For-Review, Wikimedia OCR
sweil added a comment to T284835: Install all data files for tesseract and use the appropriate ones.

You could use only the models from tessdata. They support the legacy OCR engine and include a fast model (derived from tessdata_best) for the LSTM OCR engine.

Aug 28 2023, 8:05 AM · Wikimedia OCR

Aug 16 2023

sweil renamed T316428: OCR is not working for Devanagari script in wikisource.org from OCR is not working for Devnagari script in wikisource.org to OCR is not working for Devanagari script in wikisource.org.
Aug 16 2023, 5:21 AM · Wikimedia OCR
sweil added a comment to T332450: Upgrade ws-export VMs to Bullseye.

Maybe it would be better to skip Bullseye and directly go to the current stable Bookworm?

Aug 16 2023, 5:20 AM · Community-Tech, WS Export
sweil added a comment to T332611: Upgrade Wikimedia OCR to Bookworm and PHP 8.2.

Meanwhile Debian Bookworm is the current stable version. It comes with PHP 8.2.

Aug 16 2023, 5:18 AM · Community-Tech, Wikimedia OCR
sweil created T344307: Add item description to Wikidata's notification e-mails for item changes.
Aug 16 2023, 4:49 AM · MediaWiki-Email, Wikidata

Aug 13 2023

sweil added a comment to T286656: Old English support in Tesseract.

Thanks. Adding Upstream as I assume this is about https://github.com/tesseract-ocr/tesseract (which is missing on https://www.mediawiki.org/wiki/Upstream_projects )

Aug 13 2023, 9:02 PM · Upstream, User-Inductiveload, Wikimedia OCR
sweil added a comment to T286656: Old English support in Tesseract.

A Tesseract model can be trained either with artificial data (which requires Old English texts and fonts) or with real page images and matching transcriptions. Do you have such data? If yes, I could try to train a Tesseract model.

Aug 13 2023, 8:54 PM · Upstream, User-Inductiveload, Wikimedia OCR

Jul 25 2022

sweil added a comment to T287460: Add Mannheim University OCR models.

Don't hesitate to ask me (the author of the mentioned models) if there remain any open questions.

Jul 25 2022, 12:05 PM · User-Inductiveload, Wikimedia OCR