Page MenuHomePhabricator

List loader: fetch html text from Wikisource
Open, LowestPublicFeature

Description

User request

I would like LinguaLibre to allow the production of audiobooks using texts from Wikisource or even other projects and then add the recordings at the correct places.

Avenue

  • Step 4 > tinker « External tool » list loader to accept mediawiki urls
  • Convert url into API query returning raw text
  • Inject raw text as content to be recorded
  • T370618 Being in poems recording states, will consider the whole as a single text to record.

Exploration

Wikipedia has an easy path to extract the plain text, it require a light automatic formating (or utf-8).

Wikisource makes heavy usages of cascade template inclusion, so the target page plain text is often just templates for the sub-pages. This makes it difficult to integrate, while it's the main request.

ws-export is active, can exchange with them.

Observation

Wikisource : not straight forward. Cost of development is important.
Wikipedia : not straight forward, usage of micro-templates is so ubiquitous that extracted text is likely full of gaps only noticed when reading, and preventing quality recording.
I dont recommend attacking this issue. - Yug

Solution

@Poslovitch's Dicotheque sucessfully pull text from wikisource by fetching html. The dicotheque code can be reused.

Event Timeline

satdeep_gill changed the subtype of this task from "Task" to "Feature Request".
Yug renamed this task from Allow texts from Wikisource to be recorded with LinguaLibre to Recording: allow texts from Wikisource to be recorded.Jul 7 2022, 11:04 AM
Yug lowered the priority of this task from Low to Lowest.Jul 7 2022, 11:07 AM
Yug updated the task description. (Show Details)
Yug updated the task description. (Show Details)
Yug updated the task description. (Show Details)
Yug updated the task description. (Show Details)
Yug added subscribers: Samwilson, Tpt, Yug.

cc @Tpt , @Samwilson hello.
On WikiSource and with ws-export, how to you do to get the full text for a root page such as https://fr.wikisource.org/wiki/Élégies,_Marie_et_romances/Romances which only contain a hard to read template ?
Even on a page itself I can't get the plain text

Any code to point us to and to share ?

Do you actually want plain text (i.e. without any formatting), or do you mean the full HTML of the page?

You could use ws-export to get the HTML, e.g. https://ws-export.wmcloud.org/?lang=fr&title=Élégies%2C_Marie_et_romances%2FRomances&format=htmlz

Yug changed the task status from Open to Stalled.Sep 25 2024, 4:41 PM

Problem is much more complex than expected.

  • WikiSources use cascading templating and module rendeding.
  • Target pages are nearny empty, mostly containing module calls
  • API wikitext returned is therefore the module / templates call, not the book's text.

Cannot proceed.

Does using WS Export not work for your use case? Is it the rate limiting? If you log in, you can bypass that.

Hello @Samwilson ,

I actually didn't noticed your previous message :

Do you actually want plain text (i.e. without any formatting), or do you mean the full HTML of the page?

You could use ws-export to get the HTML, e.g. :

Maybe the following variant programmatic URL could work for our case :

I note from some reading about a query limit of 1 per minute (?), this would problematic.

It would required either :

  • to whitelist queries from https://*.lingualibre.org

or

  • Given our users will already be Oauth logged onto wiki(m|p)edia.org network, should we send some Oauth key to ws-export.wmcloud.org ?

I note from some reading about a query limit of 1 per minute (?), this would problematic.

Yep, the request limit is currently 2 requests per minute. Have you been hitting the limit, or are you just being cautious? We can definitely figure out a way for lingualibre to bypass it (or you can have the lingualibre tool look in and store it's oauth access token), but how many are you expecting to use it? It seems like you'd probably want to be caching the exported works, and not re-requesting repeatedly.

I'm being cautious.
We just have about 4~6 Lingua Libre users per day, less than one per day will use this Wikisource tool.
It wont be a burden on your service but when our user fetch a text from Wikisource they typically redo to adjusts it a bit.
So being cautious since I expect such hesitations and redo reaching up to 2~4 times per minutes once a week.

No worries.

What is the "redo to adjust" process? Do you mean they'll get e.g. a chapter to read, notice some errors with it, go and fix those, and then get lingualibre to re-download that chapter? Because there's also caching within the tool, so that mightn't work that well.

Yug renamed this task from Recording: allow texts from Wikisource to be recorded to List loader: fetch texts from Wikisource.Nov 15 2024, 11:03 AM
Yug updated the task description. (Show Details)
Yug moved this task from Backlog to B-Word Lists and Generators on the Lingua-Libre board.
Yug added a subscriber: Poslovitch.
Yug renamed this task from List loader: fetch texts from Wikisource to List loader: fetch html text from Wikisource.Nov 15 2024, 11:19 AM
Yug changed the task status from Stalled to Open.