https://wikimediafoundation.org/wiki/User:Sthottingal-WMF
Maintains and engineer for ContentTranslation UniversalLanguageSelector and general MediaWiki-Internationalization
https://wikimediafoundation.org/wiki/User:Sthottingal-WMF
Maintains and engineer for ContentTranslation UniversalLanguageSelector and general MediaWiki-Internationalization
The CX entrypoint is also duplicated if you click multiple times while language selector is loading:
After the migration to node fetch, the error is still there:
TypeError: Cannot read properties of undefined (reading 'pages') at processResult (/srv/service/lib/mw/BatchedAPIRequest.js:85:23) at process.processTicksAndRejections (node:internal/process/task_queues:95:5)
The issue is resolved and the root cause of bad requests from preq library is also resolved
Browsers natively support hyphenation(breaking the word at proper position) these days. No need to change the content for this. Following CSS example shows how to do this. I developed hyphenation system for Indian languages and that is what Chrome, Firefox, TeX, Libreoffice, Indesign etc using these days.
Both MinT and Apertium does not use proxy. They were working and then we added MT clients with proxy support . Then clients without proxy started showing this issue- This is not consistently reproducible, but happens very frequent.
Issue is resolved now as train is rolled back. Not closing as we need to monitor this when train is running with backported patch
It seems the backend issue is T359509: REST API calls suddenly all returning 400 and there is already a patch to be reviewed and merged:
mw-cli can help us to create many language wikis in a cloud instance.
So we can have http://en.mediawiki.mwdd.localhost:8080, http://ig.mediawiki.mwdd.localhost:8080 ..
Tangential note: https://ruralindiaonline.org/en/articles/in-2023-paribhasha-builds-a-peoples-archive-in-peoples-languages/ is a bad example because of broken rendering in the scripts used in title image - We should never do that.
There is a feature in superset where we can just embed any dashboards in any web page. That seems the easiest approach here. https://github.com/apache/superset/tree/master/superset-embedded-sdk
A screenshot illustrating reference misplacement with current prototype: From https://en.wikipedia.org/wiki/Polar_bear
The above patch is a quick run to identify the required efforts to migrate from servicerunner. It is not for merge. My proposal is to modernize various parts of cxserver, while using servicerunner as process manager. Do this migrations in iterations and at later stage when cxserver does not have a strong dependency on servicerunner other than a process manager, replace it. Doing everything in one go is too risky as cxserver is the backbone of our heavily used translation system.
The list of models for a language pair is provided in API output of https://translate.wmcloud.org/api/languages
This is linked in the UI - See bottom links - API Spec
Additional information: This issue happens with indictrans2-en-indic model. NLLB-200 gives correct output
this will allow the language team to use this model server
However, when inspecting the target language selector you can notice that Santali (sat) is not listed.
$ uname -r 6.1.0-15-cloud-amd64
The actual failure can be reproduced by visiting https://cxserver.wikimedia.org/v2/page/sv/nn/Royal_Society_for_the_Protection_of_Birds
Page sv:Royal_Society_for_the_Protection_of_Birds could not be found. TypeError: item.dispose is not a function
Root cause is a regresssion from recent cxserver upgrade. Fix already in place https://gerrit.wikimedia.org/r/c/mediawiki/services/cxserver/+/978192 waiting for deployment
From our past observations, especiailly during translaiton campaigns, many users participate, potentially creating low quality articles. The review happens much later. Reviwers also had complained that they cannot review all these articles on time. When review happens, articles get deleted. So the deletion happens weeks later the translation activity. Considering this, the chances that a new user has a deleted translation while making intentional or unintentaionl low quality translation is rare.
Hence, the proposed strict limit if user has deletion in last 30 days might not have expected effect. However, I support keeping this in place. But the user should be clearly communicated why their translation limits are high.
The current logic in CX for CJK group of languages(including chinese) is follows. The tokens are characters instead of words, so 人口 has 2 tokens.
@elukey, What do you mean by 'reaching out to you by next time' ? Regarding the architecture of MinT and why it is not using LiftWing we had discussion in the past. I don't think it is not useful to repeat. There is a reason why we put the models in people.wikimedia.org - it was as per recommendation from SRE and this ticket was created to make it more reliable. We still need a public location for models download as MinT is not designed for WMF instrastructure alone.
We need 2TB scratch volume mounted too.
https://test.wikipedia.org/w/rest.php/coredev/v0/transform/wikitext/to/html/Oxygen looks good. If this can be exposed for all production wikis, we can definitely move to this endpoint.
It seems we need to continue with restbase for the time being till a stable, well documented API is known as replacement, right?
http://parsoid-external-ci-access.beta.wmflabs.org - Does this use actual production wiki? Or beta.wmflabs.org? If it is beta.wmflabs.org, then we will be limited by content and supported languages right?
If you need access to pagebundles or the transform endpoints, then we have to figure something out.
I think we have a serious problem here.
At https://phabricator.wikimedia.org/T350219#9298055, @daniel wrote:
"Parsoid endpoints are not expected to work for external requests. So this is "working" as expected."
Fixed in sentencex version 0.5.1
@Sportzpikachu Thanks for the PR. Please note that jquery.i18n has a successor banana.i18n which is a framework agnostic js library. That is the library we are actively going to maintain. If your usecase can use that library, it would be much better.
The model expects sentences. That is how it is trained. For example, words like "Moon" can appear in many latin based languages as proper noun or reference to a title of a book etc. The prediction quality increase as more words are provided. Then it knows better about the context of the word.
Thank you @isarantopoulos and @elukey !
We have the service in production: https://api.wikimedia.org/wiki/Lift_Wing_API/Reference/Get_language_identification_prediction
@elukey If I understood that documentation correctly, if the service required oauth token, still Anonymous users can use it with the applicable ratelimiting. am I right?
There would be usecases where non-mediawiki static webpage using this API and this anonymous ratelimited option should be sufficient.
Yes, references are moved to the end of sentence. Also seen in this example below. The positioning of references after the correct position in translation is slightly complicated and need to be implemented.
In T340507#9242994, @isarantopoulos wrote:@santhosh Thanks for creating the model card!
Is there a client/system that will use this at the moment? If yes, is there an estimate on the amount of traffic we should be expecting? Main reason I am asking is so that we know the scaling requirements (if any) and also can validate via load testing.
By adding the following line in the common.js in wikipedia, you can see the proof of concept
importScript( 'User:Santhosh.thottingal/mint-section-translation.js' );
Example: https://en.wikipedia.org/wiki/User:Santhosh.thottingal/common.js
We now have a library for this - in js and python.
Not only styles, but spaces are replaced by .
Trying to reproduce the issue:
Hi @isarantopoulos I drafted the model card here: https://meta.wikimedia.org/wiki/Machine_learning_models/Proposed/Language_Identification