https://wikimediafoundation.org/wiki/User:Sthottingal-WMF
Maintains and engineer for ContentTranslation UniversalLanguageSelector and general MediaWiki-Internationalization
https://wikimediafoundation.org/wiki/User:Sthottingal-WMF
Maintains and engineer for ContentTranslation UniversalLanguageSelector and general MediaWiki-Internationalization
The actual failure can be reproduced by visiting https://cxserver.wikimedia.org/v2/page/sv/nn/Royal_Society_for_the_Protection_of_Birds
Page sv:Royal_Society_for_the_Protection_of_Birds could not be found. TypeError: item.dispose is not a function
Root cause is a regresssion from recent cxserver upgrade. Fix already in place https://gerrit.wikimedia.org/r/c/mediawiki/services/cxserver/+/978192 waiting for deployment
From our past observations, especiailly during translaiton campaigns, many users participate, potentially creating low quality articles. The review happens much later. Reviwers also had complained that they cannot review all these articles on time. When review happens, articles get deleted. So the deletion happens weeks later the translation activity. Considering this, the chances that a new user has a deleted translation while making intentional or unintentaionl low quality translation is rare.
Hence, the proposed strict limit if user has deletion in last 30 days might not have expected effect. However, I support keeping this in place. But the user should be clearly communicated why their translation limits are high.
The current logic in CX for CJK group of languages(including chinese) is follows. The tokens are characters instead of words, so 人口 has 2 tokens.
@elukey, What do you mean by 'reaching out to you by next time' ? Regarding the architecture of MinT and why it is not using LiftWing we had discussion in the past. I don't think it is not useful to repeat. There is a reason why we put the models in people.wikimedia.org - it was as per recommendation from SRE and this ticket was created to make it more reliable. We still need a public location for models download as MinT is not designed for WMF instrastructure alone.
We need 2TB scratch volume mounted too.
https://test.wikipedia.org/w/rest.php/coredev/v0/transform/wikitext/to/html/Oxygen looks good. If this can be exposed for all production wikis, we can definitely move to this endpoint.
It seems we need to continue with restbase for the time being till a stable, well documented API is known as replacement, right?
http://parsoid-external-ci-access.beta.wmflabs.org - Does this use actual production wiki? Or beta.wmflabs.org? If it is beta.wmflabs.org, then we will be limited by content and supported languages right?
If you need access to pagebundles or the transform endpoints, then we have to figure something out.
I think we have a serious problem here.
At https://phabricator.wikimedia.org/T350219#9298055, @daniel wrote:
"Parsoid endpoints are not expected to work for external requests. So this is "working" as expected."
Fixed in sentencex version 0.5.1
@Sportzpikachu Thanks for the PR. Please note that jquery.i18n has a successor banana.i18n which is a framework agnostic js library. That is the library we are actively going to maintain. If your usecase can use that library, it would be much better.
The model expects sentences. That is how it is trained. For example, words like "Moon" can appear in many latin based languages as proper noun or reference to a title of a book etc. The prediction quality increase as more words are provided. Then it knows better about the context of the word.
Thank you @isarantopoulos and @elukey !
We have the service in production: https://api.wikimedia.org/wiki/Lift_Wing_API/Reference/Get_language_identification_prediction
@elukey If I understood that documentation correctly, if the service required oauth token, still Anonymous users can use it with the applicable ratelimiting. am I right?
There would be usecases where non-mediawiki static webpage using this API and this anonymous ratelimited option should be sufficient.
Yes, references are moved to the end of sentence. Also seen in this example below. The positioning of references after the correct position in translation is slightly complicated and need to be implemented.
In T340507#9242994, @isarantopoulos wrote:@santhosh Thanks for creating the model card!
Is there a client/system that will use this at the moment? If yes, is there an estimate on the amount of traffic we should be expecting? Main reason I am asking is so that we know the scaling requirements (if any) and also can validate via load testing.
By adding the following line in the common.js in wikipedia, you can see the proof of concept
importScript( 'User:Santhosh.thottingal/mint-section-translation.js' );
Example: https://en.wikipedia.org/wiki/User:Santhosh.thottingal/common.js
We now have a library for this - in js and python.
Not only styles, but spaces are replaced by .
Trying to reproduce the issue:
Hi @isarantopoulos I drafted the model card here: https://meta.wikimedia.org/wiki/Machine_learning_models/Proposed/Language_Identification
In T344982#9218576, @santhosh wrote:@daniel, @MSantos What would be web API for posting wikitext to /transform/wikitext/to/html? Could not see documentation for that at https://www.mediawiki.org/wiki/API:REST_API/Reference
@daniel, @MSantos What would be web API for posting wikitext to /transform/wikitext/to/html? Could not see documentation for that at https://www.mediawiki.org/wiki/API:REST_API/Reference
In case it helps, language team had a very similar requirement for our machine translation service(MinT) and for CX-cxserver. We just published our sentence segmentation library in python and javascript. It also clusters the references along with the previous sentence. It is designed to support large number of languages and custom rules per languages are possible by design.
I did a temporary fix in the repository to unblock CI issue so that we are not blocked by issue https://gerrit.wikimedia.org/r/c/mediawiki/services/machinetranslation/+/958599 - directly call pytest instead of calling via tox.
but black, and ruff checks are not present in it.
Hi, I looked into this issue again. Blubber not using a virtual environment is an important issue. As debian bookworm's pip restrictions is also coming up, we need to fix this.
We are already working on deploying this flask app as a service on the Lift Wing
@abi_ could you please include a screenshot in this ticket for better understanding of functionality and for historical reference. Thanks.
Session outline, links, suggested reading and materials is given below. I will also have a presentation with this content.
I removed some unused files. Should be reduced from 11.7 to 8GB now. Rest of the files cannot be deleted for now.
@elukey not an answer to your question, but trying to assess the effort required here. Like everybody else we are also constrained by people capacity :-)