[Newcomer track] Machine learning for Wikipedia
I am proposing a session at Wikimania Hackathon regarding machine learning projects for Wikipedia.
During a 30-45 minute session, I would like to cover the following topics:

  • Some common guidelines about usage of machine learning at Wikipedia, explained with examples
    • TODO: documents related to this will be linked to this phab ticket.
  • A list of project ideas that can be starting point for your hacking at sessions.
  • A general overview of technology that can be used, limitations that will apply to Wikipedia projects

Session outline, links, suggested reading and materials is given below. I will also have a presentation with this content.

Machine learning

Wikimedia Engineering Architecture Principles are applicable:

Some principles to keep in mind:

As per Developer Satisfaction Survey 2023, The majority of respondents indicated that English is not their first or primary language.
So I would personally focus on projects or ideas related to language diversity in this session.

Machine translation

Wikipedia now has a self hosted machine translation service.

You can access the test instance at It has translation APIs too.

Try to build some cool applications using the machine translation API. You get free APIs.

  • How about a browser plugin that translate selected text using MinT?
  • A Wikipedia gadget/script that translate wikipedia sections?
  • ASR->MT->TTS and do speech to speech translation?

However, this is a test instance, so we don't offer any uptime guarantees. But don't worry, this service can run on your laptops.
Just clone the repo from here, and run it:
Or download the prebuilt docker container and run it:

Example docker deployment: Remember to replace the tag with latest one.

$ docker pull

$ docker run -dp 8089:8989

You now has MT service supporting 35924 language pairs for 198 unique language running on your laptop.

As you can see, MinT supports translating not only plaintext, but html, json, markdown, svg etc.

Can you host this service in your webserver? Your universities webserver? Or Wikipedia chapters server and help distribute the computing cost for WMF? Seems a good idea? Interested?
Read this document:


Text to speech

Wikipedia started a project called phonos to read out IPA(pronunciation representation) and general TTS capabilities. However dependency on Google's paid TTS to support large set of languages is not a good idea. See

Interested in exploring and trying out alternate options? See,
Meta recently announced And coqui-ai/TTS rencently integrated this. is a demo web application that use Coqui-ai/TTS and Meta's MMS speech models.

  • Do you think a TTS service can help wikipedia projects? Do you have some project ideas with TTS? Does this TTS support your language?
  • If it supports, did you try it? Is there a way to fine tune it? If language is not present, can we add support for new language?

Automatic speech recognition (MIT licensed).
There are whispercpp that optimize it to run it in just CPUs(also GPUs). Try it in your computer?

Does this ASR support your language? If it supports, did you try it? Is there a way to fine tune it? If language is not present, can we add support for new language?
What are some application of ASR in Wikipedia context?

Compute optimization

Due to our opensource policies, we have restrictions on kind of GPUs we can use.
Even if we can use such powerful GPUs, any kind of optimization on inference saves energy and operational costs, make these technologies accessible to more people.
See how we optimized machine translation modesl to run on CPUs

Optical Character Recognition

Tesseract supports 100+ languages and various image formats
An example OCR frontend with API running at
There is also to OCR content present in commons. It uses tesseract, google cloud vision OCR and Transkribus.

  • Does Tesseract works well for your language? Please try it and give feedback
  • Build applications using OCR. Example: Translate a jpeg image to another language?

Transkribus is an AI-powered platform for text recognition, transcription and searching of historical documents

Large Language Models

Do read:

  3. Thoughts on chatGPT and Wikimedia

For learning:

  3. Challenges and Applications of Large Language Models

Experimenting with LLM is costly. WMF, at present does not have any LLM model based service.

ChatGPT plugin.

LLM support of languages are also not wide.

There are some internal experiments on using LLM for some usecases. All are experimental.

Retrieval Augmented Generation and natural language question answering:

Wikidata knowledge graph to articles

  • Creating summaries of article based on facts retrieved from Wikidata
  • Creating placeholder article based on facts retrieved from Wikidata
### Instruction: Write a paragraph based the given data below in fluent English.
Place name: Tenerife
area: 2,034 km
known for : tourism
vistitors: 6 million per year.
popular resorts: Puerto de la Cruz and Playa de las Américas.

### Response:

Related reading:


Wikimedia's huggingface profile:

