Page MenuHomePhabricator

Outreachy 32: Addressing the lusophone technological wishlist proposals - Visual Editor / Wikidata
Open, Needs TriagePublic

Description

Project title: Addressing the lusophone technological wishlist proposals

Brief summary: The Lusophone technological wishlist, in the lusophone Wiki context, is a survey that intends to understand which are the technological innovations and the tools and platforms that could be modified to improve user experience, that is, to identify and prioritize the most basic needs of the community of editors, readers and researchers of the Wikimedia projects in Portuguese, so they have a more productive and pleasant experience, which connects with WMF Annual Plan Products & Technology bucket Wiki Experiences. This project aims to implement one of the community wishes present in this wishlist, specifically wish #3 or #8, depending on the intern's experiences and abilities. Wishlist #3 is: to implement a check in the Visual Editor for duplicate references using the reference identifier (ISBN, DOI or URL) and let the user reutilize the already used reference. Wishlist #8 is: to implement Wikidata support for Wikimedia Brasil's scoring tool wikiscore, allowing the community to do edit-a-thons and edit contests using Wikidata.

Expected outcomes: to implement the wishlist #8 (improve wikiscore so that it count points for Wikidata edits) and/or wishlist #3 (verify duplicate references automatically), depending on the intern's experiences and abilities.

Skills required/preferred: Python, Django, Wikitext, MediaWiki APIs, Wikibase APIs, REST APIs, Javascript

Mentors: @Ederporto @Arcstur

Rating: medium

Microtasks: T418285 and T418286. Applicants that submitted both tasks until Monday, April 6th, 4pm UTC, will receive feedback until Friday, April 10th, so that they can improve their application. The project is closed to new applications.

Any other additional information for contributors: we'll have Google Meet meetings and code through GitHub.

NEW QUESTIONS

What WMF priority does this project align with? A Wishlist item? An Annual Plan objective?
It relates to community wishes 17 and 192, the Contributor Experiences (WE1) subbucket and more specifically to the Lusophone wishlist wishes 3 and 8

Why are you proposing it? What needs are you aiming to meet? Is it for your Wiki chapter, your community, etc? This is a wish requested by the Lusophone Wikimedia community and we established as one of our projects this year to fulfill at least two (and hopefully more!) of those wishes.

What is the expected impact? What does success look like? How will this affect the needs you have identified? Success means a tool that can help organizers and individuals track and qualify meaningful contributions to Wikidata, specially on wikicontests and other events; and the second project helps editors in general of Wikipedia (not only the Lusophone Wikipedia) to improve experience of editors that need to reuse references within the article without duplicating them.

Recommendation

We strongly encourage mentors to request additional specific details to help weed out AI-generated applications from potential contributors. Consider adding pre-reqs and ensure that you communicate directly with contributors before making your selection.

Related Objects

StatusSubtypeAssignedTask
OpenLGoto
OpenArcstur
OpenArcstur
OpenArcstur
OpenEileenBlessing
OpenEssa237
OpenAyush_khati1
OpenSix-shot54
OpenSupreetkaur0602
OpenKenny4111
OpenAnkitaa05
OpenShehrbano_Ali
OpenOlamidepeterojo
OpenRayala_Venkata_Bhagya_Lakshmi
OpenChidimma95
OpenNushrinaT
OpenIlma-salsabil
OpenBincyben
OpenKathbonav
OpenPayalS3
OpenAdwivedii
OpenAnvitha098
OpenSania231
OpenRishannn
OpenAnushka0111

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes

Hi @Arcstur @Ederporto and fellow contributors
I am Supreet Kaur, an Outreachy contributor. I am greatly interested in contributing to Lusophone technological wishlist project. I am currently working on the 2 micro tasks. Looking forward to have an amazing journey of learning and contributing here.

Hello @Arcstur , @Ederporto and everyone here.
I am an outreachy applicant for this round and I am interested in contributing to this project. looking foward to an amazing learning journey.

Hi @Arcstur and @Ederporto,

I explored wishlist #3 and created a small prototype to detect duplicate references by comparing normalized URLs.

It also suggests reusing an existing reference if a duplicate is found.

Here is my implementation: https://github.com/Ankita-kuntal/duplicate-reference-detector

I would appreciate any feedback on whether this approach aligns with the expected direction.

Thank you!
Ankita Kuntal

Hello @Arcstur and @Ederporto,

I’m an Outreachy applicant and I’ve completed both microtasks (T418285 and T418286).

I worked on:

  • A JavaScript solution to format article data into a human-readable output inside the HTML page
  • A Python script to read URLs from a CSV file and print their HTTP status codes with proper error handling

I have experience working with JavaScript, Python, and building user-facing features, and I focused on writing clean, readable, and robust solutions.

I’ve submitted my work via email, recorded my contributions on Outreachy, and shared my GitHub repositories:

I’d really appreciate any feedback and would be happy to improve my work further.

Thank you!

Hello @Arcstur and @Ederporto,

I'm an Outreachy applicant and I've completed both microtasks (T418285 and T418286).

For Task 1, I created a JavaScript solution that formats the JSON data and displays it in a clear, human-readable format in the HTML page.

For Task 2, I developed a Python script that reads URLs from a CSV file and prints their HTTP status codes, including basic error handling for failed requests.

I focused on writing clean, readable, and reliable code.

I have submitted my work via email and also registered my contributions on the Outreachy website. You can find my work here:

Repository: https://github.com/EstherGyimah/Task-lusophone--Intern.git

I would appreciate any feedback and I’m open to improving my work.

Thank you.

Hello all
Has anyone started receiving feedback on their submission yet?

Hello @Supreetkaur0602 So far from my end, I have not received any feedback yet. Let's be patient; I believe with time our mentors will get back to us.

Hi everyone, we are receiving your emails and feedback will be given for those that submit before Monday, April 6th, 4pm UTC. And the feedback will only be given after this date.

You don't need to post your repositories or tasks here.

Kind regards.

Hi everyone, we are receiving your emails and feedback will be given for those that submit before Monday, April 6th, 4pm UTC. And the feedback will only be given after this date.

You don't need to post your repositories or tasks here.

Kind regards.

Thank you for the update @Arcstur

Please use Zulip for general GSoC chat. Thanks.

Hi @Arcstur and @Ederporto

I am an outreachy applicant, I have read the project description and I am happy to start contributing. I will share my progress with you

Hello @Arcstur and @Ederporto, good day. I am an Outreachy applicant excited to contribute to the lusophone technological wishlist proposals project and microtasks.

Hi everyone,
For those who have already submitted, can you please help me with the submission email
I am unable to send an email to "tecnologia AT wmnobrasil.org" because of wrong format error.

I have sent it to tecnologia@wmnobrasil.org, please correct me if I am wrong here

Hi everyone,
For those who have already submitted, can you please help me with the submission email
I am unable to send an email to "tecnologia AT wmnobrasil.org" because of wrong format error.

I have sent it to tecnologia@wmnobrasil.org, please correct me if I am wrong here

Hello @Supreetkaur0602

Yes, this is the correct email to send it to.

Hy @Arcstur & @Ederporto 👋,

I’m Shehrbano Ali, a 20 years old ML Model Developer. I am excited to apply for the Addressing the lusophone technological wishlist proposals project for this Outreachy cohort.

1. My Technical Journey: Python & ML
I am a self taught developer I have build some Python based solo projects that are also linked with my Machine Learning skills, they are:

  1. Medical Report Anayzer
  2. Plant Disease Detector

These projects require a deep understanding of data structures, API integrations, and robust backend logic. I have spent the last year hardening my Python skills to handle complex system architectures, which aligns perfectly with the Python level required for this project.

2. Why I Chose This Project
I am drawn to this project because one of it's wish is focused on Python too.
I believe my core skills allow me to handle complex backend logic with ease. Whereas, I also have foundational knowledge of Javascript to handle frontend requirements too. With this mix of frontend awareness and deep backend Python/ML experience, I believe I am a strong fit for this lusophone technological project.

3. My Goal for the Internship
My primary goals are to use my Python skills to:

  1. Automate contest scoring to make point counting fast and accurate
  2. Connect to Wikidata to automatically track what users are contributing
  3. Build requested features that help the Lusophone community grow
  4. Write clean code that is organized and easy for others to maintain

Soon I'll reach you out again along with my completed and recorded Microtasks(T418285 and T418286) before 6th April, 2026.

Thank you :)

Hello @Arcstur @Ederporto,

I have completed the microtasks:

I would really appreciate your feedback and any suggestions for improvement.

Thank you!

Hi @Arcstur and @Ederporto

I am an outreachy applicant, I have read the project description and I am happy to start contributing for the tasks T418284. I will share my progress with you.

Hello @Ederporto and @Arcstur, our mentors.
I am an Outreachy applicant, excited to contribute to the highlighted tasks and looking forward to your feedback on the next steps.

Hi everyone,
For those who have already submitted, can you please help me with the submission email
I am unable to send an email to "tecnologia AT wmnobrasil.org" because of wrong format error.

I have sent it to tecnologia@wmnobrasil.org, please correct me if I am wrong here

you are very correct

I have completed task 1 and now i will proceed to task 2. I was reading about the wishlist and wish #8 really stood out to me. Right now if someone edits Wikidata during a contest, those edits just don't count in Wikiscore so their work basically goes unrecognized. That seems like such a fixable problem and fixing it would actually motivate more people to contribute structured data. i am really excited to work on something like this.

This is a good starting issue. I've started working on it and will send the email soon too.
I'm really looking forward to the feedback as well.

Hi,

Thank you for the feedback! I have now tested my script on the full CSV dataset provided in the task. I improved handling for edge cases such as timeouts, invalid URLs, and connection errors so that the script runs without crashing.

Additionally, I updated the script to store the results in a CSV file for better usability.

I would appreciate any further feedback when you have time.

Hi @Arcstur For Wish #3, are there any specific files or GitHub repos you'd recommend to understand how it is structured?

Hi,

Thanks for pointing that out. I’d also be interested in exploring any recommended resources or repositories for Wishlist #3 to better understand its structure.

Looking forward to learning more about it.

Hi,

Thank you for the feedback! I have now tested my script on the full CSV dataset provided in the task. I improved handling for edge cases such as timeouts, invalid URLs, and connection errors so that the script runs without crashing.

Additionally, I updated the script to store the results in a CSV file for better usability.

I would appreciate any further feedback when you have time.

Hi! I noticed you mentioned receiving feedback already.

I just wanted to confirm , has feedback started being shared?? Or will it still be provided after April 6 as mentioned earlier?

Thank you!

Hello @Arcstur , @Ederporto and my fellow contributors,

I have submitted Task 2, T418286. Apologies for the slight delay, my laptop ran into a technical issue due to a conflict between a Samsung and Microsoft update, which took few days to resolve.

Will be submitting Task 1 shortly as well.

Thank you!

Hi @Arcstur,

I’ve been exploring Wishlist #3 further and trying to understand the duplicate detection problem.

From what I understand, different identifiers like URLs, DOIs, and ISBNs can refer to the same source, which makes simple comparison insufficient.

It seems like a layered approach — normalizing within each identifier type and then handling cross-type matching — could be useful.

Would this be a good direction to think about? I’d appreciate any guidance when you have time.

One thing worth thinking about is that each type of identifier needs its own cleaning before comparing. For example, https://doi.org/10.1100/abc and 10.1100/abc are the same DOI, and 978-3-16-148410-0 and 9783161484100 are the same ISBN, just written differently. So the checker needs to clean and normalize each type separately before deciding whether two references are duplicates or not.

A trickier situation is when different identifier types refer to the same source, such as a URL and a DOI for the same article. Simple cleaning and normalization will not detect this due to different formats. Handling this would need some external lookup to see what each identifier actually points to. I'm curious whether this (cross-type matching) is in scope for first version or whether same-type matching is the right starting point. Would be better to confirm with the mentors @Arcstur @Ederporto .

Another edge case I am thinking about is when some references on a Wikipedia page are not added through the Visual Editor. They may be written directly in the wikitext or added through scripts. If the checker only analyzes references processed by Visual Editor, it might miss duplicates. Should the checker rely only on what the Visual Editor detects, or should it analyze the full page wikitext? Would love to know how you are thinking about this @Arcstur @Ederporto.

Also, once a duplicate is detected, what should happen next? Should it work like Visual Editor's existing named reference feature, where it shows the user the existing reference and asks if they want to reuse it?

Greetings to you, @Arcstur and @Ederporto.

I have been working through the project and paying attention to Wishlist #3.

To gain a clearer insight into the issue, I created a small interactive prototype that recreates duplicate reference detection with normalised identifiers (URL, DOI, ISBN) and a basic reuse flow.

🔗 Live Demo: https://wish3-duplicate-reference-prototype.vercel.app/
🔗 Repository: https://github.com/Jikugodwill/wish3-duplicate-reference-prototype

The prototype is concentrated on:

  • Normalizing identifiers
  • Finding duplicates on the fly.
  • Recommending the use of old references.

I took the approach as a workflow point of view to see how this would fit into the Visual Editor experience.

Regarding the microtasks:

  • JavaScript task: T418285 completed.
  • I am still working on T418286 (Python task), which I will submit soon.

It will be my pleasure to know how well this direction fits the planned implementation.

Thanks!

Hi @Arcstur,

Thank you for the detailed discussion so far — it has been very helpful.

I’ve been going through the ideas around Wishlist #3, especially about identifier normalization and duplicate detection. Based on this, I am trying to understand how a simple approach could work as a starting point.

For example, I am thinking of:

  • Normalizing identifiers (like cleaning URLs, DOIs, ISBNs)
  • Comparing within the same type first
  • Then exploring how cross-type matching might be handled later

Would this be a good direction to begin with for a basic implementation?

I’d really appreciate any guidance on how to approach this as a beginner.

Thank you!

One thing worth thinking about is that each type of identifier needs its own cleaning before comparing. For example, https://doi.org/10.1100/abc and 10.1100/abc are the same DOI, and 978-3-16-148410-0 and 9783161484100 are the same ISBN, just written differently. So the checker needs to clean and normalize each type separately before deciding whether two references are duplicates or not.

A trickier situation is when different identifier types refer to the same source, such as a URL and a DOI for the same article. Simple cleaning and normalization will not detect this due to different formats. Handling this would need some external lookup to see what each identifier actually points to. I'm curious whether this (cross-type matching) is in scope for first version or whether same-type matching is the right starting point. Would be better to confirm with the mentors @Arcstur @Ederporto .

Another edge case I am thinking about is when some references on a Wikipedia page are not added through the Visual Editor. They may be written directly in the wikitext or added through scripts. If the checker only analyzes references processed by Visual Editor, it might miss duplicates. Should the checker rely only on what the Visual Editor detects, or should it analyze the full page wikitext? Would love to know how you are thinking about this @Arcstur @Ederporto.

Also, once a duplicate is detected, what should happen next? Should it work like Visual Editor's existing named reference feature, where it shows the user the existing reference and asks if they want to reuse it?

Hi @Arcstur, @Ederporto,

I have submitted Task 1 & 2, (T418285 & T418286) and look forward to your feedback.

In the meantime, I am exploring more of Wishlist #3 and Wishlist #8. It would be great to have your guidance on my previous comment when you get a chance.

Thank you for your time!

Greetings to you, @Arcstur and @Ederporto.

I have been working through the project and paying attention to Wishlist #3.

To gain a clearer insight into the issue, I created a small interactive prototype that recreates duplicate reference detection with normalised identifiers (URL, DOI, ISBN) and a basic reuse flow.

🔗 Live Demo: https://wish3-duplicate-reference-prototype.vercel.app/
🔗 Repository: https://github.com/Jikugodwill/wish3-duplicate-reference-prototype

The prototype is concentrated on:

  • Normalizing identifiers
  • Finding duplicates on the fly.
  • Recommending the use of old references.

I took the approach as a workflow point of view to see how this would fit into the Visual Editor experience.

Regarding the microtasks:

  • JavaScript task: T418285 completed.
  • I am still working on T418286 (Python task), which I will submit soon.

It will be my pleasure to know how well this direction fits the planned implementation.

Thanks!

Quick update:

I’ve now completed both microtasks:

🔗 Microtasks repository: https://github.com/Jikugodwill/microtask-python

Thanks!

Hi @mentor,

I implemented a simple prototype for duplicate reference detection based on Wishlist #8.

Approach

  • Normalized identifiers (DOI/URL) by removing prefixes like https://doi.org/
  • Compared references within the same type (starting simple)
  • Stored references using browser localStorage
  • Checked for duplicates before adding

Prototype

What I learned

  • Importance of normalization before comparison
  • Handling edge cases like DOI vs URL formats
  • Basic duplicate detection logic

Next Steps

  • Add cross-type matching (DOI ↔ URL)
  • Improve UI and feedback
  • Explore API-based validation (CrossRef)

I would appreciate any feedback on whether this approach is aligned with the project direction.

Thank you!

@Shehrbano_Ali: Please ask general questions in Zulip instead.

Hello @Arcstur @Ederporto,

I would like to inform you that I have updated the README and improved code readability. Kindly consider these changes.

Thank you!!

The discussion here about identifier normalization is really interesting. With my background in JavaScript, I’ve been looking into how we can implement these checks in real-time within the Visual Editor without impacting the user's typing performance.

I've been researching the ve.ui.MWReferenceDialog to understand how we might trigger a 'Potential Duplicate' alert when a user adds a URL. I'm curious to hear the mentors' thoughts on whether we should aim for a client-side check for immediate feedback, or if leveraging a MediaWiki API for normalization would be more reliable for the first version? @Arcstur @Ederporto

Hi @Arcstur ,

I finished Task 2 and then updated it to use multithreading (ThreadPoolExecutor), so it checks multiple URLs at once instead of one by one.

It runs much faster now, especially with larger datasets. Feedback is welcome!

Hi @Arcstur and @Ederporto,

I have completed both microtasks and submitted my final application. Thank you for the opportunity to work on this project.

While exploring Wishlist #3, I’ve been thinking about how duplicate references could be handled more effectively in the Visual Editor. One idea I found interesting is normalizing identifiers such as URLs, DOIs, and ISBNs before comparison, so that the same reference written in different formats can still be detected as a duplicate.

For example, small differences like prefixes (https://, doi.org/) or formatting variations could lead to duplicates if not handled properly, so normalizing them first might improve accuracy.

I’m still learning and would really appreciate any feedback on whether this approach aligns with the expected direction for the project, or how I can improve my contributions.

Thank you for your time and guidance.

Thank you @Arcstur for the feedback. Here are the changes made:

Task 1: Replaced the hardcoded month names array with Intl.DateTimeFormat as suggested. This also makes it easy to switch to Portuguese month names which aligns well with the Lusophone context of this project.

Task 2: Simplified the code by removing the color-coded output, retry internet logic, summary, and timestamped CSV as these features were not required by the task. Retained features include specific-error classification, User-Agent header, and duplicate detection . Refined the duplicate detection logic after analyzing that previous logic was missing on the real duplicates. The updated approach correctly differentiates between URL differences that are meaningful and those that are not, which also connects to the core challenge described in Wishlist #3.

Please let me know if you have any further feedback.

This comment was removed by NushrinaT.

Thank you @Arcstur for the valuable feedback
Here are the changes made.

Task 1: Replaced hardcoded month names with Intl.DateTimeFormat.
Also fixed timezone issue - Dates are now parsed as UTC to prevent the "off by one" bug in UTC-3 and other timezones

Task 2: Removed complex CLI tool features - No more argparse, command-line options, or concurrent processing
Removed unnecessary classes - Eliminated the URLStatusChecker class
Removed advanced features - No retry logic, no statistics, no file saving
Simplified requirements - Only need requests library now

Thanks for your valuable inputs you have given
I appreciate.

I read through my task feedback, and it was such a great reality check. I used to think writing tons of comments made me a better coder, but the mentor reminded me that good code should just speak for itself.
I also learned a lot about tunnel vision and over-engineering. On Task 1, I was so focused on fixing a timezone bug that I hardcoded the month names, completely missing that using Intl.DateTimeFormat would solve both the bug and localization at once. And on Task 2, I built a GET fallback for 403 HEAD errors, not realizing that servers blocking HEAD almost always block GET too. It was a great reminder to truly understand how protocols work in the real world instead of just writing complex workarounds.
I have submitted the final application:)

Thank you @Arcstur for your feedback!

I have made the suggested updates:

Task 1: JSON Data Formatting

Fixed the date issue by handling the creation date in local time to avoid timezone-related shifts.

Task 2: URL Status Checker

Updated the implementation to use a HEAD request instead of GET, since only the status code is required.

Thanks again for the feedback.

Hi @Arcstur and @Ederporto,

Thanks for the detailed feedback on the tasks. I have made the necessary improvements to my microtasks and updated my Outreachy proposal here: https://phabricator.wikimedia.org/T423451. I would appreciate any further feedback or suggestions.

I have already submitted the final application.

Best regards,