Page MenuHomePhabricator

Proposal: Addressing the Lusophone Technological Wishlist Proposals Project
Closed, DeclinedPublic

Description

Profile


Name: Supreet Kaur
GitHub: https://github.com/Supreetkaur1
Location: Punjab, India (UTC+5:30)
Availability: 40–45 hours/week (Full-time)
Working Hours: 11:30 AM – 8:30 PM IST

I am a Software Development Engineer with almost 1 year of professional experience at Amazon(Prime Video), where I worked on backend systems involving API design, large-scale data processing, and reliability of user-facing pipelines. My work has involved handling structured and semi-structured data, debugging production issues, and ensuring consistency across distributed systems.

Synopsis


The Lusophone Technological Wishlist is a community-driven initiative aimed at improving Wikimedia editing workflows and contributor experience.
This project focuses on two key areas:

a. Wishlist #3: Improving the citation workflow in VisualEditor by detecting duplicate references and encouraging reuse
b. Wishlist #8: Extending WikiScore to include Wikidata contributions, enabling better recognition of structured data work
Together, these improvements enhance both editor efficiency and fair evaluation of contributions in modern Wikimedia workflows.

I plan to prioritize Wishlist #8 (Wikidata integration for WikiScore) as my primary focus, and contribute to Wishlist #3 as a secondary goal depending on progress and mentor guidance.

Mentors


@Arcstur
@Ederporto


Understanding of Wishlist Items


Wishlist #3: Duplicate Reference Detection in VisualEditor
Editors often unintentionally add duplicate references due to variations in formats (URL, DOI, ISBN). Since VisualEditor does not actively compare references during insertion, this leads to redundant citations and reduced maintainability.
The goal is to:
a. Detect duplicate references during citation insertion
b. Normalize identifiers for accurate comparison
c. Suggest reuse of existing references in a non-intrusive way
Wishlist #8: Wikidata Integration for WikiScore (Primary Focus)
WikiScore currently focuses on Wikipedia edits and does not fully account for Wikidata contributions. This creates a gap in edit-a-thons and contests where structured data contributions are significant.
The goal is to:
a. Fetch Wikidata contributions using APIs
b. Normalize and process structured edits
c. Integrate them into WikiScore’s scoring system

Technical Approach


I) Wishlist #8 – Wikidata Integration for WikiScore (Primary)
System Understanding

I will begin by analyzing WikiScore’s current architecture to understand how Wikipedia contributions are fetched and scored, and identify extension points for integrating Wikidata.
1. Data Fetching Layer
a. Use the MediaWiki Action API (usercontribs) and Wikidata endpoints
b. Fetch user contributions relevant to structured data edits
c. Handle pagination, rate limits, and API reliability
2. Data Processing Pipeline
a. Normalize contributions into a unified schema
b. Filter relevant actions such as:
i. Item creation
ii. Statement additions
iii. Label/description edits
iv. Reference updates
3. Scoring Integration
a. Extend existing scoring logic to include Wikidata edits
b. Assign weights to different contribution types
c. Ensure consistency with existing Wikipedia scoring
Initial scoring will focus on clearly measurable actions, with weighting refined iteratively based on mentor feedback.
4. Performance Considerations
a. Batch API requests to reduce overhead
b. Introduce caching for repeated queries
c. Optimize processing for large-scale events

II) Wishlist #3 – VisualEditor Duplicate Reference Detection (Secondary)
Understanding the System

While exploring VisualEditor’s architecture, I studied how citation workflows are handled (e.g., via tools like Citoid) and how references are represented in the document model. This helps identify safe integration points for duplicate detection.
1. Reference Extraction & Normalization
a. Extract references from the VisualEditor document model
b. Normalize identifiers:
i. URLs → standardize protocol, remove trailing slashes
ii. DOIs → remove prefixes and normalize casing
iii. ISBNs → remove formatting inconsistencies
The initial implementation will focus on same-type matching (URL-URL, DOI-DOI, ISBN-ISBN), with cross-type matching explored later.
2. Duplicate Detection Engine
a. Maintain an in-memory index of normalized identifiers
b. Use hash-based lookup for efficient comparison
c. Detect duplicates in real time during citation insertion
3. Integration & UX
a. Integrate into the VisualEditor reference dialog (e.g., citation insertion flow)
b. Provide non-intrusive feedback when duplicates are detected
c. Suggest reuse of existing references instead of creating new ones
4. Key Challenges
a. Handling incomplete or inconsistent identifiers
b. Maintaining performance in large articles
c. Ensuring seamless integration without disrupting user workflows

Timeline


The internship runs from May 18, 2026 to August 17, 2026 (13 weeks). I will work full-time (40-45 hours/week) and iterate based on mentor feedback.
The following timeline outlines the work I plan to accomplish during the internship period, with a primary focus on Wikidata integration for WikiScore (Wishlist #8) and a secondary focus on reference reuse in VisualEditor (Wishlist #3), within the given timeframe.

Phase 1(Primary): Wikidata Integration for WikiScore (Wishlist #8)

Week 1–2: Setup & System Understanding
a. Set up development environment
b. Study WikiScore architecture and scoring flow
c. Explore Wikidata APIs and contribution formats
d. Validate approach with mentors
Week 3–4: Data Fetching Layer
a. Implement API integration for Wikidata contributions
b. Handle pagination, rate limits, and failures
c. Begin initial data normalization
Week 5–6: Processing & Scoring Integration
a. Normalize contribution data into a unified schema
b. Implement scoring logic for key actions
c. Integrate into WikiScore pipeline
Week 7: Testing & Iteration
a. Test with real-world datasets (edit-a-thons, multiple users)
b. Handle edge cases (duplicates, reverted edits)
c. Refine scoring logic
Week 8: Stabilization & Documentation
a. Optimize performance
b. Clean up code and improve maintainability
c. Document architecture and usage

Phase 2 (Secondary): VisualEditor Duplicate Detection (Wishlist #3)

Week 9: Exploration & Design
a. Study VisualEditor reference handling
b. Identify integration points (e.g., reference dialog)
Week 10–11: Core Implementation
a. Implement normalization (URL, DOI, ISBN) and duplicate detection logic
b. Build in-memory indexing for references
Week 12: Integration & UX
a. Integrate into citation workflow
b. Add non-intrusive duplicate suggestions
Week 13: Testing & Finalization
a. Test with complex articles
b. Optimize performance
c. Final documentation and submission

I have designed this plan to deliver a functional MVP for Wishlist #8 early, followed by iterative improvements and secondary contributions to Wishlist #3 based on progress and mentor feedback.

Why I Am a Good Fit


My experience aligns closely with the challenges of this project.
As an software development engineer in the subtitles team of Prime Video, I have worked on backend systems involving subtitle processing, where I handled duplication issues, inconsistent formatting, and data integrity across large-scale text datasets. This directly relates to Wishlist #3, where similar challenges arise in detecting and managing duplicate references.
Additionally, my work involved processing structured metadata and building API-driven workflows, which aligns with Wishlist #8’s requirement to fetch, process, and score structured Wikidata contributions.
While my primary experience is in backend systems, I have collaborated closely with frontend teams, giving me a strong understanding of how backend logic impacts user-facing workflows, an important consideration when working with tools like VisualEditor.
I aim to take a practical, iterative approach: focusing on delivering a correct and maintainable MVP first, and improving it based on feedback.

Post-Internship Contribution


After the internship, I plan to:
a. Continue maintaining and improving implemented features
b. Support new contributors in understanding the codebase
c. Contribute further to Wikimedia tools related to structured data and editor experience
d. Stay actively involved in the Wikimedia technical community

Event Timeline

Gopavasanth subscribed.

Thank you for your proposal and the effort you put into it. This year we received over 20 strong applications, and after a highly competitive review, we were unfortunately unable to offer you a slot.

Please don't see this as a failure, many contributors who weren't selected for Outreachy have gone on to make meaningful, lasting impact in the Wikimedia community, and we genuinely hope you'll stay engaged. You're very welcome to continue contributing outside of Outreachy. Our mentors and org admins are happy to help you get started or keep going:

We hope to see you around in the community.