Page MenuHomePhabricator

Proposal: Addressing the Lusophone Technological Wishlist Proposals Project
Closed, DeclinedPublic

Description

Profile Information

Name: Ankita Kuntal
GitHub: https://github.com/Ankita-kuntal
Location: Agra, India
Timezone: UTC+5:30 (IST)
Availability: 40 hours/week (Full-time)
Preferred Working Hours: 4:00 PM – 12:00 AM IST

I am a third-year IT undergraduate at NIT Srinagar and a MERN stack developer, experienced in building full-stack applications with a focus on asynchronous data flows and frontend performance.

During the Outreachy contribution period, I completed microtasks (T418285, T418286) where I worked on handling real-world data inconsistencies and improving request behavior, gaining familiarity with the Wikimedia ecosystem and its development workflow

Synopsis

The Lusophone Technological Wishlist identifies improvements that enhance contributor workflows across Wikimedia projects.
This proposal focuses on Wishlist #3: Automatic Duplicate Reference Detection in the Visual Editor.

In the current workflow, contributors often add references without knowing whether the same source already exists in the article. This results in:

  • Duplicate citations
  • Cluttered reference sections
  • Reduced readability and maintainability

The goal of this project is to detect duplicate references in real time and guide users to reuse existing citations, reducing redundancy and improving editing efficiency.

Problem Statement

In the Visual Editor, contributors often add references without knowing whether the same source already exists in the article. This leads to duplicate citations and cluttered reference sections.

The main challenge arises because the same reference can appear in different formats, for example:

  • URLs: protocol differences (http / https), presence of www, trailing slashes
  • DOIs: case differences or different representations (doi.org / direct identifier)
  • ISBNs: hyphenated vs non-hyphenated formats

Due to these variations, simple string comparison is not sufficient to detect duplicates.

Additionally, since this feature operates in a real-time editing environment, the solution must be:

  • Accurate (avoid false positives)
  • Efficient (minimal performance overhead)
  • Non-intrusive (should not disrupt editing)

The goal is to detect duplicate references in real time and guide users to reuse existing references, improving consistency and reducing redundancy

Mentors

Arcstur
Ederporto

Technical Approach

The solution is based on normalization, efficient matching, and real-time integration.

1. Normalization

To ensure reliable comparison, references are converted into a canonical form before matching:

  • URLs: remove protocol (http / https), www, and trailing slashes
  • DOIs: normalize case and extract the core identifier
  • ISBNs: remove hyphens and whitespace

This ensures that semantically identical references map to the same normalized representation.

2. Duplicate Detection

  • Store normalized references in a hash map
  • Normalize incoming reference before comparison
  • Perform O(1) lookup to detect duplicates efficiently

This avoids repeated scanning and ensures fast detection even for large articles.

3. Real-Time Integration

  • Trigger detection during reference insertion
  • Use debounced / asynchronous execution
  • Ensure non-blocking behavior in the Visual Editor

4. User Experience (UX)

  • Show non-intrusive alerts when duplicates are detected
  • Suggest reusing existing references
  • Keep interaction aligned with Visual Editor UI patterns

5. Edge Cases & Performance

  • Handle malformed or incomplete references
  • Reduce false positives for similar references
  • Cache normalized values to improve performance

Prototype / Prior Work

I built a working JavaScript prototype implementing the normalization
and detection logic:

The prototype supports URLs, DOIs, and ISBNs with O(1) Map-based lookup.
Building it taught me that normalization is harder than detection,
DOI formats alone have at least 4 common variations for the same identifier.

Timeline

The Outreachy internship runs from May 18, 2026 to August 17, 2026 (13 weeks).
I will be working 40 hours/week and sharing weekly updates with mentors.

Community Bonding (May 5 – May 17)

  • Set up MediaWiki + Visual Editor development environment
  • Explore reference workflows (creation, reuse, storage)
  • Analyze how references are stored and accessed internally
  • Identify integration points in Visual Editor
  • Finalize implementation plan with mentors

Week 1 (May 18 – May 24): Codebase Understanding

  • Trace full reference lifecycle (insert → store → reuse)
  • Study how references are represented internally
  • Identify edge cases in citation handling

Week 2 (May 25 – May 31): Data Preparation

  • Build utilities for:
    • Reference extraction
    • Parsing URLs, DOIs, ISBNs
  • Create structured representation for processing

Week 3 (June 1 – June 7): URL Normalization

  • Implement normalization for:
    • Protocol differences (http / https)
    • Domain prefixes (www)
    • Trailing slashes
  • Validate using real-world examples

Week 4 (June 8 – June 14): DOI & ISBN Normalization

  • Normalize:
    • DOIs → extract canonical identifier
    • ISBNs → remove formatting differences
  • Add unit tests for correctness

Week 5 (June 15 – June 21): Detection Engine (Core)

  • Implement hash-map based storage
  • Enable O(1) lookup for duplicate detection
  • Test across multiple reference formats

Week 6 (June 22 – June 28): Detection Refinement

  • Improve matching accuracy
  • Handle variations in formatting
  • Reduce false positives

Week 7 (June 29 – July 5): Integration (Phase 1)

  • Integrate detection into reference insertion workflow
  • Trigger checks when new reference is added

Week 8 (July 6 – July 12): Integration (Phase 2)

  • Implement debounced / async execution
  • Ensure non-blocking UI behavior

Week 9 (July 13 – July 19): User Feedback

  • Add non-intrusive duplicate alerts
  • Suggest reuse of existing references

Week 10 (July 20 – July 26): UX Refinement

  • Improve interaction flow
  • Align with Visual Editor UI patterns

Week 11 (July 27 – August 2): Edge Cases

  • Handle malformed or incomplete references
  • Improve robustness

Week 12 (August 3 – August 9): Performance Optimization

  • Cache normalized references
  • Optimize for large articles
  • Benchmark detection performance

Week 13 (August 10 – August 17): Finalization

  • Testing, bug fixes, and cleanup
  • Documentation and final submission

Why I am a strong fit for this project

I approach this problem with a clear focus on how it behaves in a real system, not just how it works in isolation. Detecting duplicates is not only about matching values, but about handling inconsistent data, avoiding false positives, and ensuring the solution works reliably in a real-time editor.

Through my work, I have consistently dealt with inconsistent inputs, asynchronous workflows, and real-time processing, which are central to building this feature correctly without impacting performance or user experience.

I focus on building solutions that are accurate, efficient, and practical to integrate, and I am comfortable iterating based on feedback to refine both correctness and usability.

I am confident in my ability to take this from a working idea to a well-integrated feature in the Visual Editor.

Event Timeline

Gopavasanth subscribed.

Thank you for your proposal and the effort you put into it. This year we received over 20 strong applications, and after a highly competitive review, we were unfortunately unable to offer you a slot.

Please don't see this as a failure, many contributors who weren't selected for Outreachy have gone on to make meaningful, lasting impact in the Wikimedia community, and we genuinely hope you'll stay engaged. You're very welcome to continue contributing outside of Outreachy. Our mentors and org admins are happy to help you get started or keep going:

We hope to see you around in the community.