Profile Information
Name: Ankita Kuntal
GitHub: https://github.com/Ankita-kuntal
Location: Agra, India
Timezone: UTC+5:30 (IST)
Availability: 40 hours/week (Full-time)
Preferred Working Hours: 4:00 PM – 12:00 AM IST
I am a third-year IT undergraduate at NIT Srinagar and a MERN stack developer, experienced in building full-stack applications with a focus on asynchronous data flows and frontend performance.
During the Outreachy contribution period, I completed microtasks (T418285, T418286) where I worked on handling real-world data inconsistencies and improving request behavior, gaining familiarity with the Wikimedia ecosystem and its development workflow
Synopsis
The Lusophone Technological Wishlist identifies improvements that enhance contributor workflows across Wikimedia projects.
This proposal focuses on Wishlist #3: Automatic Duplicate Reference Detection in the Visual Editor.
In the current workflow, contributors often add references without knowing whether the same source already exists in the article. This results in:
- Duplicate citations
- Cluttered reference sections
- Reduced readability and maintainability
The goal of this project is to detect duplicate references in real time and guide users to reuse existing citations, reducing redundancy and improving editing efficiency.
Problem Statement
In the Visual Editor, contributors often add references without knowing whether the same source already exists in the article. This leads to duplicate citations and cluttered reference sections.
The main challenge arises because the same reference can appear in different formats, for example:
- URLs: protocol differences (http / https), presence of www, trailing slashes
- DOIs: case differences or different representations (doi.org / direct identifier)
- ISBNs: hyphenated vs non-hyphenated formats
Due to these variations, simple string comparison is not sufficient to detect duplicates.
Additionally, since this feature operates in a real-time editing environment, the solution must be:
- Accurate (avoid false positives)
- Efficient (minimal performance overhead)
- Non-intrusive (should not disrupt editing)
The goal is to detect duplicate references in real time and guide users to reuse existing references, improving consistency and reducing redundancy
Mentors
Technical Approach
The solution is based on normalization, efficient matching, and real-time integration.
1. Normalization
To ensure reliable comparison, references are converted into a canonical form before matching:
- URLs: remove protocol (http / https), www, and trailing slashes
- DOIs: normalize case and extract the core identifier
- ISBNs: remove hyphens and whitespace
This ensures that semantically identical references map to the same normalized representation.
2. Duplicate Detection
- Store normalized references in a hash map
- Normalize incoming reference before comparison
- Perform O(1) lookup to detect duplicates efficiently
This avoids repeated scanning and ensures fast detection even for large articles.
3. Real-Time Integration
- Trigger detection during reference insertion
- Use debounced / asynchronous execution
- Ensure non-blocking behavior in the Visual Editor
4. User Experience (UX)
- Show non-intrusive alerts when duplicates are detected
- Suggest reusing existing references
- Keep interaction aligned with Visual Editor UI patterns
5. Edge Cases & Performance
- Handle malformed or incomplete references
- Reduce false positives for similar references
- Cache normalized values to improve performance
Prototype / Prior Work
I built a working JavaScript prototype implementing the normalization
and detection logic:
- Live demo: https://ankita-kuntal.github.io/duplicate-reference-detector/
- Code: https://github.com/Ankita-kuntal/duplicate-reference-detector
The prototype supports URLs, DOIs, and ISBNs with O(1) Map-based lookup.
Building it taught me that normalization is harder than detection,
DOI formats alone have at least 4 common variations for the same identifier.
Timeline
The Outreachy internship runs from May 18, 2026 to August 17, 2026 (13 weeks).
I will be working 40 hours/week and sharing weekly updates with mentors.
Community Bonding (May 5 – May 17)
- Set up MediaWiki + Visual Editor development environment
- Explore reference workflows (creation, reuse, storage)
- Analyze how references are stored and accessed internally
- Identify integration points in Visual Editor
- Finalize implementation plan with mentors
Week 1 (May 18 – May 24): Codebase Understanding
- Trace full reference lifecycle (insert → store → reuse)
- Study how references are represented internally
- Identify edge cases in citation handling
Week 2 (May 25 – May 31): Data Preparation
- Build utilities for:
- Reference extraction
- Parsing URLs, DOIs, ISBNs
- Create structured representation for processing
Week 3 (June 1 – June 7): URL Normalization
- Implement normalization for:
- Protocol differences (http / https)
- Domain prefixes (www)
- Trailing slashes
- Validate using real-world examples
Week 4 (June 8 – June 14): DOI & ISBN Normalization
- Normalize:
- DOIs → extract canonical identifier
- ISBNs → remove formatting differences
- Add unit tests for correctness
Week 5 (June 15 – June 21): Detection Engine (Core)
- Implement hash-map based storage
- Enable O(1) lookup for duplicate detection
- Test across multiple reference formats
Week 6 (June 22 – June 28): Detection Refinement
- Improve matching accuracy
- Handle variations in formatting
- Reduce false positives
Week 7 (June 29 – July 5): Integration (Phase 1)
- Integrate detection into reference insertion workflow
- Trigger checks when new reference is added
Week 8 (July 6 – July 12): Integration (Phase 2)
- Implement debounced / async execution
- Ensure non-blocking UI behavior
Week 9 (July 13 – July 19): User Feedback
- Add non-intrusive duplicate alerts
- Suggest reuse of existing references
Week 10 (July 20 – July 26): UX Refinement
- Improve interaction flow
- Align with Visual Editor UI patterns
Week 11 (July 27 – August 2): Edge Cases
- Handle malformed or incomplete references
- Improve robustness
Week 12 (August 3 – August 9): Performance Optimization
- Cache normalized references
- Optimize for large articles
- Benchmark detection performance
Week 13 (August 10 – August 17): Finalization
- Testing, bug fixes, and cleanup
- Documentation and final submission
Why I am a strong fit for this project
I approach this problem with a clear focus on how it behaves in a real system, not just how it works in isolation. Detecting duplicates is not only about matching values, but about handling inconsistent data, avoiding false positives, and ensuring the solution works reliably in a real-time editor.
Through my work, I have consistently dealt with inconsistent inputs, asynchronous workflows, and real-time processing, which are central to building this feature correctly without impacting performance or user experience.
I focus on building solutions that are accurate, efficient, and practical to integrate, and I am comfortable iterating based on feedback to refine both correctness and usability.
I am confident in my ability to take this from a working idea to a well-integrated feature in the Visual Editor.