Profile
Name: Anjali Dwivedi
Phabricator Username: Adwivedii
GitHub: https://github.com/Anjali286
Location: Noida, Uttar Pradesh, India
Timezone: UTC+5:30 (Asia/Kolkata)
Ideal working hours: 12 PM - 8 PM UTC+5:30 (approximately 40 hours per week)
Wishlist I am applying for
I am applying to implement Wishlist #3 from the Lusophone Technological Wishlist, which aims to automatically check for duplicate references.
I made a deliberate choice to work on this wishlist because my skills, microtask experience, and the research I have done during the contribution phase strongly align with it. The solution to duplicate reference detection is centered around identifier normalization, detection layer development within VisualEditor, and user-focused UX design.
I want to build something that genuinely helps editors, not just a proof of concept, but a feature that handles real-world edge cases and works reliably on actual Portuguese Wikipedia articles.
Timeline
Week 1 (May 18 - May 24)
Task:
Set up a local MediaWiki development environment with VisualEditor and Cite extension running locally.
Read and trace the full Cite dialog flow
Understand how the Cite extension integrates with VE, study the DataModel, how references are stored during an editing session, and how the Re-use tab already reads existing references.
Goal: Know the codebase to identify exactly where every line of the feature will live
Week 2 (May 25 - May 31)
Task:
Study the Citoid's response by testing the live API and map the exact field names used for DOI, ISBN, and URL to citation templates fields.
Study Portuguese Wikipedia’s MediaWiki:Citoid-template-type-map.json to understand how field names differ from English Wikipedia.
Search Phabricator for prior discussions on duplicate detection.
Goal: Fully understand the data flow through the pipeline, what comes in, what it looks like, and where it goes
Week 3 (June 1 - June 7)
Task:
Write normalization functions for URL, DOI, and ISBN separately.
Write unit tests for all edge cases for each function.
Goal: Build three reliable, well-tested normalization functions that serve as the foundation for the entire system.
Week 4 (June 8 - June 14)
Task:
Make improvements suggested by the mentor based on Feedback #1
Build a reference extraction function that reads all existing references from the article, extracts their identifiers, normalizes them, and stores them in a map with identifier as key and array of reference numbers as value.
Goal: Address mentor feedback for the normalization and correctly reads what is already in the article
Week 5 (June 15 - June 21)
Task:
Test the reference extraction function against real article data.
Fix any field name mapping issues found during testing.
Ensure the map correctly stores arrays of reference numbers, so any existing duplicates in the article can also be detected.
Goal: Refine the reference extraction to handle real data
Week 6 (June 22 - June 28)
Task:
Write this duplicate detection into the Cite dialog.
Build the warning dialog using OOUI components.
Wire the one-click reuse action and test the full user flow end to end
Goal: Integrate duplicate detection into the pipeline and verify its accuracy
Week 7 (June 29 - July 5)
Task:
Make improvements suggested by mentor based on Feedback #2
Build real-time autocomplete feature using the required event listener on the identifier field, showing matching existing reference as a dropdown.
Goal: As the editor types, matching existing references appear as suggestions.
Week 8 (July 6 - July 12)
Task:
Build the first layer of cross-identifier detection by extracting DOIs, ISBNs, and PMIDs from URLs using regex.
Test against real cases found in Portuguese Wikipedia articles.
Goal: Ensure to have a check for different identifiers from URLs points to the same source
Week 9 (July 13 - July19)
Task:
Build the second layer of cross-identifier detection by adding asynchronous background calls to CrossRef API to resolve identifiers to their canonical records.
Implement graceful degradation, if an API call fails, times out, or returns an error, the system falls back to identifier-only matching without affecting the editing flow
Integrate this into the duplicate detection pipeline as an additional matching layer
Goal: Improve duplicate detection by resolving identifiers to a canonical record and identifying the same source across different identifiers.
Week 10 (July 20 - July 26)
Task:
Build a fallback for references without identifiers using title and author matching.
Implement two condition checks, title token overlap above threshold and at least one author last name match.
Integrate this into the duplicate detection pipeline as an additional fallback layer
Add a targeted message in the warning dialog for this case. Test the reuse flow end-to-end.
Goal: Provide a reliable fallback to detect likely duplicates for references without identifiers using title and author similarity.
Week 11 (July 27 - August 2)
Task:
Make improvements suggested by mentor based on Feedback #3
Deliberately test with hard cases like same domain, different articles, DOI in URL, DOI field, ISBN-10, ISBN-13, references pointing to the same source and references without identifiers.
Goal: Ensure feature is honest and reliable, it catches what it should, stays quiet when it should, and never misleads the editor.
Week 12 (August 3 - August 9)
Task:
Write a software documentation update for the automatic duplicate detection.
Write a user centric summary of what the feature covers and what its known limitations are.
Submit patch for mentor review.
Goal: Practice clean code principles and write clear documentation
Week 13 (August 10 - August 17)
Task:
Perform final testing to ensure stability and reliability of each detection layer
Final submission.
Goal: Submit a working and reviewed patch that genuinely improves the editing experience for Portuguese Wikipedia editors and beyond.
Impact
- This feature was requested by the Portuguese community because editors, both newcomers and experienced ones, often create duplicate references when working on long articles without realizing it. By giving a helpful prompt at the moment of adding a reference, it prevents mistakes before they happen, instead of requiring others to fix them later.
- Although this was requested by the Portuguese community, the feature is built into Visual Editor, which is used across all Wikipedias. Once implemented, it will benefit all the editors of Wikipedia. It will also resolve the wish #192 of the global community.
- The normalization functions I build for URLs, DOIs, and ISBNs will be clean, well-tested, and reusable. They can support future features like sub-referencing (which Wikimedia Deutschland is already developing), reference quality checks, and other tools that involve comparing identifiers.
Why Me
- My Task 1 work was in JavaScript and HTML, which made me comfortable working in the kind of environment VE uses. I have spent the past few days studying the Visual editor documentation, understanding how Citiod works, and what the DataModel is and how to query it.
- During my Task 2 microtask, I built normalization logic for URLs, including handling edge cases, and carefully considered how to reduce false positives by preserving the content that defines the source and leaving the meaningless components. The core problem is structurally the same as Wish #3, just applied in a different language and environment. This underlying approach and reasoning transfer directly.
- I have gone through Wishlist #3 thoroughly and I understand who the users are, what they actually need, and what a good solution looks and feels like to them.
I don’t just want to build a demo but a live feature that could give relief to the editors. I am committed to resolve their problem in Visual Editor itself and provide them the best possible solution.
Post-Internship Plans
After internship, my major focus would be to:
- work on the limitations of the duplicate detection feature.
- work on full article scan so that editors could check for duplicates in the entire article in one go rather than only detecting new references as they are added.
In the longer term, I would like to extend my solution to resolve the global community wish #192. As part of its solution, I would like to add apply to all checkbox feature and a dedicated reference section.
Time Commitment
I have no major commitments during the internship period and will dedicate full-time effort (40 hours/week) to this project with regular progress updates to mentors.