=Profile
Name: Supreet Kaur
GitHub: https://github.com/Supreetkaur1
Location: Punjab, India (IST – UTC+5:30)
Timezone: UTC+5:30Availability: 40–45 hours/week (Full-time)
Availability: 40-45 hours per week (Full Time)Working Hours: 11:30 AM – 8:30 PM IST
Availability Time Slots: 11:30 am - 8:30 pm IST (UTC+5:30)
I am a backend-focused software engineer with professional experience building scalable and production-grade systems. I have worked on backend services involving API design, text processing, structured data processing, and system reliability at scale. My work emphasizes correctnessI am a Software Development Engineer with ~1 year of professional experience at Amazon, maintainabilitywhere I worked on backend systems involving API design, and performance under real-world constraints.
During mylarge-scale data professional software engineering experience cessing, I worked on backend systems dealing with large volumesand reliability of structured data and cross-service workflowsuser-facing pipelines. ThiMy work has includvolved handling data consistency issuesstructured and semi-structured data, debugging production-level pipelines issues, and ensuring reliable user-facing outputs.
# Synopsis
The Lusophone Technological Wishlist is a community-driven initiative aimed at improving Wikimedia editing workflows for Portuguese-language contributors by prioritizing impactful technical enhancementconsistency across distributed systems.
For this internship, I propose to work on two wishlist items:
Wishlist #3: Automatic Duplicate Reference Detection in VisualEditorSynopsis
Wishlist #8: Wikidata Integration for WikiScore
These two features together improve both the editing experience (VisualEditor) and contribution inclusivity (WikiScore) by reducing redundancy in citations and expanding scoring support to structured data contributions.
My goal is to design robust, performance-aware solutions that integrate cleanly into existing Wikimedia systems while improving usability for contributors.
Selected WishlistThe Lusophone Technological Wishlist is a community-driven initiative aimed at improving Wikimedia editing workflows and contributor experience.
I aim to work and complete both Wishlist items during the internship duration, here is a little more about the wishlist items
Wishlist #3: Automatic Duplicate Reference Detection in VisualEditor
Editors often unintentionally add duplicate references when citing sources using URLs, DOIs, or ISBNs. Since VisualEditor does not actively compare new references against existing ones in real time, duplicate citations are created, leading to cluttered and less maintainable articles.
The goal is to detect duplicate references during citation insertion and guide users toward reusing existing references instead of creating new ones.
Wishlist #8: Wikidata Integration for WikiScore
WikiScore currently focuses primarily on Wikipedia edits and does not fully account for Wikidata contributions. This limits its usefulness in edit-a-thons where structured data contributions are important.
The goal is to extend WikiScore to:
Fetch Wikidata contributions via APIsThis project focuses on two key areas:
Normalize and process structured dataa. Wishlist #3: Improving the citation workflow in VisualEditor by detecting duplicate references and encouraging reuse
Integrate Wikidata edits into scoring logic alongside Wikipedia edits
This will make WikiScore more inclusive and accurate for modern Wikimedia contribution workflows.
## Technical Approachb. Wishlist #8: Extending WikiScore to include Wikidata contributions, enabling better recognition of structured data work
1.Together, Wishlist #3 – VisualEthese improvements enhance both editor Duplicate Reference Detectionefficiency and fair evaluation of contributions in modern Wikimedia workflows.
Understanding the System
While exploring VisualEditor’s architecture, I studied how citation workflows are handled, particularly how tools like Citoid generate references and how existing references are stored in the document model.I plan to prioritize Wishlist #8 (Wikidata integration for WikiScore) as my primary focus, This helped identify where duplicate detection logic can be introduced without disrupting the existing flowand contribute to Wishlist #3 as a secondary goal depending on progress and mentor guidance.
## Implementation ApproachUnderstanding of Wishlist Items
1.Wishlist #3: Duplicate Reference ExtraDetectionn in VisualEditor
Extract existing references from the VisualEditor document modelditors often unintentionally add duplicate references due to variations in formats (URL, DOI, ISBN). Since VisualEditor does not actively compare references during insertion, this leads to redundant citations and reduced maintainability.
Identify structured identifiers:The goal is to:
URLa. Detect duplicate references during citation insertion
DOIb. Normalize identifiers for accurate comparison
ISBNc. Suggest reuse of existing references in a non-intrusive way
2. Normalization Layer
To ensure accurate comparison across formats:
URLs → normalize protocol, remove trailing slashes, standardize domain formatWishlist #8: Wikidata Integration for WikiScore (Primary Focus)
DOIs → strip prefixes (https://doi.orgWikiScore currently focuses on Wikipedia edits and does not fully account for Wikidata contributions. This creates a gap in edit-a-thons and contests where structured data contributions are significant.
), normalize casingThe goal is to:
ISBNs → remove hyphens and whitespace
This ensures consistent identifier comparison.a. Fetch Wikidata contributions using APIs
b. Normalize and process structured edits
c. Integrate them into WikiScore’s scoring system
3. Duplicate Detection EngineTechnical Approach
Maintain an in-memory indexed structure of normalized identifiersWishlist #8 – Wikidata Integration for WikiScore (Primary)
Perform O(1)-style lookup using hash-based comparisonSystem Understanding
Compare new citation input against existing references in real timeI will begin by analyzing WikiScore’s current architecture to understand how Wikipedia contributions are fetched and scored, and identify extension points for integrating Wikidata.
4. Integration into Citation WorkflowData Fetching Layer
Hook into VisualEditor citation insertion flowa. Use the MediaWiki Action API (usercontribs) and Wikidata endpoints
Run duplicate detecb. Fetch user contribution before final reference creations relevant to structured data edits
Avoid blocking user actions;c. Handle pagination, rate limits, instead trigger suggestionsand API reliability
5. User Experience HandlingData Processing Pipeline
Display non-intrusive notificaa. Normalize contribution when duplicate is detecteds into a unified schema
Highlight existing reference in the reuse panelb. Filter relevant actions such as:
Provide a “reuse existing reference” aci. Item creation
Key Challengesii. Statement additions
Handling inconsistent or partial identifiersiii. Label/description edits
Avoiding performaiv. Reference overhead in large articlesupdates
EnsuScoring non-disruptive UI behavior inside VisualEditorIntegration
2a. Wishlist #8 – Wikidata Integration forExtend existing scoring logic to include WikiScoredata edits
System Understanding
I analyzed WikiScore’s existing architecture to understand how Wikipedia contributions are fetched and processedb. The goal is to extend this pipeline to support Wikidata as an additional structured data source.
## ImplementaAssign weights to different contribution Approachtypes
1c. Data FetchEnsure consistency with existing LayerWikipedia scoring
Use MediaWiki AInitial scoring will focus on clearly measurable action API and Wikidata endpointss, with weighting refined iteratively based on mentor feedback.
Fetch user contributions from WikidataPerformance Considerations
Handle pagination, rate limiting,a. andBatch API reliabilityquests to reduce overhead
2b. Data Processing PipelineIntroduce caching for repeated queries
Normalc. Optimize contribution data into a unified schemaprocessing for large-scale events
Wishlist #3 – VisualEditor Duplicate Reference Detection (Secondary)
Filter relevant actions:Understanding the System
Item creationWhile exploring VisualEditor’s architecture, I studied how citation workflows are handled (e.g., via tools like Citoid) and how references are represented in the document model. This helps identify safe integration points for duplicate detection.
Statement addiReference Extraction & Normalizations
Label/description editsa. Extract references from the VisualEditor document model
Reference updatesb. Normalize identifiers:
3.i. URLs → standardize protocol, Scoring System Integrationremove trailing slashes
Extend existing WikiScore scoring logicii. DOIs → remove prefixes and normalize casing
Assign weighted scoresiii. ISBNs → remove for different Wikidata actionsmatting inconsistencies
Ensure consistency with Wikipedia contribution scoringThe initial implementation will focus on same-type matching (URL-URL, DOI-DOI, ISBN-ISBN), with cross-type matching explored later.
4. Performance OptimizationDuplicate Detection Engine
Batch API requests to reduce loada. Maintain an in-memory index of normalized identifiers
Introduce cachingb. Use hash-based lookup for repeated queriesefficient comparison
Optimize processing for large edit-a-thonsc. Detect duplicates in real time during citation insertion
## Key ChallengesIntegration & UX
Differences in structure between Wikipedia and Wikidata editsa. Integrate into the VisualEditor reference dialog (e.g., citation insertion flow)
Handling API limitations and rate constraintsb. Provide non-intrusive feedback when duplicates are detected
Ensuring consistent scoring across platformsc. Suggest reuse of existing references instead of creating new ones
Timeline
The internship runs from May 18, 2026 to August 17, 2026 (13 weeks). I will work full-time (40 hours/week) and adjust execution based on mentor feedback.
Weeks 1–8: VisualEditor (Wishlist #3)
Week 1–2: Setup & Codebase Understanding
Set up MediaWiki + VisualEditor environmentKey Challenges
Study citation workflow and reference modela. Handling incomplete or inconsistent identifiers
Identify integration points for duplicate detection
Week 3–4: Core Implementation
Build normalization utilities (URL, DOI,b. ISBN)Maintaining performance in large articles
Implement duplicate detecc. Ensuring seamless integration engine
Week 5–6: Integrationwithout disrupting user workflows
Integrate detection into citation insertion flowTimeline
Add reuse suggestion UI
Week 7: Performance & Edge Cases
Optimize for large articlesThe internship runs from May 18, 2026 to August 17, 2026 (13 weeks). I will work full-time (~40 hours/week) and iterate based on mentor feedback.
Handle inconsistent metadata cases
Week 8: Finalization
Testing, bug fixes, documentation, and patch submissionMVP Goal (Primary – Wishlist #8)
Weeks 9–13: WikiScore (Wishlist #8)
Week 9–10: API IDeliver a working integration
Implement Wikidata data fetching layern that:
Normalize contribution a. Fetches Wikidata
Week 11: Scoring Logic
Extend WikiScore scoring system contributions
Integrate Wikidata contributions
Week 12: Testing & Optimization
Test with real edit-a-thon datasetsb. Processes them into a structured format
Optimize performance and reliability
Week 13: Finalizationc. Integrates them into WikiScore with a basic scoring system
Documentation, cleanup, and final submissionPhase 1: Wikidata Integration for WikiScore (Weeks 1–8)
## Why I Am a Good Fit
My background as a backend engineer at Amazon has given me experience in building and maintaining production systems that handle structured data,Week 1–2: Setup & System Understanding
a. Set up development environment
b. Study WikiScore architecture and scoring flow
c. Explore Wikidata APIs and contribution formats
d. Validate approach with mentors
Week 3–4: Data Fetching Layer
a. Implement API integration for Wikidata contributions
b. Handle pagination, rate limits, and failures
c. Begin initial data normalization
Week 5–6: Processing & Scoring Integration
a. Normalize contribution data into a unified schema
b. Implement scoring logic for key actions
c. Integrate into WikiScore pipeline
Week 7: Testing & Iteration
a. Test with real-world datasets (edit-a-thons, multiple users)
b. API integrationsHandle edge cases (duplicates, and performance-critical workflows.
This project aligns directly with my experience in:
Designreverted edits)
c. Refine scoring scalable backend systemslogic
Handling structured data pipelinesWeek 8: Stabilization & Documentation
Debugging and improving system reliabilitya. Optimize performance
Working with API-drivenb. Clean up code and improve maintainability
c. Document architectures and usage
Additionally,Phase 2: VisualEditor Duplicate Detection (Weeks 9–13)
Week 9: Exploration & Design
a. Study VisualEditor reference handling
b. Identify integration points (e.g., reference dialog)
Week 10–11: Core Implementation
a. Implement normalization and duplicate detection logic
b. Build in-memory indexing for references
Week 12: Integration & UX
a. Integrate into citation workflow
b. Add non-intrusive duplicate suggestions
Week 13: Testing & Finalization
a. Test with complex articles
b. I am comfortable working across both backend and integration layers,Optimize performance
c. which is important for contributing to both VisualEditorFinal documentation and WikiScore.submission
I am particularly motivated by systems that improve collaboration and data quality at scaleWhy I Am a Good Fit
My experience aligns closely with the challenges of this project.
At Amazon (Prime Video), I worked on backend systems involving subtitle processing, where I handled duplication issues, inconsistent formatting, and data integrity across large-scale text datasets. This directly relates to Wishlist #3, where similar challenges arise in detecting and managing duplicate references.
Additionally, my work involved processing structured metadata and building API-driven workflows, which aligns with Wishlist #8’s requirement to fetch, process, and score structured Wikidata contributions.
While my primary experience is in backend systems, I have collaborated closely with frontend teams, giving me a strong understanding of how backend logic impacts user-facing workflows—an important consideration when working with tools like VisualEditor.
I aim to take a practical, iterative approach: focusing on delivering a correct and maintainable MVP first, which aligns strongly with Wikimedia’s missionand improving it based on feedback.
Post-Internship Contribution
After the internship, I plan to:
Continue maintaining and improving implemented features
a. Continue maintaining and improving implemented features
b. Support new contributors in understanding the codebase
c. Contribute further to Wikimedia tools focused onrelated to structured data and editor experience
d. Stay activeively involved in the Wikimedia technical community