1. Profile Summary
Name: Bincy Benny
Username: Bincyben / inc
Github: https://github.com/SaNitYgRiM
Location: Kerala, India
Timezone: UTC+5:30 (IST)
Ideal Working Hours: 3pm-12am IST (40+hrs/week)
Can adjust with mentors timezones
2. Synopsis
This proposal aims to implement Lusophone technological wishlists specifically wishlist#3 (to implement a check in the Visual Editor for duplicate references using the reference identifier (ISBN, DOI or URL) and let the user reutilize the already used reference) and/or wishlist#8 (to implement Wikidata support for Wikimedia Brasil's scoring tool wikiscore, allowing the community to do edit-a-thons and edit contests using Wikidata) which helps community of editors, readers and researchers of the Wikimedia projects in Portuguese.
I would like to implement both of the wishlists-#3 and #8 if time permits to the best of my capabilities and with help from my mentors.
3. Observations, Understandings From What I've Read So Far
While exploring wishlist#3 and wishlist#8 further, I noticed the tasks given to us resembled to the wishlist's objectives. There were a quite a few articles to read which I had to translate to English first, and other sub-wishes under the wishlist#3 like #W17, #W192. I jotted down the points on my notepad whatever felt relevant for me to understand the core of the wishlists.
Wishlist #3- duplicate reference
The problem lies within the Visual Editor(visual & interactive tool) , since its fairly new and is experimental. The Source Editor (uses wikicode/wikitext-markups) works fine but isn't user friendly/ visually interactive as the visual editor is. Internship task is to implement a check in Visual Editor for the same using ISBN,DOI,URL & let user reuse references.
#W192->
- Duplicates of references found in both JSON storage & in the UI
- If fixed-beneficial for both editors and data reusers to deal with wikidata, also helps reduce redundancies.
#W17->
- Clashes found b/w source editor and visual editor.
- Currently citations in Visual Editor get assigned a no. as default name which poses a problem in Source Editor.
- Copy-pasting to a diff article causes duplicates
- If read in source-editor, the editor wouldn't know which source is used
- Sub-referencing faces the same problems as well
- It doesn't allow editors to use main reference & sub-refs like in the source editor where you have the 'details' tab, so the editors have to use different separate references even though they come under the same main ref.
- So currently editors either use source editor or use different refs for sub-refs.
- Re-naming references should be resolved
- Solutions for references should be adopted to sub-refs as well
- Mentioned possible usage of RefRenamer script as a great way to rename
Additionally I tried editing Wikipedia articles in both the visual and source mode. One thing I noticed is, If I add two separate references to two separate words and then change the reference of one word to the same reference (without the reuse option). And then again try to change it to any other different reference , it now applies this change to the other untouched word's reference as well!! So any changes I now make is automatically applied to both of them even I don't want them to. Another issue to worry about!
Wishlist #8- Scoring Tool
- To implement wikidata support for Wikimedia Brazil's wikiscore for edit-a-thons,edit contests etc.
- Currently the scores for editions,contributions for contests which uses wikidata are done manually and need automatic tool support which benefits both the editors/users to see their score and help encourage them and also the judges to reduce overload caused by manual scoring.
- Wishlists mentions a separate tool altogether for this purpose to automate this scoring process.
4. Mentors
5. Contributions
Github repo: repo-link
5.1. Task 1
T418285
JS script to manipulate JSON and format into human readable format. Made use of toLocaleDateString to format date.
GIthub: Task-1-link
5.2. Task 2
T418286
Python script to get and print the status codes of URLs from the given csv file. Implented HEAD requests, exception handling etc.
Github: Task-2-link
5.3. Prototype
I have built a prototype for wishlist#3 to check how it would flow.
I explored the codebase during this time before I started building the prototype. Since the visual editor had 'reuse' tab, which definitely would have its own reference list which we could use for the Wishlist #3. So Iinspected and found it does have an internalList. Went down a rabbit hole in order to find it in the codebase which I finally did and tried making sense out of it. Still need help understanding! Also found out Wikimedia uses Citoid tool for cross references ,matching. Essentially had plans to locally setup visual editor as it mentions being a standalone but as time was ticking I had to make a simple dummy website.
The prototype includes handling normalization for the user-input reference as well as the already used refs (which hadn't been normalised) in the list. It has functionalities to convert ISBN 10 to ISBN 13 as well and a Lookup table for references in the internalList to handle duplicates.
This is essentially the flow:
- The list gets processed first(one-time only) and stores normalized IDs.
- When the user inputs a reference, it normalizes , checks in the lookup table
- if found shows a toast message "ref already in use".
- If not, it calls for Citoid API for data and normalizes them and inserts them into both the list and the lookup table.
- It also checks when normalizing inside the callforCitoidAPI() because when trying out different inputs to find bugs , I found if for e.g.. there is a reference in the list already with ISBN and other details, and now a user inputs a URL which points to the same source-it creates a duplicate! why? because my function just normalized the input and checked if that is present. When I should be checking if any of the metadata of this URL matches too to the refs in the table. So I made a check inside this API call function as well to make sure if its metadata is present too before pushing a new entry in my list, if yes it abandons the list and just merges the new added field to the already present reference(accidentally found this by mistake). But it works! There are cases when it doesn't because there are no common metadata between two references given by API, so it becomes difficult to track of those duplicates, also they have multiple ISBNS etc. of the same formats too! Additional issues to think about.
Possible Modifications Needed:
- Check if the input reference is DOI,ISBN etc. before calling for normalizing functions.
- Add autofill or autocompletion when found.
- Possibly use better Storage methods.
- Need to apply this on both 'automatic' and 'manual' in the Visual Editor.
Github: Prototype-link
Live: RefNormaliser&DuplicateChecker
(When tryiing it out- It contains a single dummy data in the list for an article ,so don't worry when it gets displayed along with the results of the input-: it displays title and year of used References).
6. Proposed Internship Timeline
Weekly Reports and discussions with mentors will be done ,applying feedbacks and modifications to the workdone during the week and having it ready for them to review the next week or before. Blog Post will be done biweekly. (The proposed timeline is subject to change after proper discussion during the first week with the mentors)
| Time Period | Project Milestones | Outreachy - Blog Posts |
|---|---|---|
| Week 1 ( May 18 - May 24 ) | Introductions, discussions about different approaches (making use of already present internalList which reuse tab uses in Visual Editor, having Lookup Table for references, CitoidAPI for cross matching etc.) and their tradeoffs with mentors. Clearing doubts about potential additional fixes highlighted in community discussions under the wishlists (naming issues, sub-references, etc, collision with instances in wishlist#3 ), finding more about wishlist#8 arrive at proper expected workflow with mentor's guidance while going through the codebase properly. Week 1 will be dedicated to proper understanding of what is expected, what needs to be done, outcomes,core and basic functionalities to cover . | Blog Post 1: Introduction to Outreachy Internship. |
| Week 2 ( May 25 - May 31 ) | Starting off with wishlist#3 : Normalisation/ Conversion of different identifiers to similar canonical forms using regex and implement other needed checks. Implement functionalities to convert ISBN 10 to ISBN 13, handle all ISBNs of the same format (as a book can have multiple ISBNs of same 13 formats or similar ),removal of prefix, dashes, trailing /,query parameters in URLs, doi etc. Week 2 will be focussed on normalisation. | |
| Week 3 ( June 1 - June 7 ) | Implement functions to use internalList having all references present in the current article and pass the references already in use through the normalisation functions and store them in cleansed/sanitised manner(needs to be done just once), call for Citoid API if data present in the list isn't enough , create a lookup table for the references for fast and clean retreival of data, merge all duplicates found already in the list.Week 3 will be dedicated for conversion and storage of references. | Blog Post 2: Struggles while choosing projects and contributions in Outreachy Internship - A Not So Helpful Guide |
| Week 4 ( June 8 - June 14 ) | User-focussed, build functionalities where user input reference will be normalised and searched in the lookup table for checking duplicates, if not found send API Request to Citoid to get all required data for identifiers and store them in the list after they're normalised. Handle edge cases as well as cases when it appears to be a new reference but after receiving data from API call there are matching IDs found in the Lookup Table, add additional checks. Week 4 is for handling the user-input-reference. | |
| Week 5 ( June 15 - June 21 ) | Handle duplicate reference input by the user by implementing a possible alert or toast message , and have them reuse the reference after confirmation from the users side. Create this without affecting user experience. Week 5 focussing on handling duplicates and UX. | Blog Post 3: First-time Contributing to Open Source |
| Week 6 ( June 22 - June 28 ) | Testing for the functionalities built as separate as well as a whole unit, testing them against larger dataset of references, possible optimisations. Week 6 will be dedicated to testing and performance so far. | |
| Week 7 ( June 29 - July 5 ) | Implement these functionalities so it could be reused for both 'automatic' tab and in 'manual' tab where duplicates are prone to happen. Week 7 focussed on modularity and reusability of code. | Mid-point project progress blog post |
| Week 8 ( July 6 - July 12 ) | Testing and further optimisations needed. Finishing up with the documentation properly. Week 8 final week for this will be spent on performance testing, cleanup and documenting. | |
| Week 9 ( July 13 - July 19 ) | Starting off with wishlist#8 : Implement graphical interface where the score points for wikidata-based contests will be calculated automatically and added up, in a way it happens concurrently and doesn't lag when users are in contests, similar to wikiscore using python. Week 9 will be focussing on building tool with basic functionalities calculating scores for simple contributions. | Blog Post 5: Experiencing working with real-production level codebase for the first-time! |
| Week 10 ( July 20 - July 26 ) | Integrate functions which would track all possible and allowed contributions during contests and add scores accurately in real-time which both editors and judges can see. Week 10 will be spent expanding the tool to allow more contribution types, real-time score tracking. | |
| Week 11 ( July 27 - Aug 2 ) | Handling all cases(user traffic,overriding issues in real-time etc ),edge cases in an actual contest environment is different, so will need to test out in a similar way. Week 11 will be trying out the tool in real-like simulation or environment, handling issues encountered. | |
| Week 12 (Aug 3 - Aug 9 ) | Further optimisations, error-handling ,testing will be done. Proper documentation according to the project rules and standards. Week 12 fully dedicated for performance, rigorous testing and documentation. | |
| Week 13 (Aug 10 - Aug 16 ) | Finalising and checking if everything done is in lusophone friendly way,wrapping up the project ! Week 13 for fixing ,checking finalising and finishing up!YAY!! | Final project progress blog post |
7. Post Internship
I plan to stick around and contribute to this software further, fixing things when needed. With this I will no longer be a first-timer in open source contributer space! I also plan to try contributing to my other favourite OSS with newfound knowledge and courage I got from here.
Thanks to the guidance, feedback, discussions I got from mentors as well as other applicants, I was able to draft this proposal. Thank You, Everyone!