glam2commons (previously Single Image Batch Upload)
Open, Needs TriagePublic

Description

Project title: glam2commons (previously Single image batch upload)
Description: Currently when somebody does a batch upload (uploading a lot of images which were released by an archive/museum (GLAM)) all images get uploaded and lots of those might not be used or not that useful. The other option: uploading released images one by one wastes a lot of time. A solution to this would be to provide the metadata-mapping which makes an upload possible and a framework which, using these mappings, can be used to upload a single (or a small subset) of images from a GLAM. In the ideal final version this would allow a GLAM to have an "upload to Wikimedia Commons" button on their website.

This task is to build this framework. For each GLAM a separate metadata mapping is needed, you don't have to make these (although it would be good to make one): I will provide/create such a mapping for a Dutch archive (Nationaal Archief) or if needed from an English archive. However we'll have to discuss and design a good way these mappings can be added/maintained. The most suitable place for this to land is likely https://tools.wmflabs.org/. For authentication we could make use of OAuth. As this project will be building something new (without an existing code base) working together on designing a good code structure is part of the project.

The minimal viable product is a framework which offers the following:

  • A simple front end where an uploader selects a website/GLAM from which to upload and enters an identifier.
  • The tool then uses the metadata mapping for that specific GLAM to generate a wikitext description for the file.
  • The file then gets uploaded (using OAuth to upload from own account of user).

As that might be a bit small of a scope for the project there are a lot of potential extensions of which at least some can and should be implemented during the project (the first 4 are almost must haves, the rest are nice to have):

  • The ability to call the tool from toolname/GLAM/ID (API) and triggering an upload, this allow GLAMs to have an "upload to Wikimedia button" linking to that API
  • A template for the metadata mappings and easy way to update/edit them
  • Some landing structure on Commons: hidden category where the uploads land, request page for metadata mapping (updates)
  • License checking of uploaded files.
  • The ability to upload more than one image (think of 10-100 images max.): the user enters a search string instead of an ID and from there can select a set of images to upload.
  • Provide quite a few metadata mappings already.
  • Provide some library like function for standard parts in metadata mapping. Think of: parsing dates, categories, connecting to wikidata elements, finding creator templates, file title generation. (these are really the extras, and there are plenty here)

A quick hacky local script which does this for one archive can be found at https://github.com/basvb/single-image-batch-upload

Skills: Python + some frontend for a very simple web app. (we were thinking of a pywikibot + flask based web app on toollabs)
Estimated project time for a senior contributor: 2-3 weeks
Primary mentor: @Basvb (python, batch uploading experience, Commons)
Co-mentor: @tom29739 (Code-review, python and tools), and @zhuyifei1999 (labs, commons, python)
Microtasks: (please do both the (simple) upload to commons task and one other task)

  • Commons: Upload an image from a GLAM to Wikimedia Commons to get some feeling for licenses (!) and the things which have to be filled in for an uploaded file.

More thorough description of the idea can be found at https://commons.wikimedia.org/wiki/User:Basvb/Ideas/Single_Image_Batch_Upload

Related Objects

StatusAssignedTask
OpenNone
OpenNone
OpenKamsuri5
OpenNone
ResolvedInfobliss
Declineddjff
DeclinedKamsuri5
OpenInfobliss
ResolvedInfobliss
ResolvedInfobliss
DeclinedBasvb
ResolvedInfobliss
ResolvedInfobliss
ResolvedInfobliss
ResolvedInfobliss
ResolvedInfobliss
ResolvedInfobliss
ResolvedInfobliss
ResolvedInfobliss
DuplicateKamsuri5
OpenKamsuri5
There are a very large number of changes, so older changes are hidden. Show Older Changes
Basvb updated the task description. (Show Details)Feb 27 2017, 8:21 PM
Basvb updated the task description. (Show Details)
Basvb updated the task description. (Show Details)
Basvb added a comment.Feb 27 2017, 8:27 PM

@srishakatux and @Miriya52: I've updated the description, I've added some potential skills for a second mentor in the parts where I'd have some trouble supporting + which are likely needed for this project.

@Lokal_Profil, everybody: Based on your experience with uploading single images using a prepared metadata-mapping do you see anything that I missed in the description?

thank you @Basvb! anyone from Front-end-Standards-Group or Toolforge willing to be a co-mentor for this task for Outreach-Programs-Projects?

Hi @Basvb I would like to work on this project. I have 2 years of experience in python and this project looks perfect for me. Is there a chance to find a co-mentor for this project. I can see you are willing to mentor this project. Let's hope for the best.
Regards,
kapil kumar

Thank you @Kapilkd13 for your interest in working on this project! While we look for a second mentor, you are welcome to explore other projects which are listed here: https://www.mediawiki.org/wiki/Google_Summer_of_Code_2017 and other possible projects here: https://www.mediawiki.org/wiki/Google_Summer_of_Code_2017#Some_other_ideas

Sumit updated the task description. (Show Details)Mar 1 2017, 3:53 AM

@srishakatux and @Basvb: I'm willing to help with this project. However, I don't have all skills necessary to be a good co-mentor: I work in Python, but mostly for research analysis rather than software development. Once you find a developer who can provide software design guidance and code review, if you think some user research/user-centered design mentorship would be helpful, please ping me!

Hi, I'm an Outreachy Round 14 aspirant, and I was interested in taking up this project. Are there any mentors I can communicate with, who could also assign a few microtasks to get started with?

Basvb updated the task description. (Show Details)Mar 15 2017, 10:41 AM
Basvb added a comment.EditedMar 15 2017, 10:55 AM

@Meghana95: Thank you for showing interest in this project. We are still looking for a second mentor for this task. See the other possible project linked by @srishakatux above.

@srishakatux: Would it be a good idea to provide some microtasks already? I want to prevent somebody being disappointed by spending time on microtask if the chances are there that this will not participate in Outreachy/GSoC due to missing a second mentor. Also I'm struggling a bit with the fact that this would be a pretty stand-alone project (a new tool), thus not having already a real codebase to make microchanges to. Microtasks could aim at getting to know the process (do a mini-batch upload, request access to toollabs); setting the first steps of the tool (but that's heavily dependent on how we want to structure the tool, so not really micro) or maybe appending some related tools (best way to grasp the skills of aspiring interns). Another option is that I try to lay some quick ground work providing a basic structure to build upon, but I thought it would be nice if the intern could also have a big influence on the general design choices of the structure of the tool.

@Capt_Swing: thanks for showing interest in the project, maybe you might be able to provide some input on the point above.

Thank you @Basvb and @Capt_Swing for your willingness to mentor this project and thank you @Meghana95 for showing interest in working on this project :) I'm going to announce a call for a mentor for this project on Wikitech-I in a bit.

In the meanwhile, @Meghana95 I would encourage you to take a look at some other projects and also keep an eye on this one.

Thanks everybody for your patience!!

Hello everyone, i am interested in this project for GSOC'17, so can you both @Basvb and @Capt_Swing guide me on how i can contribute to this project.

tskolm added a subscriber: tskolm.Mar 20 2017, 8:34 PM

Hello! This project is really interesting. Can I find here a task to start participation in Outreachy/GSOC?

Thanks @Kamsuri5 and @tskolm for your interest in working on this project! Nice to see so much enthousiasm for the project. We are still looking for a second mentor for this project to ensure that the intern gets the necessary support. While we look for a second mentor, you are welcome to explore other projects which are listed here: https://www.mediawiki.org/wiki/Google_Summer_of_Code_2017 and other possible projects here: https://www.mediawiki.org/wiki/Google_Summer_of_Code_2017#Some_other_ideas

I can help with this project. I have experience in Python for software development, and I maintain a few tools on Tool Labs. I've also used Wikimedia's OAuth in one of my tools.

Basvb added a comment.Mar 21 2017, 2:48 PM

@tom29739: That would be wonderful, would it be a good idea to have a small chat on IRC this evening or tomorrow to discuss how we see the project and some next steps?

@Basvb I hope you soon find your second mentor as i am looking forward to contribute to this project in the upcoming GSOC/Outreachy.

@tom29739: That would be wonderful, would it be a good idea to have a small chat on IRC this evening or tomorrow to discuss how we see the project and some next steps?

Sure, tomorrow evening (22nd March) would be fine. I'm tom29739 on freenode.

Sumit added a subscriber: Sumit.Mar 22 2017, 4:42 AM

I can help with this project. I have experience in Python for software development, and I maintain a few tools on Tool Labs. I've also used Wikimedia's OAuth in one of my tools.

@tom29739 thanks for the help! @Meghana95 @Kamsuri5 @tskolm and any others interested you can submit your proposals for this project. Also make sure that you go through https://www.mediawiki.org/wiki/Outreach_programs/Selection_process and have certain prior contributions to be considered a strong candidate.

@Basvb please add one or two related microtasks that'd help you judge candidates for selection better.

Sumit updated the task description. (Show Details)Mar 22 2017, 4:44 AM

I've emailed Basvb privately some of my concerns on this project. If they are addressed I think I'm available for mentoring as well.

Thanks everybody for making this task ready to be mentored for GSOC/Outreachy. I would like to learn who would be the official mentors for this task now? I would like to invite official mentors from Google's dashboard/Outreachy's admin system to signup to be able to view/review applications. Thank you!

Basvb updated the task description. (Show Details)Mar 22 2017, 9:05 PM
Basvb added a comment.Mar 22 2017, 9:08 PM

@srishakatux: Is it possible to have two co-mentors? I've talked to both @tom29739 and @zhuyifei1999 who are willing to help mentor the project.
@Capt_Swing: Am I correct in the interpretation that your message was more of an offer to take a look once or twice at the user-perspective than as an intention to co-mentor?

@Basvb More the merrier, and there will not be any problem :) It would be great if @tom29739 and @zhuyifei1999 could PM me their email addresses to receive official invite.

Basvb updated the task description. (Show Details)Mar 22 2017, 9:20 PM

@Basvb I would like to work on this project. I have already uploaded a few images to the Commons. I had a few doubts regarding the Flask microtasks. Pardon me if these seem naive.

Regarding bullet 1:  Are we looking for having a dropdown where we have all the GLAMs and the potential users can choose one of the GLAMs from that dropdown? Or do we take a string from the user for the name of the GLAM and then compare it from our list of GLAMs for a match? But that may be more computationally complex I guess.

Regarding bullet 2: My understanding is here we only decide on a naming convention for the individual uploads so that we have a meaningful title. Do we assume that the different parameters such as ID, glam, collection and description, file title (at source) will be available from the GLAM's API. Do such API's even exist for all the GLAMs? If no then how do we extract all the parameters?

This comment was removed by Kamsuri5.
Kamsuri5 added a comment.EditedMar 24 2017, 2:57 AM

Hello everyone, now as the mentors have been finalized i want to take up this project for GSOC'17/Outreachy(round-14). I have gone through the details of the project and have started working on the microtasks.

According to my study of the project, we will be majorly dealing with the below mentioned problems:-

  1. To overcome the deficiencies in Batch upload, by implementing a single(or a small subset) image upload.
  2. Ease out the uploading of images by GLAMs to Commons, by introducing a button on GLAMs end.
  3. Overcoming the hassle(download, metadata updation, etc) required to upload an image from an external source

@Basvb can you comment on my conclusions, so that i can be sure whether i am going in the right direction.

To overcome the second problem every GLAM will have a button to upload, which will use our API to upload the images. I think we can get the GLAM id through the button id itself, along with which a request will be sent to our API. So request to API will consist of GLAM details and URL details, using which we will get to know which metadata mapping will be used and will proceed accordingly.

@Infobliss you are referring to having a common platform like the one mentioned by @Lokal_Profil which will deal with the third problem mentioned above, then i think we will have to fetch details of GLAM through the URL or through GLAM apis. @Basvb can you elaborate on this because this is a bit unclear as GLAMs don't have APIs. So how we will be dealing with this?

Basvb added a comment.Mar 24 2017, 6:14 PM

@Capt_Swing: Thank you again for your thoughts on the project. The upcoming structured data is a good point to keep into consideration. For the past years I personally try to use the regular (structured) templates to convey information, these can then be easily transfered into the structured data format. Within the tool we'd have to keep in mind that likely in a few years the metadata-mappings and some other parts should be changed to directly use structured data.

@Infobliss and @Kamsuri5: Very good to hear that you are interested, I'm curious to see your proposals, maybe it is a good idea to plan a (short) IRC session to talk a bit about what ideas we have and to ask questions for each other. Tom and Zhuyifei are regulars in Cloud-Services, I'll try to be there as often as possible as well.

On the specific questions, I'll be moving the microtasks into separate tasks with some more information and allowing us to discuss them in some more depth. For the link/flask: Optimally there will be a simple front end with a drop down list, besides that there will also be a link where the names of the GLAMS + ID can be used to call the tool directly (with a button from the image page at the GLAM). There will be a pre determined name per GLAM, so it's not up to the user to use free text to enter the GLAMs name.

On naming: the idea of the function is that given that you know an ID, title of the work (sometimes not known), a description and the name of the GLAM we create a standard format for the title (If title: title else: description - ID -GLAM name.ext), more on this in the specific task.

On API's: I've seen quite some collections of GLAMs with an API, this is where we get all the information from required for the upload, so GLAMs need to have such an API for us to be able to make a metadata-mapping (potentially scraping or database dumps are an option but let's not focus on that at the start). Two quick examples are http://www.gahetna.nl/beeldbank-api/opensearch/?q=2.24.14.02&count=100&startIndex=1 and http://cultureelerfgoed.adlibsoft.com/harvest/wwwopac.ashx?database=images&search=pointer%201009%20and%20BE=%22interieur%22&limit=10&xmltype=grouped . The second bullet point under Misc is to make a list of some relevant GLAMs, let me see if I can make a small start with that to give a better idea of some relevant use cases, I'll include the relevant APIs

Kamsuri5, Those are indeed the main problem, where for the first point the issue is that often there is a large collection (100.000s of images) which could clutter Wikimedia Commons if all files are uploaded. Doing uploads one by one allows for more post-processing and for the most relevant files to be uploaded only.

Basvb updated the task description. (Show Details)Mar 24 2017, 7:25 PM
Mtmlan84 removed a subscriber: Mtmlan84.
Mtmlan84 added a subscriber: Hannolans.

@Capt_Swing: Thank you again for your thoughts on the project. The upcoming structured data is a good point to keep into consideration. For the past years I personally try to use the regular (structured) templates to convey information, these can then be easily transfered into the structured data format. Within the tool we'd have to keep in mind that likely in a few years the metadata-mappings and some other parts should be changed to directly use structured data.

@Infobliss and @Kamsuri5: Very good to hear that you are interested, I'm curious to see your proposals, maybe it is a good idea to plan a (short) IRC session to talk a bit about what ideas we have and to ask questions for each other. Tom and Zhuyifei are regulars in Cloud-Services, I'll try to be there as often as possible as well.

On the specific questions, I'll be moving the microtasks into separate tasks with some more information and allowing us to discuss them in some more depth. For the link/flask: Optimally there will be a simple front end with a drop down list, besides that there will also be a link where the names of the GLAMS + ID can be used to call the tool directly (with a button from the image page at the GLAM). There will be a pre determined name per GLAM, so it's not up to the user to use free text to enter the GLAMs name.

On naming: the idea of the function is that given that you know an ID, title of the work (sometimes not known), a description and the name of the GLAM we create a standard format for the title (If title: title else: description - ID -GLAM name.ext), more on this in the specific task.

On API's: I've seen quite some collections of GLAMs with an API, this is where we get all the information from required for the upload, so GLAMs need to have such an API for us to be able to make a metadata-mapping (potentially scraping or database dumps are an option but let's not focus on that at the start). Two quick examples are http://www.gahetna.nl/beeldbank-api/opensearch/?q=2.24.14.02&count=100&startIndex=1 and http://cultureelerfgoed.adlibsoft.com/harvest/wwwopac.ashx?database=images&search=pointer%201009%20and%20BE=%22interieur%22&limit=10&xmltype=grouped . The second bullet point under Misc is to make a list of some relevant GLAMs, let me see if I can make a small start with that to give a better idea of some relevant use cases, I'll include the relevant APIs

Kamsuri5, Those are indeed the main problem, where for the first point the issue is that often there is a large collection (100.000s of images) which could clutter Wikimedia Commons if all files are uploaded. Doing uploads one by one allows for more post-processing and for the most relevant files to be uploaded only.

@Basvb can you connect with me on IRC? I am kamsuri on freenode.

@Kapilkd13, @Meghana95, and @tskolm: We've now listed a few micro tasks. I'm curious if you are still interested in pursuing a proposal for this project or have found other interesting projects in the meantime?

djff added a subscriber: djff.Mar 27 2017, 4:54 AM

Hello @Basvb, thanks a lot for the heads up on this. I am interested in working on this and I have started doing the micro-tasks. Please I wish to know when I complete the sencond micro task, am I to paste the link to my repo here?

Basvb added a comment.Mar 27 2017, 8:17 AM

Hi @djff, Good to see that you are interested in the project. You can claim the microtasks you are working on here in Phabricator (if there is not yet a sub task for it you or me could create it). You can paste the link to your repo in the subtask ticket and in your proposal.

djff added a comment.Mar 27 2017, 1:37 PM

Thanks @Basvb , looks like I was working on a task already claimed by another. None the less I will claim another task and upload what I have done so far on git asap. The sample batch-upload you provided was of great help.

Basvb added a comment.Mar 27 2017, 5:38 PM

@djff, aah ok. Please also upload the work you did on the other task as it will also show your abilities.

@djff, @Kamsuri5 and @Infobliss: don't forget to publish your proposals soon to allow us some time for feedback.

The name "Single Image Batch Upload" is extremely confusing (and reading a related proposal confused me even more). From reading the task description, it seems the idea is more like "On-demand batch upload".

I understand the confusingness of the name. It is after all a contradictio in terminis. I think that it is a good idea to move to a more descriptive name if/once the projects starts.

I'm not sure whether the batch upload part is correct at all (regarding the name you propose). The main element is that the preparation work for batch uploading is done (a metadata-mapping is available) and that every person with a Commons account can use that to upload the images from a collection they are interested in. For a batch upload I however expect multiple files being uploaded at once (a batch). I'm open to suggestions on which few terms best summarize that description or maybe we should go with a totally unrelated (non-descriptive) name.

The main element is that the preparation work for batch uploading is done (a metadata-mapping is available)

Once upon a time, people thought that this would consist of writing XSLT for the sake of GWToolset.

and that every person with a Commons account can use that to upload the images from a collection they are interested in.

If everything is ready, then files should be uploaded immediately so that random users have quick access to them. It makes sense to delay the upload only if there is some work left to do and/or the users triggering the upload are experienced and committed users.

Basvb added a comment.EditedMar 30 2017, 11:01 PM

The main element is that the preparation work for batch uploading is done (a metadata-mapping is available)

Once upon a time, people thought that this would consist of writing XSLT for the sake of GWToolset.

Do you know what the lessons learned where or the main reasons that this did not happen? I've personally never seen the GWToolset as a tool to upload one-at-a-time with and seen it as too complicated to use for end-users.

and that every person with a Commons account can use that to upload the images from a collection they are interested in.

If everything is ready, then files should be uploaded immediately so that random users have quick access to them. It makes sense to delay the upload only if there is some work left to do and/or the users triggering the upload are experienced and committed users.

There is almost always work to be done after uploading, the question is how much work has to be done and what the quality is without reviewing each file on its own. Currently there are a lot of batch uploaded files without even any categories or other useful information making them hard - if not impossible - to find for reusers. A good example for the work to be done I think is the Rijksmonumenten upload. I think we were fairly successful in stimulating improvements to the file with a good workflow for identifying monuments and with that adding relevant categories. However even in such a successful case, just because of the scale of the upload (400.000+ images) there are tens of thousands of images still needing some kind of fix. I've fixed huge parts of that semi-automatically or by hand. But as one of the uploaders I'm simply not able to committing to fix all of those. So ideally these subsets would have been left out of the batch upload for hand-picking the most useful ones and fixing those.

I also see two other possible use cases: 2. Not all files from a collection are in scope, some times cherry picking in scope images is better than uploading thousands of images out of or barely in scope. A final possibility (although this should likely have been done differently from the start) is that there are already thousands of images uploaded from the collection, and putting everything on Commons will introduce thousands of duplicates. In that case not uploading these thousands of images but doing everything in batch from the start would've been better of course. This is the case for the National Archive (NL) example above with thousands of unstructured (no way to test if your image is a duplicate) uploads.

(I can't speak for GWToolset but there are reports.) I see two points in your reply.

for hand-picking the most useful ones

Because hand-picking is time-consuming, the target/audience would be collections of which only a (small) minority of files are wanted. The obvious goal is for the screening process to take less time than it would take to handle the non-screened files while reaching the same result.

and fixing those

This is important because it acknowledges that not everything can be fixed (efficiently) *before* landing on Wikimedia Commons (categories are especially hard since they're eminently local). The question then is how much of the fixing would fall under this project's scope; you probably don't want this project to be about all possible file cleanup gadgets one can think of.

Aklapper changed the status of subtask T161599: [GSoC Proposal 2017] Single Image Batch Upload from Resolved to Declined.May 7 2017, 4:05 PM
Aklapper changed the status of subtask T161649: Outreachy/GSOC'17 proposal for Single Image Batch Upload from Resolved to Declined.
Zache added a subscriber: Zache.EditedMay 10 2017, 11:00 AM

I can see also the third use case for the "upload this to Wikimedia Commons" which is that it can be used for linking to the image in commons if it already exists.

Eg
1.) User clicks to upload button
2.) upload system checks if the image is already in the Commons
3.) if it exist it says "HERE IS THE IMAGE IN COMMONS" and if not then it will upload it

Anyway biggest point of doing the upload link is to make things more straight forward when user is using the GLAM:s website and she or he wants to add something to Wikimedia sites. If somebody wants to test how the idea would feel like i made a small script at the Hack4fi (Finnish yearly cultural hackaton) which uploads images from Finna (Finlands national digital library, Only Helsinki city museum images are supported) to Wikimedia Commons beta.

And some notes.
*A simple front end where an uploader selects a website/GLAM from which to upload and enters an identifier.

I used the URL as single identifier and with that i can figure out GLAM and ID at the server side.

*The tool then uses the metadata mapping for that specific GLAM to generate a wikitext description for the file.

Most likely there is no simple metadata mappings but you need to combine information from multiple sources. In my Finna to Commons demo i used the normalized the keywords from Finna using Finto (finnish ontology service) and matched the words from finto with wikidata pages. From Wikidata i could get commons categories for those words.

Basvb renamed this task from Single Image Batch Upload to glam2commons (previously Single Image Batch Upload).May 21 2017, 1:02 PM
Basvb updated the task description. (Show Details)
Restricted Application added a subscriber: PokestarFan. · View Herald TranscriptJul 25 2017, 9:45 AM