Page MenuHomePhabricator

glam2commons (previously Single Image Batch Upload): Write and deploy initial and usable version
Closed, ResolvedPublic

Description

Project title: glam2commons (previously Single image batch upload)
Description: Currently when somebody does a batch upload (uploading a lot of images which were released by an archive/museum (GLAM)) all images get uploaded and lots of those might not be used or not that useful. The other option: uploading released images one by one wastes a lot of time. A solution to this would be to provide the metadata-mapping which makes an upload possible and a framework which, using these mappings, can be used to upload a single (or a small subset) of images from a GLAM. In the ideal final version this would allow a GLAM to have an "upload to Wikimedia Commons" button on their website.

This task is to build this framework. For each GLAM a separate metadata mapping is needed, you don't have to make these (although it would be good to make one): I will provide/create such a mapping for a Dutch archive (Nationaal Archief) or if needed from an English archive. However we'll have to discuss and design a good way these mappings can be added/maintained. The most suitable place for this to land is likely https://tools.wmflabs.org/. For authentication we could make use of OAuth. As this project will be building something new (without an existing code base) working together on designing a good code structure is part of the project.

The minimal viable product is a framework which offers the following:

  • A simple front end where an uploader selects a website/GLAM from which to upload and enters an identifier.
  • The tool then uses the metadata mapping for that specific GLAM to generate a wikitext description for the file.
  • The file then gets uploaded (using OAuth to upload from own account of user).

As that might be a bit small of a scope for the project there are a lot of potential extensions of which at least some can and should be implemented during the project (the first 4 are almost must haves, the rest are nice to have):

  • The ability to call the tool from toolname/GLAM/ID (API) and triggering an upload, this allow GLAMs to have an "upload to Wikimedia button" linking to that API
  • A template for the metadata mappings and easy way to update/edit them
  • Some landing structure on Commons: hidden category where the uploads land, request page for metadata mapping (updates)
  • License checking of uploaded files.
  • The ability to upload more than one image (think of 10-100 images max.): the user enters a search string instead of an ID and from there can select a set of images to upload.
  • Provide quite a few metadata mappings already.
  • Provide some library like function for standard parts in metadata mapping. Think of: parsing dates, categories, connecting to wikidata elements, finding creator templates, file title generation. (these are really the extras, and there are plenty here)

A quick hacky local script which does this for one archive can be found at https://github.com/basvb/single-image-batch-upload

Skills: Python + some frontend for a very simple web app. (we were thinking of a pywikibot + flask based web app on toollabs)
Estimated project time for a senior contributor: 2-3 weeks
Primary mentor: @Basvb (python, batch uploading experience, Commons)
Co-mentor: @tom29739 (Code-review, python and tools), and @zhuyifei1999 (labs, commons, python)
Microtasks: (please do both the (simple) upload to commons task and one other task)

  • Commons: Upload an image from a GLAM to Wikimedia Commons to get some feeling for licenses (!) and the things which have to be filled in for an uploaded file.

More thorough description of the idea can be found at https://commons.wikimedia.org/wiki/User:Basvb/Ideas/Single_Image_Batch_Upload

Related Objects

StatusSubtypeAssignedTask
ResolvedInfobliss
ResolvedInfobliss
ResolvedInfobliss
ResolvedNone
ResolvedInfobliss
Declineddjff
DeclinedKamsuri5
ResolvedInfobliss
ResolvedInfobliss
ResolvedInfobliss
DeclinedBasvb
ResolvedInfobliss
ResolvedInfobliss
ResolvedInfobliss
ResolvedInfobliss
ResolvedInfobliss
ResolvedInfobliss
ResolvedInfobliss
ResolvedInfobliss
DuplicateKamsuri5

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes

@srishakatux and @Basvb: I'm willing to help with this project. However, I don't have all skills necessary to be a good co-mentor: I work in Python, but mostly for research analysis rather than software development. Once you find a developer who can provide software design guidance and code review, if you think some user research/user-centered design mentorship would be helpful, please ping me!

Hi, I'm an Outreachy Round 14 aspirant, and I was interested in taking up this project. Are there any mentors I can communicate with, who could also assign a few microtasks to get started with?

@Meghana95: Thank you for showing interest in this project. We are still looking for a second mentor for this task. See the other possible project linked by @srishakatux above.

@srishakatux: Would it be a good idea to provide some microtasks already? I want to prevent somebody being disappointed by spending time on microtask if the chances are there that this will not participate in Outreachy/GSoC due to missing a second mentor. Also I'm struggling a bit with the fact that this would be a pretty stand-alone project (a new tool), thus not having already a real codebase to make microchanges to. Microtasks could aim at getting to know the process (do a mini-batch upload, request access to toollabs); setting the first steps of the tool (but that's heavily dependent on how we want to structure the tool, so not really micro) or maybe appending some related tools (best way to grasp the skills of aspiring interns). Another option is that I try to lay some quick ground work providing a basic structure to build upon, but I thought it would be nice if the intern could also have a big influence on the general design choices of the structure of the tool.

@Capt_Swing: thanks for showing interest in the project, maybe you might be able to provide some input on the point above.

Thank you @Basvb and @Capt_Swing for your willingness to mentor this project and thank you @Meghana95 for showing interest in working on this project :) I'm going to announce a call for a mentor for this project on Wikitech-I in a bit.

In the meanwhile, @Meghana95 I would encourage you to take a look at some other projects and also keep an eye on this one.

Thanks everybody for your patience!!

Hello everyone, i am interested in this project for GSOC'17, so can you both @Basvb and @Capt_Swing guide me on how i can contribute to this project.

Hello! This project is really interesting. Can I find here a task to start participation in Outreachy/GSOC?

Thanks @Kamsuri5 and @tskolm for your interest in working on this project! Nice to see so much enthousiasm for the project. We are still looking for a second mentor for this project to ensure that the intern gets the necessary support. While we look for a second mentor, you are welcome to explore other projects which are listed here: https://www.mediawiki.org/wiki/Google_Summer_of_Code_2017 and other possible projects here: https://www.mediawiki.org/wiki/Google_Summer_of_Code_2017#Some_other_ideas

I can help with this project. I have experience in Python for software development, and I maintain a few tools on Tool Labs. I've also used Wikimedia's OAuth in one of my tools.

@tom29739: That would be wonderful, would it be a good idea to have a small chat on IRC this evening or tomorrow to discuss how we see the project and some next steps?

@Basvb I hope you soon find your second mentor as i am looking forward to contribute to this project in the upcoming GSOC/Outreachy.

@tom29739: That would be wonderful, would it be a good idea to have a small chat on IRC this evening or tomorrow to discuss how we see the project and some next steps?

Sure, tomorrow evening (22nd March) would be fine. I'm tom29739 on freenode.

I can help with this project. I have experience in Python for software development, and I maintain a few tools on Tool Labs. I've also used Wikimedia's OAuth in one of my tools.

@tom29739 thanks for the help! @Meghana95 @Kamsuri5 @tskolm and any others interested you can submit your proposals for this project. Also make sure that you go through https://www.mediawiki.org/wiki/Outreach_programs/Selection_process and have certain prior contributions to be considered a strong candidate.

@Basvb please add one or two related microtasks that'd help you judge candidates for selection better.

I've emailed Basvb privately some of my concerns on this project. If they are addressed I think I'm available for mentoring as well.

Thanks everybody for making this task ready to be mentored for GSOC/Outreachy. I would like to learn who would be the official mentors for this task now? I would like to invite official mentors from Google's dashboard/Outreachy's admin system to signup to be able to view/review applications. Thank you!

@srishakatux: Is it possible to have two co-mentors? I've talked to both @tom29739 and @zhuyifei1999 who are willing to help mentor the project.
@Capt_Swing: Am I correct in the interpretation that your message was more of an offer to take a look once or twice at the user-perspective than as an intention to co-mentor?

@Basvb More the merrier, and there will not be any problem :) It would be great if @tom29739 and @zhuyifei1999 could PM me their email addresses to receive official invite.

@Basvb I would like to work on this project. I have already uploaded a few images to the Commons. I had a few doubts regarding the Flask microtasks. Pardon me if these seem naive.

Regarding bullet 1:  Are we looking for having a dropdown where we have all the GLAMs and the potential users can choose one of the GLAMs from that dropdown? Or do we take a string from the user for the name of the GLAM and then compare it from our list of GLAMs for a match? But that may be more computationally complex I guess.

Regarding bullet 2: My understanding is here we only decide on a naming convention for the individual uploads so that we have a meaningful title. Do we assume that the different parameters such as ID, glam, collection and description, file title (at source) will be available from the GLAM's API. Do such API's even exist for all the GLAMs? If no then how do we extract all the parameters?

This comment was removed by Kamsuri5.

Hello everyone, now as the mentors have been finalized i want to take up this project for GSOC'17/Outreachy(round-14). I have gone through the details of the project and have started working on the microtasks.

According to my study of the project, we will be majorly dealing with the below mentioned problems:-

  1. To overcome the deficiencies in Batch upload, by implementing a single(or a small subset) image upload.
  2. Ease out the uploading of images by GLAMs to Commons, by introducing a button on GLAMs end.
  3. Overcoming the hassle(download, metadata updation, etc) required to upload an image from an external source

@Basvb can you comment on my conclusions, so that i can be sure whether i am going in the right direction.

To overcome the second problem every GLAM will have a button to upload, which will use our API to upload the images. I think we can get the GLAM id through the button id itself, along with which a request will be sent to our API. So request to API will consist of GLAM details and URL details, using which we will get to know which metadata mapping will be used and will proceed accordingly.

@Infobliss you are referring to having a common platform like the one mentioned by @Lokal_Profil which will deal with the third problem mentioned above, then i think we will have to fetch details of GLAM through the URL or through GLAM apis. @Basvb can you elaborate on this because this is a bit unclear as GLAMs don't have APIs. So how we will be dealing with this?

@Capt_Swing: Thank you again for your thoughts on the project. The upcoming structured data is a good point to keep into consideration. For the past years I personally try to use the regular (structured) templates to convey information, these can then be easily transfered into the structured data format. Within the tool we'd have to keep in mind that likely in a few years the metadata-mappings and some other parts should be changed to directly use structured data.

@Infobliss and @Kamsuri5: Very good to hear that you are interested, I'm curious to see your proposals, maybe it is a good idea to plan a (short) IRC session to talk a bit about what ideas we have and to ask questions for each other. Tom and Zhuyifei are regulars in Cloud-Services, I'll try to be there as often as possible as well.

On the specific questions, I'll be moving the microtasks into separate tasks with some more information and allowing us to discuss them in some more depth. For the link/flask: Optimally there will be a simple front end with a drop down list, besides that there will also be a link where the names of the GLAMS + ID can be used to call the tool directly (with a button from the image page at the GLAM). There will be a pre determined name per GLAM, so it's not up to the user to use free text to enter the GLAMs name.

On naming: the idea of the function is that given that you know an ID, title of the work (sometimes not known), a description and the name of the GLAM we create a standard format for the title (If title: title else: description - ID -GLAM name.ext), more on this in the specific task.

On API's: I've seen quite some collections of GLAMs with an API, this is where we get all the information from required for the upload, so GLAMs need to have such an API for us to be able to make a metadata-mapping (potentially scraping or database dumps are an option but let's not focus on that at the start). Two quick examples are http://www.gahetna.nl/beeldbank-api/opensearch/?q=2.24.14.02&count=100&startIndex=1 and http://cultureelerfgoed.adlibsoft.com/harvest/wwwopac.ashx?database=images&search=pointer%201009%20and%20BE=%22interieur%22&limit=10&xmltype=grouped . The second bullet point under Misc is to make a list of some relevant GLAMs, let me see if I can make a small start with that to give a better idea of some relevant use cases, I'll include the relevant APIs

Kamsuri5, Those are indeed the main problem, where for the first point the issue is that often there is a large collection (100.000s of images) which could clutter Wikimedia Commons if all files are uploaded. Doing uploads one by one allows for more post-processing and for the most relevant files to be uploaded only.

@Capt_Swing: Thank you again for your thoughts on the project. The upcoming structured data is a good point to keep into consideration. For the past years I personally try to use the regular (structured) templates to convey information, these can then be easily transfered into the structured data format. Within the tool we'd have to keep in mind that likely in a few years the metadata-mappings and some other parts should be changed to directly use structured data.

@Infobliss and @Kamsuri5: Very good to hear that you are interested, I'm curious to see your proposals, maybe it is a good idea to plan a (short) IRC session to talk a bit about what ideas we have and to ask questions for each other. Tom and Zhuyifei are regulars in Cloud-Services, I'll try to be there as often as possible as well.

On the specific questions, I'll be moving the microtasks into separate tasks with some more information and allowing us to discuss them in some more depth. For the link/flask: Optimally there will be a simple front end with a drop down list, besides that there will also be a link where the names of the GLAMS + ID can be used to call the tool directly (with a button from the image page at the GLAM). There will be a pre determined name per GLAM, so it's not up to the user to use free text to enter the GLAMs name.

On naming: the idea of the function is that given that you know an ID, title of the work (sometimes not known), a description and the name of the GLAM we create a standard format for the title (If title: title else: description - ID -GLAM name.ext), more on this in the specific task.

On API's: I've seen quite some collections of GLAMs with an API, this is where we get all the information from required for the upload, so GLAMs need to have such an API for us to be able to make a metadata-mapping (potentially scraping or database dumps are an option but let's not focus on that at the start). Two quick examples are http://www.gahetna.nl/beeldbank-api/opensearch/?q=2.24.14.02&count=100&startIndex=1 and http://cultureelerfgoed.adlibsoft.com/harvest/wwwopac.ashx?database=images&search=pointer%201009%20and%20BE=%22interieur%22&limit=10&xmltype=grouped . The second bullet point under Misc is to make a list of some relevant GLAMs, let me see if I can make a small start with that to give a better idea of some relevant use cases, I'll include the relevant APIs

Kamsuri5, Those are indeed the main problem, where for the first point the issue is that often there is a large collection (100.000s of images) which could clutter Wikimedia Commons if all files are uploaded. Doing uploads one by one allows for more post-processing and for the most relevant files to be uploaded only.

@Basvb can you connect with me on IRC? I am kamsuri on freenode.

@Kapilkd13, @Meghana95, and @tskolm: We've now listed a few micro tasks. I'm curious if you are still interested in pursuing a proposal for this project or have found other interesting projects in the meantime?

Hello @Basvb, thanks a lot for the heads up on this. I am interested in working on this and I have started doing the micro-tasks. Please I wish to know when I complete the sencond micro task, am I to paste the link to my repo here?

Hi @djff, Good to see that you are interested in the project. You can claim the microtasks you are working on here in Phabricator (if there is not yet a sub task for it you or me could create it). You can paste the link to your repo in the subtask ticket and in your proposal.

Thanks @Basvb , looks like I was working on a task already claimed by another. None the less I will claim another task and upload what I have done so far on git asap. The sample batch-upload you provided was of great help.

@djff, aah ok. Please also upload the work you did on the other task as it will also show your abilities.

@djff, @Kamsuri5 and @Infobliss: don't forget to publish your proposals soon to allow us some time for feedback.

The name "Single Image Batch Upload" is extremely confusing (and reading a related proposal confused me even more). From reading the task description, it seems the idea is more like "On-demand batch upload".

I understand the confusingness of the name. It is after all a contradictio in terminis. I think that it is a good idea to move to a more descriptive name if/once the projects starts.

I'm not sure whether the batch upload part is correct at all (regarding the name you propose). The main element is that the preparation work for batch uploading is done (a metadata-mapping is available) and that every person with a Commons account can use that to upload the images from a collection they are interested in. For a batch upload I however expect multiple files being uploaded at once (a batch). I'm open to suggestions on which few terms best summarize that description or maybe we should go with a totally unrelated (non-descriptive) name.

The main element is that the preparation work for batch uploading is done (a metadata-mapping is available)

Once upon a time, people thought that this would consist of writing XSLT for the sake of GWToolset.

and that every person with a Commons account can use that to upload the images from a collection they are interested in.

If everything is ready, then files should be uploaded immediately so that random users have quick access to them. It makes sense to delay the upload only if there is some work left to do and/or the users triggering the upload are experienced and committed users.

The main element is that the preparation work for batch uploading is done (a metadata-mapping is available)

Once upon a time, people thought that this would consist of writing XSLT for the sake of GWToolset.

Do you know what the lessons learned where or the main reasons that this did not happen? I've personally never seen the GWToolset as a tool to upload one-at-a-time with and seen it as too complicated to use for end-users.

and that every person with a Commons account can use that to upload the images from a collection they are interested in.

If everything is ready, then files should be uploaded immediately so that random users have quick access to them. It makes sense to delay the upload only if there is some work left to do and/or the users triggering the upload are experienced and committed users.

There is almost always work to be done after uploading, the question is how much work has to be done and what the quality is without reviewing each file on its own. Currently there are a lot of batch uploaded files without even any categories or other useful information making them hard - if not impossible - to find for reusers. A good example for the work to be done I think is the Rijksmonumenten upload. I think we were fairly successful in stimulating improvements to the file with a good workflow for identifying monuments and with that adding relevant categories. However even in such a successful case, just because of the scale of the upload (400.000+ images) there are tens of thousands of images still needing some kind of fix. I've fixed huge parts of that semi-automatically or by hand. But as one of the uploaders I'm simply not able to committing to fix all of those. So ideally these subsets would have been left out of the batch upload for hand-picking the most useful ones and fixing those.

I also see two other possible use cases: 2. Not all files from a collection are in scope, some times cherry picking in scope images is better than uploading thousands of images out of or barely in scope. A final possibility (although this should likely have been done differently from the start) is that there are already thousands of images uploaded from the collection, and putting everything on Commons will introduce thousands of duplicates. In that case not uploading these thousands of images but doing everything in batch from the start would've been better of course. This is the case for the National Archive (NL) example above with thousands of unstructured (no way to test if your image is a duplicate) uploads.

(I can't speak for GWToolset but there are reports.) I see two points in your reply.

for hand-picking the most useful ones

Because hand-picking is time-consuming, the target/audience would be collections of which only a (small) minority of files are wanted. The obvious goal is for the screening process to take less time than it would take to handle the non-screened files while reaching the same result.

and fixing those

This is important because it acknowledges that not everything can be fixed (efficiently) *before* landing on Wikimedia Commons (categories are especially hard since they're eminently local). The question then is how much of the fixing would fall under this project's scope; you probably don't want this project to be about all possible file cleanup gadgets one can think of.

I can see also the third use case for the "upload this to Wikimedia Commons" which is that it can be used for linking to the image in commons if it already exists.

Eg
1.) User clicks to upload button
2.) upload system checks if the image is already in the Commons
3.) if it exist it says "HERE IS THE IMAGE IN COMMONS" and if not then it will upload it

Anyway biggest point of doing the upload link is to make things more straight forward when user is using the GLAM:s website and she or he wants to add something to Wikimedia sites. If somebody wants to test how the idea would feel like i made a small script at the Hack4fi (Finnish yearly cultural hackaton) which uploads images from Finna (Finlands national digital library, Only Helsinki city museum images are supported) to Wikimedia Commons beta.

And some notes.
*A simple front end where an uploader selects a website/GLAM from which to upload and enters an identifier.

I used the URL as single identifier and with that i can figure out GLAM and ID at the server side.

*The tool then uses the metadata mapping for that specific GLAM to generate a wikitext description for the file.

Most likely there is no simple metadata mappings but you need to combine information from multiple sources. In my Finna to Commons demo i used the normalized the keywords from Finna using Finto (finnish ontology service) and matched the words from finto with wikidata pages. From Wikidata i could get commons categories for those words.

Basvb renamed this task from Single Image Batch Upload to glam2commons (previously Single Image Batch Upload).May 21 2017, 1:02 PM
Basvb updated the task description. (Show Details)

Hello! Thank you for featuring this project in the previous edition of Google Summer of Code. Did the student complete the project requirements? In any case, please help modify the task description and add/remove the tags accordingly. Thank you!

@Basvb: As T161332 and T164555 are resolved, should this task also be resolved? Or is there more work (what exactly?) left to do here?

@Basvb and others involved in this task: I wonder if it would be interesting, and viable, if glam2commons could be adapted in the future (think: starting 1 year or a bit less) to work with SDC General. APIs should be sufficiently ready by then.

If there is interest in pursuing that, I'll be happy to (help) create a separate task for that, and to provide advice and support for this.

@Basvb Should this project still live under #outreach-program-projects? Is there anything remaining in this task?

Aklapper renamed this task from glam2commons (previously Single Image Batch Upload) to glam2commons (previously Single Image Batch Upload): Write and deploy initial and usable version.Jul 18 2018, 9:26 AM
Aklapper assigned this task to Infobliss.

@Basvb Should this project still live under #outreach-program-projects? Is there anything remaining in this task?

No replies and all subtasks are resolved.
Hence closing this task. Feel free to reopen if I misunderstand, but if there is more to do, please file separate tasks in the glam2commons project instead to not turn this into a neverending task. :)