Page MenuHomePhabricator

Finalize document about tasks and subtasks involved in a COH batch upload project
Closed, DeclinedPublic12 Estimated Story Points

Description

Update the batch upload document based on the experience from the first batch uploads.

For the generic phases and tasks we could borrow from the open standard CRISP-DM and then sort and fill-in our specialized tasks, see overview on Wikipedia and a practical guide.

Using the images from GAR as an example, identify tasks, subtasks and possible checkboxes to set up a scalable workflow for batch uploads in the Connected Open Heritage project. Preferably a set of such tasks, subtasks etc could be copy-pasted as a Connected-Open-Heritage-Batch-Uploads project contaning e.g.

  • Data cleaning and transformation
    • Check already uploaded images on commons
    • Check and clean metadata file
    • Store raw image files safely
    • Review and link in mapping files from editors/domain experts

...

  • Setup Institution templates on Commons
    • Cooperation templae COH
    • Instituion
    • Media uploaded from Institution

...

  • Setup Institution on Wikidata
    • Institution item
    • Properties etc

...

  • Manual test upload one image
    • Review by...

..

  • Scripts and documentation
    • Clone earlier project folders

...

Related Objects

Event Timeline

Working document is here: https://docs.google.com/document/d/1Qvs5T4yX4Hl6NuwV6RDPhFqHdlIRtlSmtAkLDurLg4I/edit?usp=sharing

We will rework it after the first batch upload and publish it on the Meta portal for the project.

I spoke to Amanda Bittaker in Esino Lario about her helping out with spreading learning material effectively. She mailed me today and I told her to wait until we've gotten this one up nicely:

Hi Amanda,

Yes, nice to meet you too! You remember correctly, I'd like to share as much useful information as possible to entry level developers. I come from communications and have self-taught programming and can hopefully add some perspective to others like me.

I primarily work with the project Connected Open Heritage (https://meta.wikimedia.org/wiki/Connected_Open_Heritage) and part of the project goals is to create scaleable processes for batch uploading, data migrations and getting new data in to Commons.

We started with a detailed mapping of the practical steps involved in batch uploading of photos from GLAM institutions, but It's not published as a wiki yet. I also plan to publish "notebooks" (YuviPanda's baby on wmflabs) but they'll come after I've done a few projects so they'll be more mature pedagogically.

It would be great to get back to you in august when we got that up and hear what you think about how to make them as accessible to people as possible.

I appreciate you got back to me Amanda!

Regards,
/Mattias

One thing to consider is that apparently the creation of new non-trivial tasks is restricted to a product owner that also prioritizes the tasks. Here's an example from WMF Discovery Team saying:

"There should only be two ways that a task is added to the backlog column:

The product owner adds it during sprint prioritization, or the team lead adds it with permission/agreement from the product owner.
Someone on the sub-team discovers an urgent and nearly-trivial issue, which would take more time to explain and get permission to prioritize than it would to just fix."

Since we're a small team with many often over-lapping projects and short periods of funding for projects waste caused by confusion around this might have a big impact on efficiency and costs. Added a suggestion to adress this in T155743

Is there a timeline on this one @Jopparn @Lokal_Profil ? It's been here for 6 months now and it gets stuck in limbo unless we set a meeting etc to go through the result and accept them.

Work is resumed starting with an internal workshop involving @Jopparn @Lokal_Profil and @Mattias_Ostmar-WMSE. Will continue.

Major restructuring of the document has been done based on the executed batch uploads.

However, there is still a lot of work left. We need to add more info in the early steps and rephrase and restructure parts of the older text - but this is not needed for the practical work with the coming batch upload.

Hence we decided that we will continue working on the document in a couple of months time.

Jopparn renamed this task from Identify tasks and subtasks involved in a COH batch upload project to Finalize document about tasks and subtasks involved in a COH batch upload project.Feb 20 2017, 10:27 AM

I suggest that we look at CRISP-DM and see what we can learn and apply to document and report the decisions for data transformations etc done in a batchupload.

Here is a practical user guide: http://lyle.smu.edu/~mhd/8331f03/crisp.pdf

Here's the outline for refence, copied from Locke Data

Business Understanding (Timelines, expectations on and from GLAM...)
Understanding the business goal
Situation assessment
Translating the business goal in a data mining objective
Development of a project plan
Data understanding
Considering data requirements
Initial data collection, exploration, and quality assessment
Data preparation
Selection of required data
Data acquisition
Data integration and formatting […]
Data cleaning
Data tranaformation and enrichment […]

Not applicable [Modeling, Model evaluation, Model approval]

Deployment (the actual coding)
Create a report of findings
Planning and development of the deployment procedure
Deployment of the […] model
Distribution of the model results and integration in the organisation’s operational […] system
Development of a maintenance / update plan
Review of the project
Planning the next steps

My main problems in the Cyprus batchupload were clearly related to the following areas:

  • PROBLEM: Complex handling of categories requires the code to be written using OOP.
    • CONCLUSION: Thus it should be implemented from the beginning. New developers needs intro to batchupload-related OOP-thinking.
    • TODO: Create introduction session material based on BatchUploadTools as an example?
  • PROBLEM: Missing and problematic field values was mainly discovered when coding functional logic
    • CONCLUSION: Process for and more emphasis on data quality assessment and documentation to guide the data transformation and data enrichment work later on.
    • TODO: re-write Tasks for a batch upload to reflect the open standard CRISP-DM?

Closing old task that has been replaced by later efforts.