Initiate the analysis of readers' engagement with images in Wikipedia
Closed, ResolvedPublic
Actions

Assigned To

Authored By

	Miriam
	Apr 14 2020, 11:27 AM

Description

Together with collaborators, start projects on analysing readers' engagement with images in Wikipedia:

Status	Assigned	Task
Resolved	Miriam	T250154 Initiate the analysis of readers' engagement with images in Wikipedia
Resolved	Daniram3	T252539 Internship: Understanding Readers' engagement with Wikipedia through Traffic Logs
Open	Miriam	T333944 Submit Two Papers on Images and Wikipedia

Weekly Updates:
Progressing on internship contract for the student who is going to work on the quantitative bit. Working with the collaborators on scoping down the project: https://docs.google.com/document/d/1d-sOWana7zOj26cPov4wpoUvsG9exDWI8S0Tl62KDoo/edit?usp=sharing

Weekly updates:

Qualitative: worked on surveying existing datasets from automatic generation of multilingual Q and A from Wikipedia. Collaborators worked on generating image history from Wikipedia dumps. Based on WikiLinkGraph, which allows to compute links additions and deletions for each revision of a page, the system does the same for image links. Progress is being tracked hree: https://docs.google.com/document/d/1CNQg3nVmfRKfKdo1utw2HtI19-qU38alKWof1CUoFQU/edit?usp=sharing

Weekly updates:

Quantitative: scoping down the project for the internship period. Focus on specific topics (education topic) and 1 or 2 research questions, leave others for later.
- How are people engaging with images?
- How does this change across different countries/different segments of countries having different development index levels?
Qualitative: collaborators retrieved lists of commonly asked questions. Next step is to match with QA datasets, and generate mulitple choice answers. This will be the root content for our experiment.

Weekly updates:

Quantitative: analysis of image percentage distribution by language, page length and page age. Worked on computing topics for all articles with images in 14 languages.
Qualitative: worked on worked on using word embeddings to retrieve the questions in Facebook's multilingual QA dataset which are more similar to the web queries in an open dataset. For word modeling, I used laser - which is multilingual by nature, and fasttext, which can be adapted to multilingual settings. This is useful to select the questions for our user experiment.

Weekly updates:

Qualitative: worked on exploring the questions in AI2 diagram dataset: https://allenai.org/data/diagrams These are schoolbooks questions about science that we could use to test how people learn through Wikipedia articles.
Quantitative: @Daniram3 is officially onboarded, with server and notebook access! We worked on exploring the data and on understanding what external data classification do we need to complete te project (country characteristics, image classifiers)

Weekly updates:

Qualitative: an undergrad from UW has joined the team. She will we be working on the experiment setup and data labeling.
Quantitative: after becoming more familiar with the datasets in the data lake, @Daniram3 has worked on extracting the first image view statistics. It looks like image click-through rate (imageviews/pageviews) sets at around 10% in average across countries.

Weekly updates:

Qualitative:
- analyzed the "reading comprehension" dataset from Allen AI. It contains questions about reading comprehension of Wikipedia paragraphs: https://allenai.org/data/quoref t
- the team decided to start crafting a few pilots for this experiment. Focusing on few selected articles, we will take questions from existing QA datasets, and generate questions manually. We will also try different versions of the interface.

Quantitative:
- @Danirama solved joining issues and worked on extracting image clicktrhough rate by joining image views with page views. One issue we found is that there is more than one image per article, so we need to normalize somehow by number of images.

Weekly updates:

*Qualitative:

finalized the first draft of the experiment design. Worked on classifying articles in the quoref dataset by topic and number of images

Quantitative:

@Danirama and I worked with analytics to solve sqoop tables updates issues and pyspark versioning issues. Now working on generating an image ctr metric which is aggregated by user, so to have a more statistically solid estimate of the image click through rate

Qualitative:

Generated example questions for the experiment design. The experiment will contain multiple choice questions on Wikipedia articles, with both visual and textual components.

Quantitative:

@Danirama worked on computing new metrics, including.
- Overall image ctr by counting only 1 image click per page - This is around 3%, so one order of magnitude higher than citation CTR
- Image CTR by page - This is also around 3% - meaning in average, 3 out of 100 pageviews we record 1 or more image clicks on the page
- Image CTR for each image - This is around 1.5% - meaning that there is a click 15 out of 1000 times an image is displayed in a Wikipedia article

Qualitative:

Finalized details of the first pilot for the "role of images in knowledge understanding" experiment: number of questions and variables to play with. We will manually design examples.

Quantitative:

@Danirama worked on refining metrics by.

Quantitative:

Mapped image CTR by country. Likely highly related to latency.
Mapped top topics by image CTR: visual arts, geography, transportation (?) (will look into that)
Identified country indicators which we want to use as predictors for CTR

Qualitative:

Defined the schema for question generation. Each team member will generate 20-30 questions for the final experiment.
Generated example questions here: https://docs.google.com/document/d/1x5Cej0GHfGoO_eVnrA-Lg2Zb2qkp8YVOYlbhdXRh4Ao/edit?usp=sharing

Initial analysis for the quantitative part is done and will be presented today.
Initial experimental design for the qualitative part is done and we are going to implement it next quarter