Fri, Jun 26
- Generated new metrics for image and text selection gender gap.
- Brainstorming on new metrics for image/text framing gender gap.
- Prepared the code for regression analysis to estimate gender gap in a more statistically solid way.
Resolving this for now. I was not able to do the stretch goal but will leave that task open hoping to work on it soon :)
- Mapped image CTR by country. Likely highly related to latency.
- Mapped top topics by image CTR: visual arts, geography, transportation (?) (will look into that)
- Identified country indicators which we want to use as predictors for CTR
- Almost there, filling up the last tables, polishing and restructuring parts of the text
- Discussed on adding a grondtruth to evaluate the effectiveness of the citation quality classifier
- Based on the "articles with unsourced statements" category
- Aiko passed her thesis defense :)
Tue, Jun 23
@Miriam -- here's the task that we talked about making so you could give this approach a try. What do you think? Does this sound doable? On what timeline would you prefer?
Fri, Jun 19
- Finalized details of the first pilot for the "role of images in knowledge understanding" experiment: number of questions and variables to play with. We will manually design examples.
- Finalized pass on Methods, Future Work and Introduction
- Working on tables
Jun 12 2020
- Finalized paper narrative and upcoming to-dos for each of the member of the team. We need further computational results which I will be produce for now, when I have time, while we look for a student :)
- Generated example questions for the experiment design. The experiment will contain multiple choice questions on Wikipedia articles, with both visual and textual components.
- Finalized section 2 - related work
- Working on making a pass on the new sections contributed by Martin and Leila
- Working on adding comprehensive tables for better consumption of the taxonomy
- Revised the metric to measure article quality, so to consider as "correctly cited" a sentence which is in a paragraph with a citation.
- Revised predictors to model "contribution inequality" using gini coefficient.
- Found that citation quality is higher when few editors are contributing to the article, similar to previous work
- Aiko defended her thesis today and will submit the final manuscript by the end of the month -- Congrats @AikoChou
- After a break, we would like to wrap up these results in a paper for a conference
Jun 8 2020
Jun 5 2020
- We submitted the paper to ACM MM. \o/
- We polished the code for the library, and published it here: https://github.com/OlehOnyshchak/pyWikiMM
- For the part of "work on dataset release", most of the work is done, but we need to allocate time to download and store the data, and possibly run some baseline experiments on that. Everything is ready, but we won't be able to actually release data this quarter. @leila I would resolve this task if that works for you.
- Finalized the "methods" section of the taxonomy
- Gathering literature for the "related work section" which I will start soon
Jun 4 2020
Oh this is great, thanks so much @Ottomata !
Jun 1 2020
Ignore the message above, I ran the queries again, and it indeed seems that the problem has been solved in the past few hours :) thanks so much!
@JAllemandou thanks. However, if I query the mediawiki_imagelinks table in wmf_raw for pages more recent than December 2019, e.g. https://en.wikipedia.org/wiki/Coronavirus_disease_2019, I get an empty response. Am I missing something? Thanks!
Thanks for this @JAllemandou !
May 29 2020
- Worked on finalizing the paper narrative with the rest of the team.
- Refined the data, results are similar.
- Computed the top-5 accuracy as final metric on the classifiers. This metric is widely used in image classification competitions such as Imagenet Large Scale Visual Recognition Challenge. It counts how many time the correct label is found among the top-5 predictions of the classifier.
- Top-5 accuracy is around 80% for the first version, and 81.5% for the improved one, with major gains on classes we have worked on this quarter. https://docs.google.com/spreadsheets/d/18Er84wdWIme_KMOrOYZZQxq5z0d9O4L0nZMMibzQ_rc/edit?usp=sharing
- I could close this task but i still hope to train a network from scratch by the end of the quarter :)
- Added performance results, motivation, applications, and image/page statistics to the paper draft. It's almost ready to go!
- analyzed the "reading comprehension" dataset from Allen AI. It contains questions about reading comprehension of Wikipedia paragraphs: https://allenai.org/data/quoref t
- the team decided to start crafting a few pilots for this experiment. Focusing on few selected articles, we will take questions from existing QA datasets, and generate questions manually. We will also try different versions of the interface.
- Finalized the "content" subsection of the taxonomy, missing tables and references which I will add after feedback
- Will start working on the "rationale" section next week
- Refined the regression analysis, AUC on the test set is around 0.7. Still finding some inconsistencies, probably due to features' collinearity with ORES' quality score
- Aiko is defending soon, so the first wrap-up of all experiments is expected in the coming 2 weeks.
May 26 2020
May 22 2020
- Worked on logistic regression to predict presence of page/images from people characteristics - more details+plots coming next week
- Worked on comparing gender gap to other gaps, such as occupational or geographic gap.
- retrained the model with the new, polished data
- improvements are +7% overall, and +15% for the classes where we have modified the data! https://docs.google.com/spreadsheets/d/18Er84wdWIme_KMOrOYZZQxq5z0d9O4L0nZMMibzQ_rc/edit?usp=sharing
- noticed that there is another minor data improvement: basically, there are some concepts whose data comes from ambiguous Commons categories. My plan is to remove those and re-train the model on the cleaner data. Will try to do this next week.
- Paper draft almost finalized, working on the last details and contextualizing the release of the library in the MM community, and its role in supporting existing research and opening new areas of research
- Link to the repository with the library: https://github.com/OlehOnyshchak/WikipediaMultimodalDownloader
- Finalized the "readers" subsection of the taxonomy, waiting for feedback
- Started working on content subsection
- Added regression analysis and discussed the role, of kurtosis and skewness, suggested modifications on the way we sample editors (currently, Aiko was using the top-10 editors only to generate features)
- Presented the work at the weekly meeting and discussed the feedback afterwards
May 18 2020
Thanks @colewhite ! Closing this task. Thanks a lot all for your help :)
May 15 2020
polished the Commons categories related to the 30 concepts for which we have lower accuracy. Downloaded the new data on stat1005. Ready for model re-train.
First paper draft is ready, missing abstract and related work! Working on refining the sections, and packaging the software for release.
- Qualitative: worked on exploring the questions in AI2 diagram dataset: https://allenai.org/data/diagrams These are schoolbooks questions about science that we could use to test how people learn through Wikipedia articles.
- Quantitative: @Daniram3 is officially onboarded, with server and notebook access! We worked on exploring the data and on understanding what external data classification do we need to complete te project (country characteristics, image classifiers)
- Worked on creating the first paper draft, Sec 4 will be about the Taxonomy
- We decided to work collectively on "Objectives" and drop the "causes" column
- Build the structure for subsections of Section 4: Readers, Contributors, and Content
- Aggregated editors' characteristics at page level, resulting in features such as editors' contribution skewness
- Initiated the study of the impact of different factors (page length, quality, topic, and editors' features) on citation quality, based on logistic regression
May 14 2020
Thank you so much @colewhite and all!
May 13 2020
@KFrancis thanks for your kind confirmation!
And thanks @colewhite for helping out. According to your list, the last point should be @Nuria's approval. Please let me know if there is anything else I can help with!
May 12 2020
@KFrancis thanks! We discussed this case over email, and my understanding was that the signed letter of agreement already contains and NDA, so we do not need an additional one, could you please confirm?
May 9 2020
May 8 2020
- Worked on paper topic proposals
- Worked on strengthen the statistical soundness through logisitc regression-based analysis
- Extracted image quality score from all images of people in Wikipedia for all languages
started working on data refinement, checked the categories for which we get lower accuracy, and refined the Commons category list associated to those
none, deadline for paper submission postponed
Weekly updates: none for now
- Downloaded data about editors' characteristics
- Refined citation quality analysis over time
May 7 2020
May 4 2020
All to-dos from last week are finished. Working on the paper submission for Open Source Competition at ACM MM: https://2020.acmmm.org/osc-proposals.html
- Quantitative: scoping down the project for the internship period. Focus on specific topics (education topic) and 1 or 2 research questions, leave others for later.
- How are people engaging with images?
- How does this change across different countries/different segments of countries having different development index levels?
- Qualitative: collaborators retrieved lists of commonly asked questions. Next step is to match with QA datasets, and generate mulitple choice answers. This will be the root content for our experiment.
- Added a structure of the final delivery document
- Drafted the taxonomy table based on the proposed schema, for the Readers dimension: https://docs.google.com/spreadsheets/d/1XXCBHV3i8_YjDUtvenWVjWmI4i-d_oRSvzHSs8M6_WU/edit?usp=sharing
- Computed citation quality by section and topics in English Wikipedia
- Computed evolution of citation quality over time for different topics: Medicine, Politics, Economics. CQ impoves substantially over time!
Apr 28 2020
Apr 27 2020
- summarized results for selection gaps: https://docs.google.com/spreadsheets/d/1dB3NFiPvcq4Zl70yVZNlZxy8r5y8DXoRPImkXaHN5lM/edit?usp=sharing
working on the following:
- Optimise how we handle icons so that the script works faster
- Add possibility to download only fields specified by user
- Create a docker container with the script
- Start writing supporting-paper for the software
- Qualitative: worked on surveying existing datasets from automatic generation of multilingual Q and A from Wikipedia. Collaborators worked on generating image history from Wikipedia dumps. Based on WikiLinkGraph, which allows to compute links additions and deletions for each revision of a page, the system does the same for image links. Progress is being tracked hree: https://docs.google.com/document/d/1CNQg3nVmfRKfKdo1utw2HtI19-qU38alKWof1CUoFQU/edit?usp=sharing
Weekly update: reviewed existing taxonomies, and added a candidate taxonomy layout to the - https://docs.google.com/document/d/1GG0cPB5bZALLAmqpZdNQtmooC1CcfOS8WoMZ_F2DkOw/edit?usp=sharing
Apr 22 2020
Wiki workshop was succesfully held remotely on April 21st 2020.
A script is ready that extracts, for a given list of articles:
- Article text
- Article images links
- Image captions on article
- Image descriptions from Commons
- Image's section headers
- Image features from Res-net
Everything is packed in single "query" function, with tons of parameters to change the behaviour of the script if needed. Link: https://github.com/OlehOnyshchak/WikipediaDownloader
Progressing on internship contract for the student who is going to work on the quantitative bit. Working with the collaborators on scoping down the project: https://docs.google.com/document/d/1d-sOWana7zOj26cPov4wpoUvsG9exDWI8S0Tl62KDoo/edit?usp=sharing
Weekly update: Literature review in progress: https://docs.google.com/document/d/1GG0cPB5bZALLAmqpZdNQtmooC1CcfOS8WoMZ_F2DkOw/edit?usp=sharing
Next up - learning about the existing taxonomy material
Apr 14 2020
Apr 9 2020
So we did a few tests with the latest ROCm version.
- When the GPU saturates, there is no need to reboot, as killing the stalled processes is enough for the GPU to release the resources. This is a big improvement compared to the previous version!
- We found that the saturation is related to a VRAM usage problem
- We found a Tensorflow-native solution to dynamically allocate the memory used by a process on the GPU. Added to every Tensorflow code, it allows multiple users to run tensorflow scripts on the GPU at the same time. More info here: https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/AMD_GPU#Configure_your_Tensorflow_script
Apr 1 2020
Report available here: https://meta.wikimedia.org/wiki/Research:Prototypes_of_Image_Classifiers_Trained_on_Commons_Categories
It highlights milestones and areas of improvement to design our own in-house image classifiers. Reports on accuracy and GPU performance. Links to some qualitative results of classification on a new set of images..
Mar 30 2020
MOUs signed and formal collaboration announcement sent on wiki-research-l! Resolving this task.
Mar 27 2020
Closing this task as per our discussion.
Writing report here: https://meta.wikimedia.org/wiki/Research:Prototypes_of_Image_Classifiers_Trained_on_Commons_Categories