Tue, Mar 30
Hi, after discussions on slack I quickly calculated the percentage of image suggestions that contain an image in a placeholder category, please see below.
Mon, Mar 29
- Worked on image placeholder identification [T277828]: identified a category-based approach to label images as placeholders, and incorporated it into the algorithm. Trained a simple neural network to distinguish between normal images and placeholder images, with accuracy 88%+
- Worked with Platform Engineering to fix bug T277875 and identified the source of the problem
- Helped Structured Data refining the design of the test for image recommendations POC results T273092
- Schedule and speakers are finalized
- We have music!
- Authors have been notified of their presentation format
- Started exploring ideas for metrics, met with @marcmiquel
Fri, Mar 26
- Met with the full team - including Google researchers and identified the next steps and deadlines. On our end, we will work full force on data release and on setting up the contract with the org responsible for setting up the challenge.
- Started process to generate the contract.
- Started process for data release.
- Generated the full list of images to download on HDFS from the list of captioned images on the WIT dataset. From this, we will remove the ones that we shouldn't share according to the security review. Also did some geographic/topical analysis of the training data.
- Started the security review on ASANA.
Hi @gmodena one question, when I run the notebook for the February snapshot (similar to the Jan one) I get the following numbers for cebwiki:
cebwiki number of unillustrated articles: **1,435,202**
So, as @Cparle mentioned, 10 times more than what I see for the same snapshot on the gmodena.imagerec_prod table.
Wed, Mar 24
Tue, Mar 23
Fri, Mar 19
- After converting a toy dataset of images into TFRecords, @AikoChou has trained a small model on stat1008 for object classification. We were then able to succesfully run model inference on hadoop.
- We worked on improving the modeling part: @AikoChou has worked on training and evaluation different models for image quality classification using keras.
- Qualitative: We had a lot of issues when generating the data through MTurk: we were able to gather only ~60 valid questions. We are going to work with students to generate this data. We have a now well-defined set of good articles from which questions can be formulated.
- Quantitative: @Daniram3, @tizianopiccardi and I are working on re-running experiments from the rejected WWW paper, on January data. I finished computing all features over the 4.5M images on English Wikipedia. We are now targeting ACM Multimedia on April 3rd as potential deadline for resubmission.
Resolving this, feel free to reopen if there are more TODOs here !
- Estimated language and geographic distribution (thanks to Isaac's https://github.com/geohci/wiki-region-groundtruth Wiki Region Groundtruth data) of WIT test data
- Defined the legal constraints for image data publication.
- Worked with the WIT team to figure out next steps and involvement on their end, set up continuous communication channels and provided an detailed overview of timelines and commitments on our end.
- Gathered all reviews for second round of submissions
- Sent notifications for second round of submissions
- Generated list of the 23 accepted papers, and grouped them into thematic areas for the poster sessions.
- Advertised new registration form.
gathered survey questions related to our knowledge gaps from various surveys used across different initiatives at the Foundation, and collected them here: https://docs.google.com/spreadsheets/d/1xZCOSyoZXr9oVTqZRPmR47Rq_gHDYNZuvH-XFySFzC4/edit#gid=0
Technical skills is still missing but @leila mentioned that these questions have been asked as part of previous research.
@Aiko and I talked about this. We are going to work on 2 things:
- Generate a list of the existing placeholder images and see what are the categories that they are labeled with, and exclude images from those categories when querying for candidates
- Try to build a simple computer vision model that can automatically detect whether an image is a good candidate or not.
Tue, Mar 16
Mar 11 2021
Just a follow-up on a few use cases from the Research team.
Mar 9 2021
Make sense, thanks @kzimmerman !
@tizianopiccardi has extracted this data for January 2021.
Captions on English Wikipedia
If we exclude all gif, tiff and png images, English Wikipedia has 7'811'234 images. Among those, 3'645'913 have a caption: 46.6%.
Mar 8 2021
Mar 4 2021
HI @Aklapper we added a few tags, this is mainly at a Research stage for now, so I included it in the Research board, and linked it to the parent epic task for image classification studies. Thanks for the heads up!
Feb 26 2021
Thanks SO much @Mstyles !
@MPhamWMF I ran Maryum's queries, joined it with the newest list of unillustrated articles from enwiki, then computed, for each article the total pageviews in the month of January 2021. I updated the spreadsheet with the lists of unillustrated articles for the 4 groups of people, sorted by pageviews: https://docs.google.com/spreadsheets/d/1FFCqwo0XsC6jJG3t7CGMhhOuzY9IQpNQTVyW8IbCNRM/edit?usp=sharing
Feb 25 2021
@ArielGlenn thanks for clarifying this. I chatted with @Cormac on Slack and he explained how to get the image page_id from the current entity data. We can get this info from the "id" field. For example, a mediainfo slot with an id of M12345 corresponds to a page with an id 12345. Thanks both!
Feb 24 2021
people born in Africa people born in The Caribbean Australasian indigenous people people born in North America
Feb 23 2021
@ArielGlenn thanks a lot again for this!
Thanks SO much @ArielGlenn, I am also downloading those on our stats machine and will check them once they are in!
Feb 16 2021
Feb 15 2021
Feb 11 2021
Got it, @CBogen!
The tool will evaluate 500 unillustrated articles from each wiki
@Miriam I suspect you already have such list of unillustrated articles from those wikis - can you tell me where I can find that?
Feb 10 2021
No prob, thanks @Cormac - then we probably should avoid normalization in this case, @Aiko?
@Cparle we are thinking of normalizing the component scores before fitting the regression - is it possible to have Max and Min value (score) that each component can take? this would help make a normalization that is generalizable beyond the data you shared. Many thanks!
Feb 8 2021
We summarized the actions we took to incorporate the community's feedback into the second version of the Taxonomy. Please find it here.
@CBogen as discussed earlier today with @Cparle, @AikoChou and I are working on analyzing the data. @AikoChou will create a phab task with the description and results of the analysis she is doing (we just talked about it a couple of hours ago, and it's now late evening for her, so we will likely have it tomorrow morning or later today). I hope that works!
We just chatted with @AikoChou -- since we used the globalimagelinks table to get image information both for Wikidata and Wikipedia, these counts might still include icons. She will work on removing icons and recalculate these numbers (they won't probably change much).
Feb 5 2021
Feb 2 2021
Hi @Dzahn! Aiko is not getting a wikimedia email address, unless needed. She will be working with us until June 30th, so if possible, she should have access until then. And yes, please put me as contact. Many thanks for working on this!
@AikoChou just joined as a contractor. Her first task will be to look into this while we wait for her to get server access.
Jan 26 2021
I updated the spreadsheet with percentages of metadata coverage taken from a sample of 20k images (or less, when the total number of candidates is less than 20k). Numbers are slightly lower, but the overall picture doesn't change.
Jan 22 2021
@MMiller_WMF I added another sheet to the coverage spreadsheet containing the coverage numbers for unillustrated articles, excluding, as requested, disambiguation pages, list articles, and year articles. Please let me know if anything else is needed. Feel free to resolve this task if not.
@MMiller_WMF , in the second sheet of this spreadsheet you can find data about image metadata coverage in local languages. You will find % data only, as I computed these stats on a random sample of 5k image suggestions for each Wiki (due to time/computaional constraints).
Although numbers won't change much, I want to get stats from a larger sample. I will now sample ~ 20k image suggestions per language, and let the crawler run . I will then post the results here.
Hi @MPhamWMF, I modified the Wikidata query to include historical people only, and added the Wiki url - here is the result, let me know if this works! https://docs.google.com/spreadsheets/d/1FFCqwo0XsC6jJG3t7CGMhhOuzY9IQpNQTVyW8IbCNRM/edit?usp=sharing
Jan 21 2021
I joined the data from @dcausse 's query with our unillustrated article list for English Wikipedia, then queried the pageview API to get pageviews for December 2020, here you have the result: https://docs.google.com/spreadsheets/d/1FFCqwo0XsC6jJG3t7CGMhhOuzY9IQpNQTVyW8IbCNRM/edit?usp=sharing
@Mstyles at what granularity do you need pageview counts? We can use the webrequest or pageview_hourly tables from Hive if we want to access pageviews in the last 3 motnhs, or the pageview API for data aggregated monthly, e.g.: https://wikimedia.org/api/rest_v1/metrics/pageviews/per-article/en.wikipedia/all-access/all-agents/Serena_Williams/monthly/2015100100/2015103100
Jan 20 2021
HI @Gehel ! Are you looking at English Wikipedia only? Or all Wikis?
I can send you a list of QIDs for unillustrated enwiki pages which match the QIDs returned by a Wikidata query similar to this: https://w.wiki/v3P?
Jan 19 2021
@MMiller_WMF here you have an initial spreadsheet with coverage statistics: https://docs.google.com/spreadsheets/d/1IKi0mQ4MZRATVOPPaMr_tX6sOzhjL8INc0Pt_RwwUqs/edit?usp=sharing
I followed your schema, and just added one column which is "has at least one candidate", to give a better idea of the overall algorithm coverage.
Please let me know if this works!
Jan 18 2021
@Marshall - this is not very easy to do with the current version that I have, as I should parse the "instance of" property. Let me first pull the numbers with the alg as is, and then do this on a second iteration!
Jan 14 2021
Hi! thanks @fgiunchedi for this info! This is what I have to get/download image URLs given the filename:
Take this url: https://upload.wikimedia.org/wikipedia/commons/thumb/a/a8/Tour_Eiffel_Wikimedia_Commons.jpg/600px-Tour_Eiffel_Wikimedia_Commons.jpg
- The first part is always the same: https://upload.wikimedia.org/wikipedia/commons/ this should be changed with the swift url
- The second part is the first character of the MD5 hash of the file name. For Example, the MD5 hash of Tour_Eiffel_Wikimedia_Commons.jpg is #a85d416ee427dfaee44b9248229a9cdd, so we get /a.
- The third part is the first two characters of the MD5 hash from above: /a8.
- The fourth part is the file name: /Tour_Eiffel_Wikimedia_Commons.jpg
- Then you have the thumbnail size, e.g. 600px, and again the file name /600px-Tour_Eiffel_Wikimedia_Commons.jpg
Jan 12 2021
So I re-run the unillustrated article detection using:
- The new per-wiki thresholds calculated as per the previous post
- An additional anti join with the page_props table to discard all articles having a page image
I then run the image matching algorithm on top of the new set of unillustrated articles. I eyeballed the results and they looked more consistent than the eariler version. Below the quantitative / coverage results:
number of unillustrated articles: 273305 number of articles items with Wikidata image: 15983 number of articles items with Wikidata Commons Category: 28324 number of articles items with Language Links: 83995
number of unillustrated articles: 580284 number of articles items with Wikidata image: 7028 number of articles items with Wikidata Commons Category: 26526 number of articles items with Language Links: 121891
number of unillustrated articles: 867565 number of articles items with Wikidata image: 49226 number of articles items with Wikidata Commons Category: 57548 number of articles items with Language Links: 117138
number of unillustrated articles: 181867 number of articles items with Wikidata image: 8337 number of articles items with Wikidata Commons Category: 21120 number of articles items with Language Links: 69413
number of unillustrated articles: 951319 number of articles items with Wikidata image: 10938 number of articles items with Wikidata Commons Category: 39457 number of articles items with Language Links: 236592
number of unillustrated articles: 2922830 number of articles items with Wikidata image: 36412 number of articles items with Wikidata Commons Category: 92072 number of articles items with Language Links: 325534
Work done by @Swagoel for icon detection:
Jan 11 2021
I reported and extended the analysis above on Meta:
Jan 8 2021
End of quarter updates:
- Qualitative: designed a Mturk task to expand the dataset of questions for reading comprehension. Generated a list of potentially "good" articles to be fed to this task: articles that are relatively long, have more than 1 image, and contain sections such as "description" or "characteristics". Anntoated the articles with topic and popularity score to further filter out potentially unuseful articles.
- Quantitative: @Daniram3 started analyzing metrics such as dwell time and session length (in time). Early results show that dwell time is significantly longer for pages with images (or image clicks), and session time is longer when browsing through pages with images (or when sessions include image clicks), even when controlling for page length and number of pages visited in a session. This is somehow related to T265772
Dec 16 2020
@MMiller_WMF, I ran v3 on the other languages, here are the overall results:
@jcrespo originals would be great, too, I only thought of thumbnails because they generally require less space.
Hi @Ottomata, thanks for the ping! Getting a copy of Commons (thumbnails only would be fine) which is directly accessible via stat machines would be amazing! Adding @fkaelin as we also chatted about this during recent conversations about the pain points of image work.
Dec 11 2020
Here you have the new file for English with the change you required on the date field: https://drive.google.com/file/d/1kVB5krC9SyFxvJqwehWW8IDRPQ2NCR_b/view?usp=sharing
Below the metadata coverage:
missing descriptions: 27% missing captions: 95% missing categories: 0.0 missing depicts: 92%
Re- running this for Arabic Wikipedia. Could you clarify what you would like me to do? Would you like me to find image matches for unillustrated articles in Arabic, then generate metadata for those, or would you like me to check the presence of Arabic metadata on these image candidates?
Dec 10 2020
More insights on Wikidata images vs topics. @FRomeo_WMF might be of interest for you.
More analysis on Wikidatam since @CBogen asked a while ago.
Dec 9 2020
Hi @MMiller_WMF, I modified the code to parse the HTML of the Commons page (there must be a better way to do this, but for now this is what we have) - it now includes more descriptions (missing 40% only) and all the additional data you requested. Some copyright statements and dates are not available in structured data, and for now these are ignored. Please check the sample data attached and let me know if it looks good. If yes, I can run it large scale and give back more suggestions.
Dec 3 2020
Hi @MMiller_WMF, that's interesting, I used some code to parse the HTML of the Commons page, maybe there are different ways of marking the description and I missed it. I'll double check. I will have to work on extending the code to get the additional information you need. I will need a few days at least.
Nov 27 2020
Closing this for now as all points were addressed.
Nov 25 2020
@Swagoel could you try to connect to the notebooks now as I showed you earlier, and let us know if it works?
@Dzahn many thanks!!
Approximate coverage statistics (estimated from a sample of 50k articles with initial candidate suggestions extracted with V3):
- Coverage before filtering: 500k out of 3M unillustrated articles (17%)
- First round of filtering: removing invalid image candidates (flags, svgs, image placeholders): discards 55% of articles with suggestions, leaving 7.5% of unillustrated articles with potential candidates
- Second round of filtering: removing images that are on-wiki only: discards further 12% of articles with suggestions, leaving 5.3% of unillustrated articles with potential candidates
- Out of the remaining candidates, the metadata coverage is the following:
- missing descriptions: 59%
- missing captions: 96%
- missing categories: 0.0
- missing structured data: 92%
Nov 24 2020
Refactored code is now available on stat1005.
stat1005.eqiad.wmnet:/home/mirrys/ImageRecommendation/V3: - retrieve_image_candidates.ipynb - prioritize_clean.ipynb
- retrieve_image_candidates.ipynb discovers unillustrated articles and finds potential images matches
- prioritize-clean.ipynb filters out bad image candidates and generates good image suggestions for unillustrated articles, together with image captions, descriptions, categories, and structured data when available
Nov 20 2020
I made a general clean up of the methodology to basically
- Exclude all .svg images from the potential suggestions
- Exclude all flags as suggested
There are suggestions for around 13k out of 50k articles in here: https://drive.google.com/file/d/1aMlYXP8eKORx8V0m98dIUNcpmCNrADFI/view?usp=sharing