Page MenuHomePhabricator

Miriam (Miriam Redi)
User

Today

  • Clear sailing ahead.

Tomorrow

  • Clear sailing ahead.

Thursday

  • Clear sailing ahead.

User Details

User Since
Sep 25 2017, 10:36 AM (185 w, 1 d)
Availability
Available
LDAP User
Miriam
MediaWiki User
Miriam (WMF) [ Global Accounts ]

Recent Activity

Tue, Mar 30

Miriam added a comment to T277828: Investigate placeholder image recommendation.

Hi, after discussions on slack I quickly calculated the percentage of image suggestions that contain an image in a placeholder category, please see below.

Tue, Mar 30, 3:19 PM · Research (FY2020-21-Research-January-March), Image-Recommendations, Growth-Team

Mon, Mar 29

Miriam added a comment to T278681: Image Matching Structured Task: Research Q3-Q4.

Weekly updates:

  • Worked on image placeholder identification [T277828]: identified a category-based approach to label images as placeholders, and incorporated it into the algorithm. Trained a simple neural network to distinguish between normal images and placeholder images, with accuracy 88%+
  • Worked with Platform Engineering to fix bug T277875 and identified the source of the problem
  • Helped Structured Data refining the design of the test for image recommendations POC results T273092
Mon, Mar 29, 10:05 AM · Research (FY2020-21-Research-January-March)
Miriam created T278681: Image Matching Structured Task: Research Q3-Q4.
Mon, Mar 29, 10:00 AM · Research (FY2020-21-Research-January-March)
Miriam added a comment to T266655: Quantifying the importance of images in Wikipedia.

Weekly updates:

Mon, Mar 29, 9:58 AM · Research (FY2020-21-Research-January-March)
Miriam added a comment to T272385: WikiWorkshop 2021.

Weekly updates:

  • Schedule and speakers are finalized
  • We have music!
  • Authors have been notified of their presentation format
Mon, Mar 29, 9:53 AM · Research (FY2020-21-Research-January-March)
Miriam updated subscribers of T273968: Define Metrics for Survey-Based Knowledge Gaps.

Weekly updates:

  • Started exploring ideas for metrics, met with @marcmiquel
Mon, Mar 29, 9:51 AM · Research (FY2020-21-Research-January-March)

Fri, Mar 26

Miriam added a comment to T260634: Run a computer vision challenge.

Weekly updates:

  • Met with the full team - including Google researchers and identified the next steps and deadlines. On our end, we will work full force on data release and on setting up the contract with the org responsible for setting up the challenge.
  • Started process to generate the contract.
  • Started process for data release.
Fri, Mar 26, 5:58 PM · Research (FY2020-21-Research-January-March)
Miriam added a comment to T278217: Release image data for training.

Weekly updates:

  • Generated the full list of images to download on HDFS from the list of captioned images on the WIT dataset. From this, we will remove the ones that we shouldn't share according to the security review. Also did some geographic/topical analysis of the training data.
  • Started the security review on ASANA.
Fri, Mar 26, 5:54 PM · Research (FY2020-21-Research-January-March)
Miriam added a comment to T277875: Are we sure all unillustrated articles are available via the API?.

Hi @gmodena one question, when I run the notebook for the February snapshot (similar to the Jan one) I get the following numbers for cebwiki:

cebwiki
number of unillustrated articles: **1,435,202**

`
So, as @Cparle mentioned, 10 times more than what I see for the same snapshot on the gmodena.imagerec_prod table.

Fri, Mar 26, 2:20 PM · Platform Team Workboards (Image Suggestion API), Image-Suggestion-API, Image-Recommendations
Miriam updated the task description for T278217: Release image data for training.
Fri, Mar 26, 10:40 AM · Research (FY2020-21-Research-January-March)

Wed, Mar 24

Miriam updated the task description for T278217: Release image data for training.
Wed, Mar 24, 4:30 PM · Research (FY2020-21-Research-January-March)

Tue, Mar 23

Miriam updated the task description for T278217: Release image data for training.
Tue, Mar 23, 9:47 AM · Research (FY2020-21-Research-January-March)
Miriam created T278217: Release image data for training.
Tue, Mar 23, 9:39 AM · Research (FY2020-21-Research-January-March)
Miriam updated the task description for T260634: Run a computer vision challenge.
Tue, Mar 23, 9:33 AM · Research (FY2020-21-Research-January-March)

Fri, Mar 19

Miriam added a comment to T276407: An End-to-End Image Classification Pipeline.

Weekly updates:

  • After converting a toy dataset of images into TFRecords, @AikoChou has trained a small model on stat1008 for object classification. We were then able to succesfully run model inference on hadoop.
  • We worked on improving the modeling part: @AikoChou has worked on training and evaluation different models for image quality classification using keras.
Fri, Mar 19, 6:37 PM · Research (FY2020-21-Research-January-March), Structured-Data-Backlog, MachineVision
Miriam edited projects for T276407: An End-to-End Image Classification Pipeline, added: Research (FY2020-21-Research-January-March); removed Research.
Fri, Mar 19, 6:34 PM · Research (FY2020-21-Research-January-March), Structured-Data-Backlog, MachineVision
Miriam added a subtask for T256081: Image matching algorithm: T276407: An End-to-End Image Classification Pipeline.
Fri, Mar 19, 6:33 PM · Research (FY2020-21-Research-January-March), Image-Recommendations, Growth-Team, Wikipedia-Android-App-Backlog
Miriam added a parent task for T276407: An End-to-End Image Classification Pipeline: T256081: Image matching algorithm.
Fri, Mar 19, 6:33 PM · Research (FY2020-21-Research-January-March), Structured-Data-Backlog, MachineVision
Miriam updated subscribers of T266655: Quantifying the importance of images in Wikipedia.

Weekly updates:

  • Qualitative: We had a lot of issues when generating the data through MTurk: we were able to gather only ~60 valid questions. We are going to work with students to generate this data. We have a now well-defined set of good articles from which questions can be formulated.
  • Quantitative: @Daniram3, @tizianopiccardi and I are working on re-running experiments from the rejected WWW paper, on January data. I finished computing all features over the 4.5M images on English Wikipedia. We are now targeting ACM Multimedia on April 3rd as potential deadline for resubmission.
Fri, Mar 19, 6:31 PM · Research (FY2020-21-Research-January-March)
Miriam closed T266271: Testing image recommendations with V3 as Resolved.

Resolving this, feel free to reopen if there are more TODOs here !

Fri, Mar 19, 6:24 PM · Research (FY2020-21-Research-January-March), Image-Recommendations, Growth-Team
Miriam closed T266271: Testing image recommendations with V3, a subtask of T256081: Image matching algorithm, as Resolved.
Fri, Mar 19, 6:24 PM · Research (FY2020-21-Research-January-March), Image-Recommendations, Growth-Team, Wikipedia-Android-App-Backlog
Miriam added a comment to T260634: Run a computer vision challenge.

Weekly updates:

  • Estimated language and geographic distribution (thanks to Isaac's https://github.com/geohci/wiki-region-groundtruth Wiki Region Groundtruth data) of WIT test data
  • Defined the legal constraints for image data publication.
  • Worked with the WIT team to figure out next steps and involvement on their end, set up continuous communication channels and provided an detailed overview of timelines and commitments on our end.
Fri, Mar 19, 6:22 PM · Research (FY2020-21-Research-January-March)
Miriam added a comment to T272385: WikiWorkshop 2021.

Weekly updates:

  • Gathered all reviews for second round of submissions
  • Sent notifications for second round of submissions
  • Generated list of the 23 accepted papers, and grouped them into thematic areas for the poster sessions.
  • Advertised new registration form.
Fri, Mar 19, 6:19 PM · Research (FY2020-21-Research-January-March)
Miriam updated the task description for T272385: WikiWorkshop 2021.
Fri, Mar 19, 6:16 PM · Research (FY2020-21-Research-January-March)
Miriam updated subscribers of T273968: Define Metrics for Survey-Based Knowledge Gaps.

Weekly updates:
gathered survey questions related to our knowledge gaps from various surveys used across different initiatives at the Foundation, and collected them here: https://docs.google.com/spreadsheets/d/1xZCOSyoZXr9oVTqZRPmR47Rq_gHDYNZuvH-XFySFzC4/edit#gid=0
Technical skills is still missing but @leila mentioned that these questions have been asked as part of previous research.

Fri, Mar 19, 6:16 PM · Research (FY2020-21-Research-January-March)
Miriam added a parent task for T277828: Investigate placeholder image recommendation: T268352: Improve list of image candidates to discard.
Fri, Mar 19, 10:25 AM · Research (FY2020-21-Research-January-March), Image-Recommendations, Growth-Team
Miriam added a subtask for T268352: Improve list of image candidates to discard: T277828: Investigate placeholder image recommendation.
Fri, Mar 19, 10:25 AM · Image-Recommendations, Research (FY2020-21-Research-October-December), Growth-Team
Miriam added a comment to T277828: Investigate placeholder image recommendation.

@Aiko and I talked about this. We are going to work on 2 things:

  1. Generate a list of the existing placeholder images and see what are the categories that they are labeled with, and exclude images from those categories when querying for candidates
  2. Try to build a simple computer vision model that can automatically detect whether an image is a good candidate or not.
Fri, Mar 19, 10:23 AM · Research (FY2020-21-Research-January-March), Image-Recommendations, Growth-Team

Tue, Mar 16

Miriam closed T268350: Improve algorithm for unillustrated article selection, a subtask of T256081: Image matching algorithm, as Resolved.
Tue, Mar 16, 12:01 PM · Research (FY2020-21-Research-January-March), Image-Recommendations, Growth-Team, Wikipedia-Android-App-Backlog
Miriam closed T268350: Improve algorithm for unillustrated article selection as Resolved.
Tue, Mar 16, 12:01 PM · Research (FY2020-21-Research-January-March), Image-Recommendations, Growth-Team, Wikipedia-Android-App-Backlog

Mar 11 2021

Miriam updated subscribers of T276791: Configure the Hadoop cluster to use the GPUs available on some workers.

Just a follow-up on a few use cases from the Research team.

Mar 11 2021, 3:44 PM · Analytics, Machine-Learning-Team

Mar 9 2021

Miriam added a comment to T276849: [REQUEST] Caption and alternative text data related to image files.

Make sense, thanks @kzimmerman !

Mar 9 2021, 7:49 PM · Research, Product-Analytics
Miriam updated subscribers of T276849: [REQUEST] Caption and alternative text data related to image files.

@tizianopiccardi has extracted this data for January 2021.
Captions on English Wikipedia
If we exclude all gif, tiff and png images, English Wikipedia has 7'811'234 images. Among those, 3'645'913 have a caption: 46.6%.

Mar 9 2021, 11:12 AM · Research, Product-Analytics

Mar 8 2021

Miriam updated subscribers of T274878: Estimate the number of images added to each Wiki in a month.
Mar 8 2021, 10:25 AM · Research (FY2020-21-Research-January-March), Image-Recommendations, Growth-Team

Mar 4 2021

Miriam updated the task description for T215413: Image Classification Working Group.
Mar 4 2021, 9:50 AM · Analytics-Radar, Reading-Admin, SDC General, Wikidata, Multimedia, Discovery-Search, Research
Miriam added a comment to T276407: An End-to-End Image Classification Pipeline.

HI @Aklapper we added a few tags, this is mainly at a Research stage for now, so I included it in the Research board, and linked it to the parent epic task for image classification studies. Thanks for the heads up!

Mar 4 2021, 9:48 AM · Research (FY2020-21-Research-January-March), Structured-Data-Backlog, MachineVision
Miriam triaged T276407: An End-to-End Image Classification Pipeline as Medium priority.
Mar 4 2021, 9:47 AM · Research (FY2020-21-Research-January-March), Structured-Data-Backlog, MachineVision
Miriam added a subtask for T215413: Image Classification Working Group: T276407: An End-to-End Image Classification Pipeline.
Mar 4 2021, 9:44 AM · Analytics-Radar, Reading-Admin, SDC General, Wikidata, Multimedia, Discovery-Search, Research
Miriam added a parent task for T276407: An End-to-End Image Classification Pipeline: T215413: Image Classification Working Group.
Mar 4 2021, 9:44 AM · Research (FY2020-21-Research-January-March), Structured-Data-Backlog, MachineVision

Feb 26 2021

Miriam added a comment to T274675: lists for 200 most viewed historical figures in different regions.

Thanks SO much @Mstyles !
@MPhamWMF I ran Maryum's queries, joined it with the newest list of unillustrated articles from enwiki, then computed, for each article the total pageviews in the month of January 2021. I updated the spreadsheet with the lists of unillustrated articles for the 4 groups of people, sorted by pageviews: https://docs.google.com/spreadsheets/d/1FFCqwo0XsC6jJG3t7CGMhhOuzY9IQpNQTVyW8IbCNRM/edit?usp=sharing

Feb 26 2021, 11:07 AM · Discovery-Search (Current work)

Feb 25 2021

Miriam added a comment to T259067: Set up generation of JSON dumps for Wikimedia Commons.

@ArielGlenn thanks for clarifying this. I chatted with @Cormac on Slack and he explained how to get the image page_id from the current entity data. We can get this info from the "id" field. For example, a mediainfo slot with an id of M12345 corresponds to a page with an id 12345. Thanks both!

Feb 25 2021, 11:52 AM · Dumps-Generation, Structured-Data-Backlog (Current Work), Datasets-General-or-Unknown, Analytics-Radar, Product-Analytics

Feb 24 2021

Miriam updated subscribers of T274675: lists for 200 most viewed historical figures in different regions.

Hi @Mstyles I can take a look at this on Friday!
If it is easy for you or @dcausse, having the Pid/Qid pairs corresponding to the four filters required would be very helpful:

people born in Africa
people born in The Caribbean
Australasian indigenous people
people born in North America
Feb 24 2021, 10:27 AM · Discovery-Search (Current work)

Feb 23 2021

Miriam added a comment to T259067: Set up generation of JSON dumps for Wikimedia Commons.

@ArielGlenn thanks a lot again for this!

Feb 23 2021, 10:58 AM · Dumps-Generation, Structured-Data-Backlog (Current Work), Datasets-General-or-Unknown, Analytics-Radar, Product-Analytics
Miriam added a comment to T259067: Set up generation of JSON dumps for Wikimedia Commons.

Thanks SO much @ArielGlenn, I am also downloading those on our stats machine and will check them once they are in!

Feb 23 2021, 7:56 AM · Dumps-Generation, Structured-Data-Backlog (Current Work), Datasets-General-or-Unknown, Analytics-Radar, Product-Analytics

Feb 16 2021

Miriam created T274878: Estimate the number of images added to each Wiki in a month.
Feb 16 2021, 11:52 AM · Research (FY2020-21-Research-January-March), Image-Recommendations, Growth-Team

Feb 15 2021

Miriam added a comment to T272109: Assess prevalence of Wikidata infoboxes.

HI @MMiller_WMF @Tgr thanks for your comments!

Feb 15 2021, 10:46 AM · Research (FY2020-21-Research-January-March), Image-Recommendations, Growth-Team, Wikipedia-Android-App-Backlog

Feb 11 2021

Miriam added a comment to T273062: [L] Create tool to manually test image recommendations POC results.

Got it, @CBogen!

Feb 11 2021, 3:48 PM · SDAW-MediaSearch (MediaSearch-ImageRecs), Structured-Data-Backlog (Current Work), Image-Recommendations
Miriam added a comment to T273062: [L] Create tool to manually test image recommendations POC results.

The tool will evaluate 500 unillustrated articles from each wiki

@Miriam I suspect you already have such list of unillustrated articles from those wikis - can you tell me where I can find that?

Feb 11 2021, 3:36 PM · SDAW-MediaSearch (MediaSearch-ImageRecs), Structured-Data-Backlog (Current Work), Image-Recommendations

Feb 10 2021

Miriam added a comment to T274225: Multivariate logistic regression on search scores.

No prob, thanks @Cormac - then we probably should avoid normalization in this case, @Aiko?

Feb 10 2021, 12:23 PM · SDAW-MediaSearch (MediaSearch-ImageRecs), Structured-Data-Backlog (Current Work), Image-Recommendations, Structured Data Engineering, WikibaseMediaInfo
Miriam added a comment to T271799: [L] Implement new search profile(s) based on image search signal results .

Moving this into blocked for the minute, as we now have a more complete dataset that's being analysed atm, and would like that to complete before this is implemented

Thanks @Cparle. Can you link to the blocking ticket?

@CBogen blocking ticket is T274225

Feb 10 2021, 9:32 AM · MW-1.36-notes (1.36.0-wmf.37; 2021-03-30), SDAW-MediaSearch (MediaSearch-ImageRecs), Structured-Data-Backlog (Current Work), Image-Recommendations, Structured Data Engineering, WikibaseMediaInfo
Miriam added a comment to T274225: Multivariate logistic regression on search scores.

@Cparle we are thinking of normalizing the component scores before fitting the regression - is it possible to have Max and Min value (score) that each component can take? this would help make a normalization that is generalizable beyond the data you shared. Many thanks!

Feb 10 2021, 9:30 AM · SDAW-MediaSearch (MediaSearch-ImageRecs), Structured-Data-Backlog (Current Work), Image-Recommendations, Structured Data Engineering, WikibaseMediaInfo

Feb 8 2021

Miriam closed T266653: Based on the feedback gathered in Q1, update the taxonomy and the paper, and release second version as Resolved.

We summarized the actions we took to incorporate the community's feedback into the second version of the Taxonomy. Please find it here.

Feb 8 2021, 5:48 PM · Research (FY2020-21-Research-October-December)
Miriam updated the task description for T266653: Based on the feedback gathered in Q1, update the taxonomy and the paper, and release second version .
Feb 8 2021, 5:47 PM · Research (FY2020-21-Research-October-December)
Miriam added a comment to T273602: Access to analytics-privatedata-users for Research contractor AikoChou.

@Ottomata @CDanis thanks both!

Feb 8 2021, 2:34 PM · Research, SRE, SRE-Access-Requests
Miriam updated subscribers of T271799: [L] Implement new search profile(s) based on image search signal results .

@CBogen as discussed earlier today with @Cparle, @AikoChou and I are working on analyzing the data. @AikoChou will create a phab task with the description and results of the analysis she is doing (we just talked about it a couple of hours ago, and it's now late evening for her, so we will likely have it tomorrow morning or later today). I hope that works!

Feb 8 2021, 12:58 PM · MW-1.36-notes (1.36.0-wmf.37; 2021-03-30), SDAW-MediaSearch (MediaSearch-ImageRecs), Structured-Data-Backlog (Current Work), Image-Recommendations, Structured Data Engineering, WikibaseMediaInfo
Miriam added a comment to T272109: Assess prevalence of Wikidata infoboxes.

We just chatted with @AikoChou -- since we used the globalimagelinks table to get image information both for Wikidata and Wikipedia, these counts might still include icons. She will work on removing icons and recalculate these numbers (they won't probably change much).

Feb 8 2021, 9:53 AM · Research (FY2020-21-Research-January-March), Image-Recommendations, Growth-Team, Wikipedia-Android-App-Backlog

Feb 5 2021

Miriam created T273968: Define Metrics for Survey-Based Knowledge Gaps.
Feb 5 2021, 12:15 PM · Research (FY2020-21-Research-January-March)
Miriam updated the task description for T272385: WikiWorkshop 2021.
Feb 5 2021, 11:57 AM · Research (FY2020-21-Research-January-March)

Feb 2 2021

Miriam updated subscribers of T273602: Access to analytics-privatedata-users for Research contractor AikoChou.

Hi @Dzahn! Aiko is not getting a wikimedia email address, unless needed. She will be working with us until June 30th, so if possible, she should have access until then. And yes, please put me as contact. Many thanks for working on this!

Feb 2 2021, 9:30 PM · Research, SRE, SRE-Access-Requests
Miriam added a comment to T272109: Assess prevalence of Wikidata infoboxes.

@AikoChou just joined as a contractor. Her first task will be to look into this while we wait for her to get server access.

Feb 2 2021, 2:53 PM · Research (FY2020-21-Research-January-March), Image-Recommendations, Growth-Team, Wikipedia-Android-App-Backlog
Miriam reassigned T272109: Assess prevalence of Wikidata infoboxes from Miriam to AikoChou.
Feb 2 2021, 11:55 AM · Research (FY2020-21-Research-January-March), Image-Recommendations, Growth-Team, Wikipedia-Android-App-Backlog
Miriam created T273602: Access to analytics-privatedata-users for Research contractor AikoChou.
Feb 2 2021, 11:12 AM · Research, SRE, SRE-Access-Requests

Jan 26 2021

Miriam added a comment to T272107: Image match local-language metadata coverage.

I updated the spreadsheet with percentages of metadata coverage taken from a sample of 20k images (or less, when the total number of candidates is less than 20k). Numbers are slightly lower, but the overall picture doesn't change.

Jan 26 2021, 12:13 PM · Research (FY2020-21-Research-January-March), Image-Recommendations, Growth-Team, Wikipedia-Android-App-Backlog

Jan 22 2021

Miriam added a comment to T272106: Image matching algorithm coverage.

@MMiller_WMF I added another sheet to the coverage spreadsheet containing the coverage numbers for unillustrated articles, excluding, as requested, disambiguation pages, list articles, and year articles. Please let me know if anything else is needed. Feel free to resolve this task if not.

Jan 22 2021, 3:58 PM · Research (FY2020-21-Research-January-March), Image-Recommendations, Growth-Team, Wikipedia-Android-App-Backlog
Miriam added a comment to T272107: Image match local-language metadata coverage.

@MMiller_WMF , in the second sheet of this spreadsheet you can find data about image metadata coverage in local languages. You will find % data only, as I computed these stats on a random sample of 5k image suggestions for each Wiki (due to time/computaional constraints).
Although numbers won't change much, I want to get stats from a larger sample. I will now sample ~ 20k image suggestions per language, and let the crawler run . I will then post the results here.

Jan 22 2021, 11:30 AM · Research (FY2020-21-Research-January-March), Image-Recommendations, Growth-Team, Wikipedia-Android-App-Backlog
Miriam added a comment to T272447: Extract a list of the 200 most viewed black historical figures from WDQS.

Hi @MPhamWMF, I modified the Wikidata query to include historical people only, and added the Wiki url - here is the result, let me know if this works! https://docs.google.com/spreadsheets/d/1FFCqwo0XsC6jJG3t7CGMhhOuzY9IQpNQTVyW8IbCNRM/edit?usp=sharing

Jan 22 2021, 10:10 AM · Discovery-Search (Current work), Wikidata-Query-Service, Wikidata

Jan 21 2021

Miriam added a comment to T272447: Extract a list of the 200 most viewed black historical figures from WDQS.

I joined the data from @dcausse 's query with our unillustrated article list for English Wikipedia, then queried the pageview API to get pageviews for December 2020, here you have the result: https://docs.google.com/spreadsheets/d/1FFCqwo0XsC6jJG3t7CGMhhOuzY9IQpNQTVyW8IbCNRM/edit?usp=sharing

Jan 21 2021, 7:15 PM · Discovery-Search (Current work), Wikidata-Query-Service, Wikidata
Miriam added a comment to T272447: Extract a list of the 200 most viewed black historical figures from WDQS.

@Mstyles at what granularity do you need pageview counts? We can use the webrequest or pageview_hourly tables from Hive if we want to access pageviews in the last 3 motnhs, or the pageview API for data aggregated monthly, e.g.: https://wikimedia.org/api/rest_v1/metrics/pageviews/per-article/en.wikipedia/all-access/all-agents/Serena_Williams/monthly/2015100100/2015103100

Jan 21 2021, 9:42 AM · Discovery-Search (Current work), Wikidata-Query-Service, Wikidata

Jan 20 2021

Miriam added a comment to T272447: Extract a list of the 200 most viewed black historical figures from WDQS.

HI @Gehel ! Are you looking at English Wikipedia only? Or all Wikis?
I can send you a list of QIDs for unillustrated enwiki pages which match the QIDs returned by a Wikidata query similar to this: https://w.wiki/v3P?

Jan 20 2021, 2:08 PM · Discovery-Search (Current work), Wikidata-Query-Service, Wikidata

Jan 19 2021

Miriam created T272385: WikiWorkshop 2021.
Jan 19 2021, 3:05 PM · Research (FY2020-21-Research-January-March)
Miriam added a comment to T272106: Image matching algorithm coverage.

@MMiller_WMF here you have an initial spreadsheet with coverage statistics: https://docs.google.com/spreadsheets/d/1IKi0mQ4MZRATVOPPaMr_tX6sOzhjL8INc0Pt_RwwUqs/edit?usp=sharing
I followed your schema, and just added one column which is "has at least one candidate", to give a better idea of the overall algorithm coverage.
Please let me know if this works!

Jan 19 2021, 9:17 AM · Research (FY2020-21-Research-January-March), Image-Recommendations, Growth-Team, Wikipedia-Android-App-Backlog

Jan 18 2021

Miriam added a comment to T272106: Image matching algorithm coverage.

@Marshall - this is not very easy to do with the current version that I have, as I should parse the "instance of" property. Let me first pull the numbers with the alg as is, and then do this on a second iteration!

Jan 18 2021, 10:42 AM · Research (FY2020-21-Research-January-March), Image-Recommendations, Growth-Team, Wikipedia-Android-App-Backlog

Jan 14 2021

Miriam added a comment to T184744: Improve access to Commons image data for research and development.

Hi! thanks @fgiunchedi for this info! This is what I have to get/download image URLs given the filename:
Take this url: https://upload.wikimedia.org/wikipedia/commons/thumb/a/a8/Tour_Eiffel_Wikimedia_Commons.jpg/600px-Tour_Eiffel_Wikimedia_Commons.jpg

  • The first part is always the same: https://upload.wikimedia.org/wikipedia/commons/ this should be changed with the swift url
  • The second part is the first character of the MD5 hash of the file name. For Example, the MD5 hash of Tour_Eiffel_Wikimedia_Commons.jpg is #a85d416ee427dfaee44b9248229a9cdd, so we get /a.
  • The third part is the first two characters of the MD5 hash from above: /a8.
  • The fourth part is the file name: /Tour_Eiffel_Wikimedia_Commons.jpg
  • Then you have the thumbnail size, e.g. 600px, and again the file name /600px-Tour_Eiffel_Wikimedia_Commons.jpg

My script:

Jan 14 2021, 12:04 PM · User-ArielGlenn

Jan 12 2021

Miriam added a comment to T268350: Improve algorithm for unillustrated article selection.

So I re-run the unillustrated article detection using:

  • The new per-wiki thresholds calculated as per the previous post
  • An additional anti join with the page_props table to discard all articles having a page image

I then run the image matching algorithm on top of the new set of unillustrated articles. I eyeballed the results and they looked more consistent than the eariler version. Below the quantitative / coverage results:
kowiki

number of unillustrated articles: 273305
number of articles items with Wikidata image: 15983
number of articles items with Wikidata Commons Category: 28324
number of articles items with Language Links: 83995

arwiki

number of unillustrated articles: 580284
number of articles items with Wikidata image: 7028
number of articles items with Wikidata Commons Category: 26526
number of articles items with Language Links: 121891

viwiki

number of unillustrated articles: 867565
number of articles items with Wikidata image: 49226
number of articles items with Wikidata Commons Category: 57548
number of articles items with Language Links: 117138

cswiki

number of unillustrated articles: 181867
number of articles items with Wikidata image: 8337
number of articles items with Wikidata Commons Category: 21120
number of articles items with Language Links: 69413

frwiki

number of unillustrated articles: 951319
number of articles items with Wikidata image: 10938
number of articles items with Wikidata Commons Category: 39457
number of articles items with Language Links: 236592

enwiki

number of unillustrated articles: 2922830
number of articles items with Wikidata image: 36412
number of articles items with Wikidata Commons Category: 92072
number of articles items with Language Links: 325534
Jan 12 2021, 4:16 PM · Research (FY2020-21-Research-January-March), Image-Recommendations, Growth-Team, Wikipedia-Android-App-Backlog
Miriam added a comment to T268350: Improve algorithm for unillustrated article selection.

Work done by @Swagoel for icon detection:

Jan 12 2021, 11:58 AM · Research (FY2020-21-Research-January-March), Image-Recommendations, Growth-Team, Wikipedia-Android-App-Backlog

Jan 11 2021

Miriam closed T266650: A map of visual knowledge gaps as Resolved.

I reported and extended the analysis above on Meta:
https://meta.wikimedia.org/wiki/Research:Map_of_Visual_Knowledge_Gaps

Jan 11 2021, 12:30 PM · Research (FY2020-21-Research-October-December)

Jan 8 2021

Miriam moved T266271: Testing image recommendations with V3 from FY2020-21-Research-October-December to FY2020-21-Research-January-March on the Research board.
Jan 8 2021, 3:35 PM · Research (FY2020-21-Research-January-March), Image-Recommendations, Growth-Team
Miriam moved T266655: Quantifying the importance of images in Wikipedia from FY2020-21-Research-October-December to FY2020-21-Research-January-March on the Research board.
Jan 8 2021, 3:35 PM · Research (FY2020-21-Research-January-March)
Miriam updated subscribers of T266655: Quantifying the importance of images in Wikipedia.

End of quarter updates:

  • Qualitative: designed a Mturk task to expand the dataset of questions for reading comprehension. Generated a list of potentially "good" articles to be fed to this task: articles that are relatively long, have more than 1 image, and contain sections such as "description" or "characteristics". Anntoated the articles with topic and popularity score to further filter out potentially unuseful articles.
  • Quantitative: @Daniram3 started analyzing metrics such as dwell time and session length (in time). Early results show that dwell time is significantly longer for pages with images (or image clicks), and session time is longer when browsing through pages with images (or when sessions include image clicks), even when controlling for page length and number of pages visited in a session. This is somehow related to T265772
Jan 8 2021, 3:32 PM · Research (FY2020-21-Research-January-March)
Miriam moved T268350: Improve algorithm for unillustrated article selection from FY2020-21-Research-October-December to FY2020-21-Research-January-March on the Research board.
Jan 8 2021, 3:11 PM · Research (FY2020-21-Research-January-March), Image-Recommendations, Growth-Team, Wikipedia-Android-App-Backlog
Miriam moved T256081: Image matching algorithm from FY2020-21-Research-October-December to FY2020-21-Research-January-March on the Research board.
Jan 8 2021, 3:11 PM · Research (FY2020-21-Research-January-March), Image-Recommendations, Growth-Team, Wikipedia-Android-App-Backlog

Dec 16 2020

Miriam added a comment to T266271: Testing image recommendations with V3.

@MMiller_WMF, I ran v3 on the other languages, here are the overall results:

Dec 16 2020, 6:14 PM · Research (FY2020-21-Research-January-March), Image-Recommendations, Growth-Team
Miriam added a comment to T262668: WMF media storage must be adequately backed up in a remote location.

@jcrespo originals would be great, too, I only thought of thumbnails because they generally require less space.

Dec 16 2020, 9:42 AM · Data-Persistence-Backup, Epic, Goal, SRE, SRE-swift-storage
Miriam updated subscribers of T262668: WMF media storage must be adequately backed up in a remote location.

Hi @Ottomata, thanks for the ping! Getting a copy of Commons (thumbnails only would be fine) which is directly accessible via stat machines would be amazing! Adding @fkaelin as we also chatted about this during recent conversations about the pain points of image work.

Dec 16 2020, 9:33 AM · Data-Persistence-Backup, Epic, Goal, SRE, SRE-swift-storage

Dec 11 2020

Miriam added a comment to T266271: Testing image recommendations with V3.

@Thanks Marshall!
Here you have the new file for English with the change you required on the date field: https://drive.google.com/file/d/1kVB5krC9SyFxvJqwehWW8IDRPQ2NCR_b/view?usp=sharing
Below the metadata coverage:

missing descriptions: 27%
missing captions: 95%
missing categories: 0.0
missing depicts: 92%

Re- running this for Arabic Wikipedia. Could you clarify what you would like me to do? Would you like me to find image matches for unillustrated articles in Arabic, then generate metadata for those, or would you like me to check the presence of Arabic metadata on these image candidates?

Dec 11 2020, 5:08 PM · Research (FY2020-21-Research-January-March), Image-Recommendations, Growth-Team

Dec 10 2020

Miriam updated subscribers of T266650: A map of visual knowledge gaps.

More insights on Wikidata images vs topics. @FRomeo_WMF might be of interest for you.

Dec 10 2020, 1:49 PM · Research (FY2020-21-Research-October-December)
Miriam added a comment to T266650: A map of visual knowledge gaps.

More analysis on Wikidatam since @CBogen asked a while ago.

Dec 10 2020, 1:10 PM · Research (FY2020-21-Research-October-December)

Dec 9 2020

Miriam added a comment to T266271: Testing image recommendations with V3.

Hi @MMiller_WMF, I modified the code to parse the HTML of the Commons page (there must be a better way to do this, but for now this is what we have) - it now includes more descriptions (missing 40% only) and all the additional data you requested. Some copyright statements and dates are not available in structured data, and for now these are ignored. Please check the sample data attached and let me know if it looks good. If yes, I can run it large scale and give back more suggestions.

Dec 9 2020, 12:48 PM · Research (FY2020-21-Research-January-March), Image-Recommendations, Growth-Team

Dec 3 2020

Miriam added a comment to T266271: Testing image recommendations with V3.

Hi @MMiller_WMF, that's interesting, I used some code to parse the HTML of the Commons page, maybe there are different ways of marking the description and I missed it. I'll double check. I will have to work on extending the code to get the additional information you need. I will need a few days at least.

Dec 3 2020, 10:16 AM · Research (FY2020-21-Research-January-March), Image-Recommendations, Growth-Team

Nov 27 2020

Miriam closed T268346: Restructure the code for V3 of image recommendation algorithm, a subtask of T256081: Image matching algorithm, as Resolved.
Nov 27 2020, 3:44 PM · Research (FY2020-21-Research-January-March), Image-Recommendations, Growth-Team, Wikipedia-Android-App-Backlog
Miriam closed T268346: Restructure the code for V3 of image recommendation algorithm as Resolved.

Closing this for now as all points were addressed.

Nov 27 2020, 3:44 PM · Image-Recommendations, Research (FY2020-21-Research-October-December), Growth-Team

Nov 25 2020

Miriam added a comment to T267314: Access to analytics-privatedata-users for Research volunteer Swagoel.

@Swagoel could you try to connect to the notebooks now as I showed you earlier, and let us know if it works?

Nov 25 2020, 6:49 PM · Research, SRE, SRE-Access-Requests
Miriam added a comment to T267314: Access to analytics-privatedata-users for Research volunteer Swagoel.

@Dzahn many thanks!!

Nov 25 2020, 6:48 PM · Research, SRE, SRE-Access-Requests
Miriam reopened T267314: Access to analytics-privatedata-users for Research volunteer Swagoel as "Open".

re-opening this for a quick request.
@herron could you add @Swagoel to the LDAP-group so that she can access the SWAP notebooks as well?

Nov 25 2020, 6:30 PM · Research, SRE, SRE-Access-Requests
Miriam added a comment to T268346: Restructure the code for V3 of image recommendation algorithm.

Approximate coverage statistics (estimated from a sample of 50k articles with initial candidate suggestions extracted with V3):

  • Coverage before filtering: 500k out of 3M unillustrated articles (17%)
  • First round of filtering: removing invalid image candidates (flags, svgs, image placeholders): discards 55% of articles with suggestions, leaving 7.5% of unillustrated articles with potential candidates
  • Second round of filtering: removing images that are on-wiki only: discards further 12% of articles with suggestions, leaving 5.3% of unillustrated articles with potential candidates
  • Out of the remaining candidates, the metadata coverage is the following:
    • missing descriptions: 59%
    • missing captions: 96%
    • missing categories: 0.0
    • missing structured data: 92%
Nov 25 2020, 11:23 AM · Image-Recommendations, Research (FY2020-21-Research-October-December), Growth-Team

Nov 24 2020

Miriam added a comment to T268346: Restructure the code for V3 of image recommendation algorithm.

Refactored code is now available on stat1005.

stat1005.eqiad.wmnet:/home/mirrys/ImageRecommendation/V3:
- retrieve_image_candidates.ipynb
- prioritize_clean.ipynb
  • retrieve_image_candidates.ipynb discovers unillustrated articles and finds potential images matches
  • prioritize-clean.ipynb filters out bad image candidates and generates good image suggestions for unillustrated articles, together with image captions, descriptions, categories, and structured data when available
Nov 24 2020, 12:33 PM · Image-Recommendations, Research (FY2020-21-Research-October-December), Growth-Team
Miriam added a comment to T267314: Access to analytics-privatedata-users for Research volunteer Swagoel.

@herron, apologies, just saw this now. Thank you so much! I will work on onboarding @Swagoel in the coming days.

Nov 24 2020, 10:03 AM · Research, SRE, SRE-Access-Requests

Nov 20 2020

Miriam added a comment to T266271: Testing image recommendations with V3.

HI Marshall,
I made a general clean up of the methodology to basically

  1. Exclude all .svg images from the potential suggestions
  2. Exclude all flags as suggested

There are suggestions for around 13k out of 50k articles in here: https://drive.google.com/file/d/1aMlYXP8eKORx8V0m98dIUNcpmCNrADFI/view?usp=sharing

Nov 20 2020, 5:58 PM · Research (FY2020-21-Research-January-March), Image-Recommendations, Growth-Team
Miriam created T268352: Improve list of image candidates to discard.
Nov 20 2020, 3:44 PM · Image-Recommendations, Research (FY2020-21-Research-October-December), Growth-Team
Miriam created T268350: Improve algorithm for unillustrated article selection.
Nov 20 2020, 3:38 PM · Research (FY2020-21-Research-January-March), Image-Recommendations, Growth-Team, Wikipedia-Android-App-Backlog