Page MenuHomePhabricator

diego (Diego S-T)
User

Today

  • Clear sailing ahead.

Tomorrow

  • Clear sailing ahead.

Monday

  • Clear sailing ahead.

User Details

User Since
Aug 8 2017, 10:56 AM (106 w, 3 d)
Availability
Available
LDAP User
Unknown
MediaWiki User
Diego (WMF) [ Global Accounts ]

Recent Activity

Wed, Aug 14

diego added a comment to T230348: What are your experiences with templates?.

You might be interested on this project. T221211

Wed, Aug 14, 8:31 AM · Wikimania-Hackathon-2019

Tue, Aug 13

diego updated the task description for T230059: Introduction to cross-lingual word-embeddings at Wikimania 2019.
Tue, Aug 13, 11:27 PM · Research, Wikimania-Hackathon-2019

Wed, Aug 7

leila awarded T230059: Introduction to cross-lingual word-embeddings at Wikimania 2019 a Love token.
Wed, Aug 7, 8:02 PM · Research, Wikimania-Hackathon-2019
diego created T230059: Introduction to cross-lingual word-embeddings at Wikimania 2019.
Wed, Aug 7, 6:51 PM · Research, Wikimania-Hackathon-2019

Mon, Aug 5

diego updated the task description for T229267: Plan for Research team's acitivities during Wikimania 2019.
Mon, Aug 5, 3:48 PM · Research-management, Research
diego updated the task description for T229267: Plan for Research team's acitivities during Wikimania 2019.
Mon, Aug 5, 3:34 PM · Research-management, Research
diego updated the task description for T229267: Plan for Research team's acitivities during Wikimania 2019.
Mon, Aug 5, 3:32 PM · Research-management, Research

Thu, Aug 1

diego created T229595: Literature review on mis/disformation on Wikipedia.
Thu, Aug 1, 4:58 PM · Research

Mon, Jul 29

diego added a comment to T229242: Explore ways to restrict suggestions to a given knowledge area.

About this:

Mon, Jul 29, 4:40 PM · CX-boost, Language-Team (Language-2019-July-September), WorkType-NewFunctionality, Design

Jul 11 2019

diego closed T210530: Expose section mappings via an API as Resolved.
Jul 11 2019, 4:00 PM · Research-Backlog
diego closed T210530: Expose section mappings via an API, a subtask of T203046: Output 1.4: Public test APIs corresponding to section recommendation algorithms, as Resolved.
Jul 11 2019, 4:00 PM · Epic, address-knowledge-gaps
diego added a comment to T210530: Expose section mappings via an API.

This has been solved here: https://meta.wikimedia.org/wiki/Research:Expanding_Wikipedia_articles_across_languages/Inter_language_approach#Results
Check an example here: https://secrec.wmflabs.org/API/alignment/en/ja/Work

Jul 11 2019, 3:59 PM · Research-Backlog
diego removed a project from T215349: [Blog] Write a blog post about Crosslingual section alignment and recommendations. : Research.
Jul 11 2019, 3:52 PM · Research, Wikimedia-Blog-Content
diego removed a project from T186559: Provide data dumps in the Analytics Data Lake: Research.
Jul 11 2019, 3:51 PM · Analytics

Jul 10 2019

diego moved T215349: [Blog] Write a blog post about Crosslingual section alignment and recommendations. from In Progress to Staged on the Research board.
Jul 10 2019, 11:28 AM · Research, Wikimedia-Blog-Content
diego merged task T203047: Output 1.5: The first version of the algorithm that prioritizes missing sections into T227651: first version of the algorithm that prioritizes missing sections .
Jul 10 2019, 11:26 AM · Epic, address-knowledge-gaps
diego merged T203047: Output 1.5: The first version of the algorithm that prioritizes missing sections into T227651: first version of the algorithm that prioritizes missing sections .
Jul 10 2019, 11:26 AM · Research
diego closed T221211: Parameters matching on Templates: ML Exploration as Resolved.
Jul 10 2019, 11:24 AM · Language-Team (Language-2019-July-September), ContentTranslation, Research
diego added a comment to T221211: Parameters matching on Templates: ML Exploration .

@Pginer-WMF , I'm going to put this task as resolved from me side, and we can continue the follow-up somewhere else, ok?

Jul 10 2019, 11:22 AM · Language-Team (Language-2019-July-September), ContentTranslation, Research
diego closed T227651: first version of the algorithm that prioritizes missing sections as Resolved.
Jul 10 2019, 11:17 AM · Research
diego added a comment to T227651: first version of the algorithm that prioritizes missing sections .

Considering the feedback obtained in T225136, we conclude that the prioritization should be adapted to the characteristic of the editor being assisted. We can split editors in two disjoint groups, generating two different types of recommendations:

Jul 10 2019, 11:17 AM · Research
diego closed T203046: Output 1.4: Public test APIs corresponding to section recommendation algorithms as Resolved.
Jul 10 2019, 11:06 AM · Epic, address-knowledge-gaps
diego created T227651: first version of the algorithm that prioritizes missing sections .
Jul 10 2019, 11:01 AM · Research

Jul 5 2019

diego updated subscribers of T221891: [REQUEST] En Wiki pageviews by topic. Rough cut..
Jul 5 2019, 3:30 PM · Product-Analytics
diego updated subscribers of T221891: [REQUEST] En Wiki pageviews by topic. Rough cut..

Hi all,
I think this use-case highlight the need for a canonical (standanrized) cross-lingual topic model, that we could all use as the reference for all the projects within the WMF.

Jul 5 2019, 3:30 PM · Product-Analytics
diego updated subscribers of T221891: [REQUEST] En Wiki pageviews by topic. Rough cut..
Jul 5 2019, 3:11 PM · Product-Analytics

Jun 13 2019

diego added a comment to T221211: Parameters matching on Templates: ML Exploration .

oh! I see, that number is distance, so 0 would be perfect match, 1 is not matching at all. I've already put a upper bound .45, so you will just see values lower than that.

Jun 13 2019, 3:11 PM · Language-Team (Language-2019-July-September), ContentTranslation, Research
diego added a comment to T221211: Parameters matching on Templates: ML Exploration .

If I understood correctly, you are asking why two exacts strings are not having distance = 0; this is because there is not string matching mechanism in this approach. Every language is trained separately, and then aligned using some words or sentences that we know that are equivalent. This is not necessarily bad, because you will find some words that are written exactly the same, but means different things in each language. However, in the examples that you show, this is just part of the noise introduced by the model.
We could add a second step, for example using Levenshtein distance, that would take advantage of string similarity , but it would work only for languages within the same scripts. If we had some training data, we could learn how to mix these two approaches and how useful would be the latter.

Jun 13 2019, 1:56 PM · Language-Team (Language-2019-July-September), ContentTranslation, Research

Jun 10 2019

Groceryheist awarded T186559: Provide data dumps in the Analytics Data Lake a Love token.
Jun 10 2019, 7:52 PM · Analytics

May 30 2019

diego updated subscribers of T221211: Parameters matching on Templates: ML Exploration .

Hi,
I have created and uploaded the full experiments and aligned parameters for these languages:

["es", "en", "fr", "ar", "ru", "uk", "pt", "vi", "zh", "ru", "he", "it", "ta", "id", "fa", "ca"]
May 30 2019, 10:40 PM · Language-Team (Language-2019-July-September), ContentTranslation, Research

May 23 2019

diego added a comment to T221211: Parameters matching on Templates: ML Exploration .

Sorry, I've put the wrong link to the experiments in the previous comment, now is updated.

May 23 2019, 5:04 PM · Language-Team (Language-2019-July-September), ContentTranslation, Research

May 20 2019

diego added a comment to T221211: Parameters matching on Templates: ML Exploration .

You can find the results of the experiments here.

May 20 2019, 10:08 AM · Language-Team (Language-2019-July-September), ContentTranslation, Research
diego updated the task description for T221211: Parameters matching on Templates: ML Exploration .
May 20 2019, 9:59 AM · Language-Team (Language-2019-July-September), ContentTranslation, Research
diego updated the task description for T221211: Parameters matching on Templates: ML Exploration .
May 20 2019, 9:57 AM · Language-Team (Language-2019-July-September), ContentTranslation, Research

May 16 2019

diego triaged T221211: Parameters matching on Templates: ML Exploration as High priority.
May 16 2019, 4:21 PM · Language-Team (Language-2019-July-September), ContentTranslation, Research

Apr 29 2019

diego closed T218908: Submit a paper about Article Recommendation to RecSys'19 as Resolved.
Apr 29 2019, 3:59 PM · Research
diego updated the task description for T221211: Parameters matching on Templates: ML Exploration .
Apr 29 2019, 3:55 PM · Language-Team (Language-2019-July-September), ContentTranslation, Research
diego updated the task description for T221211: Parameters matching on Templates: ML Exploration .
Apr 29 2019, 3:55 PM · Language-Team (Language-2019-July-September), ContentTranslation, Research

Apr 22 2019

diego moved T221211: Parameters matching on Templates: ML Exploration from Staged to In Progress on the Research board.
Apr 22 2019, 4:27 PM · Language-Team (Language-2019-July-September), ContentTranslation, Research

Apr 17 2019

diego created T221211: Parameters matching on Templates: ML Exploration .
Apr 17 2019, 9:43 AM · Language-Team (Language-2019-July-September), ContentTranslation, Research

Apr 16 2019

diego moved T190772: Build the first version of section recommender by fusing the synonym and translator models from Staged to Done (current quarter) on the Research board.
Apr 16 2019, 10:16 PM · Research-2017-18-Q4, Research
diego moved T215349: [Blog] Write a blog post about Crosslingual section alignment and recommendations. from Time Sensitive to In Progress on the Research board.
Apr 16 2019, 10:16 PM · Research, Wikimedia-Blog-Content
diego moved T215347: Section Recommendation (Interlingual): User Feedback from Staged to Done (current quarter) on the Research board.
Apr 16 2019, 7:02 PM · Research
diego moved T215349: [Blog] Write a blog post about Crosslingual section alignment and recommendations. from Staged to Time Sensitive on the Research board.
Apr 16 2019, 7:02 PM · Research, Wikimedia-Blog-Content
diego moved T218908: Submit a paper about Article Recommendation to RecSys'19 from Staged to In Progress on the Research board.
Apr 16 2019, 7:02 PM · Research
diego added a comment to T215347: Section Recommendation (Interlingual): User Feedback.

From the [[https://meta.wikimedia.org/wiki/Research_talk:Expanding_Wikipedia_articles_across_languages/Inter_language_approach/Feedback | feedback page ]]set up by we got the following main two points:

Apr 16 2019, 6:33 PM · Research
diego moved T215348: Improve the inter-lingual section recommender system from Staged to Done (current quarter) on the Research board.
Apr 16 2019, 6:27 PM · Research
diego moved T216588: Investigate whether data can be recovered for GII from Time Sensitive to Done (current quarter) on the Research board.
Apr 16 2019, 6:26 PM · Research
diego updated subscribers of T203043: Output 1.1: Improved section recommendation algorithm with user-feedback.

From the feedback page set up by @Capt_Swing we got the following main two points:

Apr 16 2019, 6:23 PM · Epic, address-knowledge-gaps
diego added a comment to T190772: Build the first version of section recommender by fusing the synonym and translator models.

This has been done and tracked in T215348
Documentation can be found here: https://meta.wikimedia.org/wiki/Research:Expanding_Wikipedia_articles_across_languages/Inter_language_approach#Section_Recommendation

Apr 16 2019, 6:06 PM · Research-2017-18-Q4, Research
diego updated the task description for T215348: Improve the inter-lingual section recommender system .
Apr 16 2019, 6:01 PM · Research
diego added a comment to T215348: Improve the inter-lingual section recommender system .

Fixed JSON format issues for the APIs. Now they are working correctly.

Apr 16 2019, 6:01 PM · Research
diego closed T206244: Prepare and give remote talk about Wikimedia projects and AI technology at WikiConference Seoul 2018. as Resolved.
Apr 16 2019, 5:59 PM · Research
diego closed T209597: Give a talk about Wikimedia Public Resources for Research at NYU Center for Data Science as Resolved.
Apr 16 2019, 5:59 PM · Research-outreach, Research

Apr 10 2019

diego updated the task description for T215348: Improve the inter-lingual section recommender system .
Apr 10 2019, 12:01 AM · Research

Apr 9 2019

diego added a comment to T215348: Improve the inter-lingual section recommender system .

The documentation for the updated version of the section recommender system can be found here:
https://meta.wikimedia.org/wiki/Research:Expanding_Wikipedia_articles_across_languages/Inter_language_approach#Section_Recommendation

Apr 9 2019, 11:58 PM · Research
diego added a comment to T203046: Output 1.4: Public test APIs corresponding to section recommendation algorithms.

API up and running, please check the documentation here: https://meta.wikimedia.org/wiki/Research:Expanding_Wikipedia_articles_across_languages/Inter_language_approach#Section_Recommendation

Apr 9 2019, 11:24 PM · Epic, address-knowledge-gaps
diego closed T219186: Peer Review for WebSci'19 as Resolved.
Apr 9 2019, 10:08 PM · Research

Mar 25 2019

diego created T219186: Peer Review for WebSci'19.
Mar 25 2019, 5:02 PM · Research
diego added a comment to T216588: Investigate whether data can be recovered for GII.

The solution to this problem was the following:

Mar 25 2019, 4:59 PM · Research
diego added a comment to T212824: notebook/stat server(s) running out of memory.

Finally it's not just me squeezing notebooks memory :)

Mar 25 2019, 4:45 PM · Patch-For-Review, Product-Analytics, User-Elukey, Operations, Analytics

Mar 21 2019

diego created T218908: Submit a paper about Article Recommendation to RecSys'19.
Mar 21 2019, 3:26 PM · Research

Feb 21 2019

diego added a comment to T215616: Improve interlingual links across wikis through Wikidata IDs.

I think we are talking about three different things:

Feb 21 2019, 12:31 PM · Research-Backlog, MediaWiki-General, Wikidata, Analytics

Feb 19 2019

diego added a comment to T215616: Improve interlingual links across wikis through Wikidata IDs.

@JAllemandou , yes. Having this by revision would be great!

Feb 19 2019, 7:10 PM · Research-Backlog, MediaWiki-General, Wikidata, Analytics

Feb 12 2019

diego added a comment to T215616: Improve interlingual links across wikis through Wikidata IDs.

@Tbayer , great. Thanks.

Feb 12 2019, 12:16 AM · Research-Backlog, MediaWiki-General, Wikidata, Analytics

Feb 11 2019

diego added a comment to T215616: Improve interlingual links across wikis through Wikidata IDs.

@jcrespo, the API works good for query specific pages/entities, not for example to know which pages that existing in X_wiki are missing on the Y_wiki.
My point here it is that the wikidata identifier is currently the main identifier for a page/concept, and that this fact is not reflected on the DB structure. I understand that this might be due historical reasons, but it would be good to think in a way that our DBs make easier to link content across wikis.

Feb 11 2019, 10:34 PM · Research-Backlog, MediaWiki-General, Wikidata, Analytics
diego added a comment to T215616: Improve interlingual links across wikis through Wikidata IDs.

@EBernhardson , this looks exactly what I was looking for, initially. Thank you very much for that.

Feb 11 2019, 7:51 PM · Research-Backlog, MediaWiki-General, Wikidata, Analytics
diego updated subscribers of T215616: Improve interlingual links across wikis through Wikidata IDs.
Feb 11 2019, 6:24 PM · Research-Backlog, MediaWiki-General, Wikidata, Analytics
diego added a comment to T215616: Improve interlingual links across wikis through Wikidata IDs.

Looks good @JAllemandou, thanks.
This is a good workaround, but imho, we should have an structure or schema that makes this kind of tasks easier, specially for people outside without access to a cluster.

Feb 11 2019, 6:23 PM · Research-Backlog, MediaWiki-General, Wikidata, Analytics
diego added a comment to T213976: Workflow to be able to move data files computed in jobs from analytics cluster to production .

We do have one very large asset file at 1.9GB (word2vec embedding). I don't need that to be much bigger right now, but we're starting to discuss using embeddings more generally in the mid term and I don't have a good sense for how large they can become. @diego might have a better sense for how big these embeddings can be.

Feb 11 2019, 6:21 PM · Patch-For-Review, Research-Backlog, Operations, Discovery, Analytics
diego added a project to T215616: Improve interlingual links across wikis through Wikidata IDs: Wikidata.
Feb 11 2019, 1:02 PM · Research-Backlog, MediaWiki-General, Wikidata, Analytics
diego renamed T215616: Improve interlingual links across wikis through Wikidata IDs from Add (scoop) wikidatadawiki.wb_items_per_site MariaDB table to wmf_raw to Improve interlingual links across wikis through Wikidata IDs.
Feb 11 2019, 12:58 PM · Research-Backlog, MediaWiki-General, Wikidata, Analytics

Feb 8 2019

diego created T215616: Improve interlingual links across wikis through Wikidata IDs.
Feb 8 2019, 2:10 PM · Research-Backlog, MediaWiki-General, Wikidata, Analytics

Feb 7 2019

diego added a parent task for T215348: Improve the inter-lingual section recommender system : T203044: Output 1.2: Section recommendation algorithm in many languages.
Feb 7 2019, 9:27 PM · Research
diego added a subtask for T203044: Output 1.2: Section recommendation algorithm in many languages: T215348: Improve the inter-lingual section recommender system .
Feb 7 2019, 9:27 PM · Epic, address-knowledge-gaps
diego added a comment to T182849: Identify unhelpful file names on commons.

Check this notebook, apparently the number of white spaces is a pretty good indicator of the filename quality.

Feb 7 2019, 7:00 PM · Product-Analytics, Wikidata, Discovery-Analysis, SDC General

Feb 5 2019

diego created T215349: [Blog] Write a blog post about Crosslingual section alignment and recommendations. .
Feb 5 2019, 9:08 PM · Research, Wikimedia-Blog-Content
diego created T215348: Improve the inter-lingual section recommender system .
Feb 5 2019, 9:05 PM · Research
diego created T215347: Section Recommendation (Interlingual): User Feedback.
Feb 5 2019, 8:55 PM · Research
diego added a comment to T202490: Automate XML-to-parquet transformation for XML dumps (oozie job).

Thanks @JAllemandou !

Feb 5 2019, 8:34 PM · Patch-For-Review, Analytics-Kanban, Research, Analytics

Jan 15 2019

diego added a comment to T212493: Clean up staging db.

Hi! I'm not sure what is this, but for sure you can delete diego_tmp.
Thanks

Jan 15 2019, 6:01 PM · Analytics-Kanban, Analytics

Nov 28 2018

diego added a comment to T210433: Identify and release data on similar Wikidata items.

@bmansurov , eyeballing I can say:

Nov 28 2018, 12:48 AM · Research-Backlog

Nov 27 2018

diego added a comment to T210433: Identify and release data on similar Wikidata items.

Hey @bmansurov , the list in Spanish it's over 11K. Maybe you could sample by cosine similarity, and create an stratified sample. Doing 11K sounds not realistic for me.

Nov 27 2018, 6:26 PM · Research-Backlog

Nov 15 2018

diego created T209597: Give a talk about Wikimedia Public Resources for Research at NYU Center for Data Science.
Nov 15 2018, 3:30 PM · Research-outreach, Research

Nov 5 2018

diego added a comment to T208799: Add page_id column to wb_items_per_site .

@Krenair , I know that this field is needed in that table on the database located in analytics-store.eqiad.wmnet. I'm not sure what is the procedure/dependencies to do this, sorry.

Nov 5 2018, 11:43 PM · Wikidata
diego added a comment to T208799: Add page_id column to wb_items_per_site .

@Aklapper , I'm not sure which will be the proper tag. I don't see suggestions related with the MariaDB Replicas.

Nov 5 2018, 11:40 PM · Wikidata
diego added a comment to T208799: Add page_id column to wb_items_per_site .

Hi @Aklapper ,
I'm referring to the MariaDB tables on analytics-store.eqiad.wmnet. I suppose that this requires a change in the schema.

Nov 5 2018, 11:37 PM · Wikidata
diego created T208799: Add page_id column to wb_items_per_site .
Nov 5 2018, 9:31 PM · Wikidata

Oct 18 2018

diego added a comment to T207096: Present the result of WtWRW at CEE Meeting.

Slides here: https://docs.google.com/presentation/d/1dGJdVEFrkmRjrqfGTqePkg7GIgPPKjZmeqIskwQYrCM/edit?usp=sharing

Oct 18 2018, 6:48 PM · Research

Oct 15 2018

diego created T207096: Present the result of WtWRW at CEE Meeting.
Oct 15 2018, 10:17 PM · Research

Oct 4 2018

diego renamed T206244: Prepare and give remote talk about Wikimedia projects and AI technology at WikiConference Seoul 2018. from Prepare and give talk about Wikimedia projects and AI technology for the Korean Chapter to Prepare and give remote talk about Wikimedia projects and AI technology at WikiConference Seoul 2018. .
Oct 4 2018, 6:11 PM · Research
diego created T206244: Prepare and give remote talk about Wikimedia projects and AI technology at WikiConference Seoul 2018. .
Oct 4 2018, 5:27 PM · Research

Oct 1 2018

diego updated the task description for T205650: Isaac: Systems and Programs onboarding.
Oct 1 2018, 5:31 PM · Research
diego updated the task description for T205650: Isaac: Systems and Programs onboarding.
Oct 1 2018, 5:30 PM · Research

Sep 26 2018

diego added a comment to T178249: Parameter for linking a new page to the Wikidata.

Kateryna is working on this: https://meta.wikimedia.org/wiki/Research:Matching_Red_Links_with_Wikidata_Items

Sep 26 2018, 10:48 PM · Wikidata

Sep 25 2018

diego added a comment to T190772: Build the first version of section recommender by fusing the synonym and translator models.

Makes sense! Thanks!

Sep 25 2018, 11:11 PM · Research-2017-18-Q4, Research
diego added a comment to T190772: Build the first version of section recommender by fusing the synonym and translator models.

@bmansurov , interesting. I've tried with 'uz' and also don't see anything repeated. Giving that 'uz' current is a single file that make me things that is something related with the parallelization.

Sep 25 2018, 9:41 PM · Research-2017-18-Q4, Research
diego added a comment to T190772: Build the first version of section recommender by fusing the synonym and translator models.

I'm cleaning my code, and found that my parser produce duplicated outputs. Each row is present twice in the output. These two repeated rows are not together, meaning that line 1 is not repeated in line 2, but in line X with X > 2. Can you please have a look here and try to guess what I am doing wrong?
For sure I can do a post filter, but I would love to understand what is happening.

Sep 25 2018, 6:59 PM · Research-2017-18-Q4, Research

Sep 6 2018

nettrom_WMF awarded T186559: Provide data dumps in the Analytics Data Lake a Love token.
Sep 6 2018, 11:02 PM · Analytics

Sep 1 2018

diego added a comment to T203263: Measure translation recommendations against the baseline.

Beyond my subjective opinion about these rankings, I'm not sure what I should evaluate here. I understand that in the paper there is already an evaluation methodology. Are you trying to measure some new that is not covered by that methodology?

Sep 1 2018, 12:27 AM · Research