Page MenuHomePhabricator

diego (Diego S-T)
User

Today

  • Clear sailing ahead.

Tomorrow

  • Clear sailing ahead.

Friday

  • Clear sailing ahead.

User Details

User Since
Aug 8 2017, 10:56 AM (139 w, 23 h)
Availability
Available
LDAP User
Unknown
MediaWiki User
Diego (WMF) [ Global Accounts ]

Recent Activity

Wed, Apr 1

diego added a comment to T249078: Desired packages to be installed/upgraded on the PySpark cluster (jupyterhub).

Hey @elukey . Thanks for sharing, @Ottomata has talked about the general idea, but I was not aware of that document.

Wed, Apr 1, 6:45 AM · Scoring-platform-team, ORES, Analytics-Cluster, Analytics, Research
diego created T249078: Desired packages to be installed/upgraded on the PySpark cluster (jupyterhub).
Wed, Apr 1, 2:58 AM · Scoring-platform-team, ORES, Analytics-Cluster, Analytics, Research

Mon, Mar 30

diego added a comment to T224234: Research support for cross-wiki content propagation.

Weekly update:

Mon, Mar 30, 3:27 PM · Research (FY2019-20-Research-January-March), CX-boost, Language-Team (Language-2020-January-March)
diego added a comment to T243257: Start the first version of the Research Internships program. .

Week update:

Mon, Mar 30, 3:23 PM · Research (FY2019-20-Research-January-March)
diego added a comment to T243256: Measuring the consistency of information between Wikipedia articles and Wikidata items..

Weekly update:

  • No updates this week
Mon, Mar 30, 3:22 PM · Research (FY2019-20-Research-January-March)

Wed, Mar 18

diego added a comment to T243430: Basic service for mapping sections.

weekly update:

Wed, Mar 18, 3:23 PM · Language-Team (Language-2020-Focus-Sprint), Patch-For-Review, CX-boost

Wed, Mar 11

diego added a comment to T243430: Basic service for mapping sections.

As an experiment, I wrote a program that downloads the entire content translation parallel corpus from https://dumps.wikimedia.org/other/contenttranslation/20200214/ and find all section title translation done so far. It took about 9 hours to parse huge jsons files and at the end we have a 31.7 GB Sqlite database with this information.
Database: https://people.wikimedia.org/~santhosh/cx-section-titles-aligned.db

@santhosh that file is 30Mb, where can I get the full 31.7GB file?

Wed, Mar 11, 8:50 PM · Language-Team (Language-2020-Focus-Sprint), Patch-For-Review, CX-boost

Mon, Mar 9

diego added a comment to T243256: Measuring the consistency of information between Wikipedia articles and Wikidata items..

Weekly update:

Mon, Mar 9, 4:12 PM · Research (FY2019-20-Research-January-March)
diego added a comment to T224234: Research support for cross-wiki content propagation.

Weekly update:

Mon, Mar 9, 4:12 PM · Research (FY2019-20-Research-January-March), CX-boost, Language-Team (Language-2020-January-March)
diego added a comment to T243257: Start the first version of the Research Internships program. .

Weekly update:

Mon, Mar 9, 4:11 PM · Research (FY2019-20-Research-January-March)

Mar 2 2020

diego added a comment to T243256: Measuring the consistency of information between Wikipedia articles and Wikidata items..

Weekly update

Mar 2 2020, 5:24 PM · Research (FY2019-20-Research-January-March)
diego added a comment to T224234: Research support for cross-wiki content propagation.

Weekly update:

Mar 2 2020, 5:09 PM · Research (FY2019-20-Research-January-March), CX-boost, Language-Team (Language-2020-January-March)
diego added a comment to T243257: Start the first version of the Research Internships program. .

Weekly update:

Mar 2 2020, 5:06 PM · Research (FY2019-20-Research-January-March)

Feb 27 2020

diego added a comment to T243430: Basic service for mapping sections.

Hi @santhosh! Yep, this is super useful, I considering do something similar, to find some ground-truth for my approaches.

Feb 27 2020, 4:48 PM · Language-Team (Language-2020-Focus-Sprint), Patch-For-Review, CX-boost

Feb 21 2020

diego added a comment to T224234: Research support for cross-wiki content propagation.

Weekly update:

Feb 21 2020, 5:14 PM · Research (FY2019-20-Research-January-March), CX-boost, Language-Team (Language-2020-January-March)
diego added a comment to T243257: Start the first version of the Research Internships program. .

weekly update:

Feb 21 2020, 5:13 PM · Research (FY2019-20-Research-January-March)
diego added a comment to T243256: Measuring the consistency of information between Wikipedia articles and Wikidata items..

Weekly update:

Feb 21 2020, 5:13 PM · Research (FY2019-20-Research-January-March)

Feb 19 2020

diego updated the task description for T244819: wiki data for Global Innovation Index - 2019.
Feb 19 2020, 9:13 PM · Research
diego updated the task description for T244819: wiki data for Global Innovation Index - 2019.
Feb 19 2020, 9:12 PM · Research

Feb 18 2020

diego added a comment to T224234: Research support for cross-wiki content propagation.

I'll need around 3 weeks (aprox) to finish this.

Feb 18 2020, 5:38 PM · Research (FY2019-20-Research-January-March), CX-boost, Language-Team (Language-2020-January-March)

Feb 17 2020

diego added a comment to T224234: Research support for cross-wiki content propagation.

weekly update:

Feb 17 2020, 6:48 PM · Research (FY2019-20-Research-January-March), CX-boost, Language-Team (Language-2020-January-March)
diego added a comment to T243257: Start the first version of the Research Internships program. .

weekly update:

Feb 17 2020, 6:43 PM · Research (FY2019-20-Research-January-March)
diego added a comment to T243256: Measuring the consistency of information between Wikipedia articles and Wikidata items..
  • Reviewing Corona Virus related cases.
Feb 17 2020, 6:42 PM · Research (FY2019-20-Research-January-March)

Feb 13 2020

diego added a comment to T224234: Research support for cross-wiki content propagation.

@elukey I've deleted 120Gb. Moved back to 580G :)

Feb 13 2020, 3:27 PM · Research (FY2019-20-Research-January-March), CX-boost, Language-Team (Language-2020-January-March)
diego updated subscribers of T224234: Research support for cross-wiki content propagation.

Hey @elukey, for this task I need to download - at least - 50 languages models, each of them is around 8G, so I'll use around 400G. I'll do my best to make this work with that data on HDFS, but for starting I need to have it in a local machine. I'm now using stat1007 for my experiments. Is ok if I store temporarily the models there?

Feb 13 2020, 12:40 AM · Research (FY2019-20-Research-January-March), CX-boost, Language-Team (Language-2020-January-March)
diego added a comment to T224234: Research support for cross-wiki content propagation.

Hi @Theory42
I wont say "employed team only", I'll share all the code I'm creating for this, but currently I can't think in tasks where I need help for this. Please feel free to contribute to the repo previously mentioned, and tell me if I you need my help.

Feb 13 2020, 12:37 AM · Research (FY2019-20-Research-January-March), CX-boost, Language-Team (Language-2020-January-March)

Feb 11 2020

diego added a comment to T244819: wiki data for Global Innovation Index - 2019.

For the records, given that in geoeditors_edits_monthly we store information about countries using ISO 3166-1 alpha-2, we are losing Bonaire and Kosovo. The former has no code at all, and the latter has only Alpha-3.
We might want to consider use the full country name to avoid this kind problems in the future.

Feb 11 2020, 2:37 AM · Research
diego created T244819: wiki data for Global Innovation Index - 2019.
Feb 11 2020, 1:38 AM · Research
diego added a comment to T224234: Research support for cross-wiki content propagation.

Update from last two weeks:

Feb 11 2020, 1:32 AM · Research (FY2019-20-Research-January-March), CX-boost, Language-Team (Language-2020-January-March)

Feb 10 2020

diego added a comment to T243257: Start the first version of the Research Internships program. .

weekly update:

Feb 10 2020, 5:32 PM · Research (FY2019-20-Research-January-March)
diego added a comment to T243256: Measuring the consistency of information between Wikipedia articles and Wikidata items..

Update from last two weeks:

Feb 10 2020, 5:31 PM · Research (FY2019-20-Research-January-March)

Feb 3 2020

diego created T244166: Mount /public/dumps for the recommendation-api Cloud VPS project.
Feb 3 2020, 7:52 PM · cloud-services-team (Kanban), VPS-Projects, Data-Services

Jan 31 2020

diego added a comment to T243972: bigdisk2 instace shows just 19G of HD.

Btw, there is a way to mount the Wikipedia dumps on those machines?

Jan 31 2020, 6:38 PM · Cloud-VPS

Jan 30 2020

diego edited projects for T243972: bigdisk2 instace shows just 19G of HD, added: Cloud-VPS; removed Cloud-Services.
Jan 30 2020, 8:06 PM · Cloud-VPS
diego created T243972: bigdisk2 instace shows just 19G of HD.
Jan 30 2020, 8:02 PM · Cloud-VPS

Jan 22 2020

diego added a comment to T227183: Generate template parameter alignments for the selected small wikis.

In the short-term, the solution is to use the code as was designed to work with Spark.

Jan 22 2020, 5:47 PM · Patch-For-Review, Language-Team (Language-2020-January-March), CX-boost

Jan 21 2020

diego updated subscribers of T227183: Generate template parameter alignments for the selected small wikis.

Problem solved (thanks @elukey and @JAllemandou.

Jan 21 2020, 10:45 AM · Patch-For-Review, Language-Team (Language-2020-January-March), CX-boost
diego updated subscribers of T227183: Generate template parameter alignments for the selected small wikis.

Hey @Ottomata @JAllemandou, please can you check why Pyspark kernels are not working? I've been trying for a week, with the differents pyspark kernels on the notebook machines, but the notebook freezes with any command (even is you try no-spark commands), pure python is working ok. Thx

Jan 21 2020, 8:17 AM · Patch-For-Review, Language-Team (Language-2020-January-March), CX-boost
diego triaged T243257: Start the first version of the Research Internships program. as High priority.
Jan 21 2020, 1:07 AM · Research (FY2019-20-Research-January-March)
diego added a comment to T243257: Start the first version of the Research Internships program. .

Weekly update: Gather internship proposals within the team, and shared the requirements with @leila

Jan 21 2020, 1:07 AM · Research (FY2019-20-Research-January-March)
diego added a comment to T243256: Measuring the consistency of information between Wikipedia articles and Wikidata items..

Weekly Update: Preparing the dataset.

Jan 21 2020, 1:05 AM · Research (FY2019-20-Research-January-March)
diego edited projects for T224234: Research support for cross-wiki content propagation, added: Research (FY2019-20-Research-January-March); removed Research.
Jan 21 2020, 1:04 AM · Research (FY2019-20-Research-January-March), CX-boost, Language-Team (Language-2020-January-March)
diego created T243257: Start the first version of the Research Internships program. .
Jan 21 2020, 1:03 AM · Research (FY2019-20-Research-January-March)
diego created T243256: Measuring the consistency of information between Wikipedia articles and Wikidata items..
Jan 21 2020, 1:02 AM · Research (FY2019-20-Research-January-March)
diego claimed T224234: Research support for cross-wiki content propagation.
Jan 21 2020, 12:57 AM · Research (FY2019-20-Research-January-March), CX-boost, Language-Team (Language-2020-January-March)

Jan 10 2020

diego edited projects for T186558: Create a Historical Link Graph for Wikipedia, added: Research-Backlog; removed Research.
Jan 10 2020, 8:02 PM · Analytics, Research-Backlog, Data-release
diego edited projects for T215349: [Blog] Write a blog post about Crosslingual section alignment and recommendations. , added: Research-Backlog; removed Research.
Jan 10 2020, 8:01 PM · Research-Backlog
diego closed T229595: Literature review on mis/disformation on Wikipedia as Resolved.
Jan 10 2020, 7:58 PM · Research
diego closed T233384: Give a talk about "The Role of Wikipedia in the AI ecosystem" at the Catholic University of Chile as Resolved.
Jan 10 2020, 7:58 PM · Research

Dec 11 2019

diego added a comment to T224234: Research support for cross-wiki content propagation.

Hi @Pginer-WMF,
Have you already have a look in our Section Recommendation demo app? That's currently working for 6 languages. Expand it and specially maintaining for many languages could be complex, however, a simplified version of that system, using a dump approach instead of an API (like we have done with the template parameter alignment) it could be feasible.

Dec 11 2019, 3:38 PM · Research (FY2019-20-Research-January-March), CX-boost, Language-Team (Language-2020-January-March)

Oct 25 2019

diego added a comment to T230059: Introduction to cross-lingual word-embeddings at Wikimania 2019.

@Aklapper done!

Oct 25 2019, 10:00 AM · Research, Wikimania-Hackathon-2019
diego closed T230059: Introduction to cross-lingual word-embeddings at Wikimania 2019 as Resolved.
Oct 25 2019, 9:59 AM · Research, Wikimania-Hackathon-2019

Oct 10 2019

diego added a comment to T229595: Literature review on mis/disformation on Wikipedia.

Here a summary of the work done: https://meta.wikimedia.org/wiki/Research:Disinformation_Literature_Review

Oct 10 2019, 4:41 PM · Research

Oct 8 2019

diego added a comment to T234484: Add data quality metric: traffic variations per country.

@Nuria the entropy approach looks very cool, thanks for sharing.

Oct 8 2019, 9:12 PM · Patch-For-Review, Research, Analytics-Kanban, Analytics
diego added a comment to T234484: Add data quality metric: traffic variations per country.

We (research) will be supporting @ssingh on his work related to this problem, especially focused in censorship.

Oct 8 2019, 8:57 PM · Patch-For-Review, Research, Analytics-Kanban, Analytics
diego added a project to T234484: Add data quality metric: traffic variations per country: Research.
Oct 8 2019, 8:41 PM · Patch-For-Review, Research, Analytics-Kanban, Analytics

Sep 27 2019

diego updated the task description for T234007: Give a talk about the The Role of Wikipedia in the AI Ecosystem at Catholic University of Chile .
Sep 27 2019, 4:46 AM · Research
diego closed T234007: Give a talk about the The Role of Wikipedia in the AI Ecosystem at Catholic University of Chile as Resolved.
Sep 27 2019, 4:46 AM · Research
diego added a comment to T234007: Give a talk about the The Role of Wikipedia in the AI Ecosystem at Catholic University of Chile .

Slides

Sep 27 2019, 4:46 AM · Research
diego closed T234008: Give a talk about the Research Team at Wikimedia Chile as Resolved.
Sep 27 2019, 4:45 AM · Research
diego created T234008: Give a talk about the Research Team at Wikimedia Chile.
Sep 27 2019, 4:45 AM · Research
diego created T234007: Give a talk about the The Role of Wikipedia in the AI Ecosystem at Catholic University of Chile .
Sep 27 2019, 4:41 AM · Research

Sep 20 2019

diego updated the task description for T233384: Give a talk about "The Role of Wikipedia in the AI ecosystem" at the Catholic University of Chile.
Sep 20 2019, 2:54 AM · Research
diego renamed T233384: Give a talk about "The Role of Wikipedia in the AI ecosystem" at the Catholic University of Chile from Give a talk about "The Role of Wikipedia in the AI ecosystem" at the Catolic University of Chile to Give a talk about "The Role of Wikipedia in the AI ecosystem" at the Catholic University of Chile.
Sep 20 2019, 2:53 AM · Research
diego created T233384: Give a talk about "The Role of Wikipedia in the AI ecosystem" at the Catholic University of Chile.
Sep 20 2019, 2:53 AM · Research

Sep 16 2019

diego added a comment to T227183: Generate template parameter alignments for the selected small wikis.

@santhosh , I've never tried that (I understand that docker files are kind of virtual environment, but honestly, I've never used it). We can try, but remember that the person will need access to our spark cluster. Do you know if the docker environment can connect with Yarn?

Sep 16 2019, 1:53 PM · Patch-For-Review, Language-Team (Language-2020-January-March), CX-boost

Aug 28 2019

diego added a comment to T215655: Generate edit totals by country by month/year.

@diego please review below as well since you worked with GII folks during the past iteration:

Looks ok to me.

Aug 28 2019, 9:10 PM · Patch-For-Review, Analytics-Kanban, Analytics

Aug 14 2019

diego added a comment to T230348: What are your experiences with templates?.

You might be interested on this project. T221211

Aug 14 2019, 8:31 AM · TCB-Team, Wikimania-Hackathon-2019

Aug 13 2019

diego updated the task description for T230059: Introduction to cross-lingual word-embeddings at Wikimania 2019.
Aug 13 2019, 11:27 PM · Research, Wikimania-Hackathon-2019

Aug 7 2019

leila awarded T230059: Introduction to cross-lingual word-embeddings at Wikimania 2019 a Love token.
Aug 7 2019, 8:02 PM · Research, Wikimania-Hackathon-2019
diego created T230059: Introduction to cross-lingual word-embeddings at Wikimania 2019.
Aug 7 2019, 6:51 PM · Research, Wikimania-Hackathon-2019

Aug 5 2019

diego updated the task description for T229267: Plan for Research team's acitivities during Wikimania 2019.
Aug 5 2019, 3:48 PM · Research-management, Research
diego updated the task description for T229267: Plan for Research team's acitivities during Wikimania 2019.
Aug 5 2019, 3:34 PM · Research-management, Research
diego updated the task description for T229267: Plan for Research team's acitivities during Wikimania 2019.
Aug 5 2019, 3:32 PM · Research-management, Research

Aug 1 2019

diego created T229595: Literature review on mis/disformation on Wikipedia.
Aug 1 2019, 4:58 PM · Research

Jul 29 2019

diego added a comment to T229242: Explore ways to restrict suggestions to a given knowledge area.

About this:

Jul 29 2019, 4:40 PM · Language-Team (Language-2020-January-March), CX-boost, WorkType-NewFunctionality

Jul 11 2019

diego closed T210530: Expose section mappings via an API as Resolved.
Jul 11 2019, 4:00 PM · Research-Backlog
diego closed T210530: Expose section mappings via an API, a subtask of T203046: Output 1.4: Public test APIs corresponding to section recommendation algorithms, as Resolved.
Jul 11 2019, 4:00 PM · Research, address-knowledge-gaps, Epic
diego added a comment to T210530: Expose section mappings via an API.

This has been solved here: https://meta.wikimedia.org/wiki/Research:Expanding_Wikipedia_articles_across_languages/Inter_language_approach#Results
Check an example here: https://secrec.wmflabs.org/API/alignment/en/ja/Work

Jul 11 2019, 3:59 PM · Research-Backlog
diego removed a project from T215349: [Blog] Write a blog post about Crosslingual section alignment and recommendations. : Research.
Jul 11 2019, 3:52 PM · Research-Backlog
diego removed a project from T186559: Provide data dumps in the Analytics Data Lake: Research.
Jul 11 2019, 3:51 PM · Analytics

Jul 10 2019

diego moved T215349: [Blog] Write a blog post about Crosslingual section alignment and recommendations. from In Progress to Staged on the Research board.
Jul 10 2019, 11:28 AM · Research-Backlog
diego merged task T203047: Output 1.5: The first version of the algorithm that prioritizes missing sections into T227651: first version of the algorithm that prioritizes missing sections .
Jul 10 2019, 11:26 AM · Research, address-knowledge-gaps, Epic
diego merged T203047: Output 1.5: The first version of the algorithm that prioritizes missing sections into T227651: first version of the algorithm that prioritizes missing sections .
Jul 10 2019, 11:26 AM · Research
diego closed T221211: Parameters matching on Templates: ML Exploration as Resolved.
Jul 10 2019, 11:24 AM · Language-Team (Language-2019-July-September), ContentTranslation, Research
diego added a comment to T221211: Parameters matching on Templates: ML Exploration .

@Pginer-WMF , I'm going to put this task as resolved from me side, and we can continue the follow-up somewhere else, ok?

Jul 10 2019, 11:22 AM · Language-Team (Language-2019-July-September), ContentTranslation, Research
diego closed T227651: first version of the algorithm that prioritizes missing sections as Resolved.
Jul 10 2019, 11:17 AM · Research
diego added a comment to T227651: first version of the algorithm that prioritizes missing sections .

Considering the feedback obtained in T225136, we conclude that the prioritization should be adapted to the characteristic of the editor being assisted. We can split editors in two disjoint groups, generating two different types of recommendations:

Jul 10 2019, 11:17 AM · Research
diego closed T203046: Output 1.4: Public test APIs corresponding to section recommendation algorithms as Resolved.
Jul 10 2019, 11:06 AM · Research, address-knowledge-gaps, Epic
diego created T227651: first version of the algorithm that prioritizes missing sections .
Jul 10 2019, 11:01 AM · Research

Jul 5 2019

diego updated subscribers of T221891: [REQUEST] En Wiki pageviews by topic. Rough cut..
Jul 5 2019, 3:30 PM · Product-Analytics
diego updated subscribers of T221891: [REQUEST] En Wiki pageviews by topic. Rough cut..

Hi all,
I think this use-case highlight the need for a canonical (standanrized) cross-lingual topic model, that we could all use as the reference for all the projects within the WMF.

Jul 5 2019, 3:30 PM · Product-Analytics
diego updated subscribers of T221891: [REQUEST] En Wiki pageviews by topic. Rough cut..
Jul 5 2019, 3:11 PM · Product-Analytics

Jun 13 2019

diego added a comment to T221211: Parameters matching on Templates: ML Exploration .

oh! I see, that number is distance, so 0 would be perfect match, 1 is not matching at all. I've already put a upper bound .45, so you will just see values lower than that.

Jun 13 2019, 3:11 PM · Language-Team (Language-2019-July-September), ContentTranslation, Research
diego added a comment to T221211: Parameters matching on Templates: ML Exploration .

If I understood correctly, you are asking why two exacts strings are not having distance = 0; this is because there is not string matching mechanism in this approach. Every language is trained separately, and then aligned using some words or sentences that we know that are equivalent. This is not necessarily bad, because you will find some words that are written exactly the same, but means different things in each language. However, in the examples that you show, this is just part of the noise introduced by the model.
We could add a second step, for example using Levenshtein distance, that would take advantage of string similarity , but it would work only for languages within the same scripts. If we had some training data, we could learn how to mix these two approaches and how useful would be the latter.

Jun 13 2019, 1:56 PM · Language-Team (Language-2019-July-September), ContentTranslation, Research

Jun 10 2019

Groceryheist awarded T186559: Provide data dumps in the Analytics Data Lake a Love token.
Jun 10 2019, 7:52 PM · Analytics

May 30 2019

diego updated subscribers of T221211: Parameters matching on Templates: ML Exploration .

Hi,
I have created and uploaded the full experiments and aligned parameters for these languages:

["es", "en", "fr", "ar", "ru", "uk", "pt", "vi", "zh", "ru", "he", "it", "ta", "id", "fa", "ca"]
May 30 2019, 10:40 PM · Language-Team (Language-2019-July-September), ContentTranslation, Research

May 23 2019

diego added a comment to T221211: Parameters matching on Templates: ML Exploration .

Sorry, I've put the wrong link to the experiments in the previous comment, now is updated.

May 23 2019, 5:04 PM · Language-Team (Language-2019-July-September), ContentTranslation, Research

May 20 2019

diego added a comment to T221211: Parameters matching on Templates: ML Exploration .

You can find the results of the experiments here.

May 20 2019, 10:08 AM · Language-Team (Language-2019-July-September), ContentTranslation, Research
diego updated the task description for T221211: Parameters matching on Templates: ML Exploration .
May 20 2019, 9:59 AM · Language-Team (Language-2019-July-September), ContentTranslation, Research