Page MenuHomePhabricator

fkaelin
User

Projects

Today

  • Clear sailing ahead.

Tomorrow

  • Clear sailing ahead.

Saturday

  • Clear sailing ahead.

User Details

User Since
Nov 12 2020, 6:16 PM (178 w, 6 d)
Availability
Available
LDAP User
Fabian Kaelin
MediaWiki User
FKaelin (WMF) [ Global Accounts ]

Recent Activity

Tue, Apr 16

fkaelin closed T348826: Integrate with WMF deployment pipeline as Declined.

Closing this. Deploying on CloudVPS is supported, blubber integration to be done when a kubernetes deploy is needed.

Tue, Apr 16, 3:10 PM · Research
fkaelin closed T348826: Integrate with WMF deployment pipeline, a subtask of T348820: Tooling to work with embeddings, as Declined.
Tue, Apr 16, 3:10 PM · Epic, Research
fkaelin closed T348367: Create a python package to compute wikitext embeddings in the WMF data infra as Resolved.

Done - code

Tue, Apr 16, 3:04 PM · Research
fkaelin closed T348367: Create a python package to compute wikitext embeddings in the WMF data infra, a subtask of T348819: Develop pipelines for research datasets - Q2, as Resolved.
Tue, Apr 16, 3:04 PM · Research (FY2023-24-Research-October-December)
fkaelin closed T348823: Tooling to create an index from a dataset of vectors as Resolved.
Tue, Apr 16, 3:02 PM · Research
fkaelin closed T348823: Tooling to create an index from a dataset of vectors, a subtask of T348820: Tooling to work with embeddings, as Resolved.
Tue, Apr 16, 3:02 PM · Epic, Research
fkaelin removed Due Date on T343061: Denylist for language agnostic revert risk model.
Tue, Apr 16, 2:57 PM · Research
fkaelin moved T343061: Denylist for language agnostic revert risk model from Staged to Backlog on the Research board.

Removing due date and moving to backlog to prioritize.

Tue, Apr 16, 2:56 PM · Research
fkaelin closed T342915: Generate training/evaluation datasets using airflow , a subtask of T341817: Standardize research pipelines - Dataset generation, as Resolved.
Tue, Apr 16, 2:54 PM · Epic, Research
fkaelin closed T342915: Generate training/evaluation datasets using airflow as Resolved.
Tue, Apr 16, 2:54 PM · Research
fkaelin added a comment to T343065: Scheduled risk observatory pipeline.

@Pablo can this ticket be closed as well, as the work was tracked with T341777?

Tue, Apr 16, 1:35 AM · Research (FY2023-24-Research-April-June)

Wed, Apr 3

fkaelin added a comment to T341777: Automate the data collection process.

@Pablo thanks for flagging - there was indeed an issue with the wikidiff table: it is an external hive table, the required data was on hdfs and triggered the risk observatory dag, but the hive table itself was not being correctly updated, so no data was ingested. This is fixed now, and the dashboard shows data until Feb 24 now.

Wed, Apr 3, 11:10 AM · Research

Thu, Mar 21

fkaelin added a comment to T305688: Make HTML Dumps available in hadoop.

Pasting this reply from a slack thread for context

Thu, Mar 21, 3:15 PM · Data-Engineering (Q4 2024 April 1st - June 30th), Structured-Data-Backlog

Mar 5 2024

fkaelin updated the task description for T356729: Research API repository.
Mar 5 2024, 4:27 PM · Research

Mar 4 2024

fkaelin added a comment to T355440: PoC - general model training support (Cloud GPU).

Weekly updates

  • Interesting development with the ml team, there is a conversation with an European HPC infra provider about getting compute resources, and research projects are good candidates. Naturally this is relevant to this cloud GPU initiative, and research is very interested.
Mar 4 2024, 3:17 PM · Research (FY2023-24-Research-April-June)

Feb 29 2024

fkaelin added a comment to T354241: Check home/HDFS leftovers of nickifeajika.

These directories can be removed both on the stat clients and hdfs. Thanks!

Feb 29 2024, 1:55 PM · Data-Platform-SRE (2024.02.12 - 2024.03.03), Data-Engineering

Feb 27 2024

fkaelin closed T358613: Content_gap_metrics stage of knowledge_gaps job failing repeatedly as Resolved.

This is fixed (MR) and the the data is available.

Feb 27 2024, 10:54 PM · Research, Movement-Metrics, Movement-Insights

Feb 12 2024

fkaelin updated subscribers of T357316: Develop pipelines for research datasets - Q3/Q4.
Feb 12 2024, 3:23 PM · Research (FY2023-24-Research-April-June)
fkaelin created T357316: Develop pipelines for research datasets - Q3/Q4.
Feb 12 2024, 3:14 PM · Research (FY2023-24-Research-April-June)
fkaelin added a comment to T355440: PoC - general model training support (Cloud GPU).

Weekly updates

  • Trained the simplification model on a 3 billion parameter model (flan-t5-xl) on a single H100 (80GB). Results look promising.
  • Training for 2 epochs (~10h), running inference on test datasets (~6h), and downloading model weights: total cost ~50$
  • The fine-tuned model model weights are on stat1008. Validated that inference on the currently available GPU in the WMF infra works (it is slow)
Feb 12 2024, 3:09 PM · Research (FY2023-24-Research-April-June)

Feb 6 2024

fkaelin created T356729: Research API repository.
Feb 6 2024, 2:24 AM · Research

Feb 5 2024

fkaelin closed T355226: Productionize geography gaps data, cultural model as Resolved.

The cultural geographical gap data is now in production, aggregated on the level of the WMF regions. The gap name is geography_cultural_wmf_region, e.g. see here, the documentation is also updated as well as the example intersections notebook.

Feb 5 2024, 10:15 PM · Research
fkaelin added a comment to T331156: Improve documentation of the metrics available in the knowledge gap index.

For completeness: the datasets are also documented for the hive tables (which are equivalent to the published datasets) that are only available internally; see datahub (SSO login required)

Feb 5 2024, 10:07 PM · Research
fkaelin closed T331156: Improve documentation of the metrics available in the knowledge gap index as Resolved.

The is done: Datasets.

Feb 5 2024, 10:04 PM · Research
fkaelin closed T331156: Improve documentation of the metrics available in the knowledge gap index, a subtask of T331155: Knowledge Gaps Metrics, as Resolved.
Feb 5 2024, 10:04 PM · Epic, Research
fkaelin added a comment to T355440: PoC - general model training support (Cloud GPU).

Weekly updates

  • initial experiments with lambda labs, using text simplification as use case (T354653)
  • tested with A100 (40GB) and H100 (80GB) to validate approach and get an estimate of the cost for fine-tuning runs.
  • for a model size that can be trained on WMF infra (T5 large, 700M params), 1 epoch takes ~24h in WMF infra. On lambda labs 1 epoch costs ~6$ (i.e. time depends on hardware, ~4 h on A100, ~2h on a H100).
  • next up: use a model (3B param model) that can't currently be fine-tuned using WMF infra, but can be served using WMF infra.
Feb 5 2024, 9:54 PM · Research (FY2023-24-Research-April-June)

Jan 29 2024

fkaelin added a comment to T355440: PoC - general model training support (Cloud GPU).

Weekly updates

Jan 29 2024, 3:32 AM · Research (FY2023-24-Research-April-June)

Jan 25 2024

fkaelin added a comment to T355859: NEW BUG REPORT: Error querying content_gap_metrics tables from Presto/Superset.

Also for reference, at some point I created a template superset dashboard which mirrors the content_gap_metric hive tables - here https://superset.wikimedia.org/superset/dashboard/472, that is just a draft with example charts.

Jan 25 2024, 2:24 PM · Data-Platform-SRE, Data-Platform
fkaelin added a comment to T355859: NEW BUG REPORT: Error querying content_gap_metrics tables from Presto/Superset.

The issue seems that the superset ui for these queries can't render nested parquet structures, e.g. the metrics column contains a set of scalar columns, and the quantiles are nested structs themselves. The query itself works, but the UI can't render the result as is. If you formulate the query in a way that doesn't contain nested structs it works, for example:

Jan 25 2024, 2:16 PM · Data-Platform-SRE, Data-Platform

Jan 24 2024

fkaelin closed T348348: Standardize usage of geographic entities for knowledge gaps as Resolved.

This work is done with this MR, which migrated the KG pipeline to using the canonical_data.countries table which now includes the wikidata qid of the country, which allowed to replace the base region mapping file. For the cultural gap in particular, the "re-mapping" of some territories not in the canonical countries table was retained to expand the coverage of the gap.

Jan 24 2024, 5:19 PM · Research
fkaelin added a comment to T355226: Productionize geography gaps data, cultural model.

Example dataset for the cultural geographic gap (aggregated for wmf regions) for review: https://analytics.wikimedia.org/published/datasets/one-off/fab/content_gap/ . The code is merged and for the next scheduled run the new gap will be published as well (same format as the linked file above), if needed we can easily also re-run the previous pipeline to have the data sooner.

Jan 24 2024, 4:34 PM · Research
fkaelin reopened T355226: Productionize geography gaps data, cultural model as "Open".

Somehow I accidentally closed..

Jan 24 2024, 4:23 PM · Research
fkaelin moved T348348: Standardize usage of geographic entities for knowledge gaps from Backlog to In Progress on the Research board.
Jan 24 2024, 4:03 PM · Research

Jan 23 2024

fkaelin closed T355226: Productionize geography gaps data, cultural model as Resolved.
Jan 23 2024, 2:31 PM · Research
fkaelin changed the status of T348348: Standardize usage of geographic entities for knowledge gaps from Open to In Progress.
Jan 23 2024, 2:29 PM · Research
fkaelin changed the status of T355226: Productionize geography gaps data, cultural model from Open to In Progress.
Jan 23 2024, 2:29 PM · Research

Jan 18 2024

fkaelin added a comment to T346463: Identify and label prefetch proxy data in our traffic.
pa = spark.table("wmf.pageview_actor").where("""year=2024 and month=1 and day=18 and hour=16""")
prefetch_fields = [ 'prefetch_sec_purpose', 'prefetch_purpose', 'prefetch_x_moz']
cols = [F.col("x_analytics_map").getItem(f).isNotNull().alias(f) for f in prefetch_fields]
pa.groupBy(*cols).count().orderBy("count",ascending=False).show(1000,truncate=False)
Jan 18 2024, 8:36 PM · Traffic, Movement-Insights, Data-Engineering

Jan 17 2024

fkaelin added a comment to T355226: Productionize geography gaps data, cultural model.

The content gaps for the geography gap using the cultural model are available on hive:

(spark.table("content_gap_metrics.by_category")
.where("content_gap='geography_cultural_region'")
.show()
+-------+--------------------+--------------------+--------------------+--------------------+-----------+
|wiki_db|            category|             metrics|           quantiles|         content_gap|time_bucket|
+-------+--------------------+--------------------+--------------------+--------------------+-----------+
| frwiki|         Afghanistan|{1, 499592, 434.8...|{{1, 1, 1, 1, 1},...|geography_cultura...|    2023-10|
| frwiki|             Albania|{13, 551487, 363....|{{1, 1, 1, 1, 1},...|geography_cultura...|    2023-10|
| frwiki|             Algeria|{13, 5211013, 503...|{{1, 1, 1, 1, 1},...|geography_cultura...|    2023-10|
)
Jan 17 2024, 6:09 PM · Research
fkaelin added a comment to T346463: Identify and label prefetch proxy data in our traffic.

Thanks for the updates @dr0ptp4kt, and nice that you are able to reproduce such a google proxy request.

Jan 17 2024, 4:47 PM · Traffic, Movement-Insights, Data-Engineering

Jan 15 2024

fkaelin closed T342917: Provide feedback for TrainWing design as Invalid.

TrainWing as originally planned will not be built, instead it will incrementally be built within existing data engineering infrastructure. As such, I mark this task as invalid.

Jan 15 2024, 4:29 PM · Research
fkaelin closed T348819: Develop pipelines for research datasets - Q2 as Resolved.

This work is completed, the pipelines that were added:

Jan 15 2024, 3:58 PM · Research (FY2023-24-Research-October-December)
fkaelin closed T348819: Develop pipelines for research datasets - Q2, a subtask of T341817: Standardize research pipelines - Dataset generation, as Resolved.
Jan 15 2024, 3:58 PM · Epic, Research
fkaelin closed T349615: Implement risk obsevatory pipeline as Resolved.

Completed with https://gitlab.wikimedia.org/repos/research/research-datasets/-/merge_requests/11

Jan 15 2024, 3:56 PM · Research
fkaelin closed T349615: Implement risk obsevatory pipeline, a subtask of T343065: Scheduled risk observatory pipeline, as Resolved.
Jan 15 2024, 3:56 PM · Research (FY2023-24-Research-April-June)
fkaelin closed T341818: Migrate and consolidate Research teams' code to Gitlab as Resolved.

The remaining known repos have been migrated to gitlab and the github repos archived (cc @Isaac @MGerlach).

Jan 15 2024, 2:31 PM · Research (FY2023-24-Research-October-December)
fkaelin updated the task description for T341818: Migrate and consolidate Research teams' code to Gitlab.
Jan 15 2024, 2:29 PM · Research (FY2023-24-Research-October-December)

Dec 18 2023

fkaelin created T353665: Remove nickifeajika from analytics-privatedata-users.
Dec 18 2023, 5:55 PM · Data-Platform-SRE (2024.02.12 - 2024.03.03)

Dec 8 2023

fkaelin added a comment to T346463: Identify and label prefetch proxy data in our traffic.

It so happens I sniped myself into looking into this after noticing a lot of google ips in trending streaming pages. Here pyspark snippets with more results.

Dec 8 2023, 7:36 PM · Traffic, Movement-Insights, Data-Engineering

Nov 21 2023

fkaelin moved T345630: Update documentation for maintaining research web pages from In Progress to Needs Sign-off on the Research board.
Nov 21 2023, 6:08 PM · Research
fkaelin updated subscribers of T345630: Update documentation for maintaining research web pages.

@leila, should the redundant meta page be deleted, or redirected?

Nov 21 2023, 6:08 PM · Research
fkaelin updated the task description for T345630: Update documentation for maintaining research web pages.
Nov 21 2023, 6:05 PM · Research
fkaelin moved T348823: Tooling to create an index from a dataset of vectors from Staged to In Progress on the Research board.
Nov 21 2023, 5:10 PM · Research
fkaelin moved T349615: Implement risk obsevatory pipeline from Staged to In Progress on the Research board.
Nov 21 2023, 5:08 PM · Research
fkaelin closed T348822: Choose vector search framework as Resolved.

Benchmark code, analysis notebook

Nov 21 2023, 5:04 PM · Research
fkaelin closed T348822: Choose vector search framework, a subtask of T348820: Tooling to work with embeddings, as Resolved.
Nov 21 2023, 5:03 PM · Epic, Research
fkaelin closed T348821: Define requirements for List-Building & Add-A-Link as Resolved.

This is done with Embeddings At Scale and Evaluation of similarity search solutions

Nov 21 2023, 4:59 PM · Research
fkaelin closed T348821: Define requirements for List-Building & Add-A-Link, a subtask of T348820: Tooling to work with embeddings, as Resolved.
Nov 21 2023, 4:59 PM · Epic, Research
fkaelin changed Due Date from Oct 27 2023, 4:00 AM to Nov 24 2023, 5:00 AM on T345630: Update documentation for maintaining research web pages.
Nov 21 2023, 4:47 PM · Research
fkaelin moved T345630: Update documentation for maintaining research web pages from Staged to In Progress on the Research board.
Nov 21 2023, 4:47 PM · Research

Nov 20 2023

fkaelin updated subscribers of T350795: Add linting of research landing-page to gitlab CI .

@Jelto Unfortunately I also don't know either how these logs looked, and I am not familiar with npm/js. Looking at the pipeline output, it says only 1 file was tested. But eslint is for javascript, and there is only a single js file in this project. From my perspective this is a good start, we can refine this as needed in the future.

Nov 20 2023, 9:00 PM · Research, collaboration-services

Nov 16 2023

fkaelin moved T350389: Upgrade xgboost in knowledge_integrity from Staged to In Progress on the Research board.

@isarantopoulos the python 3.8 is merged, and here is the MR for the xgboost bump.

Nov 16 2023, 4:54 AM · Research, Machine-Learning-Team

Nov 15 2023

fkaelin closed T348825: Deployment on yarn as Resolved.

This is done.

Nov 15 2023, 6:24 PM · Research
fkaelin closed T348825: Deployment on yarn, a subtask of T348820: Tooling to work with embeddings, as Resolved.
Nov 15 2023, 6:24 PM · Epic, Research

Nov 8 2023

fkaelin moved T350795: Add linting of research landing-page to gitlab CI from Backlog to Support Needed on the Research board.
Nov 8 2023, 3:19 PM · Research, collaboration-services
fkaelin created T350795: Add linting of research landing-page to gitlab CI .
Nov 8 2023, 3:18 PM · Research, collaboration-services

Nov 7 2023

fkaelin closed T349614: Archeology on the notebooks / documentation as Resolved.

This is done

Nov 7 2023, 6:16 PM · Research
fkaelin closed T349614: Archeology on the notebooks / documentation, a subtask of T343065: Scheduled risk observatory pipeline, as Resolved.
Nov 7 2023, 6:16 PM · Research (FY2023-24-Research-April-June)
fkaelin assigned T350389: Upgrade xgboost in knowledge_integrity to MunizaA.
Nov 7 2023, 4:18 PM · Research, Machine-Learning-Team
fkaelin set Due Date to Nov 30 2023, 5:00 AM on T350389: Upgrade xgboost in knowledge_integrity.
Nov 7 2023, 4:17 PM · Research, Machine-Learning-Team

Nov 2 2023

Dzahn awarded T334511: Move research webpages to gitlab a Love token.
Nov 2 2023, 3:54 PM · GitLab (Pipeline Services Migration🐤), collaboration-services, Research

Oct 26 2023

fkaelin renamed T349755: Training pipeline for Revert Risk Language Agnostic (RRLA) model from Create an standardized training pipeline for Revert Risk Language Agnostic (RRLA) model to [Requesting Engineering Support] Training pipeline for Revert Risk Language Agnostic (RRLA) model.
Oct 26 2023, 2:42 PM · Knowledge-Integrity, Research

Oct 25 2023

fkaelin closed T348666: Add randomization to the revision order showed in Annotool, a subtask of T344016: Improvements to Annotool, as Resolved.
Oct 25 2023, 4:58 PM · Research
fkaelin closed T348666: Add randomization to the revision order showed in Annotool as Resolved.
Oct 25 2023, 4:58 PM · Research

Oct 24 2023

fkaelin changed Due Date from Oct 18 2023, 10:00 PM to Oct 25 2023, 10:00 PM on T348666: Add randomization to the revision order showed in Annotool.
Oct 24 2023, 4:33 PM · Research
fkaelin moved T348666: Add randomization to the revision order showed in Annotool from Backlog to In Progress on the Research board.
Oct 24 2023, 4:32 PM · Research
fkaelin closed T343063: Multilingual revert risk pipeline as Resolved.
Oct 24 2023, 4:31 PM · Research
fkaelin set Due Date to Dec 29 2023, 5:00 AM on T348826: Integrate with WMF deployment pipeline.
Oct 24 2023, 3:41 PM · Research
fkaelin assigned T348826: Integrate with WMF deployment pipeline to MunizaA.
Oct 24 2023, 3:40 PM · Research
fkaelin moved T344625: Create repo for Organizer Lab randomization from Backlog to Staged on the Research board.
Oct 24 2023, 3:04 PM · Research
fkaelin moved T348823: Tooling to create an index from a dataset of vectors from Backlog to Staged on the Research board.
Oct 24 2023, 3:01 PM · Research
fkaelin moved T348367: Create a python package to compute wikitext embeddings in the WMF data infra from Backlog to Staged on the Research board.
Oct 24 2023, 3:01 PM · Research
fkaelin assigned T348367: Create a python package to compute wikitext embeddings in the WMF data infra to MunizaA.
Oct 24 2023, 3:01 PM · Research
fkaelin assigned T348821: Define requirements for List-Building & Add-A-Link to MunizaA.
Oct 24 2023, 3:00 PM · Research
fkaelin set Due Date to Nov 10 2023, 5:00 AM on T348821: Define requirements for List-Building & Add-A-Link.
Oct 24 2023, 3:00 PM · Research
fkaelin set Due Date to Nov 30 2023, 5:00 AM on T348823: Tooling to create an index from a dataset of vectors.
Oct 24 2023, 2:58 PM · Research
fkaelin moved T348825: Deployment on yarn from Backlog to Staged on the Research board.
Oct 24 2023, 2:57 PM · Research
fkaelin set Due Date to Nov 10 2023, 5:00 AM on T348825: Deployment on yarn.
Oct 24 2023, 2:57 PM · Research
fkaelin moved T342916: Add new "Readability" gap to Knowledge Gaps pipeline from Staged to Backlog on the Research board.
Oct 24 2023, 2:56 PM · Research
fkaelin removed Due Date on T342916: Add new "Readability" gap to Knowledge Gaps pipeline.
Oct 24 2023, 2:55 PM · Research
fkaelin moved T349614: Archeology on the notebooks / documentation from Backlog to In Progress on the Research board.
Oct 24 2023, 2:53 PM · Research
fkaelin moved T349615: Implement risk obsevatory pipeline from Backlog to Staged on the Research board.
Oct 24 2023, 2:53 PM · Research
fkaelin set Due Date to Oct 24 2023, 4:00 AM on T349614: Archeology on the notebooks / documentation.
Oct 24 2023, 2:33 PM · Research
fkaelin set Due Date to Nov 30 2023, 5:00 AM on T349615: Implement risk obsevatory pipeline.
Oct 24 2023, 2:30 PM · Research
fkaelin added a comment to T341818: Migrate and consolidate Research teams' code to Gitlab.

Updates:

  • Work on T344625 will start in Q3 due to other higher priority tasks
Oct 24 2023, 1:31 PM · Research (FY2023-24-Research-October-December)
fkaelin added a comment to T348819: Develop pipelines for research datasets - Q2.

Updates

  • Created a new repo research-datasets to consolidate production research pipelines that produce datasets
  • Nearly completed analysis of risk-observatory notebooks/documentation (T349614)
Oct 24 2023, 1:24 PM · Research (FY2023-24-Research-October-December)
fkaelin created T349615: Implement risk obsevatory pipeline.
Oct 24 2023, 1:22 PM · Research
fkaelin created T349614: Archeology on the notebooks / documentation.
Oct 24 2023, 1:19 PM · Research
fkaelin added a comment to T349151: September 2023 Wikimedia movement metrics.

@Mayakp.wiki The content gap metrics are available for the September snapshots. Improvements:

  • the metrics are computed only for canonical wikis.
  • the partial time buckets are filtered out, i.e. there is no rows for 2023-10 in the September datasets
Oct 24 2023, 3:30 AM · Movement-Insights

Oct 19 2023

fkaelin assigned T348822: Choose vector search framework to MunizaA.
Oct 19 2023, 4:38 PM · Research

Oct 13 2023

fkaelin updated the task description for T341818: Migrate and consolidate Research teams' code to Gitlab.
Oct 13 2023, 3:16 AM · Research (FY2023-24-Research-October-December)