Page MenuHomePhabricator

mfossati (Marco Fossati)
Software Engineer, Structured Data

Today

  • Clear sailing ahead.

Tomorrow

  • Clear sailing ahead.

Wednesday

  • Clear sailing ahead.

User Details

User Since
Jan 6 2022, 7:27 PM (100 w, 3 d)
Availability
Available
LDAP User
Marco Fossati
MediaWiki User
MFossati (WMF) [ Global Accounts ]

Recent Activity

Thu, Dec 7

mfossati added a comment to T350009: Coalesce SEAL output.

As discussed on Slack, I would suggest using dataframe.repartition(X) for the features datasets as data is relatively small and using coalesce impairs job scalability (and the number of files is far too big in comparison to the data size :).

Many thanks for the valuable suggestion! Implemented in this commit with a default value of 4.
Now output files dropped to 1k! 🎉

Thu, Dec 7, 11:20 AM · Patch-For-Review, Structured-Data-Backlog (Current Work), Image-Suggestions

Tue, Dec 5

mfossati added a subtask for T349641: [Investigation EPIC] Machine detection for media with copyright issues in Upload Wizard on Commons: T352748: [SPIKE] Image classifier prototype.
Tue, Dec 5, 10:31 AM · UploadWizard, Epic, Structured-Data-Backlog (Current Work)
mfossati added a parent task for T352748: [SPIKE] Image classifier prototype: T349641: [Investigation EPIC] Machine detection for media with copyright issues in Upload Wizard on Commons.
Tue, Dec 5, 10:31 AM · Structured-Data-Backlog
mfossati created T352748: [SPIKE] Image classifier prototype.
Tue, Dec 5, 10:30 AM · Structured-Data-Backlog
mfossati changed the status of T338013: [L] Create search index deltas by comparing to `discovery.cirrus_index_without_content` in hive, a subtask of T340437: [EPIC] Data pipelines maintenance FY23-24, from Open to In Progress.
Tue, Dec 5, 10:02 AM · Epic, Structured-Data-Backlog (Current Work), Image-Suggestions, Section-Topics, Maintenance-Worktype, Section-Level-Image-Suggestions
mfossati changed the status of T338013: [L] Create search index deltas by comparing to `discovery.cirrus_index_without_content` in hive from Open to In Progress.
Tue, Dec 5, 10:02 AM · Structured-Data-Backlog (Current Work), Image-Suggestions

Mon, Dec 4

mfossati updated subscribers of T350020: Access request to deleted image files in the production Swift cluster.

Also CC @fkaelin .

Mon, Dec 4, 10:00 AM · SRE, UploadWizard, Structured-Data-Backlog, SRE-Access-Requests
mfossati updated subscribers of T350020: Access request to deleted image files in the production Swift cluster.

@jcrespo , would it be possible to use the internal reverse proxy to directly download deleted images via HTTP like here?

No, that won't be possible. Backup access will only be possible through backup access, which is highly restrictive, and it doesn't use swift protocol. I wonder if what you may want is swift production access instead, which won't come with the same limitation as backup access, and that Matthew here, on CC, advised using instead of backups.

Thanks for the heads up about production VS backup. The backup access request was merely based on my understanding that deleted images were stored there. If production access is a better option, then let's definitely opt for it, CC @MatthewVernon.

In any case, we are still blocked on T&C ok

Is there anything we can do from our side to unblock?

Mon, Dec 4, 9:59 AM · SRE, UploadWizard, Structured-Data-Backlog, SRE-Access-Requests

Fri, Dec 1

mfossati added a comment to T350020: Access request to deleted image files in the production Swift cluster.

@jcrespo , would it be possible to use the internal reverse proxy to directly download deleted images via HTTP like here?

Fri, Dec 1, 4:59 PM · SRE, UploadWizard, Structured-Data-Backlog, SRE-Access-Requests
mfossati added a comment to T350009: Coalesce SEAL output.

Thanks for the ping @mfossati - 79k files are still quite a lot - would you mind telling me more about the data? (size, partition-scheme etc)?

  • The typical size is roughly 60 GB, with 207 folders holding 401 files (coalesce = 400 + _SUCCESS) each.
  • no explicit partitioning

Note that the Spark job responsible for this output is the largest and most complex we have among our team's data pipelines, and usually takes 1 day of computation to complete.
The current implementation writes one parquet per wiki, thus resulting in those 207 folders. Modifying this behavior is out of scope, as it would require a lot of work: I had tried a simple solution that writes to a single parquet, causing all sorts of troubles to Spark executors.
Also, further reducing the coalesce value might increase the execution time, which isn't viable neither.

Fri, Dec 1, 4:32 PM · Patch-For-Review, Structured-Data-Backlog (Current Work), Image-Suggestions
mfossati moved T347755: [M] Add a new step to specify the threshold of originality for own work from Code Review to Needs QA on the Structured-Data-Backlog (Current Work) board.
Fri, Dec 1, 12:44 PM · Structured-Data-Backlog (Current Work), UploadWizard
mfossati moved T350009: Coalesce SEAL output from Doing to Code Review on the Structured-Data-Backlog (Current Work) board.

Report

scriptcoalescefiles beforefiles after
sections.py820499
embeddings.py1001025101
features.py400807k79k
Fri, Dec 1, 12:40 PM · Patch-For-Review, Structured-Data-Backlog (Current Work), Image-Suggestions

Thu, Nov 30

mfossati added a project to T347558: [S] Coalesce section alignment image suggestions output: Patch-For-Review.
Thu, Nov 30, 11:43 AM · Data-Engineering (Sprint 6), Patch-For-Review, Structured-Data-Backlog (Current Work), Section-Level-Image-Suggestions
mfossati moved T347558: [S] Coalesce section alignment image suggestions output from Doing to Code Review on the Structured-Data-Backlog (Current Work) board.

Report

scriptcoalescefiles beforefiles afterpartition beforepartition after
article_images.py100298k101wikinone
recommendation.py40114k41wikinone
Thu, Nov 30, 11:42 AM · Data-Engineering (Sprint 6), Patch-For-Review, Structured-Data-Backlog (Current Work), Section-Level-Image-Suggestions

Wed, Nov 29

mfossati moved T347752: [M] Add a new step for confirming to comply with guidelines when uploading someone else's work from Code Review to Needs QA on the Structured-Data-Backlog (Current Work) board.
Wed, Nov 29, 11:17 AM · MW-1.42-notes (1.42.0-wmf.9; 2023-12-12), Structured-Data-Backlog (Current Work), UploadWizard
mfossati moved T347749: [L] Redesign specifying the license and public domain information for someone else's work from Code Review to Needs QA on the Structured-Data-Backlog (Current Work) board.
Wed, Nov 29, 9:38 AM · Structured-Data-Backlog (Current Work), UploadWizard

Tue, Nov 28

mfossati moved T347757: [M] Add new steps for confirming to comply with guidelines when uploading own work from Code Review to Needs QA on the Structured-Data-Backlog (Current Work) board.
Tue, Nov 28, 6:22 PM · MW-1.42-notes (1.42.0-wmf.9; 2023-12-12), Structured-Data-Backlog (Current Work), UploadWizard
mfossati added a comment to T348220: Check for valid URLs in the file source text input for someone else's work.

We're explicitly asking the user to enter a website, although I acknowledge a full URL might not occur often.

With T347750#9332285 we'll actually ask for a link, thus increasing the chance of a URL.

Tue, Nov 28, 11:00 AM · UploadWizard, Structured-Data-Backlog
mfossati moved T349769: Revise the license flow for 3D files in the upload wizard from Code Review to Verify on Production on the Structured-Data-Backlog (Current Work) board.

Directly moving to verify on production: we agreed with @Etonkovidova that the latest patch is tiny enough.

Tue, Nov 28, 10:53 AM · MW-1.42-notes (1.42.0-wmf.9; 2023-12-12), UploadWizard, Structured-Data-Backlog (Current Work)
mfossati added a comment to T348220: Check for valid URLs in the file source text input for someone else's work.

@mfossati this sounds like a lot of work (parsing/regexing through the text, validating the URL(s), displaying an error message) for something that has very little possibility of actually happening, and very little negative impact if it does? Are we seeing issues with this somewhere?

Perhaps I made the task description too broad: what I essentially meant is just a regular expression that matches a website and a warning if none is found.
We're explicitly asking the user to enter a website, although I acknowledge a full URL might not occur often.

Tue, Nov 28, 10:37 AM · UploadWizard, Structured-Data-Backlog

Mon, Nov 27

mfossati changed the status of T350009: Coalesce SEAL output, a subtask of T340437: [EPIC] Data pipelines maintenance FY23-24, from Open to In Progress.
Mon, Nov 27, 9:56 AM · Epic, Structured-Data-Backlog (Current Work), Image-Suggestions, Section-Topics, Maintenance-Worktype, Section-Level-Image-Suggestions
mfossati changed the status of T350009: Coalesce SEAL output from Open to In Progress.

Skipping estimation: this ticket can be tackled together with T347558: [S] Coalesce section alignment image suggestions output.

Mon, Nov 27, 9:56 AM · Patch-For-Review, Structured-Data-Backlog (Current Work), Image-Suggestions
mfossati changed the status of T347558: [S] Coalesce section alignment image suggestions output, a subtask of T340437: [EPIC] Data pipelines maintenance FY23-24, from Open to In Progress.
Mon, Nov 27, 9:54 AM · Epic, Structured-Data-Backlog (Current Work), Image-Suggestions, Section-Topics, Maintenance-Worktype, Section-Level-Image-Suggestions
mfossati changed the status of T347558: [S] Coalesce section alignment image suggestions output from Open to In Progress.
Mon, Nov 27, 9:54 AM · Data-Engineering (Sprint 6), Patch-For-Review, Structured-Data-Backlog (Current Work), Section-Level-Image-Suggestions

Fri, Nov 24

mfossati added a comment to T347755: [M] Add a new step to specify the threshold of originality for own work.
Weekly update

Back to work, continued where I left off.
Sent 3 incremental patches that are now ready for review.

Fri, Nov 24, 2:31 PM · Structured-Data-Backlog (Current Work), UploadWizard

Thu, Nov 23

mfossati moved T347755: [M] Add a new step to specify the threshold of originality for own work from Doing to Code Review on the Structured-Data-Backlog (Current Work) board.
Thu, Nov 23, 6:18 PM · Structured-Data-Backlog (Current Work), UploadWizard

Nov 2 2023

mfossati added a comment to T299947: Normalize pagelinks table.

Hey @Ladsgroup , we're using that table in the image suggestions production data pipeline, see this query.
While I'll try to monitor this ticket and mailing lists announcements, it would be great if you could ping me in advance before the breaking change rolls out, so that we can take action.
You could also leave a comment in T350007: [M] Adapt image suggestions to comply with breaking database schema changes if you prefer, thanks!

Nov 2 2023, 4:01 PM · Platform Engineering, MediaWiki-Page-derived-data

Oct 30 2023

mfossati added a subtask for T349641: [Investigation EPIC] Machine detection for media with copyright issues in Upload Wizard on Commons: T339224: Retrieve actual data from Commons deletion requests.
Oct 30 2023, 12:40 PM · UploadWizard, Epic, Structured-Data-Backlog (Current Work)
mfossati added a parent task for T339224: Retrieve actual data from Commons deletion requests: T349641: [Investigation EPIC] Machine detection for media with copyright issues in Upload Wizard on Commons.
Oct 30 2023, 12:40 PM · Structured-Data-Backlog
mfossati added a subtask for T349641: [Investigation EPIC] Machine detection for media with copyright issues in Upload Wizard on Commons: T350020: Access request to deleted image files in the production Swift cluster.
Oct 30 2023, 12:39 PM · UploadWizard, Epic, Structured-Data-Backlog (Current Work)
mfossati added a parent task for T350020: Access request to deleted image files in the production Swift cluster: T349641: [Investigation EPIC] Machine detection for media with copyright issues in Upload Wizard on Commons.
Oct 30 2023, 12:39 PM · SRE, UploadWizard, Structured-Data-Backlog, SRE-Access-Requests
mfossati created T350020: Access request to deleted image files in the production Swift cluster.
Oct 30 2023, 12:37 PM · SRE, UploadWizard, Structured-Data-Backlog, SRE-Access-Requests
mfossati closed T339224: Retrieve actual data from Commons deletion requests, a subtask of T337408: UX research to identify improvements for image upload process in order to reduce moderation burden , as Invalid.
Oct 30 2023, 11:50 AM · Epic, Structured-Data-Backlog (Current Work)
mfossati closed T339224: Retrieve actual data from Commons deletion requests as Invalid.

Closing as invalid: we plan to use actual data to train a machine learning model, as part of T349641: [Investigation EPIC] Machine detection for media with copyright issues in Upload Wizard on Commons.

Oct 30 2023, 11:50 AM · Structured-Data-Backlog
mfossati added a subtask for T340437: [EPIC] Data pipelines maintenance FY23-24: T350012: Schedule all data pipeline DAGs on Thursdays.
Oct 30 2023, 10:54 AM · Epic, Structured-Data-Backlog (Current Work), Image-Suggestions, Section-Topics, Maintenance-Worktype, Section-Level-Image-Suggestions
mfossati added a parent task for T350012: Schedule all data pipeline DAGs on Thursdays: T340437: [EPIC] Data pipelines maintenance FY23-24.
Oct 30 2023, 10:54 AM · Structured-Data-Backlog (Current Work), Section-Topics, Section-Level-Image-Suggestions, Image-Suggestions
mfossati created T350012: Schedule all data pipeline DAGs on Thursdays.
Oct 30 2023, 10:54 AM · Structured-Data-Backlog (Current Work), Section-Topics, Section-Level-Image-Suggestions, Image-Suggestions
mfossati updated subscribers of T350009: Coalesce SEAL output.

CC @JAllemandou .

Oct 30 2023, 10:43 AM · Patch-For-Review, Structured-Data-Backlog (Current Work), Image-Suggestions
mfossati added a subtask for T340437: [EPIC] Data pipelines maintenance FY23-24: T350009: Coalesce SEAL output.
Oct 30 2023, 10:42 AM · Epic, Structured-Data-Backlog (Current Work), Image-Suggestions, Section-Topics, Maintenance-Worktype, Section-Level-Image-Suggestions
mfossati added a parent task for T350009: Coalesce SEAL output: T340437: [EPIC] Data pipelines maintenance FY23-24.
Oct 30 2023, 10:42 AM · Patch-For-Review, Structured-Data-Backlog (Current Work), Image-Suggestions
mfossati created T350009: Coalesce SEAL output.
Oct 30 2023, 10:42 AM · Patch-For-Review, Structured-Data-Backlog (Current Work), Image-Suggestions
mfossati added a subtask for T340437: [EPIC] Data pipelines maintenance FY23-24: T350007: [M] Adapt image suggestions to comply with breaking database schema changes.
Oct 30 2023, 10:36 AM · Epic, Structured-Data-Backlog (Current Work), Image-Suggestions, Section-Topics, Maintenance-Worktype, Section-Level-Image-Suggestions
mfossati added a parent task for T350007: [M] Adapt image suggestions to comply with breaking database schema changes: T340437: [EPIC] Data pipelines maintenance FY23-24.
Oct 30 2023, 10:36 AM · Structured-Data-Backlog (Current Work), Image-Suggestions
mfossati created T350007: [M] Adapt image suggestions to comply with breaking database schema changes.
Oct 30 2023, 10:36 AM · Structured-Data-Backlog (Current Work), Image-Suggestions
mfossati renamed T340437: [EPIC] Data pipelines maintenance FY23-24 from [EPIC] Image suggestions and section topics data pipelines maintenance FY23-24 to [EPIC] Data pipelines maintenance FY23-24.
Oct 30 2023, 10:33 AM · Epic, Structured-Data-Backlog (Current Work), Image-Suggestions, Section-Topics, Maintenance-Worktype, Section-Level-Image-Suggestions
mfossati added a comment to T347755: [M] Add a new step to specify the threshold of originality for own work.
Weekly update

Made good progress, then spent some time debugging why code that handles warning messages for sub-radio buttons was never reached.
Note that acceptance criteria came on Wed 25 and augmented the previously estimated effort in my opinion.

Oct 30 2023, 10:28 AM · Structured-Data-Backlog (Current Work), UploadWizard

Oct 27 2023

mfossati closed T325316: [XL] Productionize section alignment model training as Resolved.
Weekly update

This week's SEAL run went fine. As a result, I checked SLIS output, which looks fine as well. Image suggestions completed, too. Closing!

Oct 27 2023, 4:29 PM · Patch-For-Review, Structured-Data-Backlog (Current Work), Section-Level-Image-Suggestions, Machine-Learning-Team
mfossati closed T325316: [XL] Productionize section alignment model training, a subtask of T340437: [EPIC] Data pipelines maintenance FY23-24, as Resolved.
Oct 27 2023, 4:28 PM · Epic, Structured-Data-Backlog (Current Work), Image-Suggestions, Section-Topics, Maintenance-Worktype, Section-Level-Image-Suggestions

Oct 26 2023

mfossati added a comment to T347756: [L] Redesign choosing Creative Commons license for own work.

When we add more questions we will need to add number to the font of this question. How can we make sure that gets done later?

I thought that we could make this an ordered list in the next iteration? Does that make sense (cc @mfossati @matthiasmullie)
edit: I could make it an ordered list now if you like, but then it'd be numbered even though there's only one item

I was thinking the same 😄

Oct 26 2023, 3:07 PM · MW-1.42-notes (1.42.0-wmf.5; 2023-11-14), Patch-For-Review, Structured-Data-Backlog (Current Work), UploadWizard

Oct 24 2023

mfossati added a comment to T337925: [L] Publish full image suggestions (and intermediate) dataset.

See also https://foundation.wikimedia.org/wiki/Legal:Data_publication_guidelines

Oct 24 2023, 8:56 AM · Image-Suggestions, Section-Topics, Structured-Data-Backlog (Current Work), Section-Level-Image-Suggestions

Oct 23 2023

mfossati moved T347590: [M] Upload Wizard release rights: Redesign own vs not own work step from Code Review to Blocked on the Structured-Data-Backlog (Current Work) board.

Review done. Moving to blocked, pending community feedback integration.

Oct 23 2023, 10:41 AM · MW-1.42-notes (1.42.0-wmf.5; 2023-11-14), Structured-Data-Backlog (Current Work), UploadWizard
mfossati changed the status of T347755: [M] Add a new step to specify the threshold of originality for own work, a subtask of T347596: Upload Wizard: Redesign own work license step, from Open to In Progress.
Oct 23 2023, 10:41 AM · Epic, Structured-Data-Backlog (Current Work), UploadWizard
mfossati changed the status of T347755: [M] Add a new step to specify the threshold of originality for own work from Open to In Progress.
Oct 23 2023, 10:41 AM · Structured-Data-Backlog (Current Work), UploadWizard

Oct 20 2023

mfossati updated subscribers of T347590: [M] Upload Wizard release rights: Redesign own vs not own work step.
Weekly update

The patch was reviewed by @matthiasmullie, who directly integrated feedback (thanks!).
Pending some work on tests that are currently failing

Oct 20 2023, 4:35 PM · MW-1.42-notes (1.42.0-wmf.5; 2023-11-14), Structured-Data-Backlog (Current Work), UploadWizard
mfossati added a comment to T325316: [XL] Productionize section alignment model training.
Weekly update

The SEAL pipeline is currently running in production, waiting for the last DAG task to complete. A couple of quick hotfixes were needed to ensure a proper execution.

Oct 20 2023, 4:27 PM · Patch-For-Review, Structured-Data-Backlog (Current Work), Section-Level-Image-Suggestions, Machine-Learning-Team
mfossati added a comment to T348845: [L] Metrics for UW improvements.

Here are my suggestions:

  1. match against deletion request opening reasons, closing reasons, and file deletion reasons (AKA edit messages or revision comments), See data lake query to extract file deletion reasons
  2. expand all wikilink shorthands, such as COM:DW = COM:DERIV = Commons:Derivative_works. Note that COM prefixes always expand to Commons
  3. compile the list by looking at wikilinks and words frequencies, plus word clusters from T340546: [XL] Analysis of deletion requests on Commons. Most frequent are:
    • COM:FOP = COM:PANO = Commons:Freedom_of_panorama
    • COM:SS = COM:SCREENSHOT[S] = Commons:Screenshots
    • COM:ALBUM = Commons:ALBUM
Oct 20 2023, 9:30 AM · Structured-Data-Backlog (Current Work), UploadWizard

Oct 17 2023

mfossati moved T347590: [M] Upload Wizard release rights: Redesign own vs not own work step from Doing to Code Review on the Structured-Data-Backlog (Current Work) board.
Oct 17 2023, 3:39 PM · MW-1.42-notes (1.42.0-wmf.5; 2023-11-14), Structured-Data-Backlog (Current Work), UploadWizard
mfossati updated the task description for T347590: [M] Upload Wizard release rights: Redesign own vs not own work step.
Oct 17 2023, 3:38 PM · MW-1.42-notes (1.42.0-wmf.5; 2023-11-14), Structured-Data-Backlog (Current Work), UploadWizard

Oct 16 2023

mfossati added a comment to T343844: NEW BUG REPORT fiwiki’s section-level image suggestions aren’t generated in production.

I can confirm Airflow variables are not updated after deployment.
Opened T348963: DagProperties don't automatically update Airflow variables, CC @xcollazo .

Oct 16 2023, 9:49 AM · Structured-Data-Backlog, Data-Engineering
mfossati updated subscribers of T348963: DagProperties don't automatically update Airflow variables.
Oct 16 2023, 9:48 AM · Structured-Data-Backlog, Data Products, Data-Engineering
mfossati created T348963: DagProperties don't automatically update Airflow variables.
Oct 16 2023, 9:46 AM · Structured-Data-Backlog, Data Products, Data-Engineering
mfossati added a comment to T325316: [XL] Productionize section alignment model training.

Thanks again for your fast reaction @xcollazo , much appreciated!
I can confirm the SEAL DAG is now deployed.

Oct 16 2023, 9:23 AM · Patch-For-Review, Structured-Data-Backlog (Current Work), Section-Level-Image-Suggestions, Machine-Learning-Team
mfossati updated subscribers of T348958: Bump memory to enable large artifacts sync on HDFS.
Oct 16 2023, 9:19 AM · Structured-Data-Backlog, Data Products, Data-Engineering
mfossati created T348958: Bump memory to enable large artifacts sync on HDFS.
Oct 16 2023, 9:19 AM · Structured-Data-Backlog, Data Products, Data-Engineering
mfossati added a comment to T347561: [Maintenance] Set up deletion jobs for Structured Data's data pipelines.

Hey @VirginiaPoundstone , I don't think there's any deadline from our side. However, please note that this was initially raised as one of the causes that put the Hadoop cluster under pressure, CC @JAllemandou .

Oct 16 2023, 9:02 AM · Data-Engineering (Sprint 6), Data Products, Structured-Data-Backlog

Oct 13 2023

mfossati added a comment to T325316: [XL] Productionize section alignment model training.

(one is 1.22 GB, the other 1.64)

!!

Yeah I know 😄 , the heavier one is a pre-trained machine learning model and the other one has a lot of machine learning dependencies, despite my slimming efforts.

Maybe we can add them manually to unblock you. Let me try.

Thanks a lot and no worries, not in a hurry.
Would it be viable to bump the memory for future deployments?

Oct 13 2023, 7:01 PM · Patch-For-Review, Structured-Data-Backlog (Current Work), Section-Level-Image-Suggestions, Machine-Learning-Team
mfossati moved T325316: [XL] Productionize section alignment model training from Code Review to Verify on Production on the Structured-Data-Backlog (Current Work) board.

Moving away from code review, pending deployment & production monitoring.

Oct 13 2023, 6:05 PM · Patch-For-Review, Structured-Data-Backlog (Current Work), Section-Level-Image-Suggestions, Machine-Learning-Team
mfossati added a comment to T325316: [XL] Productionize section alignment model training.

@xcollazo , could you please give us a hand? Java is OOMing, perhaps our artifacts are too heavy (one is 1.22 GB, the other 1.64)? Pasting what's happening below:

mfossati@deploy2002:/srv/deployment/airflow-dags/platform_eng$ scap deploy
17:59:22 Started deploy [airflow-dags/platform_eng@520fa55]
17:59:22 Deploying Rev: HEAD = 520fa55c45d057c39c977e7cbe652844e9f66c00
17:59:22 Started deploy [airflow-dags/platform_eng@520fa55]: (no justification provided)
17:59:22
== DEFAULT ==
:* an-airflow1004.eqiad.wmnet
17:59:23 airflow-dags/platform_eng: fetch stage(s): 100% (in-flight: 0; ok: 1; fail: 0; left: 0) |
17:59:24 airflow-dags/platform_eng: config_deploy stage(s): 100% (in-flight: 0; ok: 1; fail: 0; left: 0) |
17:59:49 ['/usr/bin/scap', 'deploy-local', '-v', '--repo', 'airflow-dags/platform_eng', '-g', 'default', 'promote', '--refresh-config'] (ran as analytics-platform-eng@an-airflow1004.eqiad.wmnet) returned [1]: Could not chdir to home directory /nonexistent: No such file or directory
Registering scripts in directory '/srv/deployment/airflow-dags/platform_eng-cache/revs/520fa55c45d057c39c977e7cbe652844e9f66c00/scap/scripts'
Registering scripts in directory '/srv/deployment/airflow-dags/platform_eng-cache/revs/520fa55c45d057c39c977e7cbe652844e9f66c00/scap/scripts'
Executing check 'artifacts_sync'
Check 'artifacts_sync' failed: hdfsWrite: NewByteArray error:
java.lang.OutOfMemoryError: Java heap space
Traceback (most recent call last):
  File "/usr/lib/airflow/bin/artifact-cache", line 8, in <module>
    sys.exit(main())
  File "/usr/lib/airflow/lib/python3.10/site-packages/workflow_utils/artifact/cli.py", line 32, in main
    artifact.cache_put(force=args['--force'])
  File "/usr/lib/airflow/lib/python3.10/site-packages/workflow_utils/artifact/artifact.py", line 71, in cache_put
    cache.put(self, open_file)
  File "/usr/lib/airflow/lib/python3.10/site-packages/workflow_utils/artifact/cache.py", line 128, in put
    output.write(input_stream.read())
  File "pyarrow/io.pxi", line 359, in pyarrow.lib.NativeFile.write
  File "pyarrow/error.pxi", line 113, in pyarrow.lib.check_status
OSError: [Errno 12] HDFS Write failed. Detail: [errno 12] Cannot allocate memory
Oct 13 2023, 6:02 PM · Patch-For-Review, Structured-Data-Backlog (Current Work), Section-Level-Image-Suggestions, Machine-Learning-Team

Oct 9 2023

mfossati changed the status of T347590: [M] Upload Wizard release rights: Redesign own vs not own work step, a subtask of T347298: [Epic] Upload wizard Release rights step improvements on Commons, from Open to In Progress.
Oct 9 2023, 2:11 PM · Commons, UploadWizard, Epic, Structured-Data-Backlog (Current Work)
mfossati changed the status of T347590: [M] Upload Wizard release rights: Redesign own vs not own work step from Open to In Progress.
Oct 9 2023, 2:11 PM · MW-1.42-notes (1.42.0-wmf.5; 2023-11-14), Structured-Data-Backlog (Current Work), UploadWizard
mfossati updated the task description for T340546: [XL] Analysis of deletion requests on Commons.
Oct 9 2023, 10:42 AM · Structured-Data-Backlog (Current Work)
mfossati created P52865 Commons deletion requests top 10 wikilinks, words, and word clusters.
Oct 9 2023, 9:57 AM

Oct 5 2023

mfossati added a subtask for T347298: [Epic] Upload wizard Release rights step improvements on Commons: T348220: Check for valid URLs in the file source text input for someone else's work.
Oct 5 2023, 9:12 AM · Commons, UploadWizard, Epic, Structured-Data-Backlog (Current Work)
mfossati added a parent task for T348220: Check for valid URLs in the file source text input for someone else's work: T347298: [Epic] Upload wizard Release rights step improvements on Commons.
Oct 5 2023, 9:12 AM · UploadWizard, Structured-Data-Backlog
mfossati created T348220: Check for valid URLs in the file source text input for someone else's work.
Oct 5 2023, 9:11 AM · UploadWizard, Structured-Data-Backlog

Oct 3 2023

mfossati added a comment to T347832: Search dag image_suggestions_weekly failed waiting for analytics_platform_eng.image_suggestions_search_index_delta/snapshot=2023-09-18.

All data pipelines succeeded.
The required dataset is now available, row counts of ALIS, SLIS and the delta are all consistent with the previous run.

Oct 3 2023, 5:22 PM · Discovery-Search (Current work), Image-Suggestions, Structured-Data-Backlog
mfossati moved T325316: [XL] Productionize section alignment model training from Doing to Code Review on the Structured-Data-Backlog (Current Work) board.

First complete version & DAG ready for code review.

Oct 3 2023, 2:39 PM · Patch-For-Review, Structured-Data-Backlog (Current Work), Section-Level-Image-Suggestions, Machine-Learning-Team
mfossati updated the task description for T325316: [XL] Productionize section alignment model training.
Oct 3 2023, 2:37 PM · Patch-For-Review, Structured-Data-Backlog (Current Work), Section-Level-Image-Suggestions, Machine-Learning-Team
mfossati added a comment to T347832: Search dag image_suggestions_weekly failed waiting for analytics_platform_eng.image_suggestions_search_index_delta/snapshot=2023-09-18.

The upstream dependency is back, so I manually triggered all data pipelines.
Section topics and section alignment suggestions went fine, image suggestions running now.

Oct 3 2023, 9:15 AM · Discovery-Search (Current work), Image-Suggestions, Structured-Data-Backlog
mfossati closed T338939: [L] Publish data pipelines Python documentation as Resolved.

All merged, closing.

Oct 3 2023, 9:10 AM · Structured-Data-Backlog (Current Work), Section-Topics, Section-Level-Image-Suggestions, Image-Suggestions

Oct 2 2023

mfossati added a comment to T347832: Search dag image_suggestions_weekly failed waiting for analytics_platform_eng.image_suggestions_search_index_delta/snapshot=2023-09-18.

The wmf.wikidata_item_page_link/snapshot=2023-09-18 upstream dependency isn’t available yet.
Corresponding DAG sensors timed out, causing all DAGs to fail.

Oct 2 2023, 9:02 AM · Discovery-Search (Current work), Image-Suggestions, Structured-Data-Backlog

Sep 28 2023

mfossati added a comment to T337925: [L] Publish full image suggestions (and intermediate) dataset.

Note that dumps will be owned by Data Engineering, Data Products team.
Work on Dumps 2.0 is also underway.

Sep 28 2023, 1:15 PM · Image-Suggestions, Section-Topics, Structured-Data-Backlog (Current Work), Section-Level-Image-Suggestions
mfossati updated subscribers of T347524: Add integration tests to the PySpark jobs.

Hi there, for unit tests I can certainly suggest chispa, which was initially proposed by @MunizaA .
For integration tests I can't tell.

Sep 28 2023, 12:58 PM · Data Products (Data Product Sprint 04), Dumps 2.0
mfossati added a subtask for T340437: [EPIC] Data pipelines maintenance FY23-24: T347569: [L] Block search indices update and Cassandra tasks in case of no ALIS or SLIS data.
Sep 28 2023, 10:39 AM · Epic, Structured-Data-Backlog (Current Work), Image-Suggestions, Section-Topics, Maintenance-Worktype, Section-Level-Image-Suggestions
mfossati added a parent task for T347569: [L] Block search indices update and Cassandra tasks in case of no ALIS or SLIS data: T340437: [EPIC] Data pipelines maintenance FY23-24.
Sep 28 2023, 10:39 AM · Structured-Data-Backlog (Current Work), SDAW-MediaSearch, Section-Level-Image-Suggestions, Image-Suggestions
mfossati created T347569: [L] Block search indices update and Cassandra tasks in case of no ALIS or SLIS data.
Sep 28 2023, 10:39 AM · Structured-Data-Backlog (Current Work), SDAW-MediaSearch, Section-Level-Image-Suggestions, Image-Suggestions
mfossati added a subtask for T340437: [EPIC] Data pipelines maintenance FY23-24: T347566: [M] Send an alert in case of no ALIS or SLIS.
Sep 28 2023, 10:32 AM · Epic, Structured-Data-Backlog (Current Work), Image-Suggestions, Section-Topics, Maintenance-Worktype, Section-Level-Image-Suggestions
mfossati added a parent task for T347566: [M] Send an alert in case of no ALIS or SLIS: T340437: [EPIC] Data pipelines maintenance FY23-24.
Sep 28 2023, 10:32 AM · Structured-Data-Backlog (Current Work), Section-Level-Image-Suggestions, Image-Suggestions
mfossati created T347566: [M] Send an alert in case of no ALIS or SLIS.
Sep 28 2023, 10:31 AM · Structured-Data-Backlog (Current Work), Section-Level-Image-Suggestions, Image-Suggestions
mfossati added a subtask for T340437: [EPIC] Data pipelines maintenance FY23-24: T347561: [Maintenance] Set up deletion jobs for Structured Data's data pipelines.
Sep 28 2023, 10:07 AM · Epic, Structured-Data-Backlog (Current Work), Image-Suggestions, Section-Topics, Maintenance-Worktype, Section-Level-Image-Suggestions
mfossati added a parent task for T347561: [Maintenance] Set up deletion jobs for Structured Data's data pipelines: T340437: [EPIC] Data pipelines maintenance FY23-24.
Sep 28 2023, 10:07 AM · Data-Engineering (Sprint 6), Data Products, Structured-Data-Backlog
mfossati created T347561: [Maintenance] Set up deletion jobs for Structured Data's data pipelines.
Sep 28 2023, 10:06 AM · Data-Engineering (Sprint 6), Data Products, Structured-Data-Backlog
mfossati updated subscribers of T347558: [S] Coalesce section alignment image suggestions output.

CC @JAllemandou .

Sep 28 2023, 9:44 AM · Data-Engineering (Sprint 6), Patch-For-Review, Structured-Data-Backlog (Current Work), Section-Level-Image-Suggestions
mfossati added a subtask for T340437: [EPIC] Data pipelines maintenance FY23-24: T347558: [S] Coalesce section alignment image suggestions output.
Sep 28 2023, 9:42 AM · Epic, Structured-Data-Backlog (Current Work), Image-Suggestions, Section-Topics, Maintenance-Worktype, Section-Level-Image-Suggestions
mfossati added a parent task for T347558: [S] Coalesce section alignment image suggestions output: T340437: [EPIC] Data pipelines maintenance FY23-24.
Sep 28 2023, 9:42 AM · Data-Engineering (Sprint 6), Patch-For-Review, Structured-Data-Backlog (Current Work), Section-Level-Image-Suggestions
mfossati created T347558: [S] Coalesce section alignment image suggestions output.
Sep 28 2023, 9:42 AM · Data-Engineering (Sprint 6), Patch-For-Review, Structured-Data-Backlog (Current Work), Section-Level-Image-Suggestions

Sep 22 2023

KStoller-WMF awarded T346682: Generate section image recommendations for Vietnamese Wikipedia and other wikis a Love token.
Sep 22 2023, 4:44 PM · Growth-Team (Sprint 0 (Growth Team)), Structured-Data-Backlog, GrowthExperiments-Homepage, Growth-Structured-Tasks, GrowthExperiments-NewcomerTasks, Image-Suggestions
mfossati updated subscribers of T346682: Generate section image recommendations for Vietnamese Wikipedia and other wikis.

@Urbanecm_WMF , T343844: NEW BUG REPORT fiwiki’s section-level image suggestions aren’t generated in production looks fine now in terms of data.
Here's the section-level image suggestions (SLIS) current count for Vietnamese Wikipedia:

In [1]: isu = spark.read.table('analytics_platform_eng.image_suggestions_suggestions').where('snapshot="2023-09-11"')
In [2]: slis = isu.where(isu.section_index.isNotNull())
In [3]: slis.where(slis.wiki == 'viwiki').count()
Out[3]: 38210

Re the acceptance criterion of this ticket: we have at least 1 SLIS for 293 Wikipedias, not all. Here's the complete list with SLIS raw counts (sorry for the long paste!). You may want to skip Wikipedias with not enough SLIS.

In [4]: slis.select(slis.wiki).distinct().count()
Out[4]: 293
In [5]: slis.groupBy(slis.wiki).count().orderBy('count', ascending=False).show(n=300)
wikicount
enwiki237706
dewiki129877
ruwiki128102
frwiki124932
eswiki123535
itwiki118022
ukwiki95136
plwiki90826
nlwiki79564
huwiki74956
hewiki74620
cswiki68765
nowiki62103
cawiki62063
jawiki60551
ptwiki58337
svwiki57341
zhwiki56971
arwiki56372
fiwiki55348
srwiki52096
be_x_oldwiki48349
trwiki41341
elwiki41114
idwiki39665
bgwiki39489
viwiki38210
hrwiki36195
rowiki34542
shwiki33347
dawiki31348
simplewiki30523
skwiki30279
fawiki28997
hywiki24828
kowiki24236
mkwiki23015
glwiki22539
mswiki22387
lvwiki21521
bewiki21180
azwiki20708
slwiki19090
bswiki19047
eowiki18518
astwiki18457
euwiki18347
etwiki18034
ltwiki17549
bnwiki14227
kkwiki11855
thwiki11848
afwiki11295
hiwiki10680
tawiki10131
kawiki9685
nnwiki9249
sqwiki8961
alswiki8678
mlwiki7545
uzwiki7021
fywiki5864
jvwiki5744
urwiki5573
iswiki5193
lawiki4765
ocwiki4093
pnbwiki3859
knwiki3743
tewiki3707
cywiki3596
bawiki3498
suwiki3462
tlwiki3326
pawiki3279
scowiki2911
lmowiki2717
mrwiki2715
mnwiki2594
ttwiki2504
swwiki2373
anwiki2357
tgwiki2330
zh_classicalwiki2218
gawiki2204
ndswiki2186
zh_yuewiki2160
guwiki2150
pswiki2077
liwiki2061
newiki2039
mywiki1934
roa_tarawiki1927
lbwiki1851
arzwiki1806
hywwiki1707
mtwiki1668
kywiki1508
siwiki1252
bhwiki1220
stqwiki1153
kuwiki1130
aswiki1047
scnwiki952
cowiki907
vecwiki882
iawiki881
mgwiki877
orwiki869
minwiki791
mwlwiki755
nds_nlwiki739
brwiki711
ruewiki644
wuuwiki634
azbwiki633
sawiki630
kmwiki609
iowiki569
vepwiki532
krcwiki484
mznwiki462
ckbwiki437
yiwiki428
hawiki428
sowiki413
gdwiki406
sahwiki400
koiwiki393
rmwiki383
novwiki379
cebwiki371
scwiki339
barwiki333
bclwiki332
vlswiki321
ladwiki306
sdwiki288
tkwiki284
pmswiki279
cvwiki276
xmfwiki273
cewiki262
fiu_vrowiki254
hifwiki254
frrwiki253
ilowiki248
fowiki241
map_bmswiki239
maiwiki236
extwiki225
napwiki207
oswiki205
tcywiki205
warwiki201
dagwiki182
bxrwiki175
lezwiki174
iewiki163
igwiki159
furwiki140
bat_smgwiki133
newwiki125
htwiki123
altwiki117
hsbwiki113
arywiki109
lowiki108
dtywiki108
diqwiki105
zeawiki103
gomwiki102
zh_min_nanwiki92
gvwiki85
cbk_zamwiki85
tumwiki81
niawiki79
wawiki76
awawiki75
nsowiki73
banwiki73
avwiki72
gorwiki71
dvwiki71
snwiki69
mrjwiki69
bjnwiki63
skrwiki63
ganwiki60
papwiki59
szlwiki58
madwiki57
fjwiki56
avkwiki55
szywiki55
lldwiki54
kswiki52
gnwiki50
lijwiki50
shnwiki48
smnwiki47
lfnwiki47
pcdwiki46
glkwiki45
inhwiki45
kaawiki45
emlwiki44
pflwiki42
sewiki42
kbdwiki37
dsbwiki36
zawiki36
pcmwiki35
wowiki34
myvwiki34
xhwiki31
smwiki31
gurwiki30
kbpwiki30
gcrwiki30
ugwiki29
tyvwiki29
nrmwiki29
amwiki28
satwiki27
udmwiki27
nqowiki27
pamwiki26
kwwiki25
amiwiki24
yowiki24
hawwiki23
shiwiki23
frpwiki22
bpywiki22
mhrwiki22
acewiki20
trvwiki20
kabwiki19
roa_rupwiki19
crhwiki17
twwiki17
quwiki17
csbwiki16
zuwiki16
bowiki15
lnwiki14
omwiki14
kshwiki14
pntwiki13
mnwwiki12
nywiki11
kvwiki11
tywiki11
blkwiki11
xalwiki10
vowiki10
rwwiki9
ltgwiki9
bugwiki8
hakwiki8
pdcwiki8
nahwiki8
atjwiki8
sswiki7
miwiki5
olowiki5
tetwiki5
jamwiki5
ffwiki4
lbewiki4
bmwiki4
tpiwiki4
tiwiki3
cuwiki3
cdowiki3
lgwiki3
arcwiki3
guwwiki2
gagwiki2
mniwiki2
pagwiki2
angwiki2
gucwiki2
gotwiki1
adywiki1
tswiki1
tnwiki1
taywiki1
mdfwiki1
kcgwiki1
kiwiki1
vewiki1
Sep 22 2023, 2:59 PM · Growth-Team (Sprint 0 (Growth Team)), Structured-Data-Backlog, GrowthExperiments-Homepage, Growth-Structured-Tasks, GrowthExperiments-NewcomerTasks, Image-Suggestions
mfossati reopened T343844: NEW BUG REPORT fiwiki’s section-level image suggestions aren’t generated in production as "Open".

@xcollazo @mforns , actually it's essential to understand the expected behavior of DagProperties: if they don't automatically update Airflow's variables, new deployments won't be sustainable.
Re-opening the ticket: from our side, we can check what happens in the next deployment. but it would be great to get your advice, too.

Sep 22 2023, 2:59 PM · Structured-Data-Backlog, Data-Engineering
mfossati closed T343844: NEW BUG REPORT fiwiki’s section-level image suggestions aren’t generated in production as Resolved.
  • Section topics:
In [1]: st = spark.read.parquet('/user/analytics-platform-eng/structured-data/section_topics/2023-09-11')
In [2]: st.where(st.wiki_db == 'fiwiki').count()
Out[2]: 2367759
  • Section alignment suggestions:
In [3]: ali = spark.read.parquet('/user/analytics-platform-eng/structured-data/section-alignment-suggestions/suggestions/2023-09-11')
In [4]: ali.where(ali.target_wiki_db == 'fiwiki').count()
Out [4]: 28518
  • Section-level image suggestions:
In [5]: isu = spark.read.table('analytics_platform_eng.image_suggestions_suggestions').where('snapshot="2023-09-11"')
In [6]: slis = isu.where(isu.section_index.isNotNull())
In [7]: slis.where(slis.wiki == 'fiwiki').count()
Out[7]: 55348

Awesome! Closing.

Sep 22 2023, 2:53 PM · Structured-Data-Backlog, Data-Engineering