Page MenuHomePhabricator

xcollazo (Xabriel J. Collazo Mojica)
Sr. Software Engineer for Wikimedia

Projects

Today

  • Clear sailing ahead.

Tomorrow

  • Clear sailing ahead.

Saturday

  • Clear sailing ahead.

User Details

User Since
Jun 9 2022, 6:42 PM (40 w, 6 d)
Availability
Available
LDAP User
Unknown
MediaWiki User
XCollazo-WMF [ Global Accounts ]

Recent Activity

Yesterday

xcollazo closed T332031: Upgrade platform_eng Airflow instance to 2.5.1 as Resolved.

Just opened T332820 and T332822 to take care of the remaining issues as time allows.

Wed, Mar 22, 8:23 PM · Structured-Data-Backlog (Current Work)
xcollazo created T332822: Investigate datahub stack trace on an-airflow1004.eqiad.wmnet.
Wed, Mar 22, 8:21 PM · Data Pipelines
xcollazo created T332820: Investigate dangling tables after Airflow 2.5.1 upgrade of an-airflow1004.eqiad.wmnet.
Wed, Mar 22, 8:16 PM · Data Pipelines
xcollazo moved T329282: [S] Exclude short sections from having image suggestions from Doing to Code Review on the Structured-Data-Backlog (Current Work) board.
Wed, Mar 22, 8:09 PM · Patch-For-Review, Structured-Data-Backlog (Current Work), Section-Level-Image-Suggestions

Tue, Mar 21

xcollazo changed the status of T329282: [S] Exclude short sections from having image suggestions, a subtask of T311814: [EPIC] Section-level image suggestions data pipeline, from Open to In Progress.
Tue, Mar 21, 6:28 PM · Structured-Data-Backlog (Current Work), Data Pipelines, Section-Level-Image-Suggestions, Research-Backlog, Epic
xcollazo changed the status of T329282: [S] Exclude short sections from having image suggestions from Open to In Progress.
Tue, Mar 21, 6:28 PM · Patch-For-Review, Structured-Data-Backlog (Current Work), Section-Level-Image-Suggestions
xcollazo updated the task description for T331456: [S] Filter out all .svg files from section-level image suggestions.
Tue, Mar 21, 6:22 PM · Patch-For-Review, Structured-Data-Backlog (Current Work), Section-Level-Image-Suggestions
xcollazo moved T331456: [S] Filter out all .svg files from section-level image suggestions from Ready for Development to Code Review on the Structured-Data-Backlog (Current Work) board.
Tue, Mar 21, 6:22 PM · Patch-For-Review, Structured-Data-Backlog (Current Work), Section-Level-Image-Suggestions
xcollazo added a comment to T331456: [S] Filter out all .svg files from section-level image suggestions.

@CBogen: Do we want to remove the SVGs only from section-level image suggestions, or from all suggestions?

Tue, Mar 21, 2:08 AM · Patch-For-Review, Structured-Data-Backlog (Current Work), Section-Level-Image-Suggestions

Mon, Mar 20

xcollazo changed the status of T331456: [S] Filter out all .svg files from section-level image suggestions from Open to In Progress.
Mon, Mar 20, 6:05 PM · Patch-For-Review, Structured-Data-Backlog (Current Work), Section-Level-Image-Suggestions
xcollazo changed the status of T331456: [S] Filter out all .svg files from section-level image suggestions, a subtask of T311814: [EPIC] Section-level image suggestions data pipeline, from Open to In Progress.
Mon, Mar 20, 6:05 PM · Structured-Data-Backlog (Current Work), Data Pipelines, Section-Level-Image-Suggestions, Research-Backlog, Epic
xcollazo closed T328644: [M] Build the section-level image suggestions weighted tag dataset for search indices, a subtask of T311814: [EPIC] Section-level image suggestions data pipeline, as Resolved.
Mon, Mar 20, 5:37 PM · Structured-Data-Backlog (Current Work), Data Pipelines, Section-Level-Image-Suggestions, Research-Backlog, Epic
xcollazo closed T328644: [M] Build the section-level image suggestions weighted tag dataset for search indices, a subtask of T311829: [XL] Combine suggestions based on section topics with section alignment ones and convert notebook code into idiomatic data pipeline code, as Resolved.
Mon, Mar 20, 5:37 PM · Structured-Data-Backlog (Current Work), Section-Level-Image-Suggestions
xcollazo closed T328644: [M] Build the section-level image suggestions weighted tag dataset for search indices as Resolved.
Mon, Mar 20, 5:37 PM · Structured-Data-Backlog (Current Work), Section-Level-Image-Suggestions
xcollazo updated the task description for T328644: [M] Build the section-level image suggestions weighted tag dataset for search indices.
Mon, Mar 20, 5:36 PM · Structured-Data-Backlog (Current Work), Section-Level-Image-Suggestions
xcollazo added a comment to T329363: Upgrade Hadoop test cluster to Bullseye.

Also the anacoda-wmf package isn't available in bullseye

Mon, Mar 20, 3:31 PM · Patch-For-Review, Shared-Data-Infrastructure (Shared-Data-Infra Sprint 10), Data-Engineering-Planning

Fri, Mar 17

xcollazo closed T332182: GitLab CI acting funny for image_suggestions as Resolved.

Ok this one was fun. Explanation:

Fri, Mar 17, 8:53 PM · Structured-Data-Backlog (Current Work)
xcollazo changed the status of T332182: GitLab CI acting funny for image_suggestions from Open to In Progress.
Fri, Mar 17, 3:41 PM · Structured-Data-Backlog (Current Work)
xcollazo added a comment to T332182: GitLab CI acting funny for image_suggestions.

The culprit seems to be the definition of the bump_on_airflow_dags CI step. I deleted it temporarily and now MRs open up with a proper pipeline run triggered by the MR creation.

Fri, Mar 17, 3:40 PM · Structured-Data-Backlog (Current Work)
xcollazo added a comment to T328672: [M] Populate Hive tables that will feed Cassandra.

This has been merged into branch https://gitlab.wikimedia.org/repos/structured-data/image-suggestions/-/tree/T311289-combined.

Fri, Mar 17, 2:27 PM · Structured-Data-Backlog (Current Work), Section-Level-Image-Suggestions
xcollazo added a comment to T332031: Upgrade platform_eng Airflow instance to 2.5.1.

No history was lost. Some dags have been renamed: https://gitlab.wikimedia.org/repos/data-engineering/airflow-dags/-/commit/760f31789ee20f3e6e263fa4733ff51202fa52a0

So new dags were created when we deployed the last version of airflow-dags. In other words, the migration was not the problem.

Yet, there is some work to reconcile both histories if needed.

Fri, Mar 17, 1:53 PM · Structured-Data-Backlog (Current Work)

Thu, Mar 16

xcollazo closed T330688: Make sure our delta algorithm doesn't depend on successful past runs as Resolved.

This was deployed as part of T332031. Closing.

Thu, Mar 16, 6:52 PM · Structured-Data-Backlog (Current Work), Data Pipelines, Section-Level-Image-Suggestions, Epic
xcollazo closed T330688: Make sure our delta algorithm doesn't depend on successful past runs, a subtask of T311814: [EPIC] Section-level image suggestions data pipeline, as Resolved.
Thu, Mar 16, 6:52 PM · Structured-Data-Backlog (Current Work), Data Pipelines, Section-Level-Image-Suggestions, Research-Backlog, Epic
xcollazo changed the status of T332031: Upgrade platform_eng Airflow instance to 2.5.1 from Open to In Progress.
Thu, Mar 16, 6:51 PM · Structured-Data-Backlog (Current Work)
xcollazo updated subscribers of T332031: Upgrade platform_eng Airflow instance to 2.5.1.

Follow up items after Airflow 2.5.1 upgrade on platform_eng:

  • Seems like we lost history for 2 DAGs. One dag does have all history. @Antoine_Quhen is this something recoverable?
Thu, Mar 16, 6:23 PM · Structured-Data-Backlog (Current Work)
xcollazo added a comment to T332031: Upgrade platform_eng Airflow instance to 2.5.1.

Ok this has been done now.

Thu, Mar 16, 6:20 PM · Structured-Data-Backlog (Current Work)
xcollazo added a comment to T332031: Upgrade platform_eng Airflow instance to 2.5.1.

Preemptively paused all DAGs just now.

Thu, Mar 16, 3:13 PM · Structured-Data-Backlog (Current Work)

Wed, Mar 15

xcollazo added a comment to T310541: [NEEDS GROOMING] We should improve and automate python linting.

@xcollazo @Antoine_Quhen Does this apply to the airflow ci/cd that DE manages? Are there any improvements that we wish to adopt or expand?

If not, is this ticket still valid?

Wed, Mar 15, 7:13 PM · Data Pipelines
xcollazo added a comment to T330688: Make sure our delta algorithm doesn't depend on successful past runs.

Deployment is now blocked by the Airflow 2.5.1 upgrade (See T332031). We could just branch out for this deployment, but since the upgrade is slated for Thu Mar 16, it doesn't make much sense to pay the branching penalty for just one deployment.

Wed, Mar 15, 3:15 PM · Structured-Data-Backlog (Current Work), Data Pipelines, Section-Level-Image-Suggestions, Epic
xcollazo added a comment to T311825: [M] Create the section-level image suggestions Airflow DAG.

I propose that we just add this to the image-suggestions DAG rather than it having its own DAG. @mfossati @xcollazo @matthiasmullie what do you think?

Sounds good to me.

Wed, Mar 15, 2:51 PM · Structured-Data-Backlog (Current Work), Section-Level-Image-Suggestions
xcollazo created T332182: GitLab CI acting funny for image_suggestions.
Wed, Mar 15, 2:41 PM · Structured-Data-Backlog (Current Work)

Tue, Mar 14

xcollazo added a comment to T328670: Add section title column to image_suggestions.suggestions table schema.

Folks, on T328672, we are calling this column section_heading.

Tue, Mar 14, 4:04 PM · Cassandra, Section-Level-Image-Suggestions, Structured-Data-Backlog
xcollazo added a comment to T332031: Upgrade platform_eng Airflow instance to 2.5.1.

Scheduled for Thursday March 16 @ 16:00 UTC.

Tue, Mar 14, 3:40 PM · Structured-Data-Backlog (Current Work)
xcollazo created T332031: Upgrade platform_eng Airflow instance to 2.5.1.
Tue, Mar 14, 3:39 PM · Structured-Data-Backlog (Current Work)

Mon, Mar 13

xcollazo added a comment to T331647: Grant Hal deployment rights.

Hal needs to deploy to the platform-eng Airflow instance. So he needs platform-eng-deployers.

Mon, Mar 13, 6:04 PM · SRE, SRE-Access-Requests

Fri, Mar 10

xcollazo closed T331345: Install conda-analytics on Airflow servers as Resolved.

Confirmed that I can use spark3 from an-airflow1004.eqiad.wmnet:

Fri, Mar 10, 5:58 PM · Data Pipelines, Data-Engineering

Tue, Mar 7

xcollazo added a comment to T327970: Create airflow v2 instance and supporting repos for search platform.

Seems like we don't have a robust mechanism to share secrets. Airflow does provide hooks so that we can integrate secrets in, see https://airflow.apache.org/docs/apache-airflow/2.1.4/security/secrets/secrets-backend/index.html#configuration.

@xcollazo we do have the ability to do this with our puppet stuff!
See docs here and here.

Tue, Mar 7, 4:31 PM · Discovery-Search (Current work), Data Pipelines, Data-Engineering-Planning
xcollazo added a comment to T327970: Create airflow v2 instance and supporting repos for search platform.

Re https://gerrit.wikimedia.org/r/894740, we should ask @mforns @Milimetric @JAllemandou about this. I think there might be a better way? Maybe the logic to get MW db hostnames and ports should be moved out of refinery python? Or, wmfdata-python uses the refinery bin/analytics-mysql CLI. Perhaps the whole thing should move out of refinery so it is installable without deploying refinery?

Hmm, in our case what we need is the ability to ship this information in a job into the yarn cluster. We use a tiny bit of python to push both the mysql credentials file and the .dblist files from mediawiki-config into the cluster and then have custom pyspark that processes them. Moving things out of refinery might help, but we don't use refinery here anywas.

Also Q. Would it be useful to have these set up as airflow connections? Our airflow puppet has the ability to configure airflow connections.

Not really, at least not for our use case. Airflow connections are about having airflow talk directly to those things, but here we have a job that runs inside spark and uses spark's mysql connector.

Tue, Mar 7, 3:56 PM · Discovery-Search (Current work), Data Pipelines, Data-Engineering-Planning

Mon, Mar 6

xcollazo created T331345: Install conda-analytics on Airflow servers.
Mon, Mar 6, 6:50 PM · Data Pipelines, Data-Engineering

Fri, Mar 3

xcollazo added a comment to T330688: Make sure our delta algorithm doesn't depend on successful past runs.

One issue we have is that the pipeline ran for snapshot=2023-02-20 while we were working on this task.

Fri, Mar 3, 3:38 PM · Structured-Data-Backlog (Current Work), Data Pipelines, Section-Level-Image-Suggestions, Epic
xcollazo added a comment to T330688: Make sure our delta algorithm doesn't depend on successful past runs.

Merged https://gitlab.wikimedia.org/repos/data-engineering/airflow-dags/-/merge_requests/253.

Fri, Mar 3, 3:33 PM · Structured-Data-Backlog (Current Work), Data Pipelines, Section-Level-Image-Suggestions, Epic

Wed, Mar 1

xcollazo added a comment to T330792: CodeReviewBot / gitlab-phabricator should parse a bug link in Gitlab Markdown in merge request.

My vote is for standardizing on Tnnn, as that is what I expect as a Phabricator user. It also avoids your ambiguity issue.

Wed, Mar 1, 7:34 PM · User-brennen, Release-Engineering-Team, GitLab (Integrations)
xcollazo added a comment to T330436: Load wmf.unique_editors_by_country_monthly into Druid for access in Turnilo & Superset.

Passing by to note that you can use wmf.unique_editors_by_country_monthly today in Superset by creating a dataset on top of the Hive table. I just did this, and generated this example world map dashboard based on it: https://superset.wikimedia.org/superset/dashboard/432/.

Wed, Mar 1, 7:31 PM · Data Pipelines, Data-Engineering-Planning
xcollazo added a comment to T330792: CodeReviewBot / gitlab-phabricator should parse a bug link in Gitlab Markdown in merge request.

This syntax is also supported right now:

Bug: #330688
Wed, Mar 1, 6:46 PM · User-brennen, Release-Engineering-Team, GitLab (Integrations)
xcollazo updated the task description for T330667: [M] Make sure DAGs are run in the correct order.
Wed, Mar 1, 2:05 AM · Structured-Data-Backlog (Current Work), Data Pipelines, Section-Level-Image-Suggestions, Epic
xcollazo closed T328641: Productionize the Airflow DAG of section alignment-based suggestions as Resolved.

Deployed to prod.

Wed, Mar 1, 2:04 AM · Patch-For-Review, Structured-Data-Backlog (Current Work), Data Pipelines, Section-Level-Image-Suggestions
xcollazo closed T328641: Productionize the Airflow DAG of section alignment-based suggestions, a subtask of T311814: [EPIC] Section-level image suggestions data pipeline, as Resolved.
Wed, Mar 1, 2:03 AM · Structured-Data-Backlog (Current Work), Data Pipelines, Section-Level-Image-Suggestions, Research-Backlog, Epic
xcollazo updated the task description for T328641: Productionize the Airflow DAG of section alignment-based suggestions.
Wed, Mar 1, 2:03 AM · Patch-For-Review, Structured-Data-Backlog (Current Work), Data Pipelines, Section-Level-Image-Suggestions

Tue, Feb 28

xcollazo updated the task description for T330792: CodeReviewBot / gitlab-phabricator should parse a bug link in Gitlab Markdown in merge request.
Tue, Feb 28, 8:09 PM · User-brennen, Release-Engineering-Team, GitLab (Integrations)
xcollazo created T330792: CodeReviewBot / gitlab-phabricator should parse a bug link in Gitlab Markdown in merge request.
Tue, Feb 28, 8:09 PM · User-brennen, Release-Engineering-Team, GitLab (Integrations)

Mon, Feb 27

xcollazo created T330688: Make sure our delta algorithm doesn't depend on successful past runs.
Mon, Feb 27, 6:24 PM · Structured-Data-Backlog (Current Work), Data Pipelines, Section-Level-Image-Suggestions, Epic
xcollazo created T330686: Wikidata dumps wikidata-20230220-all.json.bzip2 file missing from 20230220 dump.
Mon, Feb 27, 6:05 PM · Dumps-Generation
xcollazo added a comment to T328641: Productionize the Airflow DAG of section alignment-based suggestions.

Opened T330667 for following up on proper sensors for DAG run order.

Mon, Feb 27, 3:31 PM · Patch-For-Review, Structured-Data-Backlog (Current Work), Data Pipelines, Section-Level-Image-Suggestions
xcollazo created T330667: [M] Make sure DAGs are run in the correct order.
Mon, Feb 27, 3:30 PM · Structured-Data-Backlog (Current Work), Data Pipelines, Section-Level-Image-Suggestions, Epic

Fri, Feb 24

xcollazo added a comment to T328644: [M] Build the section-level image suggestions weighted tag dataset for search indices.

This is being taken care of via T328672.

Fri, Feb 24, 9:46 PM · Structured-Data-Backlog (Current Work), Section-Level-Image-Suggestions
xcollazo added a comment to T328672: [M] Populate Hive tables that will feed Cassandra.

MR up for review at https://gitlab.wikimedia.org/repos/structured-data/image-suggestions/-/merge_requests/10.

Fri, Feb 24, 9:44 PM · Structured-Data-Backlog (Current Work), Section-Level-Image-Suggestions
xcollazo updated the task description for T328672: [M] Populate Hive tables that will feed Cassandra.
Fri, Feb 24, 7:38 PM · Structured-Data-Backlog (Current Work), Section-Level-Image-Suggestions

Wed, Feb 22

xcollazo updated subscribers of T330234: Differential privacy airflow-dags merge request.

The only task that remains to be done that I am unable to do is putting the conda env that this script runs on into archiva or the airflow-dags hdfs file, which I don't currently have permission to access.

@Htriedman this step is done automatically when we deploy to the production analytics instance. It gets pulled from the URI you specify on the artifacts file. So as soon as your DAG gets merged, any of the folks with admin privilege on that instance can deploy to prod on your behalf (me included).

Wed, Feb 22, 7:39 PM · Data Pipelines (sprint 10), Data-Engineering

Tue, Feb 21

xcollazo awarded T329525: Create Presto test clusters with 10 new nodes and try reproduce issue a Like token.
Tue, Feb 21, 4:12 PM · Shared-Data-Infrastructure (Shared-Data-Infra Sprint 10)

Feb 16 2023

xcollazo added a comment to T328641: Productionize the Airflow DAG of section alignment-based suggestions.

Addressed review suggestions to https://gitlab.wikimedia.org/repos/data-engineering/airflow-dags/-/merge_requests/228.

Feb 16 2023, 5:39 PM · Patch-For-Review, Structured-Data-Backlog (Current Work), Data Pipelines, Section-Level-Image-Suggestions
xcollazo added a comment to T323036: [L] Exclude media topics from section topics dataset .

It's merged, and the bot agrees! 😄 Closing.

Feb 16 2023, 5:20 PM · Structured-Data-Backlog (Current Work), Section-Topics

Feb 14 2023

xcollazo added a comment to T328641: Productionize the Airflow DAG of section alignment-based suggestions.

(this is still waiting for review)

Feb 14 2023, 6:35 PM · Patch-For-Review, Structured-Data-Backlog (Current Work), Data Pipelines, Section-Level-Image-Suggestions

Feb 13 2023

xcollazo closed T323597: [M] Exclude date format topics from section topics pipeline, a subtask of T311745: [EPIC] Section topics data pipeline, as Resolved.
Feb 13 2023, 9:43 PM · Data Pipelines, Research-Backlog, Structured-Data-Backlog (Current Work), Section-Topics, Epic
xcollazo closed T323597: [M] Exclude date format topics from section topics pipeline as Resolved.

Released new version via https://gitlab.wikimedia.org/repos/data-engineering/airflow-dags/-/merge_requests/234.

Feb 13 2023, 9:43 PM · Data Pipelines, Structured-Data-Backlog (Current Work), Section-Topics
xcollazo merged T329057: Change my 'full name' in GitLab into T113792: Change LDAP cn to something more useful (was Rename "Dzahn" to "Daniel Zahn" in Gerrit).
Feb 13 2023, 3:14 PM · Infrastructure-Foundations, LDAP, Gerrit
xcollazo merged task T329057: Change my 'full name' in GitLab into T113792: Change LDAP cn to something more useful (was Rename "Dzahn" to "Daniel Zahn" in Gerrit).
Feb 13 2023, 3:13 PM · GitLab
xcollazo added a comment to T329057: Change my 'full name' in GitLab.

All right, added some guidance to the onboarding template around the username field when registering with Wikitech.

Feb 13 2023, 3:12 PM · GitLab

Feb 10 2023

xcollazo moved T328641: Productionize the Airflow DAG of section alignment-based suggestions from Doing to Code Review on the Structured-Data-Backlog (Current Work) board.
Feb 10 2023, 9:40 PM · Patch-For-Review, Structured-Data-Backlog (Current Work), Data Pipelines, Section-Level-Image-Suggestions
xcollazo added a comment to T328672: [M] Populate Hive tables that will feed Cassandra.

Let's do a debug session if you have the time @Cparle.

Feb 10 2023, 3:32 PM · Structured-Data-Backlog (Current Work), Section-Level-Image-Suggestions

Feb 9 2023

xcollazo updated the task description for T328641: Productionize the Airflow DAG of section alignment-based suggestions.
Feb 9 2023, 10:00 PM · Patch-For-Review, Structured-Data-Backlog (Current Work), Data Pipelines, Section-Level-Image-Suggestions

Feb 8 2023

xcollazo added a comment to T329057: Change my 'full name' in GitLab.

IIRC, I didn't create my LDAP user while onboarding. Do we know who this feedback should go to?

I would love to know who created a Developer account for you if you did not create it yourself. I ask that as the account creation log indicates that the account was created directly via Wikitech as a new account and not via any system for tracking people creating accounts for others: https://wikitech.wikimedia.org/w/index.php?title=Special:Log&logid=953193.

Feb 8 2023, 7:36 PM · GitLab
xcollazo added a comment to T329057: Change my 'full name' in GitLab.

Thanks for looking into this folks. I understand this is not possible right now, and I do use gerrit quite often, so I'll close this.

Feb 8 2023, 2:14 AM · GitLab

Feb 7 2023

xcollazo added a comment to T328641: Productionize the Airflow DAG of section alignment-based suggestions.
Feb 7 2023, 9:51 PM · Patch-For-Review, Structured-Data-Backlog (Current Work), Data Pipelines, Section-Level-Image-Suggestions
xcollazo created T329057: Change my 'full name' in GitLab.
Feb 7 2023, 2:49 PM · GitLab

Feb 6 2023

xcollazo updated the task description for T328641: Productionize the Airflow DAG of section alignment-based suggestions.
Feb 6 2023, 9:20 PM · Patch-For-Review, Structured-Data-Backlog (Current Work), Data Pipelines, Section-Level-Image-Suggestions
xcollazo added a comment to T328641: Productionize the Airflow DAG of section alignment-based suggestions.
Feb 6 2023, 9:20 PM · Patch-For-Review, Structured-Data-Backlog (Current Work), Data Pipelines, Section-Level-Image-Suggestions
xcollazo changed the status of T328641: Productionize the Airflow DAG of section alignment-based suggestions, a subtask of T311814: [EPIC] Section-level image suggestions data pipeline, from Open to In Progress.
Feb 6 2023, 3:37 PM · Structured-Data-Backlog (Current Work), Data Pipelines, Section-Level-Image-Suggestions, Research-Backlog, Epic
xcollazo changed the status of T328641: Productionize the Airflow DAG of section alignment-based suggestions from Open to In Progress.
Feb 6 2023, 3:37 PM · Patch-For-Review, Structured-Data-Backlog (Current Work), Data Pipelines, Section-Level-Image-Suggestions

Feb 3 2023

xcollazo claimed T328641: Productionize the Airflow DAG of section alignment-based suggestions.
Feb 3 2023, 3:14 PM · Patch-For-Review, Structured-Data-Backlog (Current Work), Data Pipelines, Section-Level-Image-Suggestions

Feb 1 2023

xcollazo added a comment to T323597: [M] Exclude date format topics from section topics pipeline.

Merge of
https://gitlab.wikimedia.org/repos/structured-data/section-topics/-/merge_requests/15
and
https://gitlab.wikimedia.org/repos/structured-data/section-topics/-/merge_requests/17

Feb 1 2023, 4:45 PM · Data Pipelines, Structured-Data-Backlog (Current Work), Section-Topics

Jan 31 2023

xcollazo added a comment to T314592: Requesting membership of the analytics group in gerrit for 'snwachukwu', 'nokafor', and 'xcollazo'.

@xcollazo Does @JEbe-WMF need to create a similar request?

Jan 31 2023, 4:49 PM · Release-Engineering-Team (Blocking 🧱), Gerrit-Privilege-Requests, Data-Engineering-Radar

Jan 30 2023

xcollazo added a project to T323597: [M] Exclude date format topics from section topics pipeline: Data Pipelines.
Jan 30 2023, 5:02 PM · Data Pipelines, Structured-Data-Backlog (Current Work), Section-Topics

Jan 26 2023

xcollazo updated the task description for T327970: Create airflow v2 instance and supporting repos for search platform.
Jan 26 2023, 3:27 PM · Discovery-Search (Current work), Data Pipelines, Data-Engineering-Planning

Jan 24 2023

xcollazo added a comment to T323107: [M] Upgrade code base to Spark 3.

@xcollazo I think the reason why Spark2 does not push this filter down is because it does not infer filters from generators as the optimizer rule InferFiltersFromGenerate in Spark 3.1.2 does not seem to exist in Spark 2.4.4

Jan 24 2023, 4:05 PM · Structured-Data-Backlog (Current Work), Data Pipelines, Section-Topics
xcollazo added a comment to T314592: Requesting membership of the analytics group in gerrit for 'snwachukwu', 'nokafor', and 'xcollazo'.

Great, thanks!

Jan 24 2023, 2:03 AM · Release-Engineering-Team (Blocking 🧱), Gerrit-Privilege-Requests, Data-Engineering-Radar

Jan 23 2023

xcollazo claimed T323597: [M] Exclude date format topics from section topics pipeline.
Jan 23 2023, 10:58 PM · Data Pipelines, Structured-Data-Backlog (Current Work), Section-Topics
xcollazo closed T323905: [M] Automate Airflow DAG release as Resolved.

I have left comments on the pipeline logic for possible future generalization of the solution so that other folks could benefit from it.

Jan 23 2023, 4:18 PM · Structured-Data-Backlog (Current Work), Data Pipelines
xcollazo added a comment to T323905: [M] Automate Airflow DAG release.

Partial automation has been implemented for section-topics as well via https://gitlab.wikimedia.org/repos/structured-data/section-topics/-/merge_requests/10.

Jan 23 2023, 4:14 PM · Structured-Data-Backlog (Current Work), Data Pipelines
xcollazo added a comment to T323107: [M] Upgrade code base to Spark 3.

Nice find @MunizaA!

Jan 23 2023, 3:53 PM · Structured-Data-Backlog (Current Work), Data Pipelines, Section-Topics

Jan 20 2023

xcollazo added a comment to T323905: [M] Automate Airflow DAG release.

Partial automation has been implemented for image-suggestions via MR https://gitlab.wikimedia.org/repos/structured-data/image-suggestions/-/merge_requests/9.

Jan 20 2023, 8:23 PM · Structured-Data-Backlog (Current Work), Data Pipelines

Jan 19 2023

xcollazo changed the status of T323905: [M] Automate Airflow DAG release from Open to In Progress.
Jan 19 2023, 3:50 PM · Structured-Data-Backlog (Current Work), Data Pipelines
xcollazo triaged T324125: Figure out a good place for static HDFS helper files for the structured data team. as Low priority.
Jan 19 2023, 3:49 PM · Structured-Data-Backlog, Section-Topics, Data Pipelines
xcollazo added a comment to T324125: Figure out a good place for static HDFS helper files for the structured data team..

@xcollazo , what about setting paths with VariableProperties, pretty much as we do with the conda artifact?
Something like helper = var_props.get('helper', '/path/to/hdfs')

Jan 19 2023, 3:48 PM · Structured-Data-Backlog, Section-Topics, Data Pipelines

Jan 18 2023

xcollazo closed T323614: [M] Reduce image_suggestion HDFS files footprint as Resolved.
Jan 18 2023, 7:57 PM · Data Pipelines, Structured-Data-Backlog (Current Work), Image-Suggestions
xcollazo added a comment to T323614: [M] Reduce image_suggestion HDFS files footprint.

Confirmed that the systemd timer is present on an-launcher1002:

xcollazo@an-launcher1002:~$ systemctl list-timers | grep drop-image-suggestions
Mon 2023-01-23 13:00:00 UTC  4 days left         n/a                          n/a                 drop-image-suggestions.timer                              drop-image-suggestions.service
Jan 18 2023, 7:57 PM · Data Pipelines, Structured-Data-Backlog (Current Work), Image-Suggestions
xcollazo added a comment to T323614: [M] Reduce image_suggestion HDFS files footprint.

The only remaining task here is the merging of https://gerrit.wikimedia.org/r/c/operations/puppet/+/870974/, which I hope will happen in the next day or so.

Jan 18 2023, 3:25 PM · Data Pipelines, Structured-Data-Backlog (Current Work), Image-Suggestions
xcollazo added a comment to T323614: [M] Reduce image_suggestion HDFS files footprint.

Just for fun:

Jan 18 2023, 3:22 PM · Data Pipelines, Structured-Data-Backlog (Current Work), Image-Suggestions
xcollazo added a comment to T325837: Do a one-off run of refinery-drop-older-than to delete old data from image_suggestions.

It took ~4 hours to run! This makes sense considering the amount of partitions and files to move to the trash.

Jan 18 2023, 3:12 PM · Structured-Data-Backlog (Current Work), Data Pipelines, Image-Suggestions
xcollazo closed T326827: make user analytics-platform-eng available on the node that runs refinery systemd timers, a subtask of T325837: Do a one-off run of refinery-drop-older-than to delete old data from image_suggestions, as Resolved.
Jan 18 2023, 3:08 PM · Structured-Data-Backlog (Current Work), Data Pipelines, Image-Suggestions
xcollazo closed T326827: make user analytics-platform-eng available on the node that runs refinery systemd timers as Resolved.

Confirmed that user analytics-platform-eng and the keytab are available on an-launcher1002:

Jan 18 2023, 3:08 PM · Structured-Data-Backlog (Current Work), Data Pipelines, Image-Suggestions