Page MenuHomePhabricator

[XL] Let section alignment consume section topics output
Closed, ResolvedPublic

Description

Section-Topics already has an optional filter that handles media links. It can replace one section alignment component, namely the script that extract section images.
This is an opportunity to remove duplicate behavior and consolidate shared logic.

NOTE: this ticket accounts for one section alignment input, while the model that learns alignments is a separate task.

Tasks

  • merge more fine-grained logic from section alignment's article_images.py into section topics handle_media
  • remove article_images.py from the section alignment pipeline
  • remove the corresponding task in the section alignment DAG
  • make sure section alignment's recommendation.py takes as input section topic's image dataset in the section alignment DAG
  • update tests
  • merge section-alignment suggestions into section topics
  • remove the section-alignment DAG
  • add section alignment suggestions to section topics' DAG
  • update image suggestions DAG

Event Timeline

HI @mfossati one question, will you need research support for this task?

Hey @Miriam. no. We might ping @MunizaA in case we need help on the section alignment code.

Given that no research work is needed for now, I'm going to remove the task from our backlog. Please add the Research tag back if you need the team's help for a specific component of it. thanks!

MarkTraceur renamed this task from Let section alignment consume section topics output to [XL] Let section alignment consume section topics output.Jul 10 2024, 8:38 PM
mfossati changed the task status from Open to In Progress.Sep 20 2024, 2:47 PM
mfossati claimed this task.
mfossati added a subscriber: Cparle.

I think we're good for a round of review, CC @Cparle .

isu = spark.read.table('analytics_platform_eng.image_suggestions_suggestions')
alis = isu.where(isu.section_index.isNull())
slis = isu.where(isu.section_index.isNotNull())

alis.groupBy('snapshot').count().orderBy('snapshot').toPandas()
     snapshot     count
0  2024-09-30  24284047
1  2024-10-07  24287195
2  2024-10-14  24290046
3  2024-10-21  24302041
4  2024-10-28  24329950
5  2024-11-04  24339009

slis.groupBy('snapshot').count().orderBy('snapshot').toPandas()
     snapshot    count
0  2024-09-30  1414417
1  2024-10-07  1278966
2  2024-10-14  1353679
3  2024-10-21  1287597
4  2024-11-04  1354594

The 2024-11-04 snapshot includes this ticket. Raw counts seem in line with past runs, closing.