Page MenuHomePhabricator

HTML diff dataset for SDS 1.2.3
Closed, ResolvedPublic

Description

The "who are moderators" SDS 1.2.3 (T371865) project requires html diff data, i.e. the html of a revision and the html of the parent revision. See more details in T378617.

As html datasets are not currently available in the data lake, this task tracks two initiatives:

  1. T380871: Create a one-off HTML diff dataset to unblock work on "who are moderators" SDS 1.2.3 for Q2.
  2. T380874: Request Data Engineering to prioritize adding a HTML dataset to the data lake.

Task 1. is to avoid being blocked on making progress on SDS 1.2.3, and task 2. is needed to accomplish the goal of this project.

Details

Due Date
Nov 24 2024, 11:00 PM

Event Timeline

Adding a quick snippet to look at the percentage of revisions that have the previous revision as their parent. Generally above 99%, for enwiki it is 99.98%.

This means with the incremental html dataset paired with a html snapshot from enterprise we can generate moderator actions for almost all revisions.

wiki_dbs = ["frwiki", "dewiki", "enwiki", "arwiki"]
wikitext_df = (
    spark.table("wmf_dumps.wikitext_raw_rc2")
    .where(F.col("wiki_db").isin(*wiki_dbs))
)

page_key = ["wiki_db", "page_id"]
w = Window.partitionBy(page_key).orderBy("revision_id")
df_with_pos = wikitext_df.withColumn("pos", F.row_number().over(w) - 1)

current_pos_df = df_with_pos.select(*page_key, "revision_id", "pos")

df_with_pos_p = (
    df_with_pos.alias("df1")
    .join(
        current_pos_df.alias("df2"),
        on=(F.col("df1.wiki_db") == F.col("df2.wiki_db"))
        & (F.col("df1.page_id") == F.col("df2.page_id"))
        & (F.col("df1.revision_parent_id") == F.col("df2.revision_id")),
        how="left",
    )
    .select(F.col("df1.*"), F.col("df2.pos").alias("pos_p"))
)

df_with_distance = df_with_pos_p.withColumn(
    "distance", F.when(F.col("pos_p").isNotNull(), F.col("pos") - F.col("pos_p"))
)

# distance_counts = df_with_distance.groupBy("distance").count().orderBy("distance")
# p = distance_counts.to_polars()
parent_is_previous = (
    df_with_distance.groupBy(
        "wiki_db", (F.col("distance") == 1).alias("parent_is_previous")
    )
    .count()
    .to_polars()
)
(
    parent_is_previous.filter(pl.col("parent_is_previous").is_not_null()).with_columns(
        (pl.col("count") / pl.sum("count").over("wiki_db")).alias("percentage")
    ).sort("wiki_db","parent_is_previous")
)

image.png (421×645 px, 59 KB)

@diego What's the decision of creating one-off html dumps? From the description, there are two options. If we have decided to go with option 1, then we should start right now. If that's the case, please put this task to in-progress.

My understanding is that we need to work on this two things in parallel. The first one is to be able to not stop the work SDS 1.2.3 and the second one is to fully accomplish the goal of this project.
@fkaelin can provide more details.

diego changed the task status from Open to In Progress.Nov 5 2024, 6:11 PM
diego set Due Date to Nov 24 2024, 11:00 PM.Nov 7 2024, 8:26 AM
leila triaged this task as High priority.Nov 20 2024, 8:02 PM

Updates:

  • The one-off html dataset has been generated, see details here.
  • No updates on the request to deploy the incremental html pipeline.

No updates on the request to deploy the incremental html pipeline.

I think there is an outstanding question in the MR.

Thanks @Ottomata, I didn't see that comment from @gmodena "do you have an idea of how large the HTML content might be?" With the data from the one-off html dataset we do have some numbers for the html size in bytes:

+-------+---------+---------------+------+---------------+---------------+---------------+
|wiki_db|max_bytes|25th percentile|median|75th percentile|95th percentile|99th percentile|
+-------+---------+---------------+------+---------------+---------------+---------------+
|svwiki |2840652  |20540          |28738 |46178          |202934         |558509         |
|nlwiki |5852151  |21006          |39471 |94467          |365683         |1271868        |
|enwiki |9772989  |58968          |122304|293487         |970911         |1886269        |
|jawiki |6805994  |54561          |121944|279853         |924963         |1883142        |
|zhwiki |4826631  |65752          |166430|400902         |1185292        |1901383        |
|dewiki |6938253  |23481          |46664 |116092         |516461         |1277333        |
|ruwiki |5904112  |51398          |90471 |182772         |668348         |1332989        |
|itwiki |4233212  |41875          |85334 |203483         |574650         |1077771        |
|plwiki |4326222  |32305          |57799 |128165         |513554         |1389121        |
|arzwiki|3577816  |28376          |39325 |61151          |215750         |774739         |
|eswiki |4479112  |42121          |80644 |191561         |736356         |1379022        |
|frwiki |5373803  |49261          |99789 |215959         |758912         |1703571        |
+-------+---------+---------------+------+---------------+---------------+---------------+

Updates:

  • An initial report for "who are moderators" was shared, which used the one-off dataset T380871
  • I am closing this task as resolved, but keeping the request for an html dataset open. T380874 is slated for earliest in Q4, more likely for the next FY. Depending on when and how the html dataset will become available, we can create a new task for generating a diff version of that html dataset.