Page MenuHomePhabricator

Develop a reusable Metrics Platform schema fragment for translation workflows
Closed, ResolvedPublic

Description

Need

In translation workflows (Content Translation, MinT for Readers, MinT in Translate Extension), at various instances we need capture information related to translation source and translation target (for example: language code, page information etc.). The current Metrics Platform schema doesn't fit this use case for two reasons:

  1. The need to capture information of two pages (source and target).
  2. In most cases, the interactions are on a single special page, but not the actual pages itself.

We can have this information captured as part of action_context but as it is a need across all of the translation workflows, a re-usable schema fragment might be better.

Proposed solution

A reusable schema fragment to capture the following (not all events will use all the fields):

{
    "source_lang": "en",
    "target_lang": "hi",
    "source_type": "page",
    "target_type": "machine_translation",
    "source_id": 12345,
    "target_id": 45621,
    "modification_pct": 24.2
}

Possible values:

  • For source type:
    • page: when an actual page is being used for the translation
    • message_group: for MinT in Translation extension use case
    • message
  • For target type:
    • page
    • machine_translation (in case of MinT for Readers)
    • message

Event Timeline

@Pginer-WMF @ngkountas @abi_ @Wangombe I have listed the data that we usually need to capture for source & target. Please share if there is something else that I might be missing. We can evolve the schema later as well.

Thanks for capturing this in a ticket, @KCVelaga_WMF . The parameters proposed makes sense to me. The only one that raises some questions is the target page.

For example, with MinT for Wiki Readers translating the page Moon form English to Igbo will produce a machine translated content in Igbo. There may or may not be a page in Igbo Wikipedia for that topic. If there is one, the UI is surfacing the option for users to read it. Is the equivalent page in Igbo Wikipedia what is expected to be captured as "target page"?

If that is the case, we need to consider it as optional since it may not exist yet for many pages. also, is it the purpose to be able to discriminate the cases where an alternative human-created option exists, or do you have other uses for this parameter in mind?

@Pginer-WMF Yes, not all events will use all the fields. If the target is a machine translation (like in case of MinT for Readers), then target_id is not relevant.

@VirginiaPoundstone There is a no fixed timeline, but the sooner the better. For now and until we have this, we will be capturing the required data as a JSON blob in action_context.

@VirginiaPoundstone What is reasonable estimate from your side? We can plan things accordingly for future instrumentation.

VirginiaPoundstone added a subscriber: phuedx.

@KCVelaga_WMF adding this to our current sprint and hope to get a review done within the week.

@phuedx did a quick glance, but this requires a little more intellectual work. The data contract has a top level object called "page" with a structure (ID, title, namespace, etc). The proposal is not representing the pages in the same structure, so we want to take a moment to document options for representing it in the same way.

VirginiaPoundstone raised the priority of this task from Medium to High.Jul 22 2024, 3:36 PM

Just to let you all know that I'm looking into this.

Hi, sorry for the delay.
I've been studying your proposal, and have one question/alternative proposal which would not require to create an additional schema or schema fragment.

From the description of this task I understand that we will be measuring a workflow, as opposed of an isolated event?
I understand that, depending on the context (Content Translation, MinT for Readers, MinT in Translate Extension), the information to capture is slightly different.
Some contexts have target page, some don't, right?

The question is: Are those pieces of information (source page and target page) all manifest at the same time, atomically?
Or else, for instance, we will know about the source page first, and then about the target page (if any)?

I'm asking, because maybe we could use 2 related events instead of 1 isolated event, to capture both the information about the source page and the target page respectively.
And use the existing MP base schema for both of them. For example:

  • We could fire one event whenever we know about the source page, and fill its standard page fields with the source page information.
  • Then whenever we know about the target page, we can fire another event, and fill its standard page fields with the target page information.
  • Both events would be labeled with the theme of this experiment, for instance setting the action field to "translation_workflow".
  • Each event would be labeled with the kind of page it's holding information for, either "source_page" or "target_page". This could be stored in the field action_subtype.
  • The field action_context could hold the context of the workflow: content_translation, mint_for_readers, or mint_in_translate_extension.
  • page_type could be stored using the action_source field.
  • modification_pct... hmmmm, there's no numeric field in the base schema that can hold this number properly. Well, I guess I will finish my proposal, and then we can come back to this.

To correlate both events generated by the same user and workflow, we could use either the performer's id, pageview_id or session_id, depending on what suits best.
The queries used to analyze the data would be a bit more complex:

-- Instead of this:
SELECT
    source_lang,
    target_lang,
    source_type,
    target_type,
    source_id,
    target_id,
    modification_pct
FROM our_table
;

-- It would be something like:
WITH source_events AS (
    SELECT *
    FROM our_table
    WHERE action_context = "source_page"
),
target_events AS (
    SELECT *
    FROM our_table
    WHERE action_context = "target_page"
)
SELECT
    src.page.content_language,
    trg.page.content_language,
    src.action_source,
    trg.action_source,
    src.page.id,
    trg.page.id,
    ??? -- modification_pct
FROM source_events AS src
JOIN target_events AS trg
ON (src.performer.pageview_id = trg.performer.pageview_id)
;

I can see the downside of this approach on the data analysis side.
On the other hand, with this approach we would:

  • Avoid having to create, maintain and decommission a schema.
  • Keep all the information about pages standard.
  • Honor the atomic approach of MP by issuing smaller more atomic pieces of information, as opposed of bigger events.
  • MAYBE, if the translation workflow is more like a funnel than a single event, by capturing all stages of the funnel separately, we can analyse funnel rates?

In any case, we would have to solve the modification_pct issue, since we don't have any numeric field in the common fragment...
@phuedx Do you think we could add one?

Sorry for the long message post! And if this doesn't make sense to yall, please disagree! 🙏🏼

In any case, we would have to solve the modification_pct issue, since we don't have any numeric field in the common fragment...
@phuedx Do you think we could add one?

Would action_context suffice, i.e. SELECT CAST(action_context AS DECIMAL(4, 2)) AS modification_pct?

This would only work if we place the source_page and target_page labels are put in the action_subtype field as you said. However, in your example query you're using the action_context field :)

@mforns: I just discussed this with @KCVelaga_WMF and confirmed that multiple pieces of information will be "all manifest at the same time, atomically"

Something that hasn't been made clear that I think will help you understand the challenge here is that the request for a schema fragment is to simplify analysis because without it, the

{
    "source_lang": "en",
    "target_lang": "hi",
    "source_type": "page",
    "target_type": "machine_translation",
    "source_id": 12345,
    "target_id": 45621,
    "modification_pct": 24.2
}

would be JSON-encoded and stored as a string in action_context for events and then the analyst would have to parse the JSON string and extract individual values, rather than referring to easy-to-access and data-typed fields/columns.

@mpopov Thanks for the clarifications!

I just discussed this with @KCVelaga_WMF and confirmed that multiple pieces of information will be "all manifest at the same time, atomically"

Cool, understand.

... without it, the <code> would be JSON-encoded and stored as a string in action_context for events and then the analyst would have to parse the JSON string and extract individual values, rather than referring to easy-to-access and data-typed fields/columns.

Yes, I imagined this was the issue. I was just proposing to use 2 events instead of 1, to capture the information of the source page and the target page respectively (see details in my comment above). This way, the existing fields in the base schema would suffice* to store all the mentioned pieces of information, and we would need no JSON blob nor a new schema. (*) Maybe we'd have problems storing the modification_pct.

That said, I was thinking this approach would be nice, if the source page was chosen in a first step, and the target page appeared in a second step. This would be like treating this instrument like a funnel, and it would make sense to have 2 separate events sharing the same base schema.

However, since this is not the case, I think my idea is probably not the best approach, because it would split an otherwise atomic event, and make the experiment more prone to errors, and unnecessarily complex at query time. So, I will go ahead and prepare a proposal for a new fragment that can hold all the necessary information for this task. 👍

@mforns thanks for taking this up and sharing your thoughts.

I also wanted to share an example, which can be helpful to understand the use case overall and also might be helpful during developing the fragment.

In Translation Extension (schema), when a user opens the translation interface, we have a session initiation event, for which I have currently proposed for JSON blob in action_context, to capture the following information

{
 "source_lang": str,
 "target_lang": str,
 "source_id": int,
 "source_type": "message_group",
 // following are specific to translate extension
 "is_mint_available": boolean,
 "translatable_count": int,
 "translated_count": int
}

The source and target languages, and source_id, all of them populate at the same time. User already has a default source language and on a page, and then they click to translate to a language, which brings them to the translation interface.

Note: I have only proposed to have a fragment for fields that are applicable across all the translation workflows, and not those that are specific to only one. In the above example, the last three fields, are not relevant to other translation tools.

Change #1061096 had a related patch set uploaded (by Mforns; author: Mforns):

[schemas/event/secondary@master] Add MP fragment schema for translation workflows

https://gerrit.wikimedia.org/r/1061096

Hey all!
As you can see in the gerritbot message above, I created a first version of the fragment.
I tried to include all the fields there, even the ones that are specific to just one translation workflow.
The reason being that we should try and prevent creating new schemas for each instrument / translation workflow.
If we can re-use existing schemas, we reduce the time-to-data.
Let me know your opinions!
🙏🏻

Thank you @mforns

That mostly look good to me. I have added some minor comments on the patch, and added @ngkountas, @abi_ & @Wangombe for review as well.

@VirginiaPoundstone The patch is yet to be merged. The schema looks good, and @ngkountas also confirmed on the patch, so I think it can be merged.

Change #1061096 merged by jenkins-bot:

[schemas/event/secondary@master] Add MP fragment schema for translation workflows

https://gerrit.wikimedia.org/r/1061096

@mforns thank you.

A few questions/clarifications:

  • The schemaID in the instruments has been /analytics/product_metrics/web/base/1.2.0. Should that be changed to /fragment/analytics/product_metrics/translation/1.0.0?
  • The schema title in event stream config is currently analytics/product_metrics/web/base. Should that be changed to /fragment/analytics/product_metrics/translation?
  • There is no need to have a new stream name, is that right?
  • Any other changes to do?
  • How will the column names for custom data appear in the Hive table?

Hi @KCVelaga_WMF!

  • The schemaID in the instruments has been /analytics/product_metrics/web/base/1.2.0. Should that be changed to /fragment/analytics/product_metrics/translation/1.0.0?

We can not use the new translation fragment as is from the instrumentation code. We need to create a schema that contains it. I will do that, and let you review it this week 👍

  • The schema title in event stream config is currently analytics/product_metrics/web/base. Should that be changed to /fragment/analytics/product_metrics/translation?

Is the configuration you're mentioning the one for mediawiki.product_metrics.mint_for_readers in line 1714? If so, it should be changed to match the path to the new schema that I will create. Also, I assume you'll need to add there the new fields (from the translation fragment) that you want to collect, as well.

  • There is no need to have a new stream name, is that right?

I think since we will use a new schema, all translation events will use a new stream indeed.

  • Any other changes to do?

I don't think so! I'll check whether we need to configure the import and refinement of this new stream into the data lake, or it will work automatically. In any case, this would be on our side.

  • How will the column names for custom data appear in the Hive table?

When the import and refinement of the stream into the data lake work fine, the process will get the schema and materialize the table in Hive automatically.

I'll post more comments once I have the new schema.
Cheers

Change #1071017 had a related patch set uploaded (by Mforns; author: Mforns):

[schemas/event/secondary@master] Add metrics platform schema for web translation workflows

https://gerrit.wikimedia.org/r/1071017

I think this should be it. I imagine the schema is for web only, right? Not app. Thanks :-)

Change #1071017 merged by jenkins-bot:

[schemas/event/secondary@master] Add metrics platform schema for web translation workflows

https://gerrit.wikimedia.org/r/1071017

@mforns

Is the configuration you're mentioning the one for mediawiki.product_metrics.mint_for_readers in line 1714? If so, it should be changed to match the path to the new schema that I will create. Also, I assume you'll need to add there the new fields (from the translation fragment) that you want to collect, as well.

Yes, that's right. I will add the new fields.

I think since we will use a new schema, all translation events will use a new stream indeed.

Got it.

I don't think so! I'll check whether we need to configure the import and refinement of this new stream into the data lake, or it will work automatically. In any case, this would be on our side.

Thank you.

I think this should be it. I imagine the schema is for web only, right? Not app. Thanks :-)

Yes, but both desktop and mobile web.

I also reviewed the newly creation schema, thank you.