Page MenuHomePhabricator

xLab: New validation rules for selected contextual attributes and risk level when registering/modifying an instrument
Closed, ResolvedPublic5 Estimated Story Points

Description

Description

As part of T401384: FY25-26 SDS2.1.5 User Experience - Attribute Selection we have to work on improving xLab validation to provide advice to users when selected contextual attributes modify the selected risk level when registering an instrument. Also when the users is adding contextual attributes and they may have implications for privacy or the risk level than can be assigned to their instrument. And all this must be done according to the Data Collection Guidelines.

This task aims to implement the needed validation and guidance in xLab.

Affected contextual attributes

According to the Data Collection Guidelines, the relevant contextual attributes would be the following:

  • agent_ua_string: Considered as Personal Information (not available yet but we are working on it T385180: Implement agent.ua_string as contextual attribute)
  • performer_id: Considered as user ID
  • performer_name: Considered as username
  • page_id: Considered as long-term viewing history
  • page_title: Considered as long-term viewing history
Relevant criteria

The Data collection risk tiering grid defines the 5 criteria that any data collection must meet to be considered as Low risk. Some of these criteria are already implicitly met by the platform and xLab (the ones related to the data subject, data sender, data recipient and the retention period), so here we will focus on the one that is related to the collected data itself.

Depending on that data (which is collected via contextual attributes) the risk level could change to be considered as medium in the following cases:

  • The data collected does not include:
    • multiple items of unhashed personal information (not applicable, there is only one contextual attribute, agent_ua_string, that can be considered as personal information)
    • personal information + username/user ID or app ID:
      • Any combination of agent_ua_string + perfomer_id/performer_name/agent_app_install_id could make this criterion fail
      • Technically page_id/page_title + agent_app_install_id when the user is logged-in could make this criterion fail (not possible to know ahead of time but a warning message could be shown)
    • long-term viewing history + unique ID
      • Any combination of page_id/page_title + performer_id/performer_name could make this criterion fail
    • granular geographic data + unique ID (not applicable, there are no contextual attributes related to geographical data)
    • sensitive data (not applicable, there are no contextual attributes related to sensitive data)

There are also some cases where the instrument risk level will have to be considered as high:

  • The data collected includes agent_ua_string + performer_name/performer_id + page_id/page_title because two low risk criteria (see above) would fail:
    • Any combination of agent_ua_string + perfomer_id/performer_name/agent_app_install_id could make this criterion fail
    • Any combination of page_id/page_title + performer_id/performer_name could make this criterion fail
Warning/error messages and user guidance

The main goal here should be to give advice to users when registering their instruments, specifically when filling the fields that are related to contextual attributes and the risk level. The following would be potential scenarios where xLab can take some actions:

  • The user is selecting the contextual attributes:
    • xLab could show a message suggesting the required risk level based on the selected contextual attributes as they select them

Screenshot 2025-09-26 at 11.10.08.png (235×761 px, 35 KB)

    • xLab could show a warning message if page_id/page_title + agent_app_install_id are selected because that would require a security and legal review in the case the user is logged-in (as we mentioned above, not possible to know ahead of time)
  • The user defines Risk assessment pending:
    • xLab wouldn't need to check anything else because the user hasn't finished yet the instrument configuration. They could modify the instrument later and add/remove contextual attributes and define the corresponding risk level. xLab should wait until then
  • The user chooses Low risk as the risk level:
    • xLab will check that the selected contextual attributes meet that risk level, according to the relevant criteria explained above. If not, an error message will be thrown and the user won't be able to register/modify the instrument until they fix this

Screenshot 2025-09-26 at 11.13.01.png (214×726 px, 36 KB)

  • The user chooses Medium risk as the risk level:
    • xLab could check if there really is a combination contextual attributes that requires that risk level. In case there isn't, xLab could show an error message explaining this (taken the criteria above into account). The user should change the risk level to be able to register/modify the instrument
  • The user chooses High risk as the risk level:
    • xLab could check if there really is a combination contextual attributes that requires that risk level. In case there isn't, xLab could show an error message explaining this (taken the criteria above into account). The user should change the risk level to be able to register/modify the instrument
  • The user is using a custom schema:
    • Some additional attributes could be collected via the custom schema so we should, at least, show a warning message about it so that instrument owners can check those attributes and whether the chosen risk level is the appropriate one

In any case, if the user doesn't have yet the corresponding Security and legal review, they will have to set the risk level as Risk assessment pending. That will allow them to register/modify the instrument but it won't be possible to activate it until the Risk Level is set to a specific tier, and the Security and legal review link is provided, if needed. The user will always be able to modify again the instrument to set the appropriate risk level and the corresponding link.

Acceptance criteria

Scenario 0: Before selection

  • By updating the “Contextual attributes” field description to mention their impact on risk “Collect extra information about the users who triggered the event and the wiki where the event occurred. Some attribute combinations will increase the risk level of this instrument. Learn more”.
  • We could include information about "risk-increasing" combinations in the contextual attributes' doc page, which would complement the field's description and also support selection. Maybe this is what was meant by the AC "Documentation should be updated to include a section on regulation and data collection guidelines".

Scenario 1: Users select contextual attributes that impact risk before selecting the Risk level.

  • Display an information message below the Risk level field, where it can inform the user’s selection. Copy e.g., : “Based on the selected Contextual attributes, this instrument has a minimum risk level of “{{Tier:Risk}}” and it requires a Security and Legal review”. The message is more significant in the context of that field, as it supports user selection.

Scenario 2: Users had selected or select a risk level that's lower than their latest attribute selection
In this case, regarding the selected attributes, we will consider the following:

  • Any combination of agent_ua_string + perfomer_id/performer_name/agent_app_install_id would require Tier 2: Medium risk as the selected risk level
  • Any combination of page_id/page_title + performer_id/performer_name would require Tier 2: Medium risk as the selected risk level
  • Any combination of agent_ua_string + performer_name/performer_id + page_id/page_title would require Tier 3: High risk as the selected risk level
  • As specified in the ticket, displaying an inline warning message under the Contextual attributes field sounds good. I'd suggest indicating something like: "The selected attributes increase the data collection risk level of this instrument. Please review the Risk level field selection" (because a corresponsing error message will be displayed there)
  • The “Risk level” field should display an error state, as indicated in the task description. The copy could be simplified: “The contextual attributes selected increase the risk of this instrument to “{{Tier:Risk level}}”

No specific scenario

  • The selection of a higher level of risk won't be validated, given that there might be other factors influencing this choice
  • A warning message is shown (along with the contextual attributed field) when page_id/page_title + agent_app_install_id are selected regardless the risk level the user selects (that could require a security and legal review in the case the user is logged-in and it's not possible to know ahead of time)
  • A warning message (along with the Risk level field) is shown when the user selects a custom schema (additional and unknown attributes could be collected via that custom schema)
  • If the user doesn't have yet the corresponding Security and legal review, they will be able to continue choosing Risk assessment pending as the Risk level for their instrument. That will allow them to register/modify the instrument but it won't be possible to activate it until the Risk Level is set to a specific tier, and the Security and legal review link is provided, if needed. The user will always be able to modify again the instrument to set the appropriate risk level and the corresponding link (that's the current behaviour)

Documentation

Reference

Details

Related Changes in Gerrit:
Related Changes in GitLab:
TitleReferenceAuthorSource BranchDest Branch
New validation rules and advice about contextual attributes and risk levelrepos/data-engineering/test-kitchen!253sfaciT401390-security-review-validation-rulesmain
Customize query in GitLab

Event Timeline

Sfaci renamed this task from xLab: New validation rules regarding selected contextual attributes and risk level when registering an instrument to xLab: New validation rules for selected contextual attributes and risk level when registering/modifying an instrument.Aug 7 2025, 2:18 PM

Let's review the revised user experience of the regulation and write out the user experience together.

One consideration: should a user be able to register/start using an instrument or start an experiment if a legal review is pending?

Milimetric set the point value for this task to 5.
Milimetric moved this task from Incoming to Backlog on the Test Kitchen board.
Sfaci removed the point value 5 for this task.
Sfaci updated the task description. (Show Details)
Sfaci updated the task description. (Show Details)
Sfaci updated the task description. (Show Details)
Sfaci set the point value for this task to 5.Sep 18 2025, 11:20 PM

This is great @Sfaci. The (previously implemented) validation on moderate and high risk needing security review is great, and I like that I can save while pending, but that the instrument cannot be activated.

One note:

In any case, if the user doesn't have yet the corresponding Security and legal review, they will have to set the risk level as Risk assessment pending. That will allow them to register/modify the instrument but it won't be possible to activate it until the Security and legal review link is provided. The user will always be able to modify again the instrument to set the appropriate risk level and the corresponding link.

Security and legal review link is only needed if Moderate or High (so the user could activate it without security and legal review if the risk level is low). More concretely: "It won't be possible to activate it until the Risk Level is set to a specific tier, and the legal review link is provided, if needed"

Let me know if you need to review copy or how you present these warnings in the UI.

Let me know if you need to review copy or how you present these warnings in the UI.

Thanks @JVanderhoop-WMF! I have added your clarification. Things are clearer now.

Regarding how to present these warnings (also all the message and errors in general), I'm not sure yet. While this is being reviewed I'm exploring also the implementation itself to see which options we have with codex and the current design of xLab's UI. We have to combine some information/warning/error messages depending on the case.

  • long-term viewing history + unique ID
    • Any combination of page_id/page_title + performer_id could make this criterion fail

I would performer_name to that, because that is also considered a unique ID.

Technically page_id/page_title + agent_app_install_id would also make this criterion fail but only if end-user is logged-in, which is not possible to know ahead of time when configuring the instrument. Could we throw a warning/note for the instrument owner notifying them of potential risk elevation if their instrument will be active for logged-in users?

We don't need to consider the high risk as a potential scenario here because that one would be only reached when failing, at least, two or more of the low risk criteria:

Actually, it is possible. agent_ua_string + performer_name + page_title would be a high risk activity because it fails 2 low risk criteria:

  • data collected does not include personal information + username/user ID or app ID
  • data collected does not include long-term viewing history + unique ID]

Thanks @mpopov for your feedback! Really valuable

Actually, it is possible. agent_ua_string + performer_name + page_title would be a high risk activity because it fails 2 low risk criteria:

Regarding the above I have assumed that performer_id would be equivalent to performer_id and page_id would be to page_title, right?

Sfaci updated the task description. (Show Details)

Hey there! I think that the flow described in the ticket is already very good. Sharing some suggestions and alternatives for consideration:

Scenario 0: Before selection
It'd be great if we could anticipate the risk-related information to users. We could do this in the following ways:

  1. By updating the “Contextual attributes” field description to mention their impact on risk “Collect extra information about the users who triggered the event and the wiki where the event occurred. Some attribute combinations will increase the risk level of this instrument. Learn more”.
  1. We could include information about "risk-increasing" combinations in the contextual attributes' doc page, which would complement the field's description and also support selection. Maybe this is what was meant by the AC "Documentation should be updated to include a section on regulation and data collection guidelines".
  1. We could display the attributes’ risk level in the attribute selection menu, e.g., using Codex’s “supportingText” slot in MenuItem.

(This recommendation was based on a misunderstanding: if the risk is derived from the combination of attributes, including individual indicators might create confusion. Relying on the validation after selection sounds like a better idea.)

Scenario 1: Users select contextual attributes that impact risk before selecting the Risk level.

Providing a risk assessment message under the Contextual attributes field based on user selection sounds good. But I think it would be more relevant to:

  1. Display an information message below the Risk level field, where it can inform the user’s selection. Copy e.g., : “Based on the selected Contextual attributes, this instrument has a minimum risk level of “{{Tier:Risk}}” and it requires a Security and Legal review”. The message is more significant in the context of that field, as it supports user selection.

Scenario 2: Users had selected or select a risk level that's lower than their latest attribute selection

  1. As specified in the ticket, displaying an inline warning message under the Contextual attributes field sonuds good. I'd suggest indicating something like: "The selected attributes increase the data collection risk level of this instrument. Please review the Risk level field selection" (because a corresponsing error message will be displayed there)
  2. The “Risk level” field should display an error state, as indicated in the task description. The copy could be simplified: “The contextual attributes selected increase the risk of this instrument to “{{Tier:Risk level}}”

Regarding

The user chooses High risk as the risk level:: xLab could check if there really is a combination contextual attributes that requires that risk level. In case there isn't, xLab could show an error message explaining this (taken the criteria above into account). The user should change the risk level to be able to register/modify the instrument.

OR

An error message is shown (along with the risk level field) when the selected risk level is higher than what's needed (if the previous AC doesn't apply)

  1. I’d say we shouldn’t validate the selection of a higher level of risk, given that there might be other factors influencing this choice? Validating the cases where the selected risk is lower than required sounds sufficient, but of course this is just a proposal.

Please let me know if any of the point needs clarification or visual support! Happy to go through this synchronously if that sounds better, @Sfaci. Thank you!

Thank you very much @Sarai-WMF!!

We could add an indicator of risk in the contextual attributes' doc page, which would complement the field's description and also support selection. Maybe this is what was meant by the AC "Documentation should be updated to include a section on regulation and data collection guidelines".

It's a good point, and yes, we wanted to add some details to the project's documentation. What you propose is interesting, we will add the related combination of contextual attributes to that page in the project's documentation. The purpose of the mentioned AC was to create a specific section for the Regulation section to explain some details that aren't trivial, but I agree with you that we should include also something in the one we have about contextual attributes to mention the relevant combinations that can affect to the risk level of the instrument

I’d say we shouldn’t validate the selection of a higher level of risk, given that there might be other factors influencing this choice? Validating the cases where the selected risk is lower than required sounds sufficient, but of course this is just a proposal.

You are right, we will remove those messages. Some details about the instrument itself could be missed by xLab so it's better not to do anything about it.

Regarding the rest of your feedback, I think it's great and really valuable. We will incorporate it as it's. Thanks!

sfaci updated https://gitlab.wikimedia.org/repos/data-engineering/mpic/-/merge_requests/253

Draft: New validation rules and advice about contextual attributes and risk level

cjming merged https://gitlab.wikimedia.org/repos/data-engineering/mpic/-/merge_requests/253

New validation rules and advice about contextual attributes and risk level

Change #1193136 had a related patch set uploaded (by Santiago Faci; author: Santiago Faci):

[operations/deployment-charts@master] xLab: Deploying v1.0.6 release to staging

https://gerrit.wikimedia.org/r/1193136

Sfaci updated the task description. (Show Details)

Change #1193136 merged by jenkins-bot:

[operations/deployment-charts@master] xLab: Deploying v1.0.6 release to staging

https://gerrit.wikimedia.org/r/1193136

Change #1193878 had a related patch set uploaded (by Santiago Faci; author: Santiago Faci):

[operations/deployment-charts@master] xLab: Deploying v1.0.6 release to production

https://gerrit.wikimedia.org/r/1193878

This ticket is again in "Needs Review" just to review the ACs related to its documentation updates (the implementation part is already on the way to staging/production environments):

Change #1193878 merged by jenkins-bot:

[operations/deployment-charts@master] xLab: Deploying v1.0.6 release to production

https://gerrit.wikimedia.org/r/1193878

hi @Sfaci - I reviewed the documentation and it looks really good to me! I consider the documentation ACs to be done 🎉

Hey there! I took a quick look at the changes related to this task (I didn't try out all attribute combos alternating selection order, nor tested in different browsers, but...): things look great!

The only nitpicky designy thing that I would suggest (for a follow-up ticket) would be reducing the number of messages displayed in the Regulation section. Right now, up to 3 messages could be displayed at the same time, which can be quite an information overload:

Screenshot 2025-10-07 at 19.19.23.png (1×2 px, 232 KB)

Ideas for your consideration:

  1. The storage time limit information could be included in a section description. Similarly to what xLab does with the 'Name' section, the 'Regulation' section could also be prefaced by an introduction. The information currently displayed as a static information message could be mentioned as part of that text.

Screenshot 2025-10-07 at 19.21.52.png (378×2 px, 83 KB)

  1. Present risk level heads-ups using a single message, for example (still quite dense, copy could be simplified):

Screenshot 2025-10-07 at 20.05.01.png (846×1 px, 139 KB)

The only nitpicky designy thing that I would suggest (for a follow-up ticket) would be reducing the number of messages displayed in the Regulation section. Right now, up to 3 messages could be displayed at the same time, which can be quite an information overload:
. . . .

Thanks @Sarai-WMF! It's a really good point!
I have filed a follow-up ticket to cover this work: T406729: xLab: Reduce the number of messages shown in the Regulation Section