Page MenuHomePhabricator

Fix EventLogging schemas that use array for items type
Closed, ResolvedPublic5 Estimate Story Points

Description

We're working on using the actual JSONSchemas to create EventLogging Hive tables. Some EventLogging schemas have array fields, but they specify the items schema of those fields as arrays, e.g.

CentralNoticeBannerHistory

"on": {
    "description": "Enabled languages among all user selected languages.",
    "type": "array",
    "required": false,
    "items": [
        {
            "type": "string",
            "description": "Language code",
            "required": false
        }
    ]
},

This makes it incompatible with Hive, where a given field can only have a single type. Can we edit the offending schemas so that they specify items as an object, e.g.

"on": {
    "description": "Enabled languages among all user selected languages.",
    "type": "array",
    "required": false,
    "items": {
        "type": "string",
        "description": "Language code",
        "required": false
    }
},

?

This was caused by an old bug in the EventLogging extension that prevented users from properly specifying array items types. We have fixed the bug, so we should be able to fix the schemas now.

IIUC, this should be compatible with existing events.

So far I've found:

Details

Related Gerrit Patches:
mediawiki/extensions/TemplateWizard : masterUpdate EventLogging schema revision number
mediawiki/extensions/CentralNotice : masterUpdate Schema:CentralNoticeBannerHistory version
mediawiki/extensions/EventLogging : masterAllow use of object schema for array `items`

Event Timeline

Ottomata renamed this task from MobileWikiAppiOSUserHistory schema uses array for items type to Fix EventLogging schemas that use array for items type.Mar 18 2019, 10:51 PM
Ottomata updated the task description. (Show Details)
Ottomata added a subscriber: AndyRussG.
chelsyx moved this task from Triage to Next Up on the Product-Analytics board.Mar 19 2019, 5:59 PM
chelsyx triaged this task as Medium priority.Mar 20 2019, 11:47 PM
Nuria added a subscriber: Nuria.Mar 21 2019, 4:15 PM

Can we have an ETA of when this work would be completed (since it is not a code patch it will not be visible here on ticket). Once fixed we can refine that data and make it available on hive using the schema.

Nuria added a comment.Mar 21 2019, 4:17 PM

FYI that this blocks some changes to improve refinement of data, cause until these schemas are changed they cannot be refined with the new code. Ping @AndyRussG

Nuria moved this task from Incoming to Data Quality on the Analytics board.Mar 21 2019, 4:18 PM

FYI that this blocks some changes to improve refinement of data, cause until these schemas are changed they cannot be refined with the new code. Ping @AndyRussG

Thanks, we'll take a look! :)

Hi @Ottomata , IIUC this only requires changes on the schema on meta wiki (MobileWikiAppiOSUserHistory), no change is needed on the app side, correct? Of course we will need to change to the updated revision number of the schema in the app, but since this should be compatible with existing events and there will always be some users never update their app, it seems changes on the app is not really needed. If this is the case, I can change the schema today; if changes on the client side is required, we have to wait till the next app release (a few weeks I guess).

This is an example of the JSON string that the app is currently sending to MobileWikiAppiOSUserHistory:

{
    event =     {
        "app_install_id" = "1375F38A-A9E7-4614-82B8-C62B18B73C8E";
        "event_dt" = "2019-03-21T11:13:43-07:00";
        "feed_disabled" = 0;
        "feed_enabled_list" =         {
            cr = 1;
            fa =             {
                on =                 (
                    zh,
                    en
                );
            };
            ns =             {
                on =                 (
                    zh,
                    en
                );
            };
            od =             {
                on =                 (
                    zh,
                    en
                );
            };
            pd = 1;
            pl =             {
                on =                 (
                    zh,
                    en
                );
            };
            rd =             {
                on =                 (
                    zh,
                    en
                );
            };
            rp = 1;
            tr =             {
                on =                 (
                    zh,
                    en
                );
            };
        };
        "is_anon" = 1;
        "measure_font_size" = 100;
        "measure_readinglist_itemcount" = 0;
        "measure_readinglist_listcount" = 1;
        "primary_language" = en;
        "readinglist_showdefault" = 0;
        "readinglist_sync" = 0;
        "search_tab" = 0;
        "session_id" = "B07810C7-29AF-4A47-AA7B-7CED42EAA662";
        theme = default;
        "trend_notify" = 0;
    };
    revision = 18222579;
    schema = MobileWikiAppiOSUserHistory;
    wiki = enwiki;
}

After change the schema to:

"on": {
    "description": "Enabled languages among all user selected languages.",
    "type": "array",
    "required": false,
    "items": {
        "type": "string",
        "description": "Language code",
        "required": false
    }
},

It is still compatible with the event JSON above, right? And is there going to be any changes in how it present in the database? i.e. is the hive struct still going to looks like:

"feed_enabled_list": 
{"cr":true,
"fa":{"on":["en","zh"],"off":null},
"ns":{"on":["en","zh"],"off":null},
"od":{"on":["en","zh"],"off":null},
"pd":true,
"pl":{"on":["en","zh"],"off":null},
"rd":{"on":["en","zh"],"off":null},
"rp":true,
"tr":{"on":["en","zh"],"off":null}}

IIUC this only requires changes on the schema on meta wiki

I believe so yes! Unless yall have reasons for specifying the items type as an array, I think there isn't a reason to do it. I believe the single array items object is correct for what you are sending, i.e. an array of strings.

@Ottomata I was trying to change the schema on meta wiki, but got this error message:

Error: Invalid node: expecting "array", got "object". Path: "Schema -> Properties (properties) -> feed_enabled_list -> Properties (properties) -> fa -> Properties (properties) -> on -> Items (items)"

Do you know what's wrong?

HUH! I wonder if this is why the items are arrays......is EventLogging extension enforcing this with its weirdo JSONSchema? Will check...

chelsyx moved this task from Next Up to Doing on the Product-Analytics board.Mar 22 2019, 12:35 AM

So, wow. EventLogging's schemaschema.json is forcing that the array items is itself an array, which only validates that each element in the instance array matches the item type in the schema items array. E.g.

items: [string] will validate ["a", 1,2,3, {"bad": "thing"}] just fine! EventLogging is very wrong about this. From https://tools.ietf.org/html/draft-zyp-json-schema-03:

5.5. items
This attribute defines the allowed items in an instance array, and
MUST be a schema or an array of schemas. The default value is an
empty schema which allows any value for items in the instance array.
When this attribute value is a schema and the instance value is an
array, then all the items in the array MUST be valid according to the
schema.
When this attribute value is an array of schemas and the instance
value is an array, each position in the instance array MUST conform
to the schema in the corresponding position for this array. This
called tuple typing. When tuple typing is used, additional items are
allowed, disallowed, or constrained by the "additionalItems"
(Section 5.6) attribute using the same rules as
"additionalProperties" (Section 5.4) for objects.

So, the only reason that this works at all is that additionalProperties is not explicitly set to false on the array type. Only the first element in the instance arrays are being validated.

We need to change schemaschema.json to allow objects. I'm trying to figure out how to do this, but it is so baked into the EventLogging code (WHY DID WE EVER GO WITH CUSTOM SCHEMA VALIDATION IN PHP AHHHHHHHGHGH) that I haven't been able to do this yet. Still trying.

Oof:

public function getSequenceChildRef( $i ) {
    TODO: make this conform to draft-03 by also allowing single object

https://github.com/wikimedia/mediawiki-extensions-EventLogging/blob/master/includes/JsonSchema.php#L454

I can make the EventLogging extension accept the proper items schema object format, but I can't seem to make it properly validate instances against it. I really don't think we should be implementing and debugging a custom JSONSchema validator implementation. Can't wait til we get rid of EventLoggging! AHHAHGH.

@Nuria, what do you think we should do? We could merge my code, which will allow us to set the proper JSONSchema for these events, but events won't be correctly validatable by the EventLogging extension's JsonSchema::validate() function. I think this is ok, since we don't use the EventLogging PHP extension code for validating events, just for validating schemas on the wiki. Having the JSONSchemas with the proper items object will allow the python eventlogging validator to properly validate array elements. This will also allow us to create the Hive schemas correctly. The downside is that some unused parts of EventLogging PHP will not work properly.

Change 498408 had a related patch set uploaded (by Ottomata; owner: Ottomata):
[mediawiki/extensions/EventLogging@master] [WIP] Allow use of object schema for array items

https://gerrit.wikimedia.org/r/498408

Ottomata moved this task from Next Up to In Progress on the Analytics-Kanban board.

Sorry to hear that @Ottomata :( Let me know when you need any changes on my side.

Looking at code patch on vagrant.

Nuria added a comment.EditedMar 29 2019, 9:14 PM

So, from :

https://tools.ietf.org/html/draft-zyp-json-schema-03#section-5.5

"When this attribute value is an array of schemas and the instance
value is an array, each position in the instance array MUST conform
to the schema in the corresponding position for this array."

We will never be able to support this freedom in schemas that are persisted to hive, correct? As the implication here is that "items" can support an array in which every element (if specified that way) is of a different type, so ["a", 1,2,3, {"bad": "thing"}] is a "valid" items blurb. It is just not valid according to:

"items": [ {

    "type": "string",
    "description": "Language code",
    "required": false
} ]
Nuria added a comment.Mar 31 2019, 4:52 AM

So, wow. EventLogging's schemaschema.json is forcing that the array items is itself an array, which only validates that each element in the instance array matches the item type in the schema items array. E.g.
items: [string] will validate ["a", 1,2,3, {"bad": "thing"}] just fine!

But the reality of it is that php code is used to validate schemas , not data so this will never happen on our systems, correct?

Why don't we allow:

"items": [ {

    "type": "string",
    "description": "Language code",
    "required": false
}]

As long as this specification only has 1 element (which - please correct me if i got it wrong- is technically correct) for backwards compatibility

And we also move schemas to:

"items": {

    "type": "string",
    "description": "Language code",
    "required": false
}

which in our system is less prone to misinterpretations by schema users

I am trying to untangle code to see how we can support both, let me know if this sounds good, bad, crazy, both?

We will never be able to support this freedom in schemas that are persisted to hive, correct?

Correct. Strongly typed systems do not allow arrays with elements of varying types.

so ["a", 1,2,3, {"bad": "thing"}] is a "valid" items blurb. It is just not valid according to: ...

It is valid according to the array items. items: [{type:string}] will only validate that the first data element is a string.

But the reality of it is that php code is used to validate schemas , not data so this will never happen on our systems, correct?

The PHP code is enforcing that array items is an array. If schemas have a list of array items with a single entry (which is what the offending schemas I found have), the Python code will only ever validate that the first element data element for that array field is the proper type. Subsequent elements will not be checked.

I encountered this problem because I want to map from the EventLogging schemas to strongly typed systems (Spark, Hive, whatever). I don't want to build in another special case for EventLogging like 'IF schemas have a single array items element then use it as the type of all array elements'. We just shouldn't support array items. I guess that is what Ori (?) was trying to do here with this array items array enforcement, but got the implementation of the spec wrong. Why else would he actually enforce that items is an array of types?

Nuria added a comment.Apr 22 2019, 5:16 PM

@chelsyx after much thinking on whether we could support here here is what we have come to decide (let's us know how does this sound). We are going to deploy a change that will make for this the only valid way to describe array items:

"items": {
     "type": "string",
     "description": "Language code",
     "required": false
 }

So, schemas that specify items in the former way:

    {
        "type": "string",
        "description": "Language code",
        "required": false
    }
]

Will need to be updated so they can be persisted to hive as we are not able to support both specifications at the same time.

I don't think Chelsy is going to have much of an opinion here. The change to the code will only affect new schema versions, which we have to at least make for the couple of schemas I originally opened this task for. Let's go! :)

Change 498408 merged by jenkins-bot:
[mediawiki/extensions/EventLogging@master] Allow use of object schema for array items

https://gerrit.wikimedia.org/r/498408

@Nuria , like @Ottomata said, I don't have an opinion here. Just let me know if you need me to change the schema, or events sent from the client side :)

Ok @chelsyx, our fix is out. Can you try changing the schema again?

Ottomata updated the task description. (Show Details)May 3 2019, 8:59 PM
Ottomata added a subscriber: Niharika.

Ok @chelsyx, our fix is out. Can you try changing the schema again?

Done.

Niharika removed a subscriber: Niharika.May 6 2019, 3:37 PM

@AndyRussG this should be do-able now, can you alter your schema? If you don't mind, I'm happy to do it for you.

@Niharika If you are not the right person to ping about the TemplateWizard schema, could you point me to the right person? You are marked as the TemplateWizard maintainer. Perhaps @Samwilson?

Ottomata updated the task description. (Show Details)May 6 2019, 6:14 PM

@Ottomata Hi! I updated the CentralNotice schema. So, we'll also update the version number in the client code. Thanks!!

Change 508487 had a related patch set uploaded (by Samwilson; owner: Samwilson):
[mediawiki/extensions/TemplateWizard@master] Update EventLogging schema revision number

https://gerrit.wikimedia.org/r/508487

I've made the change in the TemplateWizard schema, and created a patch to use the new revision ID.

Ottomata updated the task description. (Show Details)May 7 2019, 1:13 PM

Change 510061 had a related patch set uploaded (by AndyRussG; owner: AndyRussG):
[mediawiki/extensions/CentralNotice@master] Update Schema:CentralNoticeBannerHistory version

https://gerrit.wikimedia.org/r/510061

Change 510061 merged by jenkins-bot:
[mediawiki/extensions/CentralNotice@master] Update Schema:CentralNoticeBannerHistory version

https://gerrit.wikimedia.org/r/510061

Nuria set the point value for this task to 5.May 14 2019, 8:36 PM
Nuria closed this task as Resolved.May 14 2019, 9:09 PM

@Nuria, @Ottomata, the update to version number for the CentralNoticeBannerHistory schema should have been deployed to production last week with the Mediawiki deploy train. However, we had to roll back CentralNotice changes due to another, unrelated issue, which we are still working to fix. So the updated schema is not yet being used on production.

If you like, we can push just the schema update out on a SWAT deploy today, or we could wait and deploy it with other stuff on next week's train. If I understand correctly, this issue is blocking other stuff for you. It's no problem for us to deploy it on a SWAT today if that's the case.

Thanks much!!!

Nuria added a comment.May 21 2019, 6:26 PM

we can wait, really. it is important but not urgent.

we can wait, really. it is important but not urgent.

Ok thanks!!!! :)

Andy thank you! For this task, the blocker really was just editing the schema on meta.wikimedia.org. Fixing the code to use the new schema version is great too, but it doesn't block the refinement of the data into Hive.

Change 508487 merged by jenkins-bot:
[mediawiki/extensions/TemplateWizard@master] Update EventLogging schema revision number

https://gerrit.wikimedia.org/r/508487