Page MenuHomePhabricator

New Service Request SchemaTree
Open, Needs TriagePublic

Description

Description: The property suggester is used to provide property recommendations while editing wikidata items. This request is about the deployment of a new backend (SchemaTree) for this service. The initial goal is to perform A/B testing to determine whether the new system provides better recommendations without sacrificing performance. If this is the case, we will switch completely to the new backend. The service is already running on Wikimedia Cloud VPS (project:schematreerecommender) and used on https://wikidata.beta.wmflabs.org

See also https://phabricator.wikimedia.org/project/profile/2796/ and https://phabricator.wikimedia.org/T285098

Timeline: Ideally this service is in production before April 2022 as starting in April the volunteers have limited availability.

Point person:

Technologies: The backend service (the index) is written in the go programming language. This service is called from the PropertySuggester MediaWiki extension. The reason to use go for this backend service is that this limits the resource usage greatly and allows for very fast processing of requests.

Technical details:
This service can be run on a virtual machine.
It should be noted that this service is completely stateless. At the start it loads an index from disk, which is then used to serve any future requests. This means:

  • That the service can be replicated easily.
  • That in case of a failure of the service, it can be restarted without consequences to future requests.

The service does not access any other services

  • For the A/B testing, all event logging is done before this service is called

We periodically need to recreate that index (similarly to the current recommender, this will be done outside of the production environment ).

  • This index is stored as a file and must be readable/retrievable by the service. The file is currently 85MB (compressed) and grows sublinear in the number of wikidata entities.

Given the statelessness of this service, a cache could be used. However,

Diagram:

diagrams-Copy of Page-3.drawio.png (680×744 px, 83 KB)

Event Timeline

Hi, if this service is to be used in the WMF production environment (and given the call graph, it will), it needs to run on kubernetes, and thus we will need to be built using our deployment pipeline first, and use the deployment-charts repository to define the deployment.

You also say the service will need to read an index file. How is that generated? Where do you plan to store / retreive it? For an A/B test baking the file into the application images might be ok, but long-term we might need to move the data to a datastore instead.

Can you point me to the repository for this service so that I can give more specific advice?

@Joe we have created files for blubber before, I assume what is needed is very similar to that? I am not sure I know what the deployment-charts repository is . The code is under development on https://github.com/martaannaj/RecommenderServer/ also the blubber configuration files are there.

@Addshore can you elaborate on how we could create the index file? Since this is not something we have to recreate often it might even be worth recreating the application image each time we recreate the index.

Ok so a few requirements:

  1. we need the repository to be on gerrit, and to include a .pipeline directory to be built using blubber/the deployment pipeline.
  2. you should probably base your image on debian bullseye and not debian stretch, but that can be done after the repository is on gerrit

Once we have a working gerrit repository and a working deployment pipeline for it, we can consider deployment options.

@Joe for the base image, would you recommend our current approach of starting from an 'empty' image and downloading the latest go distribution ourselves, or should we take this one: https://docker-registry.wikimedia.org/golang1.17/tags/
I am not sure how stable these images are.

@Joe for the base image, would you recommend our current approach of starting from an 'empty' image and downloading the latest go distribution ourselves, or should we take this one: https://docker-registry.wikimedia.org/golang1.17/tags/
I am not sure how stable these images are.

I didn't want to get into that, but yes, you should use our images, without indicating a tag (the pipeline will take care of using the most updated version of it). As an example of a go program using our pipeline, see kask:

https://gerrit.wikimedia.org/r/plugins/gitiles/mediawiki/services/kask/+/refs/heads/master/.pipeline/blubber.yaml

I can submit a patch to the github repo to amend the blubber.yaml file for you.

We'd be happy to receive a patch or pull request.

You also say the service will need to read an index file. How is that generated?

This is currently generated using this go code https://github.com/martaannaj/SchemaTreeBuilder

Where do you plan to store / retreive it? For an A/B test baking the file into the application images might be ok, but long-term we might need to move the data to a datastore instead.

I would also propose including it in the image for now and thinking about loading it from some other data store in the future.
This index will likley only get updated once a month or so, as with the current property suggester data (but that lands in an SQL table).

In terms of loading it into the image, I'm not quite sure how it may be best to do that?
I guess we don't want to load it into the git repo along side the service itself.
Does anyone have a good idea?

@Addshore can you elaborate on how we could create the index file? Since this is not something we have to recreate often it might even be worth recreating the application image each time we recreate the index.

+1 I think the the initial version and A/B test we should aim to just include it in the image as long as @Joe is okay with this

@QChris I noticed the addition of the .gitreview file on gerrit. Is this file needed? If so, we would merge it into our github repository, so we can keep the active development there and synchronize with gerrit.

@Joe : would it be fine if we include the index file into the image for the A/B testing phase? If that is successful, we can see whether it makes sense to move it to a different location.

We are fixing the blubber configuration in the coming days. Could you tell us what the next steps are for deployment?

@Joe We now made the changes to use the bullseye distribution and the provided image with go installed See: https://github.com/martaannaj/RecommenderServer/pull/22/files
@QChris We just went ahead and added that file in our repository.

This is now running on beta and seems to work fine.

What can we do next? @Joe mentioned a 'deployment-charts repository', but we do not know what that is.

So a summary comment as promised!

@QChris I noticed the addition of the .gitreview file on gerrit. Is this file needed? If so, we would merge it into our github repository, so we can keep the active development there and synchronize with gerrit.

Sounds like this would work to me, and would be fine from the WMDE side of things.
@akosiaris would syncing the code from Github to Gerrit via a Gerrit change whenever it would need to be updated be acceptable on your side?

To summarize current state:

The other big topic is still how to include the index for the service.
Context, this index file is currently ~1.5GB and would need to be regenerated every now and again:
I see a couple of options, and I'm sure @akosiaris and team would have options.

  • Build 1 image with just the code and have another repo and another blubber config etc that then adds an index file in to a new image
  • Where can this image file be stored and retrieved? is this a bit big to just shove in a gerrit repo?

So a summary comment as promised!

@QChris I noticed the addition of the .gitreview file on gerrit. Is this file needed? If so, we would merge it into our github repository, so we can keep the active development there and synchronize with gerrit.

Sounds like this would work to me, and would be fine from the WMDE side of things.
@akosiaris would syncing the code from Github to Gerrit via a Gerrit change whenever it would need to be updated be acceptable on your side?

Yes, that would be ok, some other services use that approach too.

To summarize current state:

Release Engineering would be the team for that. You can always send a gerrit change for integration/config (it should be pretty similar to 4bd7dc4ad5fb1) and they can review/merge/deploy.

Done

  • 'deployment-charts repository' was also mentioned, Again is there anything we can do on our side, or more something for service ops to handle?

This is almost entirely self-serve. Docs are at: https://wikitech.wikimedia.org/wiki/Deployment_pipeline/Migration/Tutorial#Creating_a_Helm_Chart

Once you have the chart from the above step, upload it to gerrit for review and we 'll work together there.

The other big topic is still how to include the index for the service.
Context, this index file is currently ~1.5GB and would need to be regenerated every now and again:
I see a couple of options, and I'm sure @akosiaris and team would have options.

  • Build 1 image with just the code and have another repo and another blubber config etc that then adds an index file in to a new image
  • Where can this image file be stored and retrieved? is this a bit big to just shove in a gerrit repo?

That's a large file. It's a bad idea to stick it in a docker registry, it will make deployments slower, more resource consuming and brittle. We are already meeting various limits (at multiple places) with large files and I can tell you that you don't want to be debugging them. Plus, conceptually, the place for a dataset that is used by a codebase, isn't right next to the codebase.

To attack the problem from another angle. What is in that dataset? What is the format that is has and why ? Is it something we can alter? Would it make sense if it was stored in a datastore (e.g. SQL, Cassandra, etc) and queried instead ?

It's also a bad idea to stick it in a git repo (any git repo). In fact, gerrit won't even allow you to push this. It has (and for good reasons) a maxObjectSizeLimit of 100MB.

@Michaelcochez looks like we have a few next steps!

And then there are the questions around the index file:

  • High level desc of what is in the file (I think this already exists elsewhere but mentioning here would be good)
  • What is the format
  • Alterable, or do new versions just come as a totally re generated file? (I believe the latter)

I believe the file is loaded into memory at runtime. It could be grabbed via HTTP from somewhere (service ops would have to help us know where this could be) at each container runtime?

I merged the pull request on github now.

I do not have rights to push to the gerrit repository, it might just be my limited knowledge of how gerrit works.

I will look into the helm chart/CI setup soon.

questions around the index file:

This file is the serialization of the in-memory tree structure used for recommendation. The file is a compressed (gzipped) binary file. For serialization we use https://pkg.go.dev/encoding/gob . Given this, and the fact that changes in the tree structure can have a 'rippling effect', it is not possible (or at least extremely hard) to alter the file. This tree is a specifically crafted type of index, serving its data from an external database would be impossible/detrimental for performance as it would require a lot of roundtrips.

The index file is loaded into memory once when the process starts. It could be loaded from 'anywhere' and does not even have to reside on disk necessarily.

I do not have rights to push to the gerrit repository, it might just be my limited knowledge of how gerrit works.

I believe you will still need to submit a patch / patches / merge commit to gerrit, and review this +2ing it in the UI.
I imagine we will want to keep pushing directly to master / main turned off

https://www.mediawiki.org/wiki/Gerrit/Tutorial

Perhaps this should be a topic to chat about in the next sync (or before if it is blocking things)?

I merged the pull request on github now.

I do not have rights to push to the gerrit repository, it might just be my limited knowledge of how gerrit works.

I 've added to you to the gerrit wikidata-propertysuggester-RecommenderServer group, you should have access now.

I will look into the helm chart/CI setup soon.

questions around the index file:

This file is the serialization of the in-memory tree structure used for recommendation. The file is a compressed (gzipped) binary file. For serialization we use https://pkg.go.dev/encoding/gob . Given this, and the fact that changes in the tree structure can have a 'rippling effect', it is not possible (or at least extremely hard) to alter the file. This tree is a specifically crafted type of index, serving its data from an external database would be impossible/detrimental for performance as it would require a lot of roundtrips.

Despite the allure, shipping around serialized memory objects has many drawbacks as an approach. Most obvious are the security ones and most languages indeed put wording in their respective frameworks to point that out. https://github.com/golang/go/issues/20221 has some hints. Python's pickle more or less points out the same. Really big hacks that have ex-filtrated tons of private data have happened because of "serialization" vulnerabilities (e.g. the equifax hack was reliant on an apache struts serialization vulnerability - https://nakedsecurity.sophos.com/2017/09/06/apache-struts-serialisation-vulnerability-what-you-need-to-know/)

There's more drawbacks of course. For example, how do you do versioning of the dataset? It needs to always match the definitions of the Golang Struct that it contains. Even simple changes in field names, can cause unintended behavior. e.g. changing a field name means that data for it will silently be dropped when deserializing an older dataset and loading it into memory.

Thus the dataset needs to be strongly coupled with the application (that is they need to be deployed in tandem), which is a bad pattern due to the size constraints I 've explained about above, not to mention the fact that gerrit currently won't allow you to even upload the file.

The index file is loaded into memory once when the process starts. It could be loaded from 'anywhere' and does not even have to reside on disk necessarily.

That's the thing. It can't be loaded from 'anywhere' cause of the security issues and because of the strong coupling it has with the application itself.

A final question, regarding the external database roundtrips note. Almost all datastores (either RDBMS ones or NoSQL ones) have the ability to batch results, obviating the need to multiple roundtrips. As a result e.g. many ORMs (Hibernate, Django, SQLAlchemy, Gorm) also support this (naming the functionality with various terms, but it's there). In fact, we 've seen this before and in most cases rewriting the queries to fetch hundreds or thousands of entities in 1 go instead of hundreds or thousands of queries was trivial.

There are also other well known, proven and safer ways of shipping around serialized structured data, e.g. protocol buffers (aka protobufs[1]).

Have any of the above been evaluated?

[1] https://developers.google.com/protocol-buffers

An update on the current status, mainly regarding the index file:

First, I made a mistake in my response above. The size of the file is a lot smaller than what I wrote above. The binary version is currently around 75mb (and not 1.5gb).

Progress:

  • We implemented serialization using protocol buffers. Initial experiments seem promising. Store and load times appear to be only slightly slower compared to the native format. The on-disk size grew from 75mb to 250mb. However, using compression (bzip2) the protocol buffer version can be queezed into 51mb, so that seems to go well.
  • While rewriting the serialization code, I noticed that it was hard to maintain that in a separate project. Hence, I integrated a minimal version of the index creation into the codebase.
  • The previous index creation was using the rdf dump as its datasource. The new version uses the json dump. That has several benefits, mainly with regards to needed preprocessing steps (or rather avoiding them).
  • The go version has been bumped from 1.17 to 1.18

These changes are not merged into main yet. Development is ongoing in https://github.com/martaannaj/RecommenderServer/tree/protobuffer_serialization
The following still needs to be done before merging.

  • test coverage for the new serialization format
  • checking whether gokart can be used with the latest go somehow. It does not support the new generics capabilities of go. Most functionality is covered by the other checking tools used.

Regarding the option of using a batch of queries to an external database; the issue is that what we are creating is a specialized index specifically for what we need. What we perform is a tree traversal were at each node a new decision is made. To do something like this with a database, one would have to fire a series of queries where each of the queries is dependent on all the results of the previous ones.

An update on the current status, mainly regarding the index file:

First, I made a mistake in my response above. The size of the file is a lot smaller than what I wrote above. The binary version is currently around 75mb (and not 1.5gb).

Progress:

  • We implemented serialization using protocol buffers. Initial experiments seem promising. Store and load times appear to be only slightly slower compared to the native format. The on-disk size grew from 75mb to 250mb. However, using compression (bzip2) the protocol buffer version can be queezed into 51mb, so that seems to go well.
  • While rewriting the serialization code, I noticed that it was hard to maintain that in a separate project. Hence, I integrated a minimal version of the index creation into the codebase.
  • The previous index creation was using the rdf dump as its datasource. The new version uses the json dump. That has several benefits, mainly with regards to needed preprocessing steps (or rather avoiding them).
  • The go version has been bumped from 1.17 to 1.18

These changes are not merged into main yet. Development is ongoing in https://github.com/martaannaj/RecommenderServer/tree/protobuffer_serialization
The following still needs to be done before merging.

  • test coverage for the new serialization format
  • checking whether gokart can be used with the latest go somehow. It does not support the new generics capabilities of go. Most functionality is covered by the other checking tools used.

Thanks for this update. Couple of comments:

  • Nice to see that the protocol buffer version can be squeezed down to 51MB. That size isn't gonna cut it probably for a large scale deployment of this, but it is going to be fine for an A/B test, which is what this is for. If it ends up staying around and needing to be scaled up, we can revisit this one.
  • A 51MB object can be pushed to gerrit, so that blocker is lifted. Do take care to not commit that file too often as that would make the repo huge and would cause stalls to image builds.

So, when this is merged and deemed ready, I think we can proceed with this.

Regarding the option of using a batch of queries to an external database; the issue is that what we are creating is a specialized index specifically for what we need. What we perform is a tree traversal were at each node a new decision is made. To do something like this with a database, one would have to fire a series of queries where each of the queries is dependent on all the results of the previous ones.

Yes, the point is that, depending on a couple of things, like what type of tree you are using (binary search tree, AVL tree, B-tree, red-black tree etc) and the amount of nodes in the tree, and given that these things tend to have a complexity of O(log n), it quite possibly could be fine to bulk fetch the entirety of a branch, from the starting point of the branch down to the leaf nodes in 1 query. Depending on the value of n that might mean that after issuing something like 2-5 requests, the tree would have been trimmed down enough to fetch the entirety of it in a single query and just finish up the work left locally. This can be especially useful e.g. for recommendations, where one probably wants multiple leaf nodes and not just 1.

Any updates on this one? Per previous comment we were waiting on a merge, has this been done?

Hi @akosiaris , apologies for the slow progress.
We have been updating several things, but I didn't follow up with the testing code needed for the new parts. I have blocked time in the coming days to proceed with this and then we can merge it in.

The testing code is now implemented, and we found two small issues with it. These have now been resolved and the code is simplified further.

Give this ticket: https://phabricator.wikimedia.org/T332953 I am uncertain whether it makes sense to merge things in now, or whether to wait for that to happen. @akosiaris what is your suggestion?

@Michaelcochez Probably it makes sense to merge but also comment on the ticket you linked to and get in touch with thcipriani there. Do your repos show up on that ticket? Are they currently on github?

The testing code is now implemented, and we found two small issues with it. These have now been resolved and the code is simplified further.

Give this ticket: https://phabricator.wikimedia.org/T332953 I am uncertain whether it makes sense to merge things in now, or whether to wait for that to happen. @akosiaris what is your suggestion?

I agree with @Dzahn. Don't couple the 2 things, let T332953 happen on it's own timeline.