Page MenuHomePhabricator

[Tech spike] Experiment with storage implementations for toolinfo annotations
Closed, ResolvedPublicSpike

Description

We would like to be able to add data to a toolinfo record that is durable, versioned, and separate from the core toolinfo data. This will allow us to retain the "simple" functionality of a toolinfo record being solely controlled by it's originator (either a crawled toolinfo.json file or a Toolhub user via API) while also enabling both the Toolhub system and community members to add additional data about a given tool.

  • We do not know the full extent of fields or data types that will be used across all annotations, so the design must anticipate schema changes.
  • User facing search across annotations data when needed is expected to be done via Elasticsearch rather than sql queries. This in turn means that storage may use JSONSchemaField blob storage if desired for storing collections of data as an annotation data point.
  • Adding/editing should generate new revisions of the (toolinfo, annotations) pair and be reflected in APIs for viewing history and diffs of toolinfo data.

Event Timeline

Restricted Application changed the subtype of this task from "Task" to "Spike". · View Herald TranscriptJan 19 2022, 5:28 PM
Restricted Application added a project: User-bd808. · View Herald Transcript
Restricted Application added a subscriber: Aklapper. · View Herald Transcript

As I see it, with annotations, we will end up with four "flavors" of tool data that can be thought of as qualitatively different :

  1. Original data -- data from toolinfo.json records: maintainer-controlled, "one-source-of-truth", not publicly editable.
  2. VCS derived data -- metadata derived from the repository where the tool is hosted. Shares the same attributes with the original data, i.e. maintainer-controlled, "one-source-of-truth", not publicly editable.
  3. Maintainer controlled additional data -- only the tool maintainer can edit. E.g. a boolean flag indicating whether the project is actively looking for contributors.
  4. Publicly editable additional data -- anyone can edit. E.g, toggle a boolean flag indicating whether the tool is broken.

Questions -

  1. For what reasons, if any, would we want to store "maintainer-controlled" data separate from "publicly editable" data?
  2. For what reasons, if any, would we want to store "VCS-derived" data separate from "original" data?
  3. Should any of the data that is currently "original" and thus "maintainer controlled" instead be publicly editable?
  4. The types of annotation data we have been discussing so far are rather vanilla (String, Boolean, List, Datetime...). Are there any concrete, more far-flung types that we could imagine needing in the future that would warrant NoSQL storage?
  5. What would be the MVP of this?

@Slst2020, @Raymond_Ndibe, and I talked about this task and @Slst2020's questions in a team call today. Recording some of that here so others can see it, and so that anyone involved in the discussion can correct my understanding.

Questions -

  1. For what reasons, if any, would we want to store "maintainer-controlled" data separate from "publicly editable" data?
  2. For what reasons, if any, would we want to store "VCS-derived" data separate from "original" data?

For both, access control is the primary reason. The access controls we are using on both the Django and Vue components of Toolhub work at the object level. It is certainly possible to imagine and implement finer grained access controls which work at the level of properties of objects, but things are easier today to implement and reason about based on instances of model objects. This pattern also maps relatively well to noun oriented API endpoints and their access controls.

I think it is still a debatable question whether we need more than 2 access control realms at this point. VCS derived data and other derived or externally collected data (Toolforge webservice request traffic for example) could be separated into a third control realm, but I think this would only be needed if we feel that there is high risk of vandalism in this data and/or a low amount of trust from the community if it is not under separate access control. There is one item in the previously proposed annotations, Official maintainer, which would likely trigger a need for another access control/trust realm. I don't personally think that the VCS derived data does.

An interesting related topic is where derived data is computed. We can certainly implement additional periodic job processes beyond our current crawler system. We could also decide that many/most/all things like this are better handled as external tools which submit data to Toolhub via its API. This could have benefits in ease of prototyping and implementation as well as attract more technical volunteers to contribute to the Toolhub ecosystem. In some areas it could also provide alternatives for activities that are challenging to do within Toolhub's production Kubernetes cluster hosting constraints (such as maintaining persistent clones of VCS repos for analysis if that is found to be desirable).

  1. Should any of the data that is currently "original" and thus "maintainer controlled" instead be publicly editable?

I would really like to make it easier for the community to collaborate with tool maintainers in completing the information for each tool. At the same time I would like it to be reasonably possible for a tool maintainer to publish a detailed toolinfo record without requiring human or API interaction with Toolhub directly beyond toolinfo.json URL registration.

I have been thinking about a possible "best of both" scenario where the annotation layer for a tool could have fields which duplicate fields from the core toolinfo. This pushes some editorial decisions up from the backend to frontend implementations to decide what to do if both the core toolinfo and it's annotations have data in these duplicate fields.

  1. The types of annotation data we have been discussing so far are rather vanilla (String, Boolean, List, Datetime...). Are there any concrete, more far-flung types that we could imagine needing in the future that would warrant NoSQL storage?

We are already using NoSQL storage for lists of strings and objects. I think we should be careful of overusing this convenience, but I do not see it as a bad thing. Today the keywords, url_alternates, for_wikis, sponsor, available_ui_languages, technology_used, developer_docs_url, user_docs_url, feedback_url, and privacy_policy_url fields for a Tool model are persisted as JSONSchemaField values. This makes them opaque to SQL queries (they are stored as blobs of JSON in the database), but accessible to Elasticsearch. These are also reversible decisions in that we can decide in the future to replace the stored blobs with foreign key relations to tables modeling rows in these lists and write Django migrations to translate data appropriately.

  1. What would be the MVP of this?

I am hoping to answer these questions with my spike:

  • Is it difficult or inelegant to ensure that related models for separate access control realms are created when a Tool model is created?
  • Is is difficult or inelegant to return content from multiple models as a connected entity in an API response?
  • Is it difficult or inelegant to index content from multiple models as a connected entity in Elasticsearch?
  • Is it difficult or inelegant to record create, edit, and delete events for content from multiple models in our audit log system?
  • Is it difficult or inelegant to track revisions to content from multiple models in our data versioning system?
  • Is it difficult or inelegant to produce diffs between revisions to content from multiple models with our existing tooling?

My current approach to investigating these questions is to implement an Annotations model that adds a single "wikidata_qid" property. This will be connected to our existing Tool model with a OneToOne foreign key relationship, added to existing /api/tools/... endpoint responses as appropriate, indexed in Elasticsearch as a subobject to our existing Tool object, and editable via its own /api/tools/{tool_name}/annotations/ family of endpoints. This should allow me to explore all of the questions in a way that others can examine and verify as well.

I have gotten far enough in building this out that I have new questions about how manipulating annotations should look from the audit log and edit history points of view.

The most straightforward thing to implement is to track it all completely separately, but that would also mean that we need a number of new API endpoints to patrol, diff, undo, revert, hide, and reveal Annotations separately from the existing Tool endpoints. Thinking from the point of view of the end users of both our API and UI, I think it makes more sense if viewing the edit history for a Tool (GET /api/tools/{tool_name}/revisions/) would include both edits to the Tool model and the Annotations model. How the actual implementation of that would work needs a bit more thinking.

I have gotten far enough in building this out that I have new questions about how manipulating annotations should look from the audit log and edit history points of view.

@Slst2020, @Raymond_Ndibe, and I talked this through today in a conference call. We came to the basic conclusion that Toolhub end users (people using our UI) should see edits to the "core" Tool model and edits to an associated Annotations model as being the same type of action. The differentiation between the two activities is an artifact of how we are enforcing access controls on the backend; it is not a material difference otherwise. What does this mean in practice?

  • Fetching a toolinfo record via GET /api/tools/{name}/ should return the combined tool + annotations data.
  • Results from GET /api/search/tools/ should return the combined tool + annotations data.
  • The /api/tools/{tool_name}/revisions/* family of endpoints (including diffs) should operate on the combined tool + annotations data.
    • This by extension should make the revisions, diffs, and patrolling actions work the same for either type of edit.
  • Audit log entries should look the similar in the UI for an edit of either model and both lead to the tool details.
  • Editing in the UI should use the same "edit" button for both types of edits with the distinction of how to persist the changes to the two types being handled by our UI business logic.

Change 758075 had a related patch set uploaded (by BryanDavis; author: Bryan Davis):

[wikimedia/toolhub@main] search: Extract field mapping helpers for reuse

https://gerrit.wikimedia.org/r/758075

Change 758076 had a related patch set uploaded (by BryanDavis; author: Bryan Davis):

[wikimedia/toolhub@main] search: generate Document fields from serializers

https://gerrit.wikimedia.org/r/758076

Change 758075 merged by jenkins-bot:

[wikimedia/toolhub@main] search: Extract field mapping helpers for reuse

https://gerrit.wikimedia.org/r/758075

Change 758076 merged by jenkins-bot:

[wikimedia/toolhub@main] search: generate Document fields from serializers

https://gerrit.wikimedia.org/r/758076

Change 759573 had a related patch set uploaded (by BryanDavis; author: Bryan Davis):

[wikimedia/toolhub@main] annotations: Create an Annotations model for each Tool

https://gerrit.wikimedia.org/r/759573

Change 759574 had a related patch set uploaded (by BryanDavis; author: Bryan Davis):

[wikimedia/toolhub@main] annotations: Add to ToolSerializer and search index

https://gerrit.wikimedia.org/r/759574

Change 759575 had a related patch set uploaded (by BryanDavis; author: Bryan Davis):

[wikimedia/toolhub@main] annotations: Add permissions

https://gerrit.wikimedia.org/r/759575

Change 759576 had a related patch set uploaded (by BryanDavis; author: Bryan Davis):

[wikimedia/toolhub@main] annotations: Editing API integration

https://gerrit.wikimedia.org/r/759576

Change 759573 merged by jenkins-bot:

[wikimedia/toolhub@main] annotations: Create an Annotations model for each Tool

https://gerrit.wikimedia.org/r/759573

Change 759574 merged by jenkins-bot:

[wikimedia/toolhub@main] annotations: Add to ToolSerializer and search index

https://gerrit.wikimedia.org/r/759574

Change 759575 merged by jenkins-bot:

[wikimedia/toolhub@main] annotations: Add permissions

https://gerrit.wikimedia.org/r/759575

Change 759576 merged by jenkins-bot:

[wikimedia/toolhub@main] annotations: Editing API integration

https://gerrit.wikimedia.org/r/759576

Change 770638 had a related patch set uploaded (by BryanDavis; author: Bryan Davis):

[operations/deployment-charts@master] toolhub: Bump container version to 2022-03-15-002555-production

https://gerrit.wikimedia.org/r/770638

Change 770638 merged by jenkins-bot:

[operations/deployment-charts@master] toolhub: Bump container version to 2022-03-15-002555-production

https://gerrit.wikimedia.org/r/770638