Page MenuHomePhabricator

[M] Design database model for section topics pipeline
Closed, ResolvedPublic

Description

User Story
As a Structured Data Engineer I need to design the data model for the section topics data pipeline

Why?

  • So that we can document our plans to make a technical decision on the most appropriate storage solution with Data Persistence

Done is:

  • Data model documented - see following section
  • Dataset size documented - ~716 M rows / 14 GB
  • Data access pattern documented - we opted for access through Hive only
  • Dataset refresh pattern documented - weekly frequency that follows Wikidata snapshots

See also this board.

Proposed data model

  • snapshot: string
  • wiki_db: string
  • page_namespace: int
  • revision_id: long
  • page_qid: string - the page Wikidata ID
  • page_id: long
  • page_title: string
  • section_index: int
  • section_title: string
  • topic_qid: string - the topic Wikidata ID
  • topic_title: string
  • topic_score: double - the topic relevance score
NOTE: this schema will result in denormalized data
NOTE: we don't have section identifiers, T313901: [SPIKE] Mint section identifiers will eventually include them
NOTE: no explicit partition columns are required, since downstream jobs will read all the data all the time in the long term

Example row

Hyperlinked for illustration purposes.

snapshotwiki_dbpage_namespacerevision_idpage_qidpage_idpage_titlesection_indexsection_titletopic_qidtopic_titletopic_score
2022-07-11enwiki01066420146Q446428710391760Work_(painting)1background and influencesQ543626Lazzaroni_(Naples)6.66

All available columns

  • describe wmf.mediawiki_wikitext_current:
+--------------------------+-----------------------+---------------------------------------------------------------------------------+
|         col_name         |       data_type       |                                     comment                                     |
+--------------------------+-----------------------+---------------------------------------------------------------------------------+
| page_id                  | bigint                | id of the page                                                                  |
| page_namespace           | int                   | namespace of the page                                                           |
| page_title               | string                | title of the page                                                               |
| page_redirect_title      | string                | title of the redirected-to page                                                 |
| page_restrictions        | array<string>         | restrictions of the page                                                        |
| user_id                  | bigint                | id of the user that made the revision (or null/0 if anonymous)                  |
| user_text                | string                | text of the user that made the revision (either username or IP)                 |
| revision_id              | bigint                | id of the revision                                                              |
| revision_parent_id       | bigint                | id of the parent revision                                                       |
| revision_timestamp       | string                | timestamp of the revision (ISO8601 format)                                      |
| revision_minor_edit      | boolean               | whether this revision is a minor edit or not                                    |
| revision_comment         | string                | Comment made with revision                                                      |
| revision_text_bytes      | bigint                | bytes number of the revision text                                               |
| revision_text_sha1       | string                | sha1 hash of the revision text                                                  |
| revision_text            | string                | text of the revision                                                            |
| revision_content_model   | string                | content model of the revision                                                   |
| revision_content_format  | string                | content format of the revision                                                  |
| snapshot                 | string                | Versioning information to keep multiple datasets (YYYY-MM for regular imports)  |
| wiki_db                  | string                | The wiki_db project                                                             |
| # Partition Information
| snapshot                 | string                | Versioning information to keep multiple datasets (YYYY-MM for regular imports)  |
| wiki_db                  | string                | The wiki_db project                                                             |
+--------------------------+-----------------------+---------------------------------------------------------------------------------+
  • describe wmf.wikidata_item_page_link:
+---------------------------------+-----------------------+-------------------------------------------------------------------------------------------+
|            col_name             |       data_type       |                                          comment                                          |
+---------------------------------+-----------------------+-------------------------------------------------------------------------------------------+
| item_id                         | string                | The wikidata item_id (Q32753077 for instance)                                             |
| wiki_db                         | string                | The db project of the page the wikidata item links to                                     |
| page_id                         | bigint                | The id of the page the wikidata item links to                                             |
| page_title                      | string                | The title of the page the wikidata item links to                                          |
| page_namespace                  | bigint                | The namespace of the page the wikidata item links to                                      |
| page_title_localized_namespace  | string                | The title with localized namespace header of the page the wikidata item links to          |
| snapshot                        | string                | Versioning information to keep multiple datasets (YYYY-MM-DD for regular weekly imports)  |
| # Partition Information
| snapshot                        | string                | Versioning information to keep multiple datasets (YYYY-MM-DD for regular weekly imports)  |
+---------------------------------+-----------------------+-------------------------------------------------------------------------------------------+
  • additional section attributes, given a section title in a given wiki:
    • count: bigint - the total occurrences
    • pos_start_mean: double - the mean start position
    • pos_end_mean: double - the mean end position

Event Timeline

Following today's architecture discussion, we will not be pursuing making the data available in MariaDB to be consumed publicly for now. But if we want to keep that option open, let's make sure that we identify a primary key for all tables in Hive (that is, a column or set of columns that make the row unique).

The primary key will be needed to be able to move the data from Hive to MariaDB using a diff approach later on. See T313410 for more context.

Looping in all the architecture design meeting invitees.

Added a data model proposal in the task description.
All available columns can be brought in depending on specific requirements. If you need any of them, please highlight and explain why.

Primary keys for the tables in the miro board

Note that I think that if we are de-normalizing then we should be using the local page id instead of the wikidata id to identify pages, because that simplifies things within mediawiki. We could create another lookup table if we want a client to be able query using the page's wikidata id

"main": wiki,page_id,section_title
page_title_lookup: wiki,page_id
topic_title_lookup: wiki, topic wikidata id

  • item_id should be string or null (the page is not necessarily connected to Wikidata)
  • page_title is title without namespace name, like in the other schemas? We might want to also have page_namespace then, even if it's going to be always 0 for now.
  • topic_title is the English Wikidata label? Or the label in the wiki's content language?
  • section_title is the wikitext of the section title? (Is it intentionally lowercased in the example?)

The main thing to figure out is how to reference the sections. The title is not unique, you can have an article structure like

== No free lunch theorem ==
=== Proof ===
== Ugly duckling theorem ==
=== Proof ===

so you need either numeric section IDs (1, 2, 3, 4) or hierarchic section IDs (where the two "Proof" sections are either 1.1 and 2.1, or ["No free lunch theorem", "Proof"] etc).

If we use numeric IDs (which is what MediaWiki APIs tend to use), do we match how MediaWiki uses them during transclusion?

John Doe
== Life ==
== Career ==
{{/bibliography}}
== Notes ==

.

John Doe/bibliography
== Books ==
== Scripts

This will result in an article with five sections, which MediaWiki will number 1, 2, T-1, T-2, 3.

Also, do we consider the part of the article until the first section title (the lead section as Wikipedia calls it) a section? In some MediaWiki contexts, such as edit APIs, it is referenced as section 0.

Also, if we use a non-numeric identifier, how do we make sure frontend clients can identify the section? E.g. == Writing ''War and Peace'' == will be rendered as <h2>Writing <i>War and Peace</i></h2> - if all a frontend client has to go by is the section title wikitext, identifying the section without making extra API calls won't be easy.


It's probably a good idea to include the page revision the topics were generated from.

mfossati updated the task description. (Show Details)

Thanks a lot @Tgr for your extensive feedback!

  • item_id should be string or null (the page is not necessarily connected to Wikidata)

Agreed & updated.

  • page_title is title without namespace name, like in the other schemas? We might want to also have page_namespace then, even if it's going to be always 0 for now.

Correct, page_namespace added.

  • topic_title is the English Wikidata label? Or the label in the wiki's content language?

It should be the latter, corresponds to wmf.wikidata_item_page_link.page_title, i.e., The title of the page the wikidata item links to

  • section_title is the wikitext of the section title? (Is it intentionally lowercased in the example?)

Correct. Section titles undergo normalization for better matching within a given wiki (and possibly across wikis? just thinking loud here). Matching is mainly used to count occurrences of a given (ambiguous) section title in a given wiki.

The main thing to figure out is how to reference the sections. The title is not unique
so you need either numeric section IDs (1, 2, 3, 4) or hierarchic section IDs (where the two "Proof" sections are either 1.1 and 2.1, or ["No free lunch theorem", "Proof"] etc).
If we use numeric IDs (which is what MediaWiki APIs tend to use), do we match how MediaWiki uses them during transclusion?
This will result in an article with five sections, which MediaWiki will number 1, 2, T-1, T-2, 3.

Thanks a lot for raising this. I opened a spike ticket: T313901: [SPIKE] Mint section identifiers

Also, do we consider the part of the article until the first section title (the lead section as Wikipedia calls it) a section? In some MediaWiki contexts, such as edit APIs, it is referenced as section 0.

Good question! We're using mwparserfromhell. Given a wikitext, current Research code is: mwparserfromhell.parse(wikitext).get_sections(levels=[2], include_headings=True). This should extract h2-level sections and won't include the lead section. We can modify the behavior, but the lead section potentially contains infobox templates and the like, so not sure we want to parse that as well, what do you think?

Also, if we use a non-numeric identifier, how do we make sure frontend clients can identify the section? E.g. == Writing ''War and Peace'' == will be rendered as <h2>Writing <i>War and Peace</i></h2> - if all a frontend client has to go by is the section title wikitext, identifying the section without making extra API calls won't be easy.

I agree, this further motivates the need for section identifiers.

It's probably a good idea to include the page revision the topics were generated from.

Doesn't hurt, added.

CBogen renamed this task from Design database model for section topics pipeline to [M] Design database model for section topics pipeline.Jul 27 2022, 4:23 PM
mfossati changed the task status from Open to In Progress.Jul 28 2022, 11:09 AM

@dcausse , any other relevant columns in mind? See the description

topic_title is the English Wikidata label? Or the label in the wiki's content language?

It should be the latter, corresponds to wmf.wikidata_item_page_link.page_title, i.e., The title of the page the wikidata item links to

As discussed a little bit in Slack, perhaps 'topic' is not really a great name to use for this. You are specifically trying to assign a wikidata item a wikipedia page section, correct? Perhaps section_wikidata_... (e.g. section_wikidata_title, section_wikidata_qid) or something like it is a better name? Something that specifically links it to a wikidata concept, as IIUC 'topic' is not really a wikidata concept. If you want to link wikipedia page sections to wikidata items, it'd be best to pick a name that indicates this is what you are doing. I'd personal even go as far as to call this page_section_... rather than just section, to be consistent with e.g. page_id, page_title, page_namespace, etc.

Another Q...will it ever be possible for a section to ever be linked to multiple wikidata items?

You are specifically trying to assign a wikidata item a wikipedia page section, correct?

Hey @Ottomata : no, in the proposed model topic_qid refers to a Wikidata identifier of a blue link found in the body of a given section.

Ah okay cool. So would section_link_qid or section_link_wikidata_qid be a good name?

Thanks for the suggestion, but if there are no other objections we're going to move forward with topic_qid, topic_title, and topic_score: they're consistent with our naming for this work elsewhere, and they're shorter.
We'll make sure the documentation adequately explains what this means.

We'll make sure the documentation adequately explains what this means.

Okay, thanks. Please proceed!

they're consistent with our naming for this work elsewhere

I've been noticing a naming pattern happening recently, and perhaps if I describe it we can avoid it in the future.

Non technical product managers come up with a working title for a product, project, value stream, or OKR. E.g. "section topics" "structured data", "generated data". They do this because they need a name to refer to a group of work, not because they have a technical understanding of the thing they want to build. Then, even if it shouldn't, this working project name bubbles down into technical naming. This is a problem, because then we have poorly defined concepts that engineers have to live and work with.

As engineers, we should be very specific and consistent when we assign technical names. We have to assume that a new hire tasked with working on a project years from now will not have any context as to why something was named "generated data". Project and team names die quickly; technical names live "forever".

In this case, my understanding is that "topic" is a brand new name for a concept that already has a technical name: a wikidata item.
Perhaps my understanding of how this name to be is totally wrong! Please forgive me if so :)

Anyway, please proceed with your plan, let's just be more specific in the future! :)

As engineers, we should be very specific and consistent when we assign technical names. We have to assume that a new hire tasked with working on a project years from now will not have any context as to why something was named "generated data". Project and team names die quickly; technical names live "forever".

I agree. I'll make sure the columns documentation will be tightly bound to the columns themselves, so that a describe statement will be enough for any engineer to properly grasp the schema.

In this case, my understanding is that "topic" is a brand new name for a concept that already has a technical name: a wikidata item.

Here it's actually a prefix that maps to a section blue link, so it has a broader scope: it includes a Wikidata item identifier (qid), a title (title), and a relevance score (score). I acknowledge we may replace the prefix topic_ with section_blue_link_, but that would be more expensive both in terms of naming consistency and string length.