==== **User Story**
> ==== As a Structured Data Engineer I need to design the data model for the section topics data pipeline
== Why?
- So that we can document our plans to make a technical decision on the most appropriate storage solution with Data Persistence
== Done is:
[] Data model documented [[ https://miro.com/app/board/uXjVOpiLkTQ=/ | here ]]
[] Dataset size documented
[] Data access pattern documented
[] Dataset refresh pattern documented
[] Information collected to create ticket as detailed here: https://wikitech.wikimedia.org/wiki/MariaDB#Database_creation_template
== Proposed data model
- snapshot: string - **partition column**
- wiki_db: string - **partition column**
- page_namespace: int
- revision_id: int
- page_qid: string or null - the page Wikidata ID
- page_id: int
- page_title: string
- section_title: string
- topic_qid: string - the topic Wikidata ID
- topic_title: string
- topic_score: int - the topic relevance score
NOTE: this schema will result in denormalized data
NOTE: we don't have section identifiers
=== Example row
Hyperlinked for illustration purposes.
| snapshot | wiki_db | page_namespace | revision_id | page_qid | page_id | page_title | section_title | topic_qid | topic_title | topic_score
| 2022-07-11 | enwiki | 0 | 1066420146 | [Q4464287](https://www.wikidata.org/wiki/Q4464287) | 10391760 | [Work_(painting)](https://en.wikipedia.org/wiki/Work_(painting)) | [background and influences](https://en.wikipedia.org/wiki/Work_(painting)#Background_and_influences) | [Q543626](https://www.wikidata.org/wiki/Q543626) | [Lazzaroni_(Naples)](https://en.wikipedia.org/wiki/Lazzaroni_(Naples)) | 2
=== All available columns
- `describe wmf.mediawiki_wikitext_current`:
```
+--------------------------+-----------------------+---------------------------------------------------------------------------------+
| col_name | data_type | comment |
+--------------------------+-----------------------+---------------------------------------------------------------------------------+
| page_id | bigint | id of the page |
| page_namespace | int | namespace of the page |
| page_title | string | title of the page |
| page_redirect_title | string | title of the redirected-to page |
| page_restrictions | array<string> | restrictions of the page |
| user_id | bigint | id of the user that made the revision (or null/0 if anonymous) |
| user_text | string | text of the user that made the revision (either username or IP) |
| revision_id | bigint | id of the revision |
| revision_parent_id | bigint | id of the parent revision |
| revision_timestamp | string | timestamp of the revision (ISO8601 format) |
| revision_minor_edit | boolean | whether this revision is a minor edit or not |
| revision_comment | string | Comment made with revision |
| revision_text_bytes | bigint | bytes number of the revision text |
| revision_text_sha1 | string | sha1 hash of the revision text |
| revision_text | string | text of the revision |
| revision_content_model | string | content model of the revision |
| revision_content_format | string | content format of the revision |
| snapshot | string | Versioning information to keep multiple datasets (YYYY-MM for regular imports) |
| wiki_db | string | The wiki_db project |
| # Partition Information
| snapshot | string | Versioning information to keep multiple datasets (YYYY-MM for regular imports) |
| wiki_db | string | The wiki_db project |
+--------------------------+-----------------------+---------------------------------------------------------------------------------+
```
- ` describe wmf.wikidata_item_page_link`:
```
+---------------------------------+-----------------------+-------------------------------------------------------------------------------------------+
| col_name | data_type | comment |
+---------------------------------+-----------------------+-------------------------------------------------------------------------------------------+
| item_id | string | The wikidata item_id (Q32753077 for instance) |
| wiki_db | string | The db project of the page the wikidata item links to |
| page_id | bigint | The id of the page the wikidata item links to |
| page_title | string | The title of the page the wikidata item links to |
| page_namespace | bigint | The namespace of the page the wikidata item links to |
| page_title_localized_namespace | string | The title with localized namespace header of the page the wikidata item links to |
| snapshot | string | Versioning information to keep multiple datasets (YYYY-MM-DD for regular weekly imports) |
| # Partition Information
| snapshot | string | Versioning information to keep multiple datasets (YYYY-MM-DD for regular weekly imports) |
+---------------------------------+-----------------------+-------------------------------------------------------------------------------------------+
```
- additional section attributes, given a section title in a given wiki:
- count: bigint - the total occurrences
- pos_start_mean: double - the mean start position
- pos_end_mean: double - the mean end position