User Story
As a Structured Data Engineer I need to design the data model for the section topics data pipeline
Why?
- So that we can document our plans to make a technical decision on the most appropriate storage solution with Data Persistence
Done is:
- Data model documented - see following section
- Dataset size documented - ~716 M rows / 14 GB
- Data access pattern documented - we opted for access through Hive only
- Dataset refresh pattern documented - weekly frequency that follows Wikidata snapshots
See also this board.
Proposed data model
- snapshot: string
- wiki_db: string
- page_namespace: int
- revision_id: long
- page_qid: string - the page Wikidata ID
- page_id: long
- page_title: string
- section_index: int
- section_title: string
- topic_qid: string - the topic Wikidata ID
- topic_title: string
- topic_score: double - the topic relevance score
Example row
Hyperlinked for illustration purposes.
snapshot | wiki_db | page_namespace | revision_id | page_qid | page_id | page_title | section_index | section_title | topic_qid | topic_title | topic_score |
2022-07-11 | enwiki | 0 | 1066420146 | Q4464287 | 10391760 | Work_(painting) | 1 | background and influences | Q543626 | Lazzaroni_(Naples) | 6.66 |
All available columns
- describe wmf.mediawiki_wikitext_current:
+--------------------------+-----------------------+---------------------------------------------------------------------------------+ | col_name | data_type | comment | +--------------------------+-----------------------+---------------------------------------------------------------------------------+ | page_id | bigint | id of the page | | page_namespace | int | namespace of the page | | page_title | string | title of the page | | page_redirect_title | string | title of the redirected-to page | | page_restrictions | array<string> | restrictions of the page | | user_id | bigint | id of the user that made the revision (or null/0 if anonymous) | | user_text | string | text of the user that made the revision (either username or IP) | | revision_id | bigint | id of the revision | | revision_parent_id | bigint | id of the parent revision | | revision_timestamp | string | timestamp of the revision (ISO8601 format) | | revision_minor_edit | boolean | whether this revision is a minor edit or not | | revision_comment | string | Comment made with revision | | revision_text_bytes | bigint | bytes number of the revision text | | revision_text_sha1 | string | sha1 hash of the revision text | | revision_text | string | text of the revision | | revision_content_model | string | content model of the revision | | revision_content_format | string | content format of the revision | | snapshot | string | Versioning information to keep multiple datasets (YYYY-MM for regular imports) | | wiki_db | string | The wiki_db project | | # Partition Information | snapshot | string | Versioning information to keep multiple datasets (YYYY-MM for regular imports) | | wiki_db | string | The wiki_db project | +--------------------------+-----------------------+---------------------------------------------------------------------------------+
- describe wmf.wikidata_item_page_link:
+---------------------------------+-----------------------+-------------------------------------------------------------------------------------------+ | col_name | data_type | comment | +---------------------------------+-----------------------+-------------------------------------------------------------------------------------------+ | item_id | string | The wikidata item_id (Q32753077 for instance) | | wiki_db | string | The db project of the page the wikidata item links to | | page_id | bigint | The id of the page the wikidata item links to | | page_title | string | The title of the page the wikidata item links to | | page_namespace | bigint | The namespace of the page the wikidata item links to | | page_title_localized_namespace | string | The title with localized namespace header of the page the wikidata item links to | | snapshot | string | Versioning information to keep multiple datasets (YYYY-MM-DD for regular weekly imports) | | # Partition Information | snapshot | string | Versioning information to keep multiple datasets (YYYY-MM-DD for regular weekly imports) | +---------------------------------+-----------------------+-------------------------------------------------------------------------------------------+
- additional section attributes, given a section title in a given wiki:
- count: bigint - the total occurrences
- pos_start_mean: double - the mean start position
- pos_end_mean: double - the mean end position