Page MenuHomePhabricator

{HuggingFace} Check if the schema is up to date (schema.yaml)
Closed, ResolvedPublic2 Estimated Story Points

Description

Update schema.yaml to reflect latest status.
This can then be added to the README.md on Hugging Face, so the dataset card viewer works accordingly.

example:

---
- config_name: 20240916.en
  features:
  -  name: name
        dtype: string
      - name: identifier
        dtype: int64
- … —> /// our schema (https://api.enterprise.wikimedia.com/spec/spec.yaml)

Resources:

Event Timeline

First changes are done.
Feedback from Albert:
After implementing the schema, I am finding some additional issues because the data is not aligned with the schema. For example:

  • I had to fix the names false and true to 'false' and 'true' (with quotes)
  • I had to rename the field no_index to noindex (without underscore)
  • I had to add the missing field event.date_published

Also I discovered the root field in_language is duplicated.
And now I am also facing a new missing field called "images": I am investigating in which super-field.
So I was wondering if you have a newer version of the complete schema. Otherwise, I should continue trying locally until I get the complete schema.

More feedback:

Some dates cannot be properly parsed by Arrow and raise an error:

ArrowInvalid: Failed to parse string: '2024-04-28T02:46:52.311913Z' as a scalar of type timestamp[s]

I would suggest reverting to string data type while investigating the cause and eventually may support timestamps.

Albert also added missing field infoboxes.has_parts.images

and
sections.has_parts.has_parts.has_parts.has_parts.name
sections.has_parts.has_parts.has_parts.has_parts.has_parts.links.images
sections.has_parts.has_parts.has_parts.has_parts.has_parts.has_parts
sections.has_parts.has_parts.has_parts.has_parts.has_parts.has_parts.links

SDelbecque-WMF renamed this task from Expand Hugging Face dataset card with metadata to Update schema.yaml.Sep 26 2024, 1:43 PM
SDelbecque-WMF updated the task description. (Show Details)
JArguello-WMF renamed this task from Update schema.yaml to Check if the schema is up to date (schema.yaml).Sep 26 2024, 2:07 PM
JArguello-WMF set the point value for this task to 2.

Small point, the has_parts node is a recursive data structure, it can have child elements including itself.

This is why we have sections.has_parts.has_parts.has_parts.has_parts or potentially sections.has_parts.has_parts.has_parts.has_parts..has_parts.has_parts.has_parts.has_parts.has_parts.has_parts.has_parts.has_parts in some articles.

Explicitly documenting each level is not feasible. How many levels should we document (3, 4, 5, 6, or 1 million)?

More feedback:

field which is not defined in the README schema:

  • a data item with 4 fields was found:
    • content_url: string
    • width: int64
    • height: int64
    • alternative_text: string
  • however the expected schema contains only 3 fields:
    • content_url: Value(dtype='string', id=None)
    • width: Value(dtype='int64', id=None)
    • height: Value(dtype='int64', id=None)

Error in HF:

Error code:   DatasetGenerationError
Exception:    TypeError
Message:      Couldn't cast array of type
struct<identifier: int64, comment: string, is_minor_edit: bool, scores: struct<revertrisk: struct<probability: struct<false: double, true: double>, prediction: bool>>, editor: struct<identifier: int64, name: string, edit_count: int64, groups: list<item: string>, date_started: timestamp[s], is_patroller: bool, is_bot: bool, is_admin: bool, is_anonymous: bool, has_advanced_rights: bool>, number_of_characters: int64, size: struct<value: int64, unit_text: string>, noindex: bool, maintenance_tags: struct<pov_count: int64, citation_needed_count: int64, update_count: int64>, tags: list<item: string>, is_breaking_news: bool>
to
{'identifier': Value(dtype='int64', id=None), 'comment': Value(dtype='string', id=None), 'tags': Sequence(feature=Value(dtype='string', id=None), length=-1, id=None), 'is_minor_edit': Value(dtype='bool', id=None), 'is_flagged_stable': Value(dtype='bool', id=None), 'has_tag_needs_citation': Value(dtype='bool', id=None), 'scores': {'damaging': {'prediction': Value(dtype='bool', id=None), 'probability': {'false': Value(dtype='float64', id=None), 'true': Value(dtype='float64', id=None)}}, 'goodfaith': {'prediction': Value(dtype='bool', id=None), 'probability': {'false': Value(dtype='float64', id=None), 'true': Value(dtype='float64', id=None)}}, 'revertrisk': {'prediction': Value(dtype='bool', id=None), 'probability': {'false': Value(dtype='float64', id=None), 'true': Value(dtype='float64', id=None)}}}, 'editor': {'identifier': Value(dtype='int64', id=None), 'name': Value(dtype='string', id=None), 'edit_count': Value(dtype='int64', id=None), 'groups': Sequence(feature=Value(dtype='string', id=None), length=-1, id=None), 'is_bot': Value(dtype='bool', id=None), 'is_anonymous': Value(dtype='bool', id=None), 'is_admin': Value(dtype='bool', id=None), 'is_patroller': Value(dtype='bool', id=None), 'has_advanced_rights': Value(dtype='bool', id=None), 'date_started': Value(dtype='timestamp[s]', id=None)}, 'number_of_characters': Value(dtype='int64', id=None), 'size': {'value': Value(dtype='float64', id=None), 'unit_text': Value(dtype='string', id=None)}, 'event': {'identifier': Value(dtype='string', id=None), 'type': Value(dtype='string', id=None), 'date_created': Value(dtype='timestamp[s]', id=None)}, 'is_breaking_news': Value(dtype='bool', id=None), 'noindex': Value(dtype='bool', id=None), 'maintenance_tags': {'citation_needed_count': Value(dtype='int64', id=None), 'pov_count': Value(dtype='int64', id=None), 'clarification_needed_count': Value(dtype='int64', id=None), 'update_count': Value(dtype='int64', id=None)}}
Traceback:    Traceback (most recent call last):
                File "/src/services/worker/.venv/lib/python3.9/site-packages/datasets/builder.py", line 2013, in _prepare_split_single
                  writer.write_table(table)
                File "/src/services/worker/.venv/lib/python3.9/site-packages/datasets/arrow_writer.py", line 585, in write_table
                  pa_table = table_cast(pa_table, self._schema)
                File "/src/services/worker/.venv/lib/python3.9/site-packages/datasets/table.py", line 2302, in table_cast
                  return cast_table_to_schema(table, schema)
                File "/src/services/worker/.venv/lib/python3.9/site-packages/datasets/table.py", line 2261, in cast_table_to_schema
                  arrays = [cast_array_to_feature(table[name], feature) for name, feature in features.items()]
                File "/src/services/worker/.venv/lib/python3.9/site-packages/datasets/table.py", line 2261, in <listcomp>
                  arrays = [cast_array_to_feature(table[name], feature) for name, feature in features.items()]
                File "/src/services/worker/.venv/lib/python3.9/site-packages/datasets/table.py", line 1802, in wrapper
                  return pa.chunked_array([func(chunk, *args, **kwargs) for chunk in array.chunks])
                File "/src/services/worker/.venv/lib/python3.9/site-packages/datasets/table.py", line 1802, in <listcomp>
                  return pa.chunked_array([func(chunk, *args, **kwargs) for chunk in array.chunks])
                File "/src/services/worker/.venv/lib/python3.9/site-packages/datasets/table.py", line 2122, in cast_array_to_feature
                  raise TypeError(f"Couldn't cast array of type\n{_short_str(array.type)}\nto\n{_short_str(feature)}")
              TypeError: Couldn't cast array of type
              struct<identifier: int64, comment: string, is_minor_edit: bool, scores: struct<revertrisk: struct<probability: struct<false: double, true: double>, prediction: bool>>, editor: struct<identifier: int64, name: string, edit_count: int64, groups: list<item: string>, date_started: timestamp[s], is_patroller: bool, is_bot: bool, is_admin: bool, is_anonymous: bool, has_advanced_rights: bool>, number_of_characters: int64, size: struct<value: int64, unit_text: string>, noindex: bool, maintenance_tags: struct<pov_count: int64, citation_needed_count: int64, update_count: int64>, tags: list<item: string>, is_breaking_news: bool>
              to
              {'identifier': Value(dtype='int64', id=None), 'comment': Value(dtype='string', id=None), 'tags': Sequence(feature=Value(dtype='string', id=None), length=-1, id=None), 'is_minor_edit': Value(dtype='bool', id=None), 'is_flagged_stable': Value(dtype='bool', id=None), 'has_tag_needs_citation': Value(dtype='bool', id=None), 'scores': {'damaging': {'prediction': Value(dtype='bool', id=None), 'probability': {'false': Value(dtype='float64', id=None), 'true': Value(dtype='float64', id=None)}}, 'goodfaith': {'prediction': Value(dtype='bool', id=None), 'probability': {'false': Value(dtype='float64', id=None), 'true': Value(dtype='float64', id=None)}}, 'revertrisk': {'prediction': Value(dtype='bool', id=None), 'probability': {'false': Value(dtype='float64', id=None), 'true': Value(dtype='float64', id=None)}}}, 'editor': {'identifier': Value(dtype='int64', id=None), 'name': Value(dtype='string', id=None), 'edit_count': Value(dtype='int64', id=None), 'groups': Sequence(feature=Value(dtype='string', id=None), length=-1, id=None), 'is_bot': Value(dtype='bool', id=None), 'is_anonymous': Value(dtype='bool', id=None), 'is_admin': Value(dtype='bool', id=None), 'is_patroller': Value(dtype='bool', id=None), 'has_advanced_rights': Value(dtype='bool', id=None), 'date_started': Value(dtype='timestamp[s]', id=None)}, 'number_of_characters': Value(dtype='int64', id=None), 'size': {'value': Value(dtype='float64', id=None), 'unit_text': Value(dtype='string', id=None)}, 'event': {'identifier': Value(dtype='string', id=None), 'type': Value(dtype='string', id=None), 'date_created': Value(dtype='timestamp[s]', id=None)}, 'is_breaking_news': Value(dtype='bool', id=None), 'noindex': Value(dtype='bool', id=None), 'maintenance_tags': {'citation_needed_count': Value(dtype='int64', id=None), 'pov_count': Value(dtype='int64', id=None), 'clarification_needed_count': Value(dtype='int64', id=None), 'update_count': Value(dtype='int64', id=None)}}
              
              The above exception was the direct cause of the following exception:
              
              Traceback (most recent call last):
                File "/src/services/worker/src/worker/job_runners/config/parquet_and_info.py", line 1391, in compute_config_parquet_and_info_response
                  parquet_operations, partial, estimated_dataset_info = stream_convert_to_parquet(
                File "/src/services/worker/src/worker/job_runners/config/parquet_and_info.py", line 990, in stream_convert_to_parquet
                  builder._prepare_split(
                File "/src/services/worker/.venv/lib/python3.9/site-packages/datasets/builder.py", line 1884, in _prepare_split
                  for job_id, done, content in self._prepare_split_single(
                File "/src/services/worker/.venv/lib/python3.9/site-packages/datasets/builder.py", line 2040, in _prepare_split_single
                  raise DatasetGenerationError("An error occurred while generating the dataset") from e
              datasets.exceptions.DatasetGenerationError: An error occurred while generating the dataset

Luvo tested it for frwiki, errors were fixed. Testing it now for english which is larger so more errors crop up. Target date: Monday

JArguello-WMF renamed this task from Check if the schema is up to date (schema.yaml) to {HuggingFace} Check if the schema is up to date (schema.yaml).Feb 6 2025, 9:28 PM