Page MenuHomePhabricator

Matcher fails parsing date data
Closed, ResolvedPublicBUG REPORT

Description

Steps to Reproduce:
run python3 wikidatarefisland/run.py --step match --input "scraped_data.jsonl" --output "matched_references.jsonl"

Actual Results:

python3 wikidatarefisland/run.py --step match --input "scraped_data.jsonl" --output "matched_references.jsonl"
Traceback (most recent call last):

File "wikidatarefisland/run.py", line 97, in <module>
  main(sys.argv, __file__)
File "wikidatarefisland/run.py", line 75, in main
  simple_pump.run(pipe, args.input_path, args.output_path)
File "/home/tom/src/wikimedia/reference-island/wikidatarefisland/pumps/pump.py", line 20, in run
  output = pipe.flow(line)
File "/home/tom/src/wikimedia/reference-island/wikidatarefisland/pipes/value_matcher_pipe.py", line 33, in flow
  if not any(match(potential_match) for match in filters):
File "/home/tom/src/wikimedia/reference-island/wikidatarefisland/pipes/value_matcher_pipe.py", line 33, in <genexpr>
  if not any(match(potential_match) for match in filters):
File "/home/tom/src/wikimedia/reference-island/wikidatarefisland/data_model/wikibase/value_matchers.py", line 77, in match_datetime
  return value in reference["extractedData"]
File "/home/tom/src/wikimedia/reference-island/wikidatarefisland/data_model/wikibase/value_types.py", line 79, in __eq__
  date = isoparse(self.value)
File "/home/tom/src/wikimedia/reference-island/venv/lib/python3.7/site-packages/dateutil/parser/isoparser.py", line 37, in func
  return f(self, str_in, *args, **kwargs)
File "/home/tom/src/wikimedia/reference-island/venv/lib/python3.7/site-packages/dateutil/parser/isoparser.py", line 146, in isoparse
  return datetime(*components)

ValueError: day is out of range for month
Makefile:20: recipe for target 'data/matched_references.jsonl' failed

Expected Results:
No explosion but it skips

Event Timeline

Restricted Application added a subscriber: Aklapper. · View Herald Transcript

Looks like the erroneous date (by seating through the input statements for the next "time" after the last successful output was:

{"statement": {"pid": "P569", "datatype": "time", "value": {"time": "+1928-09-31T00:00:00Z", "timezone": 0, "before": 0, "after": 0, "precision": 11, "calendarmodel": "http://www.wikidata.org/entity/Q1985727"}}, "itemId": "Q8029141", "reference": {"referenceMetadata": {"P650": "393329", "P248": "Q17299517", "dateRetrieved": "2020-06-03 21:04:35", "P854": "https://rkd.nl/explore/artists/393329"}, "extractedData": ["1928-08-31"]}}

Here you see we oddly have the 31st of September