Page MenuHomePhabricator

Crawler edit warring with itself over https://toolhub.wikimedia.org/tools/toolforge-authors
Closed, ResolvedPublicBUG REPORT

Description

Screen Shot 2021-10-21 at 1.18.31 PM.png (505×721 px, 93 KB)

The crawler is supposed to notice when the same toolinfo record is being submitted in multiple URLs and choose the oldest URL (first seen in the run) as the "winner". This is not happening for the name: toolforge-authors record which is present in both https://toolsadmin.wikimedia.org/tools/toolinfo/v1/toolinfo.json and https://authors.toolforge.org/toolinfo.json.

Current guess: the name in the record from toolsadmin is literally "toolforge.authors" which is then later transformed to "toolforge-authors" and that transform doesn't make it into the "seen" collection that does the duplicate/edit war check.

Event Timeline

bd808 changed the task status from Open to In Progress.Oct 21 2021, 8:18 PM
bd808 claimed this task.
bd808 moved this task from Backlog to In Progress on the Toolhub board.

My guess looks to be correct. The relevant parts of the crawler look something like:

if toolinfo["name"] in seen:
    # T278065: Reject updates from multiple urls in same run
    logger.error(
        "Toolinfo %s already seen at %s",
        toolinfo["name"],
        seen[toolinfo["name"]],
    )
    expected_names.discard(toolinfo["name"])
    continue
seen[toolinfo["name"]] = run_url.url.url

try:
    obj, created, updated = Tool.objects.from_toolinfo(
        toolinfo,
        run_url.url.created_by,
        Tool.ORIGIN_CRAWLER,
        "Import from {}".format(run_url.url.url),
    )

The conversion of "toolforge.authors" to "toolforge-authors" happens inside the Tool.objects.from_toolinfo call which is after the 'seen' tracking.

bd808 renamed this task from Crawler edit waring with itself over https://toolhub.wikimedia.org/tools/toolforge-authors to Crawler edit warring with itself over https://toolhub.wikimedia.org/tools/toolforge-authors.Oct 21 2021, 8:40 PM

Change 732783 had a related patch set uploaded (by BryanDavis; author: Bryan Davis):

[wikimedia/toolhub@main] crawler: Normalize toolinfo name before seen checks

https://gerrit.wikimedia.org/r/732783

Change 732783 merged by jenkins-bot:

[wikimedia/toolhub@main] crawler: Normalize toolinfo name before seen checks

https://gerrit.wikimedia.org/r/732783

Change 732832 had a related patch set uploaded (by BryanDavis; author: Bryan Davis):

[operations/deployment-charts@master] toolhub: Bump container version to 2021-10-21-224232-production

https://gerrit.wikimedia.org/r/732832

Change 732832 merged by jenkins-bot:

[operations/deployment-charts@master] toolhub: Bump container version to 2021-10-21-224232-production

https://gerrit.wikimedia.org/r/732832

bd808 moved this task from In Progress to Review on the Toolhub board.