Page MenuHomePhabricator

Request creation of topic-curator VPS project
Closed, InvalidPublic

Description

Project Name: topic-curator

Developer account usernames of requestors: so9q

Purpose: Curate subgraphs (e.g. the 85 mio scientific articles currently missing one) and add P921 main subject in batches based on matching.

Brief description: This is a simple python flask application. I intend to use a database to record use of the tool. mariadb should suffice.

How soon you are hoping this can be fulfilled: as soon as possible.

I installed the tool on toolforge but because of some configuration limit it does not work as expected.
see https://github.com/dpriskorn/WikidataTopicCurator/issues/5

I have a lot of experience with installing and maintaining linux servers. I expect to be able to handle everything myself. See https://meta.wikimedia.org/wiki/User:So9q

License: AGPLv3+

Event Timeline

So9q updated the task description. (Show Details)

@So9q Moving the tool to a dedicated project is reasonable if it will actually fix something, but if your guess about URL length is correct I'm not sure how that will be different in a dedicated project vs Toolforge assuming that your dedicated project will still need to use the https front proxy for Cloud VPS projects that is just a slightly different configuration than the Toolforge front proxy. It would be nice to have a better understanding of the problem you are facing before rushing to making a dedicated project the solution.

With the changes from https://github.com/dpriskorn/WikidataTopicCurator/commit/d532547c74c4ca156d712299f580bc72e50f645a now in place on your Toolforge tool I can't use something like https://topic-curator.toolforge.org/articles?qid=Q1949144&limit=100&prefix=haswbstatement%3AP31%3DQ13442814+-haswbstatement%3AP921%3DQ1334131&affix=-inlabel%3Asyndrome to try and help debug the problem as the live code aborts with a deliberate failure.

If your assumption about URL truncation by the Toolforge proxy is true, I'm not sure how your tool would work from anywhere as the long URL is a 302 redirect targeting https://quickstatements.toolforge.org. Or am I thinking about this wrong and the problem is somehow in emitting the redirect header from your tool?

Sending all 66 items from https://bd808-test2.toolforge.org/articles?qid=Q1949144&limit=100&prefix=haswbstatement%3AP31%3DQ13442814+-haswbstatement%3AP921%3DQ1334131&affix=-inlabel%3Asyndrome to quickstatements seems to work as expected. I have that tool running a git clone pointed at https://github.com/dpriskorn/WikidataTopicCurator/commit/940ec123a3ce82eb002331ae20da0cd3c95e5a96 with the Toolforge specific abort code removed. The current HEAD of your repo fails with some validation problem.

url to qs: https://quickstatements.toolforge.org/#/v

That is a 3492 character URL pointing to quickstatements. Per https://nginx.org/en/docs/http/ngx_http_core_module.html#large_client_header_buffers the default buffer size should be 8k, so there is a limit that would eventually be hit but a rather large number of parameters should fit. You could squeeze in a few more even by adjusting your URL encoding. There is technically no reason to encode | as %7C when contacting quickstatements. The example 3492 character URL is reduced to 2704 characters when using | directly. The large_client_header_buffers is applied in-bound to the quickstatements tool, so as I understand your stated problem this would not be changed in any way by moving your tool outside of Toolforge.

So9q claimed this task.

Big thanks for taking a deeper look into this and for the suggestion to adjust the encoding. I'll close this as resolved for now, while I try diving deeper into the issue and try to make it work.
Locally on my machine everything works fine. I can send batches as large as 5000 items to QS with no problems which translates to 10k qs-lines to edit.

I changed the tool, see https://topic-curator.toolforge.org/results?lang=en&qid=Q1949144&limit=100&prefix=haswbstatement%3AP31%3DQ13442814+-haswbstatement%3AP921%3DQ1334131&affix=-inlabel%3Asyndrome which now loads.

If your assumption about URL truncation by the Toolforge proxy is true, I'm not sure how your tool would work from anywhere as the long URL is a 302 redirect targeting https://quickstatements.toolforge.org. Or am I thinking about this wrong and the problem is somehow in emitting the redirect header from your tool?

Exactly that is my theory: nginx is denying to send the request from toolforge -> QS because of the header exceeds the max allowed size.

Thus if I get to control the nginx configuration in Cloud VPS of the instance topic-curator is running on, then I can tweak until it works for large batches also.

JJMC89 changed the task status from Resolved to Invalid.Feb 2 2024, 4:05 PM