Page MenuHomePhabricator

New project: Petscan
Closed, ResolvedPublic

Description

Project Name: petscan

Purpose:
Replacement for, and improvement of, the catscan2 and quick_intersection tools, possibly autolist as well.
Idea is a single VM (8GB/2-4 CPU), webserver in C++, similar to WDQ.

Wikitech Username of requestor: magnus / Magnus Manske

Event Timeline

Are you aware of Catgraph? When it was developed and deployed, I was hoping that all resource and speed problems of catscan* would be gone forever, but it is not mentioned very often so I'm not sure if developers even know of its existence.

I am aware. Actually, at this point, the best option for category tree operations might be a blazegraph instance, IMHO. But that's beside the point. Is there a catgraph instance running? The links I clicked on were all dead. Is it is running, is is covering all wikis (including Commons etc)? Is is up-to-date, as in lag in seconds? If any of these is "no", it is useless to me.

Catscan is popular, too popular for its own good, in terms of resources. I could try and port it to a VM "as is". But I'd be willing to invest the time to write something in C++ that is less resource-intensive, and would also allow for intersection with SPARQL query results etc. I have a webserver solution that I can port from WDQ.

You are right, the links don't work (sylvester.wmflabs.org doesn't resolve), and directly accessing the instance from within Labs gives:

scfc@toolsbeta-vagrant3-scfc:/srv/mediawiki-vagrant$ time curl "http://sylvester.catgraph.eqiad.wmflabs:8090/enwiki/traverse-successors+692675+6+&&+traverse-successors+691008+6"
FAILED! No such instance.

real    0m0.079s
user    0m0.023s
sys     0m0.005s
scfc@toolsbeta-vagrant3-scfc:/srv/mediawiki-vagrant$

(IIRC "No such instance" means: No dataset for, in this case, enwiki.) That's sad, but proves your point.

(Ceterum censeo this is functionality that should be provided by MediaWiki itself.)

Hi,

Is there a catgraph instance running? The links I clicked on were all dead. Is it is running, is is covering all wikis (including Commons etc)? Is is up-to-date, as in lag in seconds?

The link on https://wikitech.wikimedia.org/wiki/Nova_Resource:Catgraph seems to be wrong or not working. But CatGraph is online including Commons. Currently the Commons graph is about 13 minutes old. See https://tools.wmflabs.org/cgstat ( wait a few minutes for the page to gather all data ).

You could use the JSONP interface also used by the DeepCat gadget https://tools.wmflabs.org/catgraph-jsonp/commonswiki_ns14/traverse-successors%20Category:Felis_silvestris_catus%2015%2070

These all say "ns14". Does that mean they all store the tree of categories, but not the pages/files/etc. in them? Unless it does store page links, it is rather useless to me; I can get category trees reasonably cheap from the database.

Yes this is only storing the tree of categories.

I recently deleted the hostname 'sylvester.wmflabs.org' because it was not bound to an actual instance. Recreating it and binding it to sylvester.catgraph.wmflabs.org is certainly easy, if useful. It looks to me like the 'catgraph' project was mothballed and then later revived, which suggests that it's not getting the full attention that it needs.

I'm also happy to create a new project -- just let me know what you need.

I think the two do have some overlap, but I would really like to have the petscan project.

I assume I can create a VM with an externally visible host name myself then?

If all you need is http(s) then getting a public hostname is super easy, you just use the 'manage web proxies' panel. If you need a full-fledged public IP (e.g. for non-web services) then we can do that too, but I have to tinker with quotas.

Andrew claimed this task.

I created the Petscan project with Magnus as the only admin. Magnus, you can add other people and admins as you see fit; generally I like there to be at least two people responsible for a given project.

The CatGraph project is running and in use by several tools we developed at WMDE. A graph exists for every wiki available on Labs, including Commons. Carrying the page IDs for each wiki takes a sizable amount of RAM; our tools only use the categories, and there was never any interest by community developers to use CatGraph, so I removed the graphs carrying the page IDs at some point. It is kind of frustrating to me personally but I can't force anybody to use our work. After I'm back from vacation, I could add the graphs back with some effort, if there is interest.