Page MenuHomePhabricator

Determine if templatetiger is abandoned
Open, HighPublic

Description

While investigating large NFS users Toolforge on 2020-05-21 @bd808 noticed that the files in $HOME/public_html/cache for the templatetiger tool were very old. He poked around a bit and found that the cron job to cleanup the cache files for this tool introduced in response to T208456: templatetiger is using 1.2TB out of 8T available Tools had been disabled at some point. When the script ran, it deleted all of the cache files as they were all more than 90 days old.

Further poking around led to https://de.wikipedia.org/wiki/Wikipedia:WikiProjekt_Vorlagenauswertung where it appears that the project has been marked as inactive.

@Bgwhite : Could you please stop filling this directory until we found a solution for T184126 .

If we are together with WMF are not able to solve T184126, the tool is in my eyes dead and would only keep it running with old data for some time.

The referenced T184126: Templatetiger-Updating: Lost connection to MySQL server during query ticket was closed for inactivity by @bd808. The code being used to load the multiple gigabyte data files needs to be written in a different way. There is no reasonable way that the ToolsDB can be tuned to allow the multiple hours of uninterrupted runtime that this style of data loading would require:

einspielen.sql
#ukwiki-2019-02-20.txt

#Date: 20190220# DB: 20170501 sk: 20190220
# DB: 20170501 sk: 20190220
DROP TABLE IF EXISTS `ukwiki_old`; DROP TABLE IF EXISTS `ukwiki_sum_old`; DROP TABLE IF EXISTS `ukwiki_para_old`;
DROP TABLE IF EXISTS `ukwiki_neu`; DROP TABLE IF EXISTS `ukwiki_sum_neu`; DROP TABLE IF EXISTS `ukwiki_para_neu`;

CREATE TABLE `ukwiki_neu` ( `name_id` bigint( 20 ) NOT NULL , `name` varchar( 180 ) NOT NULL , `tp_nr` bigint( 20 ) NOT NULL , `tp_name` varchar( 100 ) NOT NULL , `entry_nr` bigint( 20 ) NOT NULL , `entry_name` varchar( 200 ) NOT NULL , `Value` varchar( 900 ) NOT NULL ,
KEY `tp_name` (`tp_name`(30)), KEY `name` ( `name` (15)) , KEY `name_id` ( `name_id`) , KEY `entry_name` ( `entry_name`(15) ) , KEY `Value` ( `Value` (15) ) ) ENGINE = ARIA DEFAULT CHARSET = utf8;

LOAD  DATA LOCAL INFILE '/data/project/templatetiger/public_html/dumps/ukwiki-2019-02-20.txt' IGNORE INTO TABLE `ukwiki_neu` FIELDS TERMINATED BY '\t' ESCAPED BY '\\' LINES TERMINATED BY '\n' ;

CREATE TABLE `ukwiki_sum_neu` ( `tp_name` varchar( 100 ) COLLATE utf8_bin NOT NULL , `sum` bigint( 20 ) NOT NULL , PRIMARY KEY ( `tp_name` ) , KEY `sum` ( `sum` ) ) ENGINE = ARIA DEFAULT CHARSET = utf8 COLLATE = utf8_bin;

INSERT INTO `ukwiki_sum_neu` SELECT `tp_name` , COUNT( * ) FROM `ukwiki_neu` WHERE `entry_nr` = '1' GROUP BY `tp_name` ORDER BY COUNT( * ) DESC /* SLOW_OK */;
CREATE TABLE `ukwiki_para_neu` (`tp_name` varchar( 100 ) COLLATE utf8_bin NOT NULL ,`entry_name` varchar( 200 ) COLLATE utf8_bin NOT NULL , `sum` BIGINT( 20 ) NOT NULL , KEY `tp_name` ( `tp_name` ) ,KEY `entry_name` ( `entry_name` )) ENGINE = ARIA DEFAULT CHARSET = utf8 COLLATE = utf8_bin;
INSERT INTO `ukwiki_para_neu` SELECT `tp_name` , `entry_name`, COUNT( * ) FROM `ukwiki_neu` GROUP BY `tp_name`,`entry_name`;

RENAME TABLE `ukwiki` to `ukwiki_old`, `ukwiki_sum` to `ukwiki_sum_old`,`ukwiki_para` to `ukwiki_para_old`;RENAME TABLE `ukwiki_neu` to `ukwiki`, `ukwiki_sum_neu` to `ukwiki_sum`,`ukwiki_para_neu` to `ukwiki_para`;

DROP TABLE IF EXISTS `ukwiki_old`; DROP TABLE IF EXISTS `ukwiki_sum_old`; DROP TABLE IF EXISTS `ukwiki_para_old`;

DELETE FROM `info` WHERE `lang` = 'ukwiki' ;INSERT INTO `info` ( `lang` , `date`, `sum`, `date_old` ) VALUES ('ukwiki', '20190220',(SELECT sum(`sum`) FROM `ukwiki_sum` ),'20170501');

The open question now is if @Kolossos is interested in reworking the data loading process or if they have given up on the project.

Event Timeline

The tool was historically useful to the editing communities of several wikis and it was even nominated for improvement in the 2015 Community Wishlist survey (T120767). I am not excited to see it die.

Due to the scale of data used by this project, both in source files and in the computed database that powers it at runtime, I think it would be a good candidate for "graduation" from Toolforge into it's own Cloud-VPS project. This would add overhead for the project maintainers in the form of managing their own instances, but doing that would also give the maintainers more flexibility in how their data is stored and processed. It is even possible to do something similar to what IABot has done where the web interface of the tool remains on Toolforge, but the database and import processing move to a dedicated project.

bd808 triaged this task as High priority.May 22 2020, 9:49 PM
bd808 updated the task description. (Show Details)
bd808 moved this task from Backlog to ToolsDB on the Data-Services board.
This comment was removed by Izno.

If I would see a chance to import the data, I would say it could be interesting to bring the project back to life.
Perhaps somebody has an idea.

Perhaps different small sqlite-databases, one per template.
On script-side I would use nowadays Python/Pandas to produce much cleaner code. Perhaps it would make sense to store this pandas dataframes one per template.
Perhaps something else would be a solution .

I don't believe that I will administrate a full server instance.
The templatetiger runs for many years with good response of the community. It runs on smaller hardware on Toolserver, and now it's not possible to run, this is a little bit frustrating.

So first I need ideas for a solution than I will check this if it make sense to restart the project. Any help and any ideas are welcome.

@Kolossos looking at the code in the tool and the discussion that happened on T184126, it appears to me that the most recent problem is attempting to bulk load multiple gigabytes of data into ToolsDB in a single database transaction. That is just not going to work reliably on ToolsDB. I understand that this would be something that can easily be done on a local computer or with a dedicated database. The complication in ToolsDB is that there are hundreds of other users on the same database instance and no reasonable way to provide separate quality of service guarantees and fair resource allocations to each user and tool.

The cloud-services-team has hopes to provide different infrastructure to replace ToolsDB in the future, but today we are not able to make any estimates of when those services might be available. Our current ideal goal is to use OpenStack Trove to provide a "Database as a Service" solution which would allow each tool to have a dedicated database instance. This would provide much more isolation of users and allow us to establish quota limits and processes to request increased quotas for use cases such as Templatetiger.

Your immediate options however I think come down to either redesigning the data loading process to send data to the database in smaller chunks with more attempts to detect and recover from interruptions, or moving to a dedicated Cloud-VPS project where you and co-maintainers can manage and tune a local database. As mentioned in T253424#6159583, this could be a partial move where the Cloud VPS project only hosts the database service with the rest of Templatetiger remaining on Toolforge. You could also consider moving all of the dump parsing and data loading to that Cloud VPS project to remove some of the other constraints you have been subject to regarding storage space for intermediate files and the speed of the shared NFS filesystem in Toolforge.

Thanks @bd808 for your warm words to the project. Yes, run only a database on a Could-VPS server sounds more realistic for me instead of running a full web server. So I believe I make a project request next weekend. I hope I'm able to configure the database, I don't have experience at this topic so I will need some help.

Is the old directory /data/project/templatetiger/public_html/dumps/ the right place to store the input-data or should it be at Could-VPS?
If so @Bgwhite could start to refill this directory. Some files would also help me to make some tests (Python/pandas/Vaex) on my local machine.

Thanks @bd808 for your warm words to the project. Yes, run only a database on a Could-VPS server sounds more realistic for me instead of running a full web server. So I believe I make a project request next weekend. I hope I'm able to configure the database, I don't have experience at this topic so I will need some help.

Finding someone to help you with the database and underlying operating system maintenance would probably be a good idea. I'm sure there are some folks in the dewiki technical community who could help if you ask.

Is the old directory /data/project/templatetiger/public_html/dumps/ the right place to store the input-data or should it be at Could-VPS?
If so @Bgwhite could start to refill this directory. Some files would also help me to make some tests (Python/pandas/Vaex) on my local machine.

Where to store your intermediate files is a really good question. I think you could start with them in the same location that you have used in the past. It would be nice if you could keep the size of the files stored in disk at any time as small as is reasonably possible. 50GiB or less would be ideal for storage in templatetiger's $HOME. If you need more space, using a directory in /data/scratch/ would probably be a better option. Actually, /data/scratch/ might be the best option no matter what to keep your intermediate data files. You should be able to make a /data/scratch/templatetiger directory as your tool and then keep things there with fewer complaints from the Toolforge admins about the amount of disk that you are using. Obviously the scratch share is still a finite resource, so it would be a good idea to delete files when you are done with them.

After thinking about my priorities I think I will concentrate on other projects. @Bgwhite a dump file for e.g. dewiki would still be nice for local experiements.

After thinking about my priorities I think I will concentrate on other projects.

@Kolossos The filesystem usage of the tool was fixed when I ran the cleanup script manually, but your tool is still using 7GB of storage on ToolsDB. Is this something that can be cleaned up too? The tool is still running at this point, but it looks like the data it is providing to its users is about 3 years old. Is data this long out of data still actively useful? It seems to serve about 1000 pages of content a day still, but we don't track if those are "real" people or just various web crawlers.

Please keep s51071__templatetiger_p. Perhaps it can help me to rebuild the tool (maybe in Python/streamlit) to have a running dataset also if don't speak Ukrainian language. Example: https://templatetiger.toolforge.org/tt-table4.php?template=Infobox%20Film&lang=ukwiki_neu