Page MenuHomePhabricator

Find a method for `templatetiger` to sort large files
Closed, ResolvedPublic

Description

51071     1522 20.7  0.2  99740 22168 ?        TN   20:22   6:37 sort /data/project/templatetiger/public_html/dumps/itwiki-2016-01-11.txt -k4,4 -k2,2 -k3,3n -k5,5n -t? -o /data/project/templatetiger/public_html/dumps/sort/itwiki-2016-01-11.txt -T /data/project/templatetiger

This is causing excessive load on the NFS server. Please either copy the files to the local /tmp, sort locally, then copy back, or use a database.

Event Timeline

valhallasw updated the task description. (Show Details)
valhallasw raised the priority of this task from to Needs Triage.
valhallasw added projects: Toolforge, Tools.
valhallasw added subscribers: valhallasw, Kolossos, Nosy.
Restricted Application added a project: Cloud-Services. · View Herald TranscriptJan 26 2016, 8:58 PM
Restricted Application added subscribers: StudiesWorld, Aklapper. · View Herald Transcript

I have killed this job. The start time coincides with the NFS outage today (job was started at 01/26/2016 19:03:39 UTC), and -- although we're not sure if this was the main cause -- we don't want to risk NFS going down again.

qdeled another job just now. It's putting severe stress on NFS -- please do not run another job without first consulting.

Sorry, I read this too late.
I will sort now to "/tmp/sort/" and hope thats ok.

InnoDB seems not to have an physical sort without an primary or unique key. So I have to sort before I import it.

Is it true that "/tmp/" is locally on a server randomly choosen by jstart? So how can I use this files later in the next job to import it to database?

Why is everything so complicated? ...

Yes, /tmp/ is locally on the server (that's the point -- you don't want to hit the NFS server). You can either combine the jobs (so that they run on the same server), or you can write back the result file to NFS. Two linear 5GB writes are much better than swapping to NFS.

And, after reading the GNU sort documentation in more detail: why are you passing -T /data/project/templatetiger?

‘-T tempdir’
‘--temporary-directory=tempdir’
Use directory tempdir to store temporary files, overriding the TMPDIR environment variable. If this option is given more than once, temporary files are stored in all the directories given. If you have a large sort or merge that is I/O-bound, you can often improve performance by using this option to specify directories on different disks and controllers.

It might be that sort actually does an intelligent read-once write-once, but specifying NFS as temp directory is almost guaranteed to bring the NFS server to its knees.

Yeah, /data/project/templatetiger is full of sort temp files (sortXXXXXX), totalling about 9GB from the last run. /tmp is 'only' 15GB, so you might run into issues there.

It might help to reserve more memory, and to pass --buffer-size with e.g. a 1GB buffer size. That should significantly reduce the amount of swapping to temporary files.

For the record, I sent an e-mail on 7 Jun 2015 to the entire group of maintainers:

Hello Templatetiger maintainers,

One of your SGE jobs,

sort /data/project/templatetiger/public_html/dumps/ruwiki-2015-03-24.txt -k4,4 -k2,2 -k3,3n -k5,5n -t? -o /data/project/templatetiger/public_html/dumps/sort/ruwiki-2015-03-24.txt -T /data/project/templatetiger

was causing excessive use of NFS, causing performance issues. Yuvi also suspended all > other jobs to prevent more issues.

Please discuss with Yuvi (CCed) what the best option for these kind of jobs is. (I'm just > relaying his message; he had to leave shortly after suspending the jobs)

It seems @Kolossos did not receive this e-mail (broken @toolserver.org forward -- that may have been temporary, but you might want to consider setting a different email address in wikitech), but I did not get any other bounces. I did not receive any reply to my email, and also did not follow up at that time.

I will try now -T /tmp/sort. So the temporal dir should be local tmp .
I code this stuff, often after my regular job. So sometimes I'm fine if things seems to work, also if it's not optimal. Sorry if such things happen.

This killed NFS again just now, so I've temporarily disabled the tool.

I will try now -T /tmp/sort. So the temporal dir should be local tmp .
I code this stuff, often after my regular job. So sometimes I'm fine if things seems to work, also if it's not optimal. Sorry if such things happen.

I understand this is difficult - and thank you for your generally awesome work :) I've only disabled it temporarily so we can figure out a way of going forward that doesn't involve doing heavy operations on NFS - this makes tools unavailable for everyone else, which is a problem.

Options are:

  1. Use a database directly (perhaps add a primary key?)
  2. Request and setup your own project (outside of tools) that does not use NFS
  3. Other...

I don't fully know what templatetiger does, so I can not offer full solutions. Can you tell us a bit more about why the sort is necessary, and what it is sorting?

I've temporarily disabled access for all 4 maintainers to the tool, but shall restore them shortly once discussion here gets going.

Just wanted to say #2 above should be pretty straight forward and is meant for this kind of case.

Thanks :) We are in Cloud-Services on irc if you have a minute to talk

scfc added a subscriber: scfc.Jan 30 2016, 5:41 AM

I don't think it's very efficient for someone who works on this in his spare time to add the maintenance of a Labs project to his agenda :-).

My understanding was that Templatetiger imports template parameters/values from the dumps into public database tables, so sorting should not be necessary because the database will do what it needs to do itself.

(Ceterum censeo the functionality of Templatetiger should be part of MediaWiki itself: Instead of those "wait a month, load a dump into the database" cycles, there should be (jobqueued) hooks that after creating/editing/deleting a wiki page update the database's corresponding entries. The database could be on separate servers as all its data can be regenerated if necessary. It could then be replicated to Labs or made accessible via API calls for gadgets.)

@scfc:
To the first point: Yes to maintain a labs project would be to much overhead for me. Normally templatetiger is only a little php script and a database. The trouble with to sort script cames later.

The database is normally fast enough by using indexes. Thats the case for medium wikipedias like dewiki with 3 GB. The problems starts with the large enwiki with 23 GB, there it is really a detectable effect for endusers with the data are clusterd on harddrives in the right way.

(The templatetiger database would as a live service would be great, but it would be 7 years too late and I still hope that this service become obsulet and replaced by WikiData. Maybe such an extension would be still interesting for other projects, also if semanticMediawiki is more powerfull. )

So now to my plan to solve the actuall problem:
I would try to solve this by using "order by" inside the database. This would make it also easier to maintain, than the two steps I have now. I can not guaranty that the keys are unique but I would try it.
So if I could get back access to the project, I would promise to never sort something on the commandline.

Yes, of course. The removal of access was mostly because we have no easy way to make sure a user reads a message, so the approach is heavy-handed. Sorry about that.

Coming back to using sort specifically -- just copying files to a host and back *should* be OK, as that's a linear copy of a few GB (so that should take maybe a minute to copy). However, it seems sort doesn't do this in a smart way. I'm planning to do some tests on this, but I'm not sure when I'll have time for it.

I'm not sure about the best solution for the enwiki situation -- 23GB is too bit for /tmp, but mayve /srv has enough space.

Is this still an issue?

valhallasw renamed this task from templatetiger runs `sort` on huge NFS files to Find a method for `templatetiger` to sort large files.May 27 2016, 12:40 PM
valhallasw triaged this task as Low priority.
valhallasw moved this task from Triage to Backlog on the Toolforge board.May 27 2016, 1:16 PM
chasemp added a comment.EditedDec 2 2016, 6:44 PM

Is this still an issue?

FWIW scratch is not being served by teh same server(s) as the regular tools project share anymore, and it has been increased to 3T. This has been hanging open for awhile without update so I'm going to resolve now until future work is understood.

chasemp closed this task as Resolved.Dec 2 2016, 6:46 PM
chasemp claimed this task.