Reworking Net::SFTP to handle large file downloads
I’m writing an application that downloads access logs from our production servers and runs the AWStats package against them to create the statistics web pages. This process is setup as a Rake task that uses the Net::SFTP library used by Capistrano, written by Jamis Buck. There is also a front-end Rails application to manage each of the applications to be retrieved. Everything was working great until I tried to grab a 550MB file from one of our servers. Net::SFTP chocked as it ran out of memory.
It turns out that the command:
sftp.get_file log_file, local_file
ends up putting the whole file into memory, which is fine for small files, but not the large one that I was trying to download. Luckily it wasn’t too bad to refactor my class. Here’s the new code to achieve the same effect as the above sftp.get_file command.
stat = sftp.stat( log_file )
offset = 0
file_length = stat.size
length = 64 * 1024 * 1024
File.open(local_file, File::CREAT|File::TRUNC|File::RDWR, 0644) do |f|
while (offset < file_length)
sftp.open_handle(log_file) do |handle|
data = sftp.read(handle, :length => length, :offset => offset)
f.write(data)
offset += data.length
end
end
endThis downloads the file in 64MB increments, using only that much memory at any time.
Error working with large YAML files
As part of my Application migration project, I need to pre-populate the new database with zip code data. The Rails Recipes book (very useful) has a nice recipe on extracting fixtures from live data. So, I’ve created a zips.yml file that contains all the zip code data that I can insert into the new database. However, when I try to load the fixture using this very cool rake task from Technoweenie, the YAML library throws the following exception: SystemStackError: stack level too deep.
It is possible to work-around the error by increasing the stack limit on the command-line. on Mac OSX (and probably Linux/Unix), the following command can be run before running the rake task: ulimit -s 32768. This increases the default stack limit to 32MB, which should be enough, unless the yaml file is really large, I suppose.
Does anyone know if somebody is working on fixing the YAML library to be nicer to the stack?