What constitutes a large data download and where should I store the data?
Large data storage is atypical for desktop environments and more suitable for clusters with large number of CPU and storage capacity running to Terabytes. It is best to download the data directly to the cluster and prevent an additional desktop to cluster data transfer step.
This article discuss the challenges faced with large data -> http://en.oreilly.com/datascience/public/schedule/detail/15330.
What clusters are available at UAB?
The information about various clusters at UAB is available here -> http://me.eng.uab.edu/wiki/index.php?title=HPC_-_High_Performance_Computing.
What are the typical post-download steps?
The Cheaha cluster provides system wide environment variables $UABGRID_PROJECT and $UABGRID_SCRATCH (http://me.eng.uab.edu/wiki/index.php?title=Cheaha#Scratch). Post download, the data are typically moved to the $UABGRID_PROJECT. It is the responsibility of the personnel handling the data download process to request the cluster administrator to set 'read only' access permissions on the data to prevent any accidental modifications.
How do I check the integrity of data download?
md5sum command generates a checksum key for a file/folder -> http://en.wikipedia.org/wiki/Md5sum. The
md5sum command must be run on the downloaded data and compared with md5sum provided by the data provider (e.g. dbGAP) to check the integrity of downloaded data and to spot any discrepancies caused by the user at client side.
Linux -> http://www.linuxmanpages.com/man1/md5sum.1.php
What are the security and backup issues on the UAB clusters?
The physical and network security on the clusters is handled by the cluster system administrator, Mike Hanby, and his IT team. For files on the networked file system, Unix file permissions can be set the usual way to restrict read-write-execute privileges. Currently there is no regular backup on the clusters. The data can be backed upon request to the system admin. If the data are constantly changing, commit the data to the version control system. What is a version control system? -> http://www.ssg.uab.edu/wiki/x/gAEv.
How do I manage the download process from an FTP server?
- What are the download commands?
lftp program helps users to automate the download process (http://lftp.yar.ru/lftp-man.html) and can be used in conjunction with the
-mirror command to clone/mirror the entire directory/file layout (http://www.softpanorama.org/WWW/mirroring_tools.shtml). If the download process is interrupted or broken, the
lftp program will resume the download from the point automatically. In addition, you can also use
wget (http://www.softpanorama.org/Utilities/wget.shtml). Please use the '-c' option with the
wget to resume downloading the file.
- How do I estimate the size of data on a FTP server?
Post logging into the FTP server, the user can use the
du(disk usage) command (http://www.linfo.org/du.html) to list the directory/file size.
- How do I estimate the time required to download the data on a FTP server?
After the user starts the download on the ftp server, the download progress is shown on the command prompt based on the Kb/sec scale. User can calculate the estimated time based on the current speed of download and the total size of data. For eg. 408 KB/sec / 1000 = 0.408 MB/sec = 0.41 MB/sec. Please note that the download speeds are depended on the client traffic at the ftp server. Too many ftp clients connected to a FTP server can significantly reduce the download speed/client.
How do I introduce the elements of re-producible research during data download?
One of the key elements of re-producible research is the raw data obtained from a source, for e.g an ftp/http site. Documenting or recording every step of data download can be a key element to reproducibility.
1. If downloading via browser, take step by step screen shots of the navigation to the download page, and documenting the url's used to download.
2. Creating a shell script of commands and associated options used to download the data (http://www.freeos.com/guides/lsst/ch02sec01.html). The commands can be later retrieved from the shell script.
3. Documenting the md5sums and if the data post-download was accurate.
4. If different versions of the data was downloaded at various time points, create a unique compressed 'read-only' for every version of the data. Use a version control system (as discusses above) to track changes made to the data.
Administrator Contact Information
Cheaha cluster system admin: Mike Hanby (firstname.lastname@example.org).
SVN version control system admin: Vinodh Srinivasasainagendra (email@example.com).