Versions Compared


  • This line was added.
  • This line was removed.
  • Formatting was changed.

The following code block is simply a dump from my bash history. My instructors, Vinodh and Aishwarya were excellent tutors in guiding us step by step through the generalities of MutSig analyses. This specific script is written with the goal of counting tumor recurrences (n=47) from a TCGA sample data set (n = 97k). You might find some errors in processing the steps, and that will either be due to my beginners knowledge of Linux syntax or restricted access to any data. Overall, this is achieved by the following steps: 


Code Block
title#head and grep command
head -n1 practice_final_table1.txt | tr '\t' '\n' | nl | grep "Disease"

3) Determine which column contains our tumor recurrence status, then execute the 'cut' command on the proper column (in this case, 39):

Code Block
cut -f 39 practice_final_table1.txt | sort | uniq -c

4) We then execute the 'awk' command and direct it to a txt file for further processing. This will insure that all missing values are dropped: 

Code Block
awk -F'\t' '{if($39=="Recurred/Progressed") print $1;}' practice_final_table1.txt > recurrent_samples.txt

5) Run counts on the .txt file and .maf file in place, then check the columns of the .maf file:

Code Block
wc -l recurrent_samples.txt
wc -l mod_TCGA_mutations_significance_analysis.maf 
head -n1 mod_TCGA_mutations_significance_analysis.maf | tr '\t' '\n' | nl

6) Next, we execute another 'cut' command on Column 16 for the unique samples, then again on 5 and 6 for chromosomal locations:

Code Block
cut -f16 mod_TCGA_mutations_significance_analysis.maf | sort | uniq | wc -l
cut -f5,6 mod_TCGA_mutations_significance_analysis.maf | sort | uniq | wc -l

7) Provided that your machine has access to python, you can "tell" python to match the two files and spit out a data set for further analysis in one line (WARNING: make sure you do not make the same mistake I did, which if you will notice below, is to not write a .txt file but another .maf file, this will run an error in further processing):

Code Block
python recurrent_samples.maf mod_TCGA_mutations_significance_analysis.maf
head adenocarcinoma_samples_mod_TCGA_mutations_significance_analysis.maf 

8) 'Cut' the 16th column again for tumor recurrence:

Code Block
cut -f16 adenocarcinoma_samples_mod_TCGA_mutations_significance_analysis.maf  | sort | uniq

9) Now move the output file to a new file name, and/or your own personal directory for ease of editing, then open the bash script that it will run on the cluster, edit it to your email address and source it to your new file name. Also, note the space in the original bash script and retain it- simply paste the name of your own file (here "recurrent_samples.maf" over whatever is written at the end of the source file path):

Code Block
# key in *a 
#make edits within the vi
# escape and :wq the file to save
#ensure your job is running

10) Check your job and output for any errors!


11) Full script (including messy commands) below: 

Code Block
titleMutSig2CV Code (5/23)
cd /data/project/ssg-big-data/akin_data/interns/walker
  727  ls
  728  cd recurrent
  729  ls
  730  head -nl practice_final_table1.txt | tr '\t' '\n' | nl | grep "Disease"
  731  head practice_final_table1.txt
  732  head -nl practice_final_table1.txt 
  733  head -n1 practice_final_table1.txt 
  734  head -n1 practice_final_table1.txt | tr '\t' '\n' | nl | grep "Disease"
  735  cut -f 39 practice_final_table1.txt | sort | uniq -c
  736  awk -F'\t' '{if($39=="Recurred/Progressed") print $1;}' practice_final_table1.txt > recurrent_samples.txt
  737  ls
  738  wc -l recurrent_samples.txt 
  739  wc -l mod_TCGA_mutations_significance_analysis.maf 
  740  head -n1 mod_TCGA_mutations_significance_analysis.maf | tr '\t' '\n' | n1
  741  head -n1 mod_TCGA_mutations_significance_analysis.maf | tr '\t' '\n' | nl
  742  cut -f16 mod_TCGA_mutations_significance_analysis.maf | head
  743  cut -f16 mod_TCGA_mutations_significance_analysis.maf | sort | uniq | wc -l
  744  cut -f5,6 mod_TCGA_mutations_significance_analysis.maf | sort | uniq | wc -l
  745  ls -lt
  746  python
  747  python recurrent_samples.txt
  748  python recurrent_samples.txt mod_TCGA_mutations_significance_analysis.maf
  749  ls
  750  head adenocarcinoma_samples_mod_TCGA_mutations_significance_analysis.maf 
  751  cut -f16 adenocarcinoma_samples_mod_TCGA_mutations_significance_analysis.maf  | sort | uniq | wc -l
  752  cut -f16 adenocarcinoma_samples_mod_TCGA_mutations_significance_analysis.maf  | sort | uniq
  753  mv adenocarcinoma_samples_mod_TCGA_mutations_significance_analysis.maf recurrent_samples.txt 
  754  vi
  755  sbatch
  756  vi
  757  sbatch
  758  squeue 
  759* mv adenocarcinoma_samples_mod_TCGA_mutations_significance_analysis.maf recurrent_samples.maf
  760  python recurrent_samples.txt mod_TCGA_mutations_significance_analysis.maf
  761  cut -f16 adenocarcinoma_samples.maf | sort | uniq | wc -l
  762  cut -f16 adenocarcinoma_samples_mod_TCGA_mutations_significance_analysis.maf | sort | uniq | wc -l
  763  mv adenocarcinoma_samples.maf recurrent_samples.maf
  764  mv adenocarcinoma_samples_mod_TCGA_mutations_significance_analysis.maf recurrent_samples.maf
  765  vi
  766  sbatch
  767  vi
  768  sbatch
  769  vi
  770  sbatch
  771  vi
  772  sbatch
  773  squeue
  774  history

#numbers are due to pasting from script history