Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

The following code block is simply a dump from my bash history. My instructors, Vinodh and Aishwarya were excellent tutors in guiding us step by step through the generalities of MutSig analyses. This specific script is written with the goal of counting tumor recurrences (n=47) from a TCGA sample data set (n = 97k). You might find some errors in processing the steps, and that will either be due to my beginners knowledge of Linux syntax or restricted access to any data. Overall, this is achieved by the following steps: 

...

Code Block
title#head and grep command
languagebash
head -n1 practice_final_table1.txt | tr '\t' '\n' | nl | grep "Disease"

3) Determine which column contains our tumor recurrence status, then execute the 'cut' command on the proper column (in this case, 39):

Code Block
languagebash
cut -f 39 practice_final_table1.txt | sort | uniq -c

4) We then execute the 'awk' command and direct it to a txt file for further processing. This will insure that all missing values are dropped: 

Code Block
languagebash
awk -F'\t' '{if($39=="Recurred/Progressed") print $1;}' practice_final_table1.txt > recurrent_samples.txt
ls

5) Run counts on the .txt file and .maf file in place, then check the columns of the .maf file:

Code Block
languagebash
wc -l recurrent_samples.txt
wc -l mod_TCGA_mutations_significance_analysis.maf 
head -n1 mod_TCGA_mutations_significance_analysis.maf | tr '\t' '\n' | nl

6) Next, we execute another 'cut' command on Column 16 for the unique samples, then again on 5 and 6 for chromosomal locations:

Code Block
languagebash
cut -f16 mod_TCGA_mutations_significance_analysis.maf | sort | uniq | wc -l
cut -f5,6 mod_TCGA_mutations_significance_analysis.maf | sort | uniq | wc -l

7) Provided that your machine has access to python, you can "tell" python to match the two files and spit out a data set for further analysis in one line (WARNING: make sure you do not make the same mistake I did, which if you will notice below, is to not write a .txt file but another .maf file, this will run an error in further processing):

Code Block
languagebash
python sample_subsetter.py recurrent_samples.maf mod_TCGA_mutations_significance_analysis.maf
head adenocarcinoma_samples_mod_TCGA_mutations_significance_analysis.maf 

8) 'Cut' the 16th column again for tumor recurrence:

Code Block
languagebash
cut -f16 adenocarcinoma_samples_mod_TCGA_mutations_significance_analysis.maf  | sort | uniq

9) Now move the output file to a new file name, and/or your own personal directory for ease of editing, then open the bash script that it will run on the cluster, edit it to your email address and source it to your new file name. Also, note the space in the original bash script and retain it- simply paste the name of your own file (here "recurrent_samples.maf" over whatever is written at the end of the source file path):

Code Block
languagebash
vi run.sh
# key in *a 
#make edits within the vi
# escape and :wq the file to save
 
sbatch run.sh
 
#ensure your job is running
squeue

10) Check your job and output for any errors!

 

11) Full script (including messy commands) below: 

Code Block
titleMutSig2CV Code (5/23)
languagebash
cd /data/project/ssg-big-data/akin_data/interns/walker
  727  ls
  728  cd recurrent
  729  ls
  730  head -nl practice_final_table1.txt | tr '\t' '\n' | nl | grep "Disease"
  731  head practice_final_table1.txt
  732  head -nl practice_final_table1.txt 
  733  head -n1 practice_final_table1.txt 
  734  head -n1 practice_final_table1.txt | tr '\t' '\n' | nl | grep "Disease"
  735  cut -f 39 practice_final_table1.txt | sort | uniq -c
  736  awk -F'\t' '{if($39=="Recurred/Progressed") print $1;}' practice_final_table1.txt > recurrent_samples.txt
  737  ls
  738  wc -l recurrent_samples.txt 
  739  wc -l mod_TCGA_mutations_significance_analysis.maf 
  740  head -n1 mod_TCGA_mutations_significance_analysis.maf | tr '\t' '\n' | n1
  741  head -n1 mod_TCGA_mutations_significance_analysis.maf | tr '\t' '\n' | nl
  742  cut -f16 mod_TCGA_mutations_significance_analysis.maf | head
  743  cut -f16 mod_TCGA_mutations_significance_analysis.maf | sort | uniq | wc -l
  744  cut -f5,6 mod_TCGA_mutations_significance_analysis.maf | sort | uniq | wc -l
  745  ls -lt
  746  python
  747  python sample_subsetter.py recurrent_samples.txt
  748  python sample_subsetter.py recurrent_samples.txt mod_TCGA_mutations_significance_analysis.maf
  749  ls
  750  head adenocarcinoma_samples_mod_TCGA_mutations_significance_analysis.maf 
  751  cut -f16 adenocarcinoma_samples_mod_TCGA_mutations_significance_analysis.maf  | sort | uniq | wc -l
  752  cut -f16 adenocarcinoma_samples_mod_TCGA_mutations_significance_analysis.maf  | sort | uniq
  753  mv adenocarcinoma_samples_mod_TCGA_mutations_significance_analysis.maf recurrent_samples.txt 
  754  vi run.sh
  755  sbatch run.sh
  756  vi run.sh
  757  sbatch run.sh
  758  squeue 
  759* mv adenocarcinoma_samples_mod_TCGA_mutations_significance_analysis.maf recurrent_samples.maf
  760  python sample_subsetter.py recurrent_samples.txt mod_TCGA_mutations_significance_analysis.maf
  761  cut -f16 adenocarcinoma_samples.maf | sort | uniq | wc -l
  762  cut -f16 adenocarcinoma_samples_mod_TCGA_mutations_significance_analysis.maf | sort | uniq | wc -l
  763  mv adenocarcinoma_samples.maf recurrent_samples.maf
  764  mv adenocarcinoma_samples_mod_TCGA_mutations_significance_analysis.maf recurrent_samples.maf
  765  vi run.sh
  766  sbatch run.sh
  767  vi run.sh
  768  sbatch run.sh
  769  vi run.sh
  770  sbatch run.sh
  771  vi run.sh
  772  sbatch run.sh
  773  squeue
  774  history


#numbers are due to pasting from script history