The following code snippets serve to produce fisher's p-values for mutation frequency for any given 2x2 tables (here separated by tumor Recurrent/disease free and bacterial carriers/non-carriers). Input data comes from .txt files for disease free and recurrent "sig_genes," merges them in one R data frame, and adds columns for resulting p, q, -log10 and fisher's exact values. The goal of this code will be to eventually generalize the process for more automation in producing significant genes across any arbitrarily selected clinical or demographic factors.
Now we read in the two files with tab separation and strings not written as factors, and supply column names:
Merge the two files by gene:
Use the cbind() function to make separate table with only selected headers from each dataframe:
This "for loop" accomplishes the following: 1)iterates over all 'nrow' of newly written table on the gene_result object where the rows [i, ] are iterated, 2)creates columns for all disease free characteristics (carriers and non-carriers) as well as the counts needed for fisher's and log_p values, 3)repeats the process for recurrent status. 4)The fisher matrix object creates a 2x2 for all rows with the finalized columns and the fisher_result object applies the 'fisher.test()' function to all objects in the matrix. A table is written with the final variables of interest.
Note where opportunities for optimization and automation occur. Many of the variables can be generalized or set to automatically load into environments. The function itself could be generalized and simplified for any further analyses requiring disease free and recurrent somatic mutation comparisons.