IsoformSwitchAnalyzeR ikra test illumina PEでやってみる

2019/9/19

ikraのtest/illumina_PEのoutputでやってみる。

データのインポート

code:R

#Import quantifications

salmonQuant <- importIsoformExpression(parentDir = "~/ikra_data/test/Illumina_PE/salmon_output/")

Step 1 of 3: Identifying which algorithm was used...

The quantification algorithm used was: Salmon

Found 6 quantification file(s) of interest

Step 2 of 3: Reading data...

reading in files with read_tsv

1 2 3 4 5 6

Step 3 of 3: Normalizing FPKM/TxPM values via edgeR...

Done

参考：Question: Error in reading data on isoformswitchanalyzeR

設定matrix

code:bash

myDesign <- data.frame(

sampleID = colnames(salmonQuant$abundance)-1,

condition = gsub('_.*', '', colnames(salmonQuant$abundance)-1)

)

↓

code:bash

colnames(salmonQuant$abundance)

1 "isoform_id" "salmon_output_SRR7501481" "salmon_output_SRR7501482" "salmon_output_SRR7501483"

5 "salmon_output_SRR7501484" "salmon_output_SRR7501485" "salmon_output_SRR7501486"

https://gyazo.com/38caa5eb30441df87c9618d02842f7ed

これじゃダメ。csvのnameからとってこよう。

code:bash

myDesign <- data.frame(

sampleID = colnames(salmonQuant$abundance)-1,

condition = c("WT_1", "WT_3", "WT_2", "PID_1", "PID_2", "PID_3")

)

ひとまずこれ。

↓

code:bash

myDesign <- data.frame(

sampleID = colnames(salmonQuant$abundance)-1,

condition = c("WT", "WT", "WT", "PID", "PID", "PID")

)

switchAnalyzeRlist

“Comprehensive gene annotation” GTF →ftp://ftp.ebi.ac.uk/pub/databases/gencode/Gencode_mouse/release_M22/gencode.vM22.annotation.gtf.gz

“Transcript sequences” fasta→ftp://ftp.ebi.ac.uk/pub/databases/gencode/Gencode_mouse/release_M22/gencode.vM22.transcripts.fa.gz

↑

これもダメ。

GTF→~/ikra_data/test/Illumina_PE/gencode.vM19.annotation.gtf.gz

fasta→元からあった。~/ikra_data/test/Illumina_PE/gencode.vM19.transcripts.fa.gz

code:bash

### Create switchAnalyzeRlist

aSwitchList <- importRdata(

isoformCountMatrix = salmonQuant$counts,

isoformRepExpression = salmonQuant$abundance,

designMatrix = myDesign,

isoformExonAnnoation = "~/ikra_data/test/Illumina_PE/gencode.vM19.annotation.gtf.gz",

isoformNtFasta = "~/ikra_data/test/Illumina_PE/gencode.vM19.transcripts.fa.gz",

showProgress = TRUE

)

Step 1 of 6: Checking data...

Step 2 of 6: Obtaining annotation...

importing GTF (this may take a while)

Step 3 of 6: Calculating gene expression and isoform fraction...

99817 ( 75.04%) isoforms were removed since they were not expressed in any samples.

Step 4 of 6: Merging gene and isoform expression...

|=================================================================================================================| 100%

Step 5 of 6: Making comparisons...

|=================================================================================================================| 100%

Step 6 of 6: Making switchAnalyzeRlist object...

Done

警告メッセージ:

1: importRdata(isoformCountMatrix = salmonQuant$counts, isoformRepExpression = salmonQuant$abundance, で:

The experimental design seems to be of very low complexity - very few samples per replicate. Please check the supplied design matrixt to make sure no mistakes were made.

2: importRdata(isoformCountMatrix = salmonQuant$counts, isoformRepExpression = salmonQuant$abundance, で:

The number of comparisons (n=15) is unusually high.

- If this intended please note that with a large number of comparisons IsoformSwitchAnalyzeR might use quite a lot of memmory (aka running on a small computer might be problematic).

- If this was not intended please check the supplied design matrixt to make sure no mistakes were made.

3: importRdata(isoformCountMatrix = salmonQuant$counts, isoformRepExpression = salmonQuant$abundance, で:

!!! NB !!! NB !!! NB !!!NB !!! NB !!!

IsoformSwitchAnalyzeR is not made to work with conditions without indepdendet biological replicates and results will not be trustworthy!

At best data without replicates should be analyzed as a pilot study before investing in more replicates.

Plase consult the "Analysing experiments without replicates" and "What constitute an independent biological replicate?" sections of the vignette.

!!! NB !!! NB !!! NB !!!NB !!! NB !!!

4: importRdata(isoformCountMatrix = salmonQuant$counts, isoformRepExpression = salmonQuant$abundance, で:

The annotation and quantification (count/abundance matrix and isoform annotation) Seem to be slightly different.

Specifically:

373 isoforms were only found in the annotation

Please make sure this is on purpouse since differences will cause inaccurate quantification and thereby skew all analysis.

If you have quantified with Salmon this could be normal since it as default only keep one copy of identical sequnces (can be prevented using the --keepDuplicates option)

We strongly encurage you to go back and figure out why this is the case.

salmonにかけているversion、M19とそろえないとダメ！

リファレンス：Release M19 (GRCm38.p6)

code:R

aSwitchList

This switchAnalyzeRlist list contains:

33198 isoforms from 13667 genes

15 comparison from 6 conditions (in total 6 samples)

Feature analyzed:

1 "ORFs, ntSequence"

中身の確認

code:R

head(aSwitchList$isoformFeatures,2)

iso_ref gene_ref isoform_id gene_id condition_1 condition_2 gene_name

1 isoComp_00000001 geneComp_00000001 ENSMUST00000000001.4 ENSMUSG00000000001.4 PID_1 PID_2 Gnai3

2 isoComp_00000002 geneComp_00000002 ENSMUST00000000028.13 ENSMUSG00000000028.15 PID_1 PID_2 Cdc45

gene_biotype iso_biotype gene_overall_mean gene_value_1 gene_value_2 gene_stderr_1 gene_stderr_2

1 protein_coding protein_coding 122.54929 130.46133 128.2404 NA NA

2 protein_coding protein_coding 17.49478 33.04188 0.0000 NA NA

gene_log2_fold_change gene_q_value iso_overall_mean iso_value_1 iso_value_2 iso_stderr_1 iso_stderr_2

1 -0.02476949 NA 122.54929 130.4613 128.2404 NA NA

2 -11.69051666 NA 2.74833 0.0000 0.0000 NA NA

iso_log2_fold_change iso_q_value IF_overall IF1 IF2 dIF isoform_switch_q_value gene_switch_q_value PTC

1 -0.02476949 NA 1.00 1 1 0 NA NA FALSE

2 0.00000000 NA 0.25 0 NaN NaN NA NA FALSE

head(aSwitchList$exons,2)

GRanges object with 2 ranges and 2 metadata columns:

seqnames ranges strand | isoform_id gene_id

1 chr1 4522905-4523603 + | ENSMUST00000195361.1 ENSMUSG00000102269.1

2 chr1 4524446-4526737 + | ENSMUST00000195361.1 ENSMUSG00000102269.1

-------

seqinfo: 22 sequences from an unspecified genome; no seqlengths

head(aSwitchList$ntSequence,2)

A DNAStringSet instance of length 2

width seq names

1 4153 GCACACTACGGTCCATCTCCAACAACCGCAGTGTTGCCAGTGACC...CCTTTTTAACAAACTAGACACGTAAATTAATCTCTGCCACTCCA ENSMUST00000162897.1

2 3127 CCCATTTAGTGAAGAAACTGAAATATGGCCCACTCACACTGCTGG...AGTCAGAGAAACTAATCAAAATAAAGTTTGTGTTGTGACTGTTA ENSMUST00000027035.9

exampleSwitchListは、aSwitchListと同じ、switchAnalyzeRlistで例示用。

isoformSwitchAnalysisPart1

code:R

exampleSwitchList <- isoformSwitchAnalysisPart1(

switchAnalyzeRlist = aSwitchList,

dIFcutoff = 0.3, # Cutoff for finding switches - set high for short runtime in example data

pathToOutput = "~/RNA-seq/IsoformSwitchAnalyzeR/ikra_test_illumina_pe",

outputSequences = TRUE, # prevents creation of fasta files used for external sequence analysis (you should change this)

prepareForWebServers = FALSE # change to TRUE if you will use webservers for external sequence analysis

)

Step 1 of 3 : Detecting isoform switches...

isoformSwitchTestDEXSeq(switchAnalyzeRlist, reduceToSwitchingGenes = TRUE, でエラー:

A statistical test cannot be performed without replicates. Please remove all conditions with only 1 replicate and try again.

This can either be done before creating the switchAnalyzeRlist in the first place or with the subsetSwitchAnalyzeRlist() function

独自のデータを分析する場合、outputSequences = TRUEを設定し、pathToOutputを使用してファイルを保存する場所を指定。

↓

sampleのmyDesignは、

sampleID condition

1 Fibroblasts_0 Fibroblasts

2 Fibroblasts_1 Fibroblasts

3 hESC_0 hESC

4 hESC_1 hESC

5 iPS_0 iPS

6 iPS_1 iPS

で、conditionの右に数字がつかない！

スイッチング機能の数は、次のように簡単に要約できる

code:R

extractSwitchSummary(

aSwitchList,

dIFcutoff = 0.3 # the same cutoff to the summary function as used above

)

＜パート2＞

code:R

exampleSwitchList <- isoformSwitchAnalysisPart2(

switchAnalyzeRlist = aSwitchList,

dIFcutoff = 0.3, # Cutoff for finding switches - set high for short runtime in example data

n = 10, # if plotting was enabled, it would only output the top 10 switches

removeNoncodinORFs = TRUE, # Because ORF was predicted de novo

pathToCPC2resultFile = system.file("extdata/cpc2_result.txt" , package = "IsoformSwitchAnalyzeR"),

pathToPFAMresultFile = system.file("extdata/pfam_results.txt" , package = "IsoformSwitchAnalyzeR"),

pathToNetSurfP2resultFile = system.file("extdata/netsurfp2_results.csv.gz", package = "IsoformSwitchAnalyzeR"),

pathToSignalPresultFile = system.file("extdata/signalP_results.txt" , package = "IsoformSwitchAnalyzeR"),

codingCutoff = 0.725, # the coding potential cutoff we suggested for human

outputPlots = FALSE # keeps the function from outputting the plots from this example

)

exampleSwitchListAnalyzedも、可視化用に用意されたlistで中身の形式は同じ！

code:R

exampleSwitchListAnalyzed

This switchAnalyzeRlist list contains:

1469 isoforms from 530 genes

2 comparison from 4 conditions (in total 891 samples)

Switching features:

Comparison switchingIsoforms switchingGenes

1 COAD_ctrl vs COAD_cancer 692 369

2 LUAD_ctrl vs LUAD_cancer 422 218

3 combined 1010 530

Feature analyzed:

1 "Isoform Switch Identification, ORFs, Protein Domains, Signal Peptides, Alternative splicing, IDR, Switch Consequences, Coding Potential"