BS-seq解析
2019/9/17
三島合宿にて。ホテルにて開始。
DNAメチル化解析手法の1つであるReduced Representation Bisulfite Sequencing(RRBS)法のデータ解析手順についてやってみる。
RRBS法は制限酵素MspIを用いてプロモーター領域とCpGアイランドを濃縮し、対象領域を限定してバイサルファイトシークエンシングを行うことで、効率よくDNAメチル化を捉えるための手法。
rawdata・・・公共データベースからダウンロードしたデータを格納する。
tools・・・Homebrewでインストールできないツールを格納する。
ref・・・リファレンスゲノム配列データと遺伝子アノテーションデータを格納する。
インストール
cutadaptはトリミングツールであるTrim Galoreに必要。
bowtie2とSAMtoolsはマッピングで使用するBismarkの動作に必要。
cutadapt
conda install -c bioconda cutadapt -y
trim_galore
conda install -c bioconda trim-galore -y
bismark
conda install -c bioconda bismark -y
解析データダウンロード
サンプル
SRR1812639 血液サンプル 細胞種: CD19+ B-cells 慢性リンパ性白血病患者
SRR1812671 血液サンプル 細胞種: CD19+ B-cells 健常者
2サンプルだけ。
fasq-dump→fasterq-dump
code:bash
$ cd ~/rrbs/rawdata
$ time fasterq-dump --gzip SRR1812671
$ time fasterq-dump --gzip SRR1812639
<エラー>
code:bash
2019-09-17T13:21:26 fasterq-dump.2.9.3 err: param unknown while parsing argument list within application support module - Unknown argument '--gzip'
2019-09-17T13:21:26 fasterq-dump.2.9.3 err: ArgsMakeAndHandle() -> RC(rcApp,rcArgv,rcParsing,rcParam,rcUnknown)
fasterq-dumpには、--gzipはない。
↓
code:bash
$ cd ~/rrbs/rawdata
$ time fasterq-dump SRR1812671
$ time fasterq-dump SRR1812639
$ time fasterq-dump SRR1812671
spots read : 18,733,618
reads read : 18,733,618
reads written : 18,733,618
real 3m41.621s
user 0m59.881s
sys 0m11.614s
$ time fasterq-dump SRR1812639
spots read : 19,202,715
reads read : 19,202,715
reads written : 19,202,715
real 4m23.479s
user 1m1.652s
sys 0m12.463s
・圧縮と展開
code:bash
$ gzip SRR1812639.fastq
$ gzip SRR1812671.fastq
$ gunzip -c SRR1812639.fastq.gz > SRR1812639.fastq
$ gunzip -c SRR1812671.fastq.gz > SRR1812671.fastq
gunzipのオプションで、-kはない。代わりに標準出力に出す、-cを使う。
-c, --stdout write on standard output, keep original files unchanged
gunzipはkオプションをつけると展開前のファイルが残ったまま展開される。
リファレンス配列のデータをダウンロード
refディレクトリ内にシェルスクリプトを作る
作成
010_download-ucsc.sh
code:bash
set -euo pipefail
u1="ftp://hgdownload.soe.ucsc.edu"
u2="goldenPath/hg38/bigZips/analysisSet"
u3="hg38.analysisSet.chroms.tar.gz"
curl -O ${u1}/${u2}/${u3}
tar zxvf ${u3}
実行
code:bash
$ chmod a+x 010_download-ucsc.sh
$ time ./010_download-ucsc.sh
作成
020_concatinate.sh
code:bash
set -euo pipefail
d="hg38.analysisSet.chroms"
out="hg38.fasta"
cat \
$d/chr1.fa $d/chr2.fa $d/chr3.fa $d/chr4.fa $d/chr5.fa \
$d/chr6.fa $d/chr7.fa $d/chr8.fa $d/chr9.fa $d/chr10.fa \
$d/chr11.fa $d/chr12.fa $d/chr13.fa $d/chr14.fa $d/chr15.fa \
$d/chr16.fa $d/chr17.fa $d/chr18.fa $d/chr19.fa $d/chr20.fa \
$d/chr21.fa $d/chr22.fa $d/chrX.fa $d/chrY.fa $d/chrM.fa \
$d/chrEBV.fa > $out
実行
code:bash
$ chmod a+x 020_concatinate.sh
$ time ./020_concatinate.sh
real 0m6.062s
user 0m0.003s
sys 0m4.101s
遺伝子アノテーションファイルをダウンロード
メニューのToolsからTable Browserを選択。
設定
code:bash
assembly: Dec.2013(GRCh38/hg38)
group: Genes and Gene Predictions
track: NCBI RefSeq
table: UCSC RefSeq (refGene)
region: genome
output format: custom track
output file: refGene.bed
file type returned: plain text
refGene.bedをゲット。
【解析】
FastQCで確認
mkdir fastqc_raw
FastQCはそのままだと長いリードを解析する際に3'末端の塩基をまとめて解析してしまう。これを防ぐために--nogroupオプションを設定する。
code: bash
$ time fastqc --nogroup -o fastqc_raw/ -t 12 rawdata/*.fastq
Started analysis of SRR1812639.fastq
Started analysis of SRR1812671.fastq
Approx 5% complete for SRR1812639.fastq
Approx 5% complete for SRR1812671.fastq
Approx 10% complete for SRR1812671.fastq
Approx 10% complete for SRR1812639.fastq
Approx 15% complete for SRR1812671.fastq
Approx 15% complete for SRR1812639.fastq
Approx 20% complete for SRR1812671.fastq
Approx 20% complete for SRR1812639.fastq
Approx 25% complete for SRR1812671.fastq
Approx 25% complete for SRR1812639.fastq
Approx 30% complete for SRR1812671.fastq
Approx 30% complete for SRR1812639.fastq
Approx 35% complete for SRR1812671.fastq
Approx 35% complete for SRR1812639.fastq
Approx 40% complete for SRR1812671.fastq
Approx 40% complete for SRR1812639.fastq
Approx 45% complete for SRR1812671.fastq
Approx 45% complete for SRR1812639.fastq
Approx 50% complete for SRR1812671.fastq
Approx 50% complete for SRR1812639.fastq
Approx 55% complete for SRR1812671.fastq
Approx 55% complete for SRR1812639.fastq
Approx 60% complete for SRR1812671.fastq
Approx 60% complete for SRR1812639.fastq
Approx 65% complete for SRR1812671.fastq
Approx 65% complete for SRR1812639.fastq
Approx 70% complete for SRR1812671.fastq
Approx 70% complete for SRR1812639.fastq
Approx 75% complete for SRR1812671.fastq
Approx 75% complete for SRR1812639.fastq
Approx 80% complete for SRR1812671.fastq
Approx 80% complete for SRR1812639.fastq
Approx 85% complete for SRR1812671.fastq
Approx 85% complete for SRR1812639.fastq
Approx 90% complete for SRR1812671.fastq
Approx 90% complete for SRR1812639.fastq
Approx 95% complete for SRR1812671.fastq
Analysis complete for SRR1812671.fastq
Approx 95% complete for SRR1812639.fastq
Analysis complete for SRR1812639.fastq
real 1m32.540s
user 2m52.128s
sys 0m7.256s
・htmlファイルを確認
Per Base sequence qualityをみると後半のリードにクオリティスコアが低いものが含まれていることがわかる。また、一番下にあるAdapter Contentを確認するとアダプター配列が含まれているようだ。次のステップで取り除こう。
トリミング
fastqファイルからアダプター配列など不要な配列をトリミングしていく。
trim galoreはRRBS解析のためのオプションが選択できる。
-qでクオリティクオリティスコアのカットオフを設定
-rrbsオプションはトリミング対象のデータがRRBSのサンプルであることを設定
-oオプションでトリミング後のファイルが出力されるディレクトリを指定
code:bash
$ time trim_galore -q 20 -rrbs rawdata/SRR1812639.fastq -o trim/
$ time trim_galore -q 20 -rrbs rawdata/SRR1812671.fastq -o trim/
<エラー>
code:bash
$ time trim_galore -q 20 -rrbs rawdata/SRR1812639.fastq -o trim/
Multicore support not enabled. Proceeding with single-core trimming.
Path to Cutadapt set as: 'cutadapt' (default)
Cutadapt seems to be working fine (tested command 'cutadapt --version')
Cutadapt version: 2.5
single-core operation.
No quality encoding type selected. Assuming that the data provided uses Sanger encoded Phred scores (default)
Output will be written into the directory: /home/kyamada/bioinformatics/DRY2/rrbs/trim/
AUTO-DETECTING ADAPTER TYPE
===========================
Attempting to auto-detect adapter type from the first 1 million sequences of the first file (>> rawdata/SRR1812639.fastq <<)
Found perfect matches for the following adapter sequences:
Adapter type Count Sequence Sequences analysed Percentage
Illumina 380210 AGATCGGAAGAGC 1000000 38.02
Nextera 0 CTGTCTCTTATA 1000000 0.00
smallRNA 0 TGGAATTCTCGG 1000000 0.00
Using Illumina adapter for trimming (count: 380210). Second best hit was Nextera (count: 0)
Writing report to '/home/kyamada/bioinformatics/DRY2/rrbs/trim/SRR1812639.fastq_trimming_report.txt'
SUMMARISING RUN PARAMETERS
==========================
Input filename: rawdata/SRR1812639.fastq
Trimming mode: single-end
Trim Galore version: 0.6.4
Cutadapt version: 2.5
Number of cores used for trimming: 1
Quality Phred score cutoff: 20
Quality encoding type selected: ASCII+33
Adapter sequence: 'AGATCGGAAGAGC' (Illumina TruSeq, Sanger iPCR; auto-detected)
Maximum trimming error rate: 0.1 (default)
Minimum required adapter overlap (stringency): 1 bp
Minimum required sequence length before a sequence gets removed: 20 bp
File was specified to be an MspI-digested RRBS sample. Read 1 sequences with adapter contamination will be trimmed a further 2 bp from their 3' end, and Read 2 sequences will be trimmed by 2 bp from their 5' end to remove potential methylation-biased bases from the end-repair reaction
Cutadapt seems to be fairly up-to-date (version 2.5). Setting -j 1
>> Now performing adaptive quality trimming with a Phred-score cutoff of: 20 <<<
This is cutadapt 2.5 with Python 3.7.4
Command line parameters: -j 1 -e 0.1 -q 20 -a X rawdata/SRR1812639.fastq
Processing reads on 1 core in single-end mode ...
8<- 00:01:05 9,980,000 reads @ 6.5 µs/read; 9.27 M reads/minute10000000 sequences processed -----=8 00:02:05 19,202,715 reads @ 6.5 µs/read; 9.18 M reads/minute Finished in 125.54 s (7 us/read; 9.18 M reads/minute).
=== Summary ===
Total reads processed: 19,202,715
Reads with adapters: 0 (0.0%)
Reads written (passing filters): 19,202,715 (100.0%)
Total basepairs processed: 1,920,271,500 bp
Quality-trimmed: 91,265,324 bp (4.8%)
Total written (filtered): 1,829,006,176 bp (95.2%)
=== Adapter 1 ===
Sequence: X; Type: regular 3'; Length: 1; Trimmed: 0 times.
>> Quality trimming completed <<<
19202715 sequences processed in total
Writing final adapter and quality trimmed output to SRR1812639_trimmed.fq
>> Now performing adapter trimming for the adapter sequence: 'AGATCGGAAGAGC' from file SRR1812639.fastq_qual_trimmed.fastq <<<
Do we ever get here?
Do we ever get here?
Do we ever get here?
Do we ever get here?
******************************************************************
-rrbsはオプションの書き方がおかしい?
→--rrbs
変更
code:bash
$ time trim_galore -q 20 --rrbs rawdata/SRR1812639.fastq -o trim/
$ time trim_galore -q 20 --rrbs rawdata/SRR1812671.fastq -o trim/
trim_galore version0.5.0
code:bash
$ time trim_galore -q 20 --rrbs rawdata/SRR1812639.fastq -o trim/
No quality encoding type selected. Assuming that the data provided uses Sanger encoded Phred scores (default)
Path to Cutadapt set as: 'cutadapt' (default)
2.5
Cutadapt seems to be working fine (tested command 'cutadapt --version')
AUTO-DETECTING ADAPTER TYPE
===========================
Attempting to auto-detect adapter type from the first 1 million sequences of the first file (>> rawdata/SRR1812639.fastq <<)
Found perfect matches for the following adapter sequences:
Adapter type Count Sequence Sequences analysed Percentage
Illumina 380210 AGATCGGAAGAGC 1000000 38.02
Nextera 0 CTGTCTCTTATA 1000000 0.00
smallRNA 0 TGGAATTCTCGG 1000000 0.00
Using Illumina adapter for trimming (count: 380210). Second best hit was Nextera (count: 0)
Writing report to 'trim/SRR1812639.fastq_trimming_report.txt'
SUMMARISING RUN PARAMETERS
==========================
Input filename: rawdata/SRR1812639.fastq
Trimming mode: single-end
Trim Galore version: 0.5.0
Cutadapt version: 2.5
Quality Phred score cutoff: 20
Quality encoding type selected: ASCII+33
Adapter sequence: 'AGATCGGAAGAGC' (Illumina TruSeq, Sanger iPCR; auto-detected)
Maximum trimming error rate: 0.1 (default)
Minimum required adapter overlap (stringency): 1 bp
Minimum required sequence length before a sequence gets removed: 20 bp
File was specified to be an MspI-digested RRBS sample. Read 1 sequences with adapter contamination will be trimmed a further 2 bp from their 3' end, and Read 2 sequences will be trimmed by 2 bp from their 5' end to remove potential methylation-biased bases from the end-repair reaction
>> Now performing adaptive quality trimming with a Phred-score cutoff of: 20 <<<
This is cutadapt 2.5 with Python 3.7.4
Command line parameters: -f fastq -e 0.1 -q 20 -a X rawdata/SRR1812639.fastq
WARNING: Option --format is deprecated and ignored because the input file format is always auto-detected
Processing reads on 1 core in single-end mode ...
8<-- 00:01:02 9,980,000 reads @ 6.2 µs/read; 9.61 M reads/minute10000000 sequences processed ------>8 00:02:01 19,202,715 reads @ 6.3 µs/read; 9.46 M reads/minute Finished in 121.81 s (6 us/read; 9.46 M reads/minute).
=== Summary ===
Total reads processed: 19,202,715
Reads with adapters: 0 (0.0%)
Reads written (passing filters): 19,202,715 (100.0%)
Total basepairs processed: 1,920,271,500 bp
Quality-trimmed: 91,265,324 bp (4.8%)
Total written (filtered): 1,829,006,176 bp (95.2%)
=== Adapter 1 ===
Sequence: X; Type: regular 3'; Length: 1; Trimmed: 0 times.
>> Quality trimming completed <<<
19202715 sequences processed in total
Writing final adapter and quality trimmed output to SRR1812639_trimmed.fq
>> Now performing adapter trimming for the adapter sequence: 'AGATCGGAAGAGC' from file SRR1812639.fastq_qual_trimmed.fastq <<<
10000000 sequences processed
This is cutadapt 2.5 with Python 3.7.4
Command line parameters: -f fastq -e 0.1 -O 1 -a AGATCGGAAGAGC trim/SRR1812639.fastq_qual_trimmed.fastq
WARNING: Option --format is deprecated and ignored because the input file format is always auto-detected
Processing reads on 1 core in single-end mode ...
Finished in 147.66 s (8 us/read; 7.80 M reads/minute).
=== Summary ===
Total reads processed: 19,202,715
Reads with adapters: 11,224,908 (58.5%)
Reads written (passing filters): 19,202,715 (100.0%)
Total basepairs processed: 1,829,006,176 bp
Total written (filtered): 1,553,517,600 bp (84.9%)
=== Adapter 1 ===
Sequence: AGATCGGAAGAGC; Type: regular 3'; Length: 13; Trimmed: 11224908 times.
No. of allowed errors:
0-9 bp: 0; 10-13 bp: 1
Bases preceding removed adapters:
A: 9.1%
C: 0.3%
G: 79.2%
T: 11.4%
none/other: 0.0%
Overview of removed sequences
length count expect max.err error counts
1 2321302 4800678.8 0 2321302
2 832614 1200169.7 0 832614
3 224014 300042.4 0 224014
4 137238 75010.6 0 137238
5 66585 18752.7 0 66585
6 105435 4688.2 0 105435
7 70597 1172.0 0 70597
8 102090 293.0 0 102090
9 113573 73.3 0 113325 248
10 81538 18.3 1 71222 10316
11 88151 4.6 1 77682 10469
12 104484 1.1 1 92874 11610
13 121051 0.3 1 106281 14770
14 94670 0.3 1 84277 10393
15 84245 0.3 1 74155 10090
16 91602 0.3 1 80815 10787
17 92466 0.3 1 82000 10466
18 113944 0.3 1 99786 14158
19 74002 0.3 1 70862 3140
20 86755 0.3 1 82192 4563
21 118690 0.3 1 113822 4868
22 96912 0.3 1 92540 4372
23 124146 0.3 1 118523 5623
24 113976 0.3 1 105412 8564
25 131454 0.3 1 126044 5410
26 100912 0.3 1 96356 4556
27 117761 0.3 1 112510 5251
28 137348 0.3 1 131400 5948
29 185930 0.3 1 177788 8142
30 1205931 0.3 1 1165725 40206
31 182313 0.3 1 176572 5741
32 143949 0.3 1 139283 4666
33 95314 0.3 1 92447 2867
34 125312 0.3 1 119314 5998
35 96950 0.3 1 92912 4038
36 111410 0.3 1 106439 4971
37 108387 0.3 1 103405 4982
38 103573 0.3 1 98397 5176
39 129323 0.3 1 123321 6002
40 89428 0.3 1 85818 3610
41 109504 0.3 1 104525 4979
42 103995 0.3 1 99004 4991
43 115193 0.3 1 109557 5636
44 102331 0.3 1 97064 5267
45 130497 0.3 1 122431 8066
46 147851 0.3 1 139244 8607
47 107782 0.3 1 102335 5447
48 132023 0.3 1 126083 5940
49 161562 0.3 1 153699 7863
50 90716 0.3 1 86221 4495
51 86436 0.3 1 80972 5464
52 110758 0.3 1 102892 7866
53 121669 0.3 1 113934 7735
54 62358 0.3 1 59043 3315
55 48516 0.3 1 45798 2718
56 78291 0.3 1 72865 5426
57 82613 0.3 1 73849 8764
58 89941 0.3 1 83816 6125
59 83799 0.3 1 74228 9571
60 52944 0.3 1 48196 4748
61 67208 0.3 1 63145 4063
62 57035 0.3 1 50944 6091
63 72484 0.3 1 64150 8334
64 43310 0.3 1 38889 4421
65 38774 0.3 1 33730 5044
66 50804 0.3 1 44576 6228
67 53664 0.3 1 46322 7342
68 54884 0.3 1 44802 10082
69 70520 0.3 1 52155 18365
70 89797 0.3 1 78583 11214
71 40573 0.3 1 35818 4755
72 19397 0.3 1 17252 2145
73 9965 0.3 1 9136 829
74 6638 0.3 1 6214 424
75 2474 0.3 1 2345 129
76 976 0.3 1 876 100
77 599 0.3 1 459 140
78 330 0.3 1 246 84
79 250 0.3 1 190 60
80 137 0.3 1 114 23
81 139 0.3 1 106 33
82 54 0.3 1 25 29
83 35 0.3 1 8 27
84 17 0.3 1 4 13
85 26 0.3 1 0 26
86 11 0.3 1 1 10
87 20 0.3 1 1 19
88 23 0.3 1 0 23
89 16 0.3 1 0 16
90 13 0.3 1 0 13
91 17 0.3 1 0 17
92 27 0.3 1 1 26
93 41 0.3 1 0 41
94 37 0.3 1 1 36
95 33 0.3 1 1 32
96 28 0.3 1 0 28
97 31 0.3 1 0 31
98 37 0.3 1 0 37
99 50 0.3 1 0 50
100 280 0.3 1 0 280
Successfully deleted temporary file SRR1812639.fastq_qual_trimmed.fastq
RUN STATISTICS FOR INPUT FILE: rawdata/SRR1812639.fastq
=============================================
19202715 sequences processed in total
Sequences were truncated to a varying degree because of deteriorating qualities (Phred score quality cutoff: 20): 5022874 (26.2%)
Sequences removed because they became shorter than the length cutoff of 20 bp: 112284 (0.6%)
RRBS reads trimmed by additional 2 bp when adapter contamination was detected: 11191704 (58.3%)
real 4m31.675s
user 5m12.597s
sys 0m23.569s
version 0.6.4(最新)でも同じエラー。。。
trim_galore version0.5.0
code:bash
$ time trim_galore -q 20 --rrbs rawdata/SRR1812671.fastq -o trim/
No quality encoding type selected. Assuming that the data provided uses Sanger encoded Phred scores (default)
Path to Cutadapt set as: 'cutadapt' (default)
2.5
Cutadapt seems to be working fine (tested command 'cutadapt --version')
AUTO-DETECTING ADAPTER TYPE
===========================
Attempting to auto-detect adapter type from the first 1 million sequences of the first file (>> rawdata/SRR1812671.fastq <<)
Found perfect matches for the following adapter sequences:
Adapter type Count Sequence Sequences analysed Percentage
Illumina 318494 AGATCGGAAGAGC 1000000 31.85
Nextera 0 CTGTCTCTTATA 1000000 0.00
smallRNA 0 TGGAATTCTCGG 1000000 0.00
Using Illumina adapter for trimming (count: 318494). Second best hit was Nextera (count: 0)
Writing report to 'trim/SRR1812671.fastq_trimming_report.txt'
SUMMARISING RUN PARAMETERS
==========================
Input filename: rawdata/SRR1812671.fastq
Trimming mode: single-end
Trim Galore version: 0.5.0
Cutadapt version: 2.5
Quality Phred score cutoff: 20
Quality encoding type selected: ASCII+33
Adapter sequence: 'AGATCGGAAGAGC' (Illumina TruSeq, Sanger iPCR; auto-detected)
Maximum trimming error rate: 0.1 (default)
Minimum required adapter overlap (stringency): 1 bp
Minimum required sequence length before a sequence gets removed: 20 bp
File was specified to be an MspI-digested RRBS sample. Read 1 sequences with adapter contamination will be trimmed a further 2 bp from their 3' end, and Read 2 sequences will be trimmed by 2 bp from their 5' end to remove potential methylation-biased bases from the end-repair reaction
>> Now performing adaptive quality trimming with a Phred-score cutoff of: 20 <<<
This is cutadapt 2.5 with Python 3.7.4
Command line parameters: -f fastq -e 0.1 -q 20 -a X rawdata/SRR1812671.fastq
WARNING: Option --format is deprecated and ignored because the input file format is always auto-detected
Processing reads on 1 core in single-end mode ...
8=- 00:01:05 9,950,000 reads @ 6.3 µs/read; 9.49 M reads/minute10000000 sequences processed ------>8 00:02:00 18,733,618 reads @ 6.5 µs/read; 9.29 M reads/minute Finished in 120.98 s (6 us/read; 9.29 M reads/minute).
=== Summary ===
Total reads processed: 18,733,618
Reads with adapters: 0 (0.0%)
Reads written (passing filters): 18,733,618 (100.0%)
Total basepairs processed: 1,873,361,800 bp
Quality-trimmed: 96,459,207 bp (5.1%)
Total written (filtered): 1,776,902,593 bp (94.9%)
=== Adapter 1 ===
Sequence: X; Type: regular 3'; Length: 1; Trimmed: 0 times.
>> Quality trimming completed <<<
18733618 sequences processed in total
Writing final adapter and quality trimmed output to SRR1812671_trimmed.fq
>> Now performing adapter trimming for the adapter sequence: 'AGATCGGAAGAGC' from file SRR1812671.fastq_qual_trimmed.fastq <<<
10000000 sequences processed
This is cutadapt 2.5 with Python 3.7.4
Command line parameters: -f fastq -e 0.1 -O 1 -a AGATCGGAAGAGC trim/SRR1812671.fastq_qual_trimmed.fastq
WARNING: Option --format is deprecated and ignored because the input file format is always auto-detected
Processing reads on 1 core in single-end mode ...
Finished in 142.83 s (8 us/read; 7.87 M reads/minute).
=== Summary ===
Total reads processed: 18,733,618
Reads with adapters: 10,668,158 (56.9%)
Reads written (passing filters): 18,733,618 (100.0%)
Total basepairs processed: 1,776,902,593 bp
Total written (filtered): 1,556,882,310 bp (87.6%)
=== Adapter 1 ===
Sequence: AGATCGGAAGAGC; Type: regular 3'; Length: 13; Trimmed: 10668158 times.
No. of allowed errors:
0-9 bp: 0; 10-13 bp: 1
Bases preceding removed adapters:
A: 9.8%
C: 0.4%
G: 76.4%
T: 12.9%
none/other: 0.5%
Overview of removed sequences
length count expect max.err error counts
1 2409525 4683404.5 0 2409525
2 957698 1170851.1 0 957698
3 264382 292712.8 0 264382
4 162708 73178.2 0 162708
5 81663 18294.5 0 81663
6 132941 4573.6 0 132941
7 78954 1143.4 0 78954
8 123051 285.9 0 123051
9 105056 71.5 0 104814 242
10 86282 17.9 1 84279 2003
11 110767 4.5 1 108324 2443
12 121759 1.1 1 118950 2809
13 132000 0.3 1 129118 2882
14 102061 0.3 1 99477 2584
15 86427 0.3 1 84172 2255
16 103805 0.3 1 100617 3188
17 104451 0.3 1 100769 3682
18 123672 0.3 1 119945 3727
19 82988 0.3 1 80404 2584
20 93271 0.3 1 90543 2728
21 128651 0.3 1 124904 3747
22 106234 0.3 1 102868 3366
23 119721 0.3 1 116071 3650
24 101770 0.3 1 98350 3420
25 116535 0.3 1 113036 3499
26 119282 0.3 1 115131 4151
27 137323 0.3 1 132144 5179
28 122017 0.3 1 117947 4070
29 208514 0.3 1 201455 7059
30 1229993 0.3 1 1196499 33494
31 191131 0.3 1 184964 6167
32 131369 0.3 1 127167 4202
33 108004 0.3 1 104139 3865
34 112313 0.3 1 107913 4400
35 108191 0.3 1 103743 4448
36 111852 0.3 1 107595 4257
37 97450 0.3 1 93888 3562
38 84053 0.3 1 80894 3159
39 108374 0.3 1 104444 3930
40 76714 0.3 1 74253 2461
41 92738 0.3 1 90013 2725
42 90268 0.3 1 87058 3210
43 89674 0.3 1 87336 2338
44 84105 0.3 1 81322 2783
45 95447 0.3 1 92006 3441
46 118647 0.3 1 115231 3416
47 78118 0.3 1 75903 2215
48 103302 0.3 1 100452 2850
49 103654 0.3 1 100494 3160
50 61709 0.3 1 59864 1845
51 54282 0.3 1 52600 1682
52 53946 0.3 1 52049 1897
53 62380 0.3 1 60698 1682
54 37254 0.3 1 36334 920
55 26753 0.3 1 25881 872
56 38148 0.3 1 36656 1492
57 37649 0.3 1 36251 1398
58 45037 0.3 1 42880 2157
59 35933 0.3 1 34443 1490
60 30108 0.3 1 28814 1294
61 36310 0.3 1 34335 1975
62 30575 0.3 1 24562 6013
63 37130 0.3 1 31115 6015
64 28226 0.3 1 24804 3422
65 21525 0.3 1 18038 3487
66 28783 0.3 1 22298 6485
67 33672 0.3 1 18905 14767
68 36310 0.3 1 18286 18024
69 45470 0.3 1 18123 27347
70 30177 0.3 1 21890 8287
71 9462 0.3 1 6938 2524
72 3734 0.3 1 2927 807
73 1318 0.3 1 1034 284
74 730 0.3 1 592 138
75 304 0.3 1 197 107
76 170 0.3 1 88 82
77 139 0.3 1 52 87
78 122 0.3 1 31 91
79 94 0.3 1 27 67
80 68 0.3 1 17 51
81 83 0.3 1 23 60
82 58 0.3 1 7 51
83 38 0.3 1 4 34
84 50 0.3 1 2 48
85 41 0.3 1 0 41
86 38 0.3 1 1 37
87 52 0.3 1 1 51
88 53 0.3 1 0 53
89 35 0.3 1 0 35
90 52 0.3 1 0 52
91 43 0.3 1 0 43
92 48 0.3 1 0 48
93 95 0.3 1 0 95
94 72 0.3 1 0 72
95 78 0.3 1 0 78
96 60 0.3 1 0 60
97 68 0.3 1 1 67
98 70 0.3 1 0 70
99 105 0.3 1 0 105
100 596 0.3 1 0 596
Successfully deleted temporary file SRR1812671.fastq_qual_trimmed.fastq
RUN STATISTICS FOR INPUT FILE: rawdata/SRR1812671.fastq
=============================================
18733618 sequences processed in total
Sequences were truncated to a varying degree because of deteriorating qualities (Phred score quality cutoff: 20): 4708759 (25.1%)
Sequences removed because they became shorter than the length cutoff of 20 bp: 133701 (0.7%)
RRBS reads trimmed by additional 2 bp when adapter contamination was detected: 10566437 (56.4%)
real 4m27.553s
user 5m6.533s
sys 0m23.652s
trim_galore後のfastqc
Per Base sequence qualityで先ほど後半のリードに含まれていたクオリティが低いものがトリミングされたことがわかる。また、Adapter Contentを確認するとアダプター配列がなくなったことが分かる。
code:bash
$ mkdir fastqc_trim
$ time fastqc --nogroup -o fastqc_trim/ -t 4 trim/*.fq
Started analysis of SRR1812639_trimmed.fq
Started analysis of SRR1812671_trimmed.fq
Approx 5% complete for SRR1812639_trimmed.fq
Approx 5% complete for SRR1812671_trimmed.fq
Approx 10% complete for SRR1812639_trimmed.fq
Approx 10% complete for SRR1812671_trimmed.fq
Approx 15% complete for SRR1812639_trimmed.fq
Approx 15% complete for SRR1812671_trimmed.fq
Approx 20% complete for SRR1812639_trimmed.fq
Approx 20% complete for SRR1812671_trimmed.fq
Approx 25% complete for SRR1812671_trimmed.fq
Approx 25% complete for SRR1812639_trimmed.fq
Approx 30% complete for SRR1812671_trimmed.fq
Approx 30% complete for SRR1812639_trimmed.fq
Approx 35% complete for SRR1812671_trimmed.fq
Approx 35% complete for SRR1812639_trimmed.fq
Approx 40% complete for SRR1812671_trimmed.fq
Approx 40% complete for SRR1812639_trimmed.fq
Approx 45% complete for SRR1812671_trimmed.fq
Approx 45% complete for SRR1812639_trimmed.fq
Approx 50% complete for SRR1812671_trimmed.fq
Approx 50% complete for SRR1812639_trimmed.fq
Approx 55% complete for SRR1812671_trimmed.fq
Approx 55% complete for SRR1812639_trimmed.fq
Approx 60% complete for SRR1812671_trimmed.fq
Approx 60% complete for SRR1812639_trimmed.fq
Approx 65% complete for SRR1812671_trimmed.fq
Approx 65% complete for SRR1812639_trimmed.fq
Approx 70% complete for SRR1812671_trimmed.fq
Approx 70% complete for SRR1812639_trimmed.fq
Approx 75% complete for SRR1812671_trimmed.fq
Approx 75% complete for SRR1812639_trimmed.fq
Approx 80% complete for SRR1812671_trimmed.fq
Approx 80% complete for SRR1812639_trimmed.fq
Approx 85% complete for SRR1812671_trimmed.fq
Approx 85% complete for SRR1812639_trimmed.fq
Approx 90% complete for SRR1812671_trimmed.fq
Approx 90% complete for SRR1812639_trimmed.fq
Approx 95% complete for SRR1812671_trimmed.fq
Approx 95% complete for SRR1812639_trimmed.fq
Analysis complete for SRR1812671_trimmed.fq
Analysis complete for SRR1812639_trimmed.fq
real 1m20.199s
user 2m32.637s
sys 0m5.585s
マッピング
DNAメチル化データの解析ではダウンロードしたリファレンス配列をそのままマッピングに使用することができない。
RRBBなどのバイサルファイト処理を行う実験では、メチル化されていないシトシン(C)がチミン(T)に変換されている。このため、リファレンス配列もバイサルファイト処理を考慮したものに変換する必要がある。
マッピングツールのBismarkにはこの変換を行ってくれるコマンドが含まれているのでこれを利用。
→bismark_genome_preparation
リファレンス配列のhg38.fastaに対して行う。
時間がかかる。
・マッピングの準備
code:bash
$ time bismark_genome_preparation ref/
・マッピング
使用するコア数を--multicoreオプションで指定
code:bash
$ mkdir map
$ time bismark ref/ trim/SRR1812639_trimmed.fq --multicore 30 -o map/
$ time bismark ref/ trim/SRR1812671_trimmed.fq --multicore 30 -o map/
→bamファイルとレポートファイルが出力される
DNAメチル化部位の抽出
マッピング後に出力されたbamファイルを用いる。
methylKitで使用するためにはbamファイルを染色体番号順にソートしておく。
↓
bamファイルのソートにはSamtools。
-@オプションは処理に使用するコア数を指定
code:bash
$ cd map
$ samtools sort -@ 4 SRR1812639_trimmed_bismark_bt2.bam SRR1812639_sort
$ samtools sort -@ 4 SRR1812671_trimmed_bismark_bt2.bam SRR1812671_sort
<エラー>
code:bash
bam_sort Use -T PREFIX / -o FILE to specify temporary and final output files Options:
-l INT Set compression level, from 0 (uncompressed) to 9 (best)
-m INT Set maximum memory per thread; suffix K/M/G recognized 768M -n Sort by read name
-t TAG Sort by value of TAG. Uses position as secondary index (or read name if -n is set)
-o FILE Write final output to FILE rather than standard output
-T PREFIX Write temporary files to PREFIX.nnnn.bam
--input-fmt-option OPT=VAL Specify a single input file format option in the form
of OPTION or OPTION=VALUE
-O, --output-fmt FORMAT[,OPT=VAL]... Specify output format (SAM, BAM, CRAM)
--output-fmt-option OPT=VAL Specify a single output file format option in the form
of OPTION or OPTION=VALUE
--reference FILE
Reference sequence FASTA FILE null -@, --threads INT
Number of additional threads to use 0 ↓
code:bash
$ samtools sort sample.bam > sample.sort.bam
$ time samtools sort -@ 4 SRR1812639_trimmed_bismark_bt2.bam > SRR1812639_sort.bam
$ time samtools sort -@ 4 SRR1812671_trimmed_bismark_bt2.bam > SRR1812671_sort.bam
$ time samtools sort -@ 4 SRR1812639_trimmed_bismark_bt2.bam > SRR1812639_sort.bam
real 0m27.354s
user 1m45.199s
sys 0m5.737s
ここからはRを使う。
methylKitとgenomation
methylKitはメチル化率の算出や遺伝子アノテーションとの関連づけを行うことができるパッケージ。
genomationはmethylKitの機能を補完するのに有用なパッケージ。
code:R
BiocManager::install("methylKit")
BiocManager::install("genomation")
library("methylKit")
library("genomation")
file1 <- "map/SRR1812639_sort.bam"
file2 <- "map/SRR1812671_sort.bam"
file.list <- list(file1,file2)
file1 <- "map/SRR1812639_trimmed_bismark_bt2.bam"
file2 <- "map/SRR1812671_trimmed_bismark_bt2.bam"
file.list <- list(file1,file2)
methylKitのprocessBismarkAln関数を用いてメチル化率の算出を実行。
treatmentでは同じグループのサンプルを指定できる。
今回はCLLを1、健常者を0
code:R
obj <- processBismarkAln(location=file.list,sample.id=list("CLL","normal"),
assembly="hg38",read.context="CpG",treatment=c(1,0))
Trying to process:
map/SRR1812639_sort.bam
using htslib.
Conversion Statistics:
total otherC considered (>95% C+T): 10626389
average conversion rate = 98.853014253034
total otherC considered (Forward) (>95% C+T): 5311718
average conversion rate (Forward) = 98.85287196459
total otherC considered (Reverse) (>95% C+T): 5314671
average conversion rate (Reverse) = 98.853156462417
Done.
Reading methylation percentage per base for sample: CLL
Trying to process:
map/SRR1812671_sort.bam
using htslib.
Conversion Statistics:
total otherC considered (>95% C+T): 9926020
average conversion rate = 98.561250116514
total otherC considered (Forward) (>95% C+T): 4961875
average conversion rate (Forward) = 98.555193239311
total otherC considered (Reverse) (>95% C+T): 4964145
average conversion rate (Reverse) = 98.567304224034
Done.
Reading methylation percentage per base for sample: normal
obj
https://gyazo.com/d1e8eefb7bb4a3b8121ea2352b8b1897
・strand、+はforward、ーはreverse。
https://gyazo.com/afdf6a1c652ab79da8a5b64e0364c995
カバレッジ:ゲノム上の位置あたりにマップされたリード数の平均。
https://gyazo.com/d876f6bf8e57824c43d5c68ef9c527a7
https://gyazo.com/6e5aac26770e23014a05aaaba16d41ae
https://gyazo.com/5dd692c2c633cc1c8d1391f869d751b0
objに格納された情報を使って、メチル化率の分布を確認
getMethylationStatsコマンドでヒストグラムを描画して分布を確認
code:R
getMethylationStats(obj1, plot=T)
pdf('histgram_1.pdf')
getMethylationStats(obj1, plot=T)
dev.off()
getMethylationStats(obj2, plot=T)
pdf('histgram_2.pdf')
getMethylationStats(obj2, plot=T)
dev.off()
https://gyazo.com/f9a47fc776ab27d10fab2941b5eeb90c
https://gyazo.com/6439f70e9e46ec8260855b6d59c4fc18
CpGサイトにおけるカバレッジの分布
code:R
getCoverageStats(obj1,plot=T)
pdf('coverage_cpg_1.pdf')
getCoverageStats(obj1,plot=T)
dev.off()
getCoverageStats(obj2,plot=T)
pdf('coverage_cpg_2.pdf')
getCoverageStats(obj2,plot=T)
dev.off()
https://gyazo.com/b0de72196142d755b15817e68bac6260
https://gyazo.com/b468de2372ca617c04c39654f0dea44d
サンプル間の比較などの解析を行うためには、各データを1つのオブジェクトにまとめる
code:R
meth <- unite(obj,destrand=F)
2つのサンプル間でメチル化が変化している位置を解析
calculateDiffMeth関数。
引数mc,coreで処理に使用するコア数を指定
↓
objで確認したときには約25万行あったが、diffでは約18万行のデータに。
code:R
diff <- calculateDiffMeth(meth,mc.cores=4)
two groups detected:
will calculate methylation difference as the difference of
treatment (group: 1) - control (group: 0)
NOTE: performing 'fast.fisher' instead of 'F' for two groups testing.
diff
https://gyazo.com/d04d36a6fe9d30a94c8d8dad3681f5bd
さらにgetMethylDiff関数を使って絞り込みを行う。
引数のtypeで「hyper」を指定するとメチル化が増加した箇所(=Hyper methylation)、「hypo」を指定するとメチル化が低下した箇所(=Hypo methylation )を抽出できる。
code:R
diff25p_hyper <- getMethylDiff(diff, difference=25, qvalue=0.01, type="hyper")
diff25p_hypo <- getMethylDiff(diff, difference=25, qvalue=0.01, type="hypo")
増加
https://gyazo.com/b5f02bb371e91a73913b90d64289b5be
減少
https://gyazo.com/4693e888a13562db5d9becf79cd3aba7
遺伝子アノテーションとの関連づけ
→refGene.bedを readTranscriptFeatures関数で読み込んで、annotateWithGeneParts関数でdiff25p_hyper、diff25p_hypoオブジェクトにそれぞれ遺伝子アノテーションを付与
メチル化に違いのある領域がどこに分布しているのかサマリーが表示。
annotateWithGeneParts関数を使用する際には、アノテーションを付与したいオブジェクトをGRangesオブジェクトに変換する。
code:R
gene.obj <- readTranscriptFeatures("ref/refGene.bed")
Reading the table...
Calculating intron coordinates...
Calculating exon coordinates...
Calculating TSS coordinates...
Calculating promoter coordinates...
Outputting the final GRangesList...
annotateWithGeneParts(as(diff25p_hyper,"GRanges"),gene.obj)
annotateWithGeneParts(as(diff25p_hypo,"GRanges"),gene.obj)
hyper
https://gyazo.com/c5c8d830b8cdeb411a137f0d7b72ad02
hypo
https://gyazo.com/00e3167244a0fa3cdc7ba493eea0c432
メチル化が増加した箇所と低下した箇所を具体的な遺伝子名と関連づける。
diff25p_hyperをmeth.diffが大きい順、diff25p_hypoをmeth.diffが負側に大きい順に並べ換える。
code:R
o1 <-order(-diff25p_hyper$meth.diff)
o2 <-order(diff25p_hypo$meth.diff)
o1
https://gyazo.com/1ae53af712bd0a93a8c41c4fcbb846b8
並べ替え用オブジェクトを作ったらannotateWithGeneParts関数を使ってアノテーション情報を付加。
転写開始領域(TSS)までの距離がdist.to.TSSとして保存されている。これを利用してTSSまでの距離が1000以内となるものだけを抽出する。
code:R
a1 <- annotateWithGeneParts(as(diff25p_hypero1,,"GRanges"),gene.obj) f1 <- abs(a1@dist.to.TSS$dist.to.feature) <= 1000
a2 <- annotateWithGeneParts(as(diff25p_hypoo2,,"GRanges"),gene.obj) f2 <- abs(a2@dist.to.TSS$dist.to.feature) <= 1000
hyper
https://gyazo.com/4bce2a38b003cd3014c86a7351c866ee
hypo
https://gyazo.com/03078a4be2aef88f778134ca12b6bdf2
https://gyazo.com/50c3a3992e574d497f00354c33f1da59