BS-seq解析 - mishima2019ykohki

BS-seq解析

2019/9/17

三島合宿にて。ホテルにて開始。

DNAメチル化解析手法の１つであるReduced Representation Bisulfite Sequencing(RRBS)法のデータ解析手順についてやってみる。

RRBS法は制限酵素MspIを用いてプロモーター領域とCpGアイランドを濃縮し、対象領域を限定してバイサルファイトシークエンシングを行うことで、効率よくDNAメチル化を捉えるための手法。

rawdata･･･公共データベースからダウンロードしたデータを格納する。

tools･･･Homebrewでインストールできないツールを格納する。

ref･･･リファレンスゲノム配列データと遺伝子アノテーションデータを格納する。

インストール

cutadaptはトリミングツールであるTrim Galoreに必要。

bowtie2とSAMtoolsはマッピングで使用するBismarkの動作に必要。

cutadapt

conda install -c bioconda cutadapt -y

trim_galore

conda install -c bioconda trim-galore -y

bismark

conda install -c bioconda bismark -y

解析データダウンロード

今回は慢性リンパ性白血病(CLL)B細胞のRRBSデータセット(3)の一部(https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE66121 )を使用

SRP055387→https://www.ncbi.nlm.nih.gov/Traces/study/?acc=SRP055387

サンプル

SRR1812639　血液サンプル　細胞種: CD19+ B-cells 慢性リンパ性白血病患者

SRR1812671　血液サンプル　細胞種: CD19+ B-cells 健常者

2サンプルだけ。

fasq-dump→fasterq-dump

code:bash

$ cd ~/rrbs/rawdata

$ time fasterq-dump --gzip SRR1812671

$ time fasterq-dump --gzip SRR1812639

＜エラー＞

code:bash

2019-09-17T13:21:26 fasterq-dump.2.9.3 err: param unknown while parsing argument list within application support module - Unknown argument '--gzip'

2019-09-17T13:21:26 fasterq-dump.2.9.3 err: ArgsMakeAndHandle() -> RC(rcApp,rcArgv,rcParsing,rcParam,rcUnknown)

fasterq-dumpには、--gzipはない。

↓

code:bash

$ cd ~/rrbs/rawdata

$ time fasterq-dump SRR1812671

$ time fasterq-dump SRR1812639

#実行

$ time fasterq-dump SRR1812671

spots read : 18,733,618

reads read : 18,733,618

reads written : 18,733,618

real 3m41.621s

user 0m59.881s

sys 0m11.614s

$ time fasterq-dump SRR1812639

spots read : 19,202,715

reads read : 19,202,715

reads written : 19,202,715

real 4m23.479s

user 1m1.652s

sys 0m12.463s

・圧縮と展開

code:bash

#圧縮

$ gzip SRR1812639.fastq

$ gzip SRR1812671.fastq

#展開

$ gunzip -c SRR1812639.fastq.gz > SRR1812639.fastq

$ gunzip -c SRR1812671.fastq.gz > SRR1812671.fastq

gunzipのオプションで、-kはない。代わりに標準出力に出す、-cを使う。

-c, --stdout write on standard output, keep original files unchanged

gunzipはkオプションをつけると展開前のファイルが残ったまま展開される。

リファレンス配列のデータをダウンロード

refディレクトリ内にシェルスクリプトを作る

作成

010_download-ucsc.sh

code:bash

#!/bin/bash

set -euo pipefail

u1="ftp://hgdownload.soe.ucsc.edu"

u2="goldenPath/hg38/bigZips/analysisSet"

u3="hg38.analysisSet.chroms.tar.gz"

curl -O ${u1}/${u2}/${u3}

tar zxvf ${u3}

実行

code:bash

$ chmod a+x 010_download-ucsc.sh

$ time ./010_download-ucsc.sh

→ログ_010_download-ucsc.sh

作成

020_concatinate.sh

code:bash

#!/bin/bash

set -euo pipefail

d="hg38.analysisSet.chroms"

out="hg38.fasta"

cat \

$d/chr1.fa $d/chr2.fa $d/chr3.fa $d/chr4.fa $d/chr5.fa \

$d/chr6.fa $d/chr7.fa $d/chr8.fa $d/chr9.fa $d/chr10.fa \

$d/chr11.fa $d/chr12.fa $d/chr13.fa $d/chr14.fa $d/chr15.fa \

$d/chr16.fa $d/chr17.fa $d/chr18.fa $d/chr19.fa $d/chr20.fa \

$d/chr21.fa $d/chr22.fa $d/chrX.fa $d/chrY.fa $d/chrM.fa \

$d/chrEBV.fa > $out

実行

code:bash

$ chmod a+x 020_concatinate.sh

$ time ./020_concatinate.sh

real 0m6.062s

user 0m0.003s

sys 0m4.101s

遺伝子アノテーションファイルをダウンロード

→UCSC Genome Browsrから。https://genome.ucsc.edu/index.html

メニューのToolsからTable Browserを選択。

設定

code:bash

assembly: Dec.2013(GRCh38/hg38)

group: Genes and Gene Predictions

track: NCBI RefSeq

table: UCSC RefSeq (refGene)

region: genome

output format: custom track

output file: refGene.bed

file type returned: plain text

refGene.bedをゲット。

【解析】

FastQCで確認

mkdir fastqc_raw

FastQCはそのままだと長いリードを解析する際に3'末端の塩基をまとめて解析してしまう。これを防ぐために--nogroupオプションを設定する。

code: bash

$ time fastqc --nogroup -o fastqc_raw/ -t 12 rawdata/*.fastq

Started analysis of SRR1812639.fastq

Started analysis of SRR1812671.fastq

Approx 5% complete for SRR1812639.fastq

Approx 5% complete for SRR1812671.fastq

Approx 10% complete for SRR1812671.fastq

Approx 10% complete for SRR1812639.fastq

Approx 15% complete for SRR1812671.fastq

Approx 15% complete for SRR1812639.fastq

Approx 20% complete for SRR1812671.fastq

Approx 20% complete for SRR1812639.fastq

Approx 25% complete for SRR1812671.fastq

Approx 25% complete for SRR1812639.fastq

Approx 30% complete for SRR1812671.fastq

Approx 30% complete for SRR1812639.fastq

Approx 35% complete for SRR1812671.fastq

Approx 35% complete for SRR1812639.fastq

Approx 40% complete for SRR1812671.fastq

Approx 40% complete for SRR1812639.fastq

Approx 45% complete for SRR1812671.fastq

Approx 45% complete for SRR1812639.fastq

Approx 50% complete for SRR1812671.fastq

Approx 50% complete for SRR1812639.fastq

Approx 55% complete for SRR1812671.fastq

Approx 55% complete for SRR1812639.fastq

Approx 60% complete for SRR1812671.fastq

Approx 60% complete for SRR1812639.fastq

Approx 65% complete for SRR1812671.fastq

Approx 65% complete for SRR1812639.fastq

Approx 70% complete for SRR1812671.fastq

Approx 70% complete for SRR1812639.fastq

Approx 75% complete for SRR1812671.fastq

Approx 75% complete for SRR1812639.fastq

Approx 80% complete for SRR1812671.fastq

Approx 80% complete for SRR1812639.fastq

Approx 85% complete for SRR1812671.fastq

Approx 85% complete for SRR1812639.fastq

Approx 90% complete for SRR1812671.fastq

Approx 90% complete for SRR1812639.fastq

Approx 95% complete for SRR1812671.fastq

Analysis complete for SRR1812671.fastq

Approx 95% complete for SRR1812639.fastq

Analysis complete for SRR1812639.fastq

real 1m32.540s

user 2m52.128s

sys 0m7.256s

・htmlファイルを確認

Per Base sequence qualityをみると後半のリードにクオリティスコアが低いものが含まれていることがわかる。また、一番下にあるAdapter Contentを確認するとアダプター配列が含まれているようだ。次のステップで取り除こう。

トリミング

fastqファイルからアダプター配列など不要な配列をトリミングしていく。

trim galoreはRRBS解析のためのオプションが選択できる。

-qでクオリティクオリティスコアのカットオフを設定

-rrbsオプションはトリミング対象のデータがRRBSのサンプルであることを設定

-oオプションでトリミング後のファイルが出力されるディレクトリを指定

code:bash

$ mkdir trim #トリミング後の出力ファイル保存先を作成

$ time trim_galore -q 20 -rrbs rawdata/SRR1812639.fastq -o trim/

$ time trim_galore -q 20 -rrbs rawdata/SRR1812671.fastq -o trim/

＜エラー＞

code:bash

$ time trim_galore -q 20 -rrbs rawdata/SRR1812639.fastq -o trim/

Multicore support not enabled. Proceeding with single-core trimming.

Path to Cutadapt set as: 'cutadapt' (default)

Cutadapt seems to be working fine (tested command 'cutadapt --version')

Cutadapt version: 2.5

single-core operation.

No quality encoding type selected. Assuming that the data provided uses Sanger encoded Phred scores (default)

Output will be written into the directory: /home/kyamada/bioinformatics/DRY2/rrbs/trim/

AUTO-DETECTING ADAPTER TYPE

===========================

Attempting to auto-detect adapter type from the first 1 million sequences of the first file (>> rawdata/SRR1812639.fastq <<)

Found perfect matches for the following adapter sequences:

Adapter type Count Sequence Sequences analysed Percentage

Illumina 380210 AGATCGGAAGAGC 1000000 38.02

Nextera 0 CTGTCTCTTATA 1000000 0.00

smallRNA 0 TGGAATTCTCGG 1000000 0.00

Using Illumina adapter for trimming (count: 380210). Second best hit was Nextera (count: 0)

Writing report to '/home/kyamada/bioinformatics/DRY2/rrbs/trim/SRR1812639.fastq_trimming_report.txt'

SUMMARISING RUN PARAMETERS

==========================

Input filename: rawdata/SRR1812639.fastq

Trimming mode: single-end

Trim Galore version: 0.6.4

Cutadapt version: 2.5

Number of cores used for trimming: 1

Quality Phred score cutoff: 20

Quality encoding type selected: ASCII+33

Adapter sequence: 'AGATCGGAAGAGC' (Illumina TruSeq, Sanger iPCR; auto-detected)

Maximum trimming error rate: 0.1 (default)

Minimum required adapter overlap (stringency): 1 bp

Minimum required sequence length before a sequence gets removed: 20 bp

File was specified to be an MspI-digested RRBS sample. Read 1 sequences with adapter contamination will be trimmed a further 2 bp from their 3' end, and Read 2 sequences will be trimmed by 2 bp from their 5' end to remove potential methylation-biased bases from the end-repair reaction

Cutadapt seems to be fairly up-to-date (version 2.5). Setting -j 1

>> Now performing adaptive quality trimming with a Phred-score cutoff of: 20 <<<

This is cutadapt 2.5 with Python 3.7.4

Command line parameters: -j 1 -e 0.1 -q 20 -a X rawdata/SRR1812639.fastq

Processing reads on 1 core in single-end mode ...

8<- 00:01:05 9,980,000 reads @ 6.5 µs/read; 9.27 M reads/minute10000000 sequences processed

-----=8 00:02:05 19,202,715 reads @ 6.5 µs/read; 9.18 M reads/minute

Finished in 125.54 s (7 us/read; 9.18 M reads/minute).

=== Summary ===

Total reads processed: 19,202,715

Reads with adapters: 0 (0.0%)

Reads written (passing filters): 19,202,715 (100.0%)

Total basepairs processed: 1,920,271,500 bp

Quality-trimmed: 91,265,324 bp (4.8%)

Total written (filtered): 1,829,006,176 bp (95.2%)

=== Adapter 1 ===

Sequence: X; Type: regular 3'; Length: 1; Trimmed: 0 times.

>> Quality trimming completed <<<

19202715 sequences processed in total

Writing final adapter and quality trimmed output to SRR1812639_trimmed.fq

>> Now performing adapter trimming for the adapter sequence: 'AGATCGGAAGAGC' from file SRR1812639.fastq_qual_trimmed.fastq <<<

Do we ever get here?

******************************************************************

-rrbsはオプションの書き方がおかしい？

→--rrbs

変更

code:bash

$ time trim_galore -q 20 --rrbs rawdata/SRR1812639.fastq -o trim/

$ time trim_galore -q 20 --rrbs rawdata/SRR1812671.fastq -o trim/

trim_galore version0.5.0

code:bash

$ time trim_galore -q 20 --rrbs rawdata/SRR1812639.fastq -o trim/

No quality encoding type selected. Assuming that the data provided uses Sanger encoded Phred scores (default)

Path to Cutadapt set as: 'cutadapt' (default)

2.5

Cutadapt seems to be working fine (tested command 'cutadapt --version')

AUTO-DETECTING ADAPTER TYPE

===========================

Attempting to auto-detect adapter type from the first 1 million sequences of the first file (>> rawdata/SRR1812639.fastq <<)

Found perfect matches for the following adapter sequences:

Adapter type Count Sequence Sequences analysed Percentage

Illumina 380210 AGATCGGAAGAGC 1000000 38.02

Nextera 0 CTGTCTCTTATA 1000000 0.00

smallRNA 0 TGGAATTCTCGG 1000000 0.00

Using Illumina adapter for trimming (count: 380210). Second best hit was Nextera (count: 0)

Writing report to 'trim/SRR1812639.fastq_trimming_report.txt'

SUMMARISING RUN PARAMETERS

==========================

Input filename: rawdata/SRR1812639.fastq

Trimming mode: single-end

Trim Galore version: 0.5.0

Cutadapt version: 2.5

Quality Phred score cutoff: 20

Quality encoding type selected: ASCII+33

Adapter sequence: 'AGATCGGAAGAGC' (Illumina TruSeq, Sanger iPCR; auto-detected)

Maximum trimming error rate: 0.1 (default)

Minimum required adapter overlap (stringency): 1 bp

Minimum required sequence length before a sequence gets removed: 20 bp

>> Now performing adaptive quality trimming with a Phred-score cutoff of: 20 <<<

This is cutadapt 2.5 with Python 3.7.4

Command line parameters: -f fastq -e 0.1 -q 20 -a X rawdata/SRR1812639.fastq

WARNING: Option --format is deprecated and ignored because the input file format is always auto-detected

Processing reads on 1 core in single-end mode ...

8<-- 00:01:02 9,980,000 reads @ 6.2 µs/read; 9.61 M reads/minute10000000 sequences processed

------>8 00:02:01 19,202,715 reads @ 6.3 µs/read; 9.46 M reads/minute

Finished in 121.81 s (6 us/read; 9.46 M reads/minute).

=== Summary ===

Total reads processed: 19,202,715

Reads with adapters: 0 (0.0%)

Reads written (passing filters): 19,202,715 (100.0%)

Total basepairs processed: 1,920,271,500 bp

Quality-trimmed: 91,265,324 bp (4.8%)

Total written (filtered): 1,829,006,176 bp (95.2%)

=== Adapter 1 ===

Sequence: X; Type: regular 3'; Length: 1; Trimmed: 0 times.

>> Quality trimming completed <<<

19202715 sequences processed in total

Writing final adapter and quality trimmed output to SRR1812639_trimmed.fq

>> Now performing adapter trimming for the adapter sequence: 'AGATCGGAAGAGC' from file SRR1812639.fastq_qual_trimmed.fastq <<<

10000000 sequences processed

This is cutadapt 2.5 with Python 3.7.4

Command line parameters: -f fastq -e 0.1 -O 1 -a AGATCGGAAGAGC trim/SRR1812639.fastq_qual_trimmed.fastq

WARNING: Option --format is deprecated and ignored because the input file format is always auto-detected

Processing reads on 1 core in single-end mode ...

Finished in 147.66 s (8 us/read; 7.80 M reads/minute).

=== Summary ===

Total reads processed: 19,202,715

Reads with adapters: 11,224,908 (58.5%)

Reads written (passing filters): 19,202,715 (100.0%)

Total basepairs processed: 1,829,006,176 bp

Total written (filtered): 1,553,517,600 bp (84.9%)

=== Adapter 1 ===

Sequence: AGATCGGAAGAGC; Type: regular 3'; Length: 13; Trimmed: 11224908 times.

No. of allowed errors:

0-9 bp: 0; 10-13 bp: 1

Bases preceding removed adapters:

A: 9.1%

C: 0.3%

G: 79.2%

T: 11.4%

none/other: 0.0%

Overview of removed sequences

length count expect max.err error counts

1 2321302 4800678.8 0 2321302

2 832614 1200169.7 0 832614

3 224014 300042.4 0 224014

4 137238 75010.6 0 137238

5 66585 18752.7 0 66585

6 105435 4688.2 0 105435

7 70597 1172.0 0 70597

8 102090 293.0 0 102090

9 113573 73.3 0 113325 248

10 81538 18.3 1 71222 10316

11 88151 4.6 1 77682 10469

12 104484 1.1 1 92874 11610

13 121051 0.3 1 106281 14770

14 94670 0.3 1 84277 10393

15 84245 0.3 1 74155 10090

16 91602 0.3 1 80815 10787

17 92466 0.3 1 82000 10466

18 113944 0.3 1 99786 14158

19 74002 0.3 1 70862 3140

20 86755 0.3 1 82192 4563

21 118690 0.3 1 113822 4868

22 96912 0.3 1 92540 4372

23 124146 0.3 1 118523 5623

24 113976 0.3 1 105412 8564

25 131454 0.3 1 126044 5410

26 100912 0.3 1 96356 4556

27 117761 0.3 1 112510 5251

28 137348 0.3 1 131400 5948

29 185930 0.3 1 177788 8142

30 1205931 0.3 1 1165725 40206

31 182313 0.3 1 176572 5741

32 143949 0.3 1 139283 4666

33 95314 0.3 1 92447 2867

34 125312 0.3 1 119314 5998

35 96950 0.3 1 92912 4038

36 111410 0.3 1 106439 4971

37 108387 0.3 1 103405 4982

38 103573 0.3 1 98397 5176

39 129323 0.3 1 123321 6002

40 89428 0.3 1 85818 3610

41 109504 0.3 1 104525 4979

42 103995 0.3 1 99004 4991

43 115193 0.3 1 109557 5636

44 102331 0.3 1 97064 5267

45 130497 0.3 1 122431 8066

46 147851 0.3 1 139244 8607

47 107782 0.3 1 102335 5447

48 132023 0.3 1 126083 5940

49 161562 0.3 1 153699 7863

50 90716 0.3 1 86221 4495

51 86436 0.3 1 80972 5464

52 110758 0.3 1 102892 7866

53 121669 0.3 1 113934 7735

54 62358 0.3 1 59043 3315

55 48516 0.3 1 45798 2718

56 78291 0.3 1 72865 5426

57 82613 0.3 1 73849 8764

58 89941 0.3 1 83816 6125

59 83799 0.3 1 74228 9571

60 52944 0.3 1 48196 4748

61 67208 0.3 1 63145 4063

62 57035 0.3 1 50944 6091

63 72484 0.3 1 64150 8334

64 43310 0.3 1 38889 4421

65 38774 0.3 1 33730 5044

66 50804 0.3 1 44576 6228

67 53664 0.3 1 46322 7342

68 54884 0.3 1 44802 10082

69 70520 0.3 1 52155 18365

70 89797 0.3 1 78583 11214

71 40573 0.3 1 35818 4755

72 19397 0.3 1 17252 2145

73 9965 0.3 1 9136 829

74 6638 0.3 1 6214 424

75 2474 0.3 1 2345 129

76 976 0.3 1 876 100

77 599 0.3 1 459 140

78 330 0.3 1 246 84

79 250 0.3 1 190 60

80 137 0.3 1 114 23

81 139 0.3 1 106 33

82 54 0.3 1 25 29

83 35 0.3 1 8 27

84 17 0.3 1 4 13

85 26 0.3 1 0 26

86 11 0.3 1 1 10

87 20 0.3 1 1 19

88 23 0.3 1 0 23

89 16 0.3 1 0 16

90 13 0.3 1 0 13

91 17 0.3 1 0 17

92 27 0.3 1 1 26

93 41 0.3 1 0 41

94 37 0.3 1 1 36

95 33 0.3 1 1 32

96 28 0.3 1 0 28

97 31 0.3 1 0 31

98 37 0.3 1 0 37

99 50 0.3 1 0 50

100 280 0.3 1 0 280

Successfully deleted temporary file SRR1812639.fastq_qual_trimmed.fastq

RUN STATISTICS FOR INPUT FILE: rawdata/SRR1812639.fastq

=============================================

19202715 sequences processed in total

Sequences were truncated to a varying degree because of deteriorating qualities (Phred score quality cutoff: 20): 5022874 (26.2%)

Sequences removed because they became shorter than the length cutoff of 20 bp: 112284 (0.6%)

RRBS reads trimmed by additional 2 bp when adapter contamination was detected: 11191704 (58.3%)

real 4m31.675s

user 5m12.597s

sys 0m23.569s

→trim_galore_0.5.0の環境作成

version 0.6.4（最新）でも同じエラー。。。

trim_galore version0.5.0

code:bash

$ time trim_galore -q 20 --rrbs rawdata/SRR1812671.fastq -o trim/

No quality encoding type selected. Assuming that the data provided uses Sanger encoded Phred scores (default)

Path to Cutadapt set as: 'cutadapt' (default)

2.5

Cutadapt seems to be working fine (tested command 'cutadapt --version')

AUTO-DETECTING ADAPTER TYPE

===========================

Attempting to auto-detect adapter type from the first 1 million sequences of the first file (>> rawdata/SRR1812671.fastq <<)

Found perfect matches for the following adapter sequences:

Adapter type Count Sequence Sequences analysed Percentage

Illumina 318494 AGATCGGAAGAGC 1000000 31.85

Nextera 0 CTGTCTCTTATA 1000000 0.00

smallRNA 0 TGGAATTCTCGG 1000000 0.00

Using Illumina adapter for trimming (count: 318494). Second best hit was Nextera (count: 0)

Writing report to 'trim/SRR1812671.fastq_trimming_report.txt'

SUMMARISING RUN PARAMETERS

==========================

Input filename: rawdata/SRR1812671.fastq

Trimming mode: single-end

Trim Galore version: 0.5.0

Cutadapt version: 2.5

Quality Phred score cutoff: 20

Quality encoding type selected: ASCII+33

Adapter sequence: 'AGATCGGAAGAGC' (Illumina TruSeq, Sanger iPCR; auto-detected)

Maximum trimming error rate: 0.1 (default)

Minimum required adapter overlap (stringency): 1 bp

Minimum required sequence length before a sequence gets removed: 20 bp

>> Now performing adaptive quality trimming with a Phred-score cutoff of: 20 <<<

This is cutadapt 2.5 with Python 3.7.4

Command line parameters: -f fastq -e 0.1 -q 20 -a X rawdata/SRR1812671.fastq

WARNING: Option --format is deprecated and ignored because the input file format is always auto-detected

Processing reads on 1 core in single-end mode ...

8=- 00:01:05 9,950,000 reads @ 6.3 µs/read; 9.49 M reads/minute10000000 sequences processed

------>8 00:02:00 18,733,618 reads @ 6.5 µs/read; 9.29 M reads/minute

Finished in 120.98 s (6 us/read; 9.29 M reads/minute).

=== Summary ===

Total reads processed: 18,733,618

Reads with adapters: 0 (0.0%)

Reads written (passing filters): 18,733,618 (100.0%)

Total basepairs processed: 1,873,361,800 bp

Quality-trimmed: 96,459,207 bp (5.1%)

Total written (filtered): 1,776,902,593 bp (94.9%)

=== Adapter 1 ===

Sequence: X; Type: regular 3'; Length: 1; Trimmed: 0 times.

>> Quality trimming completed <<<

18733618 sequences processed in total

Writing final adapter and quality trimmed output to SRR1812671_trimmed.fq

>> Now performing adapter trimming for the adapter sequence: 'AGATCGGAAGAGC' from file SRR1812671.fastq_qual_trimmed.fastq <<<

10000000 sequences processed

This is cutadapt 2.5 with Python 3.7.4

Command line parameters: -f fastq -e 0.1 -O 1 -a AGATCGGAAGAGC trim/SRR1812671.fastq_qual_trimmed.fastq

WARNING: Option --format is deprecated and ignored because the input file format is always auto-detected

Processing reads on 1 core in single-end mode ...

Finished in 142.83 s (8 us/read; 7.87 M reads/minute).

=== Summary ===

Total reads processed: 18,733,618

Reads with adapters: 10,668,158 (56.9%)

Reads written (passing filters): 18,733,618 (100.0%)

Total basepairs processed: 1,776,902,593 bp

Total written (filtered): 1,556,882,310 bp (87.6%)

=== Adapter 1 ===

Sequence: AGATCGGAAGAGC; Type: regular 3'; Length: 13; Trimmed: 10668158 times.

No. of allowed errors:

0-9 bp: 0; 10-13 bp: 1

Bases preceding removed adapters:

A: 9.8%

C: 0.4%

G: 76.4%

T: 12.9%

none/other: 0.5%

Overview of removed sequences

length count expect max.err error counts

1 2409525 4683404.5 0 2409525

2 957698 1170851.1 0 957698

3 264382 292712.8 0 264382

4 162708 73178.2 0 162708

5 81663 18294.5 0 81663

6 132941 4573.6 0 132941

7 78954 1143.4 0 78954

8 123051 285.9 0 123051

9 105056 71.5 0 104814 242

10 86282 17.9 1 84279 2003

11 110767 4.5 1 108324 2443

12 121759 1.1 1 118950 2809

13 132000 0.3 1 129118 2882

14 102061 0.3 1 99477 2584

15 86427 0.3 1 84172 2255

16 103805 0.3 1 100617 3188

17 104451 0.3 1 100769 3682

18 123672 0.3 1 119945 3727

19 82988 0.3 1 80404 2584

20 93271 0.3 1 90543 2728

21 128651 0.3 1 124904 3747

22 106234 0.3 1 102868 3366

23 119721 0.3 1 116071 3650

24 101770 0.3 1 98350 3420

25 116535 0.3 1 113036 3499

26 119282 0.3 1 115131 4151

27 137323 0.3 1 132144 5179

28 122017 0.3 1 117947 4070

29 208514 0.3 1 201455 7059

30 1229993 0.3 1 1196499 33494

31 191131 0.3 1 184964 6167

32 131369 0.3 1 127167 4202

33 108004 0.3 1 104139 3865

34 112313 0.3 1 107913 4400

35 108191 0.3 1 103743 4448

36 111852 0.3 1 107595 4257

37 97450 0.3 1 93888 3562

38 84053 0.3 1 80894 3159

39 108374 0.3 1 104444 3930

40 76714 0.3 1 74253 2461

41 92738 0.3 1 90013 2725

42 90268 0.3 1 87058 3210

43 89674 0.3 1 87336 2338

44 84105 0.3 1 81322 2783

45 95447 0.3 1 92006 3441

46 118647 0.3 1 115231 3416

47 78118 0.3 1 75903 2215

48 103302 0.3 1 100452 2850

49 103654 0.3 1 100494 3160

50 61709 0.3 1 59864 1845

51 54282 0.3 1 52600 1682

52 53946 0.3 1 52049 1897

53 62380 0.3 1 60698 1682

54 37254 0.3 1 36334 920

55 26753 0.3 1 25881 872

56 38148 0.3 1 36656 1492

57 37649 0.3 1 36251 1398

58 45037 0.3 1 42880 2157

59 35933 0.3 1 34443 1490

60 30108 0.3 1 28814 1294

61 36310 0.3 1 34335 1975

62 30575 0.3 1 24562 6013

63 37130 0.3 1 31115 6015

64 28226 0.3 1 24804 3422

65 21525 0.3 1 18038 3487

66 28783 0.3 1 22298 6485

67 33672 0.3 1 18905 14767

68 36310 0.3 1 18286 18024

69 45470 0.3 1 18123 27347

70 30177 0.3 1 21890 8287

71 9462 0.3 1 6938 2524

72 3734 0.3 1 2927 807

73 1318 0.3 1 1034 284

74 730 0.3 1 592 138

75 304 0.3 1 197 107

76 170 0.3 1 88 82

77 139 0.3 1 52 87

78 122 0.3 1 31 91

79 94 0.3 1 27 67

80 68 0.3 1 17 51

81 83 0.3 1 23 60

82 58 0.3 1 7 51

83 38 0.3 1 4 34

84 50 0.3 1 2 48

85 41 0.3 1 0 41

86 38 0.3 1 1 37

87 52 0.3 1 1 51

88 53 0.3 1 0 53

89 35 0.3 1 0 35

90 52 0.3 1 0 52

91 43 0.3 1 0 43

92 48 0.3 1 0 48

93 95 0.3 1 0 95

94 72 0.3 1 0 72

95 78 0.3 1 0 78

96 60 0.3 1 0 60

97 68 0.3 1 1 67

98 70 0.3 1 0 70

99 105 0.3 1 0 105

100 596 0.3 1 0 596

Successfully deleted temporary file SRR1812671.fastq_qual_trimmed.fastq

RUN STATISTICS FOR INPUT FILE: rawdata/SRR1812671.fastq

=============================================

18733618 sequences processed in total

Sequences were truncated to a varying degree because of deteriorating qualities (Phred score quality cutoff: 20): 4708759 (25.1%)

Sequences removed because they became shorter than the length cutoff of 20 bp: 133701 (0.7%)

RRBS reads trimmed by additional 2 bp when adapter contamination was detected: 10566437 (56.4%)

real 4m27.553s

user 5m6.533s

sys 0m23.652s

trim_galore後のfastqc

Per Base sequence qualityで先ほど後半のリードに含まれていたクオリティが低いものがトリミングされたことがわかる。また、Adapter Contentを確認するとアダプター配列がなくなったことが分かる。

code:bash

$ mkdir fastqc_trim

$ time fastqc --nogroup -o fastqc_trim/ -t 4 trim/*.fq

Started analysis of SRR1812639_trimmed.fq

Started analysis of SRR1812671_trimmed.fq

Approx 5% complete for SRR1812639_trimmed.fq

Approx 5% complete for SRR1812671_trimmed.fq

Approx 10% complete for SRR1812639_trimmed.fq

Approx 10% complete for SRR1812671_trimmed.fq

Approx 15% complete for SRR1812639_trimmed.fq

Approx 15% complete for SRR1812671_trimmed.fq

Approx 20% complete for SRR1812639_trimmed.fq

Approx 20% complete for SRR1812671_trimmed.fq

Approx 25% complete for SRR1812671_trimmed.fq

Approx 25% complete for SRR1812639_trimmed.fq

Approx 30% complete for SRR1812671_trimmed.fq

Approx 30% complete for SRR1812639_trimmed.fq

Approx 35% complete for SRR1812671_trimmed.fq

Approx 35% complete for SRR1812639_trimmed.fq

Approx 40% complete for SRR1812671_trimmed.fq

Approx 40% complete for SRR1812639_trimmed.fq

Approx 45% complete for SRR1812671_trimmed.fq

Approx 45% complete for SRR1812639_trimmed.fq

Approx 50% complete for SRR1812671_trimmed.fq

Approx 50% complete for SRR1812639_trimmed.fq

Approx 55% complete for SRR1812671_trimmed.fq

Approx 55% complete for SRR1812639_trimmed.fq

Approx 60% complete for SRR1812671_trimmed.fq

Approx 60% complete for SRR1812639_trimmed.fq

Approx 65% complete for SRR1812671_trimmed.fq

Approx 65% complete for SRR1812639_trimmed.fq

Approx 70% complete for SRR1812671_trimmed.fq

Approx 70% complete for SRR1812639_trimmed.fq

Approx 75% complete for SRR1812671_trimmed.fq

Approx 75% complete for SRR1812639_trimmed.fq

Approx 80% complete for SRR1812671_trimmed.fq

Approx 80% complete for SRR1812639_trimmed.fq

Approx 85% complete for SRR1812671_trimmed.fq

Approx 85% complete for SRR1812639_trimmed.fq

Approx 90% complete for SRR1812671_trimmed.fq

Approx 90% complete for SRR1812639_trimmed.fq

Approx 95% complete for SRR1812671_trimmed.fq

Approx 95% complete for SRR1812639_trimmed.fq

Analysis complete for SRR1812671_trimmed.fq

Analysis complete for SRR1812639_trimmed.fq

real 1m20.199s

user 2m32.637s

sys 0m5.585s

マッピング

DNAメチル化データの解析ではダウンロードしたリファレンス配列をそのままマッピングに使用することができない。

RRBBなどのバイサルファイト処理を行う実験では、メチル化されていないシトシン(C)がチミン(T)に変換されている。このため、リファレンス配列もバイサルファイト処理を考慮したものに変換する必要がある。

マッピングツールのBismarkにはこの変換を行ってくれるコマンドが含まれているのでこれを利用。

→bismark_genome_preparation

リファレンス配列のhg38.fastaに対して行う。

時間がかかる。

・マッピングの準備

code:bash

$ time bismark_genome_preparation ref/

→ログ_bismark_preparation

・マッピング

使用するコア数を--multicoreオプションで指定

code:bash

$ mkdir map

$ time bismark ref/ trim/SRR1812639_trimmed.fq --multicore 30 -o map/

$ time bismark ref/ trim/SRR1812671_trimmed.fq --multicore 30 -o map/

→bismark_マッピング_ログ

→bamファイルとレポートファイルが出力される

DNAメチル化部位の抽出

マッピング後に出力されたbamファイルを用いる。

methylKitで使用するためにはbamファイルを染色体番号順にソートしておく。

↓

bamファイルのソートにはSamtools。

-@オプションは処理に使用するコア数を指定

code:bash

$ cd map

$ samtools sort -@ 4 SRR1812639_trimmed_bismark_bt2.bam SRR1812639_sort

$ samtools sort -@ 4 SRR1812671_trimmed_bismark_bt2.bam SRR1812671_sort

＜エラー＞

code:bash

bam_sort Use -T PREFIX / -o FILE to specify temporary and final output files

Usage: samtools sort options... in.bam

Options:

-l INT Set compression level, from 0 (uncompressed) to 9 (best)

-m INT Set maximum memory per thread; suffix K/M/G recognized 768M

-n Sort by read name

-t TAG Sort by value of TAG. Uses position as secondary index (or read name if -n is set)

-o FILE Write final output to FILE rather than standard output

-T PREFIX Write temporary files to PREFIX.nnnn.bam

--input-fmt-option OPT=VAL

Specify a single input file format option in the form

of OPTION or OPTION=VALUE

-O, --output-fmt FORMAT[,OPT=VAL]...

Specify output format (SAM, BAM, CRAM)

--output-fmt-option OPT=VAL

Specify a single output file format option in the form

of OPTION or OPTION=VALUE

--reference FILE

Reference sequence FASTA FILE null

-@, --threads INT

Number of additional threads to use 0

参考：SAMtoolsとリダイレクト

↓

code:bash

$ samtools sort sample.bam > sample.sort.bam

$ time samtools sort -@ 4 SRR1812639_trimmed_bismark_bt2.bam > SRR1812639_sort.bam

$ time samtools sort -@ 4 SRR1812671_trimmed_bismark_bt2.bam > SRR1812671_sort.bam

$ time samtools sort -@ 4 SRR1812639_trimmed_bismark_bt2.bam > SRR1812639_sort.bam

bam_sort_core merging from 4 files and 4 in-memory blocks...

real 0m27.354s

user 1m45.199s

sys 0m5.737s

ここからはRを使う。

methylKitとgenomation

methylKitはメチル化率の算出や遺伝子アノテーションとの関連づけを行うことができるパッケージ。

genomationはmethylKitの機能を補完するのに有用なパッケージ。

code:R

#install

BiocManager::install("methylKit")

BiocManager::install("genomation")

library("methylKit")

library("genomation")

#ソートしたbamファイルからメチル化率の算出を行う。

file1 <- "map/SRR1812639_sort.bam"

file2 <- "map/SRR1812671_sort.bam"

file.list <- list(file1,file2)

#これはソートできていないやつ

file1 <- "map/SRR1812639_trimmed_bismark_bt2.bam"

file2 <- "map/SRR1812671_trimmed_bismark_bt2.bam"

file.list <- list(file1,file2)

methylKitのprocessBismarkAln関数を用いてメチル化率の算出を実行。

treatmentでは同じグループのサンプルを指定できる。

今回はCLLを1、健常者を0

code:R

obj <- processBismarkAln(location=file.list,sample.id=list("CLL","normal"),

assembly="hg38",read.context="CpG",treatment=c(1,0))

#確認

Trying to process:

map/SRR1812639_sort.bam

using htslib.

Conversion Statistics:

total otherC considered (>95% C+T): 10626389

average conversion rate = 98.853014253034

total otherC considered (Forward) (>95% C+T): 5311718

average conversion rate (Forward) = 98.85287196459

total otherC considered (Reverse) (>95% C+T): 5314671

average conversion rate (Reverse) = 98.853156462417

Done.

Reading methylation percentage per base for sample: CLL

Trying to process:

map/SRR1812671_sort.bam

using htslib.

Conversion Statistics:

total otherC considered (>95% C+T): 9926020

average conversion rate = 98.561250116514

total otherC considered (Forward) (>95% C+T): 4961875

average conversion rate (Forward) = 98.555193239311

total otherC considered (Reverse) (>95% C+T): 4964145

average conversion rate (Reverse) = 98.567304224034

Done.

Reading methylation percentage per base for sample: normal

obj

https://gyazo.com/d1e8eefb7bb4a3b8121ea2352b8b1897

・strand、+はforward、ーはreverse。

https://gyazo.com/afdf6a1c652ab79da8a5b64e0364c995

・カバレッジ→Introduction to key concepts in Illumina sequencing data analysis‐ イルミナシーケンスデータ解析入門その前に

カバレッジ：ゲノム上の位置あたりにマップされたリード数の平均。

https://gyazo.com/d876f6bf8e57824c43d5c68ef9c527a7

https://gyazo.com/6e5aac26770e23014a05aaaba16d41ae

https://gyazo.com/5dd692c2c633cc1c8d1391f869d751b0

objに格納された情報を使って、メチル化率の分布を確認

getMethylationStatsコマンドでヒストグラムを描画して分布を確認

code:R

getMethylationStats(obj1, plot=T)

#保存

pdf('histgram_1.pdf')

getMethylationStats(obj1, plot=T)

dev.off()

getMethylationStats(obj2, plot=T)

#保存

pdf('histgram_2.pdf')

getMethylationStats(obj2, plot=T)

dev.off()

https://gyazo.com/f9a47fc776ab27d10fab2941b5eeb90c

https://gyazo.com/6439f70e9e46ec8260855b6d59c4fc18

CpGサイトにおけるカバレッジの分布

code:R

getCoverageStats(obj1,plot=T)

#保存

pdf('coverage_cpg_1.pdf')

getCoverageStats(obj1,plot=T)

dev.off()

getCoverageStats(obj2,plot=T)

#保存

pdf('coverage_cpg_2.pdf')

getCoverageStats(obj2,plot=T)

dev.off()

https://gyazo.com/b0de72196142d755b15817e68bac6260

https://gyazo.com/b468de2372ca617c04c39654f0dea44d

サンプル間の比較などの解析を行うためには、各データを1つのオブジェクトにまとめる

code:R

meth <- unite(obj,destrand=F)

2つのサンプル間でメチル化が変化している位置を解析

calculateDiffMeth関数。

引数mc,coreで処理に使用するコア数を指定

↓

objで確認したときには約25万行あったが、diffでは約18万行のデータに。

code:R

diff <- calculateDiffMeth(meth,mc.cores=4)

two groups detected:

will calculate methylation difference as the difference of

treatment (group: 1) - control (group: 0)

NOTE: performing 'fast.fisher' instead of 'F' for two groups testing.

diff

https://gyazo.com/d04d36a6fe9d30a94c8d8dad3681f5bd

さらにgetMethylDiff関数を使って絞り込みを行う。

引数のtypeで「hyper」を指定するとメチル化が増加した箇所(=Hyper methylation)、「hypo」を指定するとメチル化が低下した箇所(=Hypo methylation )を抽出できる。

code:R

diff25p_hyper <- getMethylDiff(diff, difference=25, qvalue=0.01, type="hyper")

diff25p_hypo <- getMethylDiff(diff, difference=25, qvalue=0.01, type="hypo")

増加

https://gyazo.com/b5f02bb371e91a73913b90d64289b5be

減少

https://gyazo.com/4693e888a13562db5d9becf79cd3aba7

遺伝子アノテーションとの関連づけ

→refGene.bedを readTranscriptFeatures関数で読み込んで、annotateWithGeneParts関数でdiff25p_hyper、diff25p_hypoオブジェクトにそれぞれ遺伝子アノテーションを付与

メチル化に違いのある領域がどこに分布しているのかサマリーが表示。

annotateWithGeneParts関数を使用する際には、アノテーションを付与したいオブジェクトをGRangesオブジェクトに変換する。

code:R

gene.obj <- readTranscriptFeatures("ref/refGene.bed")

Reading the table...

Calculating intron coordinates...

Calculating exon coordinates...

Calculating TSS coordinates...

Calculating promoter coordinates...

Outputting the final GRangesList...

annotateWithGeneParts(as(diff25p_hyper,"GRanges"),gene.obj)

annotateWithGeneParts(as(diff25p_hypo,"GRanges"),gene.obj)

hyper

https://gyazo.com/c5c8d830b8cdeb411a137f0d7b72ad02

hypo

https://gyazo.com/00e3167244a0fa3cdc7ba493eea0c432

メチル化が増加した箇所と低下した箇所を具体的な遺伝子名と関連づける。

diff25p_hyperをmeth.diffが大きい順、diff25p_hypoをmeth.diffが負側に大きい順に並べ換える。

code:R

o1 <-order(-diff25p_hyper$meth.diff)

o2 <-order(diff25p_hypo$meth.diff)

https://gyazo.com/1ae53af712bd0a93a8c41c4fcbb846b8

並べ替え用オブジェクトを作ったらannotateWithGeneParts関数を使ってアノテーション情報を付加。

転写開始領域(TSS)までの距離がdist.to.TSSとして保存されている。これを利用してTSSまでの距離が1000以内となるものだけを抽出する。

code:R

a1 <- annotateWithGeneParts(as(diff25p_hypero1,,"GRanges"),gene.obj)

f1 <- abs(a1@dist.to.TSS$dist.to.feature) <= 1000

head(a1@dist.to.TSSf1,)

a2 <- annotateWithGeneParts(as(diff25p_hypoo2,,"GRanges"),gene.obj)

f2 <- abs(a2@dist.to.TSS$dist.to.feature) <= 1000

head(a2@dist.to.TSSf2,)

hyper

https://gyazo.com/4bce2a38b003cd3014c86a7351c866ee

hypo

https://gyazo.com/03078a4be2aef88f778134ca12b6bdf2

nm 020780 →https://www.ncbi.nlm.nih.gov/gene/?term=nm+020780

https://gyazo.com/50c3a3992e574d497f00354c33f1da59