AWS記事_前文 - Open-BioInfo-yamaken

AWS記事_前文

一文要約

ナノポアの２つのベースコーラーの性能評価をAWS上で行った

原文

code:txt

This blog post was contributed by Guilherme Coppini, Bioinformatician and Javier Quilez, Associate Director – Bioinformatics at G42 Healthcare; and Chris Seymour, Vice President of Advanced Platform Development at Oxford Nanopore; and Doruk Ozturk, Senior Solutions Architect, Container Technologies, and Michael Mueller, Senior Solutions Architect, Genomics at AWS and Stefan Dittforth, Senior Solutions Architect, Healthcare at AWS.

Update 2023-11-20: The source code for the automated deployment of the architecture described in section “Architecture” is now available as open source on GitHub.

Oxford Nanopore sequencers enables direct, real-time analysis of long DNA or RNA fragments. They work by monitoring changes to an electrical current as nucleic acids are passed through a protein nanopore. The resulting signal is decoded to provide the specific DNA or RNA sequence by virtue of compute-intensive algorithms called basecallers. This blog post presents the benchmarking results for two of those Oxford Nanopore basecallers — Guppy and Dorado — on AWS. This benchmarking project was conducted in collaboration between G42 Healthcare, Oxford Nanopore Technologies and AWS.

We ran Guppy and Dorado on 20 different Amazon Elastic Compute Cloud (Amazon EC2) instance types with GPU accelerators. The top performance was achieved on a p4d.24xlarge instance type which delivered 490 million samples/second with Dorado, and 250 million samples/second with Guppy. A sample is one measurement of the current flowing through the nanopore. Typically, the current signal is sampled at 10 times the speed at which the bases passing through the nanopore. For example, at a rate of 400 bases per second (bps) passing through the nanopore, the sampling rate is 4,000 samples per second. The Dorado basecaller outperformed Guppy by a factor of 3.8 x when performing methylation calling with the 5-hydroxymethylcytosine group (5hmCG). Our cost evaluations revealed that the g5.xlarge instance delivers the lowest cost for basecalling a whole human genome (WHG) with the Guppy tool.

翻訳

code:txt

このブログ記事は、G42 HealthcareのバイオインフォマティシャンGuilherme Coppini氏、バイオインフォマティクス担当アソシエイトディレクターJavier Quilez氏、Oxford Nanoporeのアドバンスドプラットフォーム開発担当バイスプレジデントChris Seymour氏、AWSのコンテナテクノロジー担当シニアソリューションアーキテクトDoruk Ozturk氏、ゲノム担当シニアソリューションアーキテクトMichael Mueller氏、ヘルスケア担当シニアソリューションアーキテクトStefan Dittforth氏によって寄稿された。

2023-11-20更新：アーキテクチャ」セクションで説明したアーキテクチャの自動デプロイのソースコードがGitHubでオープンソースとして公開されている。

オックスフォード・ナノポア・シーケンサーは、長いDNAやRNA断片の直接リアルタイム分析を可能にする。核酸がタンパク質のナノポアを通過する際の電流の変化をモニターすることで機能する。得られた信号は、ベースコーラと呼ばれる計算集約的なアルゴリズムによって、特定のDNAまたはRNA配列を提供するためにデコードされる。このブログ記事では、オックスフォード・ナノポアの2つのベースコーラー（GuppyとDorado）のAWS上でのベンチマーク結果を紹介します。このベンチマーク・プロジェクトは、G42 Healthcare社、Oxford Nanopore Technologies社、AWSのコラボレーションで実施されました。

GPUアクセラレータを搭載した20種類のAmazon Elastic Compute Cloud（Amazon EC2）インスタンスでGuppyとDoradoを実行しました。最高性能はp4d.24xlargeインスタンスタイプで達成され、Doradoで4億9000万サンプル/秒、Guppyで2億5000万サンプル/秒を達成した。サンプルとは、ナノポアを流れる電流の1回の測定値である。通常、電流信号は、ナノポアを通過する塩基の速度の10倍でサンプリングされる。例えば、毎秒400塩基（bps）の速度でナノポアを通過する場合、サンプリング速度は毎秒4000サンプルとなる。Dorado basecallerは、5-ヒドロキシメチルシトシン基（5hmCG）を用いたメチル化コールにおいて、Guppyを3.8倍上回った。コスト評価では、g5.xlargeインスタンスがGuppyツールによる全ヒトゲノム(WHG)のベースコールで最も低コストであることが明らかになった。

メモ

2023/11/20の記事

ナノポアシーケンサーのしくみの説明

Nanoporeのbasecallers

Dorado

Guppy

性能評価

メチル化コールにおいて、DoradoはGuppyの3.8倍（何が？速さ？）

コスト評価

g5.xlargeインスタンスが低コスト