The Google File System (2003)

昔読んだけど読み直し

Abstract

Google File System: scalable distributed file system for large distributed data-intensive applications

It provides fault tolerance while running on inexpensive commondity hardware

It delivers high aggregate performance to a large number of clients

the largets cluster to date provides hundreds of terabytes of sotrages across thousands of disks on over a thousand machines.

It is concurrently accessed by hundreds of clients

Introduction

design は application workloads and technological environment driven で、そのためにいろいろこれまでのシステムと違っている

component failure は常に起こるものと考える

constant monitoring, error detection, fault tolerance, automatic recoverry などが必須

ファイルサイズが大きい

N GB とかは普通

append による更新

ほとんどのファイルは append のみして、overwrite しない

Design overview

仮定

システムは安価なコンポーネントで作られていてよく壊れる。Component failure を検知して復帰するメカニズムが必要

システムはサイズの大きい(100MB~ X GB)ファイルを扱う。数はあまり多くないとする。

サイズの小さいファイルも扱える必要はあるが、それに最適化はしない。

read は主に2種類

large streaming read: 1MB~

small random read: read few KB at arbitary offset

write は主に append (sequential write)

arbitary position への small write もサポートするが、効率的である必要はない

複数のクライアントが同じファイルに並列で append する場合の振る舞いを明確に定義する必要がある。

atomicity with minimal synchronization overhead は必須

low latency よりも high sustained bandwitdth が重要。

ほとんどのクライアントは bulk request をしてくる

インターフェース

ファイル階層構造を提供するが、POSIX には準拠しない

標準的な操作に加えて、snapshot, record append をサポート

snapshot: low cost でコピーを作る

record append: 複数のクライアントが同一ファイルに並列 append する際の atomicity を保証

multi-way merge とか、 producer-consumer queue の実装に便利らしい？

アーキテクチャ

single master, multiple chunkservers 構成

ファイルは固定サイズのチャンクに分割する

チャンクは immutable で、 globally unique な 64bit の chunk handle で識別する

chunk handle は master が割り当てる

chunkserver は chunk を通常の linux ファイルとして保存

chunk は異なる file namespace regions に3つのレプリカを保存する

master は、namespnaces, access control, file to chunk mapping, cunk location などすべてのメタデータを管理する

chunk lerease management, chunk の gc, chunk migration などの system-wide activities も管理

master は chunkserver に定期的に HeartBeat を送って、命令や状態確認をする

Client は metadata 関連の通信は master と行うが、data の read write は chunkserver と直接通信する。

client も chunkserver もデータはキャッシュしない

扱うファイルサイズが client キャッシュに比べて大きいため

client は metadata はキャッシュする

chunkserver は linux の buffer cache が効く。cache coherence を考えなくて良くてシステムがシンプルになる。

Single master

client は filename, chunksize から chunk index を計算して、それがどの chunkserver にデータがあるかを master に問い合わせる

問い合わせた結果は一定期間キャッシュされて、その間クライアントはそのチャンクを触る際に master と通信不要になる

Chunk size

64MB にした

chunk size を大きくするメリット

client が master とやり取りする頻度を減らす

client が１つの chunk 処理にある程度時間をかけるので、１つの chunkserver との TCP connection が長く続いて、network over head が減る

master に保存する metadata のサイズが減る

デメリット

small file は chunk の数が少ない

chunk size 1 とかだと、そのファイルに大量のクライアントがアクセスすると chunkserver が hotspot になる

とりあえず replication 数を増やしてなんとかしている

Metadata

主に3種類

file と chunk の namespace

file -> chunk の mapping

chunk replica の場所

すべて in-memory で管理されている

なので速い

速いので、定期的に全データスキャンが効率的にできる

gc したり、chunkserver の failure に対応して re-replication したり、load, disk space usage 最適化のために chunk を移動させたりする

メモリが足りなくなる懸念があるが、metadata のサイズは小さいので実用上問題はない

足りなくなったらメモリ増設しましょう...

namespance, mapping は master で永続化もしているが、場所は永続化しない

operation log による永続化（local disk & remote machines)

operation log は timestamp で順序も定義

消えるとまずいので、local, remote 両方に保存してから response を返す

master は障害復帰の時、log を replay して状態を戻す

定期的にcheckpoint を作って、 replay を高速化

checkpoint は memory に direct に map できるように B-tree 的な感じの形で保存している

場所は startup や chunkserver join 時に毎回 chunkserver に問い合わせて構築する

この情報を master に永続化しようとすると chunkserver failure とかの対応がめっちゃ大変

Consistency Model