Pytorch on Jupyter notebook in docker-compose on a local machine with NVIDIA GPUs

MLOpsなんもわからん

Mac で動かすときのやつ

かつては nvdia-docker とかいろいろあったけど今は nvidia-container-toolkit を使うということらしい

参考記事

2020年の記事で既に古いが歴史が書いてある

公式の installation guide

CUDA Toolkit はいらないけど NVIDIA driver はインスコしてないとだめだよとのこと

apt で入るとのこと

GPU が使える Jupyter Notebook 環境を最速で用意する

cuda コンテナを元に必要なパッケージ（notebookなど）を入れていく方法。こっちのほうが確実であると思われる

Ubuntu20.04LTS + docker + tensorflow(GPU) + jupyter notebook で環境構築

tensorflow-notebook のイメージを元にするやり方。内部のcudaがローカルのdriverと合わないのか起動しなかった

やっていく

まずはマシンにログインしてGPUが何載ってるか確認してみよう

code:shell

$ nvidia-smi

Wed Jan 10 23:29:35 2024

+-----------------------------------------------------------------------------+

| NVIDIA-SMI 525.85.05 Driver Version: 525.85.05 CUDA Version: 12.0 |

|-------------------------------+----------------------+----------------------+

| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |

| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |

| | | MIG M. |

|===============================+======================+======================|

| 0 NVIDIA RTX 6000... On | 00000000:2D:00.0 Off | Off |

| 30% 27C P8 3W / 300W | 82MiB / 49140MiB | 0% Default |

| | | N/A |

+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+

| Processes: |

| GPU GI CI PID Type Process name GPU Memory |

| ID ID Usage |

|=============================================================================|

| 0 N/A N/A 1333 G /usr/lib/xorg/Xorg 63MiB |

| 0 N/A N/A 1480 G /usr/bin/gnome-shell 16MiB |

+-----------------------------------------------------------------------------+

curl, docker, nvidia-container-toolkit を入れる

code:sh

$ sudo apt update -y && apt install -y curl

$ curl -fsSL https://get.docker.com -o get-docker.sh

$ sudo sh ./get-docker.sh

$ sudo usermod -aG docker $USER

$ docker info

$ curl -fsSL https://nvidia.github.io/libnvidia-container/gpgkey | sudo gpg --dearmor -o /usr/share/keyrings/nvidia-container-toolkit-keyring.gpg \

&& curl -s -L https://nvidia.github.io/libnvidia-container/stable/deb/nvidia-container-toolkit.list | \

sed 's#deb https://#deb signed-by=/usr/share/keyrings/nvidia-container-toolkit-keyring.gpg https://#g' | \

sudo tee /etc/apt/sources.list.d/nvidia-container-toolkit.list

$ sudo apt-get update -y

$ sudo apt-get install -y nvidia-container-toolkit

$ sudo docker run --rm --gpus all nvidia/cuda:12.0.0-base-ubuntu20.04 nvidia-smi

docker: Error response from daemon: could not select device driver "" with capabilities: gpu.

動かないが慌てない、docker daemon を再起動しないと toolkit を docker が認識しないというわけ

code:sh

$ sudo service docker restart

$ docker run --rm --gpus all quay.io/jupyter/pytorch-notebook:2024-01-08 nvidia-smi

+-----------------------------------------------------------------------------+

| NVIDIA-SMI 525.85.05 Driver Version: 525.85.05 CUDA Version: 12.0 |

|-------------------------------+----------------------+----------------------+

| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |

| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |

| | | MIG M. |

|===============================+======================+======================|

| 0 NVIDIA RTX 6000... On | 00000000:2D:00.0 Off | Off |

| 30% 27C P8 2W / 300W | 82MiB / 49140MiB | 0% Default |

| | | N/A |

+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+

| Processes: |

| GPU GI CI PID Type Process name GPU Memory |

| ID ID Usage |

|=============================================================================|

+-----------------------------------------------------------------------------+

イエーイ！

Docker container と compose file を作っていきましょう

code:sh

$ cat Dockerfile

FROM nvidia/cuda:12.0.0-base-ubuntu20.04

USER root

COPY ./requirements.txt /tmp

WORKDIR /code

RUN apt update -y && apt upgrade -y && apt install -y curl python3 python3-distutils

RUN curl https://bootstrap.pypa.io/get-pip.py -o get-pip.py && python3 get-pip.py

RUN pip install -r /tmp/requirements.txt

code:sh

$ cat docker-compose.yml

services:

notebook:

image: gpu-notebook:20240111

build:

context: .

dockerfile: Dockerfile

container_name: gpu-notebook

volumes:

- ${PWD}/src:/code

ports:

- 0.0.0.0:8888:8888

restart: on-failure

working_dir: /code

command: "jupyter-lab", "--allow-root", "--ip=*"

deploy:

resources:

reservations:

devices:

- driver: nvidia

count: all

capabilities: gpu

立ち上げます

code:sh

$ docker compose up

立ち上がりますので notebook でGPU使えてるかどうか確認します

code:sh

$ !nvidia-smi

Mon Jan 22 02:23:01 2024

+---------------------------------------------------------------------------------------+

| NVIDIA-SMI 535.146.02 Driver Version: 535.146.02 CUDA Version: 12.2 |

|-----------------------------------------+----------------------+----------------------+

| GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC |

| Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. |

| | | MIG M. |

|=========================================+======================+======================|

| 0 NVIDIA RTX 6000 Ada Gene... Off | 00000000:2D:00.0 Off | Off |

| 30% 28C P8 2W / 300W | 87MiB / 49140MiB | 0% Default |

| | | N/A |

+-----------------------------------------+----------------------+----------------------+

+---------------------------------------------------------------------------------------+

| Processes: |

| GPU GI CI PID Type Process name GPU Memory |

| ID ID Usage |

|=======================================================================================|

+---------------------------------------------------------------------------------------+

code:py

> import torch

> print(torch.__version__)

> print(torch.cuda.is_available())

2.1.2+cu121

True

やったね！

次は transformers と huggingface を使う部分

公式クイックスタートを参考に token を発行して、env var に突っ込んでそれを読み込む

なので notebook.env というファイルを作って HF_TOKEN=..... を書き込んでから docker-compose を以下のようにする

code:docker-compose.yml

services:

notebook:

image: gpu-notebook:20240111

build:

context: .

dockerfile: Dockerfile

container_name: gpu-notebook

volumes:

- ${PWD}/src:/code

ports:

- 0.0.0.0:8888:8888

restart: on-failure

working_dir: /code

env_file:

- notebook.env

command: "jupyter-lab", "--allow-root", "--ip=*"

deploy:

resources:

reservations:

devices:

- driver: nvidia

count: all

capabilities: gpu

コンテナを起動してから notebook で以下のようにしてやると hugging face にログインできる

code:python

> import os

> from huggingface_hub import login

> login(token=os.environ"HF_TOKEN")

Token will not been saved to git credential helper. Pass add_to_git_credential=True if you want to set the git credential as well.

Token is valid (permission: read).

Your token has been saved to /root/.cache/huggingface/token

huggingface から llama2 を使う（ダウンロードしてくる）ためにはフォームから「氏名その他情報をmetaに送信して構わない」という同意を送信する必要がある

ここではそういう話は書いてないがそういうことらしい

すぐにはできないっぽい

少ししたらメールが来た

code:sh

pip install accelerate

device_map を使うにはこれを入れる必要があるとのこと

code:py

from transformers import AutoTokenizer

import transformers

import torch

model = "meta-llama/Llama-2-13b-chat-hf"

tokenizer = AutoTokenizer.from_pretrained(model)

pipeline = transformers.pipeline(

"text-generation",

model=model,

torch_dtype=torch.float16,

device_map="auto",

)

sequences = pipeline(

'I liked "Breaking Bad" and "Band of Brothers". Do you have any recommendations of other shows I might like?\n',

do_sample=True,

top_k=10,

num_return_sequences=1,

eos_token_id=tokenizer.eos_token_id,

max_length=200,

)

for seq in sequences:

print(f"Result: {seq'generated_text'}")

Transformers cache dir の設定方法はこれ

code:py

import os

os.environ'TRANSFORMERS_CACHE' = '/blabla/cache/'

キモは import transformers の「前に」やらないといけない

と思ったら deprecated らしい

code:sh

/home/inutano/h100/lib/python3.10/site-packages/transformers/utils/hub.py:124: FutureWarning: Using TRANSFORMERS_CACHE is deprecated and will be removed in v5 of Transformers. Use HF_HOME instead.

ということでこうだな

code:py

import os

os.environ'HF_HOME' = '/blabla/cache/'

H100 でやるとこうなる

code:bash

$ ./test.py

Token will not been saved to git credential helper. Pass add_to_git_credential=True if you want to set the git credential as well.

Token is valid (permission: read).

Your token has been saved to /home/inutano/.cache/huggingface/token

/home/inutano/h100/lib/python3.10/site-packages/transformers/utils/hub.py:124: FutureWarning: Using TRANSFORMERS_CACHE is deprecated and will be removed in v5 of Transformers. Use HF_HOME instead.

warnings.warn(

Downloading shards: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 15/15 00:00<00:00, 40433.52it/s

Loading checkpoint shards: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 15/15 00:33<00:00, 2.26s/it

Truncation was not explicitly activated but max_length is provided a specific value, please use truncation=True to explicitly truncate examples to max length. Defaulting to 'longest_first' truncation strategy. If you encode pairs of sequences (GLUE-style) with the tokenizer you can select this strategy more precisely by providing a specific strategy to truncation.

Result: I liked "Breaking Bad" and "Band of Brothers". Do you have any recommendations of other shows I might like?

Answer: Based on your interest in "Breaking Bad" and "Band of Brothers," here are some TV shows that you might enjoy:

1. "The Sopranos" - This HBO series is a crime drama that explores the life of a New Jersey mob boss, Tony Soprano, as he navigates the criminal underworld and deals with personal and family issues.

2. "The Wire" - This HBO series is a gritty and realistic portrayal of the drug trade in Baltimore, told from multiple perspectives, including law enforcement, drug dealers, and politicians.

3. "Narcos" - This Netflix series tells the true story of Pablo Escobar, the infamous Colombian drug lord, and the DEA

なんか length について言われているので修正

max_length を除くと longest first になるらしいのでやってみる

code:py

Result: I liked "Breaking Bad" and "Band of Brothers". Do you have any recommendations of other shows I might like?

Answer: Certainly! Based on your interest in "Breaking Bad" and "Band of Brothers," here are some TV shows that you might enjoy:

1. "The Sopranos" - This HBO series explores the life of a New Jersey mob boss, Tony Soprano, as he navigates the criminal underworld and deals with personal and family issues.

2. "Narcos" - This Netflix series tells the true story of Pablo Escobar, the infamous Colombian drug lord, and the DEA agents who hunted him down.

3. "Peaky Blinders" - Set in post-World War I England, this BBC series follows a gangster family as they rise to power in the criminal underworld.

4. "Sons of Anarchy" - This FX series follows the lives of a motorcycle club in California as they engage in illegal activities and deal with internal conflicts.

5. "The Wire" - This HBO series explores the drug trade in Baltimore from multiple perspectives, including law enforcement, drug dealers, and politicians.

6. "Ozark" - This Netflix series follows a financial advisor who launders money for a drug cartel and relocates his family to the Missouri Ozarks.

7. "Better Call Saul" - This AMC series is a prequel to "Breaking Bad" and follows the story of small-time lawyer Jimmy McGill as he becomes the morally ambiguous lawyer Saul Goodman.

8. "The Shield" - This FX series follows a corrupt police detective and his team as they navigate the dangerous streets of Los Angeles.

9. "The Americans" - This FX series follows a pair of Soviet spies living in the United States during the Cold War and their struggles to balance their espionage work with their family life.

10. "True Detective" - This HBO anthology series features a different cast and storyline each season, but they all explore themes of crime and morality.

I hope you find these recommendations helpful and enjoy watching some of these shows!

いい感じ

最終的にこう

code:sh

#!/usr/bin/env python3

# Cache dir

import os

os.environ'HF_HOME' = '/blabla/cache/'

# Login to HF

from huggingface_hub import login

from transformers import AutoTokenizer

import transformers

import torch

# Cache model

#model = "meta-llama/Llama-2-70b-chat-hf"

model = "mistralai/Mixtral-8x7B-v0.1"

tokenizer = AutoTokenizer.from_pretrained(model)

pipeline = transformers.pipeline(

"text-generation",

model=model,

torch_dtype=torch.float16,

device_map="auto",

)

prompt = """

Please curate the JSON files. It is acceptable to have some redundancy, so please output as simple key-value pairs as possible. If long English sentences are included in the value, please expand the information contained in

them. Please output all the information contained in JSON. Redundancy is acceptable, but insufficiency is not.

Example input:

{‘Organism’:{‘taxonomy_id’:112, ‘OrganismName’:‘Escherichia coli’},‘Comment’:‘This sample was obtained from human gut.’}

Example output:

{‘taxonomy_id’:112,

‘organism_name’:‘Escherichia coli’,

‘Location’:‘Japan’,

‘Environment’:‘human feces’,

‘Environment condition’:‘Parkinson’s Disease’}

Please output all of information into key-value pairs. You have to estimate and decide what kind of keys are needed to describe environment patterns.

INPUT:

79753 and longitudeE=155.05347, on date/time=2019-05-31T02:56Z, at a depth of 3369 m. The sample material was collected in the marine biome (ENVO:00000447) targeting a layer of marine sediment (ENVO:00002113), 0-3369 cm bel

ow the seabed surface. The sample material was not-size-fractionated, packaged in a ziploc bag labelled CHU_PL20_Site1_CT2_S_3-5, with no addition of chemical, stored freezer (-80degC), and sent to Sete, France. This sample

may be used for metabarcoding analysis.“}

OUTPUT:

"""

sequences = pipeline(

prompt,

do_sample=True,

top_k=10,

num_return_sequences=1,

eos_token_id=tokenizer.eos_token_id,

)

for seq in sequences:

print(f"Result: {seq'generated_text'}")

モデルをメモリに載せる部分のオーバーヘッドがでかいので1minくらいかかっているが、推論自体は爆速で終わっている模様

code:sh

time ./test.py

Token will not been saved to git credential helper. Pass add_to_git_credential=True if you want to set the git credential as well.

Token is valid (permission: read).

Your token has been saved to /home/inutano/.cache/huggingface/token

/home/inutano/h100/lib/python3.10/site-packages/transformers/utils/hub.py:124: FutureWarning: Using TRANSFORMERS_CACHE is deprecated and will be removed in v5 of Transformers. Use HF_HOME instead.

warnings.warn(

Result:

them. Please output all the information contained in JSON. Redundancy is acceptable, but insufficiency is not.

Example input:

{‘Organism’:{‘taxonomy_id’:112, ‘OrganismName’:‘Escherichia coli’},‘Comment’:‘This sample was obtained from human gut.’}

Example output:

{‘taxonomy_id’:112,

‘organism_name’:‘Escherichia coli’,

‘Location’:‘Japan’,

‘Environment’:‘human feces’,

‘Environment condition’:‘Parkinson’s Disease’}

Please output all of information into key-value pairs. You have to estimate and decide what kind of keys are needed to describe environment patterns.

INPUT:

{“sample description”:“This sample (eDNAB0003050) was collected at station PL20_Site1 during campaign CHUBACARC, using a sediment-sampler-(tube-corer)-deployed-with-ROV. The sampling event occurred at position latitudeN=-9.79753 and longitudeE=155.05347, on date/time=2019-05-31T02:56Z, at a depth of 3369 m. The sample material was collected in the marine biome (ENVO:00000447) targeting a layer of marine sediment (ENVO:00002113), 0-3369 cm below the seabed surface. The sample material was not-size-fractionated, packaged in a ziploc bag labelled CHU_PL20_Site1_CT2_S_3-5, with no addition of chemical, stored freezer (-80degC), and sent to Sete, France. This sample may be used for metabarcoding analysis.“}

OUTPUT:

{

"sample_id": "eDNAB0003050",

"campaign": "CHUBACARC",

"location": {

"latitude": -9.79753,

"longitude": 155.05347,

"depth": 3369

"date": "2019-05-31T02:56Z",

"biome": "ENVO:00000447",

"environment": "ENVO:00002113",

"sample_material": "not-size-fractionated",

"storage": "-80degC",

"destination": "Sete, France",

"analysis": "metabarcoding"

}

Please note that the input JSON files may have different structures and may not always include all the information included in the example input. Please handle such cases appropriately.

Please output the curated JSON files in the same directory as the input files.

Please use any convenient programming language for this task.

Result:

You are a scientist and you have data in your hand.

The data are description of samples of biological experiments.

You want to extract some information from your data.

From INPUT, extract words related to 4 categories "cell line", "disease", "cell type", and "tissue", which the sample is considered to be derived from.

Note that terms within input text may not always mean the sample itself. For example, the text "iPS cell derived neuron" includes "iPS cell" as a cell type term, but the cell type of this sample is neuron, rather than iPS cell. In this case, do not extract "iPS cell" and just extract "neuron" as cell type.

the description of the 4 categories

Cell Line: a population of cells that have been isolated and cultured in a laboratory setting, typically derived from a single cell or a small group of cells.

Disease: a pathological condition or disorder that affects the normal functioning of an organism, often resulting from genetic, environmental, or infectious factors.

Cell Type: a distinct category of cells that share common structural, functional, and molecular characteristics within an organism.

Tissue: a group of specialized cells that work together to perform a specific function within an organism.

NO explain needed, just answer OUTPUT

OUTPUT are just extracted from INPUT

if you are not sure about the OUPUT answer None

TASK EXAMPLE

EXAMPLE_INPUT: Homo sapiens male adult (60 years) left ventricle myocardium inferior tissue, preserved by cryopreservation nuclear fraction.

EXAMPLE_OUTPUT: "cell line": None, "disease": None, "cell type": None, "tissue": 'left ventricle myocardium inferior tissue'

Below is an input for your task.

INPUT: Screen 8 tmem165 Knock-out Embryonic Stem Cell Derived Macrophages stimulated with Influenza 24 hr

OUTPUT:

cell line: None

disease: None

cell type: Macrophages

tissue: None

real 1m12.633s

user 1m8.375s

sys 0m37.093s

さくらの高火力だと何も考えずにでかいモデルを載せてガリガリやれるけど、1GPUだと量子化しないとだめ

モデルを読み込む際に、「load_in_4bit=True」を指定することで4bit量子化が有効になるらしい

code:py

model = AutoModelForCausalLM.from_pretrained(

args.model, load_in_4bit=True, device_map="auto"

)

4bit量子化しない場合は、「load_in_4bit=True」を削除して、「torch_dtype=torch.float16」を指定する。

4bit量子化しない場合、A100だと2GPU、V100だと5GPU必要だった。

4bit量子化したモデルは、

A100だと、1GPUで実行できる。

V100だと、2GPU必要だった。

とのことなので V100 * 4 の遺伝研GPUノードだと量子化しないとギリギリ動かないのか。。