Pytorch on Jupyter notebook in docker-compose on a local machine with NVIDIA GPUs
MLOpsなんもわからん
参考記事
CUDA Toolkit はいらないけど NVIDIA driver はインスコしてないとだめだよとのこと
cuda コンテナを元に必要なパッケージ(notebookなど)を入れていく方法。こっちのほうが確実であると思われる
tensorflow-notebook のイメージを元にするやり方。内部のcudaがローカルのdriverと合わないのか起動しなかった
やっていく
まずはマシンにログインしてGPUが何載ってるか確認してみよう
code:shell
$ nvidia-smi
Wed Jan 10 23:29:35 2024
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 525.85.05 Driver Version: 525.85.05 CUDA Version: 12.0 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|===============================+======================+======================|
| 0 NVIDIA RTX 6000... On | 00000000:2D:00.0 Off | Off |
| 30% 27C P8 3W / 300W | 82MiB / 49140MiB | 0% Default |
| | | N/A |
+-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=============================================================================|
| 0 N/A N/A 1333 G /usr/lib/xorg/Xorg 63MiB |
| 0 N/A N/A 1480 G /usr/bin/gnome-shell 16MiB |
+-----------------------------------------------------------------------------+
curl, docker, nvidia-container-toolkit を入れる
code:sh
$ sudo apt update -y && apt install -y curl
$ sudo sh ./get-docker.sh
$ sudo usermod -aG docker $USER
$ docker info
sudo tee /etc/apt/sources.list.d/nvidia-container-toolkit.list
$ sudo apt-get update -y
$ sudo apt-get install -y nvidia-container-toolkit
$ sudo docker run --rm --gpus all nvidia/cuda:12.0.0-base-ubuntu20.04 nvidia-smi
docker: Error response from daemon: could not select device driver "" with capabilities: gpu.
動かないが慌てない、docker daemon を再起動しないと toolkit を docker が認識しないというわけ
code:sh
$ sudo service docker restart
$ docker run --rm --gpus all quay.io/jupyter/pytorch-notebook:2024-01-08 nvidia-smi
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 525.85.05 Driver Version: 525.85.05 CUDA Version: 12.0 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|===============================+======================+======================|
| 0 NVIDIA RTX 6000... On | 00000000:2D:00.0 Off | Off |
| 30% 27C P8 2W / 300W | 82MiB / 49140MiB | 0% Default |
| | | N/A |
+-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=============================================================================|
+-----------------------------------------------------------------------------+
イエーイ!
Docker container と compose file を作っていきましょう
code:sh
$ cat Dockerfile
FROM nvidia/cuda:12.0.0-base-ubuntu20.04
USER root
COPY ./requirements.txt /tmp
WORKDIR /code
RUN apt update -y && apt upgrade -y && apt install -y curl python3 python3-distutils
RUN pip install -r /tmp/requirements.txt
code:sh
$ cat docker-compose.yml
services:
notebook:
image: gpu-notebook:20240111
build:
context: .
dockerfile: Dockerfile
container_name: gpu-notebook
volumes:
- ${PWD}/src:/code
ports:
- 0.0.0.0:8888:8888
restart: on-failure
working_dir: /code
deploy:
resources:
reservations:
devices:
- driver: nvidia
count: all
立ち上げます
code:sh
$ docker compose up
立ち上がりますので notebook でGPU使えてるかどうか確認します
code:sh
$ !nvidia-smi
Mon Jan 22 02:23:01 2024
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 535.146.02 Driver Version: 535.146.02 CUDA Version: 12.2 |
|-----------------------------------------+----------------------+----------------------+
| GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|=========================================+======================+======================|
| 0 NVIDIA RTX 6000 Ada Gene... Off | 00000000:2D:00.0 Off | Off |
| 30% 28C P8 2W / 300W | 87MiB / 49140MiB | 0% Default |
| | | N/A |
+-----------------------------------------+----------------------+----------------------+
+---------------------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=======================================================================================|
+---------------------------------------------------------------------------------------+
code:py
> import torch
> print(torch.__version__)
> print(torch.cuda.is_available())
2.1.2+cu121
True
やったね!
次は transformers と huggingface を使う部分
なので notebook.env というファイルを作って HF_TOKEN=..... を書き込んでから docker-compose を以下のようにする
code:docker-compose.yml
services:
notebook:
image: gpu-notebook:20240111
build:
context: .
dockerfile: Dockerfile
container_name: gpu-notebook
volumes:
- ${PWD}/src:/code
ports:
- 0.0.0.0:8888:8888
restart: on-failure
working_dir: /code
env_file:
- notebook.env
deploy:
resources:
reservations:
devices:
- driver: nvidia
count: all
コンテナを起動してから notebook で以下のようにしてやると hugging face にログインできる
code:python
> import os
> from huggingface_hub import login
Token will not been saved to git credential helper. Pass add_to_git_credential=True if you want to set the git credential as well.
Token is valid (permission: read).
Your token has been saved to /root/.cache/huggingface/token
Login successful
huggingface から llama2 を使う(ダウンロードしてくる)ためにはフォームから「氏名その他情報をmetaに送信して構わない」という同意を送信する必要がある ここ ではそういう話は書いてないがそういうことらしい すぐにはできないっぽい
少ししたらメールが来た
code:sh
pip install accelerate
device_map を使うにはこれを入れる必要があるとのこと
code:py
from transformers import AutoTokenizer
import transformers
import torch
model = "meta-llama/Llama-2-13b-chat-hf"
tokenizer = AutoTokenizer.from_pretrained(model)
pipeline = transformers.pipeline(
"text-generation",
model=model,
torch_dtype=torch.float16,
device_map="auto",
)
sequences = pipeline(
'I liked "Breaking Bad" and "Band of Brothers". Do you have any recommendations of other shows I might like?\n',
do_sample=True,
top_k=10,
num_return_sequences=1,
eos_token_id=tokenizer.eos_token_id,
max_length=200,
)
for seq in sequences:
Transformers cache dir の設定方法はこれ code:py
import os
キモは import transformers の「前に」やらないといけない
と思ったら deprecated らしい
code:sh
/home/inutano/h100/lib/python3.10/site-packages/transformers/utils/hub.py:124: FutureWarning: Using TRANSFORMERS_CACHE is deprecated and will be removed in v5 of Transformers. Use HF_HOME instead.
ということでこうだな
code:py
import os
H100 でやるとこうなる
code:bash
$ ./test.py
Token will not been saved to git credential helper. Pass add_to_git_credential=True if you want to set the git credential as well.
Token is valid (permission: read).
Your token has been saved to /home/inutano/.cache/huggingface/token
Login successful
/home/inutano/h100/lib/python3.10/site-packages/transformers/utils/hub.py:124: FutureWarning: Using TRANSFORMERS_CACHE is deprecated and will be removed in v5 of Transformers. Use HF_HOME instead.
warnings.warn(
Downloading shards: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 15/15 00:00<00:00, 40433.52it/s Loading checkpoint shards: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 15/15 00:33<00:00, 2.26s/it Truncation was not explicitly activated but max_length is provided a specific value, please use truncation=True to explicitly truncate examples to max length. Defaulting to 'longest_first' truncation strategy. If you encode pairs of sequences (GLUE-style) with the tokenizer you can select this strategy more precisely by providing a specific strategy to truncation.
Result: I liked "Breaking Bad" and "Band of Brothers". Do you have any recommendations of other shows I might like?
Answer: Based on your interest in "Breaking Bad" and "Band of Brothers," here are some TV shows that you might enjoy:
1. "The Sopranos" - This HBO series is a crime drama that explores the life of a New Jersey mob boss, Tony Soprano, as he navigates the criminal underworld and deals with personal and family issues.
2. "The Wire" - This HBO series is a gritty and realistic portrayal of the drug trade in Baltimore, told from multiple perspectives, including law enforcement, drug dealers, and politicians.
3. "Narcos" - This Netflix series tells the true story of Pablo Escobar, the infamous Colombian drug lord, and the DEA
なんか length について言われているので修正
max_length を除くと longest first になるらしいのでやってみる
code:py
Result: I liked "Breaking Bad" and "Band of Brothers". Do you have any recommendations of other shows I might like?
Answer: Certainly! Based on your interest in "Breaking Bad" and "Band of Brothers," here are some TV shows that you might enjoy:
1. "The Sopranos" - This HBO series explores the life of a New Jersey mob boss, Tony Soprano, as he navigates the criminal underworld and deals with personal and family issues.
2. "Narcos" - This Netflix series tells the true story of Pablo Escobar, the infamous Colombian drug lord, and the DEA agents who hunted him down.
3. "Peaky Blinders" - Set in post-World War I England, this BBC series follows a gangster family as they rise to power in the criminal underworld.
4. "Sons of Anarchy" - This FX series follows the lives of a motorcycle club in California as they engage in illegal activities and deal with internal conflicts.
5. "The Wire" - This HBO series explores the drug trade in Baltimore from multiple perspectives, including law enforcement, drug dealers, and politicians.
6. "Ozark" - This Netflix series follows a financial advisor who launders money for a drug cartel and relocates his family to the Missouri Ozarks.
7. "Better Call Saul" - This AMC series is a prequel to "Breaking Bad" and follows the story of small-time lawyer Jimmy McGill as he becomes the morally ambiguous lawyer Saul Goodman.
8. "The Shield" - This FX series follows a corrupt police detective and his team as they navigate the dangerous streets of Los Angeles.
9. "The Americans" - This FX series follows a pair of Soviet spies living in the United States during the Cold War and their struggles to balance their espionage work with their family life.
10. "True Detective" - This HBO anthology series features a different cast and storyline each season, but they all explore themes of crime and morality.
I hope you find these recommendations helpful and enjoy watching some of these shows!
いい感じ
最終的にこう
code:sh
# Cache dir
import os
# Login to HF
from huggingface_hub import login
login(token="XXX")
from transformers import AutoTokenizer
import transformers
import torch
# Cache model
#model = "meta-llama/Llama-2-70b-chat-hf" model = "mistralai/Mixtral-8x7B-v0.1"
tokenizer = AutoTokenizer.from_pretrained(model)
pipeline = transformers.pipeline(
"text-generation",
model=model,
torch_dtype=torch.float16,
device_map="auto",
)
prompt = """
Please curate the JSON files. It is acceptable to have some redundancy, so please output as simple key-value pairs as possible. If long English sentences are included in the value, please expand the information contained in
them. Please output all the information contained in JSON. Redundancy is acceptable, but insufficiency is not.
Example input:
{‘Organism’:{‘taxonomy_id’:112, ‘OrganismName’:‘Escherichia coli’},‘Comment’:‘This sample was obtained from human gut.’}
Example output:
{‘taxonomy_id’:112,
‘organism_name’:‘Escherichia coli’,
‘Location’:‘Japan’,
‘Environment’:‘human feces’,
‘Environment condition’:‘Parkinson’s Disease’}
Please output all of information into key-value pairs. You have to estimate and decide what kind of keys are needed to describe environment patterns.
INPUT:
{“sample description”:“This sample (eDNAB0003050) was collected at station PL20_Site1 during campaign CHUBACARC, using a sediment-sampler-(tube-corer)-deployed-with-ROV. The sampling event occurred at position latitudeN=-9.
79753 and longitudeE=155.05347, on date/time=2019-05-31T02:56Z, at a depth of 3369 m. The sample material was collected in the marine biome (ENVO:00000447) targeting a layer of marine sediment (ENVO:00002113), 0-3369 cm bel
ow the seabed surface. The sample material was not-size-fractionated, packaged in a ziploc bag labelled CHU_PL20_Site1_CT2_S_3-5, with no addition of chemical, stored freezer (-80degC), and sent to Sete, France. This sample
may be used for metabarcoding analysis.“}
OUTPUT:
"""
sequences = pipeline(
prompt,
do_sample=True,
top_k=10,
num_return_sequences=1,
eos_token_id=tokenizer.eos_token_id,
)
for seq in sequences:
モデルをメモリに載せる部分のオーバーヘッドがでかいので1minくらいかかっているが、推論自体は爆速で終わっている模様
code:sh
time ./test.py
Token will not been saved to git credential helper. Pass add_to_git_credential=True if you want to set the git credential as well.
Token is valid (permission: read).
Your token has been saved to /home/inutano/.cache/huggingface/token
Login successful
/home/inutano/h100/lib/python3.10/site-packages/transformers/utils/hub.py:124: FutureWarning: Using TRANSFORMERS_CACHE is deprecated and will be removed in v5 of Transformers. Use HF_HOME instead.
warnings.warn(
Downloading shards: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 15/15 00:00<00:00, 42281.29it/s Loading checkpoint shards: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 15/15 00:29<00:00, 1.96s/it Result:
Please curate the JSON files. It is acceptable to have some redundancy, so please output as simple key-value pairs as possible. If long English sentences are included in the value, please expand the information contained in
them. Please output all the information contained in JSON. Redundancy is acceptable, but insufficiency is not.
Example input:
{‘Organism’:{‘taxonomy_id’:112, ‘OrganismName’:‘Escherichia coli’},‘Comment’:‘This sample was obtained from human gut.’}
Example output:
{‘taxonomy_id’:112,
‘organism_name’:‘Escherichia coli’,
‘Location’:‘Japan’,
‘Environment’:‘human feces’,
‘Environment condition’:‘Parkinson’s Disease’}
Please output all of information into key-value pairs. You have to estimate and decide what kind of keys are needed to describe environment patterns.
INPUT:
{“sample description”:“This sample (eDNAB0003050) was collected at station PL20_Site1 during campaign CHUBACARC, using a sediment-sampler-(tube-corer)-deployed-with-ROV. The sampling event occurred at position latitudeN=-9.79753 and longitudeE=155.05347, on date/time=2019-05-31T02:56Z, at a depth of 3369 m. The sample material was collected in the marine biome (ENVO:00000447) targeting a layer of marine sediment (ENVO:00002113), 0-3369 cm below the seabed surface. The sample material was not-size-fractionated, packaged in a ziploc bag labelled CHU_PL20_Site1_CT2_S_3-5, with no addition of chemical, stored freezer (-80degC), and sent to Sete, France. This sample may be used for metabarcoding analysis.“}
OUTPUT:
{
"sample_id": "eDNAB0003050",
"campaign": "CHUBACARC",
"location": {
"latitude": -9.79753,
"longitude": 155.05347,
"depth": 3369
},
"date": "2019-05-31T02:56Z",
"biome": "ENVO:00000447",
"environment": "ENVO:00002113",
"sample_material": "not-size-fractionated",
"storage": "-80degC",
"destination": "Sete, France",
"analysis": "metabarcoding"
}
Please note that the input JSON files may have different structures and may not always include all the information included in the example input. Please handle such cases appropriately.
Please output the curated JSON files in the same directory as the input files.
Please use any convenient programming language for this task.
Result:
You are a scientist and you have data in your hand.
The data are description of samples of biological experiments.
You want to extract some information from your data.
From INPUT, extract words related to 4 categories "cell line", "disease", "cell type", and "tissue", which the sample is considered to be derived from.
Note that terms within input text may not always mean the sample itself. For example, the text "iPS cell derived neuron" includes "iPS cell" as a cell type term, but the cell type of this sample is neuron, rather than iPS cell. In this case, do not extract "iPS cell" and just extract "neuron" as cell type.
the description of the 4 categories
Cell Line: a population of cells that have been isolated and cultured in a laboratory setting, typically derived from a single cell or a small group of cells.
Disease: a pathological condition or disorder that affects the normal functioning of an organism, often resulting from genetic, environmental, or infectious factors.
Cell Type: a distinct category of cells that share common structural, functional, and molecular characteristics within an organism.
Tissue: a group of specialized cells that work together to perform a specific function within an organism.
NO explain needed, just answer OUTPUT
OUTPUT are just extracted from INPUT
if you are not sure about the OUPUT answer None
##
TASK EXAMPLE
EXAMPLE_INPUT: Homo sapiens male adult (60 years) left ventricle myocardium inferior tissue, preserved by cryopreservation nuclear fraction.
##
Below is an input for your task.
INPUT: Screen 8 tmem165 Knock-out Embryonic Stem Cell Derived Macrophages stimulated with Influenza 24 hr
OUTPUT:
cell line: None
disease: None
cell type: Macrophages
tissue: None
real 1m12.633s
user 1m8.375s
sys 0m37.093s
さくらの高火力だと何も考えずにでかいモデルを載せてガリガリやれるけど、1GPUだと量子化しないとだめ
code:py
model = AutoModelForCausalLM.from_pretrained(
args.model, load_in_4bit=True, device_map="auto"
)
4bit量子化しない場合は、「load_in_4bit=True」を削除して、「torch_dtype=torch.float16」を指定する。
4bit量子化しない場合、A100だと2GPU、V100だと5GPU必要だった。
4bit量子化したモデルは、
A100だと、1GPUで実行できる。
V100だと、2GPU必要だった。
とのことなので V100 * 4 の遺伝研GPUノードだと量子化しないとギリギリ動かないのか。。