YuisekinAI 2.0 - YuisekinAI

YuisekinAI 2.0

ゴール

OSAIDに適合できる汎用生成AI基盤モデルの開発と公開

**MIT Licenseとする**

パラメーターサイズ0.5B、1B、3B、7B

**llama.cpp, ollamaで動くこと**

memo

事前学習から積極的に合成データセットを使う

数学、コーティングなどのコーパスを先に学習させる

その後に自然言語を学習させる

量も質も重要

データセット

OSAIDに適合できるデータセット

/YuisekinAI/事前学習データセット

github-code-clean

Project Gutenberg

青空文庫テキスト

/YuisekinAI/ファインチューニング用データセット

databricks-dolly-15k

mathoverflow-accepted

ja-stackoverflow

llm-japanese-dataset-vanilla

Anthropic HH-RLHF

ProsocialDialog

OSAIDに適合できる合成データセット用モデル

モデルサイズによってライセンスが異なるのでそれらは除外

0.5B, 1.5B, 7B, 14B, 32B

Granite 3.0 MoE

Llama派生等を除く

1.5B, 7B, 14B, 32B

Mistral Small 3

OSAIDに適合できるトークナイザー

https://github.com/google/sentencepiece

Apache-2.0 license

https://github.com/mistralai/mistral-common

Apache-2.0 license

OSAIDに適合できるアーキテクチャ

MistralForCausalLM

https://github.com/huggingface/transformers/blob/main/src/transformers/models/mistral/modeling_mistral.py

Apache-2.0 license

Qwen2ForCausalLM

https://github.com/huggingface/transformers/blob/main/src/transformers/models/qwen2/modeling_qwen2.py

Apache-2.0 license

OSAIDに適合できる事前学習ライブラリ

https://github.com/microsoft/DeepSpeed

Apache-2.0 license

https://github.com/huggingface/accelerate

Apache-2.0 license

https://github.com/bitsandbytes-foundation/bitsandbytes

MIT license