視覚文書理解
OCR
画像認識
画像処理論
マルチモーダル
Vision Language Model
FACTOOL: Factuality Detection in Generative AI A Tool Augmented Framework for Multi-Task and Multi-Domain Scenarios
https://arxiv.org/pdf/2307.13528.pdf
VisFocus: Prompt-Guided Vision Encoders for OCR-Free Dense Document Understanding
https://speakerdeck.com/sansan_randd/visfocus-prompt-guided-vision-encoders-for-ocr-free-dense-document-understanding?slide=5
MatCha
https://huggingface.co/docs/transformers/main/model_doc/matcha
DePlot
https://huggingface.co/docs/transformers/main/model_doc/deplot
pix2struct
https://github.com/google-research/pix2struct
donut
https://github.com/clovaai/donut
DocLayout-YOLO
https://github.com/opendatalab/DocLayout-YOLO
docling
https://github.com/DS4SD/docling
Azure AI SearchでPDFをセクション毎に分割してインデックスに登録する
https://qiita.com/tmiyata25/items/13bf8321853e74f46c31
Table TransformerとGPT-4Vを用いたPDF内の表の解析
https://note.com/qunasys/n/nf9ee9a4e5d60?sub_rt=share_h
table transformer
https://github.com/microsoft/table-transformer?tab=readme-ov-file
https://huggingface.co/microsoft/table-transformer-structure-recognition
mPLUG-DocOwl 1.5: Unified Structure Learning for OCR-free Document Understanding
https://arxiv.org/pdf/2403.12895
Paddle OCR Documentation
https://paddlepaddle.github.io/PaddleOCR/latest/en/ppstructure/model_train/recovery_to_doc.html
Enhanced Table Extraction from documents with Form Recognizer
https://techcommunity.microsoft.com/blog/azure-ai-services-blog/enhanced-table-extraction-from-documents-with-form-recognizer/2058011
LLMを悩ませる"Excel文書"をうまく扱う方法
https://zenn.dev/firstautomation/articles/aed95bce20e900
MarkItDown
https://github.com/microsoft/markitdown
https://x.com/gyakuse/status/1864603368699908606
①geminiが意外とテキスト検出能力が高い
②文字埋め込みなしのPDF (活字でノイズが非常に小さい画像文章) はgpt-4oやgeminiのほうがGoogle Document AIより精度高い
あたりはめちゃ意外でした
生成AIを使ってリアルな案件対応をやってみる〜麻雀牌の物体検出編〜
https://qiita.com/sakasegawa/items/676edabb25ed562d61d9
M3DOCRAG
PDFドキュメントを画像のまま検索できるColQwen2でマルチモーダル検索を試す
https://acro-engineer.hatenablog.com/entry/2024/12/25/120000
PDFドキュメントを画像のまま検索できるColQwen2でマルチモーダル検索を試す
https://acro-engineer.hatenablog.com/entry/2024/12/25/120000
Moondream structured text, enhanced ocr, gaze detection
https://moondream.ai/blog/introducing-a-new-moondream-1-9b-and-gpu-support
NVIDIA and IQVIA Build Domain-Expert Agentic AI for Healthcare and Life Sciences
https://blogs.nvidia.com/blog/iqvia-agentic-ai-healthcare/
ReaderLM-v2
https://huggingface.co/jinaai/ReaderLM-v2#readerlm-v2
メモ:表交じりのPDFからMarkdownに変換してみる。
https://bwgift.hatenadiary.jp/entry/2025/03/27/233509