視覚文書理解

FACTOOL: Factuality Detection in Generative AI A Tool Augmented Framework for Multi-Task and Multi-Domain Scenarios

https://arxiv.org/pdf/2307.13528.pdf

VisFocus: Prompt-Guided Vision Encoders for OCR-Free Dense Document Understanding

https://speakerdeck.com/sansan_randd/visfocus-prompt-guided-vision-encoders-for-ocr-free-dense-document-understanding?slide=5

MatCha

https://huggingface.co/docs/transformers/main/model_doc/matcha

DePlot

https://huggingface.co/docs/transformers/main/model_doc/deplot

pix2struct

https://github.com/google-research/pix2struct

donut

https://github.com/clovaai/donut

DocLayout-YOLO

https://github.com/opendatalab/DocLayout-YOLO

docling

https://github.com/DS4SD/docling

Azure AI SearchでPDFをセクション毎に分割してインデックスに登録する

https://qiita.com/tmiyata25/items/13bf8321853e74f46c31

Table TransformerとGPT-4Vを用いたPDF内の表の解析

https://note.com/qunasys/n/nf9ee9a4e5d60?sub_rt=share_h

table transformer

https://github.com/microsoft/table-transformer?tab=readme-ov-file

https://huggingface.co/microsoft/table-transformer-structure-recognition

mPLUG-DocOwl 1.5: Unified Structure Learning for OCR-free Document Understanding

https://arxiv.org/pdf/2403.12895

Paddle OCR Documentation

https://paddlepaddle.github.io/PaddleOCR/latest/en/ppstructure/model_train/recovery_to_doc.html

Enhanced Table Extraction from documents with Form Recognizer

https://techcommunity.microsoft.com/blog/azure-ai-services-blog/enhanced-table-extraction-from-documents-with-form-recognizer/2058011

LLMを悩ませる"Excel文書"をうまく扱う方法

https://zenn.dev/firstautomation/articles/aed95bce20e900

MarkItDown

https://github.com/microsoft/markitdown

https://x.com/gyakuse/status/1864603368699908606

①geminiが意外とテキスト検出能力が高い

②文字埋め込みなしのPDF (活字でノイズが非常に小さい画像文章) はgpt-4oやgeminiのほうがGoogle Document AIより精度高い

あたりはめちゃ意外でした

生成AIを使ってリアルな案件対応をやってみる〜麻雀牌の物体検出編〜

https://qiita.com/sakasegawa/items/676edabb25ed562d61d9

M3DOCRAG

PDFドキュメントを画像のまま検索できるColQwen2でマルチモーダル検索を試す

https://acro-engineer.hatenablog.com/entry/2024/12/25/120000

PDFドキュメントを画像のまま検索できるColQwen2でマルチモーダル検索を試す

https://acro-engineer.hatenablog.com/entry/2024/12/25/120000

Moondream structured text, enhanced ocr, gaze detection

https://moondream.ai/blog/introducing-a-new-moondream-1-9b-and-gpu-support

NVIDIA and IQVIA Build Domain-Expert Agentic AI for Healthcare and Life Sciences

https://blogs.nvidia.com/blog/iqvia-agentic-ai-healthcare/

ReaderLM-v2

https://huggingface.co/jinaai/ReaderLM-v2#readerlm-v2

メモ：表交じりのPDFからMarkdownに変換してみる。

https://bwgift.hatenadiary.jp/entry/2025/03/27/233509

unstructured

https://zenn.dev/kun432/scraps/fa842dad2f8f97

Doclingを使って図入りmarkdownを作成してみる。#2

https://bwgift.hatenadiary.jp/entry/2025/10/26/193245

Extract text from documents and images with Datalab Marker and OCR

https://replicate.com/blog/datalab-marker-and-ocr-fast-parsing

A Survey and Approach to Chart Classification

https://arxiv.org/abs/2307.04147

From Pixels to Insights: A Survey on Automatic Chart Understanding in the Era of Large Foundation Models

https://arxiv.org/pdf/2403.12027

MinerU

https://github.com/opendatalab/MinerU

Towards Mixed-Modal Retrieval for Universal Retrieval-Augmented Generation

https://arxiv.org/pdf/2510.17354