PDFからテキスト情報を取得

する方法は何通りかある

/icons/javascript.icon

pdf.jsを使う

javascript - How to correctly extract text from a pdf using pdf.js - Stack Overflow

改行をうまく認識できないのが欠点

各文字の座標が渡されるだけ

改行を表現するには、座標を解析しないといけない

viewer経由なら取り出せる

PDFjsのviewerから画像データとテキストデータを取得するUserScript

テキストのみ取得するコード

code:js

await (async () => {

async function* readPages() {

let index = 0;

while (true) {

const page = PDFViewerApplication.pdfViewer._pagesindex;

if (!page) break;

page.div.scrollIntoView();

yield await new Promise((resolve) => {

const timer = setInterval(() => {

// 読み込みが終わるまで待つ

const canvas = page.div.getElementsByTagName("canvas")?. 0;

if (!canvas) return;

if (page.div.getElementsByClassName("loadingIcon").length > 0) return;

clearInterval(timer);

// 描画を待ってから返す

setTimeout(() => {

const text = page.textLayer.textContentItemsStr?.join?.("\n") ?? "";

resolve({

canvas,

text

});

}, 2000);

}, 100);

});

index++;

}

const pages = [];

for await (const {

text

} of readPages()) {

pages.push(text);

}

const zipBlob = await zip.generateAsync({

type: "blob",

compression: "DEFLATE",

compressionOptions: {

level: 9,

});

const a = document.createElement("a");

a.href = URL.createObjectURL(new Blob(JSON.stringify(pages), {

type: "application/json"

}));

const title = document.title.replace?.(/\.pdf$/, "");

a.download = ${title}.json;

document.body.append(a);

a.click();

a.remove();

})();

Linux terminal

pdftotextを使う

$ pdftotext input.pdf

input.pdfに書き込まれている全てのテキストがinput.txtに書き込まれる

input.txtは同階層のdirectoryに出力される

$ pdftotext input.pdf output.txt

出力fileの名前/pathを変えたいときはこっち

2から5ページ目のテキストだけ取り出したいときはpdftotext -f 2 -l 5 input.pdfを実行する

改行をうまく認識してくれる

References

＠IT：PDFファイルからテキストを抽出するには

【 pdftotext 】コマンド――PDFファイルからテキストを抽出する：Linux基本コマンドTips（286） - ＠IT

#2023-05-19 08:01:28

#2023-04-11 15:30:34

#2023-04-09 10:27:05

#2021-02-26 01:25:51

#2021-02-25 22:10:18

#2021-02-15 20:55:32