Evernoteからエクスポートしたhtmlファイルをテキストファイルに変換する

Evernoteはデータを、以下の三種類から選んでエクスポートできる

enex

単一のhtml

ノートごとのhtml

「ノートごとのhtml」を開くと、divやらbrが入っていてテキストファイルとしては扱いづらい。

またcssの指定などでファイルの中身が膨れ上がっている

テキストだけのファイルにしたい→コードで処理する

イメージ

指定したフォルダのファイルリストを取得。その中で、拡張子がhtmlのファイルだけを対象とする。htmlのbodyで必要な部分だけをピックアップして、本文のテキストを生成する。もともとのファイルのタイトル+txtで新規fileを作り、そこに本文を保存する。

実装はこれから

Pythonで書くつもり

Python Tips: Python の標準ライブラリだけで HTML からテキストを抽出したい - Life with Python

Bardに質問しながら進める

bs4をpipでインストールせよとのこと

code:sample.py

#convertEnToText.py

#対象のフォルダ内のファイルリストを取得

import glob

import os.path

import re

import bs4

# html/フォルダ内のファイル一覧を取得

files = glob.glob("html/*")

#ループでそれを処理していく。

for file in files:

# 拡張子が.htmlの場合

if file.endswith(".html"):

with open(file, "r") as f:

html = f.read()

soup = bs4.BeautifulSoup(html, "html.parser")

if(soup):

#brタグを改行に変換しておく

elems = soup.find_all(name='br')

for e in elems:

e.replace_with("\n")

#不要なタグを消す

icons = soup.find("icons")

if (icons):

icons.decompose()

enstyle = soup.find("style")

if (enstyle):

enstyle.decompose()

enAttri = soup.find("note-attributes")

if (enAttri):

enAttri.decompose()

enh1 = soup.find("h1")

if (enh1):

enh1.decompose()

elemMetas = soup.find_all(name='meta')

for e in elemMetas:

e.replace_with("")

body = soup.find("en-note")

newFile = os.path.split(file)1.replace(".html", ".txt")

if(body):

body_inner_html = body.decode_contents(formatter="html")

body_inner_text = body.get_text("\n")

#無駄な改行がたくさん生まれるのでそれを消す

reBody = re.sub("\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n \n", "", body_inner_text)

rereBody = re.sub("\n\n\n", "\n\n", reBody)

with open("text/" +newFile, "w") as f:

f.write(rereBody)

else:

with open("text/" +newFile, "w") as f:

f.write("none")