Rustオンラインもくもく会 #41

#Rust #2020-05-16 #Rustオンラインもくもく会

https://rust-online.connpass.com/event/175819/

こんにちは、初めて参加します。宜しくお願いします。

Plumeというブログエンジンがあるんですが、検索機能を日本語対応させたいなと思っています。

そのためにTantivyという全文検索エンジンを調べる所からやります。

ブランチ：#rust-mokumoku-41

目標

開始時点で、開発環境で「画像」という語が検索できていない：

https://gyazo.com/a43d85d01bd3206058c2f5ea97577d85

これを検索できるようにしたい

Plumeでの検索

Plumeの検索エンジンでは、分かち書きのトークナイザーを使っている：

https://github.com/Plume-org/Plume/blob/3be842c6536c7c7df54d93c4db0c9e5662a0001b/plume-models/src/search/tokenizer.rs

これを

変更可能にし

日本語トークナイザーやバイグラムでトークナイズできるようにする

でも順番変えて、先に日本語トークナイザー入れてみちゃいたい。その方がテンション上がると思うから。

Tantivy

PlumeはTantivyを使っている。

code:plume-models/src/search/searcher.rb

pub fn create(path: &dyn AsRef<Path>) -> Result<Self> {

let whitespace_tokenizer = tokenizer::WhitespaceTokenizer.filter(LowerCaser);

let content_tokenizer = SimpleTokenizer

.filter(RemoveLongFilter::limit(40))

.filter(LowerCaser);

let property_tokenizer = NgramTokenizer::new(2, 8, false).filter(LowerCaser);

let schema = Self::schema();

create_dir_all(path).map_err(|_| SearcherError::IndexCreationError)?;

let index = Index::create(

MmapDirectory::open(path).map_err(|_| SearcherError::IndexCreationError)?,

schema,

)

.map_err(|_| SearcherError::IndexCreationError)?;

{

let tokenizer_manager = index.tokenizers();

tokenizer_manager.register("whitespace_tokenizer", whitespace_tokenizer);

tokenizer_manager.register("content_tokenizer", content_tokenizer);

tokenizer_manager.register("property_tokenizer", property_tokenizer);

という感じだから、トークナイザーを実装するか、既存の物を入れられればいい。

WhitespaceTokenizerはPlumeで実装してる分かち書きトークナイザー

SimpleTokenizerはTantivyの分かち書きトークナイザー

みんなfilterとやらを呼んでいるけど、これ何だろう？

tantivy::tokenizer::Tokenizer - Rust

Appends a token filter to the current tokenizer.

The method consumes the current TokenStream and returns a new one.

あれか、表記ゆれを吸収したり同義語も検索対象に加えたりできるやつかな。

ローマ字前提で、LowerCaserになるのは納得できる。

日本語だと、漢字・ひらがな・かたかなとか、音引きのあるなし（コンピューター、コンピュータ）とか色々あるな。でも今日はパス。

もしそういうフィルターが存在しないなら、Rust界に貢献するチャンス。

そもそもTantivy 0.12.0ではトークナイザーから直接filter呼べなかった。

TextAnalyzer::from(SimpleTokenizer).filter()ってやらないといけない。

Lindera

Tantivyで日本語トークナイザー使うにはLinderaかtantivy-tokenizer-tiny-segmenterを使う。

Configurable tokenizer (stemming available for 17 Latin languages with third party support for Chinese (tantivy-jieba and cang-jie), Japanese (lindera and tantivy-tokenizer-tiny-segmente) and Korean (lindera + lindera-ko-dic-builder)

Lindera使ってみる。

メンテナーの人が記事書いてた：

Rust初心者がRust製の日本語形態素解析器の開発を引き継いでみた - Qiita

LinderaをTantivyで使えるようにした - Qiita

content_tokenizerをLinderaTokenizerに差し替えてみたらビルドに失敗

code:shell

Compiling plume-models v0.4.0 (/home/kitaitimakoto/src/github.com/Plume-org/Plume/plume-models)

errorE0277: the trait bound for<'a> lindera_tantivy::tokenizer::LinderaTokenizer: tantivy::tokenizer::Tokenizer<'a> is not satisfied

--> plume-models/src/search/searcher.rs:97:61

97 | tokenizer_manager.register("content_tokenizer", japanese_tokenizer);

| ^^^^^^^^^^^^^^^^^^ the trait for<'a> tantivy::tokenizer::Tokenizer<'a> is not implemented for lindera_tantivy::tokenizer::LinderaTokenizer

error: aborting due to previous error

For more information about this error, try rustc --explain E0277.

error: could not compile plume-models.

tantivyのバージョンを上げてみた。

このエラーはなくなったけど、別のエラー

code:shell

errorE0437: type TokenStreamImpl is not a member of trait Tokenizer

--> plume-models/src/search/tokenizer.rs:16:5

16 | type TokenStreamImpl = WhitespaceTokenStream<'a>;

| ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ not a member of trait Tokenizer

warning: unused #[macro_use] import

--> plume-models/src/lib.rs:17:1

17 | #macro_use

| ^^^^^^^^^^^^

= note: #[warn(unused_imports)] on by default

errorE0107: wrong number of lifetime arguments: expected 0, found 1

--> plume-models/src/search/tokenizer.rs:15:20

15 | impl<'a> Tokenizer<'a> for WhitespaceTokenizer {

| ^^ unexpected lifetime argument

errorE0220: associated type TokenStreamImpl not found for Self

--> plume-models/src/search/tokenizer.rs:18:52

18 | fn token_stream(&self, text: &'a str) -> Self::TokenStreamImpl {

| ^^^^^^^^^^^^^^^ associated type TokenStreamImpl not found

error: aborting due to 3 previous errors

Some errors have detailed explanations: E0107, E0220, E0437.

For more information about an error, try rustc --explain E0107.

error: could not compile plume-models.

これを直したら今度は、filterがないというエラー

これを直すと今度はFutureに関するエラー。

code:shell

Compiling plume-models v0.4.0 (/home/kitaitimakoto/src/github.com/Plume-org/Plume/plume-models)

errorE0599: no method named map_err found for opaque type impl std::future::Future in the current scope

--> plume-models/src/search/searcher.rs:137:14

137 | .map_err(|_| SearcherError::IndexEditionError)?;

| ^^^^^^^ method not found in impl std::future::Future

= help: items from traits can only be used if the trait is in scope

= note: the following trait is implemented but not in scope; perhaps add a use for it:

use futures_util::future::try_future::TryFutureExt;

error: aborting due to previous error

For more information about this error, try rustc --explain E0599.

error: could not compile plume-models.

Plume（Rocket)はまだAsync対応してないからここだけFutures使うの辛い気がする。

Tantivyを「Asyncを使わない」且つ「lindera-tantivyが使える」というところに落とせるだろうか

lindera-tantivyはTantivyの新し目のAPIを使ってるようなので

lindera-tantivyはどのバージョンもTantivy^0.12.0に依存

^の意味は？

^1.2.3 := >=1.2.3, <2.0.0らしい（Specifying Dependencies - The Cargo Book）

Tantivyは0.12.0より下には下げられないということか……。

IndexWrite.garbage_collect_filesがFutureを返すんだけど、別のメソッドで代用できないか？

今日の所は日本語検索のプルーフオブコンセプトに絞って、取り合えずガービッジコレクションしないでやってみる・・・

結果

検索できた！

https://gyazo.com/1af9ee0b5c615bc73968409e53ab6c50

成果のコードはこちら：3be842c...01bb7a1

ガービッジコレクション

TantivyにManagedDirectory.garbage_collectというメソッドがある。使えるかも。

使用中のファイル一覧を引数に渡す必要がある。どうやって知ればいいだろう？

SegmentUpdate.list_files()を使ってるぽい。

ここまで辿れるだろうか。

そもそもgarbage_collect().awaitしなければエラーにはならない

.awaitしてないので結果がどうなるかは全く関知できない

今後できたらいいな

ガービッジコレクションできるようにする

トークナイザーに絞っちゃったけど、他の部分でも日本語検索のために必要なことがないか確認

トークナイザーを交換可能にする

Linderaをトークナイザーの選択肢に入れる

名寄せとか表記ゆれ対策とか

コーパスを選べるようにする

今はIPADICを使ってるのかな？