Bulk Insert, Search Query | Elasticsearch

#Elasticsearch

Bulk API

ファイルを利用して一括データ投入

code:bash

# ファイルDL　→ Bulk API

curl -O https://download.elastic.co/demos/kibana/gettingstarted/8.x/accounts.zip

unzip accounts.zip

curl -H 'Content-Type: application/x-ndjson' -XPOST 'localhost:9200/account/_bulk?pretty' --data-binary @accounts.json

# console　にてdata確認。indexにaccountが存在したら成功

GET /_cat/indices?v

# health status index uuid pri rep docs.count docs.deleted store.size pri.store.size

# yellow open account cPfNZzy4SSCiec40ugfWOw 1 1 1000 0 395.7kb 395.7kb

Query

Match query

code:bash

# Match Query: fisrtname = 'Amber'のレコード検索

POST account/_search

{

"query": {

"match": {

"firstname": "Amber"

}

# OR検索: "800", "Holmes", "Lane"のいずれか、が含まれたレコード検索 (空白で単語が区切られる)

POST account/_search

{

"query": {

"match": {

"address": "880 Holmes Lane"

}

# AND検索: ORでなくANDで検索したい場合、明示的に示す

POST account/_search

{

"query": {

"match": {

"address": {

"query": "880 Holmes Lane",

"operator": "and"

}

# minimum_should_match: 「最低限いくつかの単語を含む」という指定。ANDでは厳しく、ORでは緩すぎる時に利用

POST account/_search

{

"query": {

"match": {

"address": {

"query": "880 Holmes Lane",

"minimum_should_match": 2

}

Match phrase query

Match queryでは語順は評価されない。一方でMatch phrase query は語順を評価する。単語一致かつ語順一致なのでscoreが高くなる特徴。

code:bash

# match_phrase: Kings Placeの順で、間に単語は入らない

POST account/_search

{

"query": {

"match_phrase": {

"address": "Kings Place"

}

# match_phrase.slop: 語順を加味した上で、単語間がいくつ離れて良いかを加味する

POST account/_search

{

"query": {

"match_phrase": {

"address": {

"query": "Kings Place",

"slop": 1

}

Range query

code:bash

# ageが20以上30以下

POST account/_search

{

"query": {

"range": {

"age": {

"gte": 20,

"lte": 30

}

Bool query

must: 一致するドキュメント取得、スコアに影響する

must_not: 一致しないドキュメントを取得、スコアに影響しない

should: 一致するドキュメントの方がスコアが高くなる、一致しなくとも検索結果には表示される

filter: 検索結果を限定、スコアに影響しない (SQLのHavingのようなイメージか)

code:bash

# must: 一致するドキュメントの取得 (検索条件: gender = M)

POST account/_search

{

"query": {

"bool": {

"must": [

{

"match": {

"gender": "M"

}

]

}

# must_not: 一致しないドキュメントの取得 (検索条件: gender = M　かつ state != IL)

POST account/_search

{

"query": {

"bool": {

"must": [

{

"match": {

"gender": "M"

}

"must_not": [

{

"match": {

"state": "IL"

}

]

}

# should: 一致したドキュメントのスコアが高くなる、検索結果を限定はしない (検索条件: gender = M　かつ state != IL)

POST account/_search

{

"query": {

"bool": {

"must": [

{

"match": {

"gender": "M"

}

"must_not": [

{

"match": {

"state": "IL"

}

"should": [

{

"range": {

"age": {

"lt": 30

}

]

}

# filter: 検索結果を限定、スコアには影響しない (検索条件: gender = M　かつ state != IL に対して、city: 'Coalmont')

POST account/_search

{

"query": {

"bool": {

"must": [

{

"match": {

"gender": "M"

}

"must_not": [

{

"match": {

"state": "IL"

}

"should": [

{

"range": {

"age": {

"lt": 30

}

]

}

ElasticsearchのScoreの考え方: BM25

詳細については触れないが、ひとまず自分の言葉でまとめてみる

Scoreが上がる要因は以下の3つ

1. 検索語がドキュメントに存在数が多い程スコアUp

2. 加えて、検索語の中でも、レア度の高い (頻出頻度の低い)検索語の方がスコアのUp率が上がる (スコアの重みづけ)

3. 加えて、一致した検索語数に対して、ドキュメント語数が少ない方がスコアUP (一致の密度を加味)

上2つでTF-ITFというルールであり、それに3つ目を加えたものがElasticsearchのスコアルールである、BM25というらしい。

この記事が一番感覚的に理解できた。なお数式についての理解は諦めています。笑

https://itdepends.hateblo.jp/entry/2020/01/05/112447

memo

Elasticsearch8.X系では認証系に関して、ある程度updateされているため、エラーが発生しやすかった。

なので下記のチュートリアルと同じversionに合わせて行うことにした

Elasticsearch 7.12.3

Kibana 7.12.3

参照

Elasticsearch 入門。その2