PolarsのExpression

PolarsのContextsの中に書く

同じExpressionでも、異なるcontextに応じて異なる結果になる

遅延評価される

実行前に良い感じに最適化されるために高速になるらしい ref

docs

user guide

PolarsのExpression Plugins

こういう、直観どおりの計算式のこと

code:py

pl.col("weight") / (pl.col("height") ** 2)

Expression expansion

以下2つは同じ意味

code:py

pl.col("weight", "height").mean().name.prefix("avg_")

code:py

[

pl.col("weight").mean().alias("avg_weight"),

pl.col("height").mean().alias("avg_height"),

]

/mrsekut-book-4297141388/127 (4-3 Expression)

categorical

カラムの内容が、赤、黄、青、みたいに決まったカテゴリに限るやつの扱い

https://docs.pola.rs/user-guide/expressions/categorical-data-and-enums

内部で辞書を作るので、普通に文字列を使うより高速らしい

2種類

Enum

事前にカテゴリが決まっているもの

pl.Enum()

Categorical

カテゴリが不明または固定されていないもの

pl.Categorical()

aggregation

df.group_by()

df.agg()

https://docs.pola.rs/user-guide/expressions/aggregation/

code:py

def compute_age():

return date.today().year - pl.col("birthday").dt.year()

def avg_birthday(gender: str) -> pl.Expr:

return (

compute_age()

.filter(pl.col("gender") == gender)

.mean()

.alias(f"avg {gender} birthday")

)

q = (

dataset.lazy()

.group_by("state")

.agg(

avg_birthday("M"),

avg_birthday("F"),

(pl.col("gender") == "M").sum().alias("# male"),

(pl.col("gender") == "F").sum().alias("# female"),

)

.limit(5)

)

欠損値

https://docs.pola.rs/user-guide/expressions/missing-data/#null-and-nan-values

df.null_count()

各カラムの欠損値の数

pl.is_null()

欠損データの補完

https://docs.pola.rs/user-guide/expressions/missing-data/#filling-missing-data

pl.fill_null()

code:py

pl.col("col2").fill_null(pl.lit(2)) # リテラルで補完

pl.col("col2").fill_null(strategy="forward") # strategy

pl.col("col2").fill_null(pl.median("col2")) # 式で補完

pl.col("col2").interpolate() # 補間で補完

NaN

欠損データ(null)とは区別される

https://docs.pola.rs/user-guide/expressions/missing-data/#notanumber-or-nan-values

空文字をnullにしたい

code:py

df = df.with_column(

pl.when(pl.col("col1") == "")

.then(None)

.otherwise(pl.col("col1"))

.alias("col1")

)

Window関数

list

https://docs.pola.rs/user-guide/expressions/lists-and-arrays/

pl.Expr.explode()で、listを行に変換できる

listのまま色々操作できる

code:py

out = weather.with_columns(pl.col("temperatures").str.split(" ")).with_columns(

pl.col("temperatures").list.head(3).alias("top3"),

pl.col("temperatures").list.slice(-3, 3).alias("bottom_3"),

pl.col("temperatures").list.len().alias("obs"),

)

array

https://docs.pola.rs/user-guide/expressions/lists-and-arrays/

pl.Array

各行の要素数が同じである

arrayというよりtupleのイメージ？mrsekut.icon

Pythonのuser-defined functions

https://docs.pola.rs/user-guide/expressions/user-defined-functions/

numpyの関数でmapしたりできる

pl.map_elements()

pl.map_batches()

streamingにも対応している

(もちろん自分で定義しても良い)

code:py

import numpy as np

out = df.select(pl.col("values").map_batches(np.log))

PolarsのStruct型