Daskを使いこなそう
https://gyazo.com/aa70b9d7680ddea44aa9a5154a0e9e7e
https://gyazo.com/4c37503e105f05f646c535f2d66bf3da
コレクション
Array: umPyと同じインタフェース
DataFrame: Pandasと同じインタフェース
Bag: Python オブジェクトを格納
Delayed: 遅延評価
Array
code: python
import numpy as np
f = h5py.File('myfile.hdf5')
x - x.mean(axis=1)
code: python
import dask.array as da
f = h5py.File('myfile.hdf5')
x - x.mean(axis=1).compute()
DataFrame
code: python
import pandas as pd
df = pd.read_csv('2015-*-*.csv')
df.groupby(df.user_id).value.mean()
code: python
import dask.dataframe as dd
df = dd.read_csv('2015-*-*.csv')
df.groupby(df.user_id).value.mean().compute()
実行例
code: python
import time
import numpy as np
x = np.random.random((100000, 2000))
t0 = time.time()
q, r = np.linalg.qr(x)
test = np.allclose(x, q.dot(r))
assert(test)
print(time.time() - t0)
code: python
import dask, time
import dask.array as da
x = da.random.random((100000, 2000), chunks=(10000, 2000))
t0 = time.time()
q, r = da.linalg.qr(x)
test = da.all(da.isclose(x, q.dot(r)))
assert(test.compute())
print(time.time() - t0)
numpy 版はメモリ不足でエラーになる。
Graph
code: python
def inc(i):
return i + 1
def add(a, b):
return a + b
x = 1
y = inc(x)
z = add(y, 10)
Scheduler
https://gyazo.com/28a257f61b2738f1d7248226c8c432d1