選択の効果 - 西尾泰和の外部脳

選択の効果

#セレクションバイアス

Rでみるセレクションバイアスの例(webarchve)

たとえ学歴と能力に相関があったとしても、能力の高い人だけを取り出した集団の中では学歴と能力が無相関になりえる

claude.iconこの文章は、学歴（偏差値）と能力の関係について、セレクションバイアス（選択バイアス）の観点から分析したものです。主な内容は以下の通りです。

母集団全体では、偏差値と能力には正の相関がある。

能力の高い人だけを選抜した集団では、偏差値と能力の相関はほとんどみられなくなる。

一方、偏差値の高い人だけを選抜した集団では、母集団と同様に偏差値と能力に正の相関がみられる。

ただし、偏差値による選抜は、独立変数（偏差値）の分散を小さくし、推定値の標準誤差を大きくする問題がある。

nishio.iconこの話の流れを再現するJupyter Notebookを生成して

claude.icon以下は、この記事の内容を再現するJupyter Notebookです。

code:py

import numpy as np

import pandas as pd

import matplotlib.pyplot as plt

import seaborn as sns

import statsmodels.formula.api as smf

# シミュレーションのセッティング

np.random.seed(1)

N = 100000

ability_mean, ability_std = 0, 1

hensachi_mean, hensachi_std = 50, 10

corr = 0.3

# 多変量正規分布からのサンプリング

cov = [ability_std**2, corr*ability_std*hensachi_std,

corr*ability_std*hensachi_std, hensachi_std**2]

ability, hensachi = np.random.multivariate_normal(ability_mean, hensachi_mean, cov, N).T

# データフレームの作成

df = pd.DataFrame({'ability': ability, 'hensachi': hensachi})

# 散布図の作成

plt.figure(figsize=(6, 6))

sns.scatterplot(x='hensachi', y='ability', data=df, alpha=0.5)

plt.xlabel('偏差値')

plt.ylabel('能力')

plt.show()

# 能力によるセレクション

ability_threshold = 2

df'ability_selected' = df'ability' >= ability_threshold

plt.figure(figsize=(6, 6))

sns.scatterplot(x='hensachi', y='ability', data=df, hue='ability_selected', alpha=0.5)

plt.xlabel('偏差値')

plt.ylabel('能力')

plt.show()

# 偏差値によるセレクション

hensachi_threshold = 70

df'hensachi_selected' = df'hensachi' >= hensachi_threshold

plt.figure(figsize=(6, 6))

sns.scatterplot(x='hensachi', y='ability', data=df, hue='hensachi_selected', alpha=0.5)

plt.xlabel('偏差値')

plt.ylabel('能力')

plt.show()

# 回帰分析

model_all = smf.ols('ability ~ hensachi', data=df).fit()

model_ability = smf.ols('ability ~ hensachi', data=df[df'ability_selected']).fit()

model_hensachi = smf.ols('ability ~ hensachi', data=df[df'hensachi_selected']).fit()

print(model_all.summary())

print(model_ability.summary())

print(model_hensachi.summary())

https://gyazo.com/abf389bb48d7d7a34aa1f1294885be49

code::

OLS Regression Results

==============================================================================

Dep. Variable: ability R-squared: 0.090

Model: OLS Adj. R-squared: 0.090

Method: Least Squares F-statistic: 9896.

Date: Wed, 29 May 2024 Prob (F-statistic): 0.00

Time: 17:04:08 Log-Likelihood: -1.3664e+05

No. Observations: 100000 AIC: 2.733e+05

Df Residuals: 99998 BIC: 2.733e+05

Df Model: 1

Covariance Type: nonrobust

==============================================================================

coef std err t P>|t| 0.025 0.975

------------------------------------------------------------------------------

Intercept -1.4885 0.015 -97.242 0.000 -1.519 -1.459

hensachi 0.0298 0.000 99.481 0.000 0.029 0.030

==============================================================================

Omnibus: 2.020 Durbin-Watson: 2.001

Prob(Omnibus): 0.364 Jarque-Bera (JB): 2.011

Skew: 0.011 Prob(JB): 0.366

Kurtosis: 3.006 Cond. No. 261.

==============================================================================

Notes:

1 Standard Errors assume that the covariance matrix of the errors is correctly specified.

https://gyazo.com/6a30ccea7cefe3cea9b3a1566d3238fe

code::

OLS Regression Results

==============================================================================

Dep. Variable: ability R-squared: 0.021

Model: OLS Adj. R-squared: 0.020

Method: Least Squares F-statistic: 48.42

Date: Wed, 29 May 2024 Prob (F-statistic): 4.48e-12

Time: 17:04:08 Log-Likelihood: -761.44

No. Observations: 2282 AIC: 1527.

Df Residuals: 2280 BIC: 1538.

Df Model: 1

Covariance Type: nonrobust

==============================================================================

coef std err t P>|t| 0.025 0.975

------------------------------------------------------------------------------

Intercept 2.0831 0.042 49.665 0.000 2.001 2.165

hensachi 0.0050 0.001 6.958 0.000 0.004 0.006

==============================================================================

Omnibus: 659.594 Durbin-Watson: 2.020

Prob(Omnibus): 0.000 Jarque-Bera (JB): 1778.136

Skew: 1.527 Prob(JB): 0.00

Kurtosis: 6.062 Cond. No. 344.

==============================================================================

Notes:

1 Standard Errors assume that the covariance matrix of the errors is correctly specified.

https://gyazo.com/71d0b2e89f33ec93038671bfe8e9a1e1

code::

OLS Regression Results

==============================================================================

Dep. Variable: ability R-squared: 0.010

Model: OLS Adj. R-squared: 0.009

Method: Least Squares F-statistic: 22.32

Date: Wed, 29 May 2024 Prob (F-statistic): 2.44e-06

Time: 17:04:08 Log-Likelihood: -3223.3

No. Observations: 2301 AIC: 6451.

Df Residuals: 2299 BIC: 6462.

Df Model: 1

Covariance Type: nonrobust

==============================================================================

coef std err t P>|t| 0.025 0.975

------------------------------------------------------------------------------

Intercept -1.3937 0.441 -3.157 0.002 -2.259 -0.528

hensachi 0.0282 0.006 4.725 0.000 0.017 0.040

==============================================================================

Omnibus: 1.860 Durbin-Watson: 1.924

Prob(Omnibus): 0.395 Jarque-Bera (JB): 1.804

Skew: 0.067 Prob(JB): 0.406

Kurtosis: 3.030 Cond. No. 1.59e+03

==============================================================================

Notes:

1 Standard Errors assume that the covariance matrix of the errors is correctly specified.

2 The condition number is large, 1.59e+03. This might indicate that there are

strong multicollinearity or other numerical problems.

claude.icon提示された3つの回帰分析の結果から、以下のような解釈ができます。

全体のサンプルを使った分析 (N=100,000)

偏差値(hensachi)と能力(ability)の間には正の関係がみられる。

偏差値が1上がると、能力は約0.03ポイント高くなる。

モデルの説明力 (R-squared) は0.090で、それほど高くない。

能力が高い (ability≧2) サンプルのみを使った分析 (N=2,282)

偏差値と能力の関係は全体サンプルに比べて弱くなっている。

偏差値が1上がると、能力は約0.005ポイントしか高くならない。

モデルの説明力は0.021とかなり低い。

偏差値が高い (hensachi≧70) サンプルのみを使った分析 (N=2,301)

偏差値と能力の関係は、全体サンプルとほぼ同じ程度である。

偏差値が1上がると、能力は約0.028ポイント高くなる。

ただし、モデルの説明力は0.010と非常に低い。

これらの結果から、以下のようなことが示唆されます。

能力の高い人々だけを集めたサンプルでは、偏差値と能力の関係がほとんどみられなくなる（セレクションバイアスの影響）。

一方、偏差値の高い人々だけを集めたサンプルでは、全体サンプルと同様の関係が維持される。

ただし、偏差値によるセレクションは、サンプルサイズを小さくし、偏差値のばらつきを小さくするため、モデルの説明力を低下させる。

つまり、特定の属性によってセレクションされたサンプルを使った分析では、母集団全体の関係性を正しく捉えられない可能性があります。特に、分析の対象となる変数（ここでは能力）によるセレクションが行われている場合、その影響は大きいといえます。

#統計 #意外な現象