Explain Model with SHAP Value

Posted on 2020-01-05 In Models

import pandas as pd
import numpy as np
np.random.seed(0)
import matplotlib.pyplot as plt
df = pd.read_csv('/winequality-red.csv') # Load the data
from sklearn.model_selection import train_test_split
from sklearn import preprocessing
from sklearn.ensemble import RandomForestRegressor
# The target variable is 'quality'.
Y = df['quality']
X =  df[['fixed acidity', 'volatile acidity', 'citric acid', 'residual sugar','chlorides', 'free sulfur dioxide', 'total sulfur dioxide', 'density','pH', 'sulphates', 'alcohol']]
# Split the data into train and test data:
X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size = 0.2)
# Build the model with the random forest regression algorithm:
model = RandomForestRegressor(max_depth=6, random_state=0, n_estimators=10)
model.fit(X_train, Y_train)

Get Dummies from Categoriacal Variables which Python

Posted on 2020-01-01 In Python

Notes

Get the categorical vars you want to get dummy with.
use pd.get_dummies to convert one categorical variables to several dummys

catagorical_vars = ['peak','business_line','gender','age_level']
continuous_vars = set.difference(set(all_vars),set(catagorical_vars))

cate_list = []
for i in catagorical_vars:
    print(catagorical_vars)
    fe_dummy = pd.get_dummies(X[i])
    cate_list.append(fe_dummy)
    
dummy_all = pd.concat(cate_list, axis = 1)
dummy_all.head()

Multi-step LSTM Forecasting

Posted on 2019-12-22 In Models

Key Points

Convert your timeseries data to the the matrix like a moving window, which has the exact number of inputs(n_steps_in) and outpus(n_steps_out) you defined.
After trained model, here I defined to calculate mse for each out steps, and obviously, the more out steps I want to predict ,the large mse it is.

Propensity Score Matching

Posted on 2019-12-15 Edited on 2019-12-22 In Data Science

Description

Propensity Score Matching is a Sample Matching Method, it can effectively eliminate the effecting facotors between different groups and avoid the selecttion bias between two sample groups when we can not conduct random sampling.

Algorithm

Reuce X to one dimension through dimension reduction method, and get the Propensity Socre of every sample through Logistic Regression, then we match the samples, the most used way is Newares Neighbor matching, NNM.

Hexo 多设备同步&部署

Posted on 2019-12-11 Edited on 2019-12-22 In Hexo

买了新的Mac，但是Hexo博客是在的公司的电脑上部署的，所以研究了一下怎么把原来部署的博客同步新电脑上，实现两台设备都可以post博客的目的。看了很多文章也试了很多方法，最后是照着这篇文章step by step的部署好的，需要的朋友可以参考它~

利用Hexo在多台电脑上提交和更新github pages博客

Python for Sample Size & MDE Calculation

Posted on 2019-12-09 Edited on 2019-12-22 In Experiment

Sample Size Calculation

import scipy.stats as stats

def sample_size_calculation(mu, sigma, MDE, alpha=0.05, beta=0.2):
    return 2 * (sigma**2) * ((stats.norm.ppf(1-alpha/2) + stats.norm.ppf(1-beta))**2) / ((mu * MDE)**2)


Minimum Defective Effect

from scipy.stats import norm

sample_size = 1000
alpha = 0.05
z = norm.isf(alpha / 2)
estimated_variance = ds.y.var()
detectable_effect_size = z * np.sqrt(2 * estimated_variance / sample_size)

Most Valuable Things Everyone Should Know from Jordan B. Perterson

Posted on 2019-12-03 Edited on 2019-12-22 In Life

Jordan B. Peterson is a phycological professor from University of Toronto. His book 12 Rules For Life is one of my favoirate book.

Tell the truth.
Do not do things that you hate.
Act so that you can tell the truth about how you act.
Pursue what is meaningful, not what is expedient.
If you have to choose, be the one who does things, instead of the one who is seen to do things.
Pay attention.
Assume that the person you are listening to might know something you need to know.
Listen to them hard enough so that they will share it with you.
Plan and work diligently to maintain the romance in your relationships.
Be careful who you share good news with.
Be careful who you share bad news with.
Read more »

1	import pandas as pd
2	import numpy as np
3	np.random.seed(0)
4	import matplotlib.pyplot as plt
5	df = pd.read_csv('/winequality-red.csv') # Load the data
6	from sklearn.model_selection import train_test_split
7	from sklearn import preprocessing
8	from sklearn.ensemble import RandomForestRegressor
9	# The target variable is 'quality'.
10	Y = df['quality']
11	X = df[['fixed acidity', 'volatile acidity', 'citric acid', 'residual sugar','chlorides', 'free sulfur dioxide', 'total sulfur dioxide', 'density','pH', 'sulphates', 'alcohol']]
12	# Split the data into train and test data:
13	X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size = 0.2)
14	# Build the model with the random forest regression algorithm:
15	model = RandomForestRegressor(max_depth=6, random_state=0, n_estimators=10)
16	model.fit(X_train, Y_train)

1	catagorical_vars = ['peak','business_line','gender','age_level']
2	continuous_vars = set.difference(set(all_vars),set(catagorical_vars))
3
4	cate_list = []
5	for i in catagorical_vars:
6	print(catagorical_vars)
7	fe_dummy = pd.get_dummies(X[i])
8	cate_list.append(fe_dummy)
9
10	dummy_all = pd.concat(cate_list, axis = 1)
11	dummy_all.head()

1	import scipy.stats as stats
2
3	def sample_size_calculation(mu, sigma, MDE, alpha=0.05, beta=0.2):
4	return 2 * (sigma*2) ((stats.norm.ppf(1-alpha/2) + stats.norm.ppf(1-beta))*2) / ((mu MDE)**2)
5

1	from scipy.stats import norm
2
3	sample_size = 1000
4	alpha = 0.05
5	z = norm.isf(alpha / 2)
6	estimated_variance = ds.y.var()
7	detectable_effect_size = z * np.sqrt(2 * estimated_variance / sample_size)