-R- 正則化パラメータの選択

Lasso推定値は正則化パラメータλの値に依存するため,λの値が異なると当然Lasso推定値も異なってきます.
特に,Lasso推定値の違いは変数選択の結果の違いに直接影響してくるため,正則化パラメータの値の選択は重要です.

前回の投稿(アメリカの犯罪例にLassoを適用)ではアメリカの犯罪データにLassoを適用しました.
その続きとして,用意した正則化パラメータの値ごとに10分割交差検証法を用いて得られた結果を図示すると下図のようになります.

上図において,y軸は誤差2乗和(交差検証法の値)で,x軸は正則化パラメータの対数値になります.赤色の点から伸びている帯は,10分割交差検証法で計算された10個の検証誤差の標準誤差を表します.また,縦の破線は,選択された正則化パラメータの位置を表します.
 
CVの値が最小になる正則化パラメータの値(λ = 3)に対応するモデルを最適なモデルとして選択しています.
 
上図を表示するには,前回の投稿(こちら)に引き続いて以下のように入力します.
 
まずはCVを計算します.
> res.cv <- cv.glmnet(x=X, y=y)
 
続いて,CV値の推移をプロットします.
> plot(res.cv, xlab="Logarithmic value of regularization parameter", ylab="Regression coefficient")
 
CV値が最小となる正則化パラメータの値を出力します.
> res.cv$lambda.min
[1] 15.15854
 
1標準誤差ルールにより選択された正則化パラメータの値を出力します.

> res.cv$lambda.1se
[1] 155.1523
 

-R- glmnetパッケージをインストール

ターミナルからRを立ち上げます.

$ r
 
R version 3.6.0 (2019-04-26) -- "Planting of a Tree"
Copyright (C) 2019 The R Foundation for Statistical Computing
Platform: x86_64-apple-darwin17.7.0 (64-bit)
 
R is free software and comes with ABSOLUTELY NO WARRANTY.
You are welcome to redistribute it under certain conditions.
Type 'license()' or 'licence()' for distribution details.
 
  Natural language support but running in an English locale
 
R is a collaborative project with many contributors.
Type 'contributors()' for more information and
'citation()' on how to cite R or R packages in publications.
 
Type 'demo()' for some demos, 'help()' for on-line help, or
'help.start()' for an HTML browser interface to help.
Type 'q()' to quit R.
 
During startup - Warning messages:
1: Setting LC_COLLATE failed, using "C" 
2: Setting LC_TIME failed, using "C" 
3: Setting LC_MESSAGES failed, using "C" 
4: Setting LC_MONETARY failed, using "C" 
 

glmnetがインストールされてない状態でglmnetを読み込もうとしても,以下のようにそのようなパッケージが存在しないと返ってきます.

> library(glmnet)
Error in library(glmnet) : there is no package called ‘glmnet’
 

インストールは以下のようにして行います.

> install.packages("glmnet")
 
すると,以下のようにCRANミラーを選択するよう支持されます.
Installing package into ‘/usr/local/lib/R/3.6/site-library’
(as ‘lib’ is unspecified)
--- Please select a CRAN mirror for use in this session ---
Secure CRAN mirrors 
 
 1: 0-Cloud [https]                   2: Algeria [https]                
 3: Australia (Canberra) [https]      4: Australia (Melbourne 1) [https]
 5: Australia (Melbourne 2) [https]   6: Australia (Perth) [https]      
 7: Austria [https]                   8: Belgium (Ghent) [https]        
 9: Brazil (PR) [https]              10: Brazil (RJ) [https]            
11: Brazil (SP 1) [https]            12: Brazil (SP 2) [https]          
13: Bulgaria [https]                 14: Chile [https]                  
15: China (Hong Kong) [https]        16: China (Guangzhou) [https]      
17: China (Lanzhou) [https]          18: China (Shanghai) [https]       
19: Colombia (Cali) [https]          20: Czech Republic [https]         
21: Denmark [https]                  22: Ecuador (Cuenca) [https]       
23: Ecuador (Quito) [https]          24: Estonia [https]                
25: France (Lyon 1) [https]          26: France (Lyon 2) [https]        
27: France (Marseille) [https]       28: France (Montpellier) [https]   
29: France (Paris 2) [https]         30: Germany (Erlangen) [https]     
31: Germany (Göttingen) [https]      32: Germany (Münster) [https]      
33: Germany (Regensburg) [https]     34: Greece [https]                 
35: Hungary [https]                  36: Iceland [https]                
37: Indonesia (Jakarta) [https]      38: Ireland [https]                
39: Italy (Padua) [https]            40: Japan (Tokyo) [https]          
41: Japan (Yonezawa) [https]         42: Korea (Busan) [https]          
43: Korea (Gyeongsan-si) [https]     44: Korea (Seoul 1) [https]        
45: Korea (Ulsan) [https]            46: Malaysia [https]               
47: Mexico (Mexico City) [https]     48: Norway [https]                 
49: Philippines [https]              50: Serbia [https]                 
51: Spain (A Coruña) [https]         52: Spain (Madrid) [https]         
53: Sweden [https]                   54: Switzerland [https]            
55: Turkey (Denizli) [https]         56: Turkey (Mersin) [https]        
57: UK (Bristol) [https]             58: UK (London 1) [https]          
59: USA (CA 1) [https]               60: USA (IA) [https]               
61: USA (KS) [https]                 62: USA (MI 1) [https]             
63: USA (MI 2) [https]               64: USA (OR) [https]               
65: USA (TN) [https]                 66: USA (TX 1) [https]             
67: Uruguay [https]                  68: (other mirrors)                


ここでは,40番のJapan(Tokyo)を選択します.

Selection: 40
 
するとインストール*が始まります.
 
・・-・ --- --- - -・ --- - ・

* インストール時には以下のように表示されます.

also installing the dependencies ‘iterators’, ‘foreach’
 
Content type 'application/x-gzip' length 290575 bytes (283 KB)
==================================================
downloaded 283 KB
 
Content type 'application/x-gzip' length 360705 bytes (352 KB)
==================================================
downloaded 352 KB
 
Content type 'application/x-gzip' length 3862714 bytes (3.7 MB)
==================================================
downloaded 3.7 MB
 
During startup - Warning messages:
1: Setting LC_TIME failed, using "C" 
2: Setting LC_MESSAGES failed, using "C" 
3: Setting LC_MONETARY failed, using "C" 
* installing *source* package ‘iterators’ ...
** package ‘iterators’ successfully unpacked and MD5 sums checked
** using staged installation
** R
** inst
** byte-compile and prepare package for lazy loading
During startup - Warning messages:
1: Setting LC_TIME failed, using "C" 
2: Setting LC_MESSAGES failed, using "C" 
3: Setting LC_MONETARY failed, using "C" 
** help
*** installing help indices
** building package indices
** installing vignettes
** testing if installed package can be loaded from temporary location
During startup - Warning messages:
1: Setting LC_TIME failed, using "C" 
2: Setting LC_MESSAGES failed, using "C" 
3: Setting LC_MONETARY failed, using "C" 
** testing if installed package can be loaded from final location
During startup - Warning messages:
1: Setting LC_TIME failed, using "C" 
2: Setting LC_MESSAGES failed, using "C" 
3: Setting LC_MONETARY failed, using "C" 
** testing if installed package keeps a record of temporary installation path
* DONE (iterators)
During startup - Warning messages:
1: Setting LC_TIME failed, using "C" 
2: Setting LC_MESSAGES failed, using "C" 
3: Setting LC_MONETARY failed, using "C" 
* installing *source* package ‘foreach’ ...
** package ‘foreach’ successfully unpacked and MD5 sums checked
** using staged installation
** R
** demo
** inst
** byte-compile and prepare package for lazy loading
During startup - Warning messages:
1: Setting LC_TIME failed, using "C" 
2: Setting LC_MESSAGES failed, using "C" 
3: Setting LC_MONETARY failed, using "C" 
** help
*** installing help indices
** building package indices
** installing vignettes
** testing if installed package can be loaded from temporary location
During startup - Warning messages:
1: Setting LC_TIME failed, using "C" 
2: Setting LC_MESSAGES failed, using "C" 
3: Setting LC_MONETARY failed, using "C" 
** testing if installed package can be loaded from final location
During startup - Warning messages:
1: Setting LC_TIME failed, using "C" 
2: Setting LC_MESSAGES failed, using "C" 
3: Setting LC_MONETARY failed, using "C" 
** testing if installed package keeps a record of temporary installation path
* DONE (foreach)
During startup - Warning messages:
1: Setting LC_TIME failed, using "C" 
2: Setting LC_MESSAGES failed, using "C" 
3: Setting LC_MONETARY failed, using "C" 
* installing *source* package ‘glmnet’ ...
** package ‘glmnet’ successfully unpacked and MD5 sums checked
** using staged installation
** libs
gfortran  -fPIC  -g -O2  -c glmnet5dpclean.f -o glmnet5dpclean.o
clang -I"/usr/local/Cellar/r/3.6.0_3/lib/R/include" -DNDEBUG   -I/usr/local/opt/gettext/include -I/usr/local/opt/readline/include -I/usr/local/include  -fPIC  -g -O2  -c glmnet_init.c -o glmnet_init.o
clang -dynamiclib -Wl,-headerpad_max_install_names -undefined dynamic_lookup -single_module -multiply_defined suppress -L/usr/local/Cellar/r/3.6.0_3/lib/R/lib -L/usr/local/opt/gettext/lib -L/usr/local/opt/readline/lib -L/usr/local/lib -o glmnet.so glmnet5dpclean.o glmnet_init.o -L/usr/local/opt/gcc/lib/gcc/9/gcc/x86_64-apple-darwin17/9.1.0 -L/usr/local/opt/gcc/lib/gcc/9 -lgfortran -lquadmath -lm -L/usr/local/Cellar/r/3.6.0_3/lib/R/lib -lR -lintl -Wl,-framework -Wl,CoreFoundation
ld: warning: text-based stub file /System/Library/Frameworks//CoreFoundation.framework/CoreFoundation.tbd and library file /System/Library/Frameworks//CoreFoundation.framework/CoreFoundation are out of sync. Falling back to library file for linking.
installing to /usr/local/lib/R/3.6/site-library/00LOCK-glmnet/00new/glmnet/libs
** R
** data
** inst
** byte-compile and prepare package for lazy loading
During startup - Warning messages:
1: Setting LC_TIME failed, using "C" 
2: Setting LC_MESSAGES failed, using "C" 
3: Setting LC_MONETARY failed, using "C" 
** help
*** installing help indices
** building package indices
** installing vignettes
** testing if installed package can be loaded from temporary location
During startup - Warning messages:
1: Setting LC_TIME failed, using "C" 
2: Setting LC_MESSAGES failed, using "C" 
3: Setting LC_MONETARY failed, using "C" 
** checking absolute paths in shared objects and dynamic libraries
** testing if installed package can be loaded from final location
During startup - Warning messages:
1: Setting LC_TIME failed, using "C" 
2: Setting LC_MESSAGES failed, using "C" 
3: Setting LC_MONETARY failed, using "C" 
** testing if installed package keeps a record of temporary installation path
* DONE (glmnet)
 
The downloaded source packages are in
‘/private/var/folders/sd/hzc6sg0j5tj5p340n4h992_r0000gn/T/RtmphxoZOU/downloaded_packages’
 

-gnuplot- gnuplotのインストール (with Aquaterm)

macOSにHomebrew経由でgnuplotをインストールすると,terminal type が 'qt' のみとなってしまい,非常に困ってしまいます.
以前は,Homebrewのインストールで

$ brew install gnuplot --with-aquaterm

とすれば良かったのですが,現段階では何故かエラーとなってしまい,AquaTermが一緒にインストールすることができません.

AquaTermがないと.epsファイルへの書き出しができずに困ってしまうので,以下の方法でインストールを行いました.

まずは,Homebrew経由でインストールしたgnuplotを削除します.

Homebrewでインストールしたものを検索してみます).

$ brew list
(実際には,gnuplot以外にも色々と出てくると思うが,上記の例では,gnuplotのみを表示しています.


gnuplotがあることを確認して,以下のコマンドで削除します.

$ brew uninstall gnuplot
 
続いて,gnuplotのホームページからソースコードを入手します(この段階では ver. 5.2 を選択したので,ダウンロードしたファイルは gnuplot-5.2.7.tar です).
gnuplot-5.2.7.tar を展開して,適当なディレクトリに移動します(以下の例では xxx というディレクトリに移動したとして説明します).
 
gnuplot-5.2.7.tar を展開して移動したディレクトリ(xxx)に移ります
$ cd /Users/xxx/gnuplot-5.2.7 


以下のコマンドを入力します.

$ ./configure --with-readline=builtin --with-aquaterm
 
このコマンドを入力すると,以下のように相当ズラズラと流れます(以下の例では途中を省略しています).
checking for a BSD-compatible install... /usr/bin/install -c
checking whether build environment is sane... yes
checking for a thread-safe mkdir -p... ./install-sh -c -d
checking for gawk... no
checking for mawk... no
checking for nawk... no
checking for awk... awk
... <- Omission
gnuplot will install the following additional materials:
 
  cfg file for epslatex terminal: yes
  TeX *.sty for lua/tikz terminal: yes
  TeX files will be installed in /usr/local/texlive/texmf-local/tex/latex/gnuplot
                               (use --with-texdir=DIR to change)
  Help file: ${datarootdir}/gnuplot/5.2/gnuplot.gih
  PostScript prologue files: ${datarootdir}/gnuplot/5.2/PostScript/
 
続いて make します.
$ make
 
make においても,以下のように相当ズラズラと流れます(以下の例では途中を省略しています).
/Applications/Xcode.app/Contents/Developer/usr/bin/make  all-recursive
Making all in config
... <- Omission
make[3]: Nothing to be done for `all'.
cp -p ./Gnuplot.app-defaults Gnuplot
make[2]: Nothing to be done for `all-am'.
 

続いて,以下のコマンドを実行します.この段階ではパスワードの入力を求められるので,administrator としてログインする際のパスワードを入力します.

mini:gnuplot-5.2.7 hide$ sudo make install
Password:


この際にも以下のように,多少ズラズラと流れてインストールが完了します(例では途中を省略しています).

aking install in config
make[2]: Nothing to be done for `install-exec-am'.

 

make[2]: Nothing to be done for `install-data-am'.
... <- Omission
make[3]: Nothing to be done for `install-exec-am'.
 .././install-sh -c -d '/usr/local/share/gnuplot/5.2/app-defaults'
 /usr/bin/install -c -m 644 Gnuplot '/usr/local/share/gnuplot/5.2/app-defaults'
 .././install-sh -c -d '/usr/local/share/gnuplot/5.2'
 /usr/bin/install -c -m 644 colors_default.gp colors_podo.gp colors_mono.gp gnuplotrc '/usr/local/share/gnuplot/5.2'
make[2]: Nothing to be done for `install-exec-am'.
 
make[2]: Nothing to be done for `install-data-am'.
 
確認のためにgnuplot を立ち上げてみます.
 
G N U P L O T
Version 5.2 patchlevel 7    last modified 2019-05-29 
 
Copyright (C) 1986-1993, 1998, 2004, 2007-2018
Thomas Williams, Colin Kelley and many others
 
faq, bugs, etc:   type "help FAQ"
immediate help:   type "help"  (plot window: hit 'h')
 
Terminal type is now 'aqua'
 

上記のように

Terminal type is now 'aqua'

と表示されていれば成功です.

macOS High Sierra は上記の方法でインストールに成功しましたが,Majaveでは駄目でした...

-Python- 主成分分析

主成分分析に関するメモです.
 

主成分分析を行うには scikit-learn パッケージを使用して,sklearn.decomposition の PCA でインスタンスを生成します.

以下の例では,Davis データを用いて主成分分析を行っています.

 

Davisデータ(Davis.csv)はJupyter Notebookの保存されているディレクトリと同じディレクトリに保存されているものとします.

Davisデータの読み込みには pandas パッケージの pd.read_csv を使用します.

データ配列の第1, 2列の各行がデータ点 {\bf{x_{i}}} = ( w_{i}, h_{i} )に対応しています( x_{i} i番目の人の体重[kg], h_{i}は身長[cm]に対応).


パッケージの読み込みを行います.

>>> import numpy as np
>>> import matplotlib.pyplot as plt
>>> import pandas as pd
 
sklearn の PCA を使います.
>>> from sklearn.decomposition import PCA
 
pandasを使ってデータ読み込みます.読みこむ.csvファイル*は,REPLを実行しているディレクトリにあるものとしているので,必要に応じてパスの書き換えが必要です.
>>> dat = pd.read_csv('Davis.csv').values
 
身長の単位を[m]に変換し,対数の値を計算します.
>>> logdat = np.log(np.c_[dat[:,1],dat[:,2]/100].astype('float'))

データのプロットを行います.
>>> plt.plot(logdat[:,0], logdat[:,1], '.'); plt.show()
[<matplotlib.lines.Line2D object at 0x11df9dac8>]

f:id:HidehikoMURAO:20190625065138p:plain

読み込んだデータに対して主成分分析を行います.
>>> pca = PCA()
>>> pca.fit(logdat)
PCA(copy=True, iterated_power='auto', n_components=None, random_state=None,
  svd_solver='auto', tol=0.0, whiten=False)
>>> pca.components_
array([[ 0.99672116,  0.08091309],
       [ 0.08091309, -0.99672116]])
>>> 
上記のコードの pca.components_ は主成分です.

インデックス 11 のデータは外れ値として除去することにします.
>>> clean_logdat = np.delete(logdat, 11, axis=0)
 
外れ値(インデックス 11)を除去したデータに主成分分析を行います.
>>> pca = PCA() 
>>> pca.fit(clean_logdat) 
PCA(copy=True, iterated_power='auto', n_components=None, random_state=None,
  svd_solver='auto', tol=0.0, whiten=False)
>>> pca.components_
array([[ 0.97754866,  0.21070979],
       [-0.21070979,  0.97754866]])
>>> 
 

----------
* 上記のコードで読み込む.csvファイル(Davis.csv)の中身
sex,weight,height,repwt,repht
M,77,182,77,180
F,58,161,51,159
F,53,161,54,158
M,68,177,70,175
F,59,157,59,155
M,76,170,76,165
M,76,167,77,165
M,69,186,73,180
M,71,178,71,175
M,65,171,64,170
M,70,175,75,174
F,166,57,56,163
F,51,161,52,158
F,64,168,64,165
F,52,163,57,160
F,65,166,66,165
M,92,187,101,185
F,62,168,62,165
M,76,197,75,200
F,61,175,61,171
M,119,180,124,178
F,61,170,61,170
M,65,175,66,173
M,66,173,70,170
F,54,171,59,168
F,50,166,50,165
F,63,169,61,168
F,58,166,60,160
F,39,157,41,153
M,101,183,100,180
F,71,166,71,165
M,75,178,73,175
M,79,173,76,173
F,52,164,52,161
F,68,169,63,170
M,64,176,65,175
F,56,166,54,165
M,69,174,69,171
M,88,178,86,175
M,65,187,67,188
F,54,164,53,160
M,80,178,80,178
F,63,163,59,159
M,78,183,80,180
M,85,179,82,175
F,54,160,55,158
M,73,180,NA,NA
F,49,161,NA,NA
F,54,174,56,173
F,75,162,75,158
M,82,182,85,183
F,56,165,57,163
M,74,169,73,170
M,102,185,107,185
M,64,177,NA,NA
M,65,176,64,172
F,66,170,65,NA
M,73,183,74,180
M,75,172,70,169
M,57,173,58,170
M,68,165,69,165
M,71,177,71,170
M,71,180,76,175
F,78,173,75,169
M,97,189,98,185
F,60,162,59,160
F,64,165,63,163
F,64,164,62,161
F,52,158,51,155
M,80,178,76,175
F,62,175,61,171
M,66,173,66,175
F,55,165,54,163
F,56,163,57,159
F,50,166,50,161
F,50,171,NA,NA
F,50,160,55,150
F,63,160,64,158
M,69,182,70,180
M,69,183,70,183
F,61,165,60,163
M,55,168,56,170
F,53,169,52,175
F,60,167,55,163
F,56,170,56,170
M,59,182,61,183
M,62,178,66,175
F,53,165,53,165
F,57,163,59,160
F,57,162,56,160
M,70,173,68,170
F,56,161,56,161
M,84,184,86,183
M,69,180,71,180
M,88,189,87,185
F,56,165,57,160
M,103,185,101,182
F,50,169,50,165
F,52,159,52,153
F,55,155,NA,154
F,55,164,55,163
M,63,178,63,175
F,47,163,47,160
F,45,163,45,160
F,62,175,63,173
F,53,164,51,160
F,52,152,51,150
F,57,167,55,164
F,64,166,64,165
F,59,166,55,163
M,84,183,90,183
M,79,179,79,171
F,55,174,57,171
M,67,179,67,179
F,76,167,77,165
F,62,168,62,163
M,83,184,83,181
M,96,184,94,183
M,75,169,76,165
M,65,178,66,178
M,78,178,77,175
M,69,167,73,165
F,68,178,68,175
F,55,165,55,163
M,67,179,NA,NA
F,52,169,56,NA
F,47,153,NA,154
F,45,157,45,153
F,68,171,68,169
F,44,157,44,155
F,62,166,61,163
M,87,185,89,185
F,56,160,53,158
F,50,148,47,148
M,83,177,84,175
F,53,162,53,160
F,64,172,62,168
F,62,167,NA,NA
M,90,188,91,185
M,85,191,83,188
M,66,175,68,175
F,52,163,53,160
F,53,165,55,163
F,54,176,55,176
F,64,171,66,171
F,55,160,55,155
F,55,165,55,165
F,59,157,55,158
F,70,173,67,170
M,88,184,86,183
F,57,168,58,165
F,47,162,47,160
F,47,150,45,152
F,55,162,NA,NA
F,48,163,44,160
M,54,169,58,165
M,69,172,68,174
F,59,170,NA,NA
F,58,169,NA,NA
F,57,167,56,165
F,51,163,50,160
F,54,161,54,160
F,53,162,52,158
F,59,172,58,171
M,56,163,58,161
F,59,159,59,155
F,63,170,62,168
F,66,166,66,165
M,96,191,95,188
F,53,158,50,155
M,76,169,75,165
F,54,163,NA,NA
M,61,170,61,170
M,82,176,NA,NA
M,62,168,64,168
M,71,178,68,178
F,60,174,NA,NA
M,66,170,67,165
M,81,178,82,175
M,68,174,68,173
M,80,176,78,175
F,43,154,NA,NA
M,82,181,NA,NA
F,63,165,59,160
M,70,173,70,173
F,56,162,56,160
F,60,172,55,168
F,58,169,54,166
M,76,183,75,180
F,50,158,49,155
M,88,185,93,188
M,89,173,86,173
F,59,164,59,165
F,51,156,51,158
F,62,164,61,161
M,74,175,71,175
M,83,180,80,180
M,81,175,NA,NA
M,90,181,91,178
M,79,177,81,178

-Python- 交差検証法によるテスト誤差の推定

K重交差検証法によってテスト誤差を推定する際のメモです.

K重交差検証法の計算プロセスは以下のようになります.

  • 学習方法と損失:チューニングパラメータ \lambdaをもつアルゴリズム \mathcal{A},損失 l ({\bf{z}}, h)
  • 入力:データ {\bf{z}}_{1}, {\bf{z}}_{2}, \cdots, {\bf{z}}_{n}
  1. 入力データを要素数がほぼ等しいK個のグループ D_{1}, D_{2}, \cdots, D_{K}に分割する.
  2.  k = 1, 2, \cdots, Kとして,以下を反復する.
    1.  h_{\lambda, k} = \mathcal{A} (D^{(k)}, \lambda)を計算する.ここで, D^{(k)} = D \ D_{k}.
    2.  \widehat{Err} ( h_{\lambda, k}) = \cfrac{1}{m}\sum_{x \in D_{k}} l ({\bf{z}},  h_{\lambda, k})を計算する.
  3.  \widehat{Err} ( \mathcal{A}; \lambda) = \cfrac{1}{K}\sum_{k = 1}^{K} \widehat{Err} ( h_{\lambda, k} )

回帰関数は決定木によって推定することにします.よって,推定には,sklearn.tree DecisionTreeRegressor を使用します.

パッケージの読み込み

>>> import numpy as np
>>> import scipy as sp
>>> import matplotlib.pyplot as plt
>>> from sklearn.tree import DecisionTreeRegressor


設定は,データ数100個,10重CV(Cross Validation)とします.

>>> n = 100; K = 10 
 
データの生成を行います.データは区間[-2, 2]上の一様分布とします.
>>> x = np.random.uniform(-2, 2, n)
>>> y = np.sin(2 * np.pi * x) / x + np.random.normal(scale = 0.5, size = n)
 
データのグループ分けと描画を行います.
>>> cv_idx = np.tile(np.arange(K), int(np.ceil(n/K)))[: n]
>>> maxdepths = np.arange(2, 10)
>>> cverr = np.array()
>>> for mp in maxdepths:
...     cverr_lambda = np.array()
...     for k in range(K):
...             tr_idx = (cv_idx!=k) 
...             te_idx = (cv_idx==k)
...             cvx = x[tr_idx]; cvy = y[tr_idx]   
...             dtreg = DecisionTreeRegressor(max_depth=mp)
...             dtreg.fit(np.array([cvx]).T, cvy)               
...             ypred = dtreg.predict(np.array([x[te_idx]]).T)  
...             cl = np.append(cverr_lambda, np.mean*1
...     cverr = np.append(cverr, np.mean(cl))
... 
DecisionTreeRegressor(criterion='mse', max_depth=2, max_features=None,
           max_leaf_nodes=None, min_impurity_decrease=0.0,
           min_impurity_split=None, min_samples_leaf=1,
           min_samples_split=2, min_weight_fraction_leaf=0.0,
           presort=False, random_state=None, splitter='best')
.
. *1 (省略)
.
DecisionTreeRegressor(criterion='mse', max_depth=9, max_features=None,
           max_leaf_nodes=None, min_impurity_decrease=0.0,
           min_impurity_split=None, min_samples_leaf=1,
           min_samples_split=2, min_weight_fraction_leaf=0.0,
           presort=False, random_state=None, splitter='best')
>>> plt.scatter(maxdepths, cverr,c='k')  # cv誤差のプロット
<matplotlib.collections.PathCollection object at 0x122d2cc18>
>>> plt.xlabel("max depth"); plt.ylabel('cv error')
Text(0.5, 0, 'max depth')
Text(0, 0.5, 'cv error')
>>> plt.show()
 

上記のコードの解説

  • 決定木の深さの候補は maxdepths = np.arange(2, 10) で設定
  • CVのためのデータ分割は cvx = x[tr_idx]; cvy = y[tr_idx] で行っている
  • 決定木で推定するのは dtreg.fit(np.array([cvx]).T, cvy)で行っている.
  • 予測は ypred = dtreg.predict(np.array([x[te_idx] - ypred)**2/2) で行っている


上記の例のJupyter Notebookファイルは,GitHubStatistics.ipynbの"Estimation of test error by cross validation method"で見ることができます.

 

----------
REPLの実行で省略した *1 部分は,以下の様に表示されます.

DecisionTreeRegressor(criterion='mse', max_depth=2, max_features=None,
           max_leaf_nodes=None, min_impurity_decrease=0.0,
           min_impurity_split=None, min_samples_leaf=1,
           min_samples_split=2, min_weight_fraction_leaf=0.0,
           presort=False, random_state=None, splitter='best')
DecisionTreeRegressor(criterion='mse', max_depth=2, max_features=None,
           max_leaf_nodes=None, min_impurity_decrease=0.0,
           min_impurity_split=None, min_samples_leaf=1,
           min_samples_split=2, min_weight_fraction_leaf=0.0,
           presort=False, random_state=None, splitter='best')
DecisionTreeRegressor(criterion='mse', max_depth=2, max_features=None,
           max_leaf_nodes=None, min_impurity_decrease=0.0,
           min_impurity_split=None, min_samples_leaf=1,
           min_samples_split=2, min_weight_fraction_leaf=0.0,
           presort=False, random_state=None, splitter='best')
DecisionTreeRegressor(criterion='mse', max_depth=2, max_features=None,
           max_leaf_nodes=None, min_impurity_decrease=0.0,
           min_impurity_split=None, min_samples_leaf=1,
           min_samples_split=2, min_weight_fraction_leaf=0.0,
           presort=False, random_state=None, splitter='best')
DecisionTreeRegressor(criterion='mse', max_depth=2, max_features=None,
           max_leaf_nodes=None, min_impurity_decrease=0.0,
           min_impurity_split=None, min_samples_leaf=1,
           min_samples_split=2, min_weight_fraction_leaf=0.0,
           presort=False, random_state=None, splitter='best')
DecisionTreeRegressor(criterion='mse', max_depth=2, max_features=None,
           max_leaf_nodes=None, min_impurity_decrease=0.0,
           min_impurity_split=None, min_samples_leaf=1,
           min_samples_split=2, min_weight_fraction_leaf=0.0,
           presort=False, random_state=None, splitter='best')
DecisionTreeRegressor(criterion='mse', max_depth=2, max_features=None,
           max_leaf_nodes=None, min_impurity_decrease=0.0,
           min_impurity_split=None, min_samples_leaf=1,
           min_samples_split=2, min_weight_fraction_leaf=0.0,
           presort=False, random_state=None, splitter='best')
DecisionTreeRegressor(criterion='mse', max_depth=2, max_features=None,
           max_leaf_nodes=None, min_impurity_decrease=0.0,
           min_impurity_split=None, min_samples_leaf=1,
           min_samples_split=2, min_weight_fraction_leaf=0.0,
           presort=False, random_state=None, splitter='best')
DecisionTreeRegressor(criterion='mse', max_depth=2, max_features=None,
           max_leaf_nodes=None, min_impurity_decrease=0.0,
           min_impurity_split=None, min_samples_leaf=1,
           min_samples_split=2, min_weight_fraction_leaf=0.0,
           presort=False, random_state=None, splitter='best')
DecisionTreeRegressor(criterion='mse', max_depth=2, max_features=None,
           max_leaf_nodes=None, min_impurity_decrease=0.0,
           min_impurity_split=None, min_samples_leaf=1,
           min_samples_split=2, min_weight_fraction_leaf=0.0,
           presort=False, random_state=None, splitter='best')
DecisionTreeRegressor(criterion='mse', max_depth=3, max_features=None,
           max_leaf_nodes=None, min_impurity_decrease=0.0,
           min_impurity_split=None, min_samples_leaf=1,
           min_samples_split=2, min_weight_fraction_leaf=0.0,
           presort=False, random_state=None, splitter='best')
DecisionTreeRegressor(criterion='mse', max_depth=3, max_features=None,
           max_leaf_nodes=None, min_impurity_decrease=0.0,
           min_impurity_split=None, min_samples_leaf=1,
           min_samples_split=2, min_weight_fraction_leaf=0.0,
           presort=False, random_state=None, splitter='best')
DecisionTreeRegressor(criterion='mse', max_depth=3, max_features=None,
           max_leaf_nodes=None, min_impurity_decrease=0.0,
           min_impurity_split=None, min_samples_leaf=1,
           min_samples_split=2, min_weight_fraction_leaf=0.0,
           presort=False, random_state=None, splitter='best')
DecisionTreeRegressor(criterion='mse', max_depth=3, max_features=None,
           max_leaf_nodes=None, min_impurity_decrease=0.0,
           min_impurity_split=None, min_samples_leaf=1,
           min_samples_split=2, min_weight_fraction_leaf=0.0,
           presort=False, random_state=None, splitter='best')
DecisionTreeRegressor(criterion='mse', max_depth=3, max_features=None,
           max_leaf_nodes=None, min_impurity_decrease=0.0,
           min_impurity_split=None, min_samples_leaf=1,
           min_samples_split=2, min_weight_fraction_leaf=0.0,
           presort=False, random_state=None, splitter='best')
DecisionTreeRegressor(criterion='mse', max_depth=3, max_features=None,
           max_leaf_nodes=None, min_impurity_decrease=0.0,
           min_impurity_split=None, min_samples_leaf=1,
           min_samples_split=2, min_weight_fraction_leaf=0.0,
           presort=False, random_state=None, splitter='best')
DecisionTreeRegressor(criterion='mse', max_depth=3, max_features=None,
           max_leaf_nodes=None, min_impurity_decrease=0.0,
           min_impurity_split=None, min_samples_leaf=1,
           min_samples_split=2, min_weight_fraction_leaf=0.0,
           presort=False, random_state=None, splitter='best')
DecisionTreeRegressor(criterion='mse', max_depth=3, max_features=None,
           max_leaf_nodes=None, min_impurity_decrease=0.0,
           min_impurity_split=None, min_samples_leaf=1,
           min_samples_split=2, min_weight_fraction_leaf=0.0,
           presort=False, random_state=None, splitter='best')
DecisionTreeRegressor(criterion='mse', max_depth=3, max_features=None,
           max_leaf_nodes=None, min_impurity_decrease=0.0,
           min_impurity_split=None, min_samples_leaf=1,
           min_samples_split=2, min_weight_fraction_leaf=0.0,
           presort=False, random_state=None, splitter='best')
DecisionTreeRegressor(criterion='mse', max_depth=3, max_features=None,
           max_leaf_nodes=None, min_impurity_decrease=0.0,
           min_impurity_split=None, min_samples_leaf=1,
           min_samples_split=2, min_weight_fraction_leaf=0.0,
           presort=False, random_state=None, splitter='best')
DecisionTreeRegressor(criterion='mse', max_depth=4, max_features=None,
           max_leaf_nodes=None, min_impurity_decrease=0.0,
           min_impurity_split=None, min_samples_leaf=1,
           min_samples_split=2, min_weight_fraction_leaf=0.0,
           presort=False, random_state=None, splitter='best')
DecisionTreeRegressor(criterion='mse', max_depth=4, max_features=None,
           max_leaf_nodes=None, min_impurity_decrease=0.0,
           min_impurity_split=None, min_samples_leaf=1,
           min_samples_split=2, min_weight_fraction_leaf=0.0,
           presort=False, random_state=None, splitter='best')
DecisionTreeRegressor(criterion='mse', max_depth=4, max_features=None,
           max_leaf_nodes=None, min_impurity_decrease=0.0,
           min_impurity_split=None, min_samples_leaf=1,
           min_samples_split=2, min_weight_fraction_leaf=0.0,
           presort=False, random_state=None, splitter='best')
DecisionTreeRegressor(criterion='mse', max_depth=4, max_features=None,
           max_leaf_nodes=None, min_impurity_decrease=0.0,
           min_impurity_split=None, min_samples_leaf=1,
           min_samples_split=2, min_weight_fraction_leaf=0.0,
           presort=False, random_state=None, splitter='best')
DecisionTreeRegressor(criterion='mse', max_depth=4, max_features=None,
           max_leaf_nodes=None, min_impurity_decrease=0.0,
           min_impurity_split=None, min_samples_leaf=1,
           min_samples_split=2, min_weight_fraction_leaf=0.0,
           presort=False, random_state=None, splitter='best')
DecisionTreeRegressor(criterion='mse', max_depth=4, max_features=None,
           max_leaf_nodes=None, min_impurity_decrease=0.0,
           min_impurity_split=None, min_samples_leaf=1,
           min_samples_split=2, min_weight_fraction_leaf=0.0,
           presort=False, random_state=None, splitter='best')
DecisionTreeRegressor(criterion='mse', max_depth=4, max_features=None,
           max_leaf_nodes=None, min_impurity_decrease=0.0,
           min_impurity_split=None, min_samples_leaf=1,
           min_samples_split=2, min_weight_fraction_leaf=0.0,
           presort=False, random_state=None, splitter='best')
DecisionTreeRegressor(criterion='mse', max_depth=4, max_features=None,
           max_leaf_nodes=None, min_impurity_decrease=0.0,
           min_impurity_split=None, min_samples_leaf=1,
           min_samples_split=2, min_weight_fraction_leaf=0.0,
           presort=False, random_state=None, splitter='best')
DecisionTreeRegressor(criterion='mse', max_depth=4, max_features=None,
           max_leaf_nodes=None, min_impurity_decrease=0.0,
           min_impurity_split=None, min_samples_leaf=1,
           min_samples_split=2, min_weight_fraction_leaf=0.0,
           presort=False, random_state=None, splitter='best')
DecisionTreeRegressor(criterion='mse', max_depth=4, max_features=None,
           max_leaf_nodes=None, min_impurity_decrease=0.0,
           min_impurity_split=None, min_samples_leaf=1,
           min_samples_split=2, min_weight_fraction_leaf=0.0,
           presort=False, random_state=None, splitter='best')
DecisionTreeRegressor(criterion='mse', max_depth=5, max_features=None,
           max_leaf_nodes=None, min_impurity_decrease=0.0,
           min_impurity_split=None, min_samples_leaf=1,
           min_samples_split=2, min_weight_fraction_leaf=0.0,
           presort=False, random_state=None, splitter='best')
DecisionTreeRegressor(criterion='mse', max_depth=5, max_features=None,
           max_leaf_nodes=None, min_impurity_decrease=0.0,
           min_impurity_split=None, min_samples_leaf=1,
           min_samples_split=2, min_weight_fraction_leaf=0.0,
           presort=False, random_state=None, splitter='best')
DecisionTreeRegressor(criterion='mse', max_depth=5, max_features=None,
           max_leaf_nodes=None, min_impurity_decrease=0.0,
           min_impurity_split=None, min_samples_leaf=1,
           min_samples_split=2, min_weight_fraction_leaf=0.0,
           presort=False, random_state=None, splitter='best')
DecisionTreeRegressor(criterion='mse', max_depth=5, max_features=None,
           max_leaf_nodes=None, min_impurity_decrease=0.0,
           min_impurity_split=None, min_samples_leaf=1,
           min_samples_split=2, min_weight_fraction_leaf=0.0,
           presort=False, random_state=None, splitter='best')
DecisionTreeRegressor(criterion='mse', max_depth=5, max_features=None,
           max_leaf_nodes=None, min_impurity_decrease=0.0,
           min_impurity_split=None, min_samples_leaf=1,
           min_samples_split=2, min_weight_fraction_leaf=0.0,
           presort=False, random_state=None, splitter='best')
DecisionTreeRegressor(criterion='mse', max_depth=5, max_features=None,
           max_leaf_nodes=None, min_impurity_decrease=0.0,
           min_impurity_split=None, min_samples_leaf=1,
           min_samples_split=2, min_weight_fraction_leaf=0.0,
           presort=False, random_state=None, splitter='best')
DecisionTreeRegressor(criterion='mse', max_depth=5, max_features=None,
           max_leaf_nodes=None, min_impurity_decrease=0.0,
           min_impurity_split=None, min_samples_leaf=1,
           min_samples_split=2, min_weight_fraction_leaf=0.0,
           presort=False, random_state=None, splitter='best')
DecisionTreeRegressor(criterion='mse', max_depth=5, max_features=None,
           max_leaf_nodes=None, min_impurity_decrease=0.0,
           min_impurity_split=None, min_samples_leaf=1,
           min_samples_split=2, min_weight_fraction_leaf=0.0,
           presort=False, random_state=None, splitter='best')
DecisionTreeRegressor(criterion='mse', max_depth=5, max_features=None,
           max_leaf_nodes=None, min_impurity_decrease=0.0,
           min_impurity_split=None, min_samples_leaf=1,
           min_samples_split=2, min_weight_fraction_leaf=0.0,
           presort=False, random_state=None, splitter='best')
DecisionTreeRegressor(criterion='mse', max_depth=5, max_features=None,
           max_leaf_nodes=None, min_impurity_decrease=0.0,
           min_impurity_split=None, min_samples_leaf=1,
           min_samples_split=2, min_weight_fraction_leaf=0.0,
           presort=False, random_state=None, splitter='best')
DecisionTreeRegressor(criterion='mse', max_depth=6, max_features=None,
           max_leaf_nodes=None, min_impurity_decrease=0.0,
           min_impurity_split=None, min_samples_leaf=1,
           min_samples_split=2, min_weight_fraction_leaf=0.0,
           presort=False, random_state=None, splitter='best')
DecisionTreeRegressor(criterion='mse', max_depth=6, max_features=None,
           max_leaf_nodes=None, min_impurity_decrease=0.0,
           min_impurity_split=None, min_samples_leaf=1,
           min_samples_split=2, min_weight_fraction_leaf=0.0,
           presort=False, random_state=None, splitter='best')
DecisionTreeRegressor(criterion='mse', max_depth=6, max_features=None,
           max_leaf_nodes=None, min_impurity_decrease=0.0,
           min_impurity_split=None, min_samples_leaf=1,
           min_samples_split=2, min_weight_fraction_leaf=0.0,
           presort=False, random_state=None, splitter='best')
DecisionTreeRegressor(criterion='mse', max_depth=6, max_features=None,
           max_leaf_nodes=None, min_impurity_decrease=0.0,
           min_impurity_split=None, min_samples_leaf=1,
           min_samples_split=2, min_weight_fraction_leaf=0.0,
           presort=False, random_state=None, splitter='best')
DecisionTreeRegressor(criterion='mse', max_depth=6, max_features=None,
           max_leaf_nodes=None, min_impurity_decrease=0.0,
           min_impurity_split=None, min_samples_leaf=1,
           min_samples_split=2, min_weight_fraction_leaf=0.0,
           presort=False, random_state=None, splitter='best')
DecisionTreeRegressor(criterion='mse', max_depth=6, max_features=None,
           max_leaf_nodes=None, min_impurity_decrease=0.0,
           min_impurity_split=None, min_samples_leaf=1,
           min_samples_split=2, min_weight_fraction_leaf=0.0,
           presort=False, random_state=None, splitter='best')
DecisionTreeRegressor(criterion='mse', max_depth=6, max_features=None,
           max_leaf_nodes=None, min_impurity_decrease=0.0,
           min_impurity_split=None, min_samples_leaf=1,
           min_samples_split=2, min_weight_fraction_leaf=0.0,
           presort=False, random_state=None, splitter='best')
DecisionTreeRegressor(criterion='mse', max_depth=6, max_features=None,
           max_leaf_nodes=None, min_impurity_decrease=0.0,
           min_impurity_split=None, min_samples_leaf=1,
           min_samples_split=2, min_weight_fraction_leaf=0.0,
           presort=False, random_state=None, splitter='best')
DecisionTreeRegressor(criterion='mse', max_depth=6, max_features=None,
           max_leaf_nodes=None, min_impurity_decrease=0.0,
           min_impurity_split=None, min_samples_leaf=1,
           min_samples_split=2, min_weight_fraction_leaf=0.0,
           presort=False, random_state=None, splitter='best')
DecisionTreeRegressor(criterion='mse', max_depth=6, max_features=None,
           max_leaf_nodes=None, min_impurity_decrease=0.0,
           min_impurity_split=None, min_samples_leaf=1,
           min_samples_split=2, min_weight_fraction_leaf=0.0,
           presort=False, random_state=None, splitter='best')
DecisionTreeRegressor(criterion='mse', max_depth=7, max_features=None,
           max_leaf_nodes=None, min_impurity_decrease=0.0,
           min_impurity_split=None, min_samples_leaf=1,
           min_samples_split=2, min_weight_fraction_leaf=0.0,
           presort=False, random_state=None, splitter='best')
DecisionTreeRegressor(criterion='mse', max_depth=7, max_features=None,
           max_leaf_nodes=None, min_impurity_decrease=0.0,
           min_impurity_split=None, min_samples_leaf=1,
           min_samples_split=2, min_weight_fraction_leaf=0.0,
           presort=False, random_state=None, splitter='best')
DecisionTreeRegressor(criterion='mse', max_depth=7, max_features=None,
           max_leaf_nodes=None, min_impurity_decrease=0.0,
           min_impurity_split=None, min_samples_leaf=1,
           min_samples_split=2, min_weight_fraction_leaf=0.0,
           presort=False, random_state=None, splitter='best')
DecisionTreeRegressor(criterion='mse', max_depth=7, max_features=None,
           max_leaf_nodes=None, min_impurity_decrease=0.0,
           min_impurity_split=None, min_samples_leaf=1,
           min_samples_split=2, min_weight_fraction_leaf=0.0,
           presort=False, random_state=None, splitter='best')
DecisionTreeRegressor(criterion='mse', max_depth=7, max_features=None,
           max_leaf_nodes=None, min_impurity_decrease=0.0,
           min_impurity_split=None, min_samples_leaf=1,
           min_samples_split=2, min_weight_fraction_leaf=0.0,
           presort=False, random_state=None, splitter='best')
DecisionTreeRegressor(criterion='mse', max_depth=7, max_features=None,
           max_leaf_nodes=None, min_impurity_decrease=0.0,
           min_impurity_split=None, min_samples_leaf=1,
           min_samples_split=2, min_weight_fraction_leaf=0.0,
           presort=False, random_state=None, splitter='best')
DecisionTreeRegressor(criterion='mse', max_depth=7, max_features=None,
           max_leaf_nodes=None, min_impurity_decrease=0.0,
           min_impurity_split=None, min_samples_leaf=1,
           min_samples_split=2, min_weight_fraction_leaf=0.0,
           presort=False, random_state=None, splitter='best')
DecisionTreeRegressor(criterion='mse', max_depth=7, max_features=None,
           max_leaf_nodes=None, min_impurity_decrease=0.0,
           min_impurity_split=None, min_samples_leaf=1,
           min_samples_split=2, min_weight_fraction_leaf=0.0,
           presort=False, random_state=None, splitter='best')
DecisionTreeRegressor(criterion='mse', max_depth=7, max_features=None,
           max_leaf_nodes=None, min_impurity_decrease=0.0,
           min_impurity_split=None, min_samples_leaf=1,
           min_samples_split=2, min_weight_fraction_leaf=0.0,
           presort=False, random_state=None, splitter='best')
DecisionTreeRegressor(criterion='mse', max_depth=7, max_features=None,
           max_leaf_nodes=None, min_impurity_decrease=0.0,
           min_impurity_split=None, min_samples_leaf=1,
           min_samples_split=2, min_weight_fraction_leaf=0.0,
           presort=False, random_state=None, splitter='best')
DecisionTreeRegressor(criterion='mse', max_depth=8, max_features=None,
           max_leaf_nodes=None, min_impurity_decrease=0.0,
           min_impurity_split=None, min_samples_leaf=1,
           min_samples_split=2, min_weight_fraction_leaf=0.0,
           presort=False, random_state=None, splitter='best')
DecisionTreeRegressor(criterion='mse', max_depth=8, max_features=None,
           max_leaf_nodes=None, min_impurity_decrease=0.0,
           min_impurity_split=None, min_samples_leaf=1,
           min_samples_split=2, min_weight_fraction_leaf=0.0,
           presort=False, random_state=None, splitter='best')
DecisionTreeRegressor(criterion='mse', max_depth=8, max_features=None,
           max_leaf_nodes=None, min_impurity_decrease=0.0,
           min_impurity_split=None, min_samples_leaf=1,
           min_samples_split=2, min_weight_fraction_leaf=0.0,
           presort=False, random_state=None, splitter='best')
DecisionTreeRegressor(criterion='mse', max_depth=8, max_features=None,
           max_leaf_nodes=None, min_impurity_decrease=0.0,
           min_impurity_split=None, min_samples_leaf=1,
           min_samples_split=2, min_weight_fraction_leaf=0.0,
           presort=False, random_state=None, splitter='best')
DecisionTreeRegressor(criterion='mse', max_depth=8, max_features=None,
           max_leaf_nodes=None, min_impurity_decrease=0.0,
           min_impurity_split=None, min_samples_leaf=1,
           min_samples_split=2, min_weight_fraction_leaf=0.0,
           presort=False, random_state=None, splitter='best')
DecisionTreeRegressor(criterion='mse', max_depth=8, max_features=None,
           max_leaf_nodes=None, min_impurity_decrease=0.0,
           min_impurity_split=None, min_samples_leaf=1,
           min_samples_split=2, min_weight_fraction_leaf=0.0,
           presort=False, random_state=None, splitter='best')
DecisionTreeRegressor(criterion='mse', max_depth=8, max_features=None,
           max_leaf_nodes=None, min_impurity_decrease=0.0,
           min_impurity_split=None, min_samples_leaf=1,
           min_samples_split=2, min_weight_fraction_leaf=0.0,
           presort=False, random_state=None, splitter='best')
DecisionTreeRegressor(criterion='mse', max_depth=8, max_features=None,
           max_leaf_nodes=None, min_impurity_decrease=0.0,
           min_impurity_split=None, min_samples_leaf=1,
           min_samples_split=2, min_weight_fraction_leaf=0.0,
           presort=False, random_state=None, splitter='best')
DecisionTreeRegressor(criterion='mse', max_depth=8, max_features=None,
           max_leaf_nodes=None, min_impurity_decrease=0.0,
           min_impurity_split=None, min_samples_leaf=1,
           min_samples_split=2, min_weight_fraction_leaf=0.0,
           presort=False, random_state=None, splitter='best')
DecisionTreeRegressor(criterion='mse', max_depth=8, max_features=None,
           max_leaf_nodes=None, min_impurity_decrease=0.0,
           min_impurity_split=None, min_samples_leaf=1,
           min_samples_split=2, min_weight_fraction_leaf=0.0,
           presort=False, random_state=None, splitter='best')
DecisionTreeRegressor(criterion='mse', max_depth=9, max_features=None,
           max_leaf_nodes=None, min_impurity_decrease=0.0,
           min_impurity_split=None, min_samples_leaf=1,
           min_samples_split=2, min_weight_fraction_leaf=0.0,
           presort=False, random_state=None, splitter='best')
DecisionTreeRegressor(criterion='mse', max_depth=9, max_features=None,
           max_leaf_nodes=None, min_impurity_decrease=0.0,
           min_impurity_split=None, min_samples_leaf=1,
           min_samples_split=2, min_weight_fraction_leaf=0.0,
           presort=False, random_state=None, splitter='best')
DecisionTreeRegressor(criterion='mse', max_depth=9, max_features=None,
           max_leaf_nodes=None, min_impurity_decrease=0.0,
           min_impurity_split=None, min_samples_leaf=1,
           min_samples_split=2, min_weight_fraction_leaf=0.0,
           presort=False, random_state=None, splitter='best')
DecisionTreeRegressor(criterion='mse', max_depth=9, max_features=None,
           max_leaf_nodes=None, min_impurity_decrease=0.0,
           min_impurity_split=None, min_samples_leaf=1,
           min_samples_split=2, min_weight_fraction_leaf=0.0,
           presort=False, random_state=None, splitter='best')
DecisionTreeRegressor(criterion='mse', max_depth=9, max_features=None,
           max_leaf_nodes=None, min_impurity_decrease=0.0,
           min_impurity_split=None, min_samples_leaf=1,
           min_samples_split=2, min_weight_fraction_leaf=0.0,
           presort=False, random_state=None, splitter='best')
DecisionTreeRegressor(criterion='mse', max_depth=9, max_features=None,
           max_leaf_nodes=None, min_impurity_decrease=0.0,
           min_impurity_split=None, min_samples_leaf=1,
           min_samples_split=2, min_weight_fraction_leaf=0.0,
           presort=False, random_state=None, splitter='best')
DecisionTreeRegressor(criterion='mse', max_depth=9, max_features=None,
           max_leaf_nodes=None, min_impurity_decrease=0.0,
           min_impurity_split=None, min_samples_leaf=1,
           min_samples_split=2, min_weight_fraction_leaf=0.0,
           presort=False, random_state=None, splitter='best')
DecisionTreeRegressor(criterion='mse', max_depth=9, max_features=None,
           max_leaf_nodes=None, min_impurity_decrease=0.0,
           min_impurity_split=None, min_samples_leaf=1,
           min_samples_split=2, min_weight_fraction_leaf=0.0,
           presort=False, random_state=None, splitter='best')
DecisionTreeRegressor(criterion='mse', max_depth=9, max_features=None,
           max_leaf_nodes=None, min_impurity_decrease=0.0,
           min_impurity_split=None, min_samples_leaf=1,
           min_samples_split=2, min_weight_fraction_leaf=0.0,
           presort=False, random_state=None, splitter='best')
DecisionTreeRegressor(criterion='mse', max_depth=9, max_features=None,
           max_leaf_nodes=None, min_impurity_decrease=0.0,
           min_impurity_split=None, min_samples_leaf=1,
           min_samples_split=2, min_weight_fraction_leaf=0.0,
           presort=False, random_state=None, splitter='best')

 

*1:y[te_idx]-ypred)**2/2

損失関数,トレーニング誤差,テスト誤差

データ数を20,10組のデータセットに対してトレーニング誤差をプロットした際の実装例のメモです.

まずはパッケージを読み込みます.

>>> import numpy as np
>>> import matplotlib.pyplot as plt
>>> import scipy.stats
>>> 
 


パラメータ範囲を設定します.

>>> par = np.linspace(-3,3,50)


テスト誤差の設定を行います.

>>> te_err = (1+par**2)/2


テスト誤差をプロットします.

>>> for i in range(10):
...     z = np.random.normal(size=20) # データを生成
...     # トレーニング誤差
...     trerr = np.mean(np.subtract.outer(z,par)**2/2, axis=0)
...     plt.plot(par,trerr,'b--',linewidth=2)
... 
[<matplotlib.lines.Line2D object at 0x11f86b898>]
[<matplotlib.lines.Line2D object at 0x11f86b9e8>]
[<matplotlib.lines.Line2D object at 0x11f86bd30>]
[<matplotlib.lines.Line2D object at 0x11f87e0b8>]
[<matplotlib.lines.Line2D object at 0x11f87e400>]
[<matplotlib.lines.Line2D object at 0x11f87e748>]
[<matplotlib.lines.Line2D object at 0x11f87ea90>]
[<matplotlib.lines.Line2D object at 0x11f87edd8>]
[<matplotlib.lines.Line2D object at 0x11f888160>]
[<matplotlib.lines.Line2D object at 0x11f8884a8>]


続いて,描画の準備を行います.

>>> plt.xlabel("par")
Text(0.5, 0, 'par')
>>>
>>> plt.ylabel("training/test errors")
Text(0, 0.5, 'training/test errors')
>>>
 
テスト誤差をプロットして,描画します.
>>> plt.plot(par, te_err,'r-',linewidth=4) # テスト誤差をプロット
[<matplotlib.lines.Line2D object at 0x11f8889e8>]
>>> plt.show() # 描画
/Library/Frameworks/Python.framework/Versions/3.6/lib/python3.6/site-packages/matplotlib/font_manager.py:1241: UserWarning: findfont: Font family ['ipaexg'] not found. Falling back to DejaVu Sans.
  (prop.get_family(), self.defaultFamily[fontext]))
 
objc[1137]: Class FIFinderSyncExtensionHost is implemented in both /System/Library/PrivateFrameworks/FinderKit.framework/Versions/A/FinderKit (0x7fff933a3cd0) and /System/Library/PrivateFrameworks/FileProvider.framework/OverrideBundles/FinderSyncCollaborationFileProviderOverride.bundle/Contents/MacOS/FinderSyncCollaborationFileProviderOverride (0x1204bccd8). One of the two will be used. Which one is undefined.
>>> 
 
すると,以下のようなグラフが描画されます.
 
グラフは,テスト誤差(赤の実戦)とトレーニング誤差(青の破線)のプロット図になります.
10組のデータセットのそれぞれに対してトレーニング誤差がプロットされています.
関数(実線)を最小にするパラメータを求めることが目的で,これをデータから計算される関数(破線)で近似して最小化します.

コードの中の np.outer は,デフォルトではベクトルの外積を計算します.ここでは np.substract.outer のように記述することで,外積の掛け算を引き算に置き換えています.np.ufunc.outer で置き換えることができる演算は下表のようになります.


ufunc
add
subtract
multiply
divide
maximum
minimum
演算
加算
減算
乗算
除算
最大値
最小値

上記の例のJupyter Notebookファイルは,GitHubStatistics.ipynbの"Loss function, Training error and Test error"で見ることができます.
 

-Python- 共分散,相関係数

Pythonのnumpyのnp.covnp.corrcoefを使い,データから標本共分散,標本相関係数を求める例のメモです.
関数には, データの次元 x データ数 のサイズのデータ行列を入力します.

>>> import numpy as np
>>> from sklearn.datasets import load_iris
>>> iris = load_iris()
>>> # number of data, dimension
... iris.data.shape
(150, 4)
>>> # Variance-covariance matrix (Transposition of data matrix)
... np.cov(iris.data.T)
array([[ 0.68569351, -0.042434  ,  1.27431544,  0.51627069],
       [-0.042434  ,  0.18997942, -0.32965638, -0.12163937],
       [ 1.27431544, -0.32965638,  3.11627785,  1.2956094 ],
       [ 0.51627069, -0.12163937,  1.2956094 ,  0.58100626]])
>>> # Correlation coefficient matrix (Transposition of data matrix)
... np.corrcoef(iris.data.T)
array([[ 1.        , -0.11756978,  0.87175378,  0.81794113],
       [-0.11756978,  1.        , -0.4284401 , -0.36612593],
       [ 0.87175378, -0.4284401 ,  1.        ,  0.96286543],
       [ 0.81794113, -0.36612593,  0.96286543,  1.        ]])

 

上記の例のJupyter Notebookファイルは,GitHubStatistics.ipynbの"Convariance and Correlation coefficient"で見ることができます.