-gnuplot- gnuplotのインストール (with Aquaterm)

macOSにHomebrew経由でgnuplotをインストールすると,terminal type が 'qt' のみとなってしまい,非常に困ってしまいます.
以前は,Homebrewのインストールで

$ brew install gnuplot --with-aquaterm

とすれば良かったのですが,現段階では何故かエラーとなってしまい,AquaTermが一緒にインストールすることができません.

AquaTermがないと.epsファイルへの書き出しができずに困ってしまうので,以下の方法でインストールを行いました.

まずは,Homebrew経由でインストールしたgnuplotを削除します.

Homebrewでインストールしたものを検索してみます).

$ brew list
(実際には,gnuplot以外にも色々と出てくると思うが,上記の例では,gnuplotのみを表示しています.


gnuplotがあることを確認して,以下のコマンドで削除します.

$ brew uninstall gnuplot
 
続いて,gnuplotのホームページからソースコードを入手します(この段階では ver. 5.2 を選択したので,ダウンロードしたファイルは gnuplot-5.2.7.tar です).
gnuplot-5.2.7.tar を展開して,適当なディレクトリに移動します(以下の例では xxx というディレクトリに移動したとして説明します).
 
gnuplot-5.2.7.tar を展開して移動したディレクトリ(xxx)に移ります
$ cd /Users/xxx/gnuplot-5.2.7 


以下のコマンドを入力します.

$ ./configure --with-readline=builtin --with-aquaterm
 
このコマンドを入力すると,以下のように相当ズラズラと流れます(以下の例では途中を省略しています).
checking for a BSD-compatible install... /usr/bin/install -c
checking whether build environment is sane... yes
checking for a thread-safe mkdir -p... ./install-sh -c -d
checking for gawk... no
checking for mawk... no
checking for nawk... no
checking for awk... awk
... <- Omission
gnuplot will install the following additional materials:
 
  cfg file for epslatex terminal: yes
  TeX *.sty for lua/tikz terminal: yes
  TeX files will be installed in /usr/local/texlive/texmf-local/tex/latex/gnuplot
                               (use --with-texdir=DIR to change)
  Help file: ${datarootdir}/gnuplot/5.2/gnuplot.gih
  PostScript prologue files: ${datarootdir}/gnuplot/5.2/PostScript/
 
続いて make します.
$ make
 
make においても,以下のように相当ズラズラと流れます(以下の例では途中を省略しています).
/Applications/Xcode.app/Contents/Developer/usr/bin/make  all-recursive
Making all in config
... <- Omission
make[3]: Nothing to be done for `all'.
cp -p ./Gnuplot.app-defaults Gnuplot
make[2]: Nothing to be done for `all-am'.
 

続いて,以下のコマンドを実行します.この段階ではパスワードの入力を求められるので,administrator としてログインする際のパスワードを入力します.

mini:gnuplot-5.2.7 hide$ sudo make install
Password:


この際にも以下のように,多少ズラズラと流れてインストールが完了します(例では途中を省略しています).

aking install in config
make[2]: Nothing to be done for `install-exec-am'.

 

make[2]: Nothing to be done for `install-data-am'.
... <- Omission
make[3]: Nothing to be done for `install-exec-am'.
 .././install-sh -c -d '/usr/local/share/gnuplot/5.2/app-defaults'
 /usr/bin/install -c -m 644 Gnuplot '/usr/local/share/gnuplot/5.2/app-defaults'
 .././install-sh -c -d '/usr/local/share/gnuplot/5.2'
 /usr/bin/install -c -m 644 colors_default.gp colors_podo.gp colors_mono.gp gnuplotrc '/usr/local/share/gnuplot/5.2'
make[2]: Nothing to be done for `install-exec-am'.
 
make[2]: Nothing to be done for `install-data-am'.
 
確認のためにgnuplot を立ち上げてみます.
 
G N U P L O T
Version 5.2 patchlevel 7    last modified 2019-05-29 
 
Copyright (C) 1986-1993, 1998, 2004, 2007-2018
Thomas Williams, Colin Kelley and many others
 
faq, bugs, etc:   type "help FAQ"
immediate help:   type "help"  (plot window: hit 'h')
 
Terminal type is now 'aqua'
 

上記のように

Terminal type is now 'aqua'

と表示されていれば成功です.

macOS High Sierra は上記の方法でインストールに成功しましたが,Majaveでは駄目でした...

-Python- 主成分分析

主成分分析に関するメモです.
 

主成分分析を行うには scikit-learn パッケージを使用して,sklearn.decomposition の PCA でインスタンスを生成します.

以下の例では,Davis データを用いて主成分分析を行っています.

 

Davisデータ(Davis.csv)はJupyter Notebookの保存されているディレクトリと同じディレクトリに保存されているものとします.

Davisデータの読み込みには pandas パッケージの pd.read_csv を使用します.

データ配列の第1, 2列の各行がデータ点 {\bf{x_{i}}} = ( w_{i}, h_{i} )に対応しています( x_{i} i番目の人の体重[kg], h_{i}は身長[cm]に対応).


パッケージの読み込みを行います.

>>> import numpy as np
>>> import matplotlib.pyplot as plt
>>> import pandas as pd
 
sklearn の PCA を使います.
>>> from sklearn.decomposition import PCA
 
pandasを使ってデータ読み込みます.読みこむ.csvファイル*は,REPLを実行しているディレクトリにあるものとしているので,必要に応じてパスの書き換えが必要です.
>>> dat = pd.read_csv('Davis.csv').values
 
身長の単位を[m]に変換し,対数の値を計算します.
>>> logdat = np.log(np.c_[dat[:,1],dat[:,2]/100].astype('float'))

データのプロットを行います.
>>> plt.plot(logdat[:,0], logdat[:,1], '.'); plt.show()
[<matplotlib.lines.Line2D object at 0x11df9dac8>]

f:id:HidehikoMURAO:20190625065138p:plain

読み込んだデータに対して主成分分析を行います.
>>> pca = PCA()
>>> pca.fit(logdat)
PCA(copy=True, iterated_power='auto', n_components=None, random_state=None,
  svd_solver='auto', tol=0.0, whiten=False)
>>> pca.components_
array([[ 0.99672116,  0.08091309],
       [ 0.08091309, -0.99672116]])
>>> 
上記のコードの pca.components_ は主成分です.

インデックス 11 のデータは外れ値として除去することにします.
>>> clean_logdat = np.delete(logdat, 11, axis=0)
 
外れ値(インデックス 11)を除去したデータに主成分分析を行います.
>>> pca = PCA() 
>>> pca.fit(clean_logdat) 
PCA(copy=True, iterated_power='auto', n_components=None, random_state=None,
  svd_solver='auto', tol=0.0, whiten=False)
>>> pca.components_
array([[ 0.97754866,  0.21070979],
       [-0.21070979,  0.97754866]])
>>> 
 

----------
* 上記のコードで読み込む.csvファイル(Davis.csv)の中身
sex,weight,height,repwt,repht
M,77,182,77,180
F,58,161,51,159
F,53,161,54,158
M,68,177,70,175
F,59,157,59,155
M,76,170,76,165
M,76,167,77,165
M,69,186,73,180
M,71,178,71,175
M,65,171,64,170
M,70,175,75,174
F,166,57,56,163
F,51,161,52,158
F,64,168,64,165
F,52,163,57,160
F,65,166,66,165
M,92,187,101,185
F,62,168,62,165
M,76,197,75,200
F,61,175,61,171
M,119,180,124,178
F,61,170,61,170
M,65,175,66,173
M,66,173,70,170
F,54,171,59,168
F,50,166,50,165
F,63,169,61,168
F,58,166,60,160
F,39,157,41,153
M,101,183,100,180
F,71,166,71,165
M,75,178,73,175
M,79,173,76,173
F,52,164,52,161
F,68,169,63,170
M,64,176,65,175
F,56,166,54,165
M,69,174,69,171
M,88,178,86,175
M,65,187,67,188
F,54,164,53,160
M,80,178,80,178
F,63,163,59,159
M,78,183,80,180
M,85,179,82,175
F,54,160,55,158
M,73,180,NA,NA
F,49,161,NA,NA
F,54,174,56,173
F,75,162,75,158
M,82,182,85,183
F,56,165,57,163
M,74,169,73,170
M,102,185,107,185
M,64,177,NA,NA
M,65,176,64,172
F,66,170,65,NA
M,73,183,74,180
M,75,172,70,169
M,57,173,58,170
M,68,165,69,165
M,71,177,71,170
M,71,180,76,175
F,78,173,75,169
M,97,189,98,185
F,60,162,59,160
F,64,165,63,163
F,64,164,62,161
F,52,158,51,155
M,80,178,76,175
F,62,175,61,171
M,66,173,66,175
F,55,165,54,163
F,56,163,57,159
F,50,166,50,161
F,50,171,NA,NA
F,50,160,55,150
F,63,160,64,158
M,69,182,70,180
M,69,183,70,183
F,61,165,60,163
M,55,168,56,170
F,53,169,52,175
F,60,167,55,163
F,56,170,56,170
M,59,182,61,183
M,62,178,66,175
F,53,165,53,165
F,57,163,59,160
F,57,162,56,160
M,70,173,68,170
F,56,161,56,161
M,84,184,86,183
M,69,180,71,180
M,88,189,87,185
F,56,165,57,160
M,103,185,101,182
F,50,169,50,165
F,52,159,52,153
F,55,155,NA,154
F,55,164,55,163
M,63,178,63,175
F,47,163,47,160
F,45,163,45,160
F,62,175,63,173
F,53,164,51,160
F,52,152,51,150
F,57,167,55,164
F,64,166,64,165
F,59,166,55,163
M,84,183,90,183
M,79,179,79,171
F,55,174,57,171
M,67,179,67,179
F,76,167,77,165
F,62,168,62,163
M,83,184,83,181
M,96,184,94,183
M,75,169,76,165
M,65,178,66,178
M,78,178,77,175
M,69,167,73,165
F,68,178,68,175
F,55,165,55,163
M,67,179,NA,NA
F,52,169,56,NA
F,47,153,NA,154
F,45,157,45,153
F,68,171,68,169
F,44,157,44,155
F,62,166,61,163
M,87,185,89,185
F,56,160,53,158
F,50,148,47,148
M,83,177,84,175
F,53,162,53,160
F,64,172,62,168
F,62,167,NA,NA
M,90,188,91,185
M,85,191,83,188
M,66,175,68,175
F,52,163,53,160
F,53,165,55,163
F,54,176,55,176
F,64,171,66,171
F,55,160,55,155
F,55,165,55,165
F,59,157,55,158
F,70,173,67,170
M,88,184,86,183
F,57,168,58,165
F,47,162,47,160
F,47,150,45,152
F,55,162,NA,NA
F,48,163,44,160
M,54,169,58,165
M,69,172,68,174
F,59,170,NA,NA
F,58,169,NA,NA
F,57,167,56,165
F,51,163,50,160
F,54,161,54,160
F,53,162,52,158
F,59,172,58,171
M,56,163,58,161
F,59,159,59,155
F,63,170,62,168
F,66,166,66,165
M,96,191,95,188
F,53,158,50,155
M,76,169,75,165
F,54,163,NA,NA
M,61,170,61,170
M,82,176,NA,NA
M,62,168,64,168
M,71,178,68,178
F,60,174,NA,NA
M,66,170,67,165
M,81,178,82,175
M,68,174,68,173
M,80,176,78,175
F,43,154,NA,NA
M,82,181,NA,NA
F,63,165,59,160
M,70,173,70,173
F,56,162,56,160
F,60,172,55,168
F,58,169,54,166
M,76,183,75,180
F,50,158,49,155
M,88,185,93,188
M,89,173,86,173
F,59,164,59,165
F,51,156,51,158
F,62,164,61,161
M,74,175,71,175
M,83,180,80,180
M,81,175,NA,NA
M,90,181,91,178
M,79,177,81,178

-Python- 交差検証法によるテスト誤差の推定

K重交差検証法によってテスト誤差を推定する際のメモです.

K重交差検証法の計算プロセスは以下のようになります.

  • 学習方法と損失:チューニングパラメータ \lambdaをもつアルゴリズム \mathcal{A},損失 l ({\bf{z}}, h)
  • 入力:データ {\bf{z}}_{1}, {\bf{z}}_{2}, \cdots, {\bf{z}}_{n}
  1. 入力データを要素数がほぼ等しいK個のグループ D_{1}, D_{2}, \cdots, D_{K}に分割する.
  2.  k = 1, 2, \cdots, Kとして,以下を反復する.
    1.  h_{\lambda, k} = \mathcal{A} (D^{(k)}, \lambda)を計算する.ここで, D^{(k)} = D \ D_{k}.
    2.  \widehat{Err} ( h_{\lambda, k}) = \cfrac{1}{m}\sum_{x \in D_{k}} l ({\bf{z}},  h_{\lambda, k})を計算する.
  3.  \widehat{Err} ( \mathcal{A}; \lambda) = \cfrac{1}{K}\sum_{k = 1}^{K} \widehat{Err} ( h_{\lambda, k} )

回帰関数は決定木によって推定することにします.よって,推定には,sklearn.tree DecisionTreeRegressor を使用します.

パッケージの読み込み

>>> import numpy as np
>>> import scipy as sp
>>> import matplotlib.pyplot as plt
>>> from sklearn.tree import DecisionTreeRegressor


設定は,データ数100個,10重CV(Cross Validation)とします.

>>> n = 100; K = 10 
 
データの生成を行います.データは区間[-2, 2]上の一様分布とします.
>>> x = np.random.uniform(-2, 2, n)
>>> y = np.sin(2 * np.pi * x) / x + np.random.normal(scale = 0.5, size = n)
 
データのグループ分けと描画を行います.
>>> cv_idx = np.tile(np.arange(K), int(np.ceil(n/K)))[: n]
>>> maxdepths = np.arange(2, 10)
>>> cverr = np.array()
>>> for mp in maxdepths:
...     cverr_lambda = np.array()
...     for k in range(K):
...             tr_idx = (cv_idx!=k) 
...             te_idx = (cv_idx==k)
...             cvx = x[tr_idx]; cvy = y[tr_idx]   
...             dtreg = DecisionTreeRegressor(max_depth=mp)
...             dtreg.fit(np.array([cvx]).T, cvy)               
...             ypred = dtreg.predict(np.array([x[te_idx]]).T)  
...             cl = np.append(cverr_lambda, np.mean*1
...     cverr = np.append(cverr, np.mean(cl))
... 
DecisionTreeRegressor(criterion='mse', max_depth=2, max_features=None,
           max_leaf_nodes=None, min_impurity_decrease=0.0,
           min_impurity_split=None, min_samples_leaf=1,
           min_samples_split=2, min_weight_fraction_leaf=0.0,
           presort=False, random_state=None, splitter='best')
.
. *1 (省略)
.
DecisionTreeRegressor(criterion='mse', max_depth=9, max_features=None,
           max_leaf_nodes=None, min_impurity_decrease=0.0,
           min_impurity_split=None, min_samples_leaf=1,
           min_samples_split=2, min_weight_fraction_leaf=0.0,
           presort=False, random_state=None, splitter='best')
>>> plt.scatter(maxdepths, cverr,c='k')  # cv誤差のプロット
<matplotlib.collections.PathCollection object at 0x122d2cc18>
>>> plt.xlabel("max depth"); plt.ylabel('cv error')
Text(0.5, 0, 'max depth')
Text(0, 0.5, 'cv error')
>>> plt.show()
 

上記のコードの解説

  • 決定木の深さの候補は maxdepths = np.arange(2, 10) で設定
  • CVのためのデータ分割は cvx = x[tr_idx]; cvy = y[tr_idx] で行っている
  • 決定木で推定するのは dtreg.fit(np.array([cvx]).T, cvy)で行っている.
  • 予測は ypred = dtreg.predict(np.array([x[te_idx] - ypred)**2/2) で行っている


上記の例のJupyter Notebookファイルは,GitHubStatistics.ipynbの"Estimation of test error by cross validation method"で見ることができます.

 

----------
REPLの実行で省略した *1 部分は,以下の様に表示されます.

DecisionTreeRegressor(criterion='mse', max_depth=2, max_features=None,
           max_leaf_nodes=None, min_impurity_decrease=0.0,
           min_impurity_split=None, min_samples_leaf=1,
           min_samples_split=2, min_weight_fraction_leaf=0.0,
           presort=False, random_state=None, splitter='best')
DecisionTreeRegressor(criterion='mse', max_depth=2, max_features=None,
           max_leaf_nodes=None, min_impurity_decrease=0.0,
           min_impurity_split=None, min_samples_leaf=1,
           min_samples_split=2, min_weight_fraction_leaf=0.0,
           presort=False, random_state=None, splitter='best')
DecisionTreeRegressor(criterion='mse', max_depth=2, max_features=None,
           max_leaf_nodes=None, min_impurity_decrease=0.0,
           min_impurity_split=None, min_samples_leaf=1,
           min_samples_split=2, min_weight_fraction_leaf=0.0,
           presort=False, random_state=None, splitter='best')
DecisionTreeRegressor(criterion='mse', max_depth=2, max_features=None,
           max_leaf_nodes=None, min_impurity_decrease=0.0,
           min_impurity_split=None, min_samples_leaf=1,
           min_samples_split=2, min_weight_fraction_leaf=0.0,
           presort=False, random_state=None, splitter='best')
DecisionTreeRegressor(criterion='mse', max_depth=2, max_features=None,
           max_leaf_nodes=None, min_impurity_decrease=0.0,
           min_impurity_split=None, min_samples_leaf=1,
           min_samples_split=2, min_weight_fraction_leaf=0.0,
           presort=False, random_state=None, splitter='best')
DecisionTreeRegressor(criterion='mse', max_depth=2, max_features=None,
           max_leaf_nodes=None, min_impurity_decrease=0.0,
           min_impurity_split=None, min_samples_leaf=1,
           min_samples_split=2, min_weight_fraction_leaf=0.0,
           presort=False, random_state=None, splitter='best')
DecisionTreeRegressor(criterion='mse', max_depth=2, max_features=None,
           max_leaf_nodes=None, min_impurity_decrease=0.0,
           min_impurity_split=None, min_samples_leaf=1,
           min_samples_split=2, min_weight_fraction_leaf=0.0,
           presort=False, random_state=None, splitter='best')
DecisionTreeRegressor(criterion='mse', max_depth=2, max_features=None,
           max_leaf_nodes=None, min_impurity_decrease=0.0,
           min_impurity_split=None, min_samples_leaf=1,
           min_samples_split=2, min_weight_fraction_leaf=0.0,
           presort=False, random_state=None, splitter='best')
DecisionTreeRegressor(criterion='mse', max_depth=2, max_features=None,
           max_leaf_nodes=None, min_impurity_decrease=0.0,
           min_impurity_split=None, min_samples_leaf=1,
           min_samples_split=2, min_weight_fraction_leaf=0.0,
           presort=False, random_state=None, splitter='best')
DecisionTreeRegressor(criterion='mse', max_depth=2, max_features=None,
           max_leaf_nodes=None, min_impurity_decrease=0.0,
           min_impurity_split=None, min_samples_leaf=1,
           min_samples_split=2, min_weight_fraction_leaf=0.0,
           presort=False, random_state=None, splitter='best')
DecisionTreeRegressor(criterion='mse', max_depth=3, max_features=None,
           max_leaf_nodes=None, min_impurity_decrease=0.0,
           min_impurity_split=None, min_samples_leaf=1,
           min_samples_split=2, min_weight_fraction_leaf=0.0,
           presort=False, random_state=None, splitter='best')
DecisionTreeRegressor(criterion='mse', max_depth=3, max_features=None,
           max_leaf_nodes=None, min_impurity_decrease=0.0,
           min_impurity_split=None, min_samples_leaf=1,
           min_samples_split=2, min_weight_fraction_leaf=0.0,
           presort=False, random_state=None, splitter='best')
DecisionTreeRegressor(criterion='mse', max_depth=3, max_features=None,
           max_leaf_nodes=None, min_impurity_decrease=0.0,
           min_impurity_split=None, min_samples_leaf=1,
           min_samples_split=2, min_weight_fraction_leaf=0.0,
           presort=False, random_state=None, splitter='best')
DecisionTreeRegressor(criterion='mse', max_depth=3, max_features=None,
           max_leaf_nodes=None, min_impurity_decrease=0.0,
           min_impurity_split=None, min_samples_leaf=1,
           min_samples_split=2, min_weight_fraction_leaf=0.0,
           presort=False, random_state=None, splitter='best')
DecisionTreeRegressor(criterion='mse', max_depth=3, max_features=None,
           max_leaf_nodes=None, min_impurity_decrease=0.0,
           min_impurity_split=None, min_samples_leaf=1,
           min_samples_split=2, min_weight_fraction_leaf=0.0,
           presort=False, random_state=None, splitter='best')
DecisionTreeRegressor(criterion='mse', max_depth=3, max_features=None,
           max_leaf_nodes=None, min_impurity_decrease=0.0,
           min_impurity_split=None, min_samples_leaf=1,
           min_samples_split=2, min_weight_fraction_leaf=0.0,
           presort=False, random_state=None, splitter='best')
DecisionTreeRegressor(criterion='mse', max_depth=3, max_features=None,
           max_leaf_nodes=None, min_impurity_decrease=0.0,
           min_impurity_split=None, min_samples_leaf=1,
           min_samples_split=2, min_weight_fraction_leaf=0.0,
           presort=False, random_state=None, splitter='best')
DecisionTreeRegressor(criterion='mse', max_depth=3, max_features=None,
           max_leaf_nodes=None, min_impurity_decrease=0.0,
           min_impurity_split=None, min_samples_leaf=1,
           min_samples_split=2, min_weight_fraction_leaf=0.0,
           presort=False, random_state=None, splitter='best')
DecisionTreeRegressor(criterion='mse', max_depth=3, max_features=None,
           max_leaf_nodes=None, min_impurity_decrease=0.0,
           min_impurity_split=None, min_samples_leaf=1,
           min_samples_split=2, min_weight_fraction_leaf=0.0,
           presort=False, random_state=None, splitter='best')
DecisionTreeRegressor(criterion='mse', max_depth=3, max_features=None,
           max_leaf_nodes=None, min_impurity_decrease=0.0,
           min_impurity_split=None, min_samples_leaf=1,
           min_samples_split=2, min_weight_fraction_leaf=0.0,
           presort=False, random_state=None, splitter='best')
DecisionTreeRegressor(criterion='mse', max_depth=4, max_features=None,
           max_leaf_nodes=None, min_impurity_decrease=0.0,
           min_impurity_split=None, min_samples_leaf=1,
           min_samples_split=2, min_weight_fraction_leaf=0.0,
           presort=False, random_state=None, splitter='best')
DecisionTreeRegressor(criterion='mse', max_depth=4, max_features=None,
           max_leaf_nodes=None, min_impurity_decrease=0.0,
           min_impurity_split=None, min_samples_leaf=1,
           min_samples_split=2, min_weight_fraction_leaf=0.0,
           presort=False, random_state=None, splitter='best')
DecisionTreeRegressor(criterion='mse', max_depth=4, max_features=None,
           max_leaf_nodes=None, min_impurity_decrease=0.0,
           min_impurity_split=None, min_samples_leaf=1,
           min_samples_split=2, min_weight_fraction_leaf=0.0,
           presort=False, random_state=None, splitter='best')
DecisionTreeRegressor(criterion='mse', max_depth=4, max_features=None,
           max_leaf_nodes=None, min_impurity_decrease=0.0,
           min_impurity_split=None, min_samples_leaf=1,
           min_samples_split=2, min_weight_fraction_leaf=0.0,
           presort=False, random_state=None, splitter='best')
DecisionTreeRegressor(criterion='mse', max_depth=4, max_features=None,
           max_leaf_nodes=None, min_impurity_decrease=0.0,
           min_impurity_split=None, min_samples_leaf=1,
           min_samples_split=2, min_weight_fraction_leaf=0.0,
           presort=False, random_state=None, splitter='best')
DecisionTreeRegressor(criterion='mse', max_depth=4, max_features=None,
           max_leaf_nodes=None, min_impurity_decrease=0.0,
           min_impurity_split=None, min_samples_leaf=1,
           min_samples_split=2, min_weight_fraction_leaf=0.0,
           presort=False, random_state=None, splitter='best')
DecisionTreeRegressor(criterion='mse', max_depth=4, max_features=None,
           max_leaf_nodes=None, min_impurity_decrease=0.0,
           min_impurity_split=None, min_samples_leaf=1,
           min_samples_split=2, min_weight_fraction_leaf=0.0,
           presort=False, random_state=None, splitter='best')
DecisionTreeRegressor(criterion='mse', max_depth=4, max_features=None,
           max_leaf_nodes=None, min_impurity_decrease=0.0,
           min_impurity_split=None, min_samples_leaf=1,
           min_samples_split=2, min_weight_fraction_leaf=0.0,
           presort=False, random_state=None, splitter='best')
DecisionTreeRegressor(criterion='mse', max_depth=4, max_features=None,
           max_leaf_nodes=None, min_impurity_decrease=0.0,
           min_impurity_split=None, min_samples_leaf=1,
           min_samples_split=2, min_weight_fraction_leaf=0.0,
           presort=False, random_state=None, splitter='best')
DecisionTreeRegressor(criterion='mse', max_depth=4, max_features=None,
           max_leaf_nodes=None, min_impurity_decrease=0.0,
           min_impurity_split=None, min_samples_leaf=1,
           min_samples_split=2, min_weight_fraction_leaf=0.0,
           presort=False, random_state=None, splitter='best')
DecisionTreeRegressor(criterion='mse', max_depth=5, max_features=None,
           max_leaf_nodes=None, min_impurity_decrease=0.0,
           min_impurity_split=None, min_samples_leaf=1,
           min_samples_split=2, min_weight_fraction_leaf=0.0,
           presort=False, random_state=None, splitter='best')
DecisionTreeRegressor(criterion='mse', max_depth=5, max_features=None,
           max_leaf_nodes=None, min_impurity_decrease=0.0,
           min_impurity_split=None, min_samples_leaf=1,
           min_samples_split=2, min_weight_fraction_leaf=0.0,
           presort=False, random_state=None, splitter='best')
DecisionTreeRegressor(criterion='mse', max_depth=5, max_features=None,
           max_leaf_nodes=None, min_impurity_decrease=0.0,
           min_impurity_split=None, min_samples_leaf=1,
           min_samples_split=2, min_weight_fraction_leaf=0.0,
           presort=False, random_state=None, splitter='best')
DecisionTreeRegressor(criterion='mse', max_depth=5, max_features=None,
           max_leaf_nodes=None, min_impurity_decrease=0.0,
           min_impurity_split=None, min_samples_leaf=1,
           min_samples_split=2, min_weight_fraction_leaf=0.0,
           presort=False, random_state=None, splitter='best')
DecisionTreeRegressor(criterion='mse', max_depth=5, max_features=None,
           max_leaf_nodes=None, min_impurity_decrease=0.0,
           min_impurity_split=None, min_samples_leaf=1,
           min_samples_split=2, min_weight_fraction_leaf=0.0,
           presort=False, random_state=None, splitter='best')
DecisionTreeRegressor(criterion='mse', max_depth=5, max_features=None,
           max_leaf_nodes=None, min_impurity_decrease=0.0,
           min_impurity_split=None, min_samples_leaf=1,
           min_samples_split=2, min_weight_fraction_leaf=0.0,
           presort=False, random_state=None, splitter='best')
DecisionTreeRegressor(criterion='mse', max_depth=5, max_features=None,
           max_leaf_nodes=None, min_impurity_decrease=0.0,
           min_impurity_split=None, min_samples_leaf=1,
           min_samples_split=2, min_weight_fraction_leaf=0.0,
           presort=False, random_state=None, splitter='best')
DecisionTreeRegressor(criterion='mse', max_depth=5, max_features=None,
           max_leaf_nodes=None, min_impurity_decrease=0.0,
           min_impurity_split=None, min_samples_leaf=1,
           min_samples_split=2, min_weight_fraction_leaf=0.0,
           presort=False, random_state=None, splitter='best')
DecisionTreeRegressor(criterion='mse', max_depth=5, max_features=None,
           max_leaf_nodes=None, min_impurity_decrease=0.0,
           min_impurity_split=None, min_samples_leaf=1,
           min_samples_split=2, min_weight_fraction_leaf=0.0,
           presort=False, random_state=None, splitter='best')
DecisionTreeRegressor(criterion='mse', max_depth=5, max_features=None,
           max_leaf_nodes=None, min_impurity_decrease=0.0,
           min_impurity_split=None, min_samples_leaf=1,
           min_samples_split=2, min_weight_fraction_leaf=0.0,
           presort=False, random_state=None, splitter='best')
DecisionTreeRegressor(criterion='mse', max_depth=6, max_features=None,
           max_leaf_nodes=None, min_impurity_decrease=0.0,
           min_impurity_split=None, min_samples_leaf=1,
           min_samples_split=2, min_weight_fraction_leaf=0.0,
           presort=False, random_state=None, splitter='best')
DecisionTreeRegressor(criterion='mse', max_depth=6, max_features=None,
           max_leaf_nodes=None, min_impurity_decrease=0.0,
           min_impurity_split=None, min_samples_leaf=1,
           min_samples_split=2, min_weight_fraction_leaf=0.0,
           presort=False, random_state=None, splitter='best')
DecisionTreeRegressor(criterion='mse', max_depth=6, max_features=None,
           max_leaf_nodes=None, min_impurity_decrease=0.0,
           min_impurity_split=None, min_samples_leaf=1,
           min_samples_split=2, min_weight_fraction_leaf=0.0,
           presort=False, random_state=None, splitter='best')
DecisionTreeRegressor(criterion='mse', max_depth=6, max_features=None,
           max_leaf_nodes=None, min_impurity_decrease=0.0,
           min_impurity_split=None, min_samples_leaf=1,
           min_samples_split=2, min_weight_fraction_leaf=0.0,
           presort=False, random_state=None, splitter='best')
DecisionTreeRegressor(criterion='mse', max_depth=6, max_features=None,
           max_leaf_nodes=None, min_impurity_decrease=0.0,
           min_impurity_split=None, min_samples_leaf=1,
           min_samples_split=2, min_weight_fraction_leaf=0.0,
           presort=False, random_state=None, splitter='best')
DecisionTreeRegressor(criterion='mse', max_depth=6, max_features=None,
           max_leaf_nodes=None, min_impurity_decrease=0.0,
           min_impurity_split=None, min_samples_leaf=1,
           min_samples_split=2, min_weight_fraction_leaf=0.0,
           presort=False, random_state=None, splitter='best')
DecisionTreeRegressor(criterion='mse', max_depth=6, max_features=None,
           max_leaf_nodes=None, min_impurity_decrease=0.0,
           min_impurity_split=None, min_samples_leaf=1,
           min_samples_split=2, min_weight_fraction_leaf=0.0,
           presort=False, random_state=None, splitter='best')
DecisionTreeRegressor(criterion='mse', max_depth=6, max_features=None,
           max_leaf_nodes=None, min_impurity_decrease=0.0,
           min_impurity_split=None, min_samples_leaf=1,
           min_samples_split=2, min_weight_fraction_leaf=0.0,
           presort=False, random_state=None, splitter='best')
DecisionTreeRegressor(criterion='mse', max_depth=6, max_features=None,
           max_leaf_nodes=None, min_impurity_decrease=0.0,
           min_impurity_split=None, min_samples_leaf=1,
           min_samples_split=2, min_weight_fraction_leaf=0.0,
           presort=False, random_state=None, splitter='best')
DecisionTreeRegressor(criterion='mse', max_depth=6, max_features=None,
           max_leaf_nodes=None, min_impurity_decrease=0.0,
           min_impurity_split=None, min_samples_leaf=1,
           min_samples_split=2, min_weight_fraction_leaf=0.0,
           presort=False, random_state=None, splitter='best')
DecisionTreeRegressor(criterion='mse', max_depth=7, max_features=None,
           max_leaf_nodes=None, min_impurity_decrease=0.0,
           min_impurity_split=None, min_samples_leaf=1,
           min_samples_split=2, min_weight_fraction_leaf=0.0,
           presort=False, random_state=None, splitter='best')
DecisionTreeRegressor(criterion='mse', max_depth=7, max_features=None,
           max_leaf_nodes=None, min_impurity_decrease=0.0,
           min_impurity_split=None, min_samples_leaf=1,
           min_samples_split=2, min_weight_fraction_leaf=0.0,
           presort=False, random_state=None, splitter='best')
DecisionTreeRegressor(criterion='mse', max_depth=7, max_features=None,
           max_leaf_nodes=None, min_impurity_decrease=0.0,
           min_impurity_split=None, min_samples_leaf=1,
           min_samples_split=2, min_weight_fraction_leaf=0.0,
           presort=False, random_state=None, splitter='best')
DecisionTreeRegressor(criterion='mse', max_depth=7, max_features=None,
           max_leaf_nodes=None, min_impurity_decrease=0.0,
           min_impurity_split=None, min_samples_leaf=1,
           min_samples_split=2, min_weight_fraction_leaf=0.0,
           presort=False, random_state=None, splitter='best')
DecisionTreeRegressor(criterion='mse', max_depth=7, max_features=None,
           max_leaf_nodes=None, min_impurity_decrease=0.0,
           min_impurity_split=None, min_samples_leaf=1,
           min_samples_split=2, min_weight_fraction_leaf=0.0,
           presort=False, random_state=None, splitter='best')
DecisionTreeRegressor(criterion='mse', max_depth=7, max_features=None,
           max_leaf_nodes=None, min_impurity_decrease=0.0,
           min_impurity_split=None, min_samples_leaf=1,
           min_samples_split=2, min_weight_fraction_leaf=0.0,
           presort=False, random_state=None, splitter='best')
DecisionTreeRegressor(criterion='mse', max_depth=7, max_features=None,
           max_leaf_nodes=None, min_impurity_decrease=0.0,
           min_impurity_split=None, min_samples_leaf=1,
           min_samples_split=2, min_weight_fraction_leaf=0.0,
           presort=False, random_state=None, splitter='best')
DecisionTreeRegressor(criterion='mse', max_depth=7, max_features=None,
           max_leaf_nodes=None, min_impurity_decrease=0.0,
           min_impurity_split=None, min_samples_leaf=1,
           min_samples_split=2, min_weight_fraction_leaf=0.0,
           presort=False, random_state=None, splitter='best')
DecisionTreeRegressor(criterion='mse', max_depth=7, max_features=None,
           max_leaf_nodes=None, min_impurity_decrease=0.0,
           min_impurity_split=None, min_samples_leaf=1,
           min_samples_split=2, min_weight_fraction_leaf=0.0,
           presort=False, random_state=None, splitter='best')
DecisionTreeRegressor(criterion='mse', max_depth=7, max_features=None,
           max_leaf_nodes=None, min_impurity_decrease=0.0,
           min_impurity_split=None, min_samples_leaf=1,
           min_samples_split=2, min_weight_fraction_leaf=0.0,
           presort=False, random_state=None, splitter='best')
DecisionTreeRegressor(criterion='mse', max_depth=8, max_features=None,
           max_leaf_nodes=None, min_impurity_decrease=0.0,
           min_impurity_split=None, min_samples_leaf=1,
           min_samples_split=2, min_weight_fraction_leaf=0.0,
           presort=False, random_state=None, splitter='best')
DecisionTreeRegressor(criterion='mse', max_depth=8, max_features=None,
           max_leaf_nodes=None, min_impurity_decrease=0.0,
           min_impurity_split=None, min_samples_leaf=1,
           min_samples_split=2, min_weight_fraction_leaf=0.0,
           presort=False, random_state=None, splitter='best')
DecisionTreeRegressor(criterion='mse', max_depth=8, max_features=None,
           max_leaf_nodes=None, min_impurity_decrease=0.0,
           min_impurity_split=None, min_samples_leaf=1,
           min_samples_split=2, min_weight_fraction_leaf=0.0,
           presort=False, random_state=None, splitter='best')
DecisionTreeRegressor(criterion='mse', max_depth=8, max_features=None,
           max_leaf_nodes=None, min_impurity_decrease=0.0,
           min_impurity_split=None, min_samples_leaf=1,
           min_samples_split=2, min_weight_fraction_leaf=0.0,
           presort=False, random_state=None, splitter='best')
DecisionTreeRegressor(criterion='mse', max_depth=8, max_features=None,
           max_leaf_nodes=None, min_impurity_decrease=0.0,
           min_impurity_split=None, min_samples_leaf=1,
           min_samples_split=2, min_weight_fraction_leaf=0.0,
           presort=False, random_state=None, splitter='best')
DecisionTreeRegressor(criterion='mse', max_depth=8, max_features=None,
           max_leaf_nodes=None, min_impurity_decrease=0.0,
           min_impurity_split=None, min_samples_leaf=1,
           min_samples_split=2, min_weight_fraction_leaf=0.0,
           presort=False, random_state=None, splitter='best')
DecisionTreeRegressor(criterion='mse', max_depth=8, max_features=None,
           max_leaf_nodes=None, min_impurity_decrease=0.0,
           min_impurity_split=None, min_samples_leaf=1,
           min_samples_split=2, min_weight_fraction_leaf=0.0,
           presort=False, random_state=None, splitter='best')
DecisionTreeRegressor(criterion='mse', max_depth=8, max_features=None,
           max_leaf_nodes=None, min_impurity_decrease=0.0,
           min_impurity_split=None, min_samples_leaf=1,
           min_samples_split=2, min_weight_fraction_leaf=0.0,
           presort=False, random_state=None, splitter='best')
DecisionTreeRegressor(criterion='mse', max_depth=8, max_features=None,
           max_leaf_nodes=None, min_impurity_decrease=0.0,
           min_impurity_split=None, min_samples_leaf=1,
           min_samples_split=2, min_weight_fraction_leaf=0.0,
           presort=False, random_state=None, splitter='best')
DecisionTreeRegressor(criterion='mse', max_depth=8, max_features=None,
           max_leaf_nodes=None, min_impurity_decrease=0.0,
           min_impurity_split=None, min_samples_leaf=1,
           min_samples_split=2, min_weight_fraction_leaf=0.0,
           presort=False, random_state=None, splitter='best')
DecisionTreeRegressor(criterion='mse', max_depth=9, max_features=None,
           max_leaf_nodes=None, min_impurity_decrease=0.0,
           min_impurity_split=None, min_samples_leaf=1,
           min_samples_split=2, min_weight_fraction_leaf=0.0,
           presort=False, random_state=None, splitter='best')
DecisionTreeRegressor(criterion='mse', max_depth=9, max_features=None,
           max_leaf_nodes=None, min_impurity_decrease=0.0,
           min_impurity_split=None, min_samples_leaf=1,
           min_samples_split=2, min_weight_fraction_leaf=0.0,
           presort=False, random_state=None, splitter='best')
DecisionTreeRegressor(criterion='mse', max_depth=9, max_features=None,
           max_leaf_nodes=None, min_impurity_decrease=0.0,
           min_impurity_split=None, min_samples_leaf=1,
           min_samples_split=2, min_weight_fraction_leaf=0.0,
           presort=False, random_state=None, splitter='best')
DecisionTreeRegressor(criterion='mse', max_depth=9, max_features=None,
           max_leaf_nodes=None, min_impurity_decrease=0.0,
           min_impurity_split=None, min_samples_leaf=1,
           min_samples_split=2, min_weight_fraction_leaf=0.0,
           presort=False, random_state=None, splitter='best')
DecisionTreeRegressor(criterion='mse', max_depth=9, max_features=None,
           max_leaf_nodes=None, min_impurity_decrease=0.0,
           min_impurity_split=None, min_samples_leaf=1,
           min_samples_split=2, min_weight_fraction_leaf=0.0,
           presort=False, random_state=None, splitter='best')
DecisionTreeRegressor(criterion='mse', max_depth=9, max_features=None,
           max_leaf_nodes=None, min_impurity_decrease=0.0,
           min_impurity_split=None, min_samples_leaf=1,
           min_samples_split=2, min_weight_fraction_leaf=0.0,
           presort=False, random_state=None, splitter='best')
DecisionTreeRegressor(criterion='mse', max_depth=9, max_features=None,
           max_leaf_nodes=None, min_impurity_decrease=0.0,
           min_impurity_split=None, min_samples_leaf=1,
           min_samples_split=2, min_weight_fraction_leaf=0.0,
           presort=False, random_state=None, splitter='best')
DecisionTreeRegressor(criterion='mse', max_depth=9, max_features=None,
           max_leaf_nodes=None, min_impurity_decrease=0.0,
           min_impurity_split=None, min_samples_leaf=1,
           min_samples_split=2, min_weight_fraction_leaf=0.0,
           presort=False, random_state=None, splitter='best')
DecisionTreeRegressor(criterion='mse', max_depth=9, max_features=None,
           max_leaf_nodes=None, min_impurity_decrease=0.0,
           min_impurity_split=None, min_samples_leaf=1,
           min_samples_split=2, min_weight_fraction_leaf=0.0,
           presort=False, random_state=None, splitter='best')
DecisionTreeRegressor(criterion='mse', max_depth=9, max_features=None,
           max_leaf_nodes=None, min_impurity_decrease=0.0,
           min_impurity_split=None, min_samples_leaf=1,
           min_samples_split=2, min_weight_fraction_leaf=0.0,
           presort=False, random_state=None, splitter='best')

 

*1:y[te_idx]-ypred)**2/2

損失関数,トレーニング誤差,テスト誤差

データ数を20,10組のデータセットに対してトレーニング誤差をプロットした際の実装例のメモです.

まずはパッケージを読み込みます.

>>> import numpy as np
>>> import matplotlib.pyplot as plt
>>> import scipy.stats
>>> 
 


パラメータ範囲を設定します.

>>> par = np.linspace(-3,3,50)


テスト誤差の設定を行います.

>>> te_err = (1+par**2)/2


テスト誤差をプロットします.

>>> for i in range(10):
...     z = np.random.normal(size=20) # データを生成
...     # トレーニング誤差
...     trerr = np.mean(np.subtract.outer(z,par)**2/2, axis=0)
...     plt.plot(par,trerr,'b--',linewidth=2)
... 
[<matplotlib.lines.Line2D object at 0x11f86b898>]
[<matplotlib.lines.Line2D object at 0x11f86b9e8>]
[<matplotlib.lines.Line2D object at 0x11f86bd30>]
[<matplotlib.lines.Line2D object at 0x11f87e0b8>]
[<matplotlib.lines.Line2D object at 0x11f87e400>]
[<matplotlib.lines.Line2D object at 0x11f87e748>]
[<matplotlib.lines.Line2D object at 0x11f87ea90>]
[<matplotlib.lines.Line2D object at 0x11f87edd8>]
[<matplotlib.lines.Line2D object at 0x11f888160>]
[<matplotlib.lines.Line2D object at 0x11f8884a8>]


続いて,描画の準備を行います.

>>> plt.xlabel("par")
Text(0.5, 0, 'par')
>>>
>>> plt.ylabel("training/test errors")
Text(0, 0.5, 'training/test errors')
>>>
 
テスト誤差をプロットして,描画します.
>>> plt.plot(par, te_err,'r-',linewidth=4) # テスト誤差をプロット
[<matplotlib.lines.Line2D object at 0x11f8889e8>]
>>> plt.show() # 描画
/Library/Frameworks/Python.framework/Versions/3.6/lib/python3.6/site-packages/matplotlib/font_manager.py:1241: UserWarning: findfont: Font family ['ipaexg'] not found. Falling back to DejaVu Sans.
  (prop.get_family(), self.defaultFamily[fontext]))
 
objc[1137]: Class FIFinderSyncExtensionHost is implemented in both /System/Library/PrivateFrameworks/FinderKit.framework/Versions/A/FinderKit (0x7fff933a3cd0) and /System/Library/PrivateFrameworks/FileProvider.framework/OverrideBundles/FinderSyncCollaborationFileProviderOverride.bundle/Contents/MacOS/FinderSyncCollaborationFileProviderOverride (0x1204bccd8). One of the two will be used. Which one is undefined.
>>> 
 
すると,以下のようなグラフが描画されます.
 
グラフは,テスト誤差(赤の実戦)とトレーニング誤差(青の破線)のプロット図になります.
10組のデータセットのそれぞれに対してトレーニング誤差がプロットされています.
関数(実線)を最小にするパラメータを求めることが目的で,これをデータから計算される関数(破線)で近似して最小化します.

コードの中の np.outer は,デフォルトではベクトルの外積を計算します.ここでは np.substract.outer のように記述することで,外積の掛け算を引き算に置き換えています.np.ufunc.outer で置き換えることができる演算は下表のようになります.


ufunc
add
subtract
multiply
divide
maximum
minimum
演算
加算
減算
乗算
除算
最大値
最小値

上記の例のJupyter Notebookファイルは,GitHubStatistics.ipynbの"Loss function, Training error and Test error"で見ることができます.
 

-Python- 共分散,相関係数

Pythonのnumpyのnp.covnp.corrcoefを使い,データから標本共分散,標本相関係数を求める例のメモです.
関数には, データの次元 x データ数 のサイズのデータ行列を入力します.

>>> import numpy as np
>>> from sklearn.datasets import load_iris
>>> iris = load_iris()
>>> # number of data, dimension
... iris.data.shape
(150, 4)
>>> # Variance-covariance matrix (Transposition of data matrix)
... np.cov(iris.data.T)
array([[ 0.68569351, -0.042434  ,  1.27431544,  0.51627069],
       [-0.042434  ,  0.18997942, -0.32965638, -0.12163937],
       [ 1.27431544, -0.32965638,  3.11627785,  1.2956094 ],
       [ 0.51627069, -0.12163937,  1.2956094 ,  0.58100626]])
>>> # Correlation coefficient matrix (Transposition of data matrix)
... np.corrcoef(iris.data.T)
array([[ 1.        , -0.11756978,  0.87175378,  0.81794113],
       [-0.11756978,  1.        , -0.4284401 , -0.36612593],
       [ 0.87175378, -0.4284401 ,  1.        ,  0.96286543],
       [ 0.81794113, -0.36612593,  0.96286543,  1.        ]])

 

上記の例のJupyter Notebookファイルは,GitHubStatistics.ipynbの"Convariance and Correlation coefficient"で見ることができます.

-Python- 分位点

標準正規分布 $N(0, 1)$の上側$\alpha$点を$z_{\alpha}$と表します.
Pythonでは

sp.stats.norm.ppf

を使うと正規分布の分位点 z_{\alpha}の値が得られます.

Pythonでの実装は以下のようになります.

>>> import scipy as sp
>>> from scipy.stats import norm

 N(0, 1)の0.7点

>>> sp.stats.norm.ppf(0.7)
0.5244005127080407
 N(1, 2^2) の0.7点
>>> sp.stats.norm.ppf(0.7, loc = 1, scale = 2)
2.0488010254160813
 N(0, 1)の上側0.05点
>>> alpha = 0.05
>>> sp.stats.norm.ppf(1 - alpha)
1.6448536269514722

 

上記の例のJupyter Notebookファイルは,GitHubStatistics.ipynbの"Quantile"で見ることができます.

期待値と分散

Pythonでは,確率分布からデータを生成するための関数 np.randomscipy.stats として提供されています.
サンプルの生成で使うのは,以下のような関数です.

正規分布

  • np.random.normal(loc = 0.0, scale = 1.0, size = None)
オプションのlocは期待値,scaleは標準偏差,sizeはサンプル数
 

標準正規分布

  • np.random.randn(d0, d1, d2, ...)
標準正規分布は,期待値0, 分散1の正規分布.(d0, d1, d2, ...) のサイズの配列に,標準正規分布からのサンプルを格納


一様分布

  • np.random.uniform(low = 0.0, high = 1.0, size = None)
オプションのlowは最小値,highは最大値,sizeはサンプル数
 
区間[0, 1]上の一様分布
  • np.random.rand(d0, d1, d2, ...)
(d0, d1, d2, ...) のサイズの配列に,区間[0, 1]上の一様分布からのサンプルを格納


Pythonでの実装は以下のようになります.
期待値1,標準偏差2の正規分布に従うデータを100個生成してみます.

>>> import numpy as np
>>> x = np.random.normal(1, 2, 100)
 
データの平均値を計算します.
>>> x.mean()
1.150042657481204
 
別の方法でデータの平均値を計算するには以下のようにします.
>>> np.mean(x)
1.150042657481204
当然,同じ値が得られます.
 
 | x - E[x] | \leq sd(x)となるデータの割合を求めます.
>>> np.mean(np.abs(x - np.mean(x)) <= np.std(x))
0.67
 
 | x - E[x] | \leq 2 \times sd(x)となるデータの割合を求めます.
>>> np.mean(np.abs(x - np.mean(x)) <= 2 * np.std(x))
0.98

 

上記の例のJupyter Notebookファイルは,GitHubStatistics.ipynbの"Expected value and Variance"で見ることができます.