sklearn介紹
scikit-learn是數(shù)據(jù)挖掘與分析的簡(jiǎn)單而有效的工具。
依賴于NumPy, SciPy和matplotlib。
它主要包含以下幾部分內(nèi)容:
從功能來(lái)分:
classification
Regression
Clustering
Dimensionality reduction
Model selection
經(jīng)常用到的有clustering, classification(svm, tree, linear regression 等), decomposition, preprocessing, metrics等
cluster
閱讀sklearn.cluster的API,可以發(fā)現(xiàn)里面主要有兩個(gè)內(nèi)容:一個(gè)是各種聚類方法的class如cluster.KMeans,一個(gè)是可以直接使用的聚類方法的函數(shù)
sklearn
.
cluster
.
k_means
(
X
,
n_clusters
,
init
=
'k-means++'
,
precompute_distances
=
'auto'
,
n_init
=
10
,
max_iter
=
300
,
verbose
=
False
,
tol
=
0.0001
,
random_state
=
None
,
copy_x
=
True
,
n_jobs
=
1
,
algorithm
=
'auto'
,
return_n_iter
=
False
)
所以實(shí)際使用中,對(duì)應(yīng)也有兩種方法。
在sklearn.cluster共有9種聚類方法,分別是
AffinityPropagation: 吸引子傳播
AgglomerativeClustering: 層次聚類
Birch
DBSCAN
FeatureAgglomeration: 特征聚集
KMeans: K均值聚類
MiniBatchKMeans
MeanShift
SpectralClustering: 譜聚類
拿我們最熟悉的Kmeans舉例說(shuō)明:
采用類構(gòu)造器,來(lái)構(gòu)造Kmeans聚類器
首先API中KMeans的構(gòu)造函數(shù)為:
sklearn
.
cluster
.
KMeans
(
n_clusters
=
8
,
init
=
'k-means++'
,
n_init
=
10
,
max_iter
=
300
,
tol
=
0.0001
,
precompute_distances
=
'auto'
,
verbose
=
0
,
random_state
=
None
,
copy_x
=
True
,
n_jobs
=
1
,
algorithm
=
'auto'
)
參數(shù)的意義:
n_clusters:簇的個(gè)數(shù),即你想聚成幾類
init: 初始簇中心的獲取方法
n_init: 獲取初始簇中心的更迭次數(shù)
max_iter: 最大迭代次數(shù)(因?yàn)閗means算法的實(shí)現(xiàn)需要迭代)
tol: 容忍度,即kmeans運(yùn)行準(zhǔn)則收斂的條件
precompute_distances:是否需要提前計(jì)算距離
verbose: 冗長(zhǎng)模式(不太懂是啥意思,反正一般不去改默認(rèn)值)
random_state: 隨機(jī)生成簇中心的狀態(tài)條件。
copy_x: 對(duì)是否修改數(shù)據(jù)的一個(gè)標(biāo)記,如果True,即復(fù)制了就不會(huì)修改數(shù)據(jù)。
n_jobs: 并行設(shè)置
algorithm: kmeans的實(shí)現(xiàn)算法,有:‘a(chǎn)uto’, ‘full’, ‘elkan’, 其中 'full’表示用EM方式實(shí)現(xiàn)
下面給一個(gè)簡(jiǎn)單的例子:
import
numpy
as
np
from
sklearn
.
cluster
import
KMeans
data
=
np
.
random
.
rand
(
100
,
3
)
#生成一個(gè)隨機(jī)數(shù)據(jù),樣本大小為100, 特征數(shù)為3
#假如我要構(gòu)造一個(gè)聚類數(shù)為3的聚類器
estimator
=
KMeans
(
n_clusters
=
3
)
#構(gòu)造聚類器
estimator
.
fit
(
data
)
#聚類
label_pred
=
estimator
.
label_
#獲取聚類標(biāo)簽
centroids
=
estimator
.
cluster_centers_
#獲取聚類中心
inertia
=
estimator
.
inertia_
# 獲取聚類準(zhǔn)則的最后值
直接采用kmeans函數(shù):
import
numpy
as
np
from
sklearn
import
cluster
data
=
np
.
random
.
rand
(
100
,
3
)
#生成一個(gè)隨機(jī)數(shù)據(jù),樣本大小為100, 特征數(shù)為3
k
=
3
# 假如我要聚類為3個(gè)clusters
[
centroid
,
label
,
inertia
]
=
cluster
.
k_means
(
data
,
k
)
classification
常用的分類方法有:
KNN最近鄰:sklearn.neighbors
logistic regression邏輯回歸: sklearn.linear_model.LogisticRegression
svm支持向量機(jī): sklearn.svm
Naive Bayes樸素貝葉斯: sklearn.naive_bayes
Decision Tree決策樹: sklearn.tree
Neural network神經(jīng)網(wǎng)絡(luò): sklearn.neural_network
那么下面以KNN為例(主要是Nearest Neighbors Classification)來(lái)看看怎么使用這些方法:
from
sklearn
import
neighbors
,
datasets
# import some data to play with
iris
=
datasets
.
load_iris
(
)
n_neighbors
=
15
X
=
iris
.
data
[
:
,
:
2
]
# we only take the first two features. We could
# avoid this ugly slicing by using a two-dim dataset
y
=
iris
.
target
weights
=
'distance'
# also set as 'uniform'
clf
=
neighbors
.
KNeighborsClassifier
(
n_neighbors
,
weights
=
weights
)
clf
.
fit
(
X
,
y
)
# if you have test data, just predict with the following functions
# for example, xx, yy is constructed test data
x_min
,
x_max
=
X
[
:
,
0
]
.
min
(
)
-
1
,
X
[
:
,
0
]
.
max
(
)
+
1
y_min
,
y_max
=
X
[
:
,
1
]
.
min
(
)
-
1
,
X
[
:
,
1
]
.
max
(
)
+
1
xx
,
yy
=
np
.
meshgrid
(
np
.
arange
(
x_min
,
x_max
,
h
)
,
np
.
arange
(
y_min
,
y_max
,
h
)
)
Z
=
clf
.
predict
(
np
.
c_
[
xx
.
ravel
(
)
,
yy
.
ravel
(
)
]
)
# Z is the label_pred
再比如svm:
from
sklearn
import
svm
X
=
[
[
0
,
0
]
,
[
1
,
1
]
]
y
=
[
0
,
1
]
#建立支持向量分類模型
clf
=
svm
.
SVC
(
)
#擬合訓(xùn)練數(shù)據(jù),得到訓(xùn)練模型參數(shù)
clf
.
fit
(
X
,
y
)
#對(duì)測(cè)試點(diǎn)[2., 2.], [3., 3.]預(yù)測(cè)
res
=
clf
.
predict
(
[
[
2
.
,
2
.
]
,
[
3
.
,
3
.
]
]
)
#輸出預(yù)測(cè)結(jié)果值
print
(
res
)
#get support vectors
print
(
"support vectors:"
,
clf
.
support_vectors_
)
#get indices of support vectors
print
(
"indices of support vectors:"
,
clf
.
support_
)
#get number of support vectors for each class
print
(
"number of support vectors for each class:"
,
clf
.
n_support_
)
當(dāng)然SVM還有對(duì)應(yīng)的回歸模型SVR
from
sklearn
import
svm
X
=
[
[
0
,
0
]
,
[
2
,
2
]
]
y
=
[
0.5
,
2.5
]
clf
=
svm
.
SVR
(
)
clf
.
fit
(
X
,
y
)
res
=
clf
.
predict
(
[
[
1
,
1
]
]
)
print
(
res
)
邏輯回歸
from
sklearn
import
linear_model
X
=
[
[
0
,
0
]
,
[
1
,
1
]
]
y
=
[
0
,
1
]
logreg
=
linear_model
.
LogisticRegression
(
C
=
1e5
)
#we create an instance of Neighbours Classifier and fit the data.
logreg
.
fit
(
X
,
y
)
res
=
logreg
.
predict
(
[
[
2
,
2
]
]
)
print
(
res
)
preprocessing
這一塊通常我要用到的是Scale操作。而Scale類型也有很多,包括:
StandardScaler
MaxAbsScaler
MinMaxScaler
RobustScaler
Normalizer
等其他預(yù)處理操作
對(duì)應(yīng)的有直接的函數(shù)使用:scale(), maxabs_scale(), minmax_scale(), robust_scale(), normaizer()。
import
numpy
as
np
from
sklearn
import
preprocessing
X
=
np
.
random
.
rand
(
3
,
4
)
#用scaler的方法
scaler
=
preprocessing
.
MinMaxScaler
(
)
X_scaled
=
scaler
.
fit_transform
(
X
)
#用scale函數(shù)的方法
X_scaled_convinent
=
preprocessing
.
minmax_scale
(
X
)
decomposition
NMF
import
numpy
as
np
X
=
np
.
array
(
[
[
1
,
1
]
,
[
2
,
1
]
,
[
3
,
1.2
]
,
[
4
,
1
]
,
[
5
,
0.8
]
,
[
6
,
1
]
]
)
from
sklearn
.
decomposition
import
NMF
model
=
NMF
(
n_components
=
2
,
init
=
'random'
,
random_state
=
0
)
model
.
fit
(
X
)
print
(
model
.
components_
)
print
(
model
.
reconstruction_err_
)
print
(
model
.
n_iter_
)
PCA
import
numpy
as
np
X
=
np
.
array
(
[
[
1
,
1
]
,
[
2
,
1
]
,
[
3
,
1.2
]
,
[
4
,
1
]
,
[
5
,
0.8
]
,
[
6
,
1
]
]
)
from
sklearn
.
decomposition
import
PCA
model
=
PCA
(
n_components
=
2
)
model
.
fit
(
X
)
print
(
model
.
components_
)
print
(
model
.
n_components_
)
print
(
model
.
explained_variance_
)
print
(
model
.
explained_variance_ratio_
)
print
(
model
.
mean_
)
print
(
model
.
noise_variance_
)
datasets
sklearn本身也提供了幾個(gè)常見(jiàn)的數(shù)據(jù)集,如iris, diabetes, digits, covtype, kddcup99, boson, breast_cancer,都可以通過(guò)sklearn.datasets.load_iris類似的方法加載相應(yīng)的數(shù)據(jù)集。它返回一個(gè)數(shù)據(jù)集。采用下列方式獲取數(shù)據(jù)與標(biāo)簽。
from
sklearn
.
datasets
import
load_iris
iris
=
load_iris
(
)
X
=
iris
.
data
y
=
iris
.
target
更多文章、技術(shù)交流、商務(wù)合作、聯(lián)系博主
微信掃碼或搜索:z360901061

微信掃一掃加我為好友
QQ號(hào)聯(lián)系: 360901061
您的支持是博主寫作最大的動(dòng)力,如果您喜歡我的文章,感覺(jué)我的文章對(duì)您有幫助,請(qǐng)用微信掃描下面二維碼支持博主2元、5元、10元、20元等您想捐的金額吧,狠狠點(diǎn)擊下面給點(diǎn)支持吧,站長(zhǎng)非常感激您!手機(jī)微信長(zhǎng)按不能支付解決辦法:請(qǐng)將微信支付二維碼保存到相冊(cè),切換到微信,然后點(diǎn)擊微信右上角掃一掃功能,選擇支付二維碼完成支付。
【本文對(duì)您有幫助就好】元
