scikit-learn在线教程完全指南从基础入门到高级应用机器学习算法详解与实战案例助你快速掌握数据科学核心技能轻松应对实际项目挑战

引言

Scikit-learn（简称sklearn）是Python中最流行、最强大的机器学习库之一，它建立在NumPy、SciPy和Matplotlib等科学计算库之上，提供了一套简洁而统一的工具集，支持多种常见的机器学习任务。无论你是机器学习初学者，还是寻求实现复杂机器学习系统的专家，Scikit-learn都是一个不可或缺的工具。本指南将带你从基础入门到高级应用，全面掌握Scikit-learn的核心技能，轻松应对实际项目挑战。

1. Scikit-learn简介

Scikit-learn是一个开源的Python机器学习库，由David Cournapeau于2007年作为Google Summer of Code项目首次开发，如今已成为数据科学领域最常用的工具之一。它的主要特点包括：

简单高效：提供一致的API设计，易于学习和使用
功能全面：支持分类、回归、聚类、降维、模型选择等多种机器学习任务
文档完善：拥有详尽的官方文档和丰富的示例代码
社区活跃：拥有庞大的用户社区，持续更新和维护

Scikit-learn的基本功能主要被分为六大部分：分类、回归、聚类、数据降维、模型选择和数据预处理。这些功能覆盖了机器学习的主要应用场景，使其成为数据科学家的首选工具。

2. 安装与环境准备

在开始使用Scikit-learn之前，我们需要先安装必要的库和工具。

2.1 安装Scikit-learn

可以通过pip或conda安装Scikit-learn及其依赖：

# 使用pip安装 pip install numpy pandas scikit-learn matplotlib # 使用conda安装 conda install numpy pandas scikit-learn matplotlib

2.2 验证安装

安装完成后，可以通过以下代码验证安装是否成功：

import sklearn import numpy as np import pandas as pd import matplotlib.pyplot as plt print("Scikit-learn版本:", sklearn.__version__) print("NumPy版本:", np.__version__) print("Pandas版本:", pd.__version__)

2.3 开发环境推荐

对于初学者，推荐使用Jupyter Notebook作为开发环境，它提供了交互式的编程体验，便于实验和可视化：

pip install jupyterlab # 或 conda install jupyterlab

启动Jupyter Notebook：

jupyter notebook

3. Scikit-learn基础概念

3.1 数据集（Datasets）

机器学习的第一步是获取数据。Scikit-learn提供了几个内置数据集，如鸢尾花数据集（Iris dataset）和波士顿房价数据集（Boston housing dataset），你可以用它们进行练习。

from sklearn.datasets import load_iris # 加载鸢尾花数据集 iris = load_iris() X = iris.data # 特征 y = iris.target # 标签 # 查看数据集信息 print("特征名称:", iris.feature_names) print("目标类别:", iris.target_names) print("数据集形状:", X.shape) print("前5行数据:n", X[:5])

3.2 数据预处理（Data Preprocessing）

在使用数据进行训练之前，通常需要对数据进行预处理，包括标准化、缺失值处理等。Scikit-learn提供了多种工具来简化这个过程。

3.2.1 数据标准化

from sklearn.preprocessing import StandardScaler # 创建标准化器 scaler = StandardScaler() # 对数据进行标准化 X_scaled = scaler.fit_transform(X) # 比较标准化前后的数据 print("原始数据的前5行:n", X[:5]) print("标准化后的数据的前5行:n", X_scaled[:5])

3.2.2 训练集和测试集划分

from sklearn.model_selection import train_test_split # 将数据集分为训练集和测试集（80%训练，20%测试） X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42) print("训练集大小:", X_train.shape) print("测试集大小:", X_test.shape)

3.3 Scikit-learn的API设计原则

Scikit-learn的API设计遵循一致性原则，使得不同算法的使用方式非常相似。主要接口包括：

fit()：训练模型
predict()：使用训练好的模型进行预测
transform()：对数据进行转换（如特征提取、标准化等）
fit_transform()：先拟合数据，再对数据进行转换

这种统一的接口设计使得在尝试不同算法时，只需要更改很少的代码。

4. 监督学习算法详解

监督学习是机器学习中最常见的任务之一，它使用带标签的数据进行训练，包括分类和回归两大类问题。

4.1 分类算法

分类是指识别给定对象的所属类别，属于监督学习的范畴，最常见的应用场景包括垃圾邮件检测和图像识别等。

4.1.1 逻辑回归（Logistic Regression）

逻辑回归是一种简单的分类算法，尽管名字中包含”回归”，但它实际上是一种分类方法。

from sklearn.linear_model import LogisticRegression from sklearn.metrics import accuracy_score, classification_report # 创建逻辑回归模型 log_reg = LogisticRegression(max_iter=200) # 训练模型 log_reg.fit(X_train, y_train) # 在测试集上进行预测 y_pred = log_reg.predict(X_test) # 评估模型 accuracy = accuracy_score(y_test, y_pred) print("准确率:", accuracy) print("n分类报告:n", classification_report(y_test, y_pred, target_names=iris.target_names))

4.1.2 决策树（Decision Tree）

决策树是一种直观的分类方法，通过一系列的问题将数据分割成不同的类别。

from sklearn.tree import DecisionTreeClassifier from sklearn import tree # 创建决策树模型 dt_clf = DecisionTreeClassifier(max_depth=3) # 训练模型 dt_clf.fit(X_train, y_train) # 在测试集上进行预测 y_pred = dt_clf.predict(X_test) # 评估模型 accuracy = accuracy_score(y_test, y_pred) print("准确率:", accuracy) # 可视化决策树 plt.figure(figsize=(15, 10)) tree.plot_tree(dt_clf, filled=True, feature_names=iris.feature_names, class_names=iris.target_names) plt.show()

4.1.3 随机森林（Random Forest）

随机森林是一种集成学习方法，通过构建多个决策树并将它们的预测结果进行整合，以提高预测的准确性并防止过拟合。

from sklearn.ensemble import RandomForestClassifier # 创建随机森林模型 rf_clf = RandomForestClassifier(n_estimators=100, random_state=42) # 训练模型 rf_clf.fit(X_train, y_train) # 在测试集上进行预测 y_pred = rf_clf.predict(X_test) # 评估模型 accuracy = accuracy_score(y_test, y_pred) print("准确率:", accuracy) # 特征重要性 feature_importance = pd.DataFrame({ 'Feature': iris.feature_names, 'Importance': rf_clf.feature_importances_ }).sort_values('Importance', ascending=False) print("n特征重要性:n", feature_importance)

4.1.4 支持向量机（Support Vector Machine, SVM）

支持向量机是一种强大的分类算法，特别适用于高维数据。

from sklearn.svm import SVC # 创建SVM模型 svm_clf = SVC(kernel='linear', C=1.0, random_state=42) # 训练模型 svm_clf.fit(X_train, y_train) # 在测试集上进行预测 y_pred = svm_clf.predict(X_test) # 评估模型 accuracy = accuracy_score(y_test, y_pred) print("准确率:", accuracy)

4.2 回归算法

回归是指预测与给定对象相关联的连续值属性，最常见的应用场景包括预测药物反应和预测股票价格等。

4.2.1 线性回归（Linear Regression）

线性回归是最简单的回归算法，它试图找到特征和目标变量之间的线性关系。

from sklearn.datasets import load_boston from sklearn.linear_model import LinearRegression from sklearn.metrics import mean_squared_error, r2_score # 加载波士顿房价数据集 boston = load_boston() X = boston.data y = boston.target # 划分训练集和测试集 X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42) # 创建线性回归模型 lin_reg = LinearRegression() # 训练模型 lin_reg.fit(X_train, y_train) # 在测试集上进行预测 y_pred = lin_reg.predict(X_test) # 评估模型 mse = mean_squared_error(y_test, y_pred) r2 = r2_score(y_test, y_pred) print("均方误差(MSE):", mse) print("R²分数:", r2) # 系数和截距 print("n系数:", lin_reg.coef_) print("截距:", lin_reg.intercept_)

4.2.2 岭回归（Ridge Regression）

岭回归是线性回归的正则化版本，通过添加L2正则项来防止过拟合。

from sklearn.linear_model import Ridge # 创建岭回归模型 ridge_reg = Ridge(alpha=1.0) # 训练模型 ridge_reg.fit(X_train, y_train) # 在测试集上进行预测 y_pred = ridge_reg.predict(X_test) # 评估模型 mse = mean_squared_error(y_test, y_pred) r2 = r2_score(y_test, y_pred) print("均方误差(MSE):", mse) print("R²分数:", r2)

4.2.3 Lasso回归

Lasso回归是另一种正则化的线性回归，通过添加L1正则项来防止过拟合，并且可以产生稀疏模型。

from sklearn.linear_model import Lasso # 创建Lasso回归模型 lasso_reg = Lasso(alpha=0.1) # 训练模型 lasso_reg.fit(X_train, y_train) # 在测试集上进行预测 y_pred = lasso_reg.predict(X_test) # 评估模型 mse = mean_squared_error(y_test, y_pred) r2 = r2_score(y_test, y_pred) print("均方误差(MSE):", mse) print("R²分数:", r2) # 查看系数，很多系数可能变为0 print("n系数:", lasso_reg.coef_)

5. 无监督学习算法详解

无监督学习使用不带标签的数据进行训练，主要包括聚类和降维两大类任务。

5.1 聚类算法

聚类是将数据分成不同的组，使得同一组内的数据点相似，而不同组之间的数据点不相似。

5.1.1 K-means聚类

K-means是最常用的聚类算法之一，它试图将数据分成K个簇，每个簇的中心是该簇中所有点的均值。

from sklearn.cluster import KMeans from sklearn.datasets import make_blobs # 生成模拟数据 X, y = make_blobs(n_samples=300, centers=4, random_state=42) # 创建K-means模型 kmeans = KMeans(n_clusters=4, random_state=42) # 训练模型 kmeans.fit(X) # 获取聚类结果 labels = kmeans.labels_ centers = kmeans.cluster_centers_ # 可视化聚类结果 plt.figure(figsize=(10, 6)) plt.scatter(X[:, 0], X[:, 1], c=labels, cmap='viridis', s=50, alpha=0.7) plt.scatter(centers[:, 0], centers[:, 1], c='red', s=200, alpha=0.8, marker='X') plt.title('K-means聚类结果') plt.xlabel('特征1') plt.ylabel('特征2') plt.show()

5.1.2 层次聚类（Hierarchical Clustering）

层次聚类是一种构建聚类层次结构的算法，可以分为凝聚式（自底向上）和分裂式（自顶向下）两种。

from sklearn.cluster import AgglomerativeClustering from scipy.cluster.hierarchy import dendrogram, linkage # 创建层次聚类模型 agg_clustering = AgglomerativeClustering(n_clusters=4) # 训练模型 labels = agg_clustering.fit_predict(X) # 可视化聚类结果 plt.figure(figsize=(10, 6)) plt.scatter(X[:, 0], X[:, 1], c=labels, cmap='viridis', s=50, alpha=0.7) plt.title('层次聚类结果') plt.xlabel('特征1') plt.ylabel('特征2') plt.show() # 绘制树状图 plt.figure(figsize=(12, 8)) linkage_matrix = linkage(X, method='ward') dendrogram(linkage_matrix) plt.title('层次聚类树状图') plt.xlabel('数据点索引') plt.ylabel('距离') plt.show()

5.2 降维算法

降维是减少数据特征数量的过程，同时尽可能保留数据的重要信息。

5.2.1 主成分分析（PCA）

PCA是最常用的降维技术之一，它通过线性变换将原始数据转换为一组各维度线性无关的表示。

from sklearn.decomposition import PCA from sklearn.datasets import load_digits # 加载手写数字数据集 digits = load_digits() X = digits.data y = digits.target # 创建PCA模型，将数据降至2维 pca = PCA(n_components=2) # 对数据进行降维 X_pca = pca.fit_transform(X) # 可视化降维结果 plt.figure(figsize=(10, 8)) scatter = plt.scatter(X_pca[:, 0], X_pca[:, 1], c=y, cmap='viridis', alpha=0.7) plt.colorbar(scatter) plt.title('PCA降维结果') plt.xlabel('主成分1') plt.ylabel('主成分2') plt.show() # 解释方差比 print("解释方差比:", pca.explained_variance_ratio_) print("累计解释方差比:", sum(pca.explained_variance_ratio_))

5.2.2 t-SNE

t-SNE是一种非线性降维技术，特别适用于高维数据的可视化。

from sklearn.manifold import TSNE # 创建t-SNE模型，将数据降至2维 tsne = TSNE(n_components=2, random_state=42) # 对数据进行降维 X_tsne = tsne.fit_transform(X) # 可视化降维结果 plt.figure(figsize=(10, 8)) scatter = plt.scatter(X_tsne[:, 0], X_tsne[:, 1], c=y, cmap='viridis', alpha=0.7) plt.colorbar(scatter) plt.title('t-SNE降维结果') plt.xlabel('t-SNE特征1') plt.ylabel('t-SNE特征2') plt.show()

6. 模型评估与优化

6.1 交叉验证

交叉验证是一种评估模型泛化能力的方法，它将数据分成多个子集，轮流使用其中一个子集作为验证集，其余作为训练集。

from sklearn.model_selection import cross_val_score # 创建模型 model = RandomForestClassifier(n_estimators=100, random_state=42) # 使用5折交叉验证评估模型 cv_scores = cross_val_score(model, X, y, cv=5) print("各折交叉验证分数:", cv_scores) print("平均交叉验证分数:", cv_scores.mean()) print("交叉验证分数标准差:", cv_scores.std())

6.2 超参数调优

超参数是模型训练前需要设置的参数，通过调整这些参数可以优化模型性能。

6.2.1 网格搜索（Grid Search）

网格搜索通过遍历给定的超参数组合来寻找最佳参数。

from sklearn.model_selection import GridSearchCV # 定义参数网格 param_grid = { 'n_estimators': [50, 100, 200], 'max_depth': [None, 10, 20, 30], 'min_samples_split': [2, 5, 10] } # 创建模型 rf = RandomForestClassifier(random_state=42) # 创建网格搜索对象 grid_search = GridSearchCV(estimator=rf, param_grid=param_grid, cv=5, n_jobs=-1) # 执行网格搜索 grid_search.fit(X_train, y_train) # 输出最佳参数和对应的分数 print("最佳参数:", grid_search.best_params_) print("最佳交叉验证分数:", grid_search.best_score_) # 使用最佳参数的模型进行预测 best_model = grid_search.best_estimator_ y_pred = best_model.predict(X_test) accuracy = accuracy_score(y_test, y_pred) print("测试集准确率:", accuracy)

6.2.2 随机搜索（Random Search）

随机搜索在参数空间中随机采样，通常比网格搜索更高效。

from sklearn.model_selection import RandomizedSearchCV from scipy.stats import randint # 定义参数分布 param_dist = { 'n_estimators': randint(50, 200), 'max_depth': [None] + list(randint(10, 31).rvs(10)), 'min_samples_split': randint(2, 11) } # 创建随机搜索对象 random_search = RandomizedSearchCV( estimator=rf, param_distributions=param_dist, n_iter=20, cv=5, random_state=42, n_jobs=-1 ) # 执行随机搜索 random_search.fit(X_train, y_train) # 输出最佳参数和对应的分数 print("最佳参数:", random_search.best_params_) print("最佳交叉验证分数:", random_search.best_score_) # 使用最佳参数的模型进行预测 best_model = random_search.best_estimator_ y_pred = best_model.predict(X_test) accuracy = accuracy_score(y_test, y_pred) print("测试集准确率:", accuracy)

6.3 模型评估指标

不同的机器学习任务需要使用不同的评估指标。

6.3.1 分类评估指标

from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score, roc_auc_score, confusion_matrix import seaborn as sns # 使用鸢尾花数据集 X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42) # 训练模型 model = RandomForestClassifier(n_estimators=100, random_state=42) model.fit(X_train, y_train) # 预测 y_pred = model.predict(X_test) y_pred_proba = model.predict_proba(X_test) # 计算各种评估指标 accuracy = accuracy_score(y_test, y_pred) precision = precision_score(y_test, y_pred, average='weighted') recall = recall_score(y_test, y_pred, average='weighted') f1 = f1_score(y_test, y_pred, average='weighted') print("准确率:", accuracy) print("精确率:", precision) print("召回率:", recall) print("F1分数:", f1) # 混淆矩阵 cm = confusion_matrix(y_test, y_pred) plt.figure(figsize=(8, 6)) sns.heatmap(cm, annot=True, fmt='d', cmap='Blues', xticklabels=iris.target_names, yticklabels=iris.target_names) plt.title('混淆矩阵') plt.xlabel('预测标签') plt.ylabel('真实标签') plt.show()

6.3.2 回归评估指标

from sklearn.metrics import mean_squared_error, mean_absolute_error, r2_score # 使用波士顿房价数据集 X_train, X_test, y_train, y_test = train_test_split(boston.data, boston.target, test_size=0.2, random_state=42) # 训练模型 model = LinearRegression() model.fit(X_train, y_train) # 预测 y_pred = model.predict(X_test) # 计算各种评估指标 mse = mean_squared_error(y_test, y_pred) rmse = np.sqrt(mse) mae = mean_absolute_error(y_test, y_pred) r2 = r2_score(y_test, y_pred) print("均方误差(MSE):", mse) print("均方根误差(RMSE):", rmse) print("平均绝对误差(MAE):", mae) print("R²分数:", r2) # 可视化预测结果与真实值 plt.figure(figsize=(10, 6)) plt.scatter(y_test, y_pred, alpha=0.7) plt.plot([y_test.min(), y_test.max()], [y_test.min(), y_test.max()], 'r--') plt.title('预测值与真实值对比') plt.xlabel('真实值') plt.ylabel('预测值') plt.show()

7. 特征工程

特征工程是机器学习中的关键步骤，它涉及特征选择、特征提取和特征转换等技术。

7.1 特征选择

特征选择是从原始特征中选择最相关特征的子集，以提高模型性能并减少计算复杂度。

from sklearn.feature_selection import SelectKBest, f_classif, RFE # 使用鸢尾花数据集 X, y = iris.data, iris.target # 方法1：使用SelectKBest选择K个最佳特征 selector = SelectKBest(score_func=f_classif, k=2) X_new = selector.fit_transform(X, y) # 获取选中的特征的索引 selected_features = selector.get_support(indices=True) print("选中的特征索引:", selected_features) print("选中的特征名称:", [iris.feature_names[i] for i in selected_features]) # 方法2：使用递归特征消除（RFE） model = LogisticRegression(max_iter=200) rfe = RFE(estimator=model, n_features_to_select=2) X_rfe = rfe.fit_transform(X, y) # 获取特征排名 print("n特征排名:", rfe.ranking_) print("选中的特征索引:", selected_features) print("选中的特征名称:", [iris.feature_names[i] for i in selected_features])

7.2 特征提取

特征提取是从原始数据中提取新特征的过程，常用于文本和图像数据。

from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer # 示例文本数据 corpus = [ 'This is the first document.', 'This document is the second document.', 'And this is the third one.', 'Is this the first document?', ] # 词袋模型 vectorizer = CountVectorizer() X = vectorizer.fit_transform(corpus) print("词汇表:", vectorizer.get_feature_names_out()) print("词袋模型表示:n", X.toarray()) # TF-IDF模型 tfidf_vectorizer = TfidfVectorizer() X_tfidf = tfidf_vectorizer.fit_transform(corpus) print("nTF-IDF模型表示:n", X_tfidf.toarray())

7.3 特征转换

特征转换是将原始特征转换为更适合机器学习模型的形式。

from sklearn.preprocessing import OneHotEncoder, LabelEncoder, PolynomialFeatures # 示例分类数据 data = np.array([['red'], ['blue'], ['green'], ['red'], ['blue']]) # One-Hot编码 encoder = OneHotEncoder(sparse=False) one_hot_encoded = encoder.fit_transform(data) print("原始数据:n", data) print("One-Hot编码结果:n", one_hot_encoded) # 标签编码 label_encoder = LabelEncoder() label_encoded = label_encoder.fit_transform(data.ravel()) print("n标签编码结果:", label_encoded) # 多项式特征 X = np.array([[1, 2], [3, 4], [5, 6]]) poly = PolynomialFeatures(degree=2, include_bias=False) X_poly = poly.fit_transform(X) print("n原始数据:n", X) print("多项式特征结果:n", X_poly)

8. 实战案例

8.1 手写数字识别

在这个案例中，我们将使用Scikit-learn构建一个手写数字识别系统。

from sklearn.datasets import load_digits from sklearn.model_selection import train_test_split from sklearn.ensemble import RandomForestClassifier from sklearn.metrics import accuracy_score, classification_report, confusion_matrix import seaborn as sns # 加载数据 digits = load_digits() X = digits.data y = digits.target # 查看数据形状和样本图像 print("数据形状:", X.shape) print("标签形状:", y.shape) # 显示一些样本图像 fig, axes = plt.subplots(2, 5, figsize=(10, 4)) for i, ax in enumerate(axes.ravel()): ax.imshow(digits.images[i], cmap='binary') ax.set_title(f"标签: {digits.target[i]}") ax.axis('off') plt.tight_layout() plt.show() # 划分训练集和测试集 X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42) # 创建并训练模型 model = RandomForestClassifier(n_estimators=100, random_state=42) model.fit(X_train, y_train) # 在测试集上进行预测 y_pred = model.predict(X_test) # 评估模型 accuracy = accuracy_score(y_test, y_pred) print("准确率:", accuracy) print("n分类报告:n", classification_report(y_test, y_pred)) # 混淆矩阵 cm = confusion_matrix(y_test, y_pred) plt.figure(figsize=(10, 8)) sns.heatmap(cm, annot=True, fmt='d', cmap='Blues') plt.title('混淆矩阵') plt.xlabel('预测标签') plt.ylabel('真实标签') plt.show() # 显示一些预测错误的样本 incorrect = np.where(y_pred != y_test)[0] if len(incorrect) > 0: fig, axes = plt.subplots(2, 5, figsize=(12, 6)) for i, ax in enumerate(axes.ravel()): if i < len(incorrect): idx = incorrect[i] ax.imshow(digits.images[idx].reshape(8, 8), cmap='binary') ax.set_title(f"真实: {y_test[idx]}, 预测: {y_pred[idx]}") ax.axis('off') plt.tight_layout() plt.show()

8.2 乳腺癌预测

在这个案例中，我们将使用Scikit-learn构建一个乳腺癌预测模型。

from sklearn.datasets import load_breast_cancer from sklearn.preprocessing import StandardScaler from sklearn.linear_model import LogisticRegression from sklearn.metrics import roc_curve, auc, precision_recall_curve, average_precision_score # 加载数据 cancer = load_breast_cancer() X = cancer.data y = cancer.target print("特征数量:", len(cancer.feature_names)) print("样本数量:", X.shape[0]) print("恶性样本数量:", np.sum(y == 0)) print("良性样本数量:", np.sum(y == 1)) # 数据标准化 scaler = StandardScaler() X_scaled = scaler.fit_transform(X) # 划分训练集和测试集 X_train, X_test, y_train, y_test = train_test_split(X_scaled, y, test_size=0.2, random_state=42) # 创建并训练模型 model = LogisticRegression(max_iter=1000, random_state=42) model.fit(X_train, y_train) # 在测试集上进行预测 y_pred = model.predict(X_test) y_pred_proba = model.predict_proba(X_test)[:, 1] # 获取正类的概率 # 评估模型 accuracy = accuracy_score(y_test, y_pred) print("准确率:", accuracy) print("n分类报告:n", classification_report(y_test, y_pred, target_names=['恶性', '良性'])) # ROC曲线 fpr, tpr, thresholds = roc_curve(y_test, y_pred_proba) roc_auc = auc(fpr, tpr) plt.figure(figsize=(10, 8)) plt.plot(fpr, tpr, color='darkorange', lw=2, label=f'ROC曲线 (AUC = {roc_auc:.2f})') plt.plot([0, 1], [0, 1], color='navy', lw=2, linestyle='--') plt.xlim([0.0, 1.0]) plt.ylim([0.0, 1.05]) plt.xlabel('假阳性率') plt.ylabel('真阳性率') plt.title('接收者操作特征曲线') plt.legend(loc="lower right") plt.show() # 精确率-召回率曲线 precision, recall, _ = precision_recall_curve(y_test, y_pred_proba) average_precision = average_precision_score(y_test, y_pred_proba) plt.figure(figsize=(10, 8)) plt.plot(recall, precision, color='blue', lw=2, label=f'精确率-召回率曲线 (AP = {average_precision:.2f})') plt.xlim([0.0, 1.0]) plt.ylim([0.0, 1.05]) plt.xlabel('召回率') plt.ylabel('精确率') plt.title('精确率-召回率曲线') plt.legend(loc="lower left") plt.show() # 特征重要性 feature_importance = pd.DataFrame({ 'Feature': cancer.feature_names, 'Coefficient': model.coef_[0] }).sort_values('Coefficient', ascending=False) plt.figure(figsize=(12, 8)) plt.barh(feature_importance['Feature'][:10], feature_importance['Coefficient'][:10]) plt.title('前10个最重要的特征') plt.xlabel('系数大小') plt.ylabel('特征') plt.show()

8.3 客户细分分析

在这个案例中，我们将使用Scikit-learn进行客户细分分析，这是一个典型的无监督学习应用。

import pandas as pd from sklearn.preprocessing import StandardScaler from sklearn.cluster import KMeans from sklearn.decomposition import PCA # 假设我们有一个客户数据集，这里我们创建一个模拟数据集 np.random.seed(42) # 创建模拟客户数据 n_customers = 500 age = np.random.normal(40, 15, n_customers) income = np.random.normal(50000, 15000, n_customers) spending_score = np.random.normal(50, 25, n_customers) # 确保数据在合理范围内 age = np.clip(age, 18, 80) income = np.clip(income, 15000, 150000) spending_score = np.clip(spending_score, 1, 100) # 创建DataFrame customer_data = pd.DataFrame({ 'Age': age, 'Income': income, 'SpendingScore': spending_score }) print("客户数据前5行:") print(customer_data.head()) # 数据标准化 scaler = StandardScaler() scaled_data = scaler.fit_transform(customer_data) # 确定最佳聚类数 inertia = [] for k in range(1, 11): kmeans = KMeans(n_clusters=k, random_state=42) kmeans.fit(scaled_data) inertia.append(kmeans.inertia_) # 绘制肘部法则图 plt.figure(figsize=(10, 6)) plt.plot(range(1, 11), inertia, marker='o') plt.title('肘部法则确定最佳聚类数') plt.xlabel('聚类数') plt.ylabel('惯性') plt.show() # 根据肘部法则选择聚类数，这里假设为4 kmeans = KMeans(n_clusters=4, random_state=42) clusters = kmeans.fit_predict(scaled_data) # 将聚类结果添加到原始数据 customer_data['Cluster'] = clusters # 分析各个聚类的特征 cluster_analysis = customer_data.groupby('Cluster').mean() print("n各聚类的平均特征:") print(cluster_analysis) # 使用PCA进行降维以便可视化 pca = PCA(n_components=2) principal_components = pca.fit_transform(scaled_data) # 创建包含主成分和聚类结果的DataFrame pca_df = pd.DataFrame(data=principal_components, columns=['PC1', 'PC2']) pca_df['Cluster'] = clusters # 可视化聚类结果 plt.figure(figsize=(10, 8)) scatter = plt.scatter(pca_df['PC1'], pca_df['PC2'], c=pca_df['Cluster'], cmap='viridis', s=50, alpha=0.7) plt.title('客户细分结果') plt.xlabel('主成分1') plt.ylabel('主成分2') plt.colorbar(scatter) plt.show() # 按原始特征可视化聚类结果 plt.figure(figsize=(12, 10)) # Age vs Income plt.subplot(2, 2, 1) plt.scatter(customer_data['Age'], customer_data['Income'], c=customer_data['Cluster'], cmap='viridis', s=50, alpha=0.7) plt.title('年龄 vs 收入') plt.xlabel('年龄') plt.ylabel('收入') # Age vs SpendingScore plt.subplot(2, 2, 2) plt.scatter(customer_data['Age'], customer_data['SpendingScore'], c=customer_data['Cluster'], cmap='viridis', s=50, alpha=0.7) plt.title('年龄 vs 消费评分') plt.xlabel('年龄') plt.ylabel('消费评分') # Income vs SpendingScore plt.subplot(2, 2, 3) plt.scatter(customer_data['Income'], customer_data['SpendingScore'], c=customer_data['Cluster'], cmap='viridis', s=50, alpha=0.7) plt.title('收入 vs 消费评分') plt.xlabel('收入') plt.ylabel('消费评分') plt.tight_layout() plt.show()

9. 高级应用技巧

9.1 管道（Pipeline）

管道是Scikit-learn中的一个强大工具，它可以将多个处理步骤组合成一个单一的对象，使代码更简洁、更易于管理。

from sklearn.pipeline import Pipeline from sklearn.svm import SVC from sklearn.decomposition import PCA # 创建一个包含预处理和模型的管道 pipe = Pipeline([ ('scaler', StandardScaler()), # 数据标准化 ('pca', PCA(n_components=2)), # 降维 ('svm', SVC(kernel='rbf', C=1.0, random_state=42)) # SVM分类器 ]) # 使用管道进行训练和预测 pipe.fit(X_train, y_train) y_pred = pipe.predict(X_test) # 评估模型 accuracy = accuracy_score(y_test, y_pred) print("管道模型准确率:", accuracy)

9.2 自定义转换器

Scikit-learn允许创建自定义转换器，以便在管道中使用。

from sklearn.base import BaseEstimator, TransformerMixin # 创建自定义转换器 class CustomTransformer(BaseEstimator, TransformerMixin): def __init__(self, feature_index=0): self.feature_index = feature_index def fit(self, X, y=None): return self def transform(self, X, y=None): # 对特定特征进行转换 X_transformed = X.copy() X_transformed[:, self.feature_index] = np.log1p(X_transformed[:, self.feature_index]) return X_transformed # 使用自定义转换器 custom_transformer = CustomTransformer(feature_index=0) X_transformed = custom_transformer.fit_transform(X) # 比较转换前后的数据 print("原始数据的第一行:", X[0]) print("转换后的数据的第一行:", X_transformed[0])

9.3 集成学习

集成学习通过组合多个基学习器来提高模型的性能和稳定性。

9.3.1 投票分类器（Voting Classifier）

from sklearn.ensemble import VotingClassifier from sklearn.naive_bayes import GaussianNB from sklearn.tree import DecisionTreeClassifier # 创建不同的分类器 clf1 = LogisticRegression(max_iter=1000, random_state=42) clf2 = RandomForestClassifier(n_estimators=100, random_state=42) clf3 = GaussianNB() clf4 = SVC(probability=True, random_state=42) # 创建投票分类器 voting_clf = VotingClassifier( estimators=[('lr', clf1), ('rf', clf2), ('gnb', clf3), ('svc', clf4)], voting='soft' # 使用软投票，基于概率 ) # 训练模型 voting_clf.fit(X_train, y_train) # 在测试集上进行预测 y_pred = voting_clf.predict(X_test) # 评估模型 accuracy = accuracy_score(y_test, y_pred) print("投票分类器准确率:", accuracy) # 比较各个分类器的性能 for clf in (clf1, clf2, clf3, clf4, voting_clf): clf.fit(X_train, y_train) y_pred = clf.predict(X_test) print(f"{clf.__class__.__name__} 准确率: {accuracy_score(y_test, y_pred):.4f}")

9.3.2 堆叠（Stacking）

from sklearn.ensemble import StackingClassifier # 定义基学习器 estimators = [ ('lr', LogisticRegression(max_iter=1000, random_state=42)), ('rf', RandomForestClassifier(n_estimators=100, random_state=42)), ('svc', SVC(probability=True, random_state=42)) ] # 定义元学习器 final_estimator = LogisticRegression() # 创建堆叠分类器 stacking_clf = StackingClassifier( estimators=estimators, final_estimator=final_estimator, cv=5 ) # 训练模型 stacking_clf.fit(X_train, y_train) # 在测试集上进行预测 y_pred = stacking_clf.predict(X_test) # 评估模型 accuracy = accuracy_score(y_test, y_pred) print("堆叠分类器准确率:", accuracy)

9.4 处理不平衡数据

在许多实际应用中，我们可能会遇到类别不平衡的问题。Scikit-learn提供了几种处理不平衡数据的方法。

from sklearn.datasets import make_classification from sklearn.utils import resample from sklearn.metrics import roc_auc_score # 创建不平衡数据集 X_imb, y_imb = make_classification( n_samples=1000, n_features=20, n_classes=2, weights=[0.9, 0.1], # 90%的类别0，10%的类别1 random_state=42 ) print("原始数据集类别分布:", np.bincount(y_imb)) # 划分训练集和测试集 X_train, X_test, y_train, y_test = train_test_split(X_imb, y_imb, test_size=0.2, random_state=42) # 方法1：使用类别权重 model_weighted = LogisticRegression(class_weight='balanced', random_state=42) model_weighted.fit(X_train, y_train) y_pred_weighted = model_weighted.predict(X_test) y_pred_proba_weighted = model_weighted.predict_proba(X_test)[:, 1] print("n使用类别权重:") print("准确率:", accuracy_score(y_test, y_pred_weighted)) print("ROC AUC:", roc_auc_score(y_test, y_pred_proba_weighted)) # 方法2：上采样少数类 # 将训练集分为多数类和少数类 X_train_minority = X_train[y_train == 1] y_train_minority = y_train[y_train == 1] X_train_majority = X_train[y_train == 0] y_train_majority = y_train[y_train == 0] # 上采样少数类 X_train_minority_upsampled, y_train_minority_upsampled = resample( X_train_minority, y_train_minority, replace=True, # 允许重复采样 n_samples=len(X_train_majority), # 与多数类相同的数量 random_state=42 ) # 组合上采样后的少数类和多数类 X_train_upsampled = np.vstack((X_train_majority, X_train_minority_upsampled)) y_train_upsampled = np.hstack((y_train_majority, y_train_minority_upsampled)) print("n上采样后的训练集类别分布:", np.bincount(y_train_upsampled)) # 使用上采样后的数据训练模型 model_upsampled = LogisticRegression(random_state=42) model_upsampled.fit(X_train_upsampled, y_train_upsampled) y_pred_upsampled = model_upsampled.predict(X_test) y_pred_proba_upsampled = model_upsampled.predict_proba(X_test)[:, 1] print("n使用上采样:") print("准确率:", accuracy_score(y_test, y_pred_upsampled)) print("ROC AUC:", roc_auc_score(y_test, y_pred_proba_upsampled))

10. 学习资源与社区

10.1 官方资源

Scikit-learn官方文档：https://scikit-learn.org/stable/documentation.html
- 提供了API参考、用户指南和教程
- 包含大量示例代码和最佳实践
Scikit-learn GitHub仓库：https://github.com/scikit-learn/scikit-learn
- 可以获取最新版本的源代码
- 提交bug报告或功能请求

10.2 推荐书籍

《Python深度学习”四大名著”之《使用PyTorch和Scikit Learn进行深度学习》
- 作者：塞巴斯蒂安·拉施卡（Sebastian Raschka）
- 全面介绍PyTorch和Scikit-Learn的使用
- 适合机器学习新手和专业人士
《Hands-On Machine Learning with Scikit-Learn, Keras, and TensorFlow》
- 作者：Aurélien Géron
- 实践导向的机器学习书籍
- 涵盖了从基础到高级的各种主题
《Introduction to Machine Learning with Python》
- 作者：Andreas C. Müller & Sarah Guido
- 专注于Scikit-learn的使用
- 适合初学者

10.3 在线课程

Coursera - “Machine Learning” by Andrew Ng
- 虽然不专门针对Scikit-learn，但提供了机器学习的基础知识
- 网址：https://www.coursera.org/learn/machine-learning
Udemy - “Machine Learning A-Z: Hands-On Python & R In Data Science”
- 包含使用Scikit-learn的实际项目
- 网址：https://www.udemy.com/course/machinelearning/
edX - “Data Science: Machine Learning”
- 哈佛大学的课程，涵盖机器学习基础
- 网址：https://www.edx.org/course/data-science-machine-learning

10.4 社区和论坛

Stack Overflow：https://stackoverflow.com/questions/tagged/scikit-learn
- 提问和回答与Scikit-learn相关的问题
Reddit - r/MachineLearning：https://www.reddit.com/r/MachineLearning/
- 机器学习新闻、讨论和资源分享
Scikit-learn邮件列表：https://mail.python.org/mailman/listinfo/scikit-learn
- 与开发者和其他用户交流