scikit-learn适合哪些数据分析任务 从分类回归到聚类降维的全方位实战指南
引言:scikit-learn在数据分析中的定位
scikit-learn是Python生态系统中最受欢迎的机器学习库之一,它为数据科学家和分析师提供了从数据预处理到模型评估的完整工具链。作为一个基于NumPy、SciPy和matplotlib构建的开源库,scikit-learn以其简洁的API、丰富的算法实现和优秀的文档而闻名。
本文将深入探讨scikit-learn适合的各种数据分析任务,涵盖分类、回归、聚类、降维等核心领域,并通过实际代码示例展示其应用。无论您是初学者还是有经验的数据分析师,这份指南都能帮助您充分利用scikit-learn的强大功能。
一、分类任务:预测离散类别
1.1 分类任务概述
分类是机器学习中最常见的任务之一,目标是将输入数据分配到预定义的类别中。scikit-learn提供了多种分类算法,包括逻辑回归、支持向量机、决策树、随机森林等。
1.2 使用scikit-learn进行分类的完整示例
import numpy as np from sklearn.datasets import load_iris from sklearn.model_selection import train_test_split from sklearn.preprocessing import StandardScaler from sklearn.linear_model import LogisticRegression from sklearn.svm import SVC from sklearn.ensemble import RandomForestClassifier from sklearn.metrics import accuracy_score, classification_report, confusion_matrix # 加载数据集 iris = load_iris() X, y = iris.data, iris.target # 数据预处理 X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42) scaler = StandardScaler() X_train_scaled = scaler.fit_transform(X_train) X_test_scaled = scaler.transform(X_test) # 训练多个分类器 classifiers = { 'Logistic Regression': LogisticRegression(max_iter=200), 'SVM': SVC(kernel='rbf', C=1.0), 'Random Forest': RandomForestClassifier(n_estimators=100, random_state=42) } # 评估模型 for name, clf in classifiers.items(): clf.fit(X_train_scaled, y_train) y_pred = clf.predict(X_test_scaled) accuracy = accuracy_score(y_test, y_pred) print(f"n{name} 准确率: {accuracy:.4f}") print("分类报告:") print(classification_report(y_test, y_pred, target_names=iris.target_names)) print("混淆矩阵:") print(confusion_matrix(y_test, y_pred)) 1.3 分类任务的关键技术点
特征工程的重要性:
- 特征缩放:使用
StandardScaler或MinMaxScaler确保不同特征具有相同的尺度 - 特征选择:使用
SelectKBest或RFE选择最相关的特征 - 特征降维:使用
PCA或t-SNE减少特征维度
模型评估指标:
- 准确率:最直观的指标,但在不平衡数据上可能误导
- 精确率和召回率:在类别不平衡时更可靠
- F1分数:精确率和召回率的调和平均
- ROC曲线和AUC:评估二分类模型性能
交叉验证:
from sklearn.model_selection import cross_val_score scores = cross_val_score(classifiers['Random Forest'], X, y, cv=5) print(f"交叉验证分数: {scores}") print(f"平均分数: {scores.mean():.4f} (+/- {scores.std() * 2:.4f})") 二、回归任务:预测连续数值
2.1 回归任务概述
回归任务用于预测连续数值输出,如房价、销售额、温度等。scikit-learn提供了线性回归、岭回归、Lasso回归、支持向量回归等多种算法。
2.2 使用scikit-learn进行回归的完整示例
import numpy as np import matplotlib.pyplot as plt from sklearn.datasets import fetch_california_housing from sklearn.model_selection import train_test_split from sklearn.preprocessing import StandardScaler from sklearn.linear_model import LinearRegression, Ridge, Lasso from sklearn.ensemble import RandomForestRegressor from sklearn.metrics import mean_squared_error, r2_score, mean_absolute_error # 加载数据集 housing = fetch_california_housing() X, y = housing.data, housing.target # 数据预处理 X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42) scaler = StandardScaler() X_train_scaled = scaler.fit_transform(X_train) X_test_scaled = scaler.transform(X_test) # 训练多个回归模型 regressors = { 'Linear Regression': LinearRegression(), 'Ridge Regression': Ridge(alpha=1.0), 'Lasso Regression': Lasso(alpha=0.1), 'Random Forest Regressor': RandomForestRegressor(n_estimators=100, random_state=42) } # 评估模型 results = {} for name, reg in regressors.items(): reg.fit(X_train_scaled, y_train) y_pred = reg.predict(X_test_scaled) mse = mean_squared_error(y_test, y_pred) mae = mean_absolute_error(y_test, y_pred) r2 = r2_score(y_test, y_pred) results[name] = {'MSE': mse, 'MAE': mae, 'R2': r2} print(f"n{name}:") print(f" 均方误差 (MSE): {mse:.4f}") print(f" 平均绝对误差 (MAE): {mae:.4f}") print(f" 决定系数 (R²): {r2:.4f}") # 可视化预测结果 plt.figure(figsize=(12, 8)) for i, (name, reg) in enumerate(regressors.items()): y_pred = reg.predict(X_test_scaled) plt.subplot(2, 2, i+1) plt.scatter(y_test, y_pred, alpha=0.5) plt.plot([y_test.min(), y_test.max()], [y_test.min(), y_test.max()], 'r--', lw=2) plt.xlabel('真实值') plt.ylabel('预测值') plt.title(name) plt.tight_layout() plt.show() 2.3 回归任务的关键技术点
正则化技术:
- 岭回归(Ridge):L2正则化,适用于特征多重共线性
- Lasso回归:L1正则化,可进行特征选择
- ElasticNet:结合L1和L2正则化
模型选择与调优:
from sklearn.model_selection import GridSearchCV param_grid = { 'alpha': [0.001, 0.01, 0.1, 1, 10, 100], 'fit_intercept': [True, False] } grid_search = GridSearchCV(Ridge(), param_grid, cv=5, scoring='neg_mean_squared_error') grid_search.fit(X_train_scaled, y_train) print(f"最佳参数: {grid_search.best_params_}") print(f"最佳交叉验证分数: {grid_search.best_score_:.4f}") 三、聚类任务:发现数据内在结构
3.1 聚类任务概述
聚类是无监督学习任务,目标是将相似的数据点分组。scikit-learn提供了K-means、层次聚类、DBSCAN等多种聚类算法。
3.2 使用scikit-learn进行聚类的完整示例
import numpy as np import matplotlib.pyplot as plt from sklearn.datasets import make_blobs, load_iris from sklearn.preprocessing import StandardScaler from sklearn.cluster import KMeans, DBSCAN, AgglomerativeClustering from sklearn.metrics import silhouette_score, adjusted_rand_score from sklearn.decomposition import PCA # 生成模拟数据 X, y_true = make_blobs(n_samples=300, centers=4, cluster_std=0.60, random_state=0) # 标准化数据 X_scaled = StandardScaler().fit_transform(X) # 应用不同聚类算法 clustering_algorithms = { 'K-means': KMeans(n_clusters=4, random_state=42), 'DBSCAN': DBSCAN(eps=0.3, min_samples=5), 'Agglomerative': AgglomerativeClustering(n_clusters=4) } # 可视化结果 fig, axes = plt.subplots(2, 2, figsize=(12, 10)) axes[0, 0].scatter(X[:, 0], X[:, 1], c=y_true, cmap='viridis') axes[0, 0].set_title('原始数据') for idx, (name, algorithm) in enumerate(clustering_algorithms.items()): labels = algorithm.fit_predict(X_scaled) # 计算评估指标 if hasattr(algorithm, 'labels_'): if len(set(labels)) > 1: silhouette = silhouette_score(X_scaled, labels) ari = adjusted_rand_score(y_true, labels) else: silhouette = ari = 0 else: if len(set(labels)) > 1: silhouette = silhouette_score(X_scaled, labels) ari = adjusted_rand_score(y_true, labels) else: silhouette = ari = 0 row = (idx + 1) // 2 col = (idx + 1) % 2 axes[row, col].scatter(X[:, 0], X[:, 1], c=labels, cmap='viridis') axes[row, col].set_title(f'{name}n轮廓系数: {silhouette:.3f}, ARI: {ari:.3f}') plt.tight_layout() plt.show() # 使用真实数据集 iris = load_iris() X_iris = iris.data y_iris = iris.target # 使用PCA降维后聚类 pca = PCA(n_components=2) X_iris_pca = pca.fit_transform(X_iris) kmeans = KMeans(n_clusters=3, random_state=42) kmeans_labels = kmeans.fit_predict(X_iris) plt.figure(figsize=(12, 5)) plt.subplot(1, 2, 1) plt.scatter(X_iris_pca[:, 0], X_iris_pca[:, 1], c=y_iris, cmap='viridis') plt.title('真实类别') plt.subplot(1, 2, 2) plt.scatter(X_iris_pca[:, 0], X_iris_pca[:, 1], c=kmeans_labels, cmap='viridis') plt.title('K-means聚类结果') plt.show() 3.3 聚类任务的关键技术点
聚类算法选择指南:
- K-means:球形聚类,计算效率高,需要预先指定K值
- DBSCAN:任意形状聚类,对噪声鲁棒,自动确定聚类数量
- 层次聚类:可得到聚类层次结构,适用于小数据集
聚类评估指标:
- 轮廓系数:衡量聚类紧密度和分离度,范围[-1,1],越大越好
- 调整兰德指数(ARI):衡量聚类结果与真实标签的相似度
- 互信息(MI):衡量聚类结果与真实标签的信息重叠
确定最佳聚类数:
# 使用肘部法则确定K-means的最佳K值 inertias = [] silhouettes = [] K_range = range(2, 11) for k in K_range: kmeans = KMeans(n_clusters=k, random_state=42) kmeans.fit(X_scaled) inertias.append(kmeans.inertia_) if len(set(kmeans.labels_)) > 1: silhouettes.append(silhouette_score(X_scaled, kmeans.labels_)) else: silhouettes.append(0) # 可视化肘部法则 fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(12, 5)) ax1.plot(K_range, inertias, 'bo-') ax1.set_xlabel('K值') ax1.set_ylabel('惯性(Inertia)') ax1.set_title('肘部法则') ax2.plot(K_range, silhouettes, 'ro-') ax2.set_xlabel('K值') ax2.set_ylabel('轮廓系数') ax2.set_title('轮廓系数分析') plt.show() 四、降维任务:简化数据维度
4.1 降维任务概述
降维是减少数据特征数量的过程,同时尽可能保留重要信息。scikit-learn提供了PCA、t-SNE、UMAP等多种降维算法。
4.2 使用scikit-learn进行降维的完整示例
import numpy as np import matplotlib.pyplot as plt from sklearn.datasets import load_digits, load_iris from sklearn.preprocessing import StandardScaler from sklearn.decomposition import PCA, TruncatedSVD from sklearn.manifold import TSNE from sklearn.discriminant_analysis import LinearDiscriminantAnalysis # 加载手写数字数据集 digits = load_digits() X, y = digits.data, digits.target # 标准化数据 X_scaled = StandardScaler().fit_transform(X) # 应用不同降维方法 # PCA pca = PCA(n_components=2) X_pca = pca.fit_transform(X_scaled) print(f"PCA解释方差比例: {pca.explained_variance_ratio_}") print(f"PCA累计解释方差: {np.cumsum(pca.explained_variance_ratio_)}") # t-SNE tsne = TSNE(n_components=2, random_state=42, perplexity=30) X_tsne = tsne.fit_transform(X_scaled[:500]) # 使用部分数据以加快计算 y_tsne = y[:500] # LDA(线性判别分析) lda = LinearDiscriminantAnalysis(n_components=2) X_lda = lda.fit_transform(X_scaled, y) # 可视化结果 fig, axes = plt.subplots(1, 3, figsize=(18, 6)) # PCA可视化 scatter1 = axes[0].scatter(X_pca[:, 0], X_pca[:, 1], c=y, cmap='tab10', alpha=0.6) axes[0].set_title('PCA降维结果') plt.colorbar(scatter1, ax=axes[0]) # t-SNE可视化 scatter2 = axes[1].scatter(X_tsne[:, 0], X_tsne[:, 1], c=y_tsne, cmap='tab10', alpha=0.6) axes[1].set_title('t-SNE降维结果') plt.colorbar(scatter2, ax=axes[1]) # LDA可视化 scatter3 = axes[2].scatter(X_lda[:, 0], X_lda[:, 1], c=y, cmap='tab10', alpha=0.6) axes[2].set_title('LDA降维结果') plt.colorbar(scatter3, ax=axes[2]) plt.tight_layout() plt.show() # PCA解释方差分析 pca_full = PCA(n_components=None) pca_full.fit(X_scaled) cumsum = np.cumsum(pca_full.explained_variance_ratio_) d = np.argmax(cumsum >= 0.95) + 1 plt.figure(figsize=(10, 6)) plt.plot(range(1, len(cumsum) + 1), cumsum, 'b-') plt.axhline(y=0.95, color='r', linestyle='--') plt.axvline(x=d, color='g', linestyle='--') plt.xlabel('主成分数量') plt.ylabel('累计解释方差') plt.title('PCA解释方差分析') plt.text(d, 0.5, f' 保留95%方差需要{d}个主成分', fontsize=12) plt.grid(True) plt.show() 4.3 降维任务的关键技术点
降维方法选择:
- PCA:线性降维,最大化方差,适用于连续特征
- t-SNE:非线性降维,保留局部结构,适用于可视化
- LDA:有监督降维,最大化类间距离,适用于分类任务
- TruncatedSVD:适用于稀疏矩阵(如文本数据)
降维效果评估:
- 解释方差比例:衡量降维后保留的信息量
- 重建误差:衡量降维后的信息损失
- 可视化检查:通过散点图观察聚类效果
降维参数调优:
# PCA参数调优:确定最佳主成分数量 def find_optimal_pca_components(X, max_components=50): pca = PCA(n_components=None) pca.fit(X) cumsum = np.cumsum(pca.explained_variance_ratio_) # 找到解释95%方差的最小组件数 optimal_components = np.argmax(cumsum >= 0.95) + 1 plt.figure(figsize=(10, 6)) plt.plot(range(1, len(cumsum) + 1), cumsum, 'bo-') plt.axhline(y=0.95, color='r', linestyle='--', label='95%方差') plt.axvline(x=optimal_components, color='g', linestyle='--', label=f'最优组件数: {optimal_components}') plt.xlabel('主成分数量') plt.ylabel('累计解释方差') plt.title('PCA最优组件数确定') plt.legend() plt.grid(True) plt.show() return optimal_components optimal_n_components = find_optimal_pca_components(X_scaled) print(f"最优主成分数量: {optimal_n_components}") 五、模型选择与评估:确保模型可靠性
5.1 交叉验证
交叉验证是评估模型泛化能力的重要方法,scikit-learn提供了多种交叉验证策略。
from sklearn.model_selection import cross_val_score, KFold, StratifiedKFold from sklearn.datasets import load_iris from sklearn.ensemble import RandomForestClassifier iris = load_iris() X, y = iris.data, iris.target # 简单交叉验证 clf = RandomForestClassifier(n_estimators=100, random_state=42) scores = cross_val_score(clf, X, y, cv=5, scoring='accuracy') print(f"交叉验证准确率: {scores}") print(f"平均准确率: {scores.mean():.4f} (+/- {scores.std() * 2:.4f})") # 使用不同的交叉验证策略 kfold = KFold(n_splits=5, shuffle=True, random_state=42) scores_kfold = cross_val_score(clf, X, y, cv=kfold) print(f"KFold交叉验证: {scores_kfold.mean():.4f}") # 分层交叉验证(适用于分类任务) stratified_kfold = StratifiedKFold(n_splits=5, shuffle=True, random_state=42) scores_stratified = cross_val_score(clf, X, y, cv=stratified_kfold) print(f"分层交叉验证: {scores_stratified.mean():.4f}") 5.2 超参数调优
scikit-learn提供了网格搜索和随机搜索来优化模型超参数。
from sklearn.model_selection import GridSearchCV, RandomizedSearchCV from scipy.stats import randint # 网格搜索 param_grid = { 'n_estimators': [50, 100, 200], 'max_depth': [None, 10, 20, 30], 'min_samples_split': [2, 5, 10], 'min_samples_leaf': [1, 2, 4] } grid_search = GridSearchCV( RandomForestClassifier(random_state=42), param_grid, cv=5, scoring='accuracy', n_jobs=-1, verbose=1 ) grid_search.fit(X, y) print(f"网格搜索最佳参数: {grid_search.best_params_}") print(f"网格搜索最佳分数: {grid_search.best_score_:.4f}") # 随机搜索 param_dist = { 'n_estimators': randint(50, 250), 'max_depth': [None] + list(range(5, 31)), 'min_samples_split': randint(2, 20), 'min_samples_leaf': randint(1, 10) } random_search = RandomizedSearchCV( RandomForestClassifier(random_state=42), param_distributions=param_dist, n_iter=50, cv=5, scoring='accuracy', random_state=42, n_jobs=-1, verbose=1 ) random_search.fit(X, y) print(f"随机搜索最佳参数: {random_search.best_params_}") print(f"随机搜索最佳分数: {random_search.best_score_:.4f}") 5.3 模型评估指标
scikit-learn提供了丰富的评估指标,适用于不同场景。
from sklearn.metrics import ( accuracy_score, precision_score, recall_score, f1_score, mean_squared_error, r2_score, silhouette_score, adjusted_rand_score, roc_curve, auc, precision_recall_curve ) # 分类指标示例 y_true = [0, 1, 1, 0, 1, 1, 0, 0] y_pred = [0, 1, 0, 0, 1, 1, 0, 1] print("分类指标:") print(f"准确率: {accuracy_score(y_true, y_pred):.4f}") print(f"精确率: {precision_score(y_true, y_pred):.4f}") print(f"召回率: {recall_score(y_true, y_pred):.4f}") print(f"F1分数: {f1_score(y_true, y_pred):.4f}") # 回归指标示例 y_true_reg = [3.0, -0.5, 2.0, 7.0] y_pred_reg = [2.5, 0.0, 2.0, 8.0] print("n回归指标:") print(f"均方误差: {mean_squared_error(y_true_reg, y_pred_reg):.4f}") print(f"决定系数: {r2_score(y_true_reg, y_pred_reg):.4f}") # 聚类指标示例 labels_true = [0, 0, 0, 1, 1, 1] labels_pred = [0, 0, 1, 1, 2, 2] print("n聚类指标:") print(f"轮廓系数: {silhouette_score(np.array(labels_true).reshape(-1, 1), labels_pred):.4f}") print(f"调整兰德指数: {adjusted_rand_score(labels_true, labels_pred):.4f}") 六、数据预处理:构建机器学习管道
6.1 数据标准化和归一化
from sklearn.preprocessing import StandardScaler, MinMaxScaler, RobustScaler # 创建示例数据 data = np.array([[1.0, -100.0, 1000.0], [2.0, -200.0, 2000.0], [3.0, -300.0, 3000.0]]) # 标准化(Z-score) scaler_std = StandardScaler() data_std = scaler_std.fit_transform(data) print("标准化结果:") print(data_std) # 归一化(Min-Max) scaler_minmax = MinMaxScaler() data_minmax = scaler_minmax.fit_transform(data) print("n归一化结果:") print(data_minmax) # RobustScaler(对异常值鲁棒) scaler_robust = RobustScaler() data_robust = scaler_robust.fit_transform(data) print("nRobustScaler结果:") print(data_robust) 6.2 缺失值处理
from sklearn.impute import SimpleImputer, KNNImputer # 创建含缺失值的数据 data_with_nan = np.array([[1, 2, np.nan], [4, np.nan, 6], [7, 8, 9], [np.nan, 5, 3]]) # 均值填充 imputer_mean = SimpleImputer(strategy='mean') data_mean = imputer_mean.fit_transform(data_with_nan) print("均值填充:") print(data_mean) # 中位数填充 imputer_median = SimpleImputer(strategy='median') data_median = imputer_median.fit_transform(data_with_nan) print("n中位数填充:") print(data_median) # KNN填充 imputer_knn = KNNImputer(n_neighbors=2) data_knn = imputer_knn.fit_transform(data_with_nan) print("nKNN填充:") print(data_knn) 6.3 分类特征编码
from sklearn.preprocessing import LabelEncoder, OneHotEncoder import pandas as pd # 创建分类数据 data = pd.DataFrame({ '颜色': ['红色', '蓝色', '绿色', '红色', '蓝色'], '大小': ['大', '小', '大', '小', '大'], '价格': [10, 20, 30, 15, 25] }) # 标签编码 label_encoder = LabelEncoder() data['颜色编码'] = label_encoder.fit_transform(data['颜色']) print("标签编码:") print(data[['颜色', '颜色编码']]) # 独热编码 onehot_encoder = OneHotEncoder(sparse_output=False) color_encoded = onehot_encoder.fit_transform(data[['颜色']]) color_df = pd.DataFrame(color_encoded, columns=onehot_encoder.get_feature_names_out(['颜色'])) print("n独热编码:") print(color_df) 6.4 构建机器学习管道
from sklearn.pipeline import Pipeline from sklearn.compose import ColumnTransformer from sklearn.preprocessing import OneHotEncoder, StandardScaler from sklearn.ensemble import RandomForestClassifier from sklearn.model_selection import train_test_split # 创建示例数据 data = pd.DataFrame({ 'age': [25, 30, 35, 40, 45], 'salary': [50000, 60000, 70000, 80000, 90000], 'city': ['Beijing', 'Shanghai', 'Beijing', 'Shanghai', 'Beijing'], 'purchased': [0, 1, 0, 1, 1] }) X = data[['age', 'salary', 'city']] y = data['purchased'] # 定义预处理步骤 numeric_features = ['age', 'salary'] categorical_features = ['city'] numeric_transformer = Pipeline(steps=[ ('scaler', StandardScaler()) ]) categorical_transformer = Pipeline(steps=[ ('onehot', OneHotEncoder(handle_unknown='ignore')) ]) preprocessor = ColumnTransformer( transformers=[ ('num', numeric_transformer, numeric_features), ('cat', categorical_transformer, categorical_features) ]) # 构建完整管道 pipeline = Pipeline(steps=[ ('preprocessor', preprocessor), ('classifier', RandomForestClassifier(n_estimators=100, random_state=42)) ]) # 训练模型 X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42) pipeline.fit(X_train, y_train) # 评估模型 score = pipeline.score(X_test, y_test) print(f"管道模型准确率: {score:.4f}") # 使用管道进行预测 new_data = pd.DataFrame({ 'age': [28, 32], 'salary': [55000, 65000], 'city': ['Beijing', 'Shanghai'] }) predictions = pipeline.predict(new_data) print(f"新数据预测: {predictions}") 七、高级应用:特征选择与集成学习
7.1 特征选择
from sklearn.feature_selection import SelectKBest, f_classif, RFE, SelectFromModel from sklearn.ensemble import RandomForestClassifier from sklearn.datasets import make_classification # 创建模拟数据 X, y = make_classification(n_samples=1000, n_features=20, n_informative=5, n_redundant=2, random_state=42) # 方法1: 基于统计检验的特征选择 selector_kbest = SelectKBest(score_func=f_classif, k=10) X_kbest = selector_kbest.fit_transform(X, y) print(f"KBest选择后特征形状: {X_kbest.shape}") print(f"KBest分数: {selector_kbest.scores_}") # 方法2: 递归特征消除(RFE) estimator = RandomForestClassifier(n_estimators=100, random_state=42) selector_rfe = RFE(estimator, n_features_to_select=10, step=1) X_rfe = selector_rfe.fit_transform(X, y) print(f"nRFE选择后特征形状: {X_rfe.shape}") print(f"RFE排名: {selector_rfe.ranking_}") # 方法3: 基于模型的特征选择 selector_model = SelectFromModel(RandomForestClassifier(n_estimators=100, random_state=42), threshold='median') X_model = selector_model.fit_transform(X, y) print(f"n模型选择后特征形状: {X_model.shape}") 7.2 集成学习
from sklearn.ensemble import VotingClassifier, BaggingClassifier, AdaBoostClassifier from sklearn.tree import DecisionTreeClassifier from sklearn.linear_model import LogisticRegression from sklearn.svm import SVC # 创建基础分类器 clf1 = LogisticRegression(random_state=42, max_iter=1000) clf2 = RandomForestClassifier(n_estimators=100, random_state=42) clf3 = SVC(kernel='rbf', probability=True, random_state=42) # 硬投票 voting_hard = VotingClassifier( estimators=[('lr', clf1), ('rf', clf2), ('svc', clf3)], voting='hard' ) # 软投票 voting_soft = VotingClassifier( estimators=[('lr', clf1), ('rf', clf2), ('svc', clf3)], voting='soft' ) # Bagging bagging = BaggingClassifier( base_estimator=DecisionTreeClassifier(max_depth=5), n_estimators=100, random_state=42 ) # AdaBoost adaboost = AdaBoostClassifier( base_estimator=DecisionTreeClassifier(max_depth=3), n_estimators=100, random_state=42 ) # 训练和评估 from sklearn.model_selection import cross_val_score for name, clf in [('Hard Voting', voting_hard), ('Soft Voting', voting_soft), ('Bagging', bagging), ('AdaBoost', adaboost)]: scores = cross_val_score(clf, X, y, cv=5) print(f"{name} 交叉验证准确率: {scores.mean():.4f} (+/- {scores.std() * 2:.4f})") 八、实际应用案例:端到端数据分析流程
8.1 完整案例:客户流失预测
import pandas as pd import numpy as np from sklearn.model_selection import train_test_split, GridSearchCV from sklearn.preprocessing import StandardScaler, OneHotEncoder from sklearn.compose import ColumnTransformer from sklearn.pipeline import Pipeline from sklearn.ensemble import RandomForestClassifier from sklearn.metrics import classification_report, confusion_matrix, roc_auc_score import matplotlib.pyplot as plt import seaborn as sns # 创建模拟客户数据 np.random.seed(42) n_samples = 1000 data = pd.DataFrame({ 'age': np.random.randint(18, 70, n_samples), 'tenure': np.random.randint(1, 72, n_samples), 'monthly_charges': np.random.uniform(20, 120, n_samples), 'total_charges': np.random.uniform(100, 8000, n_samples), 'contract': np.random.choice(['Month-to-month', 'One year', 'Two year'], n_samples), 'internet_service': np.random.choice(['DSL', 'Fiber optic', 'No'], n_samples), 'churn': np.random.choice([0, 1], n_samples, p=[0.7, 0.3]) }) # 数据探索 print("数据基本信息:") print(data.info()) print("n数据描述统计:") print(data.describe()) print("n客户流失分布:") print(data['churn'].value_counts(normalize=True)) # 特征工程 X = data.drop('churn', axis=1) y = data['churn'] # 定义数值和分类特征 numeric_features = ['age', 'tenure', 'monthly_charges', 'total_charges'] categorical_features = ['contract', 'internet_service'] # 构建预处理管道 numeric_transformer = Pipeline(steps=[ ('scaler', StandardScaler()) ]) categorical_transformer = Pipeline(steps=[ ('onehot', OneHotEncoder(handle_unknown='ignore')) ]) preprocessor = ColumnTransformer( transformers=[ ('num', numeric_transformer, numeric_features), ('cat', categorical_transformer, categorical_features) ]) # 构建完整管道 pipeline = Pipeline(steps=[ ('preprocessor', preprocessor), ('classifier', RandomForestClassifier(random_state=42)) ]) # 数据分割 X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42, stratify=y) # 超参数调优 param_grid = { 'classifier__n_estimators': [100, 200], 'classifier__max_depth': [None, 10, 20], 'classifier__min_samples_split': [2, 5], 'classifier__min_samples_leaf': [1, 2] } grid_search = GridSearchCV(pipeline, param_grid, cv=5, scoring='roc_auc', n_jobs=-1) grid_search.fit(X_train, y_train) print(f"最佳参数: {grid_search.best_params_}") print(f"最佳交叉验证AUC: {grid_search.best_score_:.4f}") # 最终模型评估 best_model = grid_search.best_estimator_ y_pred = best_model.predict(X_test) y_pred_proba = best_model.predict_proba(X_test)[:, 1] print("n测试集性能:") print(f"AUC: {roc_auc_score(y_test, y_pred_proba):.4f}") print("n分类报告:") print(classification_report(y_test, y_pred)) # 混淆矩阵可视化 cm = confusion_matrix(y_test, y_pred) plt.figure(figsize=(8, 6)) sns.heatmap(cm, annot=True, fmt='d', cmap='Blues', xticklabels=['未流失', '流失'], yticklabels=['未流失', '流失']) plt.xlabel('预测') plt.ylabel('真实') plt.title('混淆矩阵') plt.show() # 特征重要性分析 if hasattr(best_model.named_steps['classifier'], 'feature_importances_'): # 获取特征名称 ohe = best_model.named_steps['preprocessor'].named_transformers_['cat'].named_steps['onehot'] cat_feature_names = ohe.get_feature_names_out(categorical_features) all_feature_names = numeric_features + list(cat_feature_names) importances = best_model.named_steps['classifier'].feature_importances_ feature_importance_df = pd.DataFrame({ 'feature': all_feature_names, 'importance': importances }).sort_values('importance', ascending=False) plt.figure(figsize=(10, 6)) sns.barplot(data=feature_importance_df.head(10), x='importance', y='feature') plt.title('Top 10 Feature Importances') plt.xlabel('Importance') plt.tight_layout() plt.show() 8.2 案例总结与最佳实践
- 数据探索:始终先了解数据分布和质量问题
- 管道构建:使用Pipeline确保预处理和模型训练的一致性
- 交叉验证:使用交叉验证避免过拟合
- 超参数调优:系统性地搜索最佳参数组合
- 模型解释:分析特征重要性,理解模型决策
九、总结与建议
9.1 scikit-learn适用场景总结
scikit-learn适用于以下数据分析任务:
- 分类任务:二分类、多分类、文本分类、图像分类
- 回归任务:数值预测、时间序列预测、需求预测
- 聚类任务:客户细分、异常检测、图像分割
- 降维任务:数据可视化、特征压缩、噪声过滤
- 模型选择:超参数调优、模型评估、特征选择
9.2 使用scikit-learn的最佳实践
- 从简单开始:先尝试简单模型(如线性模型),再尝试复杂模型
- 数据预处理:始终进行数据清洗和标准化
- 交叉验证:使用交叉验证评估模型泛化能力
- 管道化工作流:使用Pipeline确保可重复性
- 文档记录:记录实验过程和结果,便于复现
9.3 进一步学习资源
- 官方文档:scikit-learn.org
- 用户指南:包含详细的概念解释和示例
- API参考:完整的函数和类文档
- 案例库:丰富的实际应用示例
通过本文的指南,您应该对scikit-learn在各种数据分析任务中的应用有了全面的了解。记住,实践是最好的学习方式,建议您尝试使用真实数据集应用这些技术,逐步建立自己的数据分析能力。
支付宝扫一扫
微信扫一扫