探索scikitlearn随机森林模型的优缺点及其在实际机器学习项目中的应用表现与选择考量
1. 随机森林模型概述
随机森林(Random Forest)是一种集成学习(Ensemble Learning)方法,由Leo Breiman在2001年提出。它通过构建多个决策树并将它们的结果进行组合来提高整体模型的准确性和控制过拟合。随机森林的基本思想是”群体智慧”,即多个弱学习器(决策树)的组合可以形成一个强学习器。
随机森林的核心特点包括:
- 随机性:通过自助采样(Bootstrap sampling)随机选择训练数据
- 特征随机选择:在每个节点分裂时随机选择特征子空间
- 并行构建:多个决策树可以并行构建,提高效率
- 结果整合:通过投票(分类问题)或平均(回归问题)的方式整合各树的结果
在scikit-learn库中,随机森林通过RandomForestClassifier
(用于分类问题)和RandomForestRegressor
(用于回归问题)两个类实现。
2. scikit-learn中随机森林的实现
scikit-learn提供了易于使用的随机森林实现,下面我们首先了解基本用法:
# 导入必要的库 from sklearn.ensemble import RandomForestClassifier, RandomForestRegressor from sklearn.datasets import make_classification, make_regression from sklearn.model_selection import train_test_split from sklearn.metrics import accuracy_score, mean_squared_error # 分类问题示例 # 创建模拟数据 X, y = make_classification(n_samples=1000, n_features=20, n_informative=15, n_redundant=5, random_state=42) X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42) # 创建并训练随机森林分类器 rf_classifier = RandomForestClassifier(n_estimators=100, random_state=42) rf_classifier.fit(X_train, y_train) # 预测并评估 y_pred = rf_classifier.predict(X_test) print(f"分类准确率: {accuracy_score(y_test, y_pred):.4f}") # 回归问题示例 # 创建模拟数据 X, y = make_regression(n_samples=1000, n_features=20, n_informative=15, noise=0.1, random_state=42) X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42) # 创建并训练随机森林回归器 rf_regressor = RandomForestRegressor(n_estimators=100, random_state=42) rf_regressor.fit(X_train, y_train) # 预测并评估 y_pred = rf_regressor.predict(X_test) print(f"回归均方误差: {mean_squared_error(y_test, y_pred):.4f}")
2.1 重要参数解析
scikit-learn中的随机森林提供了多个重要参数,可以根据具体问题进行调整:
# 创建随机森林分类器,展示重要参数 rf_classifier = RandomForestClassifier( n_estimators=100, # 树的数量,默认100 criterion='gini', # 分裂标准,可选'gini'或'entropy' max_depth=None, # 树的最大深度,None表示不限制 min_samples_split=2, # 分裂内部节点所需的最小样本数 min_samples_leaf=1, # 叶节点所需的最小样本数 max_features='auto', # 寻找最佳分割时考虑的特征数量 bootstrap=True, # 是否使用bootstrap采样 oob_score=False, # 是否使用袋外样本来估计泛化精度 n_jobs=None, # 并行作业数,-1表示使用所有处理器 random_state=42, # 随机种子 verbose=0, # 控制详细程度 warm_start=False, # 是否使用之前构建的树 class_weight=None # 类别权重 )
3. 随机森林模型的优点
3.1 高准确性
随机森林通过集成多个决策树,通常能够提供较高的预测准确性。每个决策树可能在数据的不同子集上表现不佳,但通过组合多个树,可以减少整体误差。
# 比较单个决策树和随机森林的准确性 from sklearn.tree import DecisionTreeClassifier from sklearn.datasets import make_classification from sklearn.model_selection import cross_val_score # 创建数据 X, y = make_classification(n_samples=1000, n_features=20, random_state=42) # 单个决策树 dt = DecisionTreeClassifier(random_state=42) dt_scores = cross_val_score(dt, X, y, cv=5) print(f"单个决策树平均准确率: {dt_scores.mean():.4f}") # 随机森林 rf = RandomForestClassifier(n_estimators=100, random_state=42) rf_scores = cross_val_score(rf, X, y, cv=5) print(f"随机森林平均准确率: {rf_scores.mean():.4f}")
3.2 防止过拟合
随机森林通过随机采样和特征选择,有效降低了过拟合的风险。每个决策树都是在数据的随机子集上训练的,这减少了模型对特定数据模式的依赖。
# 比较决策树和随机森林在训练集和测试集上的表现 from sklearn.datasets import make_classification from sklearn.model_selection import train_test_split from sklearn.tree import DecisionTreeClassifier from sklearn.ensemble import RandomForestClassifier from sklearn.metrics import accuracy_score # 创建数据 X, y = make_classification(n_samples=1000, n_features=20, random_state=42) X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42) # 训练决策树 dt = DecisionTreeClassifier(random_state=42) dt.fit(X_train, y_train) dt_train_acc = accuracy_score(y_train, dt.predict(X_train)) dt_test_acc = accuracy_score(y_test, dt.predict(X_test)) print(f"决策树 - 训练集准确率: {dt_train_acc:.4f}, 测试集准确率: {dt_test_acc:.4f}") # 训练随机森林 rf = RandomForestClassifier(n_estimators=100, random_state=42) rf.fit(X_train, y_train) rf_train_acc = accuracy_score(y_train, rf.predict(X_train)) rf_test_acc = accuracy_score(y_test, rf.predict(X_test)) print(f"随机森林 - 训练集准确率: {rf_train_acc:.4f}, 测试集准确率: {rf_test_acc:.4f}")
3.3 处理高维数据
随机森林能够处理大量输入变量,且不需要特征缩放。这使得它在处理高维数据时非常有效。
# 使用随机森林处理高维数据 from sklearn.datasets import make_classification from sklearn.ensemble import RandomForestClassifier from sklearn.model_selection import train_test_split from sklearn.metrics import accuracy_score import time # 创建高维数据 X, y = make_classification(n_samples=1000, n_features=500, n_informative=50, random_state=42) X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42) # 训练随机森林并计时 start_time = time.time() rf = RandomForestClassifier(n_estimators=100, random_state=42) rf.fit(X_train, y_train) end_time = time.time() # 评估 y_pred = rf.predict(X_test) accuracy = accuracy_score(y_test, y_pred) print(f"高维数据(500特征)上的准确率: {accuracy:.4f}") print(f"训练时间: {end_time - start_time:.2f}秒")
3.4 评估特征重要性
随机森林提供了评估特征重要性的方法,这对于理解数据和特征工程非常有帮助。
# 使用随机森林评估特征重要性 from sklearn.datasets import make_classification from sklearn.ensemble import RandomForestClassifier import matplotlib.pyplot as plt import numpy as np # 创建数据,其中前5个特征是信息性的 X, y = make_classification(n_samples=1000, n_features=20, n_informative=5, n_redundant=0, random_state=42) # 训练随机森林 rf = RandomForestClassifier(n_estimators=100, random_state=42) rf.fit(X, y) # 获取特征重要性 importances = rf.feature_importances_ indices = np.argsort(importances)[::-1] # 打印特征排名 print("特征排名:") for i in range(X.shape[1]): print(f"特征 {indices[i]}: {importances[indices[i]]:.4f}") # 可视化特征重要性 plt.figure(figsize=(12, 6)) plt.title("特征重要性") plt.bar(range(X.shape[1]), importances[indices], align="center") plt.xticks(range(X.shape[1]), indices) plt.xlim([-1, X.shape[1]]) plt.tight_layout() plt.show()
3.5 处理缺失值
随机森林对缺失值不敏感,能够维持较高的准确性。虽然scikit-learn的实现不能直接处理缺失值,但可以通过一些策略来应对。
# 处理缺失值的策略 import numpy as np import pandas as pd from sklearn.ensemble import RandomForestClassifier from sklearn.impute import SimpleImputer from sklearn.model_selection import train_test_split from sklearn.metrics import accuracy_score # 创建带有缺失值的数据 X, y = make_classification(n_samples=1000, n_features=10, random_state=42) X = pd.DataFrame(X) # 随机引入缺失值 mask = np.random.random(X.shape) < 0.1 # 约10%的值设为缺失 X[mask] = np.nan print(f"缺失值数量: {X.isnull().sum().sum()}") # 策略1: 使用中位数填充缺失值 imputer = SimpleImputer(strategy='median') X_imputed = imputer.fit_transform(X) # 分割数据 X_train, X_test, y_train, y_test = train_test_split(X_imputed, y, test_size=0.3, random_state=42) # 训练随机森林 rf = RandomForestClassifier(n_estimators=100, random_state=42) rf.fit(X_train, y_train) # 评估 y_pred = rf.predict(X_test) print(f"处理缺失值后的准确率: {accuracy_score(y_test, y_pred):.4f}")
3.6 并行化能力
随机森林的构建过程可以并行化,因为每个决策树的构建是独立的。这使得随机森林能够充分利用多核处理器的优势。
# 比较串行和并行训练随机森林的时间 from sklearn.datasets import make_classification from sklearn.ensemble import RandomForestClassifier from sklearn.model_selection import train_test_split import time # 创建大数据集 X, y = make_classification(n_samples=5000, n_features=20, random_state=42) X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42) # 串行训练 (n_jobs=1) start_time = time.time() rf_serial = RandomForestClassifier(n_estimators=100, n_jobs=1, random_state=42) rf_serial.fit(X_train, y_train) serial_time = time.time() - start_time print(f"串行训练时间: {serial_time:.2f}秒") # 并行训练 (n_jobs=-1,使用所有可用的CPU核心) start_time = time.time() rf_parallel = RandomForestClassifier(n_estimators=100, n_jobs=-1, random_state=42) rf_parallel.fit(X_train, y_train) parallel_time = time.time() - start_time print(f"并行训练时间: {parallel_time:.2f}秒") print(f"加速比: {serial_time/parallel_time:.2f}x")
3.7 处理非线性关系
随机森林能够捕捉特征之间的复杂非线性关系,这使得它在许多实际应用中表现出色。
# 展示随机森林处理非线性关系的能力 import numpy as np import matplotlib.pyplot as plt from sklearn.ensemble import RandomForestRegressor from sklearn.tree import DecisionTreeRegressor from sklearn.linear_model import LinearRegression # 创建非线性数据 np.random.seed(42) X = np.sort(5 * np.random.rand(80, 1), axis=0) y = np.sin(X).ravel() + np.random.normal(0, 0.1, X.shape[0]) # 测试数据 X_test = np.arange(0.0, 5.0, 0.01)[:, np.newaxis] # 训练模型 lr = LinearRegression() lr.fit(X, y) dt = DecisionTreeRegressor(max_depth=5) dt.fit(X, y) rf = RandomForestRegressor(n_estimators=100, max_depth=5) rf.fit(X, y) # 预测 y_lr = lr.predict(X_test) y_dt = dt.predict(X_test) y_rf = rf.predict(X_test) # 可视化结果 plt.figure(figsize=(12, 6)) plt.scatter(X, y, c='k', label='数据') plt.plot(X_test, y_lr, label='线性回归', color='blue') plt.plot(X_test, y_dt, label='决策树', color='red') plt.plot(X_test, y_rf, label='随机森林', color='green') plt.xlabel("X") plt.ylabel("y") plt.title("不同模型对非线性数据的拟合") plt.legend() plt.show()
4. 随机森林模型的缺点
4.1 计算复杂度高
相比于单个决策树,随机森林需要更多的计算资源,特别是在树的数量很多或数据集很大的情况下。
# 比较决策树和随机森林的训练时间 from sklearn.datasets import make_classification from sklearn.tree import DecisionTreeClassifier from sklearn.ensemble import RandomForestClassifier import time # 创建不同大小的数据集 datasets = [] for n_samples in [1000, 5000, 10000, 20000]: X, y = make_classification(n_samples=n_samples, n_features=20, random_state=42) datasets.append((X, y, n_samples)) # 比较训练时间 print("数据集大小t决策树时间(s)t随机森林时间(s)t倍数") for X, y, n_samples in datasets: # 决策树 start_time = time.time() dt = DecisionTreeClassifier(random_state=42) dt.fit(X, y) dt_time = time.time() - start_time # 随机森林 start_time = time.time() rf = RandomForestClassifier(n_estimators=100, random_state=42) rf.fit(X, y) rf_time = time.time() - start_time print(f"{n_samples}tt{dt_time:.4f}tt{rf_time:.4f}tt{rf_time/dt_time:.2f}x")
4.2 解释性差
相比于单个决策树,随机森林的黑盒特性使其解释性较差。虽然可以通过特征重要性获得一些解释,但难以提供像单个决策树那样清晰的决策路径。
# 比较决策树和随机森林的解释性 from sklearn.datasets import make_classification from sklearn.tree import DecisionTreeClassifier, plot_tree from sklearn.ensemble import RandomForestClassifier import matplotlib.pyplot as plt # 创建数据 X, y = make_classification(n_samples=1000, n_features=5, random_state=42) # 训练决策树 dt = DecisionTreeClassifier(max_depth=3, random_state=42) dt.fit(X, y) # 训练随机森林 rf = RandomForestClassifier(n_estimators=100, max_depth=3, random_state=42) rf.fit(X, y) # 可视化决策树 plt.figure(figsize=(20, 10)) plot_tree(dt, filled=True, feature_names=[f'Feature {i}' for i in range(X.shape[1])], class_names=['Class 0', 'Class 1']) plt.title("决策树可视化") plt.show() # 输出随机森林的特征重要性 print("随机森林特征重要性:") for i, importance in enumerate(rf.feature_importances_): print(f"Feature {i}: {importance:.4f}") # 尝试可视化随机森林中的一棵树 plt.figure(figsize=(20, 10)) plot_tree(rf.estimators_[0], filled=True, feature_names=[f'Feature {i}' for i in range(X.shape[1])], class_names=['Class 0', 'Class 1']) plt.title("随机森林中的一棵树") plt.show()
4.3 内存占用大
随机森林需要存储多个决策树,内存占用较大,特别是在树的数量很多或树很深的情况下。
# 比较决策树和随机森林的内存占用 from sklearn.datasets import make_classification from sklearn.tree import DecisionTreeClassifier from sklearn.ensemble import RandomForestClassifier import sys import pickle # 创建数据 X, y = make_classification(n_samples=1000, n_features=20, random_state=42) # 训练决策树 dt = DecisionTreeClassifier(random_state=42) dt.fit(X, y) # 训练随机森林,不同树的数量 rf_models = [] for n_estimators in [10, 50, 100, 200]: rf = RandomForestClassifier(n_estimators=n_estimators, random_state=42) rf.fit(X, y) rf_models.append((rf, n_estimators)) # 计算模型大小 def get_model_size(model): return sys.getsizeof(pickle.dumps(model)) dt_size = get_model_size(dt) print(f"决策树模型大小: {dt_size / 1024:.2f} KB") print("随机森林模型大小:") for rf, n_estimators in rf_models: rf_size = get_model_size(rf) print(f"{n_estimators}棵树: {rf_size / 1024:.2f} KB (是决策树的{rf_size/dt_size:.2f}倍)")
4.4 对不平衡数据敏感
在类别不平衡的情况下,随机森林可能会偏向多数类。虽然可以通过调整类别权重来缓解这个问题,但这需要额外的调优。
# 展示随机森林在不平衡数据上的表现 from sklearn.datasets import make_classification from sklearn.ensemble import RandomForestClassifier from sklearn.metrics import classification_report, confusion_matrix from sklearn.model_selection import train_test_split import numpy as np # 创建不平衡数据 X, y = make_classification(n_samples=5000, n_features=20, weights=[0.9, 0.1], random_state=42) print(f"类别分布: {np.bincount(y)}") # 分割数据 X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42) # 不调整类别权重的随机森林 rf = RandomForestClassifier(n_estimators=100, random_state=42) rf.fit(X_train, y_train) y_pred = rf.predict(X_test) print("n不调整类别权重的结果:") print(classification_report(y_test, y_pred)) print("混淆矩阵:") print(confusion_matrix(y_test, y_pred)) # 调整类别权重的随机森林 rf_balanced = RandomForestClassifier(n_estimators=100, class_weight='balanced', random_state=42) rf_balanced.fit(X_train, y_train) y_pred_balanced = rf_balanced.predict(X_test) print("n调整类别权重后的结果:") print(classification_report(y_test, y_pred_balanced)) print("混淆矩阵:") print(confusion_matrix(y_test, y_pred_balanced))
4.5 超参数调优复杂
随机森林有多个超参数需要调整,如树的数量、最大深度、最小样本分裂数等。找到最佳参数组合可能需要大量的实验和计算资源。
# 展示随机森林的超参数调优 from sklearn.datasets import make_classification from sklearn.ensemble import RandomForestClassifier from sklearn.model_selection import GridSearchCV, RandomizedSearchCV from sklearn.model_selection import train_test_split import time # 创建数据 X, y = make_classification(n_samples=1000, n_features=20, random_state=42) X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42) # 定义参数网格 param_grid = { 'n_estimators': [50, 100, 200], 'max_depth': [None, 10, 20, 30], 'min_samples_split': [2, 5, 10], 'min_samples_leaf': [1, 2, 4], 'max_features': ['auto', 'sqrt', 'log2'] } # 网格搜索 print("网格搜索...") start_time = time.time() grid_search = GridSearchCV( estimator=RandomForestClassifier(random_state=42), param_grid=param_grid, cv=3, n_jobs=-1, verbose=1 ) grid_search.fit(X_train, y_train) grid_time = time.time() - start_time print(f"网格搜索最佳参数: {grid_search.best_params_}") print(f"网格搜索最佳分数: {grid_search.best_score_:.4f}") print(f"网格搜索时间: {grid_time:.2f}秒") # 随机搜索 print("n随机搜索...") start_time = time.time() random_search = RandomizedSearchCV( estimator=RandomForestClassifier(random_state=42), param_distributions=param_grid, n_iter=50, # 只尝试50种参数组合 cv=3, n_jobs=-1, verbose=1, random_state=42 ) random_search.fit(X_train, y_train) random_time = time.time() - start_time print(f"随机搜索最佳参数: {random_search.best_params_}") print(f"随机搜索最佳分数: {random_search.best_score_:.4f}") print(f"随机搜索时间: {random_time:.2f}秒") print(f"随机搜索比网格搜索快了 {grid_time/random_time:.2f} 倍")
5. 随机森林在实际项目中的应用表现
5.1 金融风险评估
随机森林在金融风险评估中表现出色,可以用于信用评分、欺诈检测等任务。
# 金融风险评估示例 - 信用评分 import pandas as pd import numpy as np from sklearn.ensemble import RandomForestClassifier from sklearn.model_selection import train_test_split from sklearn.metrics import accuracy_score, precision_score, recall_score, roc_auc_score, confusion_matrix from sklearn.preprocessing import StandardScaler # 创建模拟信用数据 np.random.seed(42) n_samples = 10000 # 特征 age = np.random.normal(40, 10, n_samples) income = np.random.normal(50000, 15000, n_samples) debt_to_income = np.random.normal(0.3, 0.1, n_samples) credit_history = np.random.randint(1, 31, n_samples) # 信用历史年数 late_payments = np.random.poisson(2, n_samples) # 逾期次数 # 创建目标变量(信用良好=1,信用不良=0) # 基于一些规则创建相关性 credit_score = ( 0.2 * (age / 50) + 0.3 * (income / 60000) - 0.25 * debt_to_income + 0.15 * (credit_history / 20) - 0.1 * (late_payments / 10) ) credit_good = (credit_score + np.random.normal(0, 0.2, n_samples)) > 0.5 # 创建DataFrame data = pd.DataFrame({ 'age': age, 'income': income, 'debt_to_income': debt_to_income, 'credit_history': credit_history, 'late_payments': late_payments, 'credit_good': credit_good.astype(int) }) print("数据预览:") print(data.head()) print(f"n信用良好比例: {data['credit_good'].mean():.2%}") # 准备数据 X = data.drop('credit_good', axis=1) y = data['credit_good'] X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42) # 标准化特征 scaler = StandardScaler() X_train_scaled = scaler.fit_transform(X_train) X_test_scaled = scaler.transform(X_test) # 训练随机森林 rf = RandomForestClassifier(n_estimators=100, max_depth=10, random_state=42) rf.fit(X_train_scaled, y_train) # 预测 y_pred = rf.predict(X_test_scaled) y_pred_proba = rf.predict_proba(X_test_scaled)[:, 1] # 评估 print("n模型评估:") print(f"准确率: {accuracy_score(y_test, y_pred):.4f}") print(f"精确率: {precision_score(y_test, y_pred):.4f}") print(f"召回率: {recall_score(y_test, y_pred):.4f}") print(f"AUC: {roc_auc_score(y_test, y_pred_proba):.4f}") print("n混淆矩阵:") print(confusion_matrix(y_test, y_pred)) # 特征重要性 feature_importances = pd.DataFrame({ 'feature': X.columns, 'importance': rf.feature_importances_ }).sort_values('importance', ascending=False) print("n特征重要性:") print(feature_importances)
5.2 医疗诊断
随机森林在医疗诊断领域也有广泛应用,可以用于疾病预测、患者风险分层等任务。
# 医疗诊断示例 - 糖尿病预测 import pandas as pd import numpy as np from sklearn.ensemble import RandomForestClassifier from sklearn.model_selection import train_test_split, cross_val_score from sklearn.metrics import accuracy_score, precision_score, recall_score, roc_auc_score, confusion_matrix from sklearn.preprocessing import StandardScaler import matplotlib.pyplot as plt # 创建模拟医疗数据 np.random.seed(42) n_samples = 2000 # 特征 age = np.random.normal(50, 12, n_samples) bmi = np.random.normal(28, 5, n_samples) glucose = np.random.normal(100, 25, n_samples) blood_pressure = np.random.normal(70, 10, n_samples) insulin = np.random.normal(80, 30, n_samples) pregnancies = np.random.poisson(2, n_samples) diabetes_pedigree = np.random.gamma(2, 0.5, n_samples) # 创建目标变量(有糖尿病=1,无糖尿病=0) # 基于一些规则创建相关性 diabetes_risk = ( 0.15 * (age / 60) + 0.25 * (bmi / 35) + 0.3 * (glucose / 140) + 0.1 * (blood_pressure / 80) + 0.1 * (insulin / 100) + 0.05 * (pregnancies / 5) + 0.05 * diabetes_pedigree ) has_diabetes = (diabetes_risk + np.random.normal(0, 0.15, n_samples)) > 0.6 # 创建DataFrame data = pd.DataFrame({ 'age': age, 'bmi': bmi, 'glucose': glucose, 'blood_pressure': blood_pressure, 'insulin': insulin, 'pregnancies': pregnancies, 'diabetes_pedigree': diabetes_pedigree, 'has_diabetes': has_diabetes.astype(int) }) print("数据预览:") print(data.head()) print(f"n糖尿病患病率: {data['has_diabetes'].mean():.2%}") # 准备数据 X = data.drop('has_diabetes', axis=1) y = data['has_diabetes'] X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42) # 标准化特征 scaler = StandardScaler() X_train_scaled = scaler.fit_transform(X_train) X_test_scaled = scaler.transform(X_test) # 训练随机森林 rf = RandomForestClassifier(n_estimators=100, max_depth=10, class_weight='balanced', random_state=42) rf.fit(X_train_scaled, y_train) # 交叉验证 cv_scores = cross_val_score(rf, X_train_scaled, y_train, cv=5, scoring='roc_auc') print(f"n交叉验证AUC: {cv_scores.mean():.4f} (±{cv_scores.std():.4f})") # 预测 y_pred = rf.predict(X_test_scaled) y_pred_proba = rf.predict_proba(X_test_scaled)[:, 1] # 评估 print("n模型评估:") print(f"准确率: {accuracy_score(y_test, y_pred):.4f}") print(f"精确率: {precision_score(y_test, y_pred):.4f}") print(f"召回率: {recall_score(y_test, y_pred):.4f}") print(f"AUC: {roc_auc_score(y_test, y_pred_proba):.4f}") print("n混淆矩阵:") print(confusion_matrix(y_test, y_pred)) # 特征重要性 feature_importances = pd.DataFrame({ 'feature': X.columns, 'importance': rf.feature_importances_ }).sort_values('importance', ascending=False) # 可视化特征重要性 plt.figure(figsize=(10, 6)) plt.barh(feature_importances['feature'], feature_importances['importance']) plt.xlabel('重要性') plt.title('糖尿病预测特征重要性') plt.tight_layout() plt.show() # 部分依赖图 - 展示葡萄糖水平与糖尿病概率的关系 from sklearn.inspection import plot_partial_dependence features = ['glucose'] # 可以添加更多特征 plot_partial_dependence(rf, X_train_scaled, features, feature_names=X.columns, grid_resolution=100) plt.suptitle('葡萄糖水平与糖尿病概率的部分依赖图') plt.tight_layout() plt.subplots_adjust(top=0.9) plt.show()
5.3 推荐系统
随机森林也可以用于推荐系统,预测用户对物品的偏好或评分。
# 推荐系统示例 - 电影评分预测 import pandas as pd import numpy as np from sklearn.ensemble import RandomForestRegressor from sklearn.model_selection import train_test_split from sklearn.metrics import mean_squared_error, mean_absolute_error from sklearn.preprocessing import LabelEncoder import matplotlib.pyplot as plt # 创建模拟电影评分数据 np.random.seed(42) n_users = 1000 n_movies = 500 n_ratings = 20000 # 用户特征 user_ages = np.random.randint(18, 65, n_users) user_genders = np.random.binomial(1, 0.5, n_users) # 0: 女性, 1: 男性 user_occupations = np.random.randint(0, 21, n_users) # 21种职业 # 电影特征 movie_years = np.random.randint(1980, 2023, n_movies) movie_genres = np.random.randint(0, 19, n_movies) # 19种电影类型 movie_durations = np.random.randint(80, 180, n_movies) # 电影时长(分钟) # 生成评分 user_ids = np.random.randint(0, n_users, n_ratings) movie_ids = np.random.randint(0, n_movies, n_ratings) # 基于用户和电影特征生成基础评分 base_ratings = ( 2.0 + # 基础分数 0.3 * np.sin(user_ages[user_ids] / 10) + # 年龄影响 0.2 * user_genders[user_ids] + # 性别影响 0.1 * (user_occupations[user_ids] / 20) + # 职业影响 0.2 * ((movie_years[movie_ids] - 1980) / 40) + # 电影年份影响 0.3 * np.sin(movie_genres[movie_ids] * np.pi / 9) + # 电影类型影响 0.1 * ((movie_durations[movie_ids] - 80) / 100) # 电影时长影响 ) # 添加一些随机噪声 ratings = np.clip(base_ratings + np.random.normal(0, 0.7, n_ratings), 1, 5).round(1) # 创建DataFrame data = pd.DataFrame({ 'user_id': user_ids, 'movie_id': movie_ids, 'rating': ratings }) # 添加用户特征 user_features = pd.DataFrame({ 'user_id': range(n_users), 'age': user_ages, 'gender': user_genders, 'occupation': user_occupations }) # 添加电影特征 movie_features = pd.DataFrame({ 'movie_id': range(n_movies), 'year': movie_years, 'genre': movie_genres, 'duration': movie_durations }) # 合并特征 data = data.merge(user_features, on='user_id') data = data.merge(movie_features, on='movie_id') print("数据预览:") print(data.head()) print(f"n评分统计:n{data['rating'].describe()}") # 准备数据 X = data.drop(['user_id', 'movie_id', 'rating'], axis=1) y = data['rating'] X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42) # 训练随机森林回归器 rf = RandomForestRegressor(n_estimators=100, max_depth=15, random_state=42) rf.fit(X_train, y_train) # 预测 y_pred = rf.predict(X_test) # 评估 print("n模型评估:") print(f"均方误差(MSE): {mean_squared_error(y_test, y_pred):.4f}") print(f"均方根误差(RMSE): {np.sqrt(mean_squared_error(y_test, y_pred)):.4f}") print(f"平均绝对误差(MAE): {mean_absolute_error(y_test, y_pred):.4f}") # 特征重要性 feature_importances = pd.DataFrame({ 'feature': X.columns, 'importance': rf.feature_importances_ }).sort_values('importance', ascending=False) # 可视化特征重要性 plt.figure(figsize=(10, 6)) plt.barh(feature_importances['feature'], feature_importances['importance']) plt.xlabel('重要性') plt.title('电影评分预测特征重要性') plt.tight_layout() plt.show() # 预测新用户的电影评分 def predict_rating(user_age, user_gender, user_occupation, movie_year, movie_genre, movie_duration): # 创建特征DataFrame features = pd.DataFrame({ 'age': [user_age], 'gender': [user_gender], 'occupation': [user_occupation], 'year': [movie_year], 'genre': [movie_genre], 'duration': [movie_duration] }) # 预测评分 rating = rf.predict(features)[0] return rating # 示例预测 print("n示例预测:") print(f"25岁男性(职业=5)对2020年动作片(时长=120分钟)的预测评分: {predict_rating(25, 1, 5, 2020, 1, 120):.2f}") print(f"40岁女性(职业=10)对1995年剧情片(时长=150分钟)的预测评分: {predict_rating(40, 0, 10, 1995, 5, 150):.2f}")
5.4 异常检测
随机森林也可以用于异常检测,通过识别与正常模式不同的数据点。
# 异常检测示例 import numpy as np import pandas as pd import matplotlib.pyplot as plt from sklearn.ensemble import IsolationForest from sklearn.ensemble import RandomForestClassifier from sklearn.model_selection import train_test_split from sklearn.metrics import classification_report, confusion_matrix # 创建正常数据 np.random.seed(42) n_normal = 1000 normal_data = np.random.multivariate_normal( mean=[0, 0], cov=[[1, 0.5], [0.5, 1]], size=n_normal ) # 创建异常数据 n_anomaly = 50 anomaly_data = np.random.uniform(low=-6, high=6, size=(n_anomaly, 2)) # 合并数据 X = np.vstack([normal_data, anomaly_data]) y = np.array([0] * n_normal + [1] * n_anomaly) # 0: 正常, 1: 异常 # 可视化数据 plt.figure(figsize=(10, 6)) plt.scatter(normal_data[:, 0], normal_data[:, 1], c='blue', label='正常数据') plt.scatter(anomaly_data[:, 0], anomaly_data[:, 1], c='red', label='异常数据') plt.title("正常数据与异常数据分布") plt.xlabel("特征1") plt.ylabel("特征2") plt.legend() plt.show() # 方法1: 使用IsolationForest (专门用于异常检测的随机森林变体) print("使用IsolationForest进行异常检测:") iso_forest = IsolationForest(contamination=0.05, random_state=42) y_pred_iso = iso_forest.fit_predict(X) # IsolationForest返回-1表示异常,1表示正常,我们需要转换为0和1 y_pred_iso = np.where(y_pred_iso == -1, 1, 0) print(classification_report(y, y_pred_iso, target_names=['正常', '异常'])) print("混淆矩阵:") print(confusion_matrix(y, y_pred_iso)) # 方法2: 使用标准随机森林分类器 print("n使用标准随机森林进行异常检测:") X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42, stratify=y) # 训练随机森林 rf = RandomForestClassifier(n_estimators=100, class_weight='balanced', random_state=42) rf.fit(X_train, y_train) # 预测 y_pred_rf = rf.predict(X_test) print(classification_report(y_test, y_pred_rf, target_names=['正常', '异常'])) print("混淆矩阵:") print(confusion_matrix(y_test, y_pred_rf)) # 创建网格来可视化决策边界 h = .02 # 网格步长 x_min, x_max = X[:, 0].min() - 1, X[:, 0].max() + 1 y_min, y_max = X[:, 1].min() - 1, X[:, 1].max() + 1 xx, yy = np.meshgrid(np.arange(x_min, x_max, h), np.arange(y_min, y_max, h)) # 预测整个网格 Z = rf.predict(np.c_[xx.ravel(), yy.ravel()]) Z = Z.reshape(xx.shape) # 可视化决策边界 plt.figure(figsize=(10, 6)) plt.contourf(xx, yy, Z, alpha=0.3) plt.scatter(X_test[:, 0], X_test[:, 1], c=y_test, cmap=plt.cm.Paired) plt.title("随机森林异常检测的决策边界") plt.xlabel("特征1") plt.ylabel("特征2") plt.show()
6. 选择随机森林模型的考量因素
6.1 数据集大小和特征数量
随机森林适用于各种大小的数据集,但在不同规模的数据集上表现可能有所不同。我们需要根据数据集大小和特征数量来决定是否使用随机森林以及如何调整其参数。
# 比较不同数据集大小下随机森林的表现 from sklearn.datasets import make_classification from sklearn.ensemble import RandomForestClassifier from sklearn.model_selection import train_test_split from sklearn.metrics import accuracy_score import time # 定义不同大小的数据集 dataset_sizes = [100, 1000, 5000, 10000, 20000] n_features = 20 results = [] for size in dataset_sizes: print(f"n数据集大小: {size}") # 创建数据 X, y = make_classification(n_samples=size, n_features=n_features, random_state=42) X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42) # 训练随机森林并计时 start_time = time.time() rf = RandomForestClassifier(n_estimators=100, random_state=42) rf.fit(X_train, y_train) train_time = time.time() - start_time # 预测并计时 start_time = time.time() y_pred = rf.predict(X_test) predict_time = time.time() - start_time # 计算准确率 accuracy = accuracy_score(y_test, y_pred) # 记录结果 results.append({ 'size': size, 'train_time': train_time, 'predict_time': predict_time, 'accuracy': accuracy }) print(f"训练时间: {train_time:.4f}秒") print(f"预测时间: {predict_time:.4f}秒") print(f"准确率: {accuracy:.4f}") # 可视化结果 import matplotlib.pyplot as plt fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(15, 5)) # 训练时间 sizes = [r['size'] for r in results] train_times = [r['train_time'] for r in results] ax1.plot(sizes, train_times, 'o-') ax1.set_title('数据集大小与训练时间') ax1.set_xlabel('数据集大小') ax1.set_ylabel('训练时间 (秒)') ax1.grid(True) # 准确率 accuracies = [r['accuracy'] for r in results] ax2.plot(sizes, accuracies, 'o-') ax2.set_title('数据集大小与准确率') ax2.set_xlabel('数据集大小') ax2.set_ylabel('准确率') ax2.grid(True) plt.tight_layout() plt.show()
6.2 计算资源限制
随机森林的计算复杂度较高,特别是在树的数量很多或数据集很大的情况下。在选择随机森林时,需要考虑可用的计算资源。
# 比较不同树的数量和深度下的内存使用和训练时间 from sklearn.datasets import make_classification from sklearn.ensemble import RandomForestClassifier import time import psutil import os def get_memory_usage(): process = psutil.Process(os.getpid()) return process.memory_info().rss / (1024 * 1024) # MB # 创建数据 X, y = make_classification(n_samples=5000, n_features=20, random_state=42) # 测试不同配置 n_estimators_list = [10, 50, 100, 200] max_depth_list = [5, 10, 20, None] results = [] for n_estimators in n_estimators_list: for max_depth in max_depth_list: print(f"n树的数量: {n_estimators}, 最大深度: {max_depth}") # 记录初始内存 initial_memory = get_memory_usage() # 训练模型并计时 start_time = time.time() rf = RandomForestClassifier( n_estimators=n_estimators, max_depth=max_depth, random_state=42, n_jobs=-1 # 使用所有CPU核心 ) rf.fit(X, y) train_time = time.time() - start_time # 记录最终内存 final_memory = get_memory_usage() memory_used = final_memory - initial_memory # 记录结果 results.append({ 'n_estimators': n_estimators, 'max_depth': max_depth, 'train_time': train_time, 'memory_used': memory_used }) print(f"训练时间: {train_time:.4f}秒") print(f"内存使用: {memory_used:.2f} MB") # 找出最优配置 best_time = min(results, key=lambda x: x['train_time']) best_memory = min(results, key=lambda x: x['memory_used']) print(f"n最快配置: 树的数量={best_time['n_estimators']}, 最大深度={best_time['max_depth']}, 时间={best_time['train_time']:.4f}秒") print(f"最省内存配置: 树的数量={best_memory['n_estimators']}, 最大深度={best_memory['max_depth']}, 内存={best_memory['memory_used']:.2f} MB")
6.3 模型解释性需求
如果项目对模型解释性有较高要求,可能需要权衡随机森林的高准确性和较低的解释性。
# 比较随机森林与其他模型的解释性 from sklearn.datasets import make_classification from sklearn.ensemble import RandomForestClassifier from sklearn.linear_model import LogisticRegression from sklearn.tree import DecisionTreeClassifier, export_text from sklearn.model_selection import train_test_split from sklearn.metrics import accuracy_score # 创建数据 X, y = make_classification(n_samples=1000, n_features=5, random_state=42) feature_names = [f'Feature {i}' for i in range(X.shape[1])] X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42) # 训练模型 models = { '逻辑回归': LogisticRegression(random_state=42), '决策树': DecisionTreeClassifier(max_depth=3, random_state=42), '随机森林': RandomForestClassifier(n_estimators=100, max_depth=3, random_state=42) } for name, model in models.items(): model.fit(X_train, y_train) y_pred = model.predict(X_test) accuracy = accuracy_score(y_test, y_pred) print(f"{name} 准确率: {accuracy:.4f}") # 展示模型解释性 print("n模型解释性比较:") # 逻辑回归系数 lr = models['逻辑回归'] print("n逻辑回归系数:") for i, coef in enumerate(lr.coef_[0]): print(f"{feature_names[i]}: {coef:.4f}") # 决策树规则 dt = models['决策树'] print("n决策树规则:") tree_rules = export_text(dt, feature_names=feature_names) print(tree_rules[:500] + "..." if len(tree_rules) > 500 else tree_rules) # 随机森林特征重要性 rf = models['随机森林'] print("n随机森林特征重要性:") for i, importance in enumerate(rf.feature_importances_): print(f"{feature_names[i]}: {importance:.4f}") # 尝试解释随机森林中的单个样本 print("n随机森林对单个样本的解释:") sample_idx = 0 sample = X_test[sample_idx:sample_idx+1] prediction = rf.predict(sample)[0] prediction_proba = rf.predict_proba(sample)[0] print(f"样本特征: {sample[0]}") print(f"预测类别: {prediction}") print(f"预测概率: {prediction_proba}") # 获取决策路径 from sklearn.tree import _tree def get_decision_path(tree, feature_names, sample): feature_names = feature_names.copy() tree_ = tree.tree_ feature_name = [ feature_names[i] if i != _tree.TREE_UNDEFINED else "undefined!" for i in tree_.feature ] def recurse(node, depth): indent = " " * depth if tree_.feature[node] != _tree.TREE_UNDEFINED: name = feature_name[node] threshold = tree_.threshold[node] if sample[0][tree_.feature[node]] <= threshold: recurse(tree_.children_left[node], depth + 1) else: recurse(tree_.children_right[node], depth + 1) else: value = tree_.value[node] print(f"{indent}类别: {np.argmax(value)}, 概率: {value[0] / value[0].sum()}") print(f"树 {0} 的决策路径:") recurse(0, 0) # 展示随机森林中第一棵树的决策路径 get_decision_path(rf.estimators_[0], feature_names, sample)
6.4 预测准确性要求
如果项目对预测准确性有较高要求,随机森林通常是一个不错的选择,但可能需要与其他模型进行比较。
# 比较随机森林与其他模型的准确性 from sklearn.datasets import make_classification, make_regression from sklearn.ensemble import RandomForestClassifier, RandomForestRegressor, GradientBoostingClassifier, GradientBoostingRegressor from sklearn.linear_model import LogisticRegression, LinearRegression from sklearn.svm import SVC, SVR from sklearn.neighbors import KNeighborsClassifier, KNeighborsRegressor from sklearn.neural_network import MLPClassifier, MLPRegressor from sklearn.model_selection import cross_val_score import time # 分类问题比较 print("分类问题模型比较:") X, y = make_classification(n_samples=1000, n_features=20, random_state=42) classifiers = { '逻辑回归': LogisticRegression(max_iter=1000, random_state=42), '随机森林': RandomForestClassifier(n_estimators=100, random_state=42), '梯度提升': GradientBoostingClassifier(n_estimators=100, random_state=42), '支持向量机': SVC(random_state=42), 'K近邻': KNeighborsClassifier(), '神经网络': MLPClassifier(max_iter=1000, random_state=42) } clf_results = [] for name, clf in classifiers.items(): start_time = time.time() scores = cross_val_score(clf, X, y, cv=5, scoring='accuracy') elapsed_time = time.time() - start_time clf_results.append({ 'model': name, 'accuracy': scores.mean(), 'std': scores.std(), 'time': elapsed_time }) print(f"{name}: 准确率={scores.mean():.4f} (±{scores.std():.4f}), 时间={elapsed_time:.4f}秒") # 回归问题比较 print("n回归问题模型比较:") X, y = make_regression(n_samples=1000, n_features=20, random_state=42) regressors = { '线性回归': LinearRegression(), '随机森林': RandomForestRegressor(n_estimators=100, random_state=42), '梯度提升': GradientBoostingRegressor(n_estimators=100, random_state=42), '支持向量机': SVR(), 'K近邻': KNeighborsRegressor(), '神经网络': MLPRegressor(max_iter=1000, random_state=42) } reg_results = [] for name, reg in regressors.items(): start_time = time.time() scores = cross_val_score(reg, X, y, cv=5, scoring='neg_mean_squared_error') elapsed_time = time.time() - start_time reg_results.append({ 'model': name, 'mse': -scores.mean(), 'std': scores.std(), 'time': elapsed_time }) print(f"{name}: MSE={-scores.mean():.4f} (±{scores.std():.4f}), 时间={elapsed_time:.4f}秒") # 找出最佳模型 best_clf = max(clf_results, key=lambda x: x['accuracy']) best_reg = min(reg_results, key=lambda x: x['mse']) print(f"n最佳分类模型: {best_clf['model']}, 准确率: {best_clf['accuracy']:.4f}") print(f"最佳回归模型: {best_reg['model']}, MSE: {best_reg['mse']:.4f}")
6.5 训练时间限制
如果项目对训练时间有严格限制,可能需要调整随机森林的参数,如减少树的数量或限制树的深度。
# 比较不同参数配置下的训练时间和模型性能 from sklearn.datasets import make_classification from sklearn.ensemble import RandomForestClassifier from sklearn.model_selection import train_test_split from sklearn.metrics import accuracy_score import time # 创建数据 X, y = make_classification(n_samples=10000, n_features=20, random_state=42) X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42) # 测试不同参数配置 param_configs = [ {'n_estimators': 10, 'max_depth': 5}, {'n_estimators': 50, 'max_depth': 5}, {'n_estimators': 100, 'max_depth': 5}, {'n_estimators': 10, 'max_depth': 10}, {'n_estimators': 50, 'max_depth': 10}, {'n_estimators': 100, 'max_depth': 10}, {'n_estimators': 10, 'max_depth': None}, {'n_estimators': 50, 'max_depth': None}, {'n_estimators': 100, 'max_depth': None} ] results = [] for params in param_configs: print(f"n参数配置: {params}") # 训练模型并计时 start_time = time.time() rf = RandomForestClassifier(random_state=42, n_jobs=-1, **params) rf.fit(X_train, y_train) train_time = time.time() - start_time # 预测并计时 start_time = time.time() y_pred = rf.predict(X_test) predict_time = time.time() - start_time # 计算准确率 accuracy = accuracy_score(y_test, y_pred) # 记录结果 results.append({ 'params': params, 'train_time': train_time, 'predict_time': predict_time, 'accuracy': accuracy }) print(f"训练时间: {train_time:.4f}秒") print(f"预测时间: {predict_time:.4f}秒") print(f"准确率: {accuracy:.4f}") # 找出不同时间限制下的最佳配置 time_limits = [1, 5, 10, 30, 60] # 秒 for time_limit in time_limits: # 找出在时间限制内训练时间最短且准确率最高的配置 valid_configs = [r for r in results if r['train_time'] <= time_limit] if valid_configs: best_config = max(valid_configs, key=lambda x: x['accuracy']) print(f"n时间限制 {time_limit} 秒内的最佳配置:") print(f"参数: {best_config['params']}") print(f"训练时间: {best_config['train_time']:.4f}秒") print(f"准确率: {best_config['accuracy']:.4f}") else: print(f"n没有配置能在 {time_limit} 秒内完成训练")
6.6 部署环境限制
在考虑使用随机森林时,还需要考虑部署环境的限制,如内存、CPU和存储空间等。
# 模拟不同部署环境下的随机森林表现 from sklearn.datasets import make_classification from sklearn.ensemble import RandomForestClassifier import pickle import time # 创建数据 X, y = make_classification(n_samples=10000, n_features=20, random_state=42) # 定义不同的部署环境 environments = { '低配环境': { 'memory_mb': 512, 'cpu_cores': 1, 'storage_mb': 100 }, '中配环境': { 'memory_mb': 2048, 'cpu_cores': 4, 'storage_mb': 500 }, '高配环境': { 'memory_mb': 8192, 'cpu_cores': 16, 'storage_mb': 2000 } } # 测试不同配置的随机森林 param_configs = [ {'n_estimators': 10, 'max_depth': 5}, {'n_estimators': 50, 'max_depth': 10}, {'n_estimators': 100, 'max_depth': 20}, {'n_estimators': 200, 'max_depth': None} ] results = [] for env_name, env_specs in environments.items(): print(f"n部署环境: {env_name}") print(f"内存限制: {env_specs['memory_mb']} MB") print(f"CPU核心数: {env_specs['cpu_cores']}") print(f"存储限制: {env_specs['storage_mb']} MB") for params in param_configs: print(f"n参数配置: {params}") # 检查是否适合该环境 suitable = True # 训练模型并计时 start_time = time.time() rf = RandomForestClassifier(random_state=42, n_jobs=min(env_specs['cpu_cores'], -1), **params) rf.fit(X, y) train_time = time.time() - start_time # 序列化模型并检查大小 model_size = len(pickle.dumps(rf)) / (1024 * 1024) # MB # 检查是否超出环境限制 if model_size > env_specs['storage_mb']: suitable = False print(f"模型大小 ({model_size:.2f} MB) 超出存储限制 ({env_specs['storage_mb']} MB)") # 记录结果 results.append({ 'environment': env_name, 'params': params, 'train_time': train_time, 'model_size': model_size, 'suitable': suitable }) if suitable: print(f"训练时间: {train_time:.4f}秒") print(f"模型大小: {model_size:.2f} MB") print("适合该环境") else: print("不适合该环境") # 找出每个环境下的最佳配置 for env_name, env_specs in environments.items(): env_results = [r for r in results if r['environment'] == env_name and r['suitable']] if env_results: # 找出训练时间最短的配置 fastest = min(env_results, key=lambda x: x['train_time']) print(f"n{env_name}下的最快配置:") print(f"参数: {fastest['params']}") print(f"训练时间: {fastest['train_time']:.4f}秒") print(f"模型大小: {fastest['model_size']:.2f} MB") else: print(f"n没有配置适合 {env_name}")
7. 实际案例和代码示例
7.1 完整的随机森林分类项目
下面是一个完整的随机森林分类项目示例,包括数据加载、预处理、模型训练、评估和调优。
# 完整的随机森林分类项目 import pandas as pd import numpy as np import matplotlib.pyplot as plt import seaborn as sns from sklearn.datasets import load_breast_cancer from sklearn.ensemble import RandomForestClassifier from sklearn.model_selection import train_test_split, GridSearchCV, cross_val_score from sklearn.metrics import accuracy_score, classification_report, confusion_matrix, roc_auc_score, roc_curve from sklearn.preprocessing import StandardScaler from sklearn.feature_selection import SelectKBest, f_classif import pickle # 1. 加载数据 data = load_breast_cancer() X = pd.DataFrame(data.data, columns=data.feature_names) y = pd.Series(data.target) print("数据集信息:") print(f"样本数: {X.shape[0]}") print(f"特征数: {X.shape[1]}") print(f"类别分布: {y.value_counts().to_dict()}") print("n前5行数据:") print(X.head()) # 2. 数据探索 # 描述性统计 print("n描述性统计:") print(X.describe()) # 特征相关性 plt.figure(figsize=(12, 10)) correlation = X.corr() sns.heatmap(correlation, cmap='coolwarm', annot=False) plt.title('特征相关性热图') plt.tight_layout() plt.show() # 3. 数据预处理 # 检查缺失值 print(f"n缺失值数量: {X.isnull().sum().sum()}") # 特征选择 selector = SelectKBest(f_classif, k=10) X_selected = selector.fit_transform(X, y) selected_features = X.columns[selector.get_support()] print(f"n选择的10个最佳特征: {list(selected_features)}") # 数据分割 X_train, X_test, y_train, y_test = train_test_split(X_selected, y, test_size=0.3, random_state=42, stratify=y) # 特征缩放 scaler = StandardScaler() X_train_scaled = scaler.fit_transform(X_train) X_test_scaled = scaler.transform(X_test) # 4. 模型训练 # 基础模型 rf = RandomForestClassifier(random_state=42) rf.fit(X_train_scaled, y_train) # 5. 模型评估 # 预测 y_pred = rf.predict(X_test_scaled) y_pred_proba = rf.predict_proba(X_test_scaled)[:, 1] # 评估指标 print("n模型评估:") print(f"准确率: {accuracy_score(y_test, y_pred):.4f}") print(f"AUC: {roc_auc_score(y_test, y_pred_proba):.4f}") # 分类报告 print("n分类报告:") print(classification_report(y_test, y_pred, target_names=data.target_names)) # 混淆矩阵 cm = confusion_matrix(y_test, y_pred) plt.figure(figsize=(8, 6)) sns.heatmap(cm, annot=True, fmt='d', cmap='Blues', xticklabels=data.target_names, yticklabels=data.target_names) plt.xlabel('预测标签') plt.ylabel('真实标签') plt.title('混淆矩阵') plt.show() # ROC曲线 fpr, tpr, thresholds = roc_curve(y_test, y_pred_proba) plt.figure(figsize=(8, 6)) plt.plot(fpr, tpr, label=f'随机森林 (AUC = {roc_auc_score(y_test, y_pred_proba):.4f})') plt.plot([0, 1], [0, 1], 'k--') plt.xlabel('假阳性率') plt.ylabel('真阳性率') plt.title('ROC曲线') plt.legend() plt.show() # 特征重要性 feature_importances = pd.DataFrame({ 'feature': selected_features, 'importance': rf.feature_importances_ }).sort_values('importance', ascending=False) plt.figure(figsize=(10, 6)) sns.barplot(x='importance', y='feature', data=feature_importances) plt.title('特征重要性') plt.tight_layout() plt.show() # 6. 模型调优 # 定义参数网格 param_grid = { 'n_estimators': [50, 100, 200], 'max_depth': [None, 10, 20], 'min_samples_split': [2, 5, 10], 'min_samples_leaf': [1, 2, 4] } # 网格搜索 print("n开始网格搜索...") grid_search = GridSearchCV( estimator=RandomForestClassifier(random_state=42), param_grid=param_grid, cv=5, scoring='accuracy', n_jobs=-1, verbose=1 ) grid_search.fit(X_train_scaled, y_train) # 最佳参数 print(f"n最佳参数: {grid_search.best_params_}") print(f"最佳交叉验证分数: {grid_search.best_score_:.4f}") # 使用最佳参数的模型 best_rf = grid_search.best_estimator_ # 评估最佳模型 y_pred_best = best_rf.predict(X_test_scaled) y_pred_proba_best = best_rf.predict_proba(X_test_scaled)[:, 1] print("n最佳模型评估:") print(f"准确率: {accuracy_score(y_test, y_pred_best):.4f}") print(f"AUC: {roc_auc_score(y_test, y_pred_proba_best):.4f}") # 交叉验证 cv_scores = cross_val_score(best_rf, X_train_scaled, y_train, cv=5, scoring='accuracy') print(f"n交叉验证准确率: {cv_scores.mean():.4f} (±{cv_scores.std():.4f})") # 7. 保存模型 model_data = { 'model': best_rf, 'scaler': scaler, 'selector': selector, 'selected_features': selected_features } with open('breast_cancer_rf_model.pkl', 'wb') as f: pickle.dump(model_data, f) print("n模型已保存为 'breast_cancer_rf_model.pkl'") # 8. 加载模型并进行预测 def predict_cancer(features): """ 使用保存的模型预测乳腺癌 参数: features -- 包含所有30个特征的字典或DataFrame 返回: 预测结果和概率 """ # 加载模型 with open('breast_cancer_rf_model.pkl', 'rb') as f: model_data = pickle.load(f) model = model_data['model'] scaler = model_data['scaler'] selector = model_data['selector'] # 转换为DataFrame if isinstance(features, dict): features = pd.DataFrame([features]) # 特征选择 features_selected = selector.transform(features) # 特征缩放 features_scaled = scaler.transform(features_selected) # 预测 prediction = model.predict(features_scaled)[0] probability = model.predict_proba(features_scaled)[0] return { 'prediction': '恶性' if prediction == 0 else '良性', 'malignant_prob': probability[0], 'benign_prob': probability[1] } # 示例预测 sample = X.iloc[0].to_dict() result = predict_cancer(sample) print(f"n示例预测:") print(f"真实标签: {'恶性' if y.iloc[0] == 0 else '良性'}") print(f"预测结果: {result['prediction']}") print(f"恶性概率: {result['malignant_prob']:.4f}") print(f"良性概率: {result['benign_prob']:.4f}")
7.2 完整的随机森林回归项目
下面是一个完整的随机森林回归项目示例,使用波士顿房价数据集。
# 完整的随机森林回归项目 import pandas as pd import numpy as np import matplotlib.pyplot as plt import seaborn as sns from sklearn.datasets import fetch_california_housing from sklearn.ensemble import RandomForestRegressor from sklearn.model_selection import train_test_split, GridSearchCV, cross_val_score from sklearn.metrics import mean_squared_error, mean_absolute_error, r2_score from sklearn.preprocessing import StandardScaler from sklearn.feature_selection import SelectKBest, f_regression import pickle # 1. 加载数据 data = fetch_california_housing() X = pd.DataFrame(data.data, columns=data.feature_names) y = pd.Series(data.target) print("数据集信息:") print(f"样本数: {X.shape[0]}") print(f"特征数: {X.shape[1]}") print(f"目标变量统计: {y.describe().to_dict()}") print("n前5行数据:") print(X.head()) # 2. 数据探索 # 描述性统计 print("n描述性统计:") print(X.describe()) # 目标变量分布 plt.figure(figsize=(10, 6)) sns.histplot(y, kde=True) plt.title('房价分布') plt.xlabel('房价 (单位: $100,000)') plt.ylabel('频数') plt.show() # 特征与目标变量的关系 plt.figure(figsize=(15, 10)) for i, feature in enumerate(X.columns): plt.subplot(3, 3, i+1) plt.scatter(X[feature], y, alpha=0.5) plt.title(f'{feature} vs 房价') plt.xlabel(feature) plt.ylabel('房价') plt.tight_layout() plt.show() # 特征相关性 plt.figure(figsize=(12, 10)) correlation = X.corr() sns.heatmap(correlation, cmap='coolwarm', annot=True, fmt='.2f') plt.title('特征相关性热图') plt.tight_layout() plt.show() # 3. 数据预处理 # 检查缺失值 print(f"n缺失值数量: {X.isnull().sum().sum()}") # 特征选择 selector = SelectKBest(f_regression, k=6) X_selected = selector.fit_transform(X, y) selected_features = X.columns[selector.get_support()] print(f"n选择的6个最佳特征: {list(selected_features)}") # 数据分割 X_train, X_test, y_train, y_test = train_test_split(X_selected, y, test_size=0.3, random_state=42) # 特征缩放 scaler = StandardScaler() X_train_scaled = scaler.fit_transform(X_train) X_test_scaled = scaler.transform(X_test) # 4. 模型训练 # 基础模型 rf = RandomForestRegressor(random_state=42) rf.fit(X_train_scaled, y_train) # 5. 模型评估 # 预测 y_pred = rf.predict(X_test_scaled) # 评估指标 print("n模型评估:") print(f"均方误差(MSE): {mean_squared_error(y_test, y_pred):.4f}") print(f"均方根误差(RMSE): {np.sqrt(mean_squared_error(y_test, y_pred)):.4f}") print(f"平均绝对误差(MAE): {mean_absolute_error(y_test, y_pred):.4f}") print(f"决定系数(R²): {r2_score(y_test, y_pred):.4f}") # 预测值与真实值对比 plt.figure(figsize=(10, 6)) plt.scatter(y_test, y_pred, alpha=0.5) plt.plot([y.min(), y.max()], [y.min(), y.max()], 'k--') plt.xlabel('真实值') plt.ylabel('预测值') plt.title('预测值与真实值对比') plt.show() # 残差图 residuals = y_test - y_pred plt.figure(figsize=(10, 6)) plt.scatter(y_pred, residuals, alpha=0.5) plt.axhline(y=0, color='r', linestyle='-') plt.xlabel('预测值') plt.ylabel('残差') plt.title('残差图') plt.show() # 特征重要性 feature_importances = pd.DataFrame({ 'feature': selected_features, 'importance': rf.feature_importances_ }).sort_values('importance', ascending=False) plt.figure(figsize=(10, 6)) sns.barplot(x='importance', y='feature', data=feature_importances) plt.title('特征重要性') plt.tight_layout() plt.show() # 6. 模型调优 # 定义参数网格 param_grid = { 'n_estimators': [50, 100, 200], 'max_depth': [None, 10, 20], 'min_samples_split': [2, 5, 10], 'min_samples_leaf': [1, 2, 4] } # 网格搜索 print("n开始网格搜索...") grid_search = GridSearchCV( estimator=RandomForestRegressor(random_state=42), param_grid=param_grid, cv=5, scoring='neg_mean_squared_error', n_jobs=-1, verbose=1 ) grid_search.fit(X_train_scaled, y_train) # 最佳参数 print(f"n最佳参数: {grid_search.best_params_}") print(f"最佳交叉验证分数(MSE): {-grid_search.best_score_:.4f}") # 使用最佳参数的模型 best_rf = grid_search.best_estimator_ # 评估最佳模型 y_pred_best = best_rf.predict(X_test_scaled) print("n最佳模型评估:") print(f"均方误差(MSE): {mean_squared_error(y_test, y_pred_best):.4f}") print(f"均方根误差(RMSE): {np.sqrt(mean_squared_error(y_test, y_pred_best)):.4f}") print(f"平均绝对误差(MAE): {mean_absolute_error(y_test, y_pred_best):.4f}") print(f"决定系数(R²): {r2_score(y_test, y_pred_best):.4f}") # 交叉验证 cv_scores = cross_val_score(best_rf, X_train_scaled, y_train, cv=5, scoring='neg_mean_squared_error') print(f"n交叉验证MSE: {-cv_scores.mean():.4f} (±{cv_scores.std():.4f})") # 7. 保存模型 model_data = { 'model': best_rf, 'scaler': scaler, 'selector': selector, 'selected_features': selected_features } with open('california_housing_rf_model.pkl', 'wb') as f: pickle.dump(model_data, f) print("n模型已保存为 'california_housing_rf_model.pkl'") # 8. 加载模型并进行预测 def predict_house_price(features): """ 使用保存的模型预测房价 参数: features -- 包含所有8个特征的字典或DataFrame 返回: 预测房价 """ # 加载模型 with open('california_housing_rf_model.pkl', 'rb') as f: model_data = pickle.load(f) model = model_data['model'] scaler = model_data['scaler'] selector = model_data['selector'] # 转换为DataFrame if isinstance(features, dict): features = pd.DataFrame([features]) # 特征选择 features_selected = selector.transform(features) # 特征缩放 features_scaled = scaler.transform(features_selected) # 预测 prediction = model.predict(features_scaled)[0] return prediction # 示例预测 sample = X.iloc[0].to_dict() predicted_price = predict_house_price(sample) actual_price = y.iloc[0] print(f"n示例预测:") print(f"实际房价: ${actual_price * 100000:.2f}") print(f"预测房价: ${predicted_price * 100000:.2f}") print(f"差异: ${abs(predicted_price - actual_price) * 100000:.2f}")
8. 结论
随机森林是一种强大且灵活的机器学习算法,在许多实际应用中表现出色。通过本文的探讨,我们可以得出以下结论:
8.1 随机森林的优势总结
高准确性:随机森林通过集成多个决策树,通常能够提供较高的预测准确性,特别是在处理复杂非线性关系时。
鲁棒性:随机森林对异常值和噪声数据具有较强的鲁棒性,不容易过拟合。
特征重要性评估:随机森林能够提供特征重要性的评估,这对于理解数据和特征工程非常有帮助。
处理高维数据:随机森林能够处理大量输入变量,且不需要特征缩放。
并行化能力:随机森林的构建过程可以并行化,能够充分利用多核处理器的优势。
8.2 随机森林的局限性
计算复杂度高:相比于单个决策树,随机森林需要更多的计算资源,特别是在树的数量很多或数据集很大的情况下。
解释性差:相比于单个决策树,随机森林的黑盒特性使其解释性较差,难以提供清晰的决策路径。
内存占用大:随机森林需要存储多个决策树,内存占用较大,特别是在树的数量很多或树很深的情况下。
对不平衡数据敏感:在类别不平衡的情况下,随机森林可能会偏向多数类,需要通过调整类别权重来缓解。
超参数调优复杂:随机森林有多个超参数需要调整,找到最佳参数组合可能需要大量的实验和计算资源。
8.3 实际应用建议
数据预处理:虽然随机森林对数据预处理的要求相对较低,但适当的数据清洗和特征工程仍然可以提高模型性能。
参数调优:通过网格搜索或随机搜索进行参数调优,可以显著提高随机森林的性能。
特征选择:在特征数量很多的情况下,考虑使用特征选择方法减少特征数量,以提高模型效率和可解释性。
处理不平衡数据:在处理类别不平衡的数据时,考虑使用类别权重或过采样/欠采样技术。
模型解释:如果需要模型解释,可以结合特征重要性、部分依赖图等工具,或者考虑使用SHAP、LIME等解释方法。
部署考虑:在部署随机森林模型时,考虑模型的内存占用和预测时间,可能需要在准确性和效率之间进行权衡。
8.4 未来发展方向
随机森林作为一种经典的机器学习算法,仍然有许多可以改进和发展的方向:
与其他算法的融合:将随机森林与深度学习、强化学习等其他算法结合,可能会产生更强大的模型。
可解释性增强:开发新的方法来提高随机森林的可解释性,使其在高风险决策领域得到更广泛的应用。
高效实现:开发更高效的随机森林实现,特别是在处理大规模数据集时。
自适应参数调整:研究自适应参数调整方法,减少手动调参的工作量。
增量学习:开发支持增量学习的随机森林变体,使其能够适应数据流环境。
总之,随机森林是一种非常实用的机器学习算法,在许多领域都有广泛的应用。通过理解其优缺点和适用场景,我们可以更好地利用这一工具解决实际问题。随着机器学习领域的不断发展,随机森林及其变体将继续在数据科学和人工智能领域发挥重要作用。