鸢尾花数据多维分布探索与可视化实践｜AI 编程社知识库精选 - 文章 - 开发者社区

picture.image

本文为 AI编程社知识库精选文章。作者为湖南大学赵芷谦同学。

AI编程社（AI Coding Club，简称ACC）是一个面向新时代开发者的共创共学社区，专注于分享 AI 编程领域的最新资讯、应用和实践，通过交流、学习和共创等方式，让更多人成为新时代的10x开发者。

🔗知识库：https://sourl.cn/kkaBuR

picture.image

本实践材料 ========

为了帮助编程初学者们更轻松地学习本文内容、上手实践项目，推荐使用 Trae 进行编程获得技术支持以及问题解答。

Trae ，点击文末“阅读原文”，直达下载链接，快来试试吧！

picture.image

本实践需要的语法基础 =============

2.1 Python 基础与环境配置

模块导入（import）：

  
  import pandas as pd    
  import seaborn as sns    
  import matplotlib.pyplot as plt  
  from sklearn.datasets import load\_iris

包管理工具：需通过 pip 安装 pandas , seaborn , matplotlib , scikit -learn 等库。
开发环境：代码适合在 Jupyter Notebook 或 IDE（如 PyCharm, VS Code）中运行。

2.2 数据处理基础 (Pandas)

DataFrame 创建方法扩展

从字典创建

  
data = {'萼片长度': [5.1, 4.9], '萼片宽度': [3.5, 3.0]}  
df = pd.DataFrame(data)

字典键自动成为列名，值作为列数据

从文件加载

  
df = pd.read\_csv('iris.csv')  # 支持CSV/Excel/JSON等格式

适用于外部数据导入场景

列操作增强

删除列

  
df.drop('target', axis=1, inplace=True)  # 永久删除列

重命名列

  
df.rename(columns={'old\_name': 'new\_name'}, inplace=True)

条件筛选

  
setosa\_df = df[df['target'] == 'setosa']  # 筛选特定类别

类型转换

  
df['sepal\_length'] = df['sepal\_length'].astype(float)  # 强制类型转换

2.3 数据可视化增强

Seaborn 高级配置

调色板设置

  
sns.set\_palette("husl")  # 使用预定义调色方案

分面绘图

  
g = sns.FacetGrid(df, col="target")  # 按类别分面  
g.map(sns.histplot, "sepal\_length")

组合图表

  
sns.violinplot(x='target', y='sepal\_length', data=df)  
sns.swarmplot(x='target', y='sepal\_length', data=df, color='black')  # 叠加散点

Matplotlib 图表优化

图表尺寸控制

  
plt.figure(figsize=(12, 6), dpi=150)  # 设置分辨率

多子图布局

  
fig, axes = plt.subplots(2, 2)  # 创建2x2子图网格  
axes[0,0].plot(x, y)

保存图表

  
plt.savefig('plot.png', bbox\_inches='tight')  # 导出高清图片

网格线增强

  
plt.grid(True, linestyle='--', alpha=0.6)  # 虚线半透明网格

开始实践 =======

3.1 代码实践目的

通过代码实现对鸢尾花数据集的加载、处理，并利用小提琴图展示不同类别鸢尾花在各个特征上的分布情况，以便直观地观察和比较不同类别之间的特征差异。

3.2 分析数据集

鸢尾花数据集是一个经典的多分类数据集，常用于机器学习和数据分析的教学和实践。代码通过使用sklearn.datasets模块中的load\_iris函数加载数据集，并对数据集进行了基本的探索性数据分析（EDA），包括查看数据集的形状、特征名称、类别标签以及数据的统计描述。

Trae 提示词：

  
分析鸢尾花数据集https://scikit-learn.org/stable/modules/generated/sklearn.datasets.load\_iris.html#

picture.image

3.3 代码实现步骤（借助 Trae）

3.3.1 环境配置与数据加载

导入依赖库：

pandas ：数据处理
seaborn 和 matplotlib ：数据可视化
scikit -learn：加载内置数据集
Trae提示词： 分析鸢尾花数据集，并解释怎么导入必要的库

  
import pandas as pd    
import seaborn as sns    
import matplotlib.pyplot as plt    
from sklearn.datasets import load\_iris

picture.image

加载数据集：

iris.data ：特征数据（4列，共150条样本）
iris.target ：类别标签（0, 1, 2）

  
iris = load\_iris()

3.3.2 数据预处理

构建 DataFrame ：

将 NumPy 数组转换为 Pandas DataFrame ，添加特征名称
新增 target 列存储类别标签
Trae 提示词： 怎么将数据集转换为 DataFrame

picture.image

  
df = pd.DataFrame(data=iris.data, columns=iris.feature\_names)    
df['target'] = iris.target

标签映射：

将数值标签（0/1/2）转换为可读的类别名称

  
df['target'] = df['target'].map({0: 'setosa', 1: 'versicolor', 2: 'virginica'})

3.3.3 数据可视化

设置绘图风格 ：

whitegrid：启用白色背景与网格线，提升图表可读性。

  
sns.set(style="whitegrid")

绘制小提琴图矩阵 ：

宽度表示数据分布密度，中间白点为中位数，黑色粗线为四分位范围（IQR）。
x='target' ：横轴为鸢尾花类别。
y ：纵轴为不同特征值。
data=df ：指定数据源。
参数说明 ：
小提琴图解读 ：

  
plt.figure(figsize=(10, 6))    
sns.violinplot(x='target', y='sepal length (cm)', data=df)    
sns.violinplot(x='target', y='sepal width (cm)', data=df)    
sns.violinplot(x='target', y='petal length (cm)', data=df)    
sns.violinplot(x='target', y='petal width (cm)', data=df)

图表美化 ：

添加标题、坐标轴标签，并显示图表。

  
plt.title('Violin Plot of Iris Dataset Features')    
plt.xlabel('Species')    
plt.ylabel('Feature Values')    
plt.show()

3.3.4 结果与分析

完整代码

  
import pandas as pd  
import seaborn as sns  
import matplotlib.pyplot as plt  
from sklearn.datasets import load\_iris  
  
# 加载鸢尾花数据集  
iris = load\_iris()  
  
# 将数据集转换为DataFrame  
df = pd.DataFrame(data=iris.data, columns=iris.feature\_names)  
df['target'] = iris.target  
  
# 将目标标签映射为类别名称  
df['target'] = df['target'].map({0: 'setosa', 1: 'versicolor', 2: 'virginica'})  
  
# 设置Seaborn的绘图风格  
sns.set(style="whitegrid")  
  
# 绘制特征分布的小提琴图矩阵  
plt.figure(figsize=(10, 6))  
sns.violinplot(x='target', y='sepal length (cm)', data=df)  
sns.violinplot(x='target', y='sepal width (cm)', data=df)  
sns.violinplot(x='target', y='petal length (cm)', data=df)  
sns.violinplot(x='target', y='petal width (cm)', data=df)  
  
# 设置图表标题和坐标轴标签  
plt.title('Violin Plot of Iris Dataset Features')  
plt.xlabel('Species')  
plt.ylabel('Feature Values')  
  
# 显示图表  
plt.show()

输出图表

picture.image

关键观察 ：

花萼宽度（sepal width） ：setosa 的分布最集中，且中位数最高
花瓣长度与宽度（petal length/width） ：setosa 明显小于其他两类，versicolor 和 virginica 存在部分重叠
类别区分度 ：花瓣特征（长度、宽度）比花萼特征更具区分性

3.3.5 技术细节与改进方向

技术亮点：

使用 seaborn.violinplot 高效绘制多维度分布图
通过 map 方法实现标签语义化，提升可解释性

潜在改进：

图表优化：分面绘制（FacetGrid）或并列显示多个子图，避免重叠
交互式可视化：使用 plotly 库生成可交互图表，支持动态探索

改进后： Trae 提示词：

  
- 图表优化：分面绘制（FacetGrid）或并列显示多个子图，避免重叠。    
- 交互式可视化：使用 plotly 库生成可交互图表，支持动态探索。    
在以上两个方面改进代码。

  
import pandas as pd  
import seaborn as sns  
import matplotlib.pyplot as plt  
from sklearn.datasets import load\_iris  
import plotly.express as px  
  
# 加载鸢尾花数据集  
iris = load\_iris()  
  
# 将数据集转换为DataFrame  
df = pd.DataFrame(data=iris.data, columns=iris.feature\_names)  
df['target'] = iris.target  
  
# 将目标标签映射为类别名称  
df['target'] = df['target'].map({0: 'setosa', 1: 'versicolor', 2: 'virginica'})  
  
# 设置Seaborn的绘图风格  
sns.set(style="whitegrid")  
  
# 使用FacetGrid分面绘制小提琴图  
g = sns.FacetGrid(df, col="target", height=4, aspect=1)  
g.map(sns.violinplot, "target", "sepal length (cm)", palette="Set2")  
g.map(sns.violinplot, "target", "sepal width (cm)", palette="Set2")  
g.map(sns.violinplot, "target", "petal length (cm)", palette="Set2")  
g.map(sns.violinplot, "target", "petal width (cm)", palette="Set2")  
  
# 设置图表标题和坐标轴标签  
g.fig.suptitle('Violin Plot of Iris Dataset Features')  
g.set\_axis\_labels('Species', 'Feature Values')  
  
# 显示图表  
plt.show()  
  
  
# 使用Plotly生成可交互图表  
fig = px.violin(df, x='target', y=['sepal length (cm)', 'sepal width (cm)', 'petal length (cm)', 'petal width (cm)'],  
                box=True, points='all', hover\_data=df.columns, color='target')  
fig.update\_layout(title='Interactive Violin Plot of Iris Dataset Features',  
                  xaxis\_title='Species',  
                  yaxis\_title='Feature Values')  
fig.show()

picture.image

3.3.6 总结

本实践通过数据预处理与可视化，直观展示了鸢尾花数据集中不同类别的特征分布差异。小提琴图能够同时反映数据分布密度与统计指标，适用于多维数据的对比分析。此方法可扩展至其他分类任务的特征探索场景。

探索 =====

Trae 提示词

  
快速探索：使用 seaborn 的pairplot 或 plotly 交互图表  
模型解释：优先选择SHAP值或特征重要性  
高维数据：尝试UMAP/t-SNE非线性降维  
报告展示：箱线图＋小提琴图组合  
根据以上要求对鸢尾花数据集实现进一步探索

  
import pandas as pd  
import seaborn as sns  
import matplotlib.pyplot as plt  
from sklearn.datasets import load\_iris  
import plotly.express as px  
from sklearn.ensemble import RandomForestClassifier  
import shap  
import umap  
from sklearn.manifold import TSNE  
  
# 加载鸢尾花数据集  
iris = load\_iris()  
df = pd.DataFrame(data=iris.data, columns=iris.feature\_names)  
df['target'] = iris.target  
df['target'] = df['target'].map({0: 'setosa', 1: 'versicolor', 2: 'virginica'})  
  
# 快速探索：使用seaborn的pairplot  
sns.pairplot(df, hue='target')  
plt.show()  
  
# 快速探索：使用plotly交互图表  
fig = px.scatter\_matrix(df, dimensions=iris.feature\_names, color='target')  
fig.show()  
  
# 模型解释：使用SHAP值  
X = df.drop('target', axis=1)  
y = df['target']  
model = RandomForestClassifier()  
model.fit(X, y)  
  
explainer = shap.TreeExplainer(model)  
shap\_values = explainer.shap\_values(X)  
shap.summary\_plot(shap\_values, X)  
  
# 高维数据：使用UMAP非线性降维  
reducer = umap.UMAP()  
embedding = reducer.fit\_transform(X)  
plt.figure()  
plt.scatter(embedding[:, 0], embedding[:, 1], c=iris.target, cmap='viridis')  
plt.colorbar()  
plt.title('UMAP projection of the Iris dataset')  
plt.show()  
  
# 高维数据：使用t-SNE非线性降维  
tsne = TSNE(n\_components=2, random\_state=42)  
tsne\_embedding = tsne.fit\_transform(X)  
plt.figure()  
plt.scatter(tsne\_embedding[:, 0], tsne\_embedding[:, 1], c=iris.target, cmap='viridis')  
plt.colorbar()  
plt.title('t-SNE projection of the Iris dataset')  
plt.show()  
  
# 报告展示：箱线图＋小提琴图组合  
plt.figure(figsize=(12, 8))  
for i, feature in enumerate(iris.feature\_names):  
    plt.subplot(2, 2, i + 1)  
    sns.boxplot(x='target', y=feature, data=df)  
    sns.violinplot(x='target', y=feature, data=df, inner=None, color='0.8')  
plt.show()