用Python绘制火山图：从零开始的可视化教程 - 文章 - 开发者社区

picture.image

背景

火山图是基因表达数据分析中的重要工具。它通过X轴表示基因的log2 Fold Change（即基因在两组样本中的表达差异），Y轴表示统计显著性（通常是p-value的负对数），从而让实验者一眼就能看出哪些基因显著上调或下调，帮助快速筛选出在实验组和对照组中显著上调或下调的基因

解读火山图

picture.image

这里根据一个简单的模拟火山图来理解其表达的实际含义

X轴：Log2 Fold Change

X轴表示的是Log2 Fold Change（基因表达的变化倍数），用来衡量基因在实验组和对照组之间表达的变化，右侧区域（Log2 Fold Change > 0）：基因在实验组中的表达上调，左侧区域（Log2 Fold Change < 0）：基因在实验组中的表达下调

Y轴：-Log10(P-value)

Y轴表示的是-Log10(p-value)，用来衡量变化的统计显著性，越高的点表示p-value越低，也就是说基因的表达变化越显著，靠近底部的点表示p-value较高，意味着变化不显著

区域解释

A区（左上角）：这个区域中的基因表达下调，并且p-value小，表示显著下调。也就是说，这些基因在实验组中的表达量比对照组低，且差异具有统计显著性。 B区（上方中央）：表示表达显著但差异较小。这些基因的p-value很低，说明统计显著性较强，但它们的Fold Change接近0，表示在实验组和对照组中的表达差异并不大。 C区（右上角）：这个区域中的基因上调，并且p-value小，表示显著上调。这些基因在实验组中的表达量比对照组高，且差异具有统计显著性。 D区（左下角）：这个区域中的基因p-value较高，表示下调但不显著。这些基因的表达在实验组中下调，但这种差异没有统计学上的显著性。 E区（下方中央）：表示基因的变化不显著且差异较小。这些基因的表达在实验组和对照组中变化不大，且统计显著性也较低。 F区（右下角）：这个区域中的基因p-value较高，表示上调但不显著。这些基因的表达在实验组中上调，但这种差异没有统计学上的显著性

代码实现

数据读取


          
import matplotlib.pyplot as plt
          
import seaborn as sns
          
import numpy as np
          
import pandas as pd
          
plt.rcParams['font.family'] = 'Times New Roman'
          
plt.rcParams['axes.unicode_minus'] = False
          

          
df = pd.read_excel('基因数据.xlsx')
          
df.head()

picture.image

读取的数据包含基因的log2倍数变化（log2FC）和p值（p-value），用于分析每个基因在实验组和对照组之间的表达差异，正的log2FC表示基因上调，负的log2FC表示下调，而p值表示这种差异的统计显著性，p值越小，结果越显著

基础火山图绘制


          
# Create a column for significance and direction
          
threshold_log2FC = 1  # example threshold for log2 fold change
          
threshold_pvalue = 0.05
          

          
# Add conditions for 'significant up', 'significant down', and 'not significant'
          
conditions = [
          
    (df['log2FC'] > threshold_log2FC) & (df['p-value'] < threshold_pvalue),
          
    (df['log2FC'] < -threshold_log2FC) & (df['p-value'] < threshold_pvalue),
          
    (df['p-value'] >= threshold_pvalue)
          
]
          
choices = ['Significant Up', 'Significant Down', 'Not Significant']
          

          
# Create a new column based on conditions
          
df['Significance'] = np.select(conditions, choices, default='Not Significant')
          

          
# Plot the volcano plot
          
plt.figure(figsize=(10, 6), dpi=1200)
          

          
# Use seaborn to plot the scatterplot
          
sns.scatterplot(x='log2FC', y=-np.log10(df['p-value']), hue='Significance', data=df,
          
                palette={'Significant Up': 'red', 'Significant Down': 'blue', 'Not Significant': 'gray'})
          

          
# Add axis labels and title
          
plt.axhline(y=-np.log10(threshold_pvalue), color='black', linestyle='--', label='p=0.05')
          
plt.xlabel('Log2 Fold Change')
          
plt.ylabel('-Log10(P-value)')
          
plt.title('Volcano Plot with Significance')
          
# Display the legend
          
plt.legend(title='Significance')
          
plt.savefig("1.pdf", format='pdf', bbox_inches='tight')
          
plt.show()

picture.image

代码绘制一个火山图，用于显示基因表达的差异，X轴表示基因的log2倍数变化（上调或下调），Y轴表示统计显著性（-log10(p-value)），代码将基因分为三类：显著上调（红色）、显著下调（蓝色）和不显著（灰色），并通过添加p值阈值线（p=0.05）来帮助识别显著性差异

添加基因标签


          
# Create a column for significance and direction
          
threshold_log2FC = 1  # example threshold for log2 fold change
          
threshold_pvalue = 0.05
          

          
# Add conditions for 'significant up', 'significant down', and 'not significant'
          
conditions = [
          
    (df['log2FC'] > threshold_log2FC) & (df['p-value'] < threshold_pvalue),
          
    (df['log2FC'] < -threshold_log2FC) & (df['p-value'] < threshold_pvalue),
          
    (df['p-value'] >= threshold_pvalue)
          
]
          
choices = ['Significant Up', 'Significant Down', 'Not Significant']
          

          
# Create a new column based on conditions
          
df['Significance'] = np.select(conditions, choices, default='Not Significant')
          

          
# List of genes to label
          
genes_to_label = ['LOC_Os01g50030.1', 'LOC_Os06g40940.3', 'LOC_Os03g03720.1']  # Replace these with your genes of interest
          

          
# Plot the volcano plot
          
plt.figure(figsize=(10, 6), dpi=1200)
          

          
# Use seaborn to plot the scatterplot
          
sns.scatterplot(x='log2FC', y=-np.log10(df['p-value']), hue='Significance', data=df,
          
                palette={'Significant Up': 'red', 'Significant Down': 'blue', 'Not Significant': 'gray'})
          

          
# Add axis labels and title
          
plt.axhline(y=-np.log10(threshold_pvalue), color='black', linestyle='--', label='p=0.05')
          
plt.xlabel('Log2 Fold Change')
          
plt.ylabel('-Log10(P-value)')
          
plt.title('Volcano Plot with Gene Labels')
          

          
# Annotate specified genes
          
for i in range(df.shape[0]):
          
    if df['GeneNames'][i] in genes_to_label:
          
        plt.text(df['log2FC'][i], -np.log10(df['p-value'][i]), df['GeneNames'][i],
          
                 horizontalalignment='left', size='medium', color='black', weight='semibold')
          

          
# Display the legend
          
plt.legend(title='Significance')
          
plt.savefig("2.pdf", format='pdf', bbox_inches='tight')
          
plt.show()

picture.image

相比之前增加了对特定基因的标注功能，使火山图更加直观，便于突出显示用户关注的基因

添加阀值线


          
# Create a column for significance and direction
          
threshold_log2FC = 1  # example threshold for log2 fold change
          
threshold_pvalue = 0.05
          

          
# Add conditions for 'significant up', 'significant down', and 'not significant'
          
conditions = [
          
    (df['log2FC'] > threshold_log2FC) & (df['p-value'] < threshold_pvalue),
          
    (df['log2FC'] < -threshold_log2FC) & (df['p-value'] < threshold_pvalue),
          
    (df['p-value'] >= threshold_pvalue)
          
]
          
choices = ['Significant Up', 'Significant Down', 'Not Significant']
          

          
# Create a new column based on conditions
          
df['Significance'] = np.select(conditions, choices, default='Not Significant')
          

          
# List of genes to label
          
genes_to_label = ['LOC_Os01g50030.1', 'LOC_Os06g40940.3', 'LOC_Os03g03720.1']  # Replace these with your genes of interest
          

          
# Plot the volcano plot
          
plt.figure(figsize=(10, 6),dpi=1200)
          

          
# Use seaborn to plot the scatterplot
          
sns.scatterplot(x='log2FC', y=-np.log10(df['p-value']), hue='Significance', data=df,
          
                palette={'Significant Up': 'red', 'Significant Down': 'blue', 'Not Significant': 'gray'})
          
plt.axhline(y=-np.log10(threshold_pvalue), color='black', linestyle='--', label='p=0.05')
          
plt.axvline(x=threshold_log2FC, color='black', linestyle='--', label='Log2 FC=1', alpha=0.5)  # Add log2FC threshold line (right)
          
plt.axvline(x=-threshold_log2FC, color='black', linestyle='--', label='Log2 FC=-1', alpha=0.5)  # Add log2FC threshold line (left)
          
plt.xlabel('Log2 Fold Change')
          
plt.ylabel('-Log10(P-value)')
          
plt.title('Volcano Plot with Gene Labels')
          
for i in range(df.shape[0]):
          
    if df['GeneNames'][i] in genes_to_label:
          
        plt.text(df['log2FC'][i], -np.log10(df['p-value'][i]), df['GeneNames'][i],
          
                 horizontalalignment='left', size='medium', color='black', weight='semibold')
          
plt.legend(title='Significance')
          
plt.savefig("3.pdf", format='pdf', bbox_inches='tight')
          
plt.show()

picture.image

相比前面的进一步改进了火山图的可视化效果，主要增加了阈值线，即在log2 Fold Change为±1的位置添加了两条竖线，帮助快速识别表达显著变化的基因

基因自定义标签


          
threshold_log2FC = 1  # example threshold for log2 fold change
          
threshold_pvalue = 0.05
          

          
# 添加显著性条件
          
conditions = [
          
    (df['log2FC'] > threshold_log2FC) & (df['p-value'] < threshold_pvalue),
          
    (df['log2FC'] < -threshold_log2FC) & (df['p-value'] < threshold_pvalue),
          
    (df['p-value'] >= threshold_pvalue)
          
]
          
choices = ['Significant Up', 'Significant Down', 'Not Significant']
          

          

          
df['Significance'] = np.select(conditions, choices, default='Not Significant')
          

          
# Gene Mapping Dictionary: Original gene name -> custom label
          
gene_labels = {
          
    'LOC_Os01g50030.1': 'CPuORF25',
          
    'LOC_Os06g40940.3': 'GDH',
          
    'LOC_Os03g03720.1': 'G3PD'
          
}
          

          
plt.figure(figsize=(10, 6), dpi=1200)
          

          
sns.scatterplot(x='log2FC', y=-np.log10(df['p-value']), hue='Significance', data=df,
          
                palette={'Significant Up': 'red', 'Significant Down': 'blue', 'Not Significant': 'gray'})
          

          
plt.axhline(y=-np.log10(threshold_pvalue), color='black', linestyle='--', label='p=0.05')
          
plt.axvline(x=threshold_log2FC, color='black', linestyle='--', label='Log2 FC=1', alpha=0.5)  # Add log2FC threshold line (right)
          
plt.axvline(x=-threshold_log2FC, color='black', linestyle='--', label='Log2 FC=-1', alpha=0.5)  # Add log2FC threshold line (left)
          
plt.xlabel('Log2 Fold Change')
          
plt.ylabel('-Log10(P-value)')
          
plt.title('Volcano Plot with Custom Gene Labels')
          
for i in range(df.shape[0]):
          
    gene_name = df['GeneNames'][i]
          
    if gene_name in gene_labels:
          
        plt.text(df['log2FC'][i], -np.log10(df['p-value'][i]), gene_labels[gene_name],
          
                 horizontalalignment='left', size='medium', color='black', weight='semibold')
          
plt.legend(title='Significance')
          
plt.savefig("4.pdf", format='pdf', bbox_inches='tight')
          
plt.show()

picture.image

最终的改进包括通过自定义基因标签、添加显著性阈值线，以及标注特定基因，使火山图更加直观易读，帮助用户更好地突出和识别关键基因

总结

在这里火山图绘制中，作者使用了matplotlib和seaborn等可视化库，这些库强大且灵活，适合自定义可视化，然而，Python还有一个专门用于生物信息学分析的库——bioinfokit，它不仅支持火山图的绘制，还提供了更专业且简便的生物信息学工具和功能，使得生物数据的处理和可视化更加高效，使用bioinfokit可以简化绘制过程，并为生信分析提供额外的支持，感兴趣的读者可自行研究

往期推荐

SCI图表复现：整合数据分布与相关系数的高级可视化策略

复现顶刊Streamlit部署预测模型APP

树模型系列：如何通过XGBoost提取特征贡献度

不止 SHAP 力图：LIME 实现任意黑盒模型的单样本解释

特征选择：Lasso和Boruta算法的结合应用

从基础到进阶：优化SHAP力图，让样本解读更直观

SCI图表复现：优化SHAP特征贡献图展示更多模型细节

多模型中的特征贡献度比较与可视化图解

从零开始：手把手教你部署顶刊机器学习在线预测APP并解读模型结果

picture.image