Plotting the Number of HLA Alleles from IMGT Stats

The human leukocyte antigen (HLA) is a gene complex encoding the major histocompatibility complex (MHC) proteins in humans. The complex is located within the 6p21.3 region on the short arm of human chromosome 6. These genes are highly polymorphic, which means there are many different HLA alleles. More information of HLA and HLA alleles can be found on hla.alleles.org and The IMGT/HLA Database. This is an attempt to plot the rapidly increasing number of HLA alleles over time, with data directly parsed from the IMGT/HLA Statistics page. This notebook file, described below, can be downloaded here.

In [1]:
import matplotlib.pyplot as plt
import pandas as pd
from pandas import Series, DataFrame
from datetime import datetime

%matplotlib inline
plt.style.use('ggplot')
In [2]:
#read data from IMGT and put the 2nd table into a pandas dataframe
url = 'https://www.ebi.ac.uk/ipd/imgt/hla/stats.html'
df = pd.read_html(url, encoding='utf-8', skiprows=1, header=0)[1]
In [3]:
#parse the HLA dictionary release dates
df['Month'] = df['Month'].apply(lambda x: datetime.strptime(x, '%Y-%m'))
#reset dates as index
df.set_index('Month', inplace=True)
In [4]:
#plot just the number of alleles over time
#each dot represents a release
plt.figure(figsize=(12,5))
plt.title("Number of HLA Alleles Over Time")
plt.ylabel("Number of Alleles")
df['Alleles'].plot(kind='line', color='r', alpha=0.8, marker='.', linestyle='')
plt.xlabel('Year')
legend = plt.legend(loc=2, frameon=1, shadow=True)
frame = legend.get_frame()
frame.set_facecolor('white')
In [5]:
#plot with 'component entries'
plt.figure(figsize=(12,5))
plt.title("IMGT HLA Database Growth Over Time")
plt.ylabel("Number of Alleles")
df['Alleles'].plot(kind='line', color='r', alpha=0.8, marker='.', linestyle='')
df['Component Entries'].plot(kind='line', color='b', alpha=0.5, marker='.', linestyle='')
plt.xlabel('Year')
legend = plt.legend(loc=2, frameon=1, shadow=True)
frame = legend.get_frame()
frame.set_facecolor('white')
References:

Robinson J, Halliwell JA, Hayhurst JH, Flicek P, Parham P, Marsh SGE
The IPD and IMGT/HLA database: allele variant databases
Nucleic Acids Research (2015) 43:D423-431

Robinson J, Malik A, Parham P, Bodmer JG, Marsh SGE:
IMGT/HLA - a sequence database for the human major histocompatibility complex
Tissue Antigens (2000), 55:280-287

Leave a Reply

Your email address will not be published. Required fields are marked *