Covid-19 Genome Analysis

Image credit: Daniel Diaz

To see the original Jupyter Notebook, check it out on Github Here. I have included a version of the notebook below.

Covid-19 Genome Analysis Notebook

GOAL: Plot Covid-19 Mutations

Covid-19 Genomes


I want to play around with coronavirus genome. I found this really cool article here where they talked about one-hot encoding a genome and plotting it. I want to see if I could apply that to the corona virus. My results are below.

from Bio import Entrez
from Bio import SeqIO
import os
import re
from sklearn.preprocessing import OneHotEncoder
from sklearn.preprocessing import LabelEncoder
import numpy as np
import as cm
import matplotlib.pyplot as plt

Search NCBI Data Online

Manually compile a list of genome ids we will use. These genome IDs will be used to download the genebank files from the NCBI database. = ""  # Always tell NCBI who you are
# List of genomes
# MN908947 -> Ref Genome
genbank_ids = ['MN908947', 'LC534418.1', 'MN985325.1', 'MN988713.1', 'MT077125.1', 'MT093571.1', 'MT044258.1', 'MT039888.1', 'MT039887.1', 'MT027062.1', 'MT019531.1']
# Ref Genome: NC_045512
for g in genbank_ids:
    handle = Entrez.efetch(db="nucleotide", id=g, rettype="gb", retmode="text")
    text =
    # print(text)
    assert text is not None

Download all the genomes from our search and store them as .gb files.

  • store them in the generated/ folder
count = 0
for genome_id in genbank_ids:
    filename = f'{os.path.abspath(".")}/generated/genBankRecord_{genome_id}.gb'
    # Lets not download the genome if we already have it.
    if os.path.isfile(filename):

    record = Entrez.efetch(db="nucleotide", id=genome_id, rettype="gb", retmode="text")

    with open(filename, 'w') as f:
    # Im noticing some really wonky file permissions with jupyter notebook and pycharm
    # going to force it to be this:
    # Dont forget this is python3 specific syntax
    os.chmod(filename, 0o666)


Download ref genome for sars-covid-2

Which btw was made with an Illumina Genome Sequencer! :) Nice

Note: I also source controlled this file, so there is no need to download it.

def download_ref_genome(): = ""  # Always tell NCBI who you are
    record = Entrez.efetch(db="nucleotide", id="MN908947", rettype="gb", retmode="text")
    filename = f'{os.path.abspath(".")}/ref_genome/'
    with open(filename, 'w') as f:
        content =
    print('File Written:{}'.format(filename))

# Note I have noticed a weird behavior with pycharm + jupyter notebook where you wont see the
# file locally unless you click out of pycharm then back in.

# download_ref_genome()

Load Genome Functions

def load_ref_genome():
    ref_genome_path = f"{os.path.abspath('.')}/ref_genome/"
    ref_genome_seq = None
    for seq_record in SeqIO.parse("./ref_genome/", "genbank"):
        ref_genome_seq = seq_record.seq
    return ref_genome_seq
# Generic Load Functions
def load_genomes(folder_path=os.path.abspath('.')+"/generated/"):
    f = []
    print(f"Using folder {folder_path} to load genomes")
    for (root, dirnames, filenames) in os.walk(folder_path):
        for filename in filenames:
    genomes = []
    for genome_file in f:
        print(f"Getting genome from file {genome_file}")
        for seq_record in SeqIO.parse(genome_file, "genbank"):
            genome_seq = str(seq_record.seq)
            # Lets assume theres one genome per file
    return genomes

Munge the genomes

Not all the genomes are the same len. I am attempting to make them the same size as the ref genome, and adding a z whenever i do any padding.

Eventually I would want to figure out a better way to align the genomes, or take the different sizes into account as that is prob significant.

def munged_genomes():
    g = load_genomes()
    ref_g = g[0]

    munged_genomes = []
    for genome in g:
        if len(genome) < len(ref_g):
            # Need to extend genome so len matches
            m_genome = genome.ljust(len(ref_g), 'z')
        elif len(genome) > len(ref_g):

    # Sanity Check
    l = len(ref_g)
    for x in munged_genomes:
        assert len(x) == l

    return munged_genomes
# Compute the munged genome sequences
munged_genome_sequences = munged_genomes()
Using folder /Users/ddiaz/src/corona/generated/ to load genomes
Getting genome from file /Users/ddiaz/src/corona/generated/
Getting genome from file /Users/ddiaz/src/corona/generated/
Getting genome from file /Users/ddiaz/src/corona/generated/
Getting genome from file /Users/ddiaz/src/corona/generated/
Getting genome from file /Users/ddiaz/src/corona/generated/
Getting genome from file /Users/ddiaz/src/corona/generated/
Getting genome from file /Users/ddiaz/src/corona/generated/
Getting genome from file /Users/ddiaz/src/corona/generated/
Getting genome from file /Users/ddiaz/src/corona/generated/
Getting genome from file /Users/ddiaz/src/corona/generated/
Getting genome from file /Users/ddiaz/src/corona/generated/

Set up functions to transform the genome into its one-hot encoded form.

# Setup One Hot Encoding Function

# One hot encode a DNA sequence string
# non 'acgt' bases (n) are 0000
# returns a L x 4 numpy array

label_encoder = LabelEncoder()
# z is also used when genomes are unequal in size.['a','c','g','t','z']))

def string_to_array(my_string):
    my_string = my_string.lower()
    my_string = re.sub('[^acgt]', 'z', my_string)
    my_array = np.array(list(my_string))
    return my_array

def one_hot_encoder(my_array):
    integer_encoded = label_encoder.transform(my_array)
    # n_values is very important, it ensures all the ecoded genomes are nx5
    # note: n_values was deprecated, have to use categories instead
    onehot_encoder = OneHotEncoder(sparse=False, dtype=int, categories=[range(5)])
    integer_encoded = integer_encoded.reshape(len(integer_encoded), 1)
    onehot_encoded = onehot_encoder.fit_transform(integer_encoded)
    onehot_encoded = np.delete(onehot_encoded, -1, 1)
    return onehot_encoded

# Test one hot encoder
def test_one_hot_encoder():
    test_sequence = 'AACGCGGTTNN'
    test_sequence_hot = one_hot_encoder(string_to_array(test_sequence))
    expected_sequence_hot =   [[1, 0, 0, 0],
                               [1, 0, 0, 0],
                               [0, 1, 0, 0],
                               [0, 0, 1, 0],
                               [0, 1, 0, 0],
                               [0, 0, 1, 0],
                               [0, 0, 1, 0],
                               [0, 0, 0, 1],
                               [0, 0, 0, 1],
                               [0, 0, 0, 0],
                               [0, 0, 0, 0]]

    # Lets check this function is working as expected
    assert np.array_equal(test_sequence_hot, expected_sequence_hot)

One Hot Encode ref genome.

# r = ref_genome_seq[0:5]
ref_seq_hot = one_hot_encoder(string_to_array(str(load_ref_genome())))

# DEV Note: Because there are no unknown letters in the ref genome, the output hot encoding is only 3 positions instead of
# 4 like in the test above.
# We can force this to be 4, by updating the one hot encoding function to take in a categories variable, which is what we have done.
# print(ref_seq_hot)

Quick note:

Notice then above array is only 3 bits

[1 0 0]
A 1 in the first position means A.
A 1 in the second position means C.
A 1 in the third position means G.
And all zeros means T

Let’s create a generic function that does the genome load and one hot encoding

def get_one_hot_genome_encoding(munged_genome_sequence):
    :returns A one hot encode genome
    genome_seq = str(munged_genome_sequence)
    seq_hot = one_hot_encoder(string_to_array(genome_seq))
    return seq_hot

Lets get all the one hot encodings for the gb files we have

def get_one_hot_genome_encodings(munged_genome_sequences):
    result = []
    for genome in munged_genome_sequences:
        r = get_one_hot_genome_encoding(genome)
    return result
one_hot_encoded_genomes = get_one_hot_genome_encodings(munged_genome_sequences)

# print("Array of one hot encoded genomes")
# print(one_hot_encoded_genomes)

Next step is to try and generate the mutation diagram

  • Each row will be an different seq.
  • Below is a test mutation diagram.
g1_ref = [[1,0,0,0],[0,1,0,0],[0,0,1,0]]
g2 = [[1,0,0,0],[0,1,0,0],[0,0,1,0]]
# This one has a mutation in the 3rd letter
g3 = [[1,0,0,0],[0,1,0,0],[0,0,0,1]]

# Two options, we can either compare to the ref genome, or compare to an array that represents
# the most common letter in that position

# Lets start with the former, then move to the latter
# A final zero bit (black) will be ref genome
# A one (white) will be mutation from ref genome

# X = np.random.random((100, 100)) # sample 2D array
# plt.imshow(X, cmap="gray")

# this is our basline, all zeros
genome_chart = [[0]*len(g1_ref)]
# print(genome_chart)

def compare(ref_genome, new_genome):
    i = 0
    result = []
    for i in range(len(ref_genome)):
        if np.array_equal(ref_genome[i], new_genome[i]):
    return result

result1 = compare(g1_ref,g2)
result2 = compare(g1_ref,g3)

plt.imshow(genome_chart, cmap="gray")
[0, 0, 0]
[0, 0, 1]
[[0, 0, 0], [0, 0, 0], [0, 0, 1]]


Ok let’s pause.

Above we created 3 sample genomes that were only 3 letters long. one was the ref genome. Only one genome had a mutation in the last position. So if we compare that to the chart, the top row is the ref genome, its all black becuase its the ref. THe bottom row is the mutated genome, with the last column, last row being white as thats were the mutation is.

# Now lets do the same thing to the 3 genomes we have.
g1 = ref_seq_hot
gN = one_hot_encoded_genomes
# imchart requires floats
genome_chart = [[float(0)]*len(g1)]
def compare_with_fudging(ref_genome, new_genome):
    # TODO: this func shouldnt be needed anymore cause we did some munging earlier
    i = 0
    result = []
    for i in range(len(ref_genome)):
        # check if arrays are equal, if not write a 1
        # sometimes arrays arent equal. for now, lets just assume thats a mutation
            if np.array_equal(ref_genome[i], new_genome[i]):
        except IndexError:
    return result

for genome in gN[:]:
    c = compare_with_fudging(g1,genome)

# for some reason i am getting a type error when i do the full array, for now , lets justt do a partial.
# genome_chart_debug = [genome_chart[0][:500], genome_chart[1][:500], genome_chart[2][:500]]
genome_chart_debug = []
for arr in genome_chart:

Plot the genome mutation matrix

# plt.figure(figsize = (15,5))
plt.imshow(np.real(genome_chart_debug), cmap="gray", interpolation="nearest", aspect='auto')



Above you will see 50 base pairs (Position 250->300). The black bar at the top being the ref genome, and the white lines representing mutations.

What can be improved

  • Right now I am only using a few genomes, and only a subset of the genome, so the chart still looks funky, and is not the checkerboard I was expecting.
  • Also, I did not take into consideration actually aligning the genome.
  • I also added some fudge factor for unequal genome lengths.
  • Everything is being compared against the ref genome. It would probably be more interesting to derive the most common letter for that position, then label one wild, and one mutation.

Daniel Diaz
Daniel Diaz
Sr. DevOps Engineer

My interests include Python, Kubernetes and Machine Learning.