Nested enumerated for loops to comprehension list

This code is slow for several reasons:

  • it is (probably) executed in CPython and written in pure Python which is a slow interpreter not designed for this kind of numerical code;
  • sim_func is a generic way to compare various kind of elements but is also very inefficient (allocations, hashing, exception handling and string manipulation).

The code cannot be parallelized easily and so vectorized numpy. However, you can use Numba to speed it up. It will worth it only if the input string are quite big or this processing is executed a lot of time. If this is not the case, please use a more appropriate programming language (eg. C, C++, D, Rust, etc.) or a native Python module dedicated for that.

Here is the optimized Numba code:

s1 = "sentence1"
s2 = "sentevfers2"
gap = 1  # Assume this is an integer

def NeedlemanWunschDP(dist_mat, s1, s2):
    for i in range(1, len(s1)+1):
        for j in range(1, len(s2)+1):
            match = dist_mat[i - 1, j - 1] + (s1[i-1] == s2[j-1])
            delete = dist_mat[i - 1, j] - gap
            insert = dist_mat[i, j - 1] - gap
            dist_mat[i, j] = max(match, delete, insert)

dist_mat = numpy.empty(
    (len(s1) + 1, len(s2) + 1),

# DP initialization
for i in range(len(s1) + 1):
    dist_mat[i, 0] = -(i * gap)

# DP initialization
for j in range(len(s2) + 1):
    dist_mat[0, j] = -(j * gap)

# Transform the strings to fast integer arrays
tmp_s1 = numpy.array([ord(e) for e in s1], dtype=numpy.int64)
tmp_s2 = numpy.array([ord(e) for e in s2], dtype=numpy.int64)
# Needleman-Wunsch DP calculation
NeedlemanWunschDP(dist_mat, tmp_s1, tmp_s2)
distance = dist_mat[dist_mat.shape[0] - 1, dist_mat.shape[1] - 1]

The compilation time of NeedlemanWunschDP takes about 400 ms on my machine but the resulting code is more than 1800 times faster on huge strings.

CLICK HERE to find out more related problems solutions.

Leave a Comment

Your email address will not be published.

Scroll to Top