This code is slow for several reasons:
- it is (probably) executed in CPython and written in pure Python which is a slow interpreter not designed for this kind of numerical code;
sim_funcis a generic way to compare various kind of elements but is also very inefficient (allocations, hashing, exception handling and string manipulation).
The code cannot be parallelized easily and so vectorized numpy. However, you can use Numba to speed it up. It will worth it only if the input string are quite big or this processing is executed a lot of time. If this is not the case, please use a more appropriate programming language (eg. C, C++, D, Rust, etc.) or a native Python module dedicated for that.
Here is the optimized Numba code:
s1 = "sentence1" s2 = "sentevfers2" gap = 1 # Assume this is an integer @njit def NeedlemanWunschDP(dist_mat, s1, s2): for i in range(1, len(s1)+1): for j in range(1, len(s2)+1): match = dist_mat[i - 1, j - 1] + (s1[i-1] == s2[j-1]) delete = dist_mat[i - 1, j] - gap insert = dist_mat[i, j - 1] - gap dist_mat[i, j] = max(match, delete, insert) dist_mat = numpy.empty( (len(s1) + 1, len(s2) + 1), dtype=numpy.int64, ) # DP initialization for i in range(len(s1) + 1): dist_mat[i, 0] = -(i * gap) # DP initialization for j in range(len(s2) + 1): dist_mat[0, j] = -(j * gap) # Transform the strings to fast integer arrays tmp_s1 = numpy.array([ord(e) for e in s1], dtype=numpy.int64) tmp_s2 = numpy.array([ord(e) for e in s2], dtype=numpy.int64) # Needleman-Wunsch DP calculation NeedlemanWunschDP(dist_mat, tmp_s1, tmp_s2) distance = dist_mat[dist_mat.shape - 1, dist_mat.shape - 1] print(distance)
The compilation time of
NeedlemanWunschDP takes about 400 ms on my machine but the resulting code is more than 1800 times faster on huge strings.
CLICK HERE to find out more related problems solutions.