题目描述
Problem
Given two strings sss and ttt, ttt is a substring of sss if ttt is contained as a contiguous collection of symbols in sss (as a result, ttt must be no longer than sss).
The position of a symbol in a string is the total number of symbols found to its left, including itself (e.g., the positions of all occurrences of ‘U’ in “AUGCUUCAGAAAGGUCUUACG” are 2, 5, 6, 15, 17, and 18). The symbol at position iii of sss is denoted by s[i]s[i]s[i].
A substring of sss can be represented as s[j:k]s[j:k]s[j:k], where jjj and kkk represent the starting and ending positions of the substring in sss; for example, if sss = “AUGCUUCAGAAAGGUCUUACG”, then s[2:5]s[2:5]s[2:5] = “UGCU”.
The location of a substring s[j:k]s[j:k]s[j:k] is its beginning position jjj; note that ttt will have multiple locations in sss if it occurs more than once as a substring of sss (see the Sample below).
Given: Two DNA strings sss and ttt (each of length at most 1 kbp).
Return: All locations of ttt as a substring of sss.
Sample Dataset
GATATATGCATATACTT
ATAT
Sample Output
2 4 10
题解
根据数据规模其实对每一个位置进行逐一比对即可,时间复杂度为O(mn)O(mn)O(mn)
考虑如果字串比较长,加之在核酸序列中重复很多,可以使用KMP算法,优化效果较好
KMP算法大致原理是对模板串构建失配函数,即当匹配到模板串第i位时,如果失配,前Fail[i]-1位都是相同的,可以直接匹配Fail[i]位字符即可,时间复杂度为O(n)O(n)O(n)
参考代码
只提供KMP优化的代码参考,python语言注意处理列表末尾下标,注意不要溢出,C/C++等语言使用数组可以不用担心
fo = open("out.txt", "w")
def getFail(matSeq):
failLink = [0, 0]
for i in range(1, len(matSeq)):
j = failLink[i]
while (j and matSeq[i] != matSeq[j]):
j = failLink[j]
if (matSeq[i] == matSeq[j]):
failLink.append(j + 1)
else:
failLink.append(0)
return failLink
def strFindAll(querySeq, matSeq, failLink):
j = 0
m = len(matSeq)
for i in range(len(querySeq)):
while (j and (j == m or matSeq[j] != querySeq[i])):
j = failLink[j]
if (j < m and matSeq[j] == querySeq[i]):
j = j + 1
print(j)
if (j == m):
fo.write("%d " % (i - m + 2))
with open("rosalind_subs.txt", "r") as f:
query = f.readline().rstrip()
mat = f.readline().rstrip()
f.close()
link = getFail(mat)
strFindAll(query, mat, link)
fo.close()
该博客介绍了如何解决在两个DNA字符串中查找子串位置的问题。通过提供一个KMP算法的Python实现,展示了如何在时间复杂度为O(n)的情况下,有效地找出所有子串在主串中的出现位置。博客内容包括问题描述、解决方案以及代码实现,特别关注了在生物信息学中DNA序列的处理和优化算法的应用。

1869

被折叠的 条评论
为什么被折叠?



