【Rosalind】Finding a Motif in DNA

最新推荐文章于 2026-06-20 22:17:21 发布

原创最新推荐文章于 2026-06-20 22:17:21 发布 · 208 阅读

0 ·

本内容遵循CC 4.0 BY-SA版权协议

标签

#算法 #生物信息学

Rosalind 专栏收录该内容

10 篇文章

订阅专栏

该博客介绍了如何解决在两个DNA字符串中查找子串位置的问题。通过提供一个KMP算法的Python实现，展示了如何在时间复杂度为O(n)的情况下，有效地找出所有子串在主串中的出现位置。博客内容包括问题描述、解决方案以及代码实现，特别关注了在生物信息学中DNA序列的处理和优化算法的应用。

题目描述

Problem

Given two strings $s$ and $t$ , $t$ is a substring of $s$ if $t$ is contained as a contiguous collection of symbols in $s$ (as a result, $t$ must be no longer than $s$ ).

The position of a symbol in a string is the total number of symbols found to its left, including itself (e.g., the positions of all occurrences of ‘U’ in “AUGCUUCAGAAAGGUCUUACG” are 2, 5, 6, 15, 17, and 18). The symbol at position $i$ of $s$ is denoted by $s [i]$ .

A substring of $s$ can be represented as $s [j : k]$ , where $j$ and $k$ represent the starting and ending positions of the substring in $s$ ; for example, if $s$ = “AUGCUUCAGAAAGGUCUUACG”, then $s [2 : 5]$ = “UGCU”.

The location of a substring $s [j : k]$ is its beginning position $j$ ; note that $t$ will have multiple locations in $s$ if it occurs more than once as a substring of $s$ (see the Sample below).

Given: Two DNA strings $s$ and $t$ (each of length at most 1 kbp).

Return: All locations of $t$ as a substring of $s$ .

Sample Dataset

GATATATGCATATACTT
ATAT

Sample Output

2 4 10

题解

根据数据规模其实对每一个位置进行逐一比对即可，时间复杂度为 $O (m n)$
考虑如果字串比较长，加之在核酸序列中重复很多，可以使用KMP算法，优化效果较好
KMP算法大致原理是对模板串构建失配函数，即当匹配到模板串第i位时，如果失配，前Fail[i]-1位都是相同的，可以直接匹配Fail[i]位字符即可，时间复杂度为 $O (n)$

参考代码

只提供KMP优化的代码参考，python语言注意处理列表末尾下标，注意不要溢出，C/C++等语言使用数组可以不用担心

fo = open("out.txt", "w")

def getFail(matSeq):
	failLink = [0, 0]
	for i in range(1, len(matSeq)):
		j = failLink[i]
		while (j and matSeq[i] != matSeq[j]):
			j = failLink[j]
		if (matSeq[i] == matSeq[j]):
			failLink.append(j + 1)
		else:
			failLink.append(0)
	return failLink

def strFindAll(querySeq, matSeq, failLink):
	j = 0
	m = len(matSeq)
	for i in range(len(querySeq)):
		while (j and (j == m or matSeq[j] != querySeq[i])):
			j = failLink[j]
		if (j < m and matSeq[j] == querySeq[i]):
			j = j + 1
		print(j)
		if (j == m):
			fo.write("%d " % (i - m + 2))

with open("rosalind_subs.txt", "r") as f:
	query = f.readline().rstrip()
	mat = f.readline().rstrip()
	f.close()

link = getFail(mat)
strFindAll(query, mat, link)

fo.close()