- 1 中文数据预处理
- a. 中文分词预处理
- b. 用正则表达式,对特殊类型进行泛化,比如:数字,时间,日期,网址等等
- c. 组织机构的名词不捆绑,为了信息抽取更方便,比如同济大学土木工程学院拆分成"同济大学"和"土木工程学院"更合理
- 2 英语数据预处理
- a. 将所有的大写字母改为小写
- b. 将符号与单词用空格隔开
- c. 与1中b处理一样
- 3 词对齐算法(工具,GIZA++)
- 4 短语翻译表构造(短语抽取,概率估计)
- 5 解码(Beam Search)
Note:还有全角字符转化成半角字符,同一类型泛化名字最好一致等等
- Ext 1: 中文分词
- 结巴分词(直接分析源码吧):
a. 基于前缀字典的动态规划分词
jieba.cut(self, sentence, cut_all=False, HMM=True, use_paddle=False);
'''
当cut_all = False, HMM = FALSE, 执行的是基于前缀词典的动态规划分词
FREQ是词典,三元组(词,词频,词性)样子如下:
AT&T 3 nz
B超 3 n
c# 3 nz
C# 3 nz
c++ 3 nz
C++ 3 nz
get_DAG 得到有向无环图
calc 动态规划计算最大的概率,route[i]表示best(0:i) + best(i + j) + best(j:)中j的值
'''
def get_DAG(self, sentence):
self.check_initialized()
DAG = {}
N = len(sentence)
for k in xrange(N): # 逆序计算
tmplist = []
i = k
frag = sentence[k]
while i < N and frag in self.FREQ:
if self.FREQ[frag]:
tmplist.append(i) # 建立有向边,逆向建立
i += 1
frag = sentence[k:i + 1]
if not tmplist:
tmplist.append(k)
DAG[k] = tmplist
return DAG
# 计算最佳路径,类似dijstra
def calc(self, sentence, DAG, route):
N = len(sentence)
route[N] = (0, 0)
logtotal = log(self.total)
for idx in xrange(N - 1, -1, -1): # 反向计算
route[idx] = max((log(self.FREQ.get(sentence[idx:x + 1]) or 1) -
logtotal + route[x + 1][0], x) for x in DAG[idx])
# 切分主程序
def __cut_DAG_NO_HMM(self, sentence):
DAG = self.get_DAG(sentence)
route = {}
self.calc(sentence, DAG, route)
x = 0
N = len(sentence)
buf = ''
while x < N:
y = route[x][1] + 1 # 查找结果
l_word = sentence[x:y]
if re_eng.match(l_word) and len(l_word) == 1: # 对于纯字母和数字处理
buf += l_word
x = y
else:
if buf:
yield buf
buf = ''
yield l_word
x = y
if buf:
yield buf
buf = ''
b. HMM模型
当cut_all = False, HMM = True时,利用的是基于HMM的分词,使用的是viterbi算法。
HMM模型转化为λ=(A,B,π)\lambda = (A, B, \pi)λ=(A,B,π) 状态序列为I,对应的观测序列为O,对于这三个基本参数,HMM有以下经典问题
- 概率计算问题,在模型λ\lambdaλ下观测序列O出现的概率
- 学习问题,已知观测序列O,估计模型参数λ\lambdaλ,是的该模型下观测序列P(O∣λ)P(O|\lambda)P(O∣λ)最大
- 解码问题,已知模型λ\lambdaλ与观测序列O,求解条件概率P(I∣O)P(I|O)P(I∣O)最大的状态序列
此分词是第三个经典问题,该模型转化为已知状态集合Q=[“BMES”],λ\lambdaλ由样本训练得到,观测序列为对应的汉字序列,求解最大概率的状态序列I。
假设序列为C = “我们一起去打羽毛球”,状态序列为T=[“B”,“M”,“E”,“S”],即求条件概率:
maxP(t1,...,tn∣c1,...,cn)\max P(t_1, ..., t_n | c_1, ..., c_n)maxP(t1,...,tn∣c1,...,cn)
有限历史性假设:
P(ti∣ti−1ti−2...t1)=P(ti∣ti−1)P(t_i|t_{i-1}t_{i-2}...t_1) = P(t_i|t_{i-1})P(ti∣ti−1ti−2...t1)=P(ti∣ti−1)
独立输出假设:
P(c1,...,cn∣t1,...tn)=P(c1∣t1)P(c2∣t2),...,P(cn∣tn)P(c_1,...,c_n|t_1,...t_n)=P(c_1|t_1)P(c_2|t_2),...,P(c_n|t_n)P(c1,...,cn∣t1,...tn)=P(c1∣t1)P(c2∣t2),...,P(cn∣tn)
所以条件概率转化为:
maxP(t1,...,tn∣c1,...,cn)\max P(t_1, ..., t_n | c1, ..., c_n)maxP(t1,...,tn∣c1,...,cn)等价于maxP(c1,...,cn∣t1,...,tn)P(t1,...,tn)\max P(c_1, ..., c_n|t_1,...,t_n)P(t1,...,t_n)maxP(c1,...,cn∣t1,...,tn)P(t1,...,tn)
根据上面的两个假设,转化为:
maxP(c1,...,cn∣t1,...,tn)P(t1,...,tn)=maxP(c1∣t1)P(t1)P(c2∣t2)P(t2)P(t2∣t1),...,P(cn∣tn)P(tn)P(tn∣tn−1)\max P(c_1,...,c_n|t_1,...,t_n)P(t_1,...,t_n)=\max P(c_1|t_1)P(t_1)P(c_2|t_2)P(t_2)P(t_2|t_1),...,P(c_n|t_n)P(t_n)P(t_n|t_{n-1})maxP(c1,...,cn∣t1,...,tn)P(t1,...,tn)=maxP(c1∣t1)P(t1)P(c2∣t2)P(t2)P(t2∣t1),...,P(cn∣tn)P(tn)P(tn∣tn−1)
代码如下:
jieba.cut(self, sentence, cut_all=False, HMM=True, use_paddle=False);
'''
当cut_all = False, HMM = True, 执行的是viterbi算法,用于分词
'''
# 切分主程序
def __cut_DAG(self, sentence):
DAG = self.get_DAG(sentence)
route = {}
self.calc(sentence, DAG, route)
x = 0
buf = ''
N = len(sentence)
while x < N:
y = route[x][1] + 1
l_word = sentence[x:y]
if y - x == 1: # 只有一个字符
buf += l_word
else:
if buf:
if len(buf) == 1:
yield buf
buf = ''
else:
# 不是词典中的词即词频为0,按照Viterbi切分
if not self.FREQ.get(buf):
recognized = finalseg.cut(buf)
for t in recognized:
yield t
else:
for elem in buf:
yield elem
buf = ''
yield l_word
x = y
if buf:
if len(buf) == 1:
yield buf
elif not self.FREQ.get(buf):
recognized = finalseg.cut(buf)
for t in recognized:
yield t
else:
for elem in buf:
yield elem
# viterbi算法
# start_p: start概率,即P(t_i)
# trans_p: 转化矩阵,状态从t_i转化为t_j
# emit_p: 发射矩阵,在状态t_i下的值为o_i
def viterbi(obs, states, start_p, trans_p, emit_p):
V = [{}] # tabular
path = {}
for y in states: # init
V[0][y] = start_p[y] + emit_p[y].get(obs[0], MIN_FLOAT)
path[y] = [y]
for t in xrange(1, len(obs)):
V.append({})
newpath = {}
for y in states:
em_p = emit_p[y].get(obs[t], MIN_FLOAT)
(prob, state) = max(
[(V[t - 1][y0] + trans_p[y0].get(y, MIN_FLOAT) + em_p, y0) for y0 in PrevStatus[y]])
V[t][y] = prob
newpath[y] = path[state] + [y]
path = newpath
(prob, state) = max((V[len(obs) - 1][y], y) for y in 'ES')
return (prob, path[state])
# HMM切分算法
def __cut(sentence):
global emit_P
prob, pos_list = viterbi(sentence, 'BMES', start_P, trans_P, emit_P)
begin, nexti = 0, 0
# print pos_list, sentence
for i, char in enumerate(sentence):
pos = pos_list[i]
if pos == 'B':
begin = i
elif pos == 'E':
yield sentence[begin:i + 1]
nexti = i + 1
elif pos == 'S':
yield char
nexti = i + 1
if nexti < len(sentence):
yield sentence[nexti:]
- Ex2: 评价标准
- a. 人工评价(打分制)
- b. F值(正确率,召回率)
正确率P=正确长度目的串长度;召回率R=正确长度源串长度正确率 P= \frac {正确长度}{目的串长度}; \quad 召回率R= \frac {正确长度}{源串长度}正确率P=目的串长度正确长度;召回率R=源串长度正确长度
F=2∗P∗RP+RF=\frac{2*P*R}{P+R}F=P+R2∗P∗R - c BELU(更通用)
BLEU(score)=BP∗exp(∑n=1Nwnlogpn)BLEU(score)=BP*\exp (\sum_{n=1}^{N}w_n\log p_n)BLEU(score)=BP∗exp(n=1∑Nwnlogpn)
其中,
BP={1,c > re(1−cr),c≤r ,wn=1N,c表示output长度,r表示reference长度BP= \begin{cases} 1, &\text {c > r} \\ e^{(1-\frac {c}{r})}, & \text c \leq r \ \end {cases} ,\quad w_n = \frac {1}{N}, \quad c表示output长度,r表示reference长度BP={1,e(1−rc),c > rc≤r ,wn=N1,c表示output长度,r表示reference长度
取对数,得到
BELU^=min(0,1−cr)+(∑n=1Nwnlogpn)\hat{BELU} = \min(0, 1 - \frac {c}{r}) +(\sum_{n=1}^{N}w_n\log p_n)BELU^=min(0,1−rc)+(n=1∑Nwnlogpn)
&spm=1001.2101.3001.5002&articleId=106459498&d=1&t=3&u=6adbad604681463da46faa0b97e88dc9)
1097

被折叠的 条评论
为什么被折叠?



