如何生成随机文本?直观的方法是,以等概率的方式随机的选择每一个字母和空格,当然这样生成的文本好无意义。类似于这样:
vLtcNoXSseBeQHvTalbSbqVzFfLnaczP UrNImTyRnnMtQmZkATgEdLJP LYCsWnJavHDTMoqAAPxkmSuTgPPEdZBIOAzYPebkXw
如何生成令人感兴趣的文本呢?原理是这样的:多数事件发生在上下文。比如今天的温度和过去几天的温度相关,正常情况下不会出现太大的波动。
对于英文单词也是这样:如果当前字母是Q,那么下一个字母是U的可能性很大。通过把每个字母设置成前一个或几个字母的随机函数,就可以生成
更令人感兴趣的文本。
字母级别生成随机文本
如下代码是在字母级别随机生成文本,把k定义为4时,生成的单词基本都是英文单词了。
#include <stdio.h>
#include <string.h>
#include <algorithm>
#include <time.h>
int k=4; //阶数
char inputchars[5000000];
char *a[5000000]; //后缀数组
int nword = 0;
char* skip(char *p, int n)
{
return p+n;
}
const char* skip(const char *p, int n)
{
return p+n;
}
int charncmp(const char *p, const char *q)
{
int n=k;
while(n && *p==*q)
{
n--; p++, q++;
}
return ((n==0)?0:(*p-*q));
}
int sortcmp(const void *a, const void *b)
{
return (charncmp((const char *)*(const char **)a, (const char *)*(const char **)b));
}
/*
* @brief 查找phrase第一次出现的位置
*/
int bsearch(const char *phrase, int l, int u)
{
int m;
while (l <= u)
{
m = (l+u)/2;
int t = charncmp(a[m], phrase);
if (t < 0)
l = m+1;
else if( t > 0)
u = m-1;
else if (t == 0)
{
if (m-1>=0 && charncmp(a[m-1], phrase) == 0)
u = m-1;
else
return m;
}
}
return -1;
}
int main()
{
a[0] = inputchars;
while (scanf("%c", a[nword]) != EOF)
{
if (*a[nword]=='\r' || *a[nword]=='\n')
continue;
a[nword+1] = a[nword] + 1;
nword++;
}
int i;
for (i=0; i<k; i++)
a[nword][i] = 0;
//排序
qsort(a, nword, sizeof(a[0]), sortcmp);
srand(time(NULL));
//随机生成初始串
char *phrase = a[rand()%nword];
int charsleft;
for (i=0; i<k; i++)
printf("%c", *skip(phrase, i));
//随机生成300个字母
for (charsleft=300; charsleft>0; charsleft--)
{
int u = bsearch(phrase, 0, nword-1);
if (u==-1)
{
printf("charsleft: %d\n", charsleft);
break;
}
char *p = NULL;
for (i=0; charncmp(phrase, a[u+i]) == 0; i++) ;
//多个可选字母,随机选择一个
p = a[u+rand()%i];
phrase = skip(p, 1);
if (strlen(skip(phrase, k-1)) == 0)
break;
printf("%c", *skip(phrase, k-1));
}
printf("\n");
return 0;
}
这看起来很有趣,的确是个重要的发现。
字母级别生成的文本,虽然单词是英文单词了,不过句子看起来没什么意义。
carry of Arguing or their guilt and Pope.In a singleEurope integrations a sing claimcond both flies, alter, Sc(iv) whent an advocated more lives who show we sex and psychiavels and
loveredand in Ameritaint or itcause men around good the originalds rated State United not addictator of people for the being inspire man son todefy and how about, it is a revealistings. Welfare action, thelessness) that mean othereason be else.<#>Caligula's
life an in or not years, in thebirths as Vessence states
单词级别生成随机文本
同样可以在单词级别生成随机文本,这样生成的文本,句子看起来就像英文句子一样。
#include <stdio.h>
#include <string.h>
#include <algorithm>
#include <time.h>
int k=4;
char inputchars[5000000];
char *word[1000000]; //word也是一个后缀数组
int nword = 0;
int wordncmp(const char *p, const char *q)
{
int n=k;
for (; *p==*q; p++, q++)
if (*p == 0 && --n==0)
return 0;
return *p - *q;
}
int sortcmp(const void *a, const void *b)
{
return (wordncmp((const char *)*(const char **)a, (const char *)*(const char **)b));
}
char *skip(char *p, int n)
{
while(n)
{
if (*p++== 0) n--;
}
return p;
}
int bsearch(const char *phrase, int l, int u)
{
int m;
while (l <= u)
{
m = (l+u)/2;
int t = wordncmp(word[m], phrase);
if (t < 0)
l = m+1;
else if( t > 0)
u = m-1;
else if (t == 0)
{
if (m-1>=0 && wordncmp(word[m-1], phrase) == 0)
u = m-1;
else
return m;
}
}
return -1;
}
int main()
{
word[0] = inputchars;
while (scanf("%s", word[nword]) != EOF)
{
word[nword+1] = word[nword] + strlen(word[nword]) + 1; //下一个单词的开始地址
nword++;
}
int i;
for (i=0; i<k; i++)
word[nword][i] = 0;
//排序
qsort(word, nword, sizeof(word[0]), sortcmp);
srand(time(NULL));
//随机生成初始串
char *phrase = word[rand()%nword];
int wordsleft;
for (i=0; i<k; i++)
printf("%s ", skip(phrase, i));
//生成100个单词
for (wordsleft=100; wordsleft>0; wordsleft--)
{
int u = bsearch(phrase, 0, nword-1);
if (u==-1)
{
printf("wordsleft: %d\n", wordsleft);
break;
}
char *p = NULL;
for (i=0; wordncmp(phrase, word[u+i]) == 0; i++) ;
p = word[u+rand()%i];
phrase = skip(p, 1);
if (strlen(skip(phrase, k-1)) == 0)
break;
printf("%s ", skip(phrase, k-1));
}
printf("\n");
return 0;
}
该生成器生成的一段文本,据书上说2阶文本模拟英文的效果是最理想的。
and Cacambao leave Eldorado laden with treasure beyond the wildest dreams of Europeans, to begin their search for Cunθgonde. <#> Another incident happens in Italy where Candide
is with Martin. Martin is the closest character in the tale to Voltaire himself. He forever bursts Candide's optimistic bubbles and is the pessimist influence in his life had really been simple, he had just deluded himself into thinking so. <#> Upon reflection
Clamence realised his greatest crime which gave rise to his vulnerability). He believed that a man from making his own choice in high positions. After being elected in 1981, since the National Assembly
另外后缀数组确实是个很强大的方法
参考:《编程珠玑》15章
本文探讨了如何生成更有趣的随机文本,通过考虑上下文关系,如在字母或单词级别应用随机函数,可以创建看似有意义的英文句子。介绍了两种方法:字母级别生成的文本能形成英文单词,而单词级别生成的文本则可构成类似英文句子的内容。参考了《编程珠玑》中关于后缀数组的讨论。

3665

被折叠的 条评论
为什么被折叠?



