数据压缩原理与应用 Huffman编码

最新推荐文章于 2024-10-06 15:31:47 发布

原创最新推荐文章于 2024-10-06 15:31:47 发布 · 1.4k 阅读

4 ·

本内容遵循CC 4.0 BY-SA版权协议

标签

#数据压缩 #应用 #编码 #数据

编程设计专栏收录该内容

6 篇文章

订阅专栏

本文介绍了Huffman编码的基本原理，它是无失真编码的一种，利用概率统计模型进行编码，使得高概率符号编码短，低概率符号编码长，达到数据压缩目的。文章详细阐述了实验流程，包括读取文件、统计频率、建立Huffman树、生成码字和输出文件，并展示了实验结果，指出Huffman编码对符号分布不均的文件压缩效果更佳。

一.Huffman编码的原理

Huffman Coding (霍夫曼编码)是一种无失真编码的编码方式，Huffman 编码是可变字长编码(VLC)的一种。
Huffman 编码基于信源的概率统计模型，它的基本思路是出现概率大的信源符号编长码，出现概率小的信源符号编短码，从而使平均码长最小。
在计算机数据处理中，霍夫曼编码使用变长编码表对源符号（如文件中的一个字母）进行编码，其中变长编码表是通过一种评估来源符号出现机率的方法得到的，出现机率高的字母使用较短的编码，反之出现机率低的则使用较长的编码，这便使编码之后的字符串的平均长度、期望值降低，从而达到无损压缩数据的目的。
在程序实现中常使用一种叫做树的数据结构实现 Huffman 编码，由它编出的码是即时码。

huffman编码流程
统计符号的发生概率
将频率从小到大排序
每一次选出最小的两个值，作为二叉树的两个叶子节点，将和作为它们的根节点，这两个叶子节点不再参与比较，新的根节点参与比较
重复第三步
将形成的二叉树的左节点标 0，右节点标 1，把从最上面的根节点到最下面的叶子节点途中遇到的 0，1 序列串起来，就得到了各个符号的编码

举例：
这里写图片描述

huffman节点

     /*huffman二叉树*/
    typedef struct huffman_node_tag
    {
        unsigned char isLeaf;//判断是否为叶节点
        unsigned long count;//字符数，字母出现的频率，节点代表的符号加权和
        struct huffman_node_tag *parent;//当前节点的父节点
        union
        {
            struct
            {
                struct huffman_node_tag *zero, *one;//若是叶子，则该项表示信源符号，若不是叶子，则此项为该结点左右孩子的指针
            };
            unsigned char symbol;
        };
    } huffman_node;

huffman编码

    typedef struct huffman_code_tag
    {
        /* The length of this code in bits. */
        /*码字的长度(单位:位)*/
        unsigned long numbits;

        /* The bits that make up this code. The first
           bit is at position 0 in bits[0]. The second
           bit is at position 1 in bits[0]. The eighth
           bit is at position 7 in bits[0]. The ninth
           bit is at position 0 in bits[1]. */
           /* 码字, 码字的第 1 位存于 bits[0]的第 1 位,
           码字的第 2 位存于 bits[0]的第的第 2 位,
           码字的第 8 位存于 bits[0]的第的第 8 位,
           码字的第 9 位存于 bits[1]的第的第 1 位 */
        unsigned char *bits;
    } huffman_code;

huffman表格

     /*对输出的表格进行结构化 针对huff_run main函数 */
    //输出的表格有些变量可以直接写进结构体里，如此表格可以很快输出，再加入新变量也不会很复杂
    typedef struct huffman_statistics_result
    {
        float freq[256];
        unsigned long numbits[256];
        unsigned char bits[256][100];
    }huffman_stat;

二.实验流程

这里写图片描述

三.代码分析

1.读取待编码的文件

首先对命令行的参数进行定义

    static void
    usage(FILE* out)//命令行参数的设置
    {
        fputs("Usage: huffcode [-i<input file>] [-o<output file>] [-d|-c]\n"
              "-i - input file (default is standard input)\n"//输入文件
              "-o - output file (default is standard output)\n"//输出文件
              "-d - decompress\n"//解压缩操作
              "-c - compress (default)\n"//压缩操作
              "-m - read file into memory, compress, then write to file (not default)\n",//读内存
              "-t - output huffman statistics\n",//输出huffman统计表格
              out);
    }

主函数，对命令行参数进行设计

    int
    main(int argc, char** argv)//命令行参数
    {
        char memory = 0;//内存操作的标识符
        char compress = 1;//解压缩的标识符
        int opt;//命令行参数选项
        const char *file_in = NULL, *file_out = NULL;
        const char *file_out_table = NULL;
        FILE *in = stdin;
        FILE *out = stdout;
        FILE * outTable = NULL;

            /* Get the command line arguments. */
        while((opt = getopt(argc, argv, "i:o:cdhvmt:")) != -1) //读取命令行参数的选项
        {
            switch(opt)
            {
            case 'i':
                file_in = optarg;//选择i，输入文件
                break;
            case 'o':
                file_out = optarg;//选择o，输出文件
                break;
            case 'c':
                compress = 1;//选择c，压缩
                break;
            case 'd':
                compress = 0;//选择d，解压缩
                break;
            case 'h':
                usage(stdout);//选择h，输出参数用法说明
                return 0;
            case 'v':
                version(stdout);//选择v，输出版本号信息
                return 0;
            case 'm'://选择m，内存操作
                memory = 1;
                break;
            case 't'://选t，输出统计数据表格
                file_out_table = optarg;            
                break;
            default:
                usage(stderr);
                return 1;
            }
        }

        /* If an input file is given then open it. */
        if(file_in)
        {
        //（省略）
        }

        /* If an output file is given then create it. */
        if(file_out)
        {
        //（省略）
        }

        //by yzhang for huffman statistics
        if(file_out_table)
        {
            outTable = fopen(file_out_table, "w");
            if(!outTable)
            {
                fprintf(stderr,
                    "Can't open output file '%s': %s\n",
                    file_out_table, strerror(errno));//strerror函数为通过标准错误的标号，获得错误的描述字符串 ，将单纯的错误标号转为字符串描述
                return 1;
            }
        }
        //end by yzhang
        //（省略）
    }

使用库函数中的getopt解析命令行参数，命令行参数可设置成
这里写图片描述
然后打开文件
先定义一个数值，因为符号范围在0-255，所以统计频率和码字编码时的数值为256。

    #define MAX_SYMBOLS 256
    typedef huffman_node* SymbolFrequencies[MAX_SYMBOLS];//频率
    typedef huffman_code* SymbolEncoder[MAX_SYMBOLS];//码字

    int 
    huffman_encode_file(FILE *in, FILE *out, FILE *out_Table) //对文件进行Huffman编码
     /*step1:changed by yzhang for huffman statistics from (FILE *in, FILE *out) to (FILE *in, FILE *out, FILE *out_Table)*/
    {
        SymbolFrequencies sf;
        SymbolEncoder *se;
        huffman_node *root = NULL;
        int rc;
        unsigned int symbol_count;//文件中各字节出现的次数
        huffman_stat hs;

        /* Get the frequency of each symbol in the input file. */
        symbol_count = get_symbol_frequencies(&sf, in); //统计第一遍扫面后，文件中各符号出现的频率
        huffST_getSymFrequencies(&sf,&hs,symbol_count);
        /* Build an optimal table from the symbolCount. */
        se = calculate_huffman_codes(&sf); //根据得到的符号频率建立一棵Huffman树和码表
        root = sf[0];//将sf[0]设为根结点

        //step3:add by yzhang for huffman statistics... output the statistics to file
        huffST_getcodeword(se, &hs);//获得码字
        output_huffman_statistics(&hs,out_Table);//建立码表
        //end by yzhang

        /* Scan the file again and, using the table
           previously built, encode it into the output file. */
        /*再一次扫描文件，使用之前构建的表，将其编码到输出文件中*/
        rewind(in);//回到文件开头，准备第二遍扫描文件
        rc = write_code_table(out, se, symbol_count);//先在输出文件中写入码表
        if(rc == 0)
            rc = do_file_encode(in, out, se);//写完后根据码表进行编码
        /* Free the Huffman tree. */
        free_huffman_tree(root);
        free_encoder(se);
        return rc;
    }

2.统计各字节出现的频率

    static unsigned int
    get_symbol_frequencies(SymbolFrequencies *pSF, FILE *in)
    {
        int c;
        unsigned int total_count = 0;//初始化

        /* Set all frequencies to 0. */
        init_frequencies(pSF);//将所有符号的频率设为0

        /* Count the frequency of each symbol in the input file. */
        /*计算输入文件中每个符号的频率*/
        while((c = fgetc(in)) != EOF)//如果符号存在
        {
            unsigned char uc = c;
            if(!(*pSF)[uc])
                (*pSF)[uc] = new_leaf_node(uc);//新建一个叶节点
            ++(*pSF)[uc]->count;//叶节点的出现次数+1
            ++total_count;//总数+1
        }

        return total_count;
    }

上述统计频率的函数中调用了new_leaf_node 函数，新建叶节点

    static huffman_node*
    new_leaf_node(unsigned char symbol)//增加叶节点
    {
        huffman_node *p = (huffman_node*)malloc(sizeof(huffman_node));
        p->isLeaf = 1;//新建的是叶节点，所以isLeaf设为1
        p->symbol = symbol;//节点存储的信源符号
        p->count = 0;//因为还没统计概率所以数目设为0
        p->parent = 0;//未知父节点，同样设为空
        return p;
    }

3.建立huffman树

    static SymbolEncoder*
    calculate_huffman_codes(SymbolFrequencies * pSF)//对统计后的码字频率按升序进行排序
    {
        unsigned int i = 0;
        unsigned int n = 0;
        huffman_node *m1 = NULL, *m2 = NULL;//m1存放左子节点的地址，m2则为右子节点

    //输出未排序前统计的频率   
    #if 1
        printf("BEFORE SORT\n");
        print_freqs(pSF);   
    #endif

        /* Sort the symbol frequency array by ascending frequency. */
        /*按升序对符号频率数组进行排序*/
        qsort((*pSF), MAX_SYMBOLS, sizeof((*pSF)[0]), SFComp); /*qsort是排序函数，可自定义排序方式，SFComp为升序排列*/

    //输出排序后的统计频率
    #if 1   
        printf("AFTER SORT\n");
        print_freqs(pSF);
    #endif

        /* Get the number of symbols. */
        /* 统计下信源符号数，因为符号范围在0-255，但不一定完全包含256个字符*/
        for(n = 0; n < MAX_SYMBOLS && (*pSF)[n]; ++n)
            ;

        /*建立huffman树*/
        for(i = 0; i < n - 1; ++i)
        {
            /* Set m1 and m2 to the two subsets of least probability. */
                m1 = (*pSF)[0];//m1存放的是左孩子
            m2 = (*pSF)[1];//m2存放的是右孩子

            /* Replace m1 and m2 with a set {m1, m2} whose probability
             * is the sum of that of m1 and m2. */
            (*pSF)[0] = m1->parent = m2->parent =
                new_nonleaf_node(m1->count + m2->count, m1, m2);//调用了new_nonleaf_node，将原先的左孩子和右孩子相加的值作为父节点的值                                                                //而父节点作为现在的左子节点
            (*pSF)[1] = NULL;//右子节点设为NULL，即无

            /* Put newSet into the correct count position in pSF. */
            qsort((*pSF), n, sizeof((*pSF)[0]), SFComp);//将现在n-1个数据重新进行排序
        }

        /* Build the SymbolEncoder array from the tree. */
        pSE = (SymbolEncoder*)malloc(sizeof(SymbolEncoder));//分配码字节点指针数组的内存空间
        memset(pSE, 0, sizeof(SymbolEncoder));//初始化该数组
        build_symbol_encoder((*pSF)[0], pSE);//建立所有的码字节点
        return pSE;
    }

在 qsort((*pSF), MAX_SYMBOLS, sizeof((*pSF)[0]), SFComp) 中，排序方式是由SFComp决定，下面分析SFComp

    static int
    SFComp(const void *p1, const void *p2) //自定义的排序顺序函数，把节点数组由小到大排序
    {
        const huffman_node *hn1 = *(const huffman_node**)p1;
        const huffman_node *hn2 = *(const huffman_node**)p2;
        //把两个排序指针设为自定义的树节点类型
        /* Sort all NULLs to the end. */
        if(hn1 == NULL && hn2 == NULL)//如果两个节点都为空，则返回0
            return 0;
        if(hn1 == NULL)//若第一个节点为空，说明第二个节点大于第一个节点，返回1
            return 1;//
        if(hn2 == NULL)//若第二个节点为空，说明小于第一个节点，返回-1
            return -1;

        if(hn1->count > hn2->count)//若二者指向的数值，一大于二，返回1
            return 1;
        else if(hn1->count < hn2->count)//若二者指向的数值，二大于一，返回-1
            return -1;

        return 0;
    }

建立huffman树时，得到父节点时调用了 new_nonleaf_node 函数，因为父节点不是叶子，所以函数也不相同

    static huffman_node*
    new_nonleaf_node(unsigned long count, huffman_node *zero, huffman_node *one)
    {
        huffman_node *p = (huffman_node*)malloc(sizeof(huffman_node));
        p->isLeaf = 0;//新建的是中间节点，所以isLeaf设为0
        p->count = count;//符号次数为父节点的符号出现次数
        p->zero = zero;//对应的左孩子
        p->one = one;//对应的右孩子
        p->parent = 0;//父节点为空

        return p;
    }

举例， new_nonleaf_node
这里写图片描述
isleaf=0,count=count1+count2,parent=NULL,zero=0x1H,one=0x2H。

4.生成码字

本实验中huffman的编码是从叶到根部的编码，所以在生成码字时需要进行倒序处理
这里写图片描述
如上图所示，右下角的从叶到根的编码是1101，而生成的码字为1011

    static void
    build_symbol_encoder(huffman_node *subtree, SymbolEncoder *pSF)
    {
        if(subtree == NULL)//如果树为空，返回
            return;

        if(subtree->isLeaf)
            (*pSF)[subtree->symbol] = new_code(subtree);//如果是叶节点，则对叶节点进行编码
        else
        {
            build_symbol_encoder(subtree->zero, pSF);
            build_symbol_encoder(subtree->one, pSF);
        }//如果不是，那么先访问左节点，到了叶节点之后再访问右节点
        //深度遍历
    }

对叶节点进行的编码

    static huffman_code*
    new_code(const huffman_node* leaf)
    {
        /* Build the huffman code by walking up to
         * the root node and then reversing the bits,
         * since the Huffman code is calculated by
         * walking down the tree. */
        /** 通过走到根节点然后反转位来构建huffman代码，因为霍夫曼代码是通过树来计算的。*/
        unsigned long numbits = 0;//遍历过的字节
        unsigned char* bits = NULL;
        huffman_code *p;//码字节点

        while(leaf && leaf->parent)/* leaf=NULL，字符为空，无法编码。leaf !=0: 当前字符存在，应该编码 */
                                   /* leaf->parent=NULL，到达根结点，编码结束。leaf->parent !=0: 当前字符的编码仍未完成，即未完成由叶至根的该字符的编码过程 */
        {
            huffman_node *parent = leaf->parent;//当前结点的父亲
            unsigned char cur_bit = (unsigned char)(numbits % 8);//当前byte
            unsigned long cur_byte = numbits / 8;//当前bit

            /* If we need another byte to hold the code,
               then allocate it. */
            if(cur_bit == 0)//一个字节满则需要重新分配字节
            {
                size_t newSize = cur_byte + 1; //新的字节数为当前字节数+1，size_t 即为 unsigned int 类型
                bits = (char*)realloc(bits, newSize);/*extern void *realloc(void *mem_address, unsigned int newsize); 返回的void*，所以要进行强制类型转换*/
                bits[newSize - 1] = 0; /* Initialize the new byte. *///新增加的字节设为0
            }

            /* If a one must be added then or it in. If a zero
             * must be added then do nothing, since the byte
             * was initialized to zero. */
            /*如果leaf是其parent的左孩子，则无需改变bits[cur_byte]中的二进制数字，因为初始化值是0. 
            *如果是右孩子，则需要位运算，按照左0右1的原则，在bit[cur_byte]相应位置变为1*/
            //先把1右移到当前位（cur_bit）位置，再把当前字节（bits[cur_byte]）与移位后的1做或操作
            if(leaf == parent->one)
                bits[cur_byte] |= 1 << cur_bit;

            ++numbits;//增加bit数
            leaf = parent;//写下一个叶子，当前叶子即为父节点
        }

        if(bits)/*如果编码里头含有1, 则需要进行反转, 如果全为0, 
                 * 反转后和反转前是一样的, 就没必要反转了
                 */
            reverse_bits(bits, numbits);

        p = (huffman_code*)malloc(sizeof(huffman_code));
        p->numbits = numbits; /* 设置对指定的字符进行编码需要的位数 */
        p->bits = bits;/* 用来存储编码的区间 */
        return p;
    }

生成码字的时候需要反转

    /*反转数组的前((numbits/8)+1)个字符*/
    static void
    reverse_bits(unsigned char* bits, unsigned long numbits)
    {
        unsigned long numbytes = numbytes_from_numbits(numbits);//所占最多的字节数
        unsigned char *tmp =
            (unsigned char*)alloca(numbytes);// 分配内存
        /*_alloca是在栈(stack)上申请空间,用完马上就释放，设置一个tmp作为numbytes的临时变量，给这个tmp开numbytes同样大的空间*/
        unsigned long curbit;//当前比特数
        long curbyte = 0;//即将要反转的字节所在的的数组下标 

        memset(tmp, 0, numbytes);/*memset是计算机中C/C++语言函数。
        *将s所指向的某一块内存中的前n个字节的内容全部设置为ch指定的ASCII值， 
        *第一个值为指定的内存地址，块的大小由第三个参数指定，
        *这个函数通常为新申请的内存做初始化工作， 其返回值为指向s的指针。
        *此处为给tmp全部赋值为0*/

        for(curbit = 0; curbit < numbits; ++curbit)
        {
            unsigned int bitpos = curbit % 8;// 向左移动的位数

            if(curbit > 0 && curbit % 8 == 0)//如果一个字节已经反转完，则到下一个字节
                ++curbyte;
    /*get_bit()第二个参数是0时，则为bit[0]最右一位；为numbits-curbit-1时，则为bit[numbytes-1]的最左一位。向左移位，使之可以按位或*/
            tmp[curbyte] |= (get_bit(bits, numbits - curbit - 1) << bitpos);
        }

        memcpy(bits, tmp, numbytes);//将tmp内的数据给bits
    }

    /*bit->byte,不足8位补齐*/
    static unsigned long
    numbytes_from_numbits(unsigned long numbits)
    {
        return numbits / 8 + (numbits % 8 ? 1 : 0);
    }
    //举例，若numbits=9，需要2字节，9/8+1=1+1=2
    //若numbits=5，需要1字节，5/8+1=0+1=1

5.输出文件

首先是输出码表，对文件进行编码

    static int
    write_code_table(FILE* out, SymbolEncoder *se, unsigned int symbol_count)
    {
        unsigned long i, count = 0;

        /* Determine the number of entries in se. */
        //统计实际字符个数
        for(i = 0; i < MAX_SYMBOLS; ++i)
        {
            if((*se)[i])
                ++count;
        }

        /* Write the number of entries in network byte order. */
        i = htonl(count);    
        if(fwrite(&i, sizeof(i), 1, out) != 1)
            return 1;

        /* Write the number of bytes that will be encoded. */
        symbol_count = htonl(symbol_count);
        if(fwrite(&symbol_count, sizeof(symbol_count), 1, out) != 1)
            return 1;
    //写入码表
        /* Write the entries. */
        for(i = 0; i < MAX_SYMBOLS; ++i)
        {
            huffman_code *p = (*se)[i];
            if(p)
            {
                unsigned int numbytes;
                /* Write the 1 byte symbol. */
                fputc((unsigned char)i, out);//先写入符号
                /* Write the 1 byte code bit length. */
                fputc(p->numbits, out);//再写入码长
                /* Write the code bytes. */
                numbytes = numbytes_from_numbits(p->numbits);
                if(fwrite(p->bits, 1, numbytes, out) != numbytes)
                    return 1;//最后写入码字
            }
        }

        return 0;
    }

    static int
    do_file_encode(FILE* in, FILE* out, SymbolEncoder *se)
    {
        unsigned char curbyte = 0;
        unsigned char curbit = 0;
        int c;
        while((c = fgetc(in)) != EOF)
        {
            unsigned char uc = (unsigned char)c;
            huffman_code *code = (*se)[uc];
            unsigned long i;
            //读取字符数，找到当前字符uc对应的code
            for(i = 0; i < code->numbits; ++i)
            {
                /* Add the current bit to curbyte. */
                curbyte |= get_bit(code->bits, i) << curbit;
                //将当前的bit放进编码字节对应的位置
                /* If this byte is filled up then write it
                 * out and reset the curbit and curbyte. */
                 //如果该字节被填满，则写入该字节并重新设置curbit和curbyte。
                 //针对每一个字节，都书写，且重新赋初值
                if(++curbit == 8)
                {
                    fputc(curbyte, out);
                    curbyte = 0;
                    curbit = 0;
                }
            }
        }

        /*
         * If there is data in curbyte that has not been
         * output yet, which means that the last encoded
         * character did not fall on a byte boundary,
         * then output it.
         */
         //如果存在尚未输出的数据，则表示字节未满，补满
        if(curbit > 0)
            fputc(curbyte, out);

        return 0;
    }

这里写图片描述
编码结果可以看出符号数为256（100H），后面四字节对应码长，即文件所占空间，后面是码字。

四.实验结果

原文件和编码后的文件，以及编码后数据存储的文件
这里写图片描述

文件类型	平均码长	信源熵(bit/sym)	原文件大小（kB）	压缩后文件大小（kB）	压缩比
jpg	7.996698	7.9787	63	64	0.98
mp4	7.996456	-7.97934	49533	49511	1.00
rar	8.000008	8.000008	463	464	0.99
doc	4.36568	4.348527	90	51	1.76
avi	6.13318	6.0958935	22315	17109	1.30
xlsx	7.228231	-7.180883	13	13	1.00
ppt	7.473907	-7.4465188	1256	1175	1.07
png	7.997136	-7.9784109	94	95	0.99
pdf	7.989494	-7.9746842	2837	2834	1.001
flv	7.595216	-7.5853566	22822	16428	1.39

下图是这十个文件的字符发生概率统计图，纵坐标为字符发生概率，横坐标为符号值
这里写图片描述

结合表格和统计图可以看出Huffman编码的平均码长基本接近于信源熵，且信源熵基本没有超过8bit/sym，因为符号值范围在0-255，所以熵值不会超过8，无失真编码中码长的下限即为信源熵。
对于这十个文件，压缩比最大的是doc文件，而相应的信源熵和平均码长也是最小的，分析图像可以看出，符号发生的概率不均匀。综上所述，Huffman编码对于符号分布不均的文件压缩效果较好，因为huffman是变长编码，不同概率，码长不同，而若字符概率都相近就没法按照大概率短码，小概率长码的原则进行编码，所以编码效果较差，有些文件编码后文件大小反而大于原文件，就是因为字符出现概率接近等概，如png和jpg。