Python(1):python代码中支持支持中文字符,读取xml文件,及比较文字字符串问题

最新推荐文章于 2024-09-02 07:14:24 发布

原创最新推荐文章于 2024-09-02 07:14:24 发布 · 6.5k 阅读

4 ·

本内容遵循CC 4.0 BY-SA版权协议

标签

#python

Python学习专栏收录该内容

6 篇文章

订阅专栏

本文探讨了Python中处理中文字符的编码问题，如何在代码开头指定编码来避免乱码。接着介绍了如何使用ElementTree库读取XML文件，以VOC2007格式为例。最后讨论了Python中的文字字符串比较，特别是在处理汉字时，如何通过解码转换进行有效比较。

摘要:

一般性的python不支持中文字符,就算是注释都不行,

但是,注意但是……..

1.python中的中文字符问题.

当然了是因为编码问题,细节内容可查看:
https://www.python.org/dev/peps/pep-0263/
人家发现问题了,然后也给出了解决方法(你可以设定你的代码的编码方式):


> Defining the Encoding
>     Python will default to ASCII as standard encoding if no other
>     encoding hints are given.
>     ***To define a source code encoding, a magic comment must
>     be placed into the source files either as first or second
>     line in the file, such as:***
>           # coding=<encoding name>
>     or (using formats recognized by popular editors)
>           #!/usr/bin/python
>           # -*- coding: <encoding name> -*-
>     or
>           #!/usr/bin/python
>           # vim: set fileencoding=<encoding name> :
>     More precisely, the first or second line must match the regular
>     expression "^[ \t\v]*#.*?coding[:=][ \t]*([-_.a-zA-Z0-9]+)".
>     The first group of this
>     expression is then interpreted as encoding name. If the encoding
>     is unknown to Python, an error is raised during compilation. There
>     must not be any Python statement on the line that contains the
>     encoding declaration.  If the first line matches the second line
>     is ignored.
> 
>     To aid with platforms such as Windows, which add Unicode BOM marks
>     to the beginning of Unicode files, the UTF-8 signature
>     '\xef\xbb\xbf' will be interpreted as 'utf-8' encoding as well
>     (even if no magic encoding comment is given).
> 
>     If a source file uses both the UTF-8 BOM mark signature and a
>     magic encoding comment, the only allowed encoding for the comment
>     is 'utf-8'.  Any other encoding will cause an error.

于是乎你就可以在代码中定义代码的编码方式(在第一行或者第二行定义编码方式):
例如:test.py

**# coding=utf-8** kk='文字' print kk
执行

$python test.py
输出:

文字
如果没有”# coding=utf-8”:
kk='文字' print kk
输出为:
SyntaxError: Non-ASCII character '\xe6' in file nouse2.py on line 9, but no encoding declared; see http://www.python.org/peps/pep-0263.html for details

2.python 读取xml文件

要读取的xml文件格式,ＶＯＣ2007.
<annotation> <folder>VOC2007</folder> <filename>000001.jpg</filename> <source> <database>The VOC2007 Database</database> <annotation>PASCAL VOC2007</annotation> <image>flickr</image> <flickrid>341012865</flickrid> </source> <owner> <flickrid>Fried Camels</flickrid> <name>Jinky the Fruit Bat</name> </owner> <size> <width>353</width> <height>500</height> <depth>3</depth> </size> <segmented>0</segmented> <object> <name>dog</name> <pose>Left</pose> <truncated>1</truncated> <difficult>0</difficult> <bndbox> <xmin>48</xmin> <ymin>240</ymin> <xmax>195</xmax> <ymax>371</ymax> </bndbox> </object> <object> <name>person</name> <pose>Left</pose> <truncated>1</truncated> <difficult>0</difficult> <bndbox> <xmin>8</xmin> <ymin>12</ymin> <xmax>352</xmax> <ymax>498</ymax> </bndbox> </object> </annotation>
读取方式:
`import xml.etree.ElementTree as ET #xml的解析库
import os
import cPickle
import numpy as np

def readxml(filename):
tree = ET.parse(filename)#加载并且解析xml文件,tree为根节点.
objs = tree.findall(‘object’) #在根节点上寻找node
num_objs = len(objs)#
for ix, obj in enumerate(objs):#遍历objs的下标和内容
bbox = obj.find(‘objectbox’)
x1 = float(bbox.find(‘xmin’).text)
y1 = float(bbox.find(‘ymin’).text)
x2 = float(bbox.find(‘xmax’).text)
y2 = float(bbox.find(‘ymax’).text)
******

3.python中的文字字符串比较.

先瞎扯点:
项目实在faster-rcnn下做车辆检测.所以自己做了个标注工具,matlab实现,name,color,pose等参数是汉字存储的.而fasterrcnn是基于VOC的标注数据格式,以上参数均为英文.所以修改了fasterrcnn的数据读取接口,将汉字类转化为英文字符.
在自己标注的xml文件,头行显示为:
<?xml version="1.0" encoding="utf-8"?>
为 utf-8编码
假如我用2中的方法获取了一个节点的内容:
name = obj.find('name')
而在xml中name的内容是”宠物”,则name=宠物
那么通过type(),可以查看bbox内容格式:
print type(name)
输出:
<type 'unicode'>
至于unicode是何意思,自行百度.
又如果我要判断是”宠物”,我将bbox设为”chong wu”则可以用下列代码实现:
name = obj.find('name')# name="宠物"这里是unicode格式 if name=='宠物'.decode('utf-8'):#这里的"宠物"是< string>格式,所以需要修改编码格式 name='chongwu'
‘宠物’.decode(‘utf-8’) ,的意思是将”宠物”重编码为’utf-8’格式.这样就可以比较了.
(重编码前”宠物”是 < type ‘string’>格式,所以无法比较.)