摘要:
一般性的python不支持中文字符,就算是注释都不行,
但是,注意但是……..
1.python中的中文字符问题.
当然了是因为编码问题,细节内容可查看:
https://www.python.org/dev/peps/pep-0263/
人家发现问题了,然后也给出了解决方法(你可以设定你的代码的编码方式):
> Defining the Encoding
> Python will default to ASCII as standard encoding if no other
> encoding hints are given.
> ***To define a source code encoding, a magic comment must
> be placed into the source files either as first or second
> line in the file, such as:***
> # coding=<encoding name>
> or (using formats recognized by popular editors)
> #!/usr/bin/python
> # -*- coding: <encoding name> -*-
> or
> #!/usr/bin/python
> # vim: set fileencoding=<encoding name> :
> More precisely, the first or second line must match the regular
> expression "^[ \t\v]*#.*?coding[:=][ \t]*([-_.a-zA-Z0-9]+)".
> The first group of this
> expression is then interpreted as encoding name. If the encoding
> is unknown to Python, an error is raised during compilation. There
> must not be any Python statement on the line that contains the
> encoding declaration. If the first line matches the second line
> is ignored.
>
> To aid with platforms such as Windows, which add Unicode BOM marks
> to the beginning of Unicode files, the UTF-8 signature
> '\xef\xbb\xbf' will be interpreted as 'utf-8' encoding as well
> (even if no magic encoding comment is given).
>
> If a source file uses both the UTF-8 BOM mark signature and a
> magic encoding comment, the only allowed encoding for the comment
> is 'utf-8'. Any other encoding will cause an error.
于是乎你就可以在代码中定义代码的编码方式(在第一行或者第二行定义编码方式):
例如:test.py
**# coding=utf-8**
kk='文字'
print kk
执行
$python test.py
输出:
文字
如果没有”# coding=utf-8”:
kk='文字'
print kk
输出为:
SyntaxError: Non-ASCII character '\xe6' in file nouse2.py on line 9, but no encoding declared;
see http://www.python.org/peps/pep-0263.html for details
2.python 读取xml文件
要读取的xml文件格式,VOC2007.
<annotation>
<folder>VOC2007</folder>
<filename>000001.jpg</filename>
<source>
<database>The VOC2007 Database</database>
<annotation>PASCAL VOC2007</annotation>
<image>flickr</image>
<flickrid>341012865</flickrid>
</source>
<owner>
<flickrid>Fried Camels</flickrid>
<name>Jinky the Fruit Bat</name>
</owner>
<size>
<width>353</width>
<height>500</height>
<depth>3</depth>
</size>
<segmented>0</segmented>
<object>
<name>dog</name>
<pose>Left</pose>
<truncated>1</truncated>
<difficult>0</difficult>
<bndbox>
<xmin>48</xmin>
<ymin>240</ymin>
<xmax>195</xmax>
<ymax>371</ymax>
</bndbox>
</object>
<object>
<name>person</name>
<pose>Left</pose>
<truncated>1</truncated>
<difficult>0</difficult>
<bndbox>
<xmin>8</xmin>
<ymin>12</ymin>
<xmax>352</xmax>
<ymax>498</ymax>
</bndbox>
</object>
</annotation>
读取方式:
`import xml.etree.ElementTree as ET #xml的解析库
import os
import cPickle
import numpy as np
def readxml(filename):
tree = ET.parse(filename)#加载并且解析xml文件,tree为根节点.
objs = tree.findall(‘object’) #在根节点上寻找node
num_objs = len(objs)#
for ix, obj in enumerate(objs):#遍历objs的下标和内容
bbox = obj.find(‘objectbox’)
x1 = float(bbox.find(‘xmin’).text)
y1 = float(bbox.find(‘ymin’).text)
x2 = float(bbox.find(‘xmax’).text)
y2 = float(bbox.find(‘ymax’).text)
******
3.python中的文字字符串比较.
先瞎扯点:
项目实在faster-rcnn下做车辆检测.所以自己做了个标注工具,matlab实现,name,color,pose等参数是汉字存储的.而fasterrcnn是基于VOC的标注数据格式,以上参数均为英文.所以修改了fasterrcnn的数据读取接口,将汉字类转化为英文字符.
在自己标注的xml文件,头行显示为:
<?xml version="1.0" encoding="utf-8"?>
为 utf-8编码
假如我用2中的方法获取了一个节点的内容:
name = obj.find('name')
而在xml中name的内容是”宠物”,则name=宠物
那么通过type(),可以查看bbox内容格式:
print type(name)
输出:
<type 'unicode'>
至于unicode是何意思,自行百度.
又如果我要判断是”宠物”,我将bbox设为”chong wu”则可以用下列代码实现:
name = obj.find('name')# name="宠物"这里是unicode格式
if name=='宠物'.decode('utf-8'):#这里的"宠物"是< string>格式,所以需要修改编码格式
name='chongwu'
‘宠物’.decode(‘utf-8’) ,的意思是将”宠物”重编码为’utf-8’格式.这样就可以比较了.
(重编码前”宠物”是 < type ‘string’>格式,所以无法比较.)
本文探讨了Python中处理中文字符的编码问题,如何在代码开头指定编码来避免乱码。接着介绍了如何使用ElementTree库读取XML文件,以VOC2007格式为例。最后讨论了Python中的文字字符串比较,特别是在处理汉字时,如何通过解码转换进行有效比较。

3781

被折叠的 条评论
为什么被折叠?



