Python FAQ: charset detection

22 Apr 2010

Python FAQ: charset detection

Q:
在处理一个url提交的数据

在PC上，大多数浏览器会以utf8编码提交数据
“北京西”看起来是这样“%E5%8C%97%E4%BA%AC%E8%A5%BF”

在手机上，Mobile IE会以GBK提交数据
“北京西”看起来是这样“%B1%B1%BE%A9%CE%F7”

我的程序需要不加区别的处理这两种输入，我应该如何判断这是哪种编码，然后
统一地将他们转换为unicode呢？
===================

A:
>>> import chardet, urllib
>>> chardet.detect(urllib.unquote_plus('%E5%8C%97%E4%BA%AC%E8%A5%BF'))

{'confidence': 0.87624999999999997, 'encoding': 'utf-8'}
>>> chardet.detect(urllib.unquote_plus('%B1%B1%BE%A9%CE%F7'))

{'confidence': 0.98999999999999999, 'encoding': 'GB2312'}

chardet
http://chardet.feedparser.org/

^.^ - http://goo.gl/GDIO

Labels

22 Apr 2010

Python FAQ: charset detection

No comments :

Post a Comment

Links

Archive