解码URL中的转义字符

83

我有一个列表，其中包含带有转义字符的URL。这些字符是urllib2.urlopen在恢复html页面时设置的：

http://www.sample1webpage.com/index.php?title=%E9%A6%96%E9%A1%B5&action=edit
http://www.sample1webpage.com/index.php?title=%E9%A6%96%E9%A1%B5&action=history
http://www.sample1webpage.com/index.php?title=%E9%A6%96%E9%A1%B5&variant=zh

有没有办法将它们转换回python中未转义的形式？

PS：URL编码为utf-8

python escaping

— 托尼
source

144

官方文档。

urllib.unquote(串)

将%xx转义符替换为它们的单字符等效项。

示例：unquote('/%7Econnolly/')yields '/~connolly/'。

然后解码。

更新： 对于Python 3，编写以下代码：

import urllib.parse
urllib.parse.unquote(url)

Python 3文档。

— 伊格纳西奥·巴斯克斯（Ignacio Vazquez-Abrams）
source

如我在上面所说的，unquote显示sample.com/index.php?title=\xe9\xa6\x96\xe9\xa1\xb5&action=edi ...也许我在这种情况下没有很好地解释自己...但是url是一个中文字符，我想将其解码为原始字符，而不是unquote字符

— Tony

3

@dyoser您需要在您的问题中提出这个问题。

— 克里斯·哈珀

@ root45这是对一个答案的评论...所以在这里很好。感谢您的赞赏。

— 托尼

11

请注意，对于python3，这是urllib.parse.unquote

— tayfun 2015年

4

对于python3它也是urllib.request.unquote

— 本

30

如果您正在使用Python3，则可以使用：

import urllib.parse
urllib.parse.unquote(url)

— 弗拉迪·帕拉多·克鲁兹（Vladir Parrado Cruz）
source

此外，在urllib.request.unquote

— 奔

11

要么 urllib.unquote_plus

>>> import urllib
>>> urllib.unquote('erythrocyte+membrane+protein+1%2C+PfEMP1+%28VAR%29')
'erythrocyte+membrane+protein+1,+PfEMP1+(VAR)'
>>> urllib.unquote_plus('erythrocyte+membrane+protein+1%2C+PfEMP1+%28VAR%29')
'erythrocyte membrane protein 1, PfEMP1 (VAR)'

— dli
source

7

您可以使用 urllib.unquote

— 克劳斯·比斯科夫·佩德森
source

当我使用unquote（顺便说一下...）时，它显示此字符串sample.com/index.php?title=\xe9\xa6\x96\xe9\xa1\xb5&action=edi，我知道它们是中文字符...我怎么看到他们？我猜这是unicode，对不对？

— 托尼

那已经是你的问题了。这些是UTF-8字节；您可以将它们转换为Unicode字符串b"\xe9\xa6\x96\xe9\xa1\xb5".decode("utf-8")（现在使用某种更现代的Python语法）。

— 三胞胎

5

import re

def unquote(url):
  return re.compile('%([0-9a-fA-F]{2})',re.M).sub(lambda m: chr(int(m.group(1),16)), url)

— Mistercx
source

8

当有内置库可以满足您的需要时，为什么还要手动使用正则表达式和lambda，也许更贴心？

— 布拉德·科赫

6

很酷的解决方案！urllib2不是标准python发行版的一部分。 re是。

— cxxl 2014年