>我正在尝试使用代理获取 php 网页的源代码,但它显示不可打印的字符。我得到的输出如下:
"日期:2016 年 2 月 9 日星期二 10:29:14 GMT服务器: Apache/2.4.9 (Unix( OpenSSL/1.0.1g PHP/5.5.11 mod_perl/2.0.8-dev Perl/v5.16.3X-Powered-By: PHP/5.5.11Set-Cookie: PHPSESSID=jmqasueos33vqoe6dbm3iscvg0;路径=/到期: 星期四, 19 十一月 1981 08:52:00 GMT缓存控制:无存储、无缓存、必须重新验证、检查后=0、预检查=0杂注:无缓存内容编码:gzip变化:接受编码内容长度:577保持活动状态:超时=5,最大值=99连接:保持活动状态内容类型:文本/htmlTMo @ G 7 (P H H DS =U = U ]˻ _ Ycl T *> eg Z V N f :6 IkZ77 A nG W ɗ RGY Oc'-ο ƜO ~? V $ l4 + n ]。W TLJSx/| n #> r ; l H 4 f '' SY y 7 "
如何使用 Python 解码这段代码,我尝试使用
decd=zlib.decompress(data, 16+zlib.MAX_WBITS(
但没有提供解码的数据
我正在使用的代理适用于其他几个 Web 应用程序。它显示了某些Web应用程序的不可打印字符,如何解码?
由于我使用的是代理,我不想使用 get(( 和 urlopen(( 或来自 python 程序的任何其他请求。
一种明显的方法是从响应中提取压缩数据并使用 GzipFile().read()
对其进行解压缩。这种拆分响应的方法可能容易失败,但这里是:
from gzip import GzipFile
from StringIO import StringIO
http = 'HTTP/1.1 200 OK'r'nServer: nginx'r'nDate: Tue, 09 Feb 2016 12:02:25 GMT'r'nContent-Type: application/json'r'nContent-Length: 115'r'nConnection: close'r'nContent-Encoding: gzip'r'nAccess-Control-Allow-Origin: *'r'nAccess-Control-Allow-Credentials: true'r'n'r'n'x1f'x8b'x08'x00'xa0'xda'xb9V'x02'xff'xab'xe6RPPJ'xaf'xca,(HMQ'xb2R()*M'xd5Q'x00'x89e'xa4&'xa6'xa4'x16'x15'x03'xc5'xaa'x81'''xa0'x80G~q't'x90'xa7'x94QRR'x90'x94'x99'xa7'x97_'x94'xae'x04'x94'xa9'x85('xcfM-'xc9'xc8'x07'x99'xa0'xe4'xee'x1a'xa2'x04'x11'xcb/'xcaL'xcf'xcc'x03'x89'x19Z'x1a'xe9'x19'x9aY'xe8'x19'xea'x19*q'xd5r'x01'x00'r('xafRu'x00'x00'x00'
body = http.split(''r'n'r'n', 1)[1]
print GzipFile(fileobj=StringIO(body)).read()
输出
{ "gzipped":对, "标头":{ "主机":"httpbin.org" }, "方法": "获取", "原产地":"192.168.1.1"}
如果您觉得有必要解析完整的HTTP响应消息,那么,受此答案的启发,这里有一种相当迂回的方法,它涉及直接从原始HTTP响应构造httplib.HTTPResponse
,使用它创建一个urllib3.response.HTTPResponse
,然后访问解压缩的数据:
import httplib
from cStringIO import StringIO
from urllib3.response import HTTPResponse
http = 'HTTP/1.1 200 OK'r'nServer: nginx'r'nDate: Tue, 09 Feb 2016 12:02:25 GMT'r'nContent-Type: application/json'r'nContent-Length: 115'r'nConnection: close'r'nContent-Encoding: gzip'r'nAccess-Control-Allow-Origin: *'r'nAccess-Control-Allow-Credentials: true'r'n'r'n'x1f'x8b'x08'x00'xa0'xda'xb9V'x02'xff'xab'xe6RPPJ'xaf'xca,(HMQ'xb2R()*M'xd5Q'x00'x89e'xa4&'xa6'xa4'x16'x15'x03'xc5'xaa'x81'''xa0'x80G~q't'x90'xa7'x94QRR'x90'x94'x99'xa7'x97_'x94'xae'x04'x94'xa9'x85('xcfM-'xc9'xc8'x07'x99'xa0'xe4'xee'x1a'xa2'x04'x11'xcb/'xcaL'xcf'xcc'x03'x89'x19Z'x1a'xe9'x19'x9aY'xe8'x19'xea'x19*q'xd5r'x01'x00'r('xafRu'x00'x00'x00'
class DummySocket(object):
def __init__(self, data):
self._data = StringIO(data)
def makefile(self, *args, **kwargs):
return self._data
response = httplib.HTTPResponse(DummySocket(http))
response.begin()
response = HTTPResponse.from_httplib(response)
print(response.data)
输出
{ "gzipped":对, "标头":{ "主机":"httpbin.org" }, "方法": "获取", "原产地":"192.168.1.1"}
虽然gzip
使用zlib
,但当Content-Encoding
设置为gzip
时,在压缩流之前有一个额外的标头,zlib.decompress
调用没有正确解释。
将数据放入 file-like
对象中,并通过 gzip
模块传递数据。例如:
mydatafile = cStringIO.StringIO(data)
gzipper = gzip.GzipFile(fileobj=mydatafile)
decdata = gzipper.read()
来自我已经很旧的Python 2.xhttp库
- https://github.com/mementum/httxlib/blob/master/httxlib/httxcompression.py