如何解码在python中使用gzip压缩的源代码


How to decode a source code which is compressed with gzip in python

>我正在尝试使用代理获取 php 网页的源代码,但它显示不可打印的字符。我得到的输出如下:

"日期:2016 年 2 月 9 日星期二 10:29:14 GMT服务器: Apache/2.4.9 (Unix( OpenSSL/1.0.1g PHP/5.5.11 mod_perl/2.0.8-dev Perl/v5.16.3X-Powered-By: PHP/5.5.11Set-Cookie: PHPSESSID=jmqasueos33vqoe6dbm3iscvg0;路径=/到期: 星期四, 19 十一月 1981 08:52:00 GMT缓存控制:无存储、无缓存、必须重新验证、检查后=0、预检查=0杂注:无缓存内容编码:gzip变化:接受编码内容长度:577保持活动状态:超时=5,最大值=99连接:保持活动状态内容类型:文本/htmlTMo @ G 7 (P H H DS =U = U ]˻ _ Ycl T *> eg                                                          Z                                                                V N f :6 IkZ77 A nG W ɗ RGY Oc'-ο ƜO ~? V $                            l4 + n ]。W TLJSx/| n #> r ; l H 4 f '' SY y 7 " 

如何使用 Python 解码这段代码,我尝试使用

decd=zlib.decompress(data, 16+zlib.MAX_WBITS( 

但没有提供解码的数据

我正在使用的代理适用于其他几个 Web 应用程序。它显示了某些Web应用程序的不可打印字符,如何解码?

由于我使用的是代理,我不想使用 get(( 和 urlopen(( 或来自 python 程序的任何其他请求。

一种明显的方法是从响应中提取压缩数据并使用 GzipFile().read() 对其进行解压缩。这种拆分响应的方法可能容易失败,但这里是:

from gzip import GzipFile
from StringIO import StringIO
http = 'HTTP/1.1 200 OK'r'nServer: nginx'r'nDate: Tue, 09 Feb 2016 12:02:25 GMT'r'nContent-Type: application/json'r'nContent-Length: 115'r'nConnection: close'r'nContent-Encoding: gzip'r'nAccess-Control-Allow-Origin: *'r'nAccess-Control-Allow-Credentials: true'r'n'r'n'x1f'x8b'x08'x00'xa0'xda'xb9V'x02'xff'xab'xe6RPPJ'xaf'xca,(HMQ'xb2R()*M'xd5Q'x00'x89e'xa4&'xa6'xa4'x16'x15'x03'xc5'xaa'x81'''xa0'x80G~q't'x90'xa7'x94QRR'x90'x94'x99'xa7'x97_'x94'xae'x04'x94'xa9'x85('xcfM-'xc9'xc8'x07'x99'xa0'xe4'xee'x1a'xa2'x04'x11'xcb/'xcaL'xcf'xcc'x03'x89'x19Z'x1a'xe9'x19'x9aY'xe8'x19'xea'x19*q'xd5r'x01'x00'r('xafRu'x00'x00'x00'
body = http.split(''r'n'r'n', 1)[1]
print GzipFile(fileobj=StringIO(body)).read()

输出

{  "gzipped":对,  "标头":{    "主机":"httpbin.org"  },  "方法": "获取",  "原产地":"192.168.1.1"}

如果您觉得有必要解析完整的HTTP响应消息,那么,受此答案的启发,这里有一种相当迂回的方法,它涉及直接从原始HTTP响应构造httplib.HTTPResponse,使用它创建一个urllib3.response.HTTPResponse,然后访问解压缩的数据:

import httplib
from cStringIO import StringIO
from urllib3.response import HTTPResponse
http = 'HTTP/1.1 200 OK'r'nServer: nginx'r'nDate: Tue, 09 Feb 2016 12:02:25 GMT'r'nContent-Type: application/json'r'nContent-Length: 115'r'nConnection: close'r'nContent-Encoding: gzip'r'nAccess-Control-Allow-Origin: *'r'nAccess-Control-Allow-Credentials: true'r'n'r'n'x1f'x8b'x08'x00'xa0'xda'xb9V'x02'xff'xab'xe6RPPJ'xaf'xca,(HMQ'xb2R()*M'xd5Q'x00'x89e'xa4&'xa6'xa4'x16'x15'x03'xc5'xaa'x81'''xa0'x80G~q't'x90'xa7'x94QRR'x90'x94'x99'xa7'x97_'x94'xae'x04'x94'xa9'x85('xcfM-'xc9'xc8'x07'x99'xa0'xe4'xee'x1a'xa2'x04'x11'xcb/'xcaL'xcf'xcc'x03'x89'x19Z'x1a'xe9'x19'x9aY'xe8'x19'xea'x19*q'xd5r'x01'x00'r('xafRu'x00'x00'x00'
class DummySocket(object):
    def __init__(self, data):
        self._data = StringIO(data)
    def makefile(self, *args, **kwargs):
        return self._data
response = httplib.HTTPResponse(DummySocket(http))
response.begin()
response = HTTPResponse.from_httplib(response)
print(response.data)

输出

{  "gzipped":对,  "标头":{    "主机":"httpbin.org"  },  "方法": "获取",  "原产地":"192.168.1.1"}

虽然gzip使用zlib,但当Content-Encoding设置为gzip时,在压缩流之前有一个额外的标头,zlib.decompress调用没有正确解释。

将数据放入 file-like 对象中,并通过 gzip 模块传递数据。例如:

mydatafile = cStringIO.StringIO(data)
gzipper = gzip.GzipFile(fileobj=mydatafile)
decdata = gzipper.read()

来自我已经很旧的Python 2.xhttp库

  • https://github.com/mementum/httxlib/blob/master/httxlib/httxcompression.py