使用超时、最大大小和连接池的http请求

Question

14 浏览2023年5月20日

匿名的 2022年5月19日

0 Comments

我正在寻找一种使用Python（2.7）进行HTTP请求的方法，具有以下三个要求：

超时（可靠性）
内容最大尺寸（安全性）
连接池（性能）

我已经查看了几乎所有的Python HTTP库，但没有一个库与我的要求相符。例如：

urllib2：好用，但没有连接池

import urllib2
import json
r = urllib2.urlopen('https://github.com/timeline.json', timeout=5)
content = r.read(100+1)
if len(content) > 100: 
    print 'too large'
    r.close()
else:
    print json.loads(content)
r = urllib2.urlopen('https://github.com/timeline.json', timeout=5)
content = r.read(100000+1)
if len(content) > 100000: 
    print 'too large'
    r.close()
else:
    print json.loads(content)

requests：没有最大尺寸

import requests
r = requests.get('https://github.com/timeline.json', timeout=5, stream=True)
r.headers['content-length'] # does not exists for this request, and not safe
content = r.raw.read(100000+1)
print content # ARF this is gzipped, so not the real size
print json.loads(content) # content is gzipped so pretty useless
print r.json() # Does not work anymore since raw.read was used

urllib3：即使在50Mo文件中也无法使用“read”方法...

httplib：httplib.HTTPConnection不是一个连接池（只有一个连接）

我几乎无法相信urllib2是我可以使用的最好的HTTP库！因此，如果有人知道哪个库可以做到这一点，或者如何使用先前的库...

编辑：

我通过Martijn Pieters找到的最佳解决方案（即使对于巨大的文件，StringIO也不会变慢，而字符串添加会做很多事情）。

r = requests.get('https://github.com/timeline.json', stream=True)
size = 0
ctt = StringIO()
for chunk in r.iter_content(2048):
    size += len(chunk)
    ctt.write(chunk)
    if size > maxsize:
        r.close()
        raise ValueError('Response too large')
content = ctt.getvalue()

admin 更改状态以发布 2023年5月20日

0

1 答案

匿名的 · Answer 1 · 2022-05-19T20:57:58+00:00

您可以使用requests轻松完成此操作。但是，您需要知道raw 对象是 urllib3内部的一部分，并利用HTTPResponse.read()调用所支持的额外参数，该参数可以让您指定要读取的解码数据：

import requests
r = requests.get('https://github.com/timeline.json', timeout=5, stream=True)
content = r.raw.read(100000+1, decode_content=True)
if len(content) > 100000:
    raise ValueError('Too large a response')
print content
print json.loads(content)

或者，在读取之前可以在raw对象上设置decode_content标志：

import requests
r = requests.get('https://github.com/timeline.json', timeout=5, stream=True)
r.raw.decode_content = True
content = r.raw.read(100000+1)
if len(content) > 100000:
    raise ValueError('Too large a response')
print content
print json.loads(content)

如果您不喜欢这样访问urllib3内部的方法，可以使用response.iter_content()迭代以块方式遍历解码内容；这也使用了底层的HTTPResponse（使用.stream()生成器版本）：

import requests
r = requests.get('https://github.com/timeline.json', timeout=5, stream=True)
maxsize = 100000
content = ''
for chunk in r.iter_content(2048):
    content += chunk
    if len(content) > maxsize:
        r.close()
        raise ValueError('Response too large')
print content
print json.loads(content)

这里对压缩数据大小处理有一些微妙的差异；r.raw.read(100000+1)将始终仅读取100k字节的压缩数据；未压缩的数据将根据您的最大大小进行测试。如果压缩流大于未压缩数据，则iter_content()方法将阅读更多未压缩数据。

两种方法都不允许r.json()工作；这些不会设置response._content属性；当然，您可以手动执行此操作。但是，由于 .raw.read()和 .iter_content() 调用已经为您提供了所需的内容访问，因此真正没有必要这样做。