下载并解压缩 .zip 文件,而无需写入磁盘。
下载并解压缩 .zip 文件,而无需写入磁盘。
我已经成功编写了我的第一个Python脚本,可以从URL下载一个ZIP文件列表,然后解压缩这些ZIP文件并将它们写入磁盘。
现在我不知道如何实现下一步。
我的主要目标是下载和提取zip文件,并通过TCP流传递其内容(CSV数据)。如果可以避免这样做,我宁愿不实际将任何zip或提取的文件写入磁盘。
这是我的当前脚本,它可以工作,但不幸的是必须将文件写入磁盘。
import urllib, urllister import zipfile import urllib2 import os import time import pickle # check for extraction directories existence if not os.path.isdir('downloaded'): os.makedirs('downloaded') if not os.path.isdir('extracted'): os.makedirs('extracted') # open logfile for downloaded data and save to local variable if os.path.isfile('downloaded.pickle'): downloadedLog = pickle.load(open('downloaded.pickle')) else: downloadedLog = {'key':'value'} # remove entries older than 5 days (to maintain speed) # path of zip files zipFileURL = "http://www.thewebserver.com/that/contains/a/directory/of/zip/files" # retrieve list of URLs from the webservers usock = urllib.urlopen(zipFileURL) parser = urllister.URLLister() parser.feed(usock.read()) usock.close() parser.close() # only parse urls for url in parser.urls: if "PUBLIC_P5MIN" in url: # download the file downloadURL = zipFileURL + url outputFilename = "downloaded/" + url # check if file already exists on disk if url in downloadedLog or os.path.isfile(outputFilename): print "Skipping " + downloadURL continue print "Downloading ",downloadURL response = urllib2.urlopen(downloadURL) zippedData = response.read() # save data to disk print "Saving to ",outputFilename output = open(outputFilename,'wb') output.write(zippedData) output.close() # extract the data zfobj = zipfile.ZipFile(outputFilename) for name in zfobj.namelist(): uncompressed = zfobj.read(name) # save uncompressed data to disk outputFilename = "extracted/" + name print "Saving extracted file to ",outputFilename output = open(outputFilename,'wb') output.write(uncompressed) output.close() # send data via tcp stream # file successfully downloaded and extracted store into local log and filesystem log downloadedLog[url] = time.time(); pickle.dump(downloadedLog, open('downloaded.pickle', "wb" ))
admin 更改状态以发布 2023年5月21日
我的建议是使用一个StringIO
对象。它们模拟文件,但是驻留在内存中。所以你可以像这样做:
# get_zip_data() gets a zip archive containing 'foo.txt', reading 'hey, foo' import zipfile from StringIO import StringIO zipdata = StringIO() zipdata.write(get_zip_data()) myzipfile = zipfile.ZipFile(zipdata) foofile = myzipfile.open('foo.txt') print foofile.read() # output: "hey, foo"
或者更简单一些(对Vishal表示歉意):
myzipfile = zipfile.ZipFile(StringIO(get_zip_data())) for name in myzipfile.namelist(): [ ... ]
在Python 3中使用BytesIO代替StringIO:
import zipfile from io import BytesIO filebytes = BytesIO(get_zip_data()) myzipfile = zipfile.ZipFile(filebytes) for name in myzipfile.namelist(): [ ... ]
以下是我用来获取压缩的csv文件的代码段,请看一下:
Python 2:
from StringIO import StringIO from zipfile import ZipFile from urllib import urlopen resp = urlopen("http://www.test.com/file.zip") myzip = ZipFile(StringIO(resp.read())) for line in myzip.open(file).readlines(): print line
Python 3:
from io import BytesIO from zipfile import ZipFile from urllib.request import urlopen # or: requests.get(url).content resp = urlopen("http://www.test.com/file.zip") myzip = ZipFile(BytesIO(resp.read())) for line in myzip.open(file).readlines(): print(line.decode('utf-8'))
这里的file
是一个字符串。要获取要传递的实际字符串,可以使用zipfile.namelist()
。例如:
resp = urlopen('http://mlg.ucd.ie/files/datasets/bbc.zip') myzip = ZipFile(BytesIO(resp.read())) myzip.namelist() # ['bbc.classes', 'bbc.docs', 'bbc.mtx', 'bbc.terms']