zstream fails for concatenated gz files
Original Reporter info from Mantis: crlab @neurolabusc1
-
Reporter name: Chris Rorden
Original Reporter info from Mantis: crlab @neurolabusc1
- Reporter name: Chris Rorden
Description:
The gzip format allows for multiple such streams to be concatenated (gzipped files are simply decompressed concatenated as if they were originally one file) (1). In other words, a single file could be broken into parts, each part compressed with gz and the resulting gz files stacked into a single file in sequential order. When decompressed, the result should be identical to the input file. This strategy is used by the bgzf format and mgzip - the advantages include faster random access, faster compression and faster decompression, at a small cost in terms of compression efficiency. The files created by bgzf and mgzip are fully compatible with the gzip standard, but the pascal zstreams unit only appears to read the first block.
1.) https://en.wikipedia.org/wiki/Gzip
2.) http://samtools.github.io/hts-specs/SAMv1.pdf
3.) https://blastedbio.blogspot.com/2011/11/bgzf-blocked-bigger-better-gzip.html
4.) https://pypi.org/project/mgzip/
Steps to reproduce:
Create a blocked gzip file. For the example, I compressed a 1147232 byte file to 284864 bytes using the following Python code:
import mgzip
fnm = 'img.nii'
fh = open(fnm, "rb")
gh = mgzip.open(fnm + ".gz", "wb", compresslevel=9, blocksize=10**5)
data = fh.read()
gh.write(data)
gh.close()
Now try to extract this with zstream. It will only identify 200000 bytes, and if you try to read more you will get an exception.
Mantis conversion info:
- Mantis ID: 36822
- OS: Darwin
- OS Build: 10.11.6
- Build: 3.0.4 [2018/09/30]
- Platform: MacBook 2012 Retina 13"
- Version: 3.0.4
- Fixed in version: 3.3.1
- Fixed in revision: 49421 (#affefb6c)
- Target version: 4.0.0