fix(performance): Use bounded HTTP Range requests for indexed BAM queries#1998
fix(performance): Use bounded HTTP Range requests for indexed BAM queries#1998TechIsCool wants to merge 1 commit intosamtools:developfrom
Conversation
When reading indexed BAM files from remote URLs (HTTP, S3, etc.), seeking to a chunk offset would then read unbounded to EOF. For small queries against large files, this downloads far more data than needed. This adds bgzf_seek_limit() which accepts the chunk end offset from the BAM index, enabling bounded Range requests (bytes=X-Y) instead of unbounded ones (bytes=X-) in the libcurl backend. Changes: - hfile.h/hfile.c: Add readahead_limit field and setter - bgzf.h/bgzf.c: Add bgzf_seek_limit() that passes limit to hfile - hfile_libcurl.c: Use CURLOPT_RANGE with bounds when limit is set - hts.c: Call bgzf_seek_limit() with chunk end in hts_itr_next() The limit is cleared after each hseek(), so only affects reads immediately following a seek. Signed-off-by: David Beck <techiscool@gmail.com>
|
This changes the public This problem has already been fixed in the |
|
Thanks for the review. I am unsure if the PR could exist without impacting ABI? For context we use BenchmarksI tested
For small regions, GET counts are similar but S3 transfers ~5x more data (fixed 1MB chunks). For larger regions, the 1MB chunking causes 44 separate requests vs 1 bounded request. ReproductionIFACE=$(ls /sys/class/net/ | grep -v lo | head -1)
REGIONS="1:1000000-1000100 2:5000000-5000100 3:10000000-10000100 5:50000000-50000100 7:117188547-117188800"
BAM="1000genomes/phase3/data/NA12878/exome_alignment/NA12878.mapped.ILLUMINA.bwa.CEU.exome.20121211.bam"
test_region() {
local bam=$1 bai=$2 region=$3
local before=$(cat /sys/class/net/$IFACE/statistics/rx_bytes)
local start=$(date +%s.%N)
samtools view -X "$bam" "$bai" "$region" >/dev/null 2>&1
local elapsed=$(echo "$(date +%s.%N) - $start" | bc)
echo "$region: ${elapsed}s, $(($(cat /sys/class/net/$IFACE/statistics/rx_bytes) - before)) bytes"
}
echo "=== S3 ==="
for r in $REGIONS; do test_region "s3://$BAM" "s3://$BAM.bai" "$r"; done
echo "=== HTTPS ==="
for r in $REGIONS; do test_region "https://s3.amazonaws.com/$BAM" "https://s3.amazonaws.com/$BAM.bai" "$r"; doneAdditional finding: On slower links, Adding |
Problem
When reading remote BAM files with an index, htslib seeks to each chunk's start offset but issues unbounded Range requests. The server advertises gigabytes of Content-Length even though we only need kilobytes:
bytes=8224425-bytes=1631423494-bytes=7287649006-The client terminates early, but "early termination" isn't free - data in flight still transfers. We have also found that being specific about what is needed improves S3 responsiveness.
Solution
The BAM index already contains chunk end offsets. Pass them through to the HTTP layer:
EC2 Benchmark
(35 MB/s bandwidth)
Environment: EC2 m8azn.medium (up to 25 Gbps bandwidth), us-east-1
Test file:
s3://1000genomes/.../NA12878.mapped.ILLUMINA.bwa.CEU.exome.20121211.bam(17.3 GB)Measurement: Wall clock time + actual wire transfer via
/sys/class/net/<iface>/statistics/rx_bytesIt appears S3 optimizes resource allocation for bounded requests, leading to much faster responses. The time improvement exceeds bandwidth savings, suggesting that S3 can serve bounded requests more efficiently.
Per-Request Comparison (5 regions)
Range: bytes=X-Range: bytes=X-YLocal Benchmark
(2 MB/s bandwidth)
Environment: MacOS M1 (up to 100Mbps), California
Test file:
s3://1000genomes/.../NA12878.mapped.ILLUMINA.bwa.CEU.exome.20121211.bam(17.3 GB)Reproduction