As part of the contributing guidelines I am creating this issue to start the conversation before pushing a PR with the fix, which I will include the fix in this issue as well.
I also looked for other conversations, I found this one from 2020, with no response: #524
In our QUIC stack, connections would complete handshake but then fail on early 1-RTT sends because quiche’s BoringSSL backend selected TLS13-specific AES-GCM AEAD entrypoints (EVP_aead_aes_gcm_tls13), which on the resolved boring backend rejected that runtime nonce/counter usage with INVALID_NONCE; this surfaced from quiche as CryptoFail, and in tokio-quiche appeared as send_on_path failure followed by local transport close code 1, causing reconnect churn. Switching quiche’s AES-GCM selection to the generic AEAD entrypoints (EVP_aead_aes_gcm) resolved the nonce failure and restored stable connections.
Findings:
- In tokio-quiche, that failure is observed in the worker send path at qconn.send_on_path(...), then it closes with transport InternalError (1) if no local error is already set , see source here:
|
if send_buf.len() > segment_size.unwrap_or(usize::MAX) { |
|
// Never let the buffer be longer than segment size, for GSO to |
|
// function properly. |
|
send_buf = &mut send_buf[..segment_size.unwrap_or(usize::MAX)]; |
|
} |
|
|
|
// On the first call to `select_path()` a path will be chosen based on |
|
// the local address the connection initially landed on. Once a path is |
|
// selected following calls to `select_path()` will return it, until it |
|
// is reset at the start of the next write cycle. |
|
// |
|
// The path is then passed to `send_on_path()` which will only generate |
|
// packets meant for that path, this way a single GSO buffer will only |
|
// contain packets that belong to the same network path, which is |
|
// required because the from/to addresses for each `sendmsg()` call |
|
// apply to the whole GSO buffer. |
|
let (from, to) = self.select_path(qconn).unzip(); |
|
|
|
match qconn.send_on_path(send_buf, from, to) { |
|
Ok((packet_size, info)) => { |
|
let _ = send_info.get_or_insert(info); |
|
|
|
self.write_state.bytes_written += packet_size; |
|
self.write_state.num_pkts += 1; |
|
|
|
let from = send_info.as_ref().map(|info| info.from); |
|
let to = send_info.as_ref().map(|info| info.to); |
|
|
|
self.write_state.selected_path = from.zip(to); |
|
|
|
self.write_state.has_pending_data = true; |
|
|
|
Ok(packet_size) |
|
}, |
|
|
|
Err(QuicheError::Done) => { |
|
// Flush the current buffer to network. If no other path needs |
|
// to be flushed to the network also yield the work loop task. |
|
// |
|
// Otherwise the write loop will start again and the next path |
|
// will be selected. |
|
let has_pending_paths = self.write_state.pending_paths.len() > 0; |
|
|
|
// Keep writing if there are paths left to try. |
|
self.write_state.has_pending_data = has_pending_paths; |
|
|
|
Ok(0) |
|
}, |
|
|
|
Err(e) => { |
|
let error_code = if let Some(local_error) = qconn.local_error() { |
|
local_error.error_code |
|
} else { |
- The send path calls into quiche packet encryption (encrypt_pkt -> seal_with_u64_counter):
|
extra_in, |
|
)?; |
|
|
|
encrypt_hdr(&mut header, pn_len, payload.as_ref(), aead)?; |
|
|
|
Ok(payload_offset + ciphertext_len) |
|
} |
|
|
|
pub fn encode_pkt_num( |
|
pn: u64, pn_len: usize, b: &mut octets::OctetsMut, |
|
) -> Result<()> { |
|
match pn_len { |
|
1 => b.put_u8(pn as u8)?, |
|
|
|
2 => b.put_u16(pn as u16)?, |
- In quiche’s BoringSSL backend, AEAD algorithm selection uses TLS13-specific AES-GCM entrypoints:
|
#[repr(C)] |
|
pub(crate) struct AES_KEY { |
|
rd_key: [u32; 4 * (14 + 1)], |
|
rounds: c_int, |
|
} |
|
|
|
impl Algorithm { |
|
fn get_evp_aead(self) -> *const EVP_AEAD { |
|
match self { |
- The actual crypto failure is emitted from the seal call (EVP_AEAD_CTX_seal_scatter) which maps to CryptoFail on non-success:
|
|
|
let rc = unsafe { |
|
EVP_AEAD_CTX_open( |
|
&self.ctx, // ctx |
|
buf.as_mut_ptr(), // out |
|
&mut out_len, // out_len |
|
max_out_len, // max_out_len |
|
nonce[..].as_ptr(), // nonce |
|
nonce.len(), // nonce_len |
|
buf.as_ptr(), // inp |
|
buf.len(), // in_len |
|
ad.as_ptr(), // ad |
|
ad.len(), // ad_len |
|
) |
|
}; |
|
|
|
if rc != 1 { |
|
return Err(Error::CryptoFail); |
|
} |
|
|
|
Ok(out_len) |
|
} |
|
|
|
pub fn seal_with_u64_counter( |
|
&mut self, counter: u64, ad: &[u8], buf: &mut [u8], in_len: usize, |
|
extra_in: Option<&[u8]>, |
|
) -> Result<usize> { |
|
let tag_len = self.alg.tag_len(); |
|
|
|
let mut out_tag_len = tag_len; |
|
|
|
let (extra_in_ptr, extra_in_len) = match extra_in { |
|
Some(v) => (v.as_ptr(), v.len()), |
|
|
|
None => (std::ptr::null(), 0), |
|
}; |
|
|
|
// Make sure all the outputs combined fit in the buffer. |
|
if in_len + tag_len + extra_in_len > buf.len() { |
|
return Err(Error::CryptoFail); |
|
} |
|
|
|
let nonce = make_nonce(&self.nonce, counter); |
|
|
|
let rc = unsafe { |
I have tested this patch locally and it works, however, it could have some potential impacts on decisions later.
The fix is as follows:
|
impl Algorithm { |
|
fn get_evp_aead(self) -> *const EVP_AEAD { |
|
match self { |
|
Algorithm::AES128_GCM => unsafe { EVP_aead_aes_128_gcm_tls13() }, |
|
Algorithm::AES256_GCM => unsafe { EVP_aead_aes_256_gcm_tls13() }, |
|
Algorithm::ChaCha20_Poly1305 => unsafe { |
|
EVP_aead_chacha20_poly1305() |
|
}, |
|
} |
|
} |
|
} |
|
|
Changes to:
`impl Algorithm {
fn get_evp_aead(self) -> *const EVP_AEAD {
match self {
Algorithm::AES128_GCM => unsafe { EVP_aead_aes_128_gcm() },
Algorithm::AES256_GCM => unsafe { EVP_aead_aes_256_gcm() },
Algorithm::ChaCha20_Poly1305 => unsafe {
EVP_aead_chacha20_poly1305()
},
}
}
}`
And:
|
fn EVP_aead_aes_128_gcm_tls13() -> *const EVP_AEAD; |
|
|
|
fn EVP_aead_aes_256_gcm_tls13() -> *const EVP_AEAD; |
|
|
|
fn EVP_aead_chacha20_poly1305() -> *const EVP_AEAD; |
|
|
Changes to:
`extern "C" {
fn EVP_aead_aes_128_gcm() -> *const EVP_AEAD;
fn EVP_aead_aes_256_gcm() -> *const EVP_AEAD;
fn EVP_aead_chacha20_poly1305() -> *const EVP_AEAD;`
PR: #2412
As part of the contributing guidelines I am creating this issue to start the conversation before pushing a PR with the fix, which I will include the fix in this issue as well.
I also looked for other conversations, I found this one from 2020, with no response: #524
In our QUIC stack, connections would complete handshake but then fail on early 1-RTT sends because quiche’s BoringSSL backend selected TLS13-specific AES-GCM AEAD entrypoints (EVP_aead_aes_gcm_tls13), which on the resolved boring backend rejected that runtime nonce/counter usage with INVALID_NONCE; this surfaced from quiche as CryptoFail, and in tokio-quiche appeared as send_on_path failure followed by local transport close code 1, causing reconnect churn. Switching quiche’s AES-GCM selection to the generic AEAD entrypoints (EVP_aead_aes_gcm) resolved the nonce failure and restored stable connections.
Findings:
quiche/tokio-quiche/src/quic/io/worker.rs
Lines 562 to 614 in b30f9e7
quiche/quiche/src/packet.rs
Lines 711 to 725 in b30f9e7
quiche/quiche/src/crypto/boringssl.rs
Lines 23 to 31 in b30f9e7
quiche/quiche/src/crypto/boringssl.rs
Lines 96 to 140 in b30f9e7
I have tested this patch locally and it works, however, it could have some potential impacts on decisions later.
The fix is as follows:
quiche/quiche/src/crypto/boringssl.rs
Lines 29 to 40 in b30f9e7
Changes to:
And:
quiche/quiche/src/crypto/boringssl.rs
Lines 323 to 328 in b30f9e7
Changes to:
PR: #2412