Skip to content

fix: keep dispatching messages in a batch when one of them errors or throws#5057

Open
alberti42 wants to merge 1 commit into
emacs-lsp:masterfrom
alberti42:fix/lsp--create-filter-function
Open

fix: keep dispatching messages in a batch when one of them errors or throws#5057
alberti42 wants to merge 1 commit into
emacs-lsp:masterfrom
alberti42:fix/lsp--create-filter-function

Conversation

@alberti42
Copy link
Copy Markdown
Contributor

@alberti42 alberti42 commented May 3, 2026

Summary

lsp--create-filter-function is the function that turns the raw bytes
arriving from the language server's process into JSON messages and hands
them to the rest of lsp-mode. A single read from the server very often
delivers more than one message at once, so the filter parses what it can
and then dispatches each message in turn.

There are two ways the existing code can lose messages that were already
parsed and ready to dispatch:

  1. A framing error escapes the parsing loop. The LSP wire format
    prefixes each message with Content-Length: N\r\n\r\n headers and
    then N bytes of JSON body. If the parsing loop's view of the byte
    stream becomes mis-aligned and the next iteration tries to read
    headers from the middle of a body, the code signals an error
    (Unable to find Content-Length header.). That error escapes the
    whole filter and takes the list of already-parsed messages with it.
  2. A throw from a message handler abandons the rest of the batch.
    When a message handler eventually triggers a (throw 'lsp-done …) or
    (throw 'input …) — these are caught higher up by code like
    lsp-request-while-no-input — the throw passes back through the
    filter on its way up. The filter currently dispatches messages with
    mapc, and mapc is killed by the throw mid-loop. Messages later in
    the batch never run.

Both are independent of the contents of the messages. The most visible
symptom is a server that gets stuck waiting for a reply to a request
(such as window/workDoneProgress/create or workspace/configuration)
that the client did receive and parse but never dispatched.

This PR makes both paths preserve the already-parsed messages:

  • A condition-case around the parsing loop catches a framing error,
    logs it, resets the framing state, and falls through so the messages
    parsed before the error still get dispatched.
  • The dispatch step replaces mapc with a per-message dolist that
    catches 'lsp-done and 'input throws locally, finishes dispatching
    the rest of the batch, and then re-issues the saved throw so it still
    reaches its original target.

No public APIs change. The normal path is unchanged. The only difference
a user can observe is that batches are no longer truncated when one
message in them fails.

Background

Three pieces of context that the rest of the PR depends on.

How messages reach the filter

lsp-mode connects to a language server over a pipe (or a TCP socket).
Whenever Emacs has bytes available from that connection, it calls the
process filter — the function returned by lsp--create-filter-function
— with the bytes that just arrived.

Two things to keep in mind:

  • A single call to the filter can contain part of one message, several
    whole messages, or some mix
    . The filter has to buffer leftover bytes
    across calls and split out complete messages as it goes.
  • The filter is the only place where parsing happens. If the filter
    loses a message — by throwing it away, by leaving it in the buffer, or
    by signalling an error before dispatching it — that message is gone;
    there is no retry from the network layer.

LSP wire framing

Each message on the wire looks like this:

Content-Length: 123\r\n
\r\n
{ … 123 bytes of JSON … }

The filter walks through the byte buffer alternating between two states:
"reading headers" and "reading a body of known length." When a body is
complete, it parses the JSON, appends the result to a local messages
list, resets the framing state, and goes back to "reading headers" for
whatever bytes are left.

catch / throw (briefly)

Emacs Lisp's (throw 'TAG VAL) jumps up the call stack to the nearest
matching (catch 'TAG …), discarding every function call in between. If
no matching catch is on the stack, Emacs raises a no-catch error.

lsp-mode uses (throw 'lsp-done …) and (throw 'input …) to break
out of synchronous wait loops in callers like lsp-request-while-no-input.
Crucially, the catches for these tags are higher in the stack than the
filter — so when the throw fires inside a message handler, it travels up
through the filter on its way to the catch.

The two bugs

Bug 1 — a framing error escapes the parsing loop

The parsing loop pulls header lines with the equivalent of:

(or (string-match-p "Content-Length" chunk)
    (error "Unable to find Content-Length header."))

Under normal operation, every successful body parse leaves chunk
positioned right at the start of the next Content-Length line, so the
search always succeeds.

Under abnormal operation, it can fail. The most concrete way: an earlier
JSON body fails to parse, the inner error handler logs the failure but
the framing state ends up in a position where the next iteration looks
for headers in what is actually mid-body bytes. The string-match-p
call returns nil, the (error …) fires, and the error escapes both the
inner condition-case (which only covers JSON parse) and the outer
while loop.

The damage: at the moment the error fires, the local list messages may
already hold one or more complete, parsed JSON objects from earlier in
the same read. Because the error escapes the whole filter function, the
dispatch step that would normally run them never runs. Those messages
are silently dropped.

If one of the dropped messages is a server-initiated request that
expects a reply (for example window/workDoneProgress/create), the
server hangs waiting for a response that the client has already thrown
away.

Bug 2 — a throw from a message handler abandons the rest of the batch

After parsing, the filter dispatches messages with:

(mapc (lambda (msg)
        (lsp--parser-on-message msg workspace))
      (nreverse messages))

Suppose the batch contains two messages: a response to a client request
and a server-initiated window/workDoneProgress/create. mapc runs the
lambda on the response first. The response's callback is the one
registered by lsp-request-while-no-input; that callback ends with
(throw 'lsp-done '_). The throw walks up the stack toward the catch in
lsp-request-while-no-input.

The catch is several frames above the filter. To get there, the throw
unwinds — among other frames — the lambda and the call to mapc itself.
mapc does not survive the throw: it is mid-iteration, but it never
gets to start the next iteration. The second message in the batch
(window/workDoneProgress/create) is never dispatched.

The catch in lsp-request-while-no-input does what it was designed to
do, the user's wait completes, and from lsp-request-while-no-input's
point of view everything is fine. From the server's point of view, it
sent a request that needs a reply and never got one. The server hangs.

The same shape of failure applies to (throw 'input …), which is used
elsewhere in lsp-mode to cancel work when user input arrives.

This bug compounds with bug 1 and with the id-collision bug fixed in
PR 1: any of the three is enough to silently drop a server-initiated
request, and any dropped server-initiated request that expects a reply
can leave the server stuck.

The fix

Fix 1 — wrap the parsing loop and salvage already-parsed messages

Wrap the framing while loop in a condition-case. If any error
escapes the loop, log it, reset framing state for the next read, and
fall through to the dispatch step. The local messages list is
preserved, so any messages parsed before the error still get
dispatched.

(condition-case framing-err
    (while (not (s-blank? chunk))
      … existing parsing loop …)
  (error
   (lsp-warn "[lsp-filter] framing-error-salvage: salvaged %d parsed message(s); error: %S"
             (length messages) framing-err)
   (setf leftovers nil
         body-length nil
         body-received 0
         body nil)))

Two notes:

  • The state reset (leftovers, body-length, body-received, body)
    is the same reset the parsing loop already does after a successful
    body parse. Doing it on the error path puts the filter into a known
    state for the next read instead of leaving it half-way through a
    message it could not finish parsing.
  • The inner condition-case around JSON parsing is left in place. JSON
    parse errors continue to be reported with the existing lsp-warn
    message; only framing errors that escape that inner handler are now
    caught.

Fix 2 — per-message dispatch with throw queueing

Replace mapc with a dolist that wraps each individual call to
lsp--parser-on-message in (catch 'lsp-done …) and (catch 'input …).
If a handler throws either tag, the local catch receives it, the loop
records the tag and value, and the loop continues with the next message.
After all messages in the batch have been dispatched, the saved throw
is re-issued so the original target catch — for example the one in
lsp-request-while-no-input — still receives it.

(let ((sentinel (cons nil nil))
      queued-tag queued-value)
  (dolist (msg (nreverse messages))
    (let ((r (catch 'lsp-done
               (let ((r2 (catch 'input
                           (lsp--parser-on-message msg workspace)
                           sentinel)))
                 (unless (eq r2 sentinel)
                   (setq queued-tag 'input queued-value r2))
                 sentinel))))
      (unless (eq r sentinel)
        (setq queued-tag 'lsp-done queued-value r))))
  (when queued-tag
    (throw queued-tag queued-value)))

How it works:

  • A fresh cons cell sentinel is created at the start. A normal return
    from the inner forms is signalled by returning that exact cons; any
    other value must have come from a throw. Using a fresh cons (via
    (cons nil nil)) means eq reliably distinguishes normal return
    from a thrown value, even if a handler throws something that happens
    to look like the sentinel.
  • Each message's dispatch is wrapped in two nested catches, one for
    'input (innermost) and one for 'lsp-done (outer). If either
    throws, the corresponding catch returns the thrown value instead of
    the sentinel; the loop records queued-tag and queued-value and
    moves on to the next message.
  • After the loop, if a throw was queued, it is re-issued by the final
    (throw queued-tag queued-value). The catch higher up the stack —
    the one the original throw was aimed at — still receives the throw,
    just slightly delayed: every message in the same batch is dispatched
    first.
  • Only 'lsp-done and 'input are caught here. These are the two tags
    lsp-mode uses for synchronous-wait coordination. Any other throw
    (or any error) is left to propagate as before, so this change does
    not silently swallow unrelated control flow.
  • If two messages in the same batch each throw, the later throw
    replaces the earlier one. This matches the behaviour you would get
    without the patch when the first throwing message wins; the patch
    prefers the last throwing message instead, which is consistent with
    finishing the batch before exiting.

Why this is preferable to the alternatives

A few alternatives to consider:

  • Catch-and-rethrow at the outer caller instead of in the filter.
    This would require changes to every caller that uses 'lsp-done or
    'input. The filter is the single point that sees the whole batch,
    so handling the rescue here is the smallest, most local change.
  • Process messages one at a time on the next event-loop tick. This
    would avoid the batch-abandon problem at the cost of changing
    lsp-mode's observable timing in many places — synchronous waits
    would need extra ticks to complete, and the order of side effects
    could shift. Far more invasive than the local fix.
  • Catch all throws, not just 'lsp-done and 'input. Catching
    unknown throw tags risks hiding control flow we did not anticipate,
    including ones that other parts of the code may legitimately use to
    exit. The two tags caught here are the ones we know are used by
    synchronous-wait callers and that are the source of the abandon-batch
    symptom.
  • Convert the handlers to not throw across the filter. This would
    require redesigning the way lsp-mode does synchronous waiting. A
    large change for the same effect; the local fix is enough.

Compatibility

  • No API change. lsp--create-filter-function keeps the same
    signature and the same observable behaviour for batches that contain
    no errors and no throws.
  • Throws still reach their original catch. Any existing
    (catch 'lsp-done …) or (catch 'input …) still receives the throw
    with the original value, just after the rest of the batch has been
    dispatched.
  • Errors that are not framing errors are not caught here. The added
    condition-case only catches errors that escape the parsing loop;
    the inner JSON-parse condition-case is unchanged.
  • lsp-warn is the standard way lsp-mode reports recoverable
    problems.
    The new framing-error log uses it rather than introducing
    a new diagnostic channel.

Reproducing the bugs before the patch

Bug 2 is the easier of the two to reproduce reliably:

  • Use a server that issues both server-initiated requests
    (window/workDoneProgress/create, workspace/configuration, …) and
    responses to client requests in close succession. ltex-ls-plus does
    this naturally during interactive editing.
  • Enable lsp-completion-enable and use Corfu with corfu-auto t and
    a short corfu-auto-delay.
  • Type for several seconds. Watch the server's status — if it goes
    silent (no diagnostics, no progress notifications), look at the wire
    log: you will find a server-initiated request (often
    window/workDoneProgress/create) for which the client never sent a
    response.

Bug 1 is harder to reproduce on demand because it depends on a
specific framing-state misalignment. I have observed it as a one-off
event during long sessions; it surfaces in *Messages* as
(error "Unable to find Content-Length header.") followed by the
server going silent. With the patch, the same misalignment is logged
as a framing-error-salvage warning and the messages parsed before the
error are still dispatched.

With both fixes applied, server hangs of the kind described above stop
appearing in my testing.

Origin

This issue surfaced while debugging lsp-ltex-plus (an lsp-mode
client for the ltex-ls-plus grammar/spell server). ltex-ls-plus
issues server-initiated requests interleaved with responses to client
requests during interactive editing, which made the abandon-batch case
happen reliably. To unblock users while a proper fix is discussed
upstream, lsp-ltex-plus has been shipping the patch as a local
override of lsp--create-filter-function since release vX.Y.Z
(https://github.com/alberti42/emacs-ltex-plus/releases). This is
mentioned only as context — the bug is in lsp-mode regardless of
which client is connected, and the fix belongs upstream so other
clients benefit without each one having to ship its own override.

Related work

This is the third of three independent fixes to how lsp-mode handles
messages when there are many requests waiting for responses at the same
time. The other two are:

  • PR 1 — make lsp--parser-on-message classify messages by their
    method field before their id. Today, when a server-initiated
    request happens to use an id that matches a pending client request,
    it is mis-routed as a response. PR 1 fixes that.
  • PR 2 — make lsp-request-while-no-input ignore callbacks that
    arrive after the function has stopped waiting. Today, a late response
    callback can fire after the surrounding (catch 'lsp-done …) has
    already gone, causing a stray no-catch error in unrelated code.
    PR 2 fixes that.

Each PR addresses a separate problem and is independently useful; they
are split apart to keep review focused. PRs 1 and 2 do not depend on
this one — applying any subset of the three is safe.

@jcs090218
Copy link
Copy Markdown
Member

jcs090218 commented May 11, 2026

Can you rebase this and simply comment changes you made (with GitHub review)? The git diff isn't very useful with large files. 😓 Thank you!

@alberti42 alberti42 force-pushed the fix/lsp--create-filter-function branch 5 times, most recently from 4ed5c31 to 971d2d9 Compare May 11, 2026 20:41
@alberti42
Copy link
Copy Markdown
Contributor Author

@jcs090218 Thanks for looking into this PR!

  • I rebased onto the latest master.
  • I renamed a few variables to be more aligned with the naming conventions in lsp-mode.el; no
    logic change.
  • I rewrote a few code comments to align with the conventions used in lsp-mode.el (as best I could
    judge by grepping through the file and looking at how things were handled).
  • Added more inline code comments explaining the why and the what. I put them in the source rather
    than as GitHub review comments so they stay useful for future readers without requiring going
    through the PR. If you'd also like GitHub review comments alongside, I'm happy to add those too.

The git diff isn't very useful with large files. 😓

I see the problem: a lot of the noise comes from indentation changes after wrapping the parsing loop
in (condition-case …). Can you try the diff with ?w=1, which hides whitespace-only changes? You
can directly click on the link:

https://github.com/emacs-lsp/lsp-mode/pull/5057/files?w=1

On my screen that hides all the indentation noise and leaves just the real logic. If it still isn't
enough, I'm happy to split the PR into two commits, one with the real changes, one with the
indentation only; but hopefully ?w=1 does the trick.

A single TCP read often delivers multiple framed messages.  Two failure
modes could lose later messages in the same batch:

1. A framing-parse error (e.g. when JSON parsing of an earlier body
   leaves the framing state inconsistent and the next iteration enters
   header parsing on mid-body bytes) would escape the parsing loop with
   an uncaught `(error "Unable to find Content-Length header.")', losing
   any messages already accumulated in the local list.

2. A `(throw 'lsp-done ...)' or `(throw 'input ...)' from inside a
   message's response handler would terminate `mapc' before subsequent
   messages were dispatched.  These tags are caught higher in the call
   stack (e.g. in `lsp-request-while-no-input'), and the design assumes
   the catch handles the throw locally — but `mapc' is between the
   handler and the catch, and gets killed in transit.

This commit:

- Wraps the framing loop in `condition-case' that salvages already-parsed
  messages before resetting framing state for the next read.
- Replaces `mapc' with a per-message `dolist' that catches `'lsp-done'
  and `'input' throws, queues them, dispatches the remaining messages,
  and re-issues the saved throw after the loop completes.

The original target catch still receives the throw — just delayed until
after the batch is fully drained.
@alberti42 alberti42 force-pushed the fix/lsp--create-filter-function branch from 971d2d9 to 4e523c9 Compare May 13, 2026 20:32
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants