Skip to content

Fix dead surface hook#3404

Merged
YaLTeR merged 4 commits into
niri-wm:mainfrom
cmeissl:fix/dead_surface_hook
Feb 10, 2026
Merged

Fix dead surface hook#3404
YaLTeR merged 4 commits into
niri-wm:mainfrom
cmeissl:fix/dead_surface_hook

Conversation

@cmeissl
Copy link
Copy Markdown
Contributor

@cmeissl cmeissl commented Feb 6, 2026

So, it looks like niri ends up adding dmabuf hooks for already destroyed surfaces. I haven't looked too much into why it does this, but my guess is that somehow toplevel_destroyed gets called after the surface is destroyed. The toplevel_destoryed handler does try to install the hook, which will insert the surface in the list for tracking, but might
never get removed again keeping a reference to the last buffer/dmabuf.

I will also open a PR in smithay to prevent this, but that will be a breaking change and might take a bit longer.
So I am opening this PR here first to see if it really fixes the issue and maybe get a short term fix.

fixes #1869

(Draft PR in smithay: Smithay/smithay#1921)

@YaLTeR
Copy link
Copy Markdown
Member

YaLTeR commented Feb 7, 2026

With this PR, I'm consistently getting pairs of:

ERROR niri::handlers::compositor: tried to remove dmabuf pre-commit hook but there was none
ERROR niri::handlers::compositor: tried to add dmabuf pre-commit hook for a dead surface

by running gtk4-demo in a terminal then Ctrl-C-ing it. Suggesting that in this case, WlSurface::destroyed() runs before toplevel_destroyed(). Is this intended on the Smithay side? It seems that when I wrote the code I definitely did not intend for this ordering to be possible. If this is intended, then I'll need to think if there's anything else that needs adjusting then.

@YaLTeR
Copy link
Copy Markdown
Member

YaLTeR commented Feb 7, 2026

Note that I also found some destruction order weirdness when working on wlr-screencopy: https://github.com/YaLTeR/niri/blob/549148d27779d024255a84535b42b947f1c2a113/src/protocols/screencopy.rs#L445-L472

I don't know if this is relevant here though, or if toplevel_destroyed() just generally is expected to sometimes run after WlSurface::destroyed()

@cmeissl
Copy link
Copy Markdown
Contributor Author

cmeissl commented Feb 7, 2026

With this PR, I'm consistently getting pairs of:

ERROR niri::handlers::compositor: tried to remove dmabuf pre-commit hook but there was none
ERROR niri::handlers::compositor: tried to add dmabuf pre-commit hook for a dead surface

by running gtk4-demo in a terminal then Ctrl-C-ing it. Suggesting that in this case, WlSurface::destroyed() runs before toplevel_destroyed(). Is this intended on the Smithay side? It seems that when I wrote the code I definitely did not intend for this ordering to be possible. If this is intended, then I'll need to think if there's anything else that needs adjusting then.

I just checked with kitty and the WAYLAND_DEBUG log looks like expected, the toplevel is destroyed before the surface. So, yeah, this looks fishy.

@cmeissl
Copy link
Copy Markdown
Contributor Author

cmeissl commented Feb 7, 2026

@cmeissl
Copy link
Copy Markdown
Contributor Author

cmeissl commented Feb 7, 2026

Hm, dispatch_all_clients does set the HANDLE, so it looks like it might be possible that some destructors land in pending while others are delayed. This sounds like it could mess with the destruction order. I will try if putting all of them in then pending list is able to fix the order.

@cmeissl
Copy link
Copy Markdown
Contributor Author

cmeissl commented Feb 8, 2026

Okay, after a lot of debugging it looks like some clients will destroy the resources in the correct order, but not wait for destruction. On client disconnect libwayland then seems to call destroy in id order, so surface destroy will be called before toplevel destroy. The same seems to be true in case a client is disconnected because of a protocol error.
I will take another look, but if this is correct I am not sure what we can do in smithay or wayland-rs.

@YaLTeR
Copy link
Copy Markdown
Member

YaLTeR commented Feb 8, 2026

Thanks. If this is the case, then let's clearly document it in Smithay/wayland-rs (e.g. in toplevel_destroyed() docs say that it will sometimes be called after WlSurface::destroyed(), etc.)

@cmeissl
Copy link
Copy Markdown
Contributor Author

cmeissl commented Feb 8, 2026

Thanks. If this is the case, then let's clearly document it in Smithay/wayland-rs (e.g. in toplevel_destroyed() docs say that it will sometimes be called after WlSurface::destroyed(), etc.)

I fear there is not much we can do and it makes sense that there can't be any guarantee of destruction order in case the client just exits. (I mean technically we could try to delay destruction in a way that keeps protocol order, but that does sound like a nightmare to do right). Anyway, I opened a PR to at least warn about this in the handlers where smithay forwards the destruction callbacks: Smithay/smithay#1924

@cmeissl
Copy link
Copy Markdown
Contributor Author

cmeissl commented Feb 8, 2026

@YaLTeR I looked through the code and it looks like surface/toplevel destruction should be fine with these changes.
I checked with kitty and glmark2-es2-wayland and the dmabuf count seems to be stable now.

@YaLTeR
Copy link
Copy Markdown
Member

YaLTeR commented Feb 8, 2026

What was actually causing the VRAM leak? So we have self.niri.dmabuf_pre_commit_hook.insert(surface, hook) that is never cleared out. The hook closure doesn't seem to reference anything. Is it the strong reference to WlSurface (for the hashmap key) that kept some userdata alive with the dmabufs?

@cmeissl
Copy link
Copy Markdown
Contributor Author

cmeissl commented Feb 8, 2026

What was actually causing the VRAM leak? So we have self.niri.dmabuf_pre_commit_hook.insert(surface, hook) that is never cleared out. The hook closure doesn't seem to reference anything. Is it the strong reference to WlSurface (for the hashmap key) that kept some userdata alive with the dmabufs?

Yes, afaict the strong reference to the surface is what keeps the gpu resource alive. The surface still has a reference to the texture according to my valgrind lokgs. Not using the dmabuf hooks (actually not inserting the surface in the hook list) or using weak surface references also fixes the issue (at least the issue I was able to reproduce).
I actually implemented the weak surface stuff first, but did it this way to keep the logic in niri as is.
Using weak also has the downside that you would have to constantly check of dead surfaces.

@cmeissl
Copy link
Copy Markdown
Contributor Author

cmeissl commented Feb 8, 2026

I have also found that it might not be consistent, sometimes the destruction handler run in the correct order.
imo this is a race in the client when not waiting for destruction. alacritty for example does a rountrip before exiting according to the wayland debug log, while kitty seems to just exit after calling destroy on the resources. I looks like the latter is racy and can result in libwayland-server doing the implicit clean-up. I also reproduced the same using glmark2-es2-wayland, so it does not seem to only affect a single client.

@YaLTeR
Copy link
Copy Markdown
Member

YaLTeR commented Feb 8, 2026

Thanks, I'll merge it a bit later. I'll rearrange some conditionals

cmeissl and others added 4 commits February 10, 2026 08:41
this re-uses the already existing remove_default_dmabuf_pre_commit_hook
during surface destruction instead of just removing the hook from the
stored list of hooks.
this prevents surfaces getting stored indefinitely in case
some logic tries to add the hook for an already
destroyed surface.
resource destruction has undefined order in case
the client does not explicitly destroy the resourced
and wait for destruction to complete.
the same applies for clients exiting unexpectedly.
@YaLTeR YaLTeR force-pushed the fix/dead_surface_hook branch from f457499 to 11134b9 Compare February 10, 2026 05:46
@YaLTeR YaLTeR enabled auto-merge (squash) February 10, 2026 05:47
@YaLTeR YaLTeR merged commit 6d5c5f1 into niri-wm:main Feb 10, 2026
13 checks passed
@YaLTeR
Copy link
Copy Markdown
Member

YaLTeR commented Feb 10, 2026

Thanks

@cmeissl cmeissl deleted the fix/dead_surface_hook branch February 10, 2026 07:47
Roodnt pushed a commit to Roodnt/niri that referenced this pull request Mar 3, 2026
* remove pre-commit hook when surface is destroyed

this re-uses the already existing remove_default_dmabuf_pre_commit_hook
during surface destruction instead of just removing the hook from the
stored list of hooks.

* do not add dmabuf pre-commit hook for destroyed surfaces

this prevents surfaces getting stored indefinitely in case
some logic tries to add the hook for an already
destroyed surface.

* align surface/toplevel destruction order for client destruction

resource destruction has undefined order in case
the client does not explicitly destroy the resourced
and wait for destruction to complete.
the same applies for clients exiting unexpectedly.

* rearrange some things

---------

Co-authored-by: Ivan Molodetskikh <yalterz@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Video memory not released after closing certain apps

2 participants