Misc. bug:  Optimize GPU token gen perf when context depth is passed (llama-bench)

### Name and Version

llama-bench tool
e.g. PP0, TG128, D8192, NITER3

### Operating systems

_No response_

### Which llama.cpp modules do you know to be affected?

_No response_

### Command line

```shell

```

### Problem description & steps to reproduce

See token gen graphs showing D8192 perf

### First Bad Commit

_No response_

### Relevant log output

<details>
<summary>Logs</summary>


```console

```
</details>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Misc. bug: Optimize GPU token gen perf when context depth is passed (llama-bench) #188

Name and Version

Operating systems

Which llama.cpp modules do you know to be affected?

Command line

Problem description & steps to reproduce

First Bad Commit

Relevant log output

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Misc. bug: Optimize GPU token gen perf when context depth is passed (llama-bench) #188

Description

Name and Version

Operating systems

Which llama.cpp modules do you know to be affected?

Command line

Problem description & steps to reproduce

First Bad Commit

Relevant log output

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions