mirror of
https://github.com/onestardao/WFGY.git
synced 2026-04-28 11:40:07 +00:00
Update exllama.md
This commit is contained in:
parent
dcc0515aef
commit
039454bc54
1 changed files with 17 additions and 0 deletions
|
|
@ -1,5 +1,22 @@
|
|||
# ExLLaMA: Guardrails and Fix Patterns
|
||||
|
||||
<details>
|
||||
<summary><strong>🧭 Quick Return to Map</strong></summary>
|
||||
|
||||
<br>
|
||||
|
||||
> You are in a sub-page of **LocalDeploy_Inference**.
|
||||
> To reorient, go back here:
|
||||
>
|
||||
> - [**LocalDeploy_Inference** — on-prem deployment and model inference](./README.md)
|
||||
> - [**WFGY Global Fix Map** — main Emergency Room, 300+ structured fixes](../README.md)
|
||||
> - [**WFGY Problem Map 1.0** — 16 reproducible failure modes](../../README.md)
|
||||
>
|
||||
> Think of this page as a desk within a ward.
|
||||
> If you need the full triage and all prescriptions, return to the Emergency Room lobby.
|
||||
</details>
|
||||
|
||||
|
||||
ExLLaMA (and its fork ExLLaMA2/ExLLaMA-HF) is a highly optimized CUDA inference backend used under **TextGen WebUI** and custom pipelines.
|
||||
It can run very large models (65B+) on limited VRAM, but often shows instability when sharded, quantized, or paired with retrieval layers.
|
||||
This guide stabilizes ExLLaMA with structural guardrails.
|
||||
|
|
|
|||
Loading…
Add table
Add a link
Reference in a new issue