mirror of
https://github.com/LostRuins/koboldcpp.git
synced 2026-04-28 03:30:20 +00:00
mirror of https://github.com/LostRuins/koboldcpp
| .devops | ||
| .github/workflows | ||
| .dockerignore | ||
| .gitignore | ||
| CMakeLists.txt | ||
| convert-pth-to-ggml.py | ||
| download-pth.py | ||
| expose.cpp | ||
| flake.lock | ||
| flake.nix | ||
| ggml.c | ||
| ggml.h | ||
| LICENSE | ||
| llama_for_kobold.py | ||
| llamalib.dll | ||
| main.cpp | ||
| main.exe | ||
| Makefile | ||
| quantize.cpp | ||
| quantize.exe | ||
| quantize.sh | ||
| README.md | ||
| utils.cpp | ||
| utils.h | ||
llama-for-kobold
A hacky little script from Concedo that exposes llama.cpp function bindings, allowing it to be used via a simulated Kobold API endpoint.
It's not very usable as there is a fundamental flaw with llama.cpp, which causes generation delay to scale linearly with original prompt length. Nobody knows why or really cares much, so I'm just going to publish whatever I have at this point.
If you care, please contribute to this discussion which, if resolved, will actually make this viable.
Considerations
- Don't want to use pybind11 due to dependencies on MSVCC
- ZERO or MINIMAL changes as possible to main.cpp - do not move their function declarations elsewhere!
- Leave main.cpp UNTOUCHED, We want to be able to update the repo and pull any changes automatically.
- No dynamic memory allocation! Setup structs with FIXED (known) shapes and sizes for ALL output fields. Python will ALWAYS provide the memory, we just write to it.
- No external libraries or dependencies. That means no Flask, Pybind and whatever. All You Need Is Python.
Usage
- Windows binaries are provided in the form of llamalib.dll but if you feel worried go ahead and rebuild it yourself.
- Weights are not included, you can use the llama.cpp quantize.exe to generate them from your official weight files (or download them from...places).
- To run, simply clone the repo and run
llama_for_kobold.py [ggml_quant_model.bin] [port], and then connect with Kobold or Kobold Lite.