mirror of
https://github.com/kvcache-ai/ktransformers.git
synced 2025-09-06 04:30:03 +00:00
Fix the following runtime error with --no-use_cuda_graph option Traceback (most recent call last): File "/home/aubrey/miniforge3/envs/kt/lib/python3.11/multiprocessing/process.py", line 314, in _bootstrap self.run() File "/home/aubrey/miniforge3/envs/kt/lib/python3.11/multiprocessing/process.py", line 108, in run self._target(*self._args, **self._kwargs) File "/home/aubrey/miniforge3/envs/kt/lib/python3.11/site-packages/ktransformers/server/backend/interfaces/balance_serve.py", line 282, in run_engine engine.loop() File "/home/aubrey/miniforge3/envs/kt/lib/python3.11/site-packages/ktransformers/server/backend/interfaces/balance_serve.py", line 234, in loop self.model_runner.run(self.batch, self.query_manager) File "/home/aubrey/miniforge3/envs/kt/lib/python3.11/site-packages/ktransformers/server/balance_serve/inference/model_runner.py", line 220, in run self.output.logits[0] = self.output.logits[0][self.input[cuda_graph_idx].minibatch.logits_start] ~~~~~~~~~~~~~~~~~~~~~^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ RuntimeError: CUDA error: an illegal memory access was encountered CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect. |
||
---|---|---|
.. | ||
__init__.py | ||
attention.py | ||
balance_serve_attention.py | ||
base_operator.py | ||
cpuinfer.py | ||
dynamic_attention.py | ||
experts.py | ||
flashinfer_batch_prefill_wrapper.py | ||
flashinfer_wrapper.py | ||
gate.py | ||
layernorm.py | ||
linear.py | ||
mlp.py | ||
models.py | ||
RoPE.py | ||
triton_attention.py | ||
triton_attention_prefill.py |