vocab : refactor tokenizer to reduce init overhead (#9449)

* refactor tokenizer

* llama : make llm_tokenizer more private

ggml-ci

* refactor tokenizer

* refactor tokenizer

* llama : make llm_tokenizer more private

ggml-ci

* remove unused files

* remove unused fileds to avoid unused filed build error

* avoid symbol link error

* Update src/llama.cpp

* Update src/llama.cpp

---------

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
This commit is contained in:
Zhenwei Jin 2024-09-28 20:10:58 +08:00 committed by GitHub
parent 9a913110cf
commit 6102037bbb
No known key found for this signature in database
GPG key ID: B5690EEEBB952194
5 changed files with 238 additions and 141 deletions

View file

@ -6464,6 +6464,8 @@ static void llm_load_vocab(
}
GGML_ASSERT(vocab.id_to_token.size() == vocab.token_to_id.size());
vocab.init_tokenizer();
// determine the newline token: LLaMA "<0x0A>" == 10 == '\n', Falcon 193 == '\n'
if (vocab.type == LLAMA_VOCAB_TYPE_SPM) {
// For Fill-In-the-Middle (FIM)/infill models which where converted