Local LLM-assisted text completion.
![image](https://private-user-images.githubusercontent.com/1991296/380711734-a950e38c-3b3f-4c46-94fe-0d6e0f790fc6.png?jwt=eyJhbGciOiJIUzI1NiIsInR5cCI6IkpXVCJ9.eyJpc3MiOiJnaXRodWIuY29tIiwiYXVkIjoicmF3LmdpdGh1YnVzZXJjb250ZW50LmNvbSIsImtleSI6ImtleTUiLCJleHAiOjE3Mzg5OTc1OTUsIm5iZiI6MTczODk5NzI5NSwicGF0aCI6Ii8xOTkxMjk2LzM4MDcxMTczNC1hOTUwZTM4Yy0zYjNmLTRjNDYtOTRmZS0wZDZlMGY3OTBmYzYucG5nP1gtQW16LUFsZ29yaXRobT1BV1M0LUhNQUMtU0hBMjU2JlgtQW16LUNyZWRlbnRpYWw9QUtJQVZDT0RZTFNBNTNQUUs0WkElMkYyMDI1MDIwOCUyRnVzLWVhc3QtMSUyRnMzJTJGYXdzNF9yZXF1ZXN0JlgtQW16LURhdGU9MjAyNTAyMDhUMDY0ODE1WiZYLUFtei1FeHBpcmVzPTMwMCZYLUFtei1TaWduYXR1cmU9NDViZmMzNzM4ZGI3NWIwNzFkNDI2MDQ3MTEyYjJhZjUwMDQ5NDA4Yzc0NmM1YjRhOTA1YzVkNzIyOWIxODE1ZCZYLUFtei1TaWduZWRIZWFkZXJzPWhvc3QifQ.7izNCPbyQff9V28JxdhcARQh0T2__pifmgHi4yCLcig)
- Auto-suggest on cursor movement in
Insert
mode - Toggle the suggestion manually by pressing
Ctrl+F
- Accept a suggestion with
Tab
- Accept the first line of a suggestion with
Shift+Tab
- Control max text generation time
- Configure scope of context around the cursor
- Ring context with chunks from open and edited files and yanked text
- Supports very large contexts even on low-end hardware via smart context reuse
- Display performance stats
Plug 'ggml-org/llama.vim'
cd ~/.vim/bundle
git clone https://github.com/ggml-org/llama.vim
Then add Plugin 'llama.vim'
to your .vimrc in the vundle#begin()
section.
{
'ggml-org/llama.vim',
}
You can customize llama.vim by setting the g:llama_config
variable.
Examples:
- Disable the inline info:
" put before llama.vim loads
let g:llama_config = { 'show_info': 0 }
- Same thing but setting directly
let g:llama_config.show_info = v:false
- Disable auto FIM completion with lazy.nvim
{
'ggml-org/llama.vim',
init = function()
vim.g.llama_config = {
auto_fim = false,
}
end,
}
Please refer to :help llama_config
or the source
for the full list of options.
The plugin requires a llama.cpp server instance to be running at g:llama_config.endpoint
.
brew install llama.cpp
Either build from source or use the latest binaries: https://github.com/ggerganov/llama.cpp/releases
Here are recommended settings, depending on the amount of VRAM that you have:
-
More than 16GB VRAM:
llama-server \ -hf ggml-org/Qwen2.5-Coder-7B-Q8_0-GGUF \ --port 8012 -ngl 99 -fa -ub 1024 -b 1024 \ --ctx-size 0 --cache-reuse 256
-
Less than 16GB VRAM:
llama-server \ -hf ggml-org/Qwen2.5-Coder-3B-Q8_0-GGUF \ --port 8012 -ngl 99 -fa -ub 1024 -b 1024 \ --ctx-size 0 --cache-reuse 256
-
Less than 8GB VRAM:
llama-server \ -hf ggml-org/Qwen2.5-Coder-1.5B-Q8_0-GGUF \ --port 8012 -ngl 99 -fa -ub 1024 -b 1024 \ --ctx-size 0 --cache-reuse 256
Use :help llama
for more details.
The plugin requires FIM-compatible models: HF collection
![image](https://private-user-images.githubusercontent.com/1991296/376671627-8f5748b3-183a-4b7f-90e1-9148f0a58883.png?jwt=eyJhbGciOiJIUzI1NiIsInR5cCI6IkpXVCJ9.eyJpc3MiOiJnaXRodWIuY29tIiwiYXVkIjoicmF3LmdpdGh1YnVzZXJjb250ZW50LmNvbSIsImtleSI6ImtleTUiLCJleHAiOjE3Mzg5OTc1OTUsIm5iZiI6MTczODk5NzI5NSwicGF0aCI6Ii8xOTkxMjk2LzM3NjY3MTYyNy04ZjU3NDhiMy0xODNhLTRiN2YtOTBlMS05MTQ4ZjBhNTg4ODMucG5nP1gtQW16LUFsZ29yaXRobT1BV1M0LUhNQUMtU0hBMjU2JlgtQW16LUNyZWRlbnRpYWw9QUtJQVZDT0RZTFNBNTNQUUs0WkElMkYyMDI1MDIwOCUyRnVzLWVhc3QtMSUyRnMzJTJGYXdzNF9yZXF1ZXN0JlgtQW16LURhdGU9MjAyNTAyMDhUMDY0ODE1WiZYLUFtei1FeHBpcmVzPTMwMCZYLUFtei1TaWduYXR1cmU9ZDI1NWNhMWZjZWVkMDI2ZTcwM2MwMjEyMTcxMmZmYTdlNWMwOTRhMWQ0ZTY5ZWQxNzY1YmQ1MmMzYjkwZjc1ZCZYLUFtei1TaWduZWRIZWFkZXJzPWhvc3QifQ.nN6RJgsPEic6tMwyW7zf-ZW03Mg-3rDqLx5lq8GG490)
![image](https://private-user-images.githubusercontent.com/1991296/378362882-0ccb93c6-c5c5-4376-a5a3-cc99fafc5eef.png?jwt=eyJhbGciOiJIUzI1NiIsInR5cCI6IkpXVCJ9.eyJpc3MiOiJnaXRodWIuY29tIiwiYXVkIjoicmF3LmdpdGh1YnVzZXJjb250ZW50LmNvbSIsImtleSI6ImtleTUiLCJleHAiOjE3Mzg5OTc1OTUsIm5iZiI6MTczODk5NzI5NSwicGF0aCI6Ii8xOTkxMjk2LzM3ODM2Mjg4Mi0wY2NiOTNjNi1jNWM1LTQzNzYtYTVhMy1jYzk5ZmFmYzVlZWYucG5nP1gtQW16LUFsZ29yaXRobT1BV1M0LUhNQUMtU0hBMjU2JlgtQW16LUNyZWRlbnRpYWw9QUtJQVZDT0RZTFNBNTNQUUs0WkElMkYyMDI1MDIwOCUyRnVzLWVhc3QtMSUyRnMzJTJGYXdzNF9yZXF1ZXN0JlgtQW16LURhdGU9MjAyNTAyMDhUMDY0ODE1WiZYLUFtei1FeHBpcmVzPTMwMCZYLUFtei1TaWduYXR1cmU9ZTE4NDRjOGJhNmYwYmMwNzczOTAyZmI1Yzg0OGM4M2IzZjIwM2ExZWFhNDQwYjdkMzVjY2U0MTFjNmQ5NTNiYiZYLUFtei1TaWduZWRIZWFkZXJzPWhvc3QifQ.1SAjxsA0RoIoUs0K_LYPDlvK68V0OcnmuvFHLhvsDFM)
The orange text is the generated suggestion. The green text contains performance stats for the FIM request: the currently used context is 15186
tokens and the maximum is 32768
. There are 30
chunks in the ring buffer with extra context (out of 64
). So far, 1
chunk has been evicted in the current session and there are 0
chunks in queue. The newly computed prompt tokens for this request were 260
and the generated tokens were 24
. It took 1245 ms
to generate this suggestion after entering the letter c
on the current line.
llama.vim-0-lq.mp4
Demonstrates that the global context is accumulated and maintained across different files and showcases the overall latency when working in a large codebase.
The plugin aims to be very simple and lightweight and at the same time to provide high-quality and performant local FIM completions, even on consumer-grade hardware. Read more on how this is achieved in the following links:
- Initial implementation and techincal description: ggerganov/llama.cpp#9787
- Classic Vim support: ggerganov/llama.cpp#9995