-
Notifications
You must be signed in to change notification settings - Fork 9.4k
Insights: ggerganov/llama.cpp
September 15, 2024 – September 22, 2024
Overview
Could not load contribution data
Please try again later
32 Releases published by 1 person
-
b3759
published
Sep 15, 2024 -
b3760
published
Sep 15, 2024 -
b3761
published
Sep 15, 2024 -
b3763
published
Sep 16, 2024 -
b3764
published
Sep 16, 2024 -
b3766
published
Sep 16, 2024 -
b3765
published
Sep 16, 2024 -
b3767
published
Sep 16, 2024 -
b3770
published
Sep 16, 2024 -
b3771
published
Sep 16, 2024 -
b3772
published
Sep 16, 2024 -
b3774
published
Sep 17, 2024 -
b3775
published
Sep 17, 2024 -
b3777
published
Sep 17, 2024 -
b3778
published
Sep 17, 2024 -
b3779
published
Sep 17, 2024 -
b3781
published
Sep 18, 2024 -
b3782
published
Sep 18, 2024 -
b3783
published
Sep 18, 2024 -
b3785
published
Sep 18, 2024 -
b3786
published
Sep 19, 2024 -
b3787
published
Sep 19, 2024 -
b3788
published
Sep 20, 2024 -
b3789
published
Sep 20, 2024 -
b3790
published
Sep 20, 2024 -
b3795
published
Sep 20, 2024 -
b3797
published
Sep 21, 2024 -
b3798
published
Sep 21, 2024 -
b3799
published
Sep 21, 2024 -
b3800
published
Sep 22, 2024 -
b3801
published
Sep 22, 2024 -
b3802
published
Sep 22, 2024
40 Pull requests merged by 21 people
-
CUDA: enable Gemma FA for HIP/Pascal
#9581 merged
Sep 22, 2024 -
llama: remove redundant loop when constructing ubatch
#9574 merged
Sep 22, 2024 -
RWKV v6: RWKV_WKV op CUDA implementation
#9454 merged
Sep 22, 2024 -
ggml-alloc : fix list of allocated tensors with GGML_ALLOCATOR_DEBUG
#9573 merged
Sep 21, 2024 -
Update CUDA graph on scale change plus clear nodes/params
#9550 merged
Sep 21, 2024 -
CI: Provide prebuilt windows binary for hip
#9467 merged
Sep 21, 2024 -
quantize : improve type name parsing
#9570 merged
Sep 20, 2024 -
sync : ggml
#9567 merged
Sep 20, 2024 -
CUDA: fix sum.cu compilation for CUDA < 11.7
#9562 merged
Sep 20, 2024 -
examples : flush log upon ctrl+c
#9559 merged
Sep 20, 2024 -
Perplexity input data should not be unescaped
#9548 merged
Sep 20, 2024 -
server : clean-up completed tasks from waiting list
#9531 merged
Sep 19, 2024 -
Imatrix input data should not be unescaped
#9543 merged
Sep 19, 2024 -
ggml : fix n_threads_cur initialization with one thread
#9538 merged
Sep 18, 2024 -
scripts : verify py deps at the start of compare
#9520 merged
Sep 18, 2024 -
llama : use reserve/emplace_back in sampler_sample
#9534 merged
Sep 18, 2024 -
bugfix: structured output response_format does not match openai
#9527 merged
Sep 18, 2024 -
server : fix OpenSSL build by removing invalid
LOG_INFO
references#9529 merged
Sep 18, 2024 -
[SYCL]set context default value to avoid memory issue, update guide
#9476 merged
Sep 18, 2024 -
llama-bench: correct argument parsing error message
#9524 merged
Sep 17, 2024 -
add env variable for parallel
#9513 merged
Sep 17, 2024 -
Fixed n vocab
#9511 merged
Sep 17, 2024 -
threadpool: skip polling for unused threads
#9461 merged
Sep 17, 2024 -
llama.cpp: Add a missing header for cpp23
#9508 merged
Sep 17, 2024 -
IBM Granite Architecture
#9412 merged
Sep 17, 2024 -
llama: public llama_n_head
#9512 merged
Sep 17, 2024 -
ggml : move common CPU backend impl to new header
#9509 merged
Sep 16, 2024 -
llama : rename n_embed to n_embd in rwkv6_time_mix
#9504 merged
Sep 16, 2024 -
ggml: link MATH_LIBRARY not by its full path
#9339 merged
Sep 16, 2024 -
convert : identify missing model files
#9397 merged
Sep 16, 2024 -
cmake : do not hide GGML options + rename option
#9465 merged
Sep 16, 2024 -
IQ4_NL sgemm + Q4_0 AVX optimization
#9422 merged
Sep 16, 2024 -
Implement OLMoE architecture
#9462 merged
Sep 16, 2024 -
Support MiniCPM3.
#9322 merged
Sep 16, 2024 -
main: option to disable context shift
#9484 merged
Sep 16, 2024 -
metal : handle zero-sized allocs
#9466 merged
Sep 16, 2024 -
nix: update flake.lock
#9488 merged
Sep 16, 2024 -
common : reimplement logging
#9418 merged
Sep 15, 2024 -
gguf-split : add basic checks
#9499 merged
Sep 15, 2024 -
CMake: correct order of sycl flags
#9497 merged
Sep 15, 2024
12 Pull requests opened by 11 people
-
llama : add reranking support
#9510 opened
Sep 16, 2024 -
docs: update server streaming mode documentation
#9519 opened
Sep 17, 2024 -
llama: (proposal) propagating the results of `graph_compute` to the user interface
#9525 opened
Sep 17, 2024 -
musa: enable building fat binaries, enable unified memory, and disable Flash Attention on QY1 (MTT S80)
#9526 opened
Sep 18, 2024 -
Implementations for Q4_0_8_8 quantization based functions - AVX512 version of ggml_gemm_q4_0_8x8_q8_0
#9532 opened
Sep 18, 2024 -
add solar pro support
#9541 opened
Sep 18, 2024 -
server: disable context shift
#9544 opened
Sep 19, 2024 -
baby-llama : use unnamed namespace in baby_llama_layer
#9557 opened
Sep 20, 2024 -
CUDA: Enable K-shift operation for -ctk q8_0 (limited)
#9571 opened
Sep 20, 2024 -
[SYCL] add missed dll file in package
#9577 opened
Sep 21, 2024 -
Revert "[SYCL] fallback mmvq"
#9579 opened
Sep 21, 2024 -
nix: update flake.lock
#9586 opened
Sep 22, 2024
54 Issues closed by 15 people
-
Add theme Rose Pine
#9584 closed
Sep 22, 2024 -
Bug: Gemma2 9B FlashAttention is offloaded to CPU on AMD (HIP)
#9580 closed
Sep 22, 2024 -
Add a new `llama_load_model_from_buffer()` method to compliment `llama_load_model_from_file()`
#6311 closed
Sep 22, 2024 -
Bug - Can't build vulkan backend on RISC-V platform anymore
#8488 closed
Sep 22, 2024 -
Add lightweight tests for LoRA
#8708 closed
Sep 22, 2024 -
Bug: 2 tests fail
#8906 closed
Sep 22, 2024 -
Bug: Inference fails with "llama_get_logits_ith: invalid logits id 7, reason: no logits" in ollama
#8911 closed
Sep 22, 2024 -
Bug: Quantized kv cache caused performance drop on Apple silicon
#8918 closed
Sep 22, 2024 -
build ERROR: Failed building wheel for pyyaml
#8919 closed
Sep 22, 2024 -
Bug: Latest version of convert_hf_to_gguf not compatible with gguf 0.9.1 from pip
#8925 closed
Sep 22, 2024 -
ERROR: Can't Compile llama.cpp on Mac OS Sequoia (September 2024 update)
#9575 closed
Sep 21, 2024 -
Bug: Flash attention reduces vulkan performance by ~50%
#9572 closed
Sep 21, 2024 -
How to convert a finetuned non-LLM model in .pt format into .gguf? #8790
#8791 closed
Sep 21, 2024 -
Bug: Quantizing HuggingFaceM4/Idefics3-8B-Llama3 fails with error
#8902 closed
Sep 21, 2024 -
Bug: Unreachable code warnings
#8904 closed
Sep 21, 2024 -
Bug: KV cache load/save is slow
#8915 closed
Sep 21, 2024 -
Bug: Update to "convert_hf_to_gguf.py"
#8920 closed
Sep 21, 2024 -
Bug: server crash when changing LoRA scale while using CUDA
#9451 closed
Sep 21, 2024 -
Bug: Llama-Quantize Not Working with Capital Letters (T^T)
#9569 closed
Sep 20, 2024 -
Bug: Fail to compile after commit 202084d31d4247764fc6d6d40d2e2bda0c89a73a
#9554 closed
Sep 20, 2024 -
Bug: llama-cli does not show the results of the performance test when SIGINT
#9558 closed
Sep 20, 2024 -
Feature Request: Processing one token takes the same amount of time as processing 40 tokens (CUDA/MMVQ)
#8869 closed
Sep 20, 2024 -
openblas not working.
#8882 closed
Sep 20, 2024 -
Bug: task ids not removed from waiting_tasks for /v1/chat/completions call
#9528 closed
Sep 19, 2024 -
Huge performance degradation using latest branch on Intel Core Ultra 7 155H
#8328 closed
Sep 19, 2024 -
Bug: The quantization model suffers from infinite replies and does not stop
#8861 closed
Sep 19, 2024 -
Bug: [SYCL] silently failed on windows
#9540 closed
Sep 18, 2024 -
Bug: llama-cli generates incoherent output with full gpu offload
#9535 closed
Sep 18, 2024 -
Bug: llama-server structured output response_format does not match openai docs
#9522 closed
Sep 18, 2024 -
Support BitNet b1.58 ternary models
#5761 closed
Sep 18, 2024 -
Feature Request: server : make chat_example available through /props endpoint
#8694 closed
Sep 18, 2024 -
Feature Request: multiple queues or multiple threads to load model files.
#8796 closed
Sep 18, 2024 -
Mistral-large-instruction-2407 cannot be quantified
#8807 closed
Sep 18, 2024 -
Bug: Failed to build ggml-mainline-vulkan_autogen
#8844 closed
Sep 18, 2024 -
Bug: Gemma 2 incoherent output when using quantized k cache without Flash Attention
#8853 closed
Sep 18, 2024 -
Bug: llama-server: when result doesn't fit in max_tokens, finished_reason should be length
#8856 closed
Sep 18, 2024 -
Bug: Crash in Release Mode when built with Xcode 16 (& since Xcode 15.3)
#9514 closed
Sep 17, 2024 -
Bug: Last 2 Chunks In Streaming Mode Come Together In Firefox
#9502 closed
Sep 17, 2024 -
Can't load a Q4 model on 12gb vram
#9517 closed
Sep 17, 2024 -
Bug: ERROR-hf-to-gguf
#9483 closed
Sep 16, 2024 -
Bug: Build failure in master on Ubuntu 24.04 with CUDA enabled
#9473 closed
Sep 16, 2024 -
Bug: Missing Sanity Check in convert_hf_to_gguf.py
#9245 closed
Sep 16, 2024 -
Add support for OLMoE-1B-7B / 7B
#9380 closed
Sep 16, 2024 -
Bug: ggml_backend_metal_buffer_type_alloc_buffer: error: failed to allocate buffer
#9460 closed
Sep 16, 2024 -
Bug: [SYCL] linker fails with undefined reference to symbol
#9490 closed
Sep 16, 2024 -
When using GPU (OpenCL), the reply speed is slower and all replies are incorrect??
#7661 closed
Sep 16, 2024 -
Bug: Unable to load grammar from `json.gbnf` example
#7991 closed
Sep 16, 2024 -
Bug: FPE (Floating Point Exception) in gguf_init_from_file due to division by zero
#8816 closed
Sep 16, 2024 -
Feature Request: Bare metal build
#8820 closed
Sep 16, 2024 -
Feature Request: Please provide a Linux Vulkan binary
#8825 closed
Sep 16, 2024 -
llama : reimplement logging
#8566 closed
Sep 15, 2024 -
Bug: can not merge gguf, gguf_init_from_file: invalid magic characters ''
#9498 closed
Sep 15, 2024 -
Bug: Unable to quantise Uncensored Mistral NeMo Model
#9363 closed
Sep 15, 2024
22 Issues opened by 22 people
-
Bug: false sharing in threadpool makes ggml_barrier() needlessly slow
#9588 opened
Sep 22, 2024 -
Bug: passing `tfs_z` crashes the server
#9587 opened
Sep 22, 2024 -
Feature Request: Support Jina V3 arch
#9585 opened
Sep 21, 2024 -
Bug: Templates are swapped for Mistral and Llama 2 in llama-server when using --chat-template
#9583 opened
Sep 21, 2024 -
Bug: Vulkan not compile
#9582 opened
Sep 21, 2024 -
Bug: ROCM 7900xtx output random garbage with qwen1.5/14B after recent update
#9568 opened
Sep 20, 2024 -
Bug: gguf pypi package corrupts environment
#9566 opened
Sep 20, 2024 -
Bug: Release version less accurate than Debug version consistently
#9564 opened
Sep 20, 2024 -
Bug: Model isn't loading
#9563 opened
Sep 20, 2024 -
[CANN]Bug: Can't compile ggml/src/CMakeFiles/ggml.dir/ggml-cann/acl_tensor.cpp.o
#9560 opened
Sep 20, 2024 -
Bug: Unreadable output from android example project
#9555 opened
Sep 20, 2024 -
Feature Request: Support GRIN-MoE by Microsoft
#9552 opened
Sep 19, 2024 -
Bug: KV quantization fails when using vulkan
#9551 opened
Sep 19, 2024 -
Bug: Build fails on i386 systems
#9545 opened
Sep 19, 2024 -
Error compiling using CUDA on Jetson Orin nx
#9533 opened
Sep 18, 2024 -
Bug: Lower performance in pre-built binary llama-server, Since llama-b3681-bin-win-cuda-cu12.2.0-x64
#9530 opened
Sep 18, 2024 -
Bug: duplicate vulkan devices being detected on windows
#9516 opened
Sep 17, 2024 -
metal : increase GPU duty-cycle during inference
#9507 opened
Sep 16, 2024 -
Bug: Lower performance in SYCL vs IPEX LLM.
#9505 opened
Sep 16, 2024 -
Bug: llama-bench: split-mode flag doesn't recognize argument 'none'
#9501 opened
Sep 16, 2024
67 Unresolved conversations
Sometimes conversations happen on old items that aren’t yet closed. Here is a list of all the Issues and Pull Requests with unresolved conversations.
-
vocab: refactor tokenizer to reduce the overhead of creating multi times tokenizer
#9449 commented on
Sep 20, 2024 • 11 new comments -
Support video understanding
#9165 commented on
Sep 17, 2024 • 3 new comments -
Update clip.cpp
#9482 commented on
Sep 18, 2024 • 2 new comments -
ggml: Add run-time detection of neon, i8mm and sve
#9331 commented on
Sep 20, 2024 • 2 new comments -
IBM Granite MoE Architecture
#9438 commented on
Sep 17, 2024 • 1 new comment -
server : add Hermes-3 tool call support (WIP)
#9254 commented on
Sep 19, 2024 • 1 new comment -
Add Intel Advanced Matrix Extensions (AMX) support to ggml
#8998 commented on
Sep 18, 2024 • 1 new comment -
Merging #7568 with #7430(Implementing LLaMA 3 torch to gguf conversion)
#7651 commented on
Sep 20, 2024 • 1 new comment -
server : ability to disable context shift
#9390 commented on
Sep 21, 2024 • 0 new comments -
Bug: Mac build failed using make
#9157 commented on
Sep 21, 2024 • 0 new comments -
Feature Request: Add support for Phi-3.5 MoE and Vision Instruct
#9119 commented on
Sep 21, 2024 • 0 new comments -
Bug: llama-server api first query very slow
#9492 commented on
Sep 21, 2024 • 0 new comments -
Feature Request: Support for Qwen2-VL
#9246 commented on
Sep 21, 2024 • 0 new comments -
Phi 3 medium/small support
#7439 commented on
Sep 21, 2024 • 0 new comments -
Bug: NikolayKozloff/madlad400-10b-mt-Q8_0-GGUF works with llama-cli but doesn't work with llama-server
#9030 commented on
Sep 21, 2024 • 0 new comments -
obtain attention matrices during inference, similar to the output_attentions=True parameter in the transformers package
#9122 commented on
Sep 21, 2024 • 0 new comments -
Bug: cannot create std::vector larger than max_size()
#9391 commented on
Sep 21, 2024 • 0 new comments -
Feature Request: Pixtral by Mistral support (pixtral-12b-240910)
#9440 commented on
Sep 20, 2024 • 0 new comments -
llama : store token ids in the KV Cache
#9113 commented on
Sep 20, 2024 • 0 new comments -
Refactor: decide the future of llama_tensor_get_type()
#8736 commented on
Sep 20, 2024 • 0 new comments -
llama : add test for saving/loading sessions to the CI
#2631 commented on
Sep 17, 2024 • 0 new comments -
Improve `cvector-generator`
#8724 commented on
Sep 21, 2024 • 0 new comments -
Bug: Throughput (tokens/sec) does not scale with increasing batch sizes in Intel GPUs
#9097 commented on
Sep 22, 2024 • 0 new comments -
Bug: High CPU usage and bad output with flash attention on ROCm
#8893 commented on
Sep 22, 2024 • 0 new comments -
Bug: Llava not working on android
#8436 commented on
Sep 22, 2024 • 0 new comments -
llama : fix K-shift with quantized K (wip)
#5653 commented on
Sep 20, 2024 • 0 new comments -
Server: enable lookup decoding
#6828 commented on
Sep 18, 2024 • 0 new comments -
added implementation of DRY sampler
#6839 commented on
Sep 22, 2024 • 0 new comments -
Changes for the existing quant strategies / FTYPEs and new ones
#8836 commented on
Sep 19, 2024 • 0 new comments -
Add lora test workflow (WIP)
#9058 commented on
Sep 20, 2024 • 0 new comments -
server: add repeat penalty sigmoid
#9076 commented on
Sep 15, 2024 • 0 new comments -
llama : initial Mamba-2 support
#9126 commented on
Sep 18, 2024 • 0 new comments -
convert : refactor rope_freqs generation
#9396 commented on
Sep 18, 2024 • 0 new comments -
naming : normalize the name of callback-related identifiers
#9405 commented on
Sep 16, 2024 • 0 new comments -
Question: How to generate an MPS gputrace
#6506 commented on
Sep 17, 2024 • 0 new comments -
changelog : `libllama` API
#9289 commented on
Sep 17, 2024 • 0 new comments -
llama : support sliding window attention
#3377 commented on
Sep 17, 2024 • 0 new comments -
Bug: On a 3 GPU System [A-C] not using CUDA_VISIBLE_DEVICES but using tensor split [1,1,0] should not allocate ANY memory on GPU C
#8827 commented on
Sep 17, 2024 • 0 new comments -
Bug: GGML_ASSERT(llama_add_eos_token(model) != 1) failed llama-server critical error with flan-t5 models
#8990 commented on
Sep 17, 2024 • 0 new comments -
Bug: Couldn't load GGUF file into Transformers
#9021 commented on
Sep 17, 2024 • 0 new comments -
Vulkan adreno error
#9064 commented on
Sep 17, 2024 • 0 new comments -
llama : support reranking API endpoint and models
#8555 commented on
Sep 16, 2024 • 0 new comments -
llama : speed-up grammar sampling
#4218 commented on
Sep 16, 2024 • 0 new comments -
Bug: andriod compiling bug, with vulkan open
#9489 commented on
Sep 16, 2024 • 0 new comments -
Feature Request: T-MAC: CPU Renaissance via Table Lookup for Low-Bit LLM Deployment on Edge
#8485 commented on
Sep 16, 2024 • 0 new comments -
Bug: src/llama.cpp:15099: Deepseek2 does not support K-shift
#8862 commented on
Sep 16, 2024 • 0 new comments -
Feature Request: Support vulkan when building on Android
#8933 commented on
Sep 16, 2024 • 0 new comments -
Feature Request: RDMA support for rpc back ends
#9493 commented on
Sep 15, 2024 • 0 new comments -
Bug: There is an issue to execute llama-baby-llama.
#9478 commented on
Sep 15, 2024 • 0 new comments -
server : improvements and maintenance
#4216 commented on
Sep 15, 2024 • 0 new comments -
Feature Request: NPU Support
#9181 commented on
Sep 20, 2024 • 0 new comments -
Bug: MinGW build fails to load models with "error loading model: PrefetchVirtualMemory unavailable"
#9311 commented on
Sep 20, 2024 • 0 new comments -
Feature Request: InternVL2 Support ?
#8848 commented on
Sep 20, 2024 • 0 new comments -
Encounter the "newline in constant" error while compiling with MSVC
#8334 commented on
Sep 20, 2024 • 0 new comments -
Bug: llama3.1 8B GGUF parallel inferring process leads to endless repeating results
#9104 commented on
Sep 20, 2024 • 0 new comments -
llama : refactor llama_vocab
#9369 commented on
Sep 19, 2024 • 0 new comments -
Bug: llama_print_timings seems to accumulate load_time/total_time in `llama-bench`
#9286 commented on
Sep 19, 2024 • 0 new comments -
Feature Request: Support Codestral Mamba
#8519 commented on
Sep 19, 2024 • 0 new comments -
BF16 has no CUDA support
#8941 commented on
Sep 19, 2024 • 0 new comments -
Bug: Unable to load phi3:3B(2.2GB) model on Apple M1 Pro
#9049 commented on
Sep 19, 2024 • 0 new comments -
[CANN]Feature Request: Support OrangeAIPRO 310b CANN
#9481 commented on
Sep 18, 2024 • 0 new comments -
Support speculative decoding in `server` example
#5877 commented on
Sep 18, 2024 • 0 new comments -
Running Lllava in interactive mode just Quits after generating response without waiting for next prompt.
#3593 commented on
Sep 18, 2024 • 0 new comments -
llama : tool for evaluating quantization results per layer
#2783 commented on
Sep 18, 2024 • 0 new comments -
metal : compile-time kernel args and params
#4085 commented on
Sep 18, 2024 • 0 new comments -
ggml : unified CMake build
#6913 commented on
Sep 18, 2024 • 0 new comments -
How to utilize GPU on Android to accelerate inference?
#8705 commented on
Sep 18, 2024 • 0 new comments