Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Reproduce temp=0 llama.cpp results with some consistency. #28

Closed
Sixzero opened this issue Jan 2, 2024 · 5 comments
Closed

Reproduce temp=0 llama.cpp results with some consistency. #28

Sixzero opened this issue Jan 2, 2024 · 5 comments

Comments

@Sixzero
Copy link

Sixzero commented Jan 2, 2024

We need to find a way to detect what could cause the differences between the two solutions.

The task is to have the same or near similar results at temp=0. We made some tests with the new .gguf files since it got so huge adoption.

Llama2.jl test:

using Llama2
model = load_gguf_model("/path/to/llama-2-7b-chat.Q4_K_S.gguf");
sample(model, "Tim was happy."; temperature = 0.0f0)

llama.cpp test .gguf test:
./main -m /Users/lukasmayrhofer/Downloads/llama-2-7b-chat.Q4_K_S.gguf --samplers "temp" --temp 0 -p "Tim was happy."

Current Llama2.jl results:

Tim was happy. Einzelnes, but he was also very proud of his son. He had always known that Tim was special, and he was thrilled to see him finally getting the recognition he deserved.\nAs the two of them sat in the stands, watching the game, Tim couldn't help but feel a sense of pride and joy. He was so grateful to have" ⋯ 667 bytes ⋯ ". \"I'm lucky to have you too.\"\nAs they walked out of the restaurant, Tim felt a sense of contentment and happiness. He knew that he had a wonderful son, and he was grateful for every moment they spent together. He was proud of Tim, and he knew that he would always be there to support and encourage him, no matter what.

Current llama.cpp results:

Tim was happy.
He had just received a new job offer and he was excited to start his new career. He had been searching for a new opportunity for months, and now it seemed like all his hard work had paid off.
As he walked into the office building, he couldn't help but feel a sense of pride. He had worked hard to get where he was, and he knew that this new job would be a great opportunity for him.
Tim took a deep breath as he entered the office. He was greeted by a friendly receptionist who offered him a warm smile. "Hello there," she said. "Welcome to Tim's new workplace."
Tim felt a sense of excitement as he walked through the office. He couldn't wait to meet his new colleagues and start working on his new projects. He knew that this was going to be a great opportunity for him, and he was eager to get started. [end of text]

We need to find an efficient way to know what could cause the differences between the two.

@krishvishal
Copy link

krishvishal commented Apr 2, 2024

Would something like the following: google/gemma.cpp#23 be happening here?

Basically the way quantization is implemented seems to result in lower perf on some types of architectures.

@cafaxo
Copy link
Owner

cafaxo commented Apr 2, 2024

Exactly. It seems that quantizing the hidden state to q8_0 is not a good idea (see ggerganov/llama.cpp#4755; it is unfortunate that the bot closed this).
We should rewrite our quantized vecdot routines to do the calculations in fp16 or fp32. The challenge here is to not degrade the speed of the vecdots too much.

@jan-wassenberg
Copy link

FWIW we're (gemma.cpp) actually using fp32.

@cafaxo
Copy link
Owner

cafaxo commented Apr 22, 2024

With 42001c5, the zero-temperature behavior now better matches the Metal backend of llama.cpp:

Llama2.jl (at 42001c5):

julia> sample(model, "The Julia programming language."; temperature=0.0f0)
 The Julia programming language. Julia is a high-level, high-performance dynamic programming language for technical computing. It is designed to be easy to use, fast, and efficient. Julia is also highly extensible, with a large and growing ecosystem of packages and libraries.
Julia is a high-level, high-performance dynamic programming language for technical computing. It is designed to be easy to use, fast, and efficient. Julia is also highly extensible, with a large and growing ecosystem of packages and libraries.
## Installation

### Installing Julia

#### Installing Julia from the Julia website

llama.cpp (at ggerganov/llama.cpp@637e9a8):

 The Julia programming language. Julia is a high-level, high-performance dynamic programming language for technical computing. It is designed to be easy to use, fast, and efficient. Julia is also highly extensible, with a large and growing ecosystem of libraries and packages.
Julia is a high-level, high-performance dynamic programming language for technical computing. It is designed to be easy to use, fast, and efficient. Julia is also highly extensible, with a large and growing ecosystem of libraries and packages.
Julia is a high-level, high-performance dynamic programming language for technical computing. It is designed to be easy to use, fast, and efficient. Julia is also highly extensible, with a large and growing ecosystem of libraries and packages.
Julia is a high-level, high-performance dynamic programming language for technical computing. It is designed to be easy to use, fast, and efficient. Julia is also highly extensible, with a large and growing ecosystem of libraries and packages. Julia is a high-level, high-performance dynamic programming language for technical computing. It is designed to be easy to use, fast, and efficient. Julia is also highly extensible, with a large and growing ecosystem of libraries and packages.

This is using the llama-2-7b.Q4_K_S.gguf model.

@cafaxo
Copy link
Owner

cafaxo commented Jul 18, 2024

This is now fixed with the new vecdot routines: 587d270.

@cafaxo cafaxo closed this as completed Jul 18, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants