Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ggml: Support OpenMP for multi-thread processing #7606

Merged
merged 9 commits into from
Jun 3, 2024

Conversation

msy-kato
Copy link
Contributor

@msy-kato msy-kato commented May 29, 2024

Currently, ggml multi-thread processing is implemented with pthread_create and pthread_join. This is simple and straightforward, but has the following issues

  • Difficult to analyze performance per TID (e.g., perf) because each thread is created each time a token is generating.
  • Thread affinity and core binding settings are not flexible.

Therefore, I suggest an option using OpenMP, which is general multi-threaded library. it is optional, and by default the previous implementation works.

Below are the results measured on an AWS m7i.24xlarge (Intel Xeon 8488).

### build
$ cmake -DLLAMA_OPENMP=ON -B build -S .
$ cmake --build build -j$(($(nproc)/2))

### run
$ export OMP_PROC_BIND=TRUE # if need
$ main --model models/llama-2-7b-chat.Q4_0.gguf --temp 0.1 --threads XX --prompt 'AI is going to' --n-predict 256 --seed 0
Threads Pthread
[token/sec]
OpenMP
[token/sec]
OpenMP
OMP_PROC_BIND=TRUE
[token/sec]
6 11.07 11.24 7.11
12 19.28 19.62 13.37
24 26.81 27.11 22.92
36 26.73 28.94 28.76
42 26.08 27.21 30.54
48 25.18 26.72 30.97

image

OpenMP version has slightly better performance than the pthread implementation. And setting OMP_PROC_BIND=TRUE improves performance on many threads (Threads >~ 36).

It would be good for users to have additional OpenMP options to tuning for their system.
Any comments are welcome!

related : #7342 #7526

@github-actions github-actions bot added build Compilation issues ggml changes relating to the ggml tensor library for machine learning labels May 29, 2024
Copy link
Contributor

github-actions bot commented May 29, 2024

📈 llama.cpp server for bench-server-baseline on Standard_NC4as_T4_v3 for phi-2-q4_0: 533 iterations 🚀

Expand details for performance related PR only
  • Concurrent users: 8, duration: 10m
  • HTTP request : avg=8792.69ms p(95)=21057.25ms fails=, finish reason: stop=481 truncated=52
  • Prompt processing (pp): avg=103.53tk/s p(95)=446.09tk/s
  • Token generation (tg): avg=32.2tk/s p(95)=47.34tk/s
  • ggml-org/models/phi-2/ggml-model-q4_0.gguf parallel=8 ctx-size=16384 ngl=33 batch-size=2048 ubatch-size=256 pp=1024 pp+tg=2048 branch=feat-openmp-support-draft commit=e0b077d4daf9a3003e80bcddafda2da34bbcd1da

prompt_tokens_seconds

More
---
config:
    xyChart:
        titleFontSize: 12
        width: 900
        height: 600
    themeVariables:
        xyChart:
            titleColor: "#000000"
---
xychart-beta
    title "llama.cpp bench-server-baseline on Standard_NC4as_T4_v3
 duration=10m 533 iterations"
    y-axis "llamacpp:prompt_tokens_seconds"
    x-axis "llamacpp:prompt_tokens_seconds" 1717456080 --> 1717456710
    line [0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 441.05, 441.05, 441.05, 441.05, 441.05, 929.03, 929.03, 929.03, 929.03, 929.03, 917.02, 917.02, 917.02, 917.02, 917.02, 934.04, 934.04, 934.04, 934.04, 934.04, 952.91, 952.91, 952.91, 952.91, 952.91, 942.51, 942.51, 942.51, 942.51, 942.51, 940.96, 940.96, 940.96, 940.96, 940.96, 950.87, 950.87, 950.87, 950.87, 950.87, 952.41, 952.41, 952.41, 952.41, 952.41, 958.03, 958.03, 958.03, 958.03, 958.03, 975.17, 975.17, 975.17, 975.17, 975.17, 991.06, 991.06, 991.06, 991.06, 991.06, 988.37, 988.37, 988.37, 988.37, 988.37, 956.63, 956.63, 956.63, 956.63, 956.63, 924.43, 924.43, 924.43, 924.43, 924.43, 912.47, 912.47, 912.47, 912.47, 912.47, 913.01, 913.01, 913.01, 913.01, 913.01, 909.27, 909.27, 909.27, 909.27, 909.27, 924.85, 924.85, 924.85, 924.85, 924.85, 917.64, 917.64, 917.64, 917.64, 917.64, 923.51, 923.51, 923.51, 923.51, 923.51, 917.05, 917.05, 917.05, 917.05, 917.05, 916.79, 916.79, 916.79, 916.79, 916.79, 928.31, 928.31, 928.31, 928.31, 928.31, 923.4, 923.4, 923.4, 923.4, 923.4, 922.28, 922.28, 922.28, 922.28, 922.28, 934.58, 934.58, 934.58, 934.58, 934.58, 929.79, 929.79, 929.79, 929.79, 929.79, 928.0, 928.0, 928.0, 928.0, 928.0, 929.39, 929.39, 929.39, 929.39, 929.39, 930.34, 930.34, 930.34, 930.34, 930.34, 928.87, 928.87, 928.87, 928.87, 928.87, 925.78, 925.78, 925.78, 925.78, 925.78, 929.04, 929.04, 929.04, 929.04, 929.04, 931.71, 931.71, 931.71, 931.71, 931.71, 938.64, 938.64, 938.64, 938.64, 938.64, 937.09, 937.09, 937.09, 937.09, 937.09, 920.9, 920.9, 920.9, 920.9, 920.9, 911.52, 911.52, 911.52, 911.52, 911.52, 910.82, 910.82, 910.82, 910.82, 910.82, 915.69, 915.69, 915.69, 915.69, 915.69, 914.91, 914.91, 914.91, 914.91, 914.91, 883.69, 883.69, 883.69, 883.69, 883.69, 874.82, 874.82, 874.82, 874.82, 874.82, 873.51, 873.51, 873.51, 873.51, 873.51, 872.42, 872.42, 872.42, 872.42, 872.42, 868.15, 868.15, 868.15, 868.15, 868.15, 871.02, 871.02, 871.02, 871.02, 871.02, 861.37, 861.37, 861.37, 861.37, 861.37, 860.11, 860.11, 860.11, 860.11, 860.11, 864.62, 864.62, 864.62, 864.62, 864.62, 863.14, 863.14, 863.14, 863.14, 863.14, 864.71, 864.71, 864.71, 864.71, 864.71, 868.52, 868.52, 868.52, 868.52, 868.52, 867.03, 867.03, 867.03, 867.03, 867.03, 864.46, 864.46, 864.46, 864.46, 864.46, 863.31, 863.31, 863.31, 863.31, 863.31, 863.72, 863.72, 863.72, 863.72, 863.72, 863.83, 863.83, 863.83, 863.83, 863.83, 864.91, 864.91, 864.91, 864.91, 864.91, 865.52, 865.52, 865.52, 865.52, 865.52, 865.52]
                    
Loading
predicted_tokens_seconds
More
---
config:
    xyChart:
        titleFontSize: 12
        width: 900
        height: 600
    themeVariables:
        xyChart:
            titleColor: "#000000"
---
xychart-beta
    title "llama.cpp bench-server-baseline on Standard_NC4as_T4_v3
 duration=10m 533 iterations"
    y-axis "llamacpp:predicted_tokens_seconds"
    x-axis "llamacpp:predicted_tokens_seconds" 1717456080 --> 1717456710
    line [0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 36.64, 36.64, 36.64, 36.64, 36.64, 30.1, 30.1, 30.1, 30.1, 30.1, 27.62, 27.62, 27.62, 27.62, 27.62, 29.0, 29.0, 29.0, 29.0, 29.0, 31.29, 31.29, 31.29, 31.29, 31.29, 31.75, 31.75, 31.75, 31.75, 31.75, 33.43, 33.43, 33.43, 33.43, 33.43, 34.29, 34.29, 34.29, 34.29, 34.29, 34.78, 34.78, 34.78, 34.78, 34.78, 34.92, 34.92, 34.92, 34.92, 34.92, 34.69, 34.69, 34.69, 34.69, 34.69, 34.04, 34.04, 34.04, 34.04, 34.04, 33.12, 33.12, 33.12, 33.12, 33.12, 33.11, 33.11, 33.11, 33.11, 33.11, 31.92, 31.92, 31.92, 31.92, 31.92, 31.35, 31.35, 31.35, 31.35, 31.35, 30.02, 30.02, 30.02, 30.02, 30.02, 29.92, 29.92, 29.92, 29.92, 29.92, 30.13, 30.13, 30.13, 30.13, 30.13, 29.8, 29.8, 29.8, 29.8, 29.8, 29.83, 29.83, 29.83, 29.83, 29.83, 30.13, 30.13, 30.13, 30.13, 30.13, 30.36, 30.36, 30.36, 30.36, 30.36, 30.45, 30.45, 30.45, 30.45, 30.45, 29.95, 29.95, 29.95, 29.95, 29.95, 30.14, 30.14, 30.14, 30.14, 30.14, 30.29, 30.29, 30.29, 30.29, 30.29, 30.09, 30.09, 30.09, 30.09, 30.09, 30.34, 30.34, 30.34, 30.34, 30.34, 30.47, 30.47, 30.47, 30.47, 30.47, 30.54, 30.54, 30.54, 30.54, 30.54, 30.56, 30.56, 30.56, 30.56, 30.56, 30.62, 30.62, 30.62, 30.62, 30.62, 30.79, 30.79, 30.79, 30.79, 30.79, 30.8, 30.8, 30.8, 30.8, 30.8, 30.72, 30.72, 30.72, 30.72, 30.72, 30.61, 30.61, 30.61, 30.61, 30.61, 30.52, 30.52, 30.52, 30.52, 30.52, 29.93, 29.93, 29.93, 29.93, 29.93, 30.11, 30.11, 30.11, 30.11, 30.11, 30.28, 30.28, 30.28, 30.28, 30.28, 30.44, 30.44, 30.44, 30.44, 30.44, 30.54, 30.54, 30.54, 30.54, 30.54, 30.37, 30.37, 30.37, 30.37, 30.37, 29.91, 29.91, 29.91, 29.91, 29.91, 29.86, 29.86, 29.86, 29.86, 29.86, 28.6, 28.6, 28.6, 28.6, 28.6, 28.61, 28.61, 28.61, 28.61, 28.61, 28.6, 28.6, 28.6, 28.6, 28.6, 28.61, 28.61, 28.61, 28.61, 28.61, 28.64, 28.64, 28.64, 28.64, 28.64, 28.64, 28.64, 28.64, 28.64, 28.64, 28.69, 28.69, 28.69, 28.69, 28.69, 28.66, 28.66, 28.66, 28.66, 28.66, 28.59, 28.59, 28.59, 28.59, 28.59, 28.6, 28.6, 28.6, 28.6, 28.6, 28.48, 28.48, 28.48, 28.48, 28.48, 28.57, 28.57, 28.57, 28.57, 28.57, 28.75, 28.75, 28.75, 28.75, 28.75, 28.79, 28.79, 28.79, 28.79, 28.79, 28.89, 28.89, 28.89, 28.89, 28.89, 28.99]
                    
Loading

Details

kv_cache_usage_ratio

More
---
config:
    xyChart:
        titleFontSize: 12
        width: 900
        height: 600
    themeVariables:
        xyChart:
            titleColor: "#000000"
---
xychart-beta
    title "llama.cpp bench-server-baseline on Standard_NC4as_T4_v3
 duration=10m 533 iterations"
    y-axis "llamacpp:kv_cache_usage_ratio"
    x-axis "llamacpp:kv_cache_usage_ratio" 1717456080 --> 1717456710
    line [0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.14, 0.14, 0.14, 0.14, 0.14, 0.42, 0.42, 0.42, 0.42, 0.42, 0.11, 0.11, 0.11, 0.11, 0.11, 0.17, 0.17, 0.17, 0.17, 0.17, 0.16, 0.16, 0.16, 0.16, 0.16, 0.11, 0.11, 0.11, 0.11, 0.11, 0.1, 0.1, 0.1, 0.1, 0.1, 0.14, 0.14, 0.14, 0.14, 0.14, 0.12, 0.12, 0.12, 0.12, 0.12, 0.16, 0.16, 0.16, 0.16, 0.16, 0.29, 0.29, 0.29, 0.29, 0.29, 0.28, 0.28, 0.28, 0.28, 0.28, 0.2, 0.2, 0.2, 0.2, 0.2, 0.4, 0.4, 0.4, 0.4, 0.4, 0.34, 0.34, 0.34, 0.34, 0.34, 0.33, 0.33, 0.33, 0.33, 0.33, 0.17, 0.17, 0.17, 0.17, 0.17, 0.15, 0.15, 0.15, 0.15, 0.15, 0.3, 0.3, 0.3, 0.3, 0.3, 0.12, 0.12, 0.12, 0.12, 0.12, 0.2, 0.2, 0.2, 0.2, 0.2, 0.17, 0.17, 0.17, 0.17, 0.17, 0.18, 0.18, 0.18, 0.18, 0.18, 0.3, 0.3, 0.3, 0.3, 0.3, 0.12, 0.12, 0.12, 0.12, 0.12, 0.17, 0.17, 0.17, 0.17, 0.17, 0.32, 0.32, 0.32, 0.32, 0.32, 0.18, 0.18, 0.18, 0.18, 0.18, 0.13, 0.13, 0.13, 0.13, 0.13, 0.17, 0.17, 0.17, 0.17, 0.17, 0.17, 0.17, 0.17, 0.17, 0.17, 0.19, 0.19, 0.19, 0.19, 0.19, 0.1, 0.1, 0.1, 0.1, 0.1, 0.16, 0.16, 0.16, 0.16, 0.16, 0.23, 0.23, 0.23, 0.23, 0.23, 0.27, 0.27, 0.27, 0.27, 0.27, 0.39, 0.39, 0.39, 0.39, 0.39, 0.45, 0.45, 0.45, 0.45, 0.45, 0.08, 0.08, 0.08, 0.08, 0.08, 0.07, 0.07, 0.07, 0.07, 0.07, 0.11, 0.11, 0.11, 0.11, 0.11, 0.15, 0.15, 0.15, 0.15, 0.15, 0.38, 0.38, 0.38, 0.38, 0.38, 0.53, 0.53, 0.53, 0.53, 0.53, 0.56, 0.56, 0.56, 0.56, 0.56, 0.58, 0.58, 0.58, 0.58, 0.58, 0.14, 0.14, 0.14, 0.14, 0.14, 0.26, 0.26, 0.26, 0.26, 0.26, 0.24, 0.24, 0.24, 0.24, 0.24, 0.18, 0.18, 0.18, 0.18, 0.18, 0.23, 0.23, 0.23, 0.23, 0.23, 0.16, 0.16, 0.16, 0.16, 0.16, 0.2, 0.2, 0.2, 0.2, 0.2, 0.31, 0.31, 0.31, 0.31, 0.31, 0.24, 0.24, 0.24, 0.24, 0.24, 0.26, 0.26, 0.26, 0.26, 0.26, 0.25, 0.25, 0.25, 0.25, 0.25, 0.14, 0.14, 0.14, 0.14, 0.14, 0.1, 0.1, 0.1, 0.1, 0.1, 0.13, 0.13, 0.13, 0.13, 0.13, 0.14, 0.14, 0.14, 0.14, 0.14, 0.2]
                    
Loading
requests_processing
More
---
config:
    xyChart:
        titleFontSize: 12
        width: 900
        height: 600
    themeVariables:
        xyChart:
            titleColor: "#000000"
---
xychart-beta
    title "llama.cpp bench-server-baseline on Standard_NC4as_T4_v3
 duration=10m 533 iterations"
    y-axis "llamacpp:requests_processing"
    x-axis "llamacpp:requests_processing" 1717456080 --> 1717456710
    line [0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 3.0, 3.0, 3.0, 3.0, 3.0, 6.0, 6.0, 6.0, 6.0, 6.0, 6.0, 6.0, 6.0, 6.0, 6.0, 4.0, 4.0, 4.0, 4.0, 4.0, 3.0, 3.0, 3.0, 3.0, 3.0, 6.0, 6.0, 6.0, 6.0, 6.0, 7.0, 7.0, 7.0, 7.0, 7.0, 3.0, 3.0, 3.0, 3.0, 3.0, 3.0, 3.0, 3.0, 3.0, 3.0, 4.0, 4.0, 4.0, 4.0, 4.0, 6.0, 6.0, 6.0, 6.0, 6.0, 6.0, 6.0, 6.0, 6.0, 6.0, 6.0, 6.0, 6.0, 6.0, 6.0, 7.0, 7.0, 7.0, 7.0, 7.0, 6.0, 6.0, 6.0, 6.0, 6.0, 8.0, 8.0, 8.0, 8.0, 8.0, 7.0, 7.0, 7.0, 7.0, 7.0, 6.0, 6.0, 6.0, 6.0, 6.0, 8.0, 8.0, 8.0, 8.0, 8.0, 3.0, 3.0, 3.0, 3.0, 3.0, 8.0, 8.0, 8.0, 8.0, 8.0, 6.0, 6.0, 6.0, 6.0, 6.0, 4.0, 4.0, 4.0, 4.0, 4.0, 8.0, 8.0, 8.0, 8.0, 8.0, 3.0, 3.0, 3.0, 3.0, 3.0, 5.0, 5.0, 5.0, 5.0, 5.0, 5.0, 5.0, 5.0, 5.0, 5.0, 5.0, 5.0, 5.0, 5.0, 5.0, 1.0, 1.0, 1.0, 1.0, 1.0, 4.0, 4.0, 4.0, 4.0, 4.0, 8.0, 8.0, 8.0, 8.0, 8.0, 5.0, 5.0, 5.0, 5.0, 5.0, 7.0, 7.0, 7.0, 7.0, 7.0, 3.0, 3.0, 3.0, 3.0, 3.0, 1.0, 1.0, 1.0, 1.0, 1.0, 5.0, 5.0, 5.0, 5.0, 5.0, 5.0, 5.0, 5.0, 5.0, 5.0, 6.0, 6.0, 6.0, 6.0, 6.0, 7.0, 7.0, 7.0, 7.0, 7.0, 6.0, 6.0, 6.0, 6.0, 6.0, 5.0, 5.0, 5.0, 5.0, 5.0, 3.0, 3.0, 3.0, 3.0, 3.0, 6.0, 6.0, 6.0, 6.0, 6.0, 8.0, 8.0, 8.0, 8.0, 8.0, 7.0, 7.0, 7.0, 7.0, 7.0, 8.0, 8.0, 8.0, 8.0, 8.0, 5.0, 5.0, 5.0, 5.0, 5.0, 7.0, 7.0, 7.0, 7.0, 7.0, 7.0, 7.0, 7.0, 7.0, 7.0, 6.0, 6.0, 6.0, 6.0, 6.0, 5.0, 5.0, 5.0, 5.0, 5.0, 7.0, 7.0, 7.0, 7.0, 7.0, 7.0, 7.0, 7.0, 7.0, 7.0, 5.0, 5.0, 5.0, 5.0, 5.0, 2.0, 2.0, 2.0, 2.0, 2.0, 8.0, 8.0, 8.0, 8.0, 8.0, 2.0, 2.0, 2.0, 2.0, 2.0, 4.0, 4.0, 4.0, 4.0, 4.0, 0.0, 0.0, 0.0, 0.0, 0.0, 3.0, 3.0, 3.0, 3.0, 3.0, 6.0, 6.0, 6.0, 6.0, 6.0, 1.0]
                    
Loading

@mofosyne mofosyne added the Review Complexity : Medium Generally require more time to grok but manageable by beginner to medium expertise level label May 29, 2024
@slaren
Copy link
Collaborator

slaren commented May 29, 2024

Here is the same test with NKVO that I have been running in the other thread pools PRs:

GPU Model NKVO Test t/s master t/s feat-openmp-support-draft Speedup
RTX 3090 Ti llama 7B Q4_0 0 pp512 4539.81 4534.23 1.00
RTX 3090 Ti llama 7B Q4_0 0 tg128 153.90 155.25 1.01
RTX 3090 Ti llama 7B Q4_0 1 pp512 502.87 527.40 1.05
RTX 3090 Ti llama 7B Q4_0 1 tg128 20.85 62.17 2.98

I have also tested with CPU only, on native Windows with MSVC, and on a Apple MBP with an M3 Max, and I haven't found any performance regressions. So I have to say congratulations, this is the first thread pool attempt that achieves a very large reduction in the thread launch overhead, without any significant downsides. I think we could use OMP by default.

Some concerns:

  • The value of num_threads is not always respected by OMP, and when this happens the result is a deadlock. The OMP logic is fairly complicated and it may depend on factors outside of our control. We may need to update the number of threads in the shared state according to the value returned by omp_get_num_threads.
  • It may be better to avoid using OMP at all if n_threads == 1, and instead use the main thread. The OMP implementations that I tested seem to be smart enough to do this already, but I don't think this is guaranteed, and it would be more reliable to handle this case ourselves and avoid the OMP overhead entirely. This case happens when fully offloading a model to a GPU backend, so it is fairly common.

@netrunnereve
Copy link
Contributor

I have also tested with CPU only, on native Windows with MSVC, and on a Apple MBP with an M3 Max, and I haven't found any performance regressions. So I have to say congratulations, this is the first thread pool attempt that achieves a very large reduction in the thread launch overhead, without any significant downsides. I think we could use OMP by default.

On an old 4 core/8 thread Xeon v2 there's a small regression with four threads but it's the same as master with eight. Typically with these threadpool implementations the speedup only happens when you use a lot of threads and thus have to deal with the overhead.
master

model size params backend threads test t/s
llama 7B Q6_K 5.53 GiB 7.24 B CPU 4 pp512 4.38 ± 0.00
llama 7B Q6_K 5.53 GiB 7.24 B CPU 4 tg128 3.51 ± 0.00
llama 7B Q6_K 5.53 GiB 7.24 B CPU 8 pp512 4.64 ± 0.02
llama 7B Q6_K 5.53 GiB 7.24 B CPU 8 tg128 3.52 ± 0.00

PR

model size params backend threads test t/s
llama 7B Q6_K 5.53 GiB 7.24 B CPU 4 pp512 4.34 ± 0.00
llama 7B Q6_K 5.53 GiB 7.24 B CPU 4 tg128 3.30 ± 0.08
llama 7B Q6_K 5.53 GiB 7.24 B CPU 8 pp512 4.64 ± 0.00
llama 7B Q6_K 5.53 GiB 7.24 B CPU 8 tg128 3.52 ± 0.00

PR with OMP_PROC_BIND=TRUE

model size params backend threads test t/s
llama 7B Q6_K 5.53 GiB 7.24 B CPU 8 pp512 4.64 ± 0.01
llama 7B Q6_K 5.53 GiB 7.24 B CPU 8 tg128 3.49 ± 0.00

@slaren
Copy link
Collaborator

slaren commented May 30, 2024

I tried with a very small stories260k model, since that should exacerbate the overhead of starting the threads, but it is still consistently faster with this PR (except 1 threads). Is it really a given that waking up a thread needs to be more expensive than starting a thread?

CPU Model Threads Test t/s master t/s omp Speedup
i9-13900K stories260k F32 1 pp512 6920.29 6882.03 0.99
i9-13900K stories260k F32 1 tg128 6937.02 6865.42 0.99
i9-13900K stories260k F32 2 pp512 12365.50 12665.08 1.02
i9-13900K stories260k F32 2 tg128 4687.61 5891.87 1.26
i9-13900K stories260k F32 3 pp512 17501.55 17597.04 1.01
i9-13900K stories260k F32 3 tg128 3988.88 5272.38 1.32
i9-13900K stories260k F32 4 pp512 22587.09 22778.87 1.01
i9-13900K stories260k F32 4 tg128 3454.17 4633.26 1.34
i9-13900K stories260k F32 5 pp512 27142.85 27477.98 1.01
i9-13900K stories260k F32 5 tg128 3028.37 4052.77 1.34
i9-13900K stories260k F32 6 pp512 30735.90 31353.86 1.02
i9-13900K stories260k F32 6 tg128 2669.83 3668.35 1.37
i9-13900K stories260k F32 7 pp512 31723.15 35381.52 1.12
i9-13900K stories260k F32 7 tg128 2388.36 3295.09 1.38
i9-13900K stories260k F32 8 pp512 32668.71 31819.11 0.97
i9-13900K stories260k F32 8 tg128 2140.98 2944.40 1.38
OMP_PROC_BIND=TRUE
CPU Model Threads Test t/s master t/s feat-openmp-support-draft Speedup
13th Gen Intel(R) Core(TM) i9-13900K llama ?B all F32 (guessed) 1 pp512 6883.01 6835.37 0.99
13th Gen Intel(R) Core(TM) i9-13900K llama ?B all F32 (guessed) 1 tg128 6896.21 6854.63 0.99
13th Gen Intel(R) Core(TM) i9-13900K llama ?B all F32 (guessed) 2 pp512 12351.67 12220.15 0.99
13th Gen Intel(R) Core(TM) i9-13900K llama ?B all F32 (guessed) 2 tg128 4809.30 5899.53 1.23
13th Gen Intel(R) Core(TM) i9-13900K llama ?B all F32 (guessed) 3 pp512 17397.55 17597.46 1.01
13th Gen Intel(R) Core(TM) i9-13900K llama ?B all F32 (guessed) 3 tg128 3950.32 5340.62 1.35
13th Gen Intel(R) Core(TM) i9-13900K llama ?B all F32 (guessed) 4 pp512 22494.21 22735.69 1.01
13th Gen Intel(R) Core(TM) i9-13900K llama ?B all F32 (guessed) 4 tg128 3442.58 4758.60 1.38
13th Gen Intel(R) Core(TM) i9-13900K llama ?B all F32 (guessed) 5 pp512 26679.56 27395.33 1.03
13th Gen Intel(R) Core(TM) i9-13900K llama ?B all F32 (guessed) 5 tg128 2996.16 4205.13 1.40
13th Gen Intel(R) Core(TM) i9-13900K llama ?B all F32 (guessed) 6 pp512 29370.43 31436.20 1.07
13th Gen Intel(R) Core(TM) i9-13900K llama ?B all F32 (guessed) 6 tg128 2682.35 3848.70 1.43
13th Gen Intel(R) Core(TM) i9-13900K llama ?B all F32 (guessed) 7 pp512 31625.77 35164.59 1.11
13th Gen Intel(R) Core(TM) i9-13900K llama ?B all F32 (guessed) 7 tg128 2351.82 3361.92 1.43
13th Gen Intel(R) Core(TM) i9-13900K llama ?B all F32 (guessed) 8 pp512 30794.54 35432.06 1.15
13th Gen Intel(R) Core(TM) i9-13900K llama ?B all F32 (guessed) 8 tg128 2100.06 3045.57 1.45

@slaren slaren marked this pull request as ready for review May 30, 2024 08:01
@slaren slaren requested a review from ggerganov May 30, 2024 08:01
@msy-kato
Copy link
Contributor Author

msy-kato commented May 30, 2024

@slaren Thank you for the comments and performance test! I have fixed the code to your feedback.

The value of num_threads is not always respected by OMP, and when this happens the result is a deadlock. The OMP logic is fairly complicated and it may depend on factors outside of our control. We may need to update the number of threads in the shared state according to the value returned by omp_get_num_threads. Instead, how about limiting the n_threads by omp_get_max_threads() in ggml_graph_plan as in this implementation?

As you say, deadlocks can occur since the number of threads specified in the num_threads clause is not always used(It happens especially OMP_DYNAMIC=TRUE). On the other hand, I'm not sure what happens when n_threads in state_shared is updated in the parallel section(omp_get_num_threads() returns the right value only within the parallel section).

Instead, how about limiting the number of threads by omp_get_max_threads() in ggml_graph_plan as like this?
I have confirmed that deadlock does not occur under the following conditions

  • n_threads is greater than the number of physical cores
  • n_threads is greater than thread-limit()

Without this modification, deadlock occurred in the above cases.

It may be better to avoid using OMP at all if n_threads == 1, and instead use the main thread. The OMP implementations that I tested seem to be smart enough to do this already, but I don't think this is guaranteed, and it would be more reliable to handle this case ourselves and avoid the OMP overhead entirely. This case happens when fully offloading a model to a GPU backend, so it is fairly common.

Thanks, I checked here:

The thread that encountered the parallel construct becomes the master thread of the new team, with a thread number of zero for the duration of the new parallel region.

If n_threads is 1, main thread is the master of OMP, so the cost of waking up the thread may not be a problem. However, the modification is simple, so it is possible to branch only when n_threads=1.

@slaren
Copy link
Collaborator

slaren commented May 30, 2024

I changed it to update n_threads from omp_get_num_threads in a omp single section, because I don't think that we can rely on the value returned by omp_get_max_threads. I also removed the call to omp_get_thread_limit since it was causing issues with MSVC, and I think it is not strictly necessary.

@msy-kato
Copy link
Contributor Author

Thanks for the correction!

@slaren slaren force-pushed the feat-openmp-support-draft branch from e0e880a to 5970a26 Compare May 30, 2024 09:10
@slaren
Copy link
Collaborator

slaren commented May 30, 2024

  • I was wrong about support in macOS, apple clang does not support OpenMP. It is still possible to use it by installing libomp with brew and kind of hacking the compiler command line, but the performance is not good. I have disabled OpenMP on macOS in the make build, and cmake automatically detects that OpenMP is not available.
  • Thread sanitizer requires building libomp with LIBOMP_TSAN_SUPPORT to use with OpenMP. As a workaround, I have disabled OpenMP in the thread sanitizer CI.

@github-actions github-actions bot added the devops improvements to build systems and github actions label May 30, 2024
@slaren
Copy link
Collaborator

slaren commented Jun 3, 2024

@ggerganov PTAL and let's merge this if you don't have any objections, it will significantly improve performance in some cases and it will allow me to continue work on #6210.

Copy link
Owner

@ggerganov ggerganov left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry for the delay - I have missed the request for review

ggml.c Outdated Show resolved Hide resolved
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
@slaren slaren merged commit a5735e4 into ggerganov:master Jun 3, 2024
66 of 70 checks passed
@xatier
Copy link

xatier commented Jun 3, 2024

We need to install libgomp1 in server.Dockerfile, otherwise it won't start.

# ldd /server
        linux-vdso.so.1 (0x000079740aaa5000)
        libcurl.so.4 => /lib/x86_64-linux-gnu/libcurl.so.4 (0x000079740a9f6000)
        libstdc++.so.6 => /lib/x86_64-linux-gnu/libstdc++.so.6 (0x000079740a200000)
        libm.so.6 => /lib/x86_64-linux-gnu/libm.so.6 (0x000079740a519000)
        libgomp.so.1 => not found
...

apt install libgomp1

https://github.com/ggerganov/llama.cpp/blob/master/.devops/server.Dockerfile#L19

@xatier
Copy link

xatier commented Jun 8, 2024

Fixed in #7780

@BodhiHu
Copy link

BodhiHu commented Aug 6, 2024

Fyi, this changes throws "omp.h not found error" on Ubuntu 22.04.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
build Compilation issues devops improvements to build systems and github actions ggml changes relating to the ggml tensor library for machine learning Review Complexity : Medium Generally require more time to grok but manageable by beginner to medium expertise level
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants