ggml: Support OpenMP for multi-thread processing #7606

msy-kato · 2024-05-29T05:59:58Z

Currently, ggml multi-thread processing is implemented with pthread_create and pthread_join. This is simple and straightforward, but has the following issues

Difficult to analyze performance per TID (e.g., perf) because each thread is created each time a token is generating.
Thread affinity and core binding settings are not flexible.

Therefore, I suggest an option using OpenMP, which is general multi-threaded library. it is optional, and by default the previous implementation works.

Below are the results measured on an AWS m7i.24xlarge (Intel Xeon 8488).

### build
$ cmake -DLLAMA_OPENMP=ON -B build -S .
$ cmake --build build -j$(($(nproc)/2))

### run
$ export OMP_PROC_BIND=TRUE # if need
$ main --model models/llama-2-7b-chat.Q4_0.gguf --temp 0.1 --threads XX --prompt 'AI is going to' --n-predict 256 --seed 0

Threads	Pthread [token/sec]	OpenMP [token/sec]	OpenMP OMP_PROC_BIND=TRUE [token/sec]
6	11.07	11.24	7.11
12	19.28	19.62	13.37
24	26.81	27.11	22.92
36	26.73	28.94	28.76
42	26.08	27.21	30.54
48	25.18	26.72	30.97

OpenMP version has slightly better performance than the pthread implementation. And setting OMP_PROC_BIND=TRUE improves performance on many threads (Threads >~ 36).

It would be good for users to have additional OpenMP options to tuning for their system.
Any comments are welcome!

related : #7342 #7526

github-actions · 2024-05-29T06:29:57Z

📈 llama.cpp server for bench-server-baseline on Standard_NC4as_T4_v3 for phi-2-q4_0: 533 iterations 🚀

Expand details for performance related PR only

Concurrent users: 8, duration: 10m
HTTP request : avg=8792.69ms p(95)=21057.25ms fails=, finish reason: stop=481 truncated=52
Prompt processing (pp): avg=103.53tk/s p(95)=446.09tk/s
Token generation (tg): avg=32.2tk/s p(95)=47.34tk/s
ggml-org/models/phi-2/ggml-model-q4_0.gguf parallel=8 ctx-size=16384 ngl=33 batch-size=2048 ubatch-size=256 pp=1024 pp+tg=2048 branch=feat-openmp-support-draft commit=e0b077d4daf9a3003e80bcddafda2da34bbcd1da

More

---
config:
    xyChart:
        titleFontSize: 12
        width: 900
        height: 600
    themeVariables:
        xyChart:
            titleColor: "#000000"
---
xychart-beta
    title "llama.cpp bench-server-baseline on Standard_NC4as_T4_v3
 duration=10m 533 iterations"
    y-axis "llamacpp:prompt_tokens_seconds"
    x-axis "llamacpp:prompt_tokens_seconds" 1717456080 --> 1717456710
    line [0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 441.05, 441.05, 441.05, 441.05, 441.05, 929.03, 929.03, 929.03, 929.03, 929.03, 917.02, 917.02, 917.02, 917.02, 917.02, 934.04, 934.04, 934.04, 934.04, 934.04, 952.91, 952.91, 952.91, 952.91, 952.91, 942.51, 942.51, 942.51, 942.51, 942.51, 940.96, 940.96, 940.96, 940.96, 940.96, 950.87, 950.87, 950.87, 950.87, 950.87, 952.41, 952.41, 952.41, 952.41, 952.41, 958.03, 958.03, 958.03, 958.03, 958.03, 975.17, 975.17, 975.17, 975.17, 975.17, 991.06, 991.06, 991.06, 991.06, 991.06, 988.37, 988.37, 988.37, 988.37, 988.37, 956.63, 956.63, 956.63, 956.63, 956.63, 924.43, 924.43, 924.43, 924.43, 924.43, 912.47, 912.47, 912.47, 912.47, 912.47, 913.01, 913.01, 913.01, 913.01, 913.01, 909.27, 909.27, 909.27, 909.27, 909.27, 924.85, 924.85, 924.85, 924.85, 924.85, 917.64, 917.64, 917.64, 917.64, 917.64, 923.51, 923.51, 923.51, 923.51, 923.51, 917.05, 917.05, 917.05, 917.05, 917.05, 916.79, 916.79, 916.79, 916.79, 916.79, 928.31, 928.31, 928.31, 928.31, 928.31, 923.4, 923.4, 923.4, 923.4, 923.4, 922.28, 922.28, 922.28, 922.28, 922.28, 934.58, 934.58, 934.58, 934.58, 934.58, 929.79, 929.79, 929.79, 929.79, 929.79, 928.0, 928.0, 928.0, 928.0, 928.0, 929.39, 929.39, 929.39, 929.39, 929.39, 930.34, 930.34, 930.34, 930.34, 930.34, 928.87, 928.87, 928.87, 928.87, 928.87, 925.78, 925.78, 925.78, 925.78, 925.78, 929.04, 929.04, 929.04, 929.04, 929.04, 931.71, 931.71, 931.71, 931.71, 931.71, 938.64, 938.64, 938.64, 938.64, 938.64, 937.09, 937.09, 937.09, 937.09, 937.09, 920.9, 920.9, 920.9, 920.9, 920.9, 911.52, 911.52, 911.52, 911.52, 911.52, 910.82, 910.82, 910.82, 910.82, 910.82, 915.69, 915.69, 915.69, 915.69, 915.69, 914.91, 914.91, 914.91, 914.91, 914.91, 883.69, 883.69, 883.69, 883.69, 883.69, 874.82, 874.82, 874.82, 874.82, 874.82, 873.51, 873.51, 873.51, 873.51, 873.51, 872.42, 872.42, 872.42, 872.42, 872.42, 868.15, 868.15, 868.15, 868.15, 868.15, 871.02, 871.02, 871.02, 871.02, 871.02, 861.37, 861.37, 861.37, 861.37, 861.37, 860.11, 860.11, 860.11, 860.11, 860.11, 864.62, 864.62, 864.62, 864.62, 864.62, 863.14, 863.14, 863.14, 863.14, 863.14, 864.71, 864.71, 864.71, 864.71, 864.71, 868.52, 868.52, 868.52, 868.52, 868.52, 867.03, 867.03, 867.03, 867.03, 867.03, 864.46, 864.46, 864.46, 864.46, 864.46, 863.31, 863.31, 863.31, 863.31, 863.31, 863.72, 863.72, 863.72, 863.72, 863.72, 863.83, 863.83, 863.83, 863.83, 863.83, 864.91, 864.91, 864.91, 864.91, 864.91, 865.52, 865.52, 865.52, 865.52, 865.52, 865.52]

More

---
config:
    xyChart:
        titleFontSize: 12
        width: 900
        height: 600
    themeVariables:
        xyChart:
            titleColor: "#000000"
---
xychart-beta
    title "llama.cpp bench-server-baseline on Standard_NC4as_T4_v3
 duration=10m 533 iterations"
    y-axis "llamacpp:predicted_tokens_seconds"
    x-axis "llamacpp:predicted_tokens_seconds" 1717456080 --> 1717456710
    line [0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 36.64, 36.64, 36.64, 36.64, 36.64, 30.1, 30.1, 30.1, 30.1, 30.1, 27.62, 27.62, 27.62, 27.62, 27.62, 29.0, 29.0, 29.0, 29.0, 29.0, 31.29, 31.29, 31.29, 31.29, 31.29, 31.75, 31.75, 31.75, 31.75, 31.75, 33.43, 33.43, 33.43, 33.43, 33.43, 34.29, 34.29, 34.29, 34.29, 34.29, 34.78, 34.78, 34.78, 34.78, 34.78, 34.92, 34.92, 34.92, 34.92, 34.92, 34.69, 34.69, 34.69, 34.69, 34.69, 34.04, 34.04, 34.04, 34.04, 34.04, 33.12, 33.12, 33.12, 33.12, 33.12, 33.11, 33.11, 33.11, 33.11, 33.11, 31.92, 31.92, 31.92, 31.92, 31.92, 31.35, 31.35, 31.35, 31.35, 31.35, 30.02, 30.02, 30.02, 30.02, 30.02, 29.92, 29.92, 29.92, 29.92, 29.92, 30.13, 30.13, 30.13, 30.13, 30.13, 29.8, 29.8, 29.8, 29.8, 29.8, 29.83, 29.83, 29.83, 29.83, 29.83, 30.13, 30.13, 30.13, 30.13, 30.13, 30.36, 30.36, 30.36, 30.36, 30.36, 30.45, 30.45, 30.45, 30.45, 30.45, 29.95, 29.95, 29.95, 29.95, 29.95, 30.14, 30.14, 30.14, 30.14, 30.14, 30.29, 30.29, 30.29, 30.29, 30.29, 30.09, 30.09, 30.09, 30.09, 30.09, 30.34, 30.34, 30.34, 30.34, 30.34, 30.47, 30.47, 30.47, 30.47, 30.47, 30.54, 30.54, 30.54, 30.54, 30.54, 30.56, 30.56, 30.56, 30.56, 30.56, 30.62, 30.62, 30.62, 30.62, 30.62, 30.79, 30.79, 30.79, 30.79, 30.79, 30.8, 30.8, 30.8, 30.8, 30.8, 30.72, 30.72, 30.72, 30.72, 30.72, 30.61, 30.61, 30.61, 30.61, 30.61, 30.52, 30.52, 30.52, 30.52, 30.52, 29.93, 29.93, 29.93, 29.93, 29.93, 30.11, 30.11, 30.11, 30.11, 30.11, 30.28, 30.28, 30.28, 30.28, 30.28, 30.44, 30.44, 30.44, 30.44, 30.44, 30.54, 30.54, 30.54, 30.54, 30.54, 30.37, 30.37, 30.37, 30.37, 30.37, 29.91, 29.91, 29.91, 29.91, 29.91, 29.86, 29.86, 29.86, 29.86, 29.86, 28.6, 28.6, 28.6, 28.6, 28.6, 28.61, 28.61, 28.61, 28.61, 28.61, 28.6, 28.6, 28.6, 28.6, 28.6, 28.61, 28.61, 28.61, 28.61, 28.61, 28.64, 28.64, 28.64, 28.64, 28.64, 28.64, 28.64, 28.64, 28.64, 28.64, 28.69, 28.69, 28.69, 28.69, 28.69, 28.66, 28.66, 28.66, 28.66, 28.66, 28.59, 28.59, 28.59, 28.59, 28.59, 28.6, 28.6, 28.6, 28.6, 28.6, 28.48, 28.48, 28.48, 28.48, 28.48, 28.57, 28.57, 28.57, 28.57, 28.57, 28.75, 28.75, 28.75, 28.75, 28.75, 28.79, 28.79, 28.79, 28.79, 28.79, 28.89, 28.89, 28.89, 28.89, 28.89, 28.99]

Details

More

---
config:
    xyChart:
        titleFontSize: 12
        width: 900
        height: 600
    themeVariables:
        xyChart:
            titleColor: "#000000"
---
xychart-beta
    title "llama.cpp bench-server-baseline on Standard_NC4as_T4_v3
 duration=10m 533 iterations"
    y-axis "llamacpp:kv_cache_usage_ratio"
    x-axis "llamacpp:kv_cache_usage_ratio" 1717456080 --> 1717456710
    line [0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.14, 0.14, 0.14, 0.14, 0.14, 0.42, 0.42, 0.42, 0.42, 0.42, 0.11, 0.11, 0.11, 0.11, 0.11, 0.17, 0.17, 0.17, 0.17, 0.17, 0.16, 0.16, 0.16, 0.16, 0.16, 0.11, 0.11, 0.11, 0.11, 0.11, 0.1, 0.1, 0.1, 0.1, 0.1, 0.14, 0.14, 0.14, 0.14, 0.14, 0.12, 0.12, 0.12, 0.12, 0.12, 0.16, 0.16, 0.16, 0.16, 0.16, 0.29, 0.29, 0.29, 0.29, 0.29, 0.28, 0.28, 0.28, 0.28, 0.28, 0.2, 0.2, 0.2, 0.2, 0.2, 0.4, 0.4, 0.4, 0.4, 0.4, 0.34, 0.34, 0.34, 0.34, 0.34, 0.33, 0.33, 0.33, 0.33, 0.33, 0.17, 0.17, 0.17, 0.17, 0.17, 0.15, 0.15, 0.15, 0.15, 0.15, 0.3, 0.3, 0.3, 0.3, 0.3, 0.12, 0.12, 0.12, 0.12, 0.12, 0.2, 0.2, 0.2, 0.2, 0.2, 0.17, 0.17, 0.17, 0.17, 0.17, 0.18, 0.18, 0.18, 0.18, 0.18, 0.3, 0.3, 0.3, 0.3, 0.3, 0.12, 0.12, 0.12, 0.12, 0.12, 0.17, 0.17, 0.17, 0.17, 0.17, 0.32, 0.32, 0.32, 0.32, 0.32, 0.18, 0.18, 0.18, 0.18, 0.18, 0.13, 0.13, 0.13, 0.13, 0.13, 0.17, 0.17, 0.17, 0.17, 0.17, 0.17, 0.17, 0.17, 0.17, 0.17, 0.19, 0.19, 0.19, 0.19, 0.19, 0.1, 0.1, 0.1, 0.1, 0.1, 0.16, 0.16, 0.16, 0.16, 0.16, 0.23, 0.23, 0.23, 0.23, 0.23, 0.27, 0.27, 0.27, 0.27, 0.27, 0.39, 0.39, 0.39, 0.39, 0.39, 0.45, 0.45, 0.45, 0.45, 0.45, 0.08, 0.08, 0.08, 0.08, 0.08, 0.07, 0.07, 0.07, 0.07, 0.07, 0.11, 0.11, 0.11, 0.11, 0.11, 0.15, 0.15, 0.15, 0.15, 0.15, 0.38, 0.38, 0.38, 0.38, 0.38, 0.53, 0.53, 0.53, 0.53, 0.53, 0.56, 0.56, 0.56, 0.56, 0.56, 0.58, 0.58, 0.58, 0.58, 0.58, 0.14, 0.14, 0.14, 0.14, 0.14, 0.26, 0.26, 0.26, 0.26, 0.26, 0.24, 0.24, 0.24, 0.24, 0.24, 0.18, 0.18, 0.18, 0.18, 0.18, 0.23, 0.23, 0.23, 0.23, 0.23, 0.16, 0.16, 0.16, 0.16, 0.16, 0.2, 0.2, 0.2, 0.2, 0.2, 0.31, 0.31, 0.31, 0.31, 0.31, 0.24, 0.24, 0.24, 0.24, 0.24, 0.26, 0.26, 0.26, 0.26, 0.26, 0.25, 0.25, 0.25, 0.25, 0.25, 0.14, 0.14, 0.14, 0.14, 0.14, 0.1, 0.1, 0.1, 0.1, 0.1, 0.13, 0.13, 0.13, 0.13, 0.13, 0.14, 0.14, 0.14, 0.14, 0.14, 0.2]

More

---
config:
    xyChart:
        titleFontSize: 12
        width: 900
        height: 600
    themeVariables:
        xyChart:
            titleColor: "#000000"
---
xychart-beta
    title "llama.cpp bench-server-baseline on Standard_NC4as_T4_v3
 duration=10m 533 iterations"
    y-axis "llamacpp:requests_processing"
    x-axis "llamacpp:requests_processing" 1717456080 --> 1717456710
    line [0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 3.0, 3.0, 3.0, 3.0, 3.0, 6.0, 6.0, 6.0, 6.0, 6.0, 6.0, 6.0, 6.0, 6.0, 6.0, 4.0, 4.0, 4.0, 4.0, 4.0, 3.0, 3.0, 3.0, 3.0, 3.0, 6.0, 6.0, 6.0, 6.0, 6.0, 7.0, 7.0, 7.0, 7.0, 7.0, 3.0, 3.0, 3.0, 3.0, 3.0, 3.0, 3.0, 3.0, 3.0, 3.0, 4.0, 4.0, 4.0, 4.0, 4.0, 6.0, 6.0, 6.0, 6.0, 6.0, 6.0, 6.0, 6.0, 6.0, 6.0, 6.0, 6.0, 6.0, 6.0, 6.0, 7.0, 7.0, 7.0, 7.0, 7.0, 6.0, 6.0, 6.0, 6.0, 6.0, 8.0, 8.0, 8.0, 8.0, 8.0, 7.0, 7.0, 7.0, 7.0, 7.0, 6.0, 6.0, 6.0, 6.0, 6.0, 8.0, 8.0, 8.0, 8.0, 8.0, 3.0, 3.0, 3.0, 3.0, 3.0, 8.0, 8.0, 8.0, 8.0, 8.0, 6.0, 6.0, 6.0, 6.0, 6.0, 4.0, 4.0, 4.0, 4.0, 4.0, 8.0, 8.0, 8.0, 8.0, 8.0, 3.0, 3.0, 3.0, 3.0, 3.0, 5.0, 5.0, 5.0, 5.0, 5.0, 5.0, 5.0, 5.0, 5.0, 5.0, 5.0, 5.0, 5.0, 5.0, 5.0, 1.0, 1.0, 1.0, 1.0, 1.0, 4.0, 4.0, 4.0, 4.0, 4.0, 8.0, 8.0, 8.0, 8.0, 8.0, 5.0, 5.0, 5.0, 5.0, 5.0, 7.0, 7.0, 7.0, 7.0, 7.0, 3.0, 3.0, 3.0, 3.0, 3.0, 1.0, 1.0, 1.0, 1.0, 1.0, 5.0, 5.0, 5.0, 5.0, 5.0, 5.0, 5.0, 5.0, 5.0, 5.0, 6.0, 6.0, 6.0, 6.0, 6.0, 7.0, 7.0, 7.0, 7.0, 7.0, 6.0, 6.0, 6.0, 6.0, 6.0, 5.0, 5.0, 5.0, 5.0, 5.0, 3.0, 3.0, 3.0, 3.0, 3.0, 6.0, 6.0, 6.0, 6.0, 6.0, 8.0, 8.0, 8.0, 8.0, 8.0, 7.0, 7.0, 7.0, 7.0, 7.0, 8.0, 8.0, 8.0, 8.0, 8.0, 5.0, 5.0, 5.0, 5.0, 5.0, 7.0, 7.0, 7.0, 7.0, 7.0, 7.0, 7.0, 7.0, 7.0, 7.0, 6.0, 6.0, 6.0, 6.0, 6.0, 5.0, 5.0, 5.0, 5.0, 5.0, 7.0, 7.0, 7.0, 7.0, 7.0, 7.0, 7.0, 7.0, 7.0, 7.0, 5.0, 5.0, 5.0, 5.0, 5.0, 2.0, 2.0, 2.0, 2.0, 2.0, 8.0, 8.0, 8.0, 8.0, 8.0, 2.0, 2.0, 2.0, 2.0, 2.0, 4.0, 4.0, 4.0, 4.0, 4.0, 0.0, 0.0, 0.0, 0.0, 0.0, 3.0, 3.0, 3.0, 3.0, 3.0, 6.0, 6.0, 6.0, 6.0, 6.0, 1.0]

slaren · 2024-05-29T15:39:32Z

Here is the same test with NKVO that I have been running in the other thread pools PRs:

GPU	Model	NKVO	Test	t/s master	t/s feat-openmp-support-draft	Speedup
RTX 3090 Ti	llama 7B Q4_0	0	pp512	4539.81	4534.23	1.00
RTX 3090 Ti	llama 7B Q4_0	0	tg128	153.90	155.25	1.01
RTX 3090 Ti	llama 7B Q4_0	1	pp512	502.87	527.40	1.05
RTX 3090 Ti	llama 7B Q4_0	1	tg128	20.85	62.17	2.98

I have also tested with CPU only, on native Windows with MSVC, and on a Apple MBP with an M3 Max, and I haven't found any performance regressions. So I have to say congratulations, this is the first thread pool attempt that achieves a very large reduction in the thread launch overhead, without any significant downsides. I think we could use OMP by default.

Some concerns:

The value of num_threads is not always respected by OMP, and when this happens the result is a deadlock. The OMP logic is fairly complicated and it may depend on factors outside of our control. We may need to update the number of threads in the shared state according to the value returned by omp_get_num_threads.
It may be better to avoid using OMP at all if n_threads == 1, and instead use the main thread. The OMP implementations that I tested seem to be smart enough to do this already, but I don't think this is guaranteed, and it would be more reliable to handle this case ourselves and avoid the OMP overhead entirely. This case happens when fully offloading a model to a GPU backend, so it is fairly common.

netrunnereve · 2024-05-30T04:27:32Z

I have also tested with CPU only, on native Windows with MSVC, and on a Apple MBP with an M3 Max, and I haven't found any performance regressions. So I have to say congratulations, this is the first thread pool attempt that achieves a very large reduction in the thread launch overhead, without any significant downsides. I think we could use OMP by default.

On an old 4 core/8 thread Xeon v2 there's a small regression with four threads but it's the same as master with eight. Typically with these threadpool implementations the speedup only happens when you use a lot of threads and thus have to deal with the overhead.
master

model	size	params	backend	threads	test	t/s
llama 7B Q6_K	5.53 GiB	7.24 B	CPU	4	pp512	4.38 ± 0.00
llama 7B Q6_K	5.53 GiB	7.24 B	CPU	4	tg128	3.51 ± 0.00
llama 7B Q6_K	5.53 GiB	7.24 B	CPU	8	pp512	4.64 ± 0.02
llama 7B Q6_K	5.53 GiB	7.24 B	CPU	8	tg128	3.52 ± 0.00

PR

model	size	params	backend	threads	test	t/s
llama 7B Q6_K	5.53 GiB	7.24 B	CPU	4	pp512	4.34 ± 0.00
llama 7B Q6_K	5.53 GiB	7.24 B	CPU	4	tg128	3.30 ± 0.08
llama 7B Q6_K	5.53 GiB	7.24 B	CPU	8	pp512	4.64 ± 0.00
llama 7B Q6_K	5.53 GiB	7.24 B	CPU	8	tg128	3.52 ± 0.00

PR with OMP_PROC_BIND=TRUE

model	size	params	backend	threads	test	t/s
llama 7B Q6_K	5.53 GiB	7.24 B	CPU	8	pp512	4.64 ± 0.01
llama 7B Q6_K	5.53 GiB	7.24 B	CPU	8	tg128	3.49 ± 0.00

slaren · 2024-05-30T06:30:42Z

I tried with a very small stories260k model, since that should exacerbate the overhead of starting the threads, but it is still consistently faster with this PR (except 1 threads). Is it really a given that waking up a thread needs to be more expensive than starting a thread?

CPU	Model	Threads	Test	t/s master	t/s omp	Speedup
i9-13900K	stories260k F32	1	pp512	6920.29	6882.03	0.99
i9-13900K	stories260k F32	1	tg128	6937.02	6865.42	0.99
i9-13900K	stories260k F32	2	pp512	12365.50	12665.08	1.02
i9-13900K	stories260k F32	2	tg128	4687.61	5891.87	1.26
i9-13900K	stories260k F32	3	pp512	17501.55	17597.04	1.01
i9-13900K	stories260k F32	3	tg128	3988.88	5272.38	1.32
i9-13900K	stories260k F32	4	pp512	22587.09	22778.87	1.01
i9-13900K	stories260k F32	4	tg128	3454.17	4633.26	1.34
i9-13900K	stories260k F32	5	pp512	27142.85	27477.98	1.01
i9-13900K	stories260k F32	5	tg128	3028.37	4052.77	1.34
i9-13900K	stories260k F32	6	pp512	30735.90	31353.86	1.02
i9-13900K	stories260k F32	6	tg128	2669.83	3668.35	1.37
i9-13900K	stories260k F32	7	pp512	31723.15	35381.52	1.12
i9-13900K	stories260k F32	7	tg128	2388.36	3295.09	1.38
i9-13900K	stories260k F32	8	pp512	32668.71	31819.11	0.97
i9-13900K	stories260k F32	8	tg128	2140.98	2944.40	1.38

OMP_PROC_BIND=TRUE

CPU	Model	Threads	Test	t/s master	t/s feat-openmp-support-draft	Speedup
13th Gen Intel(R) Core(TM) i9-13900K	llama ?B all F32 (guessed)	1	pp512	6883.01	6835.37	0.99
13th Gen Intel(R) Core(TM) i9-13900K	llama ?B all F32 (guessed)	1	tg128	6896.21	6854.63	0.99
13th Gen Intel(R) Core(TM) i9-13900K	llama ?B all F32 (guessed)	2	pp512	12351.67	12220.15	0.99
13th Gen Intel(R) Core(TM) i9-13900K	llama ?B all F32 (guessed)	2	tg128	4809.30	5899.53	1.23
13th Gen Intel(R) Core(TM) i9-13900K	llama ?B all F32 (guessed)	3	pp512	17397.55	17597.46	1.01
13th Gen Intel(R) Core(TM) i9-13900K	llama ?B all F32 (guessed)	3	tg128	3950.32	5340.62	1.35
13th Gen Intel(R) Core(TM) i9-13900K	llama ?B all F32 (guessed)	4	pp512	22494.21	22735.69	1.01
13th Gen Intel(R) Core(TM) i9-13900K	llama ?B all F32 (guessed)	4	tg128	3442.58	4758.60	1.38
13th Gen Intel(R) Core(TM) i9-13900K	llama ?B all F32 (guessed)	5	pp512	26679.56	27395.33	1.03
13th Gen Intel(R) Core(TM) i9-13900K	llama ?B all F32 (guessed)	5	tg128	2996.16	4205.13	1.40
13th Gen Intel(R) Core(TM) i9-13900K	llama ?B all F32 (guessed)	6	pp512	29370.43	31436.20	1.07
13th Gen Intel(R) Core(TM) i9-13900K	llama ?B all F32 (guessed)	6	tg128	2682.35	3848.70	1.43
13th Gen Intel(R) Core(TM) i9-13900K	llama ?B all F32 (guessed)	7	pp512	31625.77	35164.59	1.11
13th Gen Intel(R) Core(TM) i9-13900K	llama ?B all F32 (guessed)	7	tg128	2351.82	3361.92	1.43
13th Gen Intel(R) Core(TM) i9-13900K	llama ?B all F32 (guessed)	8	pp512	30794.54	35432.06	1.15
13th Gen Intel(R) Core(TM) i9-13900K	llama ?B all F32 (guessed)	8	tg128	2100.06	3045.57	1.45

msy-kato · 2024-05-30T08:09:22Z

@slaren Thank you for the comments and performance test! I have fixed the code to your feedback.

The value of num_threads is not always respected by OMP, and when this happens the result is a deadlock. The OMP logic is fairly complicated and it may depend on factors outside of our control. We may need to update the number of threads in the shared state according to the value returned by omp_get_num_threads. Instead, how about limiting the n_threads by omp_get_max_threads() in ggml_graph_plan as in this implementation?

As you say, deadlocks can occur since the number of threads specified in the num_threads clause is not always used(It happens especially OMP_DYNAMIC=TRUE). On the other hand, I'm not sure what happens when n_threads in state_shared is updated in the parallel section(omp_get_num_threads() returns the right value only within the parallel section).

Instead, how about limiting the number of threads by omp_get_max_threads() in ggml_graph_plan as like this?
I have confirmed that deadlock does not occur under the following conditions

n_threads is greater than the number of physical cores
n_threads is greater than thread-limit()

Without this modification, deadlock occurred in the above cases.

It may be better to avoid using OMP at all if n_threads == 1, and instead use the main thread. The OMP implementations that I tested seem to be smart enough to do this already, but I don't think this is guaranteed, and it would be more reliable to handle this case ourselves and avoid the OMP overhead entirely. This case happens when fully offloading a model to a GPU backend, so it is fairly common.

Thanks, I checked here:

The thread that encountered the parallel construct becomes the master thread of the new team, with a thread number of zero for the duration of the new parallel region.

If n_threads is 1, main thread is the master of OMP, so the cost of waking up the thread may not be a problem. However, the modification is simple, so it is possible to branch only when n_threads=1.

slaren · 2024-05-30T08:29:16Z

I changed it to update n_threads from omp_get_num_threads in a omp single section, because I don't think that we can rely on the value returned by omp_get_max_threads. I also removed the call to omp_get_thread_limit since it was causing issues with MSVC, and I think it is not strictly necessary.

msy-kato · 2024-05-30T08:31:07Z

Thanks for the correction!

slaren · 2024-05-30T15:41:34Z

I was wrong about support in macOS, apple clang does not support OpenMP. It is still possible to use it by installing libomp with brew and kind of hacking the compiler command line, but the performance is not good. I have disabled OpenMP on macOS in the make build, and cmake automatically detects that OpenMP is not available.
Thread sanitizer requires building libomp with LIBOMP_TSAN_SUPPORT to use with OpenMP. As a workaround, I have disabled OpenMP in the thread sanitizer CI.

slaren · 2024-06-03T08:48:43Z

@ggerganov PTAL and let's merge this if you don't have any objections, it will significantly improve performance in some cases and it will allow me to continue work on #6210.

ggerganov

Sorry for the delay - I have missed the request for review

ggml.c

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>

xatier · 2024-06-03T21:26:59Z

We need to install libgomp1 in server.Dockerfile, otherwise it won't start.

# ldd /server
        linux-vdso.so.1 (0x000079740aaa5000)
        libcurl.so.4 => /lib/x86_64-linux-gnu/libcurl.so.4 (0x000079740a9f6000)
        libstdc++.so.6 => /lib/x86_64-linux-gnu/libstdc++.so.6 (0x000079740a200000)
        libm.so.6 => /lib/x86_64-linux-gnu/libm.so.6 (0x000079740a519000)
        libgomp.so.1 => not found
...

apt install libgomp1

https://github.com/ggerganov/llama.cpp/blob/master/.devops/server.Dockerfile#L19

…ution (see ggerganov/llama.cpp#7606)

xatier · 2024-06-08T03:13:39Z

Fixed in #7780

BodhiHu · 2024-08-06T08:09:16Z

Fyi, this changes throws "omp.h not found error" on Ubuntu 22.04.

ggml: Added OpenMP for multi-threads processing

e9a70b1

github-actions bot added build Compilation issues ggml changes relating to the ggml tensor library for machine learning labels May 29, 2024

mofosyne added the Review Complexity : Medium Generally require more time to grok but manageable by beginner to medium expertise level label May 29, 2024

msy-kato and others added 4 commits May 30, 2024 16:09

ggml : Limit the number of threads used to avoid deadlock

7918ed7

update shared state n_threads in parallel region

fa864af

clear numa affinity for main thread even with openmp

377bc78

enable openmp by default

5ddbd18

slaren marked this pull request as ready for review May 30, 2024 08:01

slaren requested a review from ggerganov May 30, 2024 08:01

slaren approved these changes May 30, 2024

View reviewed changes

fix msvc build

5970a26

slaren force-pushed the feat-openmp-support-draft branch from e0e880a to 5970a26 Compare May 30, 2024 09:10

slaren added 2 commits May 30, 2024 11:52

disable openmp on macos

f1772c9

ci : disable openmp with thread sanitizer

62855ca

github-actions bot added the devops improvements to build systems and github actions label May 30, 2024

slaren mentioned this pull request May 31, 2024

Introduce ggml_threadpool #7526

Closed

mingfeima mentioned this pull request Jun 3, 2024

Add Intel Advanced Matrix Extensions (AMX) support to ggml #7707

Open

ggerganov approved these changes Jun 3, 2024

View reviewed changes

ggml.c Outdated Show resolved Hide resolved

Update ggml.c

e0b077d

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>

slaren merged commit a5735e4 into ggerganov:master Jun 3, 2024
66 of 70 checks passed

EZForever added a commit to EZForever/llama.cpp-static that referenced this pull request Jun 4, 2024

Disable OpenMP in build since libgomp1 have no static library distrib…

18e1582

…ution (see ggerganov/llama.cpp#7606)

besnardjb mentioned this pull request Jun 4, 2024

Another threadpool: Avoid creating hundreds of threads in GGML #7342

Closed

abhishek-rn mentioned this pull request Jun 19, 2024

Usage of Open MP Barrier for thread sync to avoid atomic operations #8002

Open

4 tasks

Nexesenex mentioned this pull request Jun 22, 2024

Add OpenMP support in CMakeList LostRuins/koboldcpp#939

Merged

4 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ggml: Support OpenMP for multi-thread processing #7606

ggml: Support OpenMP for multi-thread processing #7606

msy-kato commented May 29, 2024 •

edited

Loading

github-actions bot commented May 29, 2024 •

edited

Loading

slaren commented May 29, 2024 •

edited

Loading

netrunnereve commented May 30, 2024

slaren commented May 30, 2024 •

edited

Loading

msy-kato commented May 30, 2024 •

edited

Loading

slaren commented May 30, 2024

msy-kato commented May 30, 2024

slaren commented May 30, 2024

slaren commented Jun 3, 2024

ggerganov left a comment

xatier commented Jun 3, 2024 •

edited

Loading

xatier commented Jun 8, 2024

BodhiHu commented Aug 6, 2024

ggml: Support OpenMP for multi-thread processing #7606

ggml: Support OpenMP for multi-thread processing #7606

Conversation

msy-kato commented May 29, 2024 • edited Loading

github-actions bot commented May 29, 2024 • edited Loading

slaren commented May 29, 2024 • edited Loading

netrunnereve commented May 30, 2024

slaren commented May 30, 2024 • edited Loading

msy-kato commented May 30, 2024 • edited Loading

slaren commented May 30, 2024

msy-kato commented May 30, 2024

slaren commented May 30, 2024

slaren commented Jun 3, 2024

ggerganov left a comment

Choose a reason for hiding this comment

xatier commented Jun 3, 2024 • edited Loading

xatier commented Jun 8, 2024

BodhiHu commented Aug 6, 2024

msy-kato commented May 29, 2024 •

edited

Loading

github-actions bot commented May 29, 2024 •

edited

Loading

slaren commented May 29, 2024 •

edited

Loading

slaren commented May 30, 2024 •

edited

Loading

msy-kato commented May 30, 2024 •

edited

Loading

xatier commented Jun 3, 2024 •

edited

Loading