Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

New QuantLib benchmark for machine comparison #1962

Merged
merged 12 commits into from
May 17, 2024

Conversation

amd-jadutoit
Copy link
Contributor

Many thanks to @klausspanderen for help and guidance on this work.

QuantLib is one of the few open source production quality quant libraries. As such, platforms such as Phoronix and others have used the QuantLib benchmark as a measure of how well different CPUs run "financial workloads". This is only meaningful in as much the QL benchmark resembles an overnight risk run (this is the main computational workload that investment banks
select hardware for).

This PR modifies the benchmark so that it is easier to compare the performance of modern servers on overnight risk workloads.

Copy link

boring-cyborg bot commented May 3, 2024

Thanks for opening this pull request! It might take a while before we look at it, so don't worry if there seems to be no feedback. We'll get to it.

@CLAassistant
Copy link

CLAassistant commented May 3, 2024

CLA assistant check
All committers have signed the CLA.

@amd-jadutoit
Copy link
Contributor Author

@klausspanderen it seems that the CI workflows need to be run on this PR. Do you know who to contact about that?

@klausspanderen
Copy link
Contributor

@amd-jadutoit looking into the CI builds it seems that we have a couple of linker issues in the CI workflows, which we need to fix first.

@amd-jadutoit
Copy link
Contributor Author

@amd-jadutoit looking into the CI builds it seems that we have a couple of linker issues in the CI workflows, which we need to fix first.

Yes I saw - I didn't realise there was an automake build system in addition to CMake. I'm looking into that now, the Windows failures as well. Will take me a few days probably.

@lballabio
Copy link
Owner

Thanks! No hurry—the next release will be in July.

CMakeLists.txt Outdated Show resolved Hide resolved
@amd-jadutoit
Copy link
Contributor Author

@klausspanderen @lballabio I've pushed an updated PR

  • I've reverted the change in master CMakeLists.txt
  • I've fixed the automake build path, both normal and unity. Unity seems a little slow, but it does compile successfully on my system
  • I've set the default benchmark size to 3, which means just running the benchmark executable with no command line arguments will complete in a minute or two.

I'm not clear why the Windows CMake target was failing ... there is nothing Windows specific in the test-suite CMakeLists file. I hope once the CI pipes have run this time I'll be able to get a hint as to what is wrong.

@lballabio
Copy link
Owner

It looks like splitting the test suite into library + executable break the detection of the test cases? I wouldn't be opposed to duplication here if that's what it takes...

Many thanks to Klaus Spanderen for help and guidance on
this work.

QuantLib is one of the few open source production quality quant libraries.
As such, platforms such as Phoronix and others have used the QuantLib
benchmark as a measure of how well different CPUs run "financial workloads".
This is only meaningful in as much the QL benchmark resembles an overnight
risk run (this is the main computational workload that investment banks
select hardware for).

The original QL benchmark was sequential, which clearly is unrealistic.
In addition, it was rather light on more modern 2 and 3 factor models,
most of the tests it ran were rather short, it seemed not to have so much
in the line of Barriers or products using AMC, and it computed a score
based on FLOP counts.

For the problems with using FLOPS as the only system performance metric,
please see the HPCG benchmark and discussion on their website.

My experience is that wall time of a large risk run is a metric that many
investment banks can get behind when comparing hardware.  This suggests
that the benchmark should in some way resemble a large risk run.  We
can at best approximate this rather loosely, since the makeup of a risk run
is highly dependent on the products, models, numerical techniques, etc, that
an organisation has.  We've chosen to focus on the following main features:
 * Work is expressed as tasks
 * A large fixed-size hopper of tasks is given to the machine. The number of
   tasks is independent of the machine being tested
 * The tasks are not the same: they do differnt things, have differen runtimes,
   some are very short, some are long, they run different numerical techniques, etc
 * Tasks are all single threaded and all independent
 * The metric of performce is how quickly the system can complete all the tasks

This way, as long as we run the same number of tasks, the performance of machines
can be compared.

There is a potential tail effect here - the larger the number of tasks, the smaller
the tail effect. There is instrumentation to calculate the tail effect, in my testing,
for benchmark sizes S and up, the effect is small (less than 3%). I schedule the
longest-running tasks first (the approx. relative runtime of each task is hard-coded
into the benchmark).

The selection of work in each task is somewhat arbitrary.  We selected a broad range
of tests from the test suite, trying to cover as many different parts of the library
as we could. There is no right way to do this really.

It is also important to check that we get the right answer.  There is no point getting
the wrong answer faster.  Vendor compilers can make quite aggressive optimisations,
for example the Intel Compiler 2023 enables fast-math by default at O3, which causes
some tests to fail.

The boost test framework adds another layer of complexity.  If one "hacks" entry points
to the test functions by declaring their symbol names and calling them direclty,
then all the checking logic inside boost is disabled: no exceptions are raised for
failures, and it's impossible to know whether the test passed or failed.
Conversely, if one calls the boost test framework and asks it to
execute the tests (which enables all the checks and exceptions) then a substantial overhead
is introduced.

We worked around this by running each test exactly once through the boost framework, trapping
exceptions and terminating as needed.  All other executions of a test/task called the
symbols directly, bypassing the boost test framework.  This seems to me to give
a reasonable compromise.  Note that there is no way to reliably run
BOOST_AUTO_TEST_CASE_TEMPLATE tests without the boost framework, so these tests
are out of scope for the benchmark.

We also included extensive logging information, which was very useful in debugging performance
problems.

Lastly, it's important that people on small machines can run the benchmark with relative ease,
while on large machines the system is kept busy for around 2min or more.  We've approached this
through the --size parameter.  The recommended benchmark size for large machines (e.g.
dual socket servers with ~100 cores per socket) is S.  Benchmark sizes of M, L and above are
reserved for future growth.  Smaller systems should use benchmark sizes of XS, XXS or XXXS.

It is crucial, once this patch is merged, that these benchmark sizes are NOT CHANGED since that
will make it impossible to compare machines.  Machines can only be compared if they ran the
same workload, i.e they had the same number of tasks.  We can introduce more T-shirt sizes
on the front or end of the list, but the T-shirt sizes in the benchmark must remain fixed.
Use this to build the test-suite and benchmark executables.
This avoids having to do any additional compilation for the benchmark.
Instead, the same objects that were used to build the test suite are
used to build the benchmark.

This also more or less obviates the need for a CMake BUILD_BENCHMARK
variable.
The test tolerance is very strict and even small changes in floating point
order causes the test to fail (actual error is ~5e-15 rather than required 1e-15).
It is highly unlikely that failure in this case indicates invalid assembly.
This patch allows both AOCC with -O3 -zopt -amdveclib and ICPX with -O3 -xCORE-AVX512
-fp-model=precise to pass.
@amd-jadutoit
Copy link
Contributor Author

@lballabio @klausspanderen I hope this will pass now

  • Automake now works my end with and without static libs configured. It turns out that automake does not rebuild all the test object files when the same sources are passed to test-suite or benchmark, so this simplifies things considerably
  • On windows, there were a few unused vars in the benchmark program which were treated as errors. The benchmark failed to build, which stopped the test suite for being built as well (I'm not sure why)
  • The final set of failures were due to the static library of test file objects. It turns out that when you link a static lib of boost unit test objects against the boost unit test driver object, no undefined symbols are found and the static library is essentially ignored. Hence, a working test executable with no tests. This is fixed now (at least on my system).

@coveralls
Copy link

coveralls commented May 16, 2024

Coverage Status

coverage: 72.497%. remained the same
when pulling 56a4fbc on amd-jadutoit:jdt-new-benchmark
into 7178092 on lballabio:master.

@lballabio
Copy link
Owner

Thanks—I added a small fix for "make dist" but it works now.

We don't have a CI build running the benchmark. Do you think I should add one?

@amd-jadutoit
Copy link
Contributor Author

Thanks—I added a small fix for "make dist" but it works now.

We don't have a CI build running the benchmark. Do you think I should add one?

Thanks very much for the fix!

Yes please, I think that would be wise. As I discovered, when making changes to the build system it's sometimes easy to mess something up, and then an executable might not get built. Since the cost of building the benchmark is just a re-link of existing test-suite objects, it's negligible.

Running the benchmark with no arguments will run through all the 80+ tasks 3 times, which is pretty quick even on a small machine. The process should exit successfully, main() should return 0. If anything goes wrong, main() should return non-zero.

I don't think a separate CI pipe/build is necessary. We could add this as another executable to run on some existing builds, provided we make sure we build the benchmark as well. On CMake builds we should always build the benchmark if we build the test suite. On automake I think we need to explicitly enable it.

@lballabio lballabio added this to the Release 1.35 milestone May 17, 2024
@lballabio lballabio enabled auto-merge May 17, 2024 13:57
@lballabio lballabio merged commit 14377c2 into lballabio:master May 17, 2024
42 checks passed
Copy link

boring-cyborg bot commented May 17, 2024

Congratulations on your first merged pull request!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants