Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

New QuantLib benchmark for machine comparison #1962

Merged
merged 12 commits into from
May 17, 2024

Commits on May 3, 2024

  1. enable papi 6.0 or higher

    klausspanderen authored and amd-jadutoit committed May 3, 2024
    Configuration menu
    Copy the full SHA
    05e7775 View commit details
    Browse the repository at this point in the history
  2. Configuration menu
    Copy the full SHA
    5ce41a1 View commit details
    Browse the repository at this point in the history
  3. remove papi linker

    klausspanderen authored and amd-jadutoit committed May 3, 2024
    Configuration menu
    Copy the full SHA
    9581d6a View commit details
    Browse the repository at this point in the history
  4. adjust configure

    klausspanderen authored and amd-jadutoit committed May 3, 2024
    Configuration menu
    Copy the full SHA
    12e62b4 View commit details
    Browse the repository at this point in the history
  5. Configuration menu
    Copy the full SHA
    0259c40 View commit details
    Browse the repository at this point in the history
  6. .

    klausspanderen authored and amd-jadutoit committed May 3, 2024
    Configuration menu
    Copy the full SHA
    cdfa72d View commit details
    Browse the repository at this point in the history

Commits on May 15, 2024

  1. Overhaul QuantLib benchmark

    Many thanks to Klaus Spanderen for help and guidance on
    this work.
    
    QuantLib is one of the few open source production quality quant libraries.
    As such, platforms such as Phoronix and others have used the QuantLib
    benchmark as a measure of how well different CPUs run "financial workloads".
    This is only meaningful in as much the QL benchmark resembles an overnight
    risk run (this is the main computational workload that investment banks
    select hardware for).
    
    The original QL benchmark was sequential, which clearly is unrealistic.
    In addition, it was rather light on more modern 2 and 3 factor models,
    most of the tests it ran were rather short, it seemed not to have so much
    in the line of Barriers or products using AMC, and it computed a score
    based on FLOP counts.
    
    For the problems with using FLOPS as the only system performance metric,
    please see the HPCG benchmark and discussion on their website.
    
    My experience is that wall time of a large risk run is a metric that many
    investment banks can get behind when comparing hardware.  This suggests
    that the benchmark should in some way resemble a large risk run.  We
    can at best approximate this rather loosely, since the makeup of a risk run
    is highly dependent on the products, models, numerical techniques, etc, that
    an organisation has.  We've chosen to focus on the following main features:
     * Work is expressed as tasks
     * A large fixed-size hopper of tasks is given to the machine. The number of
       tasks is independent of the machine being tested
     * The tasks are not the same: they do differnt things, have differen runtimes,
       some are very short, some are long, they run different numerical techniques, etc
     * Tasks are all single threaded and all independent
     * The metric of performce is how quickly the system can complete all the tasks
    
    This way, as long as we run the same number of tasks, the performance of machines
    can be compared.
    
    There is a potential tail effect here - the larger the number of tasks, the smaller
    the tail effect. There is instrumentation to calculate the tail effect, in my testing,
    for benchmark sizes S and up, the effect is small (less than 3%). I schedule the
    longest-running tasks first (the approx. relative runtime of each task is hard-coded
    into the benchmark).
    
    The selection of work in each task is somewhat arbitrary.  We selected a broad range
    of tests from the test suite, trying to cover as many different parts of the library
    as we could. There is no right way to do this really.
    
    It is also important to check that we get the right answer.  There is no point getting
    the wrong answer faster.  Vendor compilers can make quite aggressive optimisations,
    for example the Intel Compiler 2023 enables fast-math by default at O3, which causes
    some tests to fail.
    
    The boost test framework adds another layer of complexity.  If one "hacks" entry points
    to the test functions by declaring their symbol names and calling them direclty,
    then all the checking logic inside boost is disabled: no exceptions are raised for
    failures, and it's impossible to know whether the test passed or failed.
    Conversely, if one calls the boost test framework and asks it to
    execute the tests (which enables all the checks and exceptions) then a substantial overhead
    is introduced.
    
    We worked around this by running each test exactly once through the boost framework, trapping
    exceptions and terminating as needed.  All other executions of a test/task called the
    symbols directly, bypassing the boost test framework.  This seems to me to give
    a reasonable compromise.  Note that there is no way to reliably run
    BOOST_AUTO_TEST_CASE_TEMPLATE tests without the boost framework, so these tests
    are out of scope for the benchmark.
    
    We also included extensive logging information, which was very useful in debugging performance
    problems.
    
    Lastly, it's important that people on small machines can run the benchmark with relative ease,
    while on large machines the system is kept busy for around 2min or more.  We've approached this
    through the --size parameter.  The recommended benchmark size for large machines (e.g.
    dual socket servers with ~100 cores per socket) is S.  Benchmark sizes of M, L and above are
    reserved for future growth.  Smaller systems should use benchmark sizes of XS, XXS or XXXS.
    
    It is crucial, once this patch is merged, that these benchmark sizes are NOT CHANGED since that
    will make it impossible to compare machines.  Machines can only be compared if they ran the
    same workload, i.e they had the same number of tasks.  We can introduce more T-shirt sizes
    on the front or end of the list, but the T-shirt sizes in the benchmark must remain fixed.
    amd-jadutoit committed May 15, 2024
    Configuration menu
    Copy the full SHA
    600a0e0 View commit details
    Browse the repository at this point in the history
  2. Create a CMake Object library for the test-suite objects.

    Use this to build the test-suite and benchmark executables.
    This avoids having to do any additional compilation for the benchmark.
    Instead, the same objects that were used to build the test suite are
    used to build the benchmark.
    
    This also more or less obviates the need for a CMake BUILD_BENCHMARK
    variable.
    amd-jadutoit committed May 15, 2024
    Configuration menu
    Copy the full SHA
    b7f6194 View commit details
    Browse the repository at this point in the history
  3. Configuration menu
    Copy the full SHA
    e18ed01 View commit details
    Browse the repository at this point in the history
  4. Relax tolerance on AmericanOption test

    The test tolerance is very strict and even small changes in floating point
    order causes the test to fail (actual error is ~5e-15 rather than required 1e-15).
    It is highly unlikely that failure in this case indicates invalid assembly.
    This patch allows both AOCC with -O3 -zopt -amdveclib and ICPX with -O3 -xCORE-AVX512
    -fp-model=precise to pass.
    amd-jadutoit committed May 15, 2024
    Configuration menu
    Copy the full SHA
    fec4e07 View commit details
    Browse the repository at this point in the history

Commits on May 16, 2024

  1. Fix "make dist"

    lballabio committed May 16, 2024
    Configuration menu
    Copy the full SHA
    24a2386 View commit details
    Browse the repository at this point in the history

Commits on May 17, 2024

  1. Configuration menu
    Copy the full SHA
    56a4fbc View commit details
    Browse the repository at this point in the history