-
Notifications
You must be signed in to change notification settings - Fork 364
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Upstream of AOCL 2.2.1 changes. #448
Merged
Conversation
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Details: - Made extra explicit the fact that: (a) multithreading in BLIS is disabled by default; and (b) even with multithreading enabled, the user must specify multithreading at runtime in order to observe parallelism. Thanks to M. Zhou for suggesting these clarifications in flame#292. - Also made explicit that only the environment variable and global runtime API methods are available when using the BLAS API. If the user wishes to use the local runtime API (specify multithreading on a per-call basis), one of the native BLIS APIs must be used.
Details: - Replaced the existing --enable-export-all / --disable-export-all configure option with --export-shared=[public|all], with the 'public' instance of the latter corresponding to --disable-export-all and the 'all' instance corresponding to --enable-export-all. Nothing else semantically about the option, or its default, has changed.
Details: - Adjusted the zen sub-configuration's cache blocksizes for float, scomplex, and dcomplex based on the existing values for double. (The previous values were taken directly from the haswell subconfig, which targets Intel Haswell/Broadwell/Skylake systems.)
Details: - Added a new markdown document, docs/Performance.md, which reports performance of a representative set of level-3 operations across a variety of hardware architectures, comparing BLIS to OpenBLAS and a vendor library (MKL on Intel/AMD, ARMPL on ARM). Performance graphs, in pdf and png formats, reside in docs/graphs. - Updated README.md to link to new Performance.md document. - Minor updates to CREDITS, docs/Multithreading.md. - Minor updates to matlab scripts in test/3/matlab.
Details: - Fixed a few broken section links in the Contents section.
Details: - Fixed some incorrect labels associated with the pdf/png graphs, apparently the result of copy-pasting.
Details: - Updated ReleaseNotes.md in preparation for next version.
Details: - Defined GFLOPS as billions of floating-point operations per second, and reworded the sentence after about normalization.
Details: - Added targets to test/3/Makefile that link against a BLAS library build by Eigen. It appears, however, that Eigen's BLAS library does not support multithreading. (It may be that multithreading is only available when using the native C++ APIs.) - Updated runme.sh with a few Eigen-related tweaks. - Minor tweaks to docs/Performance.md.
Details: - Modified bli_blas.h so that: - By default, if the BLAS layer is enabled at configure-time, BLAS prototypes are also enabled within blis.h; - But if the user #defines BLIS_DISABLE_BLAS_DEFS prior to including blis.h, BLAS prototypes are skipped over entirely so that, for example, the application or some other header pulled in by the application may prototype the BLAS functions without causing any duplication. - Updated docs/BuildSystem.md to document the feature above, and related text.
clang -dumpversion gives 4.2.1 for all clang versions as clang was originally compatible with gcc 4.2.1 Apple clang version and clang version are two different things and the real clang version cannot be deduced from apple clang version programatically. Rely on wikipedia to map apple clang to clang version Also fixes assembly detection with clang clang 3.8 can't build knl as it doesn't recognize zmm0
Details: - Use compile-time implementations of Eigen in test_gemm.c via new EIGEN cpp macro, defined on command line. (Linking to Eigen's BLAS library is not necessary.) However, as of Eigen 3.3.7, Eigen only parallelizes the gemm operation and not hemm, herk, trmm, trsm, or any other level-3 operation. - Fixed a bug in trmm and trsm drivers whereby the wrong function (bli_does_trans()) was being called to determine whether the object for matrix A should be created for a left- or right-side case. This was corrected by changing the function to bli_is_left(), as is done in the hemm driver. - Added support for running Eigen test drivers from runme.sh.
Details: - Adjusted test/3/Makefile so that the test drivers are linked against Eigen's BLAS library for hemm, herk, trmm, and trsm. We have to do this since Eigen's headers don't define implementations to the standard BLAS APIs. - Simplified #included headers in hemm, herk, trmm, and trsm source driver files, since nothing specific to Eigen is needed at compile-time for those operations.
Export macros can't support both shared and static at the same time. When blis is built with both shared and static, headers assume that shared is used at link time and dllimports the symbols with __imp_ prefix. To use the headers with static libraries a user can give -DBLIS_EXPORT= to import the symbol without the __imp_ prefix
Details: - Fixed the Makefile in test/3 so that it no longer incorrectly labels the matlab output variables from Eigen-linked hemm, herk, trmm, and trsm driver output as "vendor". (The gemm drivers were already correctly outputing matlab variables containing the "eigen" label.)
Details: - Updated matlab scripts in test/3/matlab to optionally plot/display Eigen performance curves. Whether Eigen is plotted is determined by a new boolean function parameter, with_eigen. - Updated runme.m scratchpad to reflect the latest invocations of the plot_panel_4x5() function (with Eigen plotting enabled).
Details: - Updated the Haswell, SkylakeX, and Epyc performance graphs in docs/graphs to report on Eigen implementations, where applicable. Specifically, Eigen implements all level-3 operations sequentially, however, of those operations it only provides multithreaded gemm. Thus, mt results for symm/hemm, syrk/herk, trmm, and trsm are omitted. Thanks to Sameer Agarwal for his help configuring and using Eigen. - Updated docs/Performance.md to note the new implementation tested. - CREDITS file update.
Details: - Added/updated a few more details, mostly regarding Eigen.
Details: - Updated the level-3 performance graphs in docs/graphs with new Eigen results, this time using a development version cloned from their git mirror on March 27, 2019 (version 3.3.90). Performance is improved over 3.3.7, though still noticeably short of BLIS/MKL in most cases. - Very minor updates to docs/Performance.md and matlab scripts in test/3/matlab.
Details: - Renamed kernels/armv8a/3/bli_gemm_armv8a_opt_4x4.c to kernels/armv8a/3/bli_gemm_armv8a_asm_d6x8.c. This follows the naming convention used by other kernel sets, most notably haswell.
Change void*-typed function pointers to void_fp. - Updated all instances of void* variables that store function pointers to variables of a new type, void_fp. Originally, I wanted to define the type of void_fp as "void (*void_fp)( void )"--that is, a pointer to a function with no return value and no arguments. However, once I did this, I realized that gcc complains with incompatible pointer type (-Wincompatible-pointer-types) warnings every time any such a pointer is being assigned to its final, type-accurate function pointer type. That is, gcc will silently typecast a void* to another defined function pointer type (e.g. dscalv_ker_ft) during an assignment from the former to the latter, but the same statement will trigger a warning when typecasting from a void_fp type. I suspect an explicit typecast is needed in order to avoid the warning, which I'm not willing to insert at this time. - Added a typedef to bli_type_defs.h defining void_fp as void*, along with a commented-out version of the aborted definition described above. (Note that POSIX requires that void* and function pointers be interchangeable; it is the C standard that does not provide this guarantee.) - Comment updates to various _oapi.c files.
Details: - Added more details and clarifying language to implications of 1m and the recycling of microkernels between microarchitectures.
Details: - Fixed a minor bug in flatten-headers.py whereby the script, upon encountering a #include directive for the root header file, would erroneously recurse and inline the conents of that root header. The script has been modified to avoid recursion into any headers that share the same name as the root-level header that was passed into the script. (Note: this bug didn't actually manifest in BLIS, so it's merely a precaution for usage of flatten-headers.py in other contexts.)
Details: - Changed the default installation prefix from $HOME/lib to /usr/local. - Modified the way configure internally handles the prefix, libdir, includedir, and sharedir (and also added an --exec-prefix option). The defaults to these variables are set as follows: prefix: /usr/local exec_prefix: ${prefix} libdir: ${exec_prefix}/lib includedir: ${prefix}/include sharedir: ${prefix}/share The key change, aside from the addition of exec_prefix and its use to define the default to libdir, is that the variables are substituted into config.mk with quoting that delays evaluation, meaning the substituted values may contain unevaluated references to other variables (namely, ${prefix} and ${exec_prefix}). This more closely follows GNU conventions, including those used by GNU autoconf, and also allows make to override any one of the variables *after* configure has already been run (e.g. during 'make install'). - Updates to build/config.mk.in pursuant to above changes. - Updates to output of 'configure --help' pursuant to above changes. - Updated docs/BuildSystem.md to reflect the new default installation prefix, as well as mention EXECPREFIX and SHAREDIR. - Changed the definitions of the UNINSTALL_OLD_* variables in the top-level Makefile to use $(wildcard ...) instead of 'find'. This was motivated by the new way of handling prefix and friends, which leads to the 'find' command being run on /usr/local (by default), which can take a while almost never yielding any benefit (since the user will very rarely use the uninstall-old targets). - Removed periods from the end of descriptive output statements (i.e., non-verbose output) since those statements often end with file or directory paths, which get confusing to read when puctuated by a period. - Trival change to 'make showconfig' output. - Removed my name from 'configure --help'. (Many have contributed to it over the years.) - In configure script, changed the default state of threading_model variable from 'no' to 'off' to match that of debug_type, where there are similarly more than two valid states. ('no' is still accepted if given via the --enable-debug= option, though it will be standardized to 'off' prior to config.mk being written out.) - Minor variable name change in flatten-headers.py that was intended for 32812ff. - CREDITS file update.
Details: - Somehow the variable name change (root_file_name -> root_inputname) in flatten-headers.py mentioned in the commit log entry for 89a70cc didn't make it into the actual commit. This commit applies that change.
Details: - Added preprocessor branches to test/3/test_gemm.c to explicitly support row-stored matrices. Column-stored matrices are also still supported (and is the default for now). (This is mainly residual work leftover from initial integration of Eigen into the test drivers, so if we ever want to test Eigen with row-stored matrices, the code will be ready to use, even if it is not yet integrated into the Makefile in test/3.)
…mats and non Transpose/Conjugate Matrices Failure was seen in libflame function (FLASH_UDdate_UT_inc) Due to typecasting double complex pointer as double pointer Change-Id: If6e2f4663575450a13a9a07dddd5622628f5c6b0
This will ensure early return in case full gemm processing is not needed. Based on dimension which is found to be zero following actions will be taken: If 'c' has zero dimension, no further processing is requried If alpha is zero or if 'a' or 'b' has zero diemension, we perform scalm operation instead of gemm. (c = alpha*a + beta*b) Change-Id: Icc031944fc4e80138adf991974547f2d57ab570b AMD-Internal: [CPUPL-904]
Change-Id: Icad0ff1c1858c1762792ba8f2c5c3e846909cbb5
…o amd-staging-rome-2.2
Details: - Optimized saxpyf kernel with fuse_factor=5 and iter_unroll=2. - Modified framework files of sgemv to remove dependency on cntx variable. - Updated cntx_init file of zen2 to choose optimized kernels. - Modified BLAS interface call for SGEMV to reduce framework overhread. - Currently these changes are applicable for zen2 configuration. Change-Id: Iabc36ae640e82e65f8764f3c6dee513ad64b22fd Signed-off-by: Meghana Vankadari <Meghana.Vankadari@amd.com> AMD-Internal: [CPUPL-707]
Added traces from blas/cblas API's till kernels for dgemm and sgemm. By default the traces will be disabled, user need to enable them in their local workspace, please check aocl_dtl/aocldtlcf.h file. AMD Internal : CPUPL-806 Change-Id: I83b310509fb1a599c114387192bcf882ef0480f9
Change-Id: I0f902e32085058ec618d08470793f5e5e49719b3
…rements Multiple trace levels will allow user to set the nested call levels up to which the traces to be limited. It will also reduce file size requirements. Also optimized auto trace output to reduce file size by removing thread ID's from individual lines. AMD Internal: [CPUPL-806] Change-Id: I28e08a5bdf1b147469d8ce290ff7cde7f74481bd
Added BLIS specific extension to AOCL DTL, in this added support to print the input matrix sizes from BLIS library. AMD Internal: [CPUPL-806] Change-Id: I80ed779d65f9b1c48466137fc2f05629fa2fb561
This library ported on Windows 10 using CMake scripts and Visual Studio 2019 with clang compiler AMD internal:[CPUPL-657] Change-Id: Ie701f52ebc0e0585201ba703b6284ac94fc0feb9
Details: - Fixed an innocuous bug that manifested when running the testsuite on extremely small matrices with randomization via the "powers of 2 in narrow precision range" option enabled. When the randomization function emits a perfect 0.0 to fill a 1x1 matrix, the testsuite will then compute 0.0/0.0 during the normalization process, which leads to NaN residuals. The solution entails smarter implementaions of randv, randnv, randm, and randnm, each of which will compute the 1-norm of the vector or matrix in question. If the object has a 1-norm of 0.0, the object is re-randomized until the 1-norm is not 0.0. Thanks to Kiran Varaganti for reporting this issue (flame#413). - Updated the implementation of randm_unb_var1() so that it loops over a call to the randv_unb_var1() implementation directly rather than calling it indirectly via randv(). This was done to avoid the overhead of multiple calls to norm1v() when randomizing the rows/columns of a matrix. - Updated comments. Change-Id: I0e3d65ff97b26afde614da746e17ed33646839d1
Details: - Added new API Which Computes a matrix-matrix product with general matrices but updates only the upper or lower triangular part of the result matrix. cblas_?gemmt() and ?gemmt_(). - These routines are similar to the ?gemm routines, but they only access and update a triangular part of the square result matrix. - Added DGEMMT functionality by reusing GEMM kernels. - Created a new folder for GEMMT under l3, and added GEMMT specific framework code. - Modified cntl_create routine to choose different macro kernel for GEMMT. - Added routines to copy lower/upper triangular part of a block to the buffer. - Defined BLIS, BLAS and CBLAS interface APIs for GEMMT. - Added test_gemmt.c to test folder and Updated the Makefile. - Added a macro 'CBLAS' in test_gemm.c to call CBLAS APIs. Change-Id: Ie00c1a15b9c654b65c687a9ca781cbc6f9641791
…lso supports complex data types. Details: - Added framework code for GEMMT SUP. - Implemented SUP for GEMMT using similar techniques as native path. - Moved update routines to frame/util folder. - Ported update routines for complex datatypes. Change-Id: I17adfd0586d07f5a23dca6a07b2d48f4c9fcf71c Signed-off-by: Meghana Vankadari <Meghana.Vankadari@amd.com>, Dipal M Zambare <DipalMadhukar.Zambare@amd.com>, Mangala V <managala.v@amd.com>
…vironment. 1) Added dcomplex based zdotc_ version as a function with additional parameter. 2) The datatypes (single , double, Complex) functions retained as the macros. 3) This modification handles the ZDOTC_ invocation from Fortran based application for 'double complex' datatypes. 4) The modifications are placed under macro 'AOCL_F2C'. 5) Blis, Blas Test suites verified ALL PASS with GCC and Flang + with and without 'AOCL_F2C' macro on Ubuntu machine. 6) Adding BLIS_EXPORT_BLAS to make the APIs visible when linking dll. Change-Id: I4ada39a73f416e3794708f5b55e947342c261117 Signed-off-by: Meghana <Meghana.Vankadari@amd.com>, Nagendra <Nagendra.PrasadM@amd.com> AMD-Internal: [SWLCSG-177]
Details: - Since GEMM kernel prefers row-storage, if input C matrix is in col-major order, entire operation is transposed. In that case uplo(c) needs to be toggled before kernel-variant selection. - disabled "bli_gemmsup_ref_var1n2m_opt_cases" inside gemmtsup. - Updated version number to 2.2.1 Change-Id: I0a85df1141fc4a98d98ea4e0c3d42db8602fa69b
Details: - BLIS test application throws an error when built with dynamic library as "Undefined reference to bli_abort". This happens because bli_abort is hidden and cannot be linkable from outside. Annotating prototype with BLIS_EXPORT_BLAS to make it public. Change-Id: I0d7aec046e8871ba6491024694ed06f883b005ac AMD Internal: [CPUPL-1030]
…nels Change-Id: Ib309aba0cb08161877fd1a720ed65222d3b303f3
Details: - Since C is triangular, in order to maintain load balance among threads, we need to use weighted range partitioning. Change-Id: I03d8ff71ac7af843acd787f1389b5907b56453ee
Details: - Unlike default path, storage scheme of C is not always row-major in SUP. - Whenever C is col-major, the temporary buffer 'ct' is also chosen to be col-major. - Since update routines only support row-major order, a transpose is induced for c and ct buffers before passing them to update routine. Change-Id: I3fea10860f39632df7540c9399786e7aa1cfba37
Details: - If there are any zero rows or columns along the edges of MCxNC block of C, shrink the dimensions to avoid "no-op" iterations. - For lower-triangle kernel variant, Added a flag to determine if a block that is strictly below triangle is reached. Once such block is reached, the flag is set and all the blocks that are below it are strictly below the diagonal and flag is used to make decision. - For upper-triangle kernel-variant, whenever a block that is strictly below the triangle is reached, break the for loop and go for next iteration of JR loop because all the blocks below it will also be strictly below diagonal and are filled with zeroes which requires no computation. Change-Id: I606b0f900509aab6ed7ff30cefee9d7207b7b010
The testsuite coveres all combinations of upper, lower, transpose and API formats. AMD Internal: [CPUPL-1021] Change-Id: I2a1d79eba1dcaf4217fd9c2c346bd6173b80a782
Details: - Problem: If row major, first four elements of last column on output matrix C was not updated If col major, first four elements of last row on output matrix C was not updated - Solution: Updating elements after computation is done on right offset in bli_dgemmsup_rv_haswell_asm_5x8() Change-Id: I588c60f2f3cd5f51e475cfc140e3bf0e9d5a4dae
…ixed" This reverts commit 725bf5a. Reason for revert: <INSERT REASONING HERE> Change-Id: I7dd6b84731f091c8b39080ed9321a708fa5f11d8
GEMMT changes porting on to Windows AMD Internal : [CPUPL-1061] Change-Id: I587d1789cd29ea18b04f8ab43e5742b4d902067a
jeffhammond
changed the title
Upstream of AOCL 2.2.1 chagnes.
Upstream of AOCL 2.2.1 changes.
Sep 23, 2020
fgvanzee
added a commit
that referenced
this pull request
Nov 2, 2020
Details: - Removed a few flags that slipped into the recent merge of #448 which *may* be causing breakage. This commit moves amd_config.mk back to the state it is in, more or less, in the 'master' branch.
fgvanzee
added a commit
that referenced
this pull request
Nov 14, 2020
Merged contributions from AMD's AOCL BLIS (#448). Details: - Added support for level-3 operation gemmt, which performs a gemm on only the lower or upper triangle of a square matrix C. For now, only the conventional/large code path will be supported (in vanilla BLIS). This was accomplished by leveraging the existing variant logic for herk. However, some of the infrastructure to support a gemmtsup is included in this commit, including - A bli_gemmtsup() front-end, similar to bli_gemmsup(). - A bli_gemmtsup_ref() reference handler function. - A bli_gemmtsup_int() variant chooser function (with variant calls commented out). - Added support for inducing complex domain gemmt via the 1m method. - Added gemmt APIs to the BLAS and CBLAS compatiblity layers. - Added gemmt test module to testsuite. - Added standalone gemmt test driver to 'test' directory. - Documented gemmt APIs in BLISObjectAPI.md and BLISTypedAPI.md. - Added a C++ template header (blis.hh) containing a BLAS-inspired wrapper to a set of polymorphic CBLAS-like function wrappers defined in another header (cblas.hh). These two headers are installed if running the 'install' target with INSTALL_HH is set to 'yes'. (Also added a set of unit tests that exercise blis.hh, although they are disabled for now because they aren't compatible with out-of-tree builds.) These files now live in the 'vendor' top-level directory. - Various updates to 'zen' and 'zen2' subconfigurations, particularly within the context initialization functions. - Added s and d copyv, setv, and swapv kernels to kernels/zen/1, and various minor updates to dotv and scalv kernels. Also added various sup kernels contributed by AMD to kernels/zen/3. However, these kernels are (for now) not yet used, in part because they caused AppVeyor clang failures, and also because I have not found time to review and vet them. - Output the python found during configure into the definition of PYTHON in build/config.mk (via build/config.mk.in). - Added early-return checks (A, B, or C with zero dimension; alpha = 0) to bli_gemm_front.c. - Implemented explicit beta = 0 handling in for the sgemm ukernel in bli_gemm_armv7a_int_d4x4.c, which was previously missing. This latent bug surfaced because the gemmt module verifies its computation using gemm with its beta parameter set to zero, which, on a cortexa15 system caused the gemm kernel code to unconditionally multiply the uninitialized C data by beta. The C matrix likely contained non-numeric values such as NaN, which then would have resulted in a false failure. - Fixed a bug whereby the implementation for bli_herk_determine_kc(), in bli_l3_blocksize.c, was inadvertantly being defined in terms of helper functions meant for trmm. This bug was probably harmless since the trmm code should have also done the right thing for herk. - Used cpp macros to neutralize the various AOCL_DTL_TRACE_ macros in kernels/zen/3/bli_gemm_small.c since those macros are not used in vanilla BLIS. - Added cpp guard to definition of bli_mem_clear() in bli_mem.h to accommodate C++'s stricter type checking. - Added cpp guard to test/*.c drivers that facilitate compilation on Windows systems. - Various whitespace changes.
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Following features/optimizations were added in this release.
Please do the needful and let us know if you have any questions.