Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fortran underscores #264

Closed
jkd2016 opened this issue Oct 9, 2018 · 31 comments
Closed

Fortran underscores #264

jkd2016 opened this issue Oct 9, 2018 · 31 comments

Comments

@jkd2016
Copy link

jkd2016 commented Oct 9, 2018

I'm currently adding BLIS to our electronic structure code Elk (elk.sourceforge.net).

There is an issue with calling the BLIS API subroutines, namely they don't have leading underscores as required by Fortran.

For example, to set the number of threads at run-time, I can use

call mkl_set_num_threads(nthd)
call openblas_set_num_threads(nthd)

with MKL and OpenBLAS, but

call bli_thread_set_num_threads(nthd)

doesn't work because bli_thread_set_num_threads_ is not the name in the library.

Any ideas on how to resolve this in a portable way?

@jeffhammond
Copy link
Member

You could use Fortran 2003's C interoperability features but I suspect that this is just a trivial bug that someone will fix quickly.

@devinamatthews
Copy link
Member

Not really a bug since we haven't made any effort to make anything other than regular BLAS functions available in Fortran. F2003 is one option, or you could:

  1. Either set the BLIS_NUM_THREADS (or OMP_NUM_THREADS) environment variable before program start or during the program but before the first BLAS call.
  2. Write a super simple wrapper in C that exports a Fortran interface.

@jkd2016
Copy link
Author

jkd2016 commented Oct 9, 2018

Actually Elk controls how many threads each nesting level gets. When it reaches the deepest nesting levels and there are still idle threads, then it assigns them to BLAS or LAPACK routines.

Hence it has to have access to the MKL, OpenBLAS or BLIS threading routines during runtime.

I would like to request that these and other BLIS routines be made Fortran-compatible. The other libraries appear to have both conventions included (eg. zgemm and zgemm_).

I understand it's a bit annoying to have to do this, but it would certainly be useful for Elk and other Fortran codes to be able to access the BLIS API.

@rvdg
Copy link
Collaborator

rvdg commented Oct 9, 2018

Interesting. We have always viewed the BLIS native APIs as being for C and C++ users. It is good to see that Fortran users are open to more flexible interfaces as well.

I'm not sure where on the priority list this should fall... We'll huddle.

@fgvanzee
Copy link
Member

fgvanzee commented Oct 9, 2018

Thanks for your comments, everyone.

@jkd2016 As much as I would like, in principle, to provide you with a ready-to-use solution that suits your needs, we have taken many steps to keep BLIS completely divorced of Fortran dependencies. That is why we shy away from anything that causes BLIS to rely on a Fortran compiler. And while it would be possible to create Fortran-compatible API wrappers in C99 for all of BLIS's native APIs, doing so would be a huge time sink and probably not used by very many people. (I don't say this to imply that your needs are not important, but rather because we have to strategically invest our limited resources for maximum impact.)

I think it would be best to wait until the right person volunteered to contribute this kind of code. If someone else wrote it, I might be willing to include it in the BLIS source distribution as long as it was completely optional and as long as most of the ongoing technical support was provided by others.

I would like to request that these and other BLIS routines be made Fortran-compatible. The other libraries appear to have both conventions included (eg. zgemm and zgemm_).

If you would like us to provide BLAS APIs without an underscore (in addition to with an underscore) I would be happy to add this to my to-do list. However, I don't think that is what you are interested in. (Please correct me if I'm mistaken.)

@jeffhammond Thanks for opening this issue.

@jkd2016
Copy link
Author

jkd2016 commented Oct 9, 2018

Adding the wrapper routine "bli_thread_set_num_threads_" wouldn't suddenly make BLIS depend on Fortran. All the BLAS routines in BLIS already have the underscore.

If BLIS and libFLAME are AMD's counterpart to MKL, then there is certainly incentive to be able to link to Fortran codes. We have found that zgemm applied to a 15000x15000 matrix with multi-threaded BLIS on a 32-core Ryzen 2990WX processor is about twice as fast as MKL (43 seconds to 80 seconds). Hence the interest in getting it to work with Elk and our motivation for buying AMD hardware.

Where MKL wins (even on AMD CPUs) is multi-threaded LAPACK. The eigenvalue solver zheev scales almost linearly with the number of threads whereas BLIS+libFLAME-LAPACK shows no improvement. For a 5000x5000 matrix zheev with MKL takes 45 seconds on 32 cores and BLIS+libFLAME takes 411 seconds.

In short, there are good reasons to make BLIS Fortran-friendly without making it Fortran-dependent.

@fgvanzee
Copy link
Member

fgvanzee commented Oct 9, 2018

Adding the wrapper routine "bli_thread_set_num_threads_" wouldn't suddenly make BLIS depend on Fortran. All the BLAS routines in BLIS already have the underscore.

Sorry, I was sloppily arguing multiple points simultaneously. I agree that adding a bli_thread_set_num_threads_() wrapper would not be all that difficult. Somehow, I missed that part and thought you were asking for Fortran-compatible wrappers for all computational functions in BLIS. Also, I completely understand your point about the nuance between Fortran dependency and Fortran compatibility. The comments on Fortran dependency were aimed at any hypothetical solution to your request that required the BLIS build system to compile, for example, a Fortran compatibility module. So, sorry for all of that confusion on my part.

If BLIS and libFLAME are AMD's counterpart to MKL, then there is certainly incentive to be able to link to Fortran codes. We have found that zgemm applied to a 15000x15000 matrix with multi-threaded BLIS on a 32-core Ryzen 2990WX processor is about twice as fast as MKL (43 seconds to 80 seconds).

That is impressive. I'm actually surprised by those results, and now I understand a little better while you are playing the part of squeaky wheel.

Where MKL wins (even on AMD CPUs) is multi-threaded LAPACK. The eigenvalue solver zheev scales almost linearly with the number of threads whereas BLIS+libFLAME-LAPACK shows no improvement. For a 5000x5000 matrix zheev with MKL takes 45 seconds on 32 cores and BLIS+libFLAME takes 411 seconds.

libflame has advanced Hermitian eigenvalue decomposition (and general SVD) based on the QR method similar in spirit to zheev(), but that functionality was last tuned for SSE-era hardware and thus does not perform well on modern architectures, so I am not surprised here either. Rewriting or substantially updating much of libflame has always been on my long-term radar, but I'm not yet at a place where I can meaningfully pivot there. If I had to guess, what I am more likely to do is (sometime within the next year or so) begin implementing some key LAPACK operations directly within BLIS--probably Cholesky, LU, and QR factorizations. This will make it much easier to perform certain optimizations that are not available if you restrict yourself to the BLAS interface. (libflame depends on BLAS and currently only supports the BLAS API for that purpose.)

In short, there are good reasons to make BLIS Fortran-friendly without making it Fortran-dependent.

I don't disagree.

How much mileage would you get if I provided Fortran wrappers to only the runtime thread-related API functions? (I'm primarily thinking of bli_thread_set_num_threads_() and friends.) Specifically, what I'm offering is a set of interfaces that (a) append underscores to the native function names, and (b) use pass-by-reference (pointer) calling conventions. Would this be sufficient for now?

@jeffhammond
Copy link
Member

I will discuss the Fortran API for BLIS on #265...

@jkd2016
Copy link
Author

jkd2016 commented Oct 10, 2018

How much mileage would you get if I provided Fortran wrappers to only the runtime thread-related API functions? (I'm primarily thinking of bli_thread_set_num_threads_() and friends.) Specifically, what I'm offering is a set of interfaces that (a) append underscores to the native function names, and (b) use pass-by-reference (pointer) calling conventions. Would this be sufficient for now?

This would be great: I just need bli_thread_set_num_threads_ for Elk at the moment, but I'll leave it to your judgment to add other routines (Elk uses mkl_set_dynamic(.false.) to switch off dynamic thread allocation).

In future it would be interesting to try some of the non-BLAS vector operations in BLIS. Elk relies heavily on complex vector operations some of which are not available in BLAS. One pressing example is to calculate many successive complex dot-products: right now I have to call zdotc multiple times and it would be good to dispense with the associated overhead.

In the longer term, multi-threaded LAPACK optimised for AMD hardware would be most welcome, particularly for complex matrices of about 10000x10000 elements.

I performed a few benchmarks on our machines. We just bought a 32 core Ryzen 2990WX with 64GB RAM for testing but we also have a cluster with Intel Xeon E5-2680 nodes. Here is how BLIS, OpenBLAS and MKL compare for zgemm with 15000x15000 matrices:

# AMD 2990wx, BLIS(ZEN)
cores      time (seconds)
  1          856
  2          435
  4          234
  8          119
  16         70
  32         42


# AMD 2990wx, OpenBLAS
cores      time (seconds)
  1          867
  2          452
  4          227
  8          118
  16         67
  32         41

# AMD 2990wx, MKL
cores      time (seconds)
  1          1771
  2          886
  4          449
  8          232
  16         135
  32         80

# Intel Xeon E5-2680 v3, MKL
cores      time (seconds)
  1          624
  2          315
  4          158
  8          80
  16         44
  24         35

# Intel Xeon E5-2680 v3, BLIS
cores      time (seconds)
  1          722
  2          372
  4          218
  8          103
  16         58
  24         43

As you can see BLIS and OpenBLAS are virtually identical and are faster than MKL on AMD hardware. MKL is somewhat faster than BLIS on the Xeon processors. The Intel processors are faster overall, probably because the amount of memory being thrown around is quite large (~10GB).

@fgvanzee
Copy link
Member

(Elk uses mkl_set_dynamic(.false.) to switch off dynamic thread allocation).

Could you fill me in on the implications of this? I am not a regular user of non-BLAS MKL functions.

In future it would be interesting to try some of the non-BLAS vector operations in BLIS.

Which operations are you referring to here? Things like dotxv and xpbyv?

Elk relies heavily on complex vector operations some of which are not available in BLAS. One pressing example is to calculate many successive complex dot-products: right now I have to call zdotc multiple times and it would be good to dispense with the associated overhead.

Are all of these dot products the same length? Are any of the vectors reused across dot products? If yes to both, you can cast the collection of operations as one or more matrix multiplications.

I performed a few benchmarks on our machines. We just bought a 32 core Ryzen 2990WX with 64GB RAM for testing but we also have a cluster with Intel Xeon E5-2680 nodes. Here is how BLIS, OpenBLAS and MKL compare for zgemm with 15000x15000 matrices:

Thanks for sharing these results. I'm not surprised by MKL outperforming BLIS on Intel hardware, and frankly, nor should anyone else be surprised. It's not a fair fight--not even close. Intel is a company with essentially limitless resources, in money and talent, and they have intimate knowledge of how their hardware works since, well, they designed it. This puts them in a uniquely advantageous position to produce high-performance software. And us? We're basically the equivalent of a scrappy startup company that sneezed in the direction of a university. Most importantly, BLIS can do plenty of things that MKL or any other BLAS library cannot, so we aren't particularly upset about being 10% slower for many/most operations.

One last comment: Measuring in raw time is somewhat foreign to us. We usually measure performance in units of gigaflops per second (gflops), which allows us to characterize performance as a fraction of the theoretical peak rate of computation. (Time in seconds only works as a vehicle for comparison when you have a second implementation against which to compare to.) We also look at a range of problem sizes (e.g. 100 to 2000 in increments of 100). This allows the reader to see how quickly performance asymptotes towards the attained peak. We have standalone test drivers that generate these gflops numbers, so it's relatively easy to get up and running with them. Granted, our method detail-oriented and geared towards those who live and breathe this stuff, so it is not for everyone.

@loveshack
Copy link
Contributor

I haven't kept up with the situation in current compilers, but after many years' experience of that sort of thing, I'd definitely recommend the f2003-ish features. They may not apply here, but there are potential gotchas to assuming you can just use names with different case/underscoring in general.

[Glad to see the packaging I'm getting into Fedora will have a consumer for the BLIS interface in the ELK package.]

@jeffhammond
Copy link
Member

The Intel processors are faster overall, probably because the amount of memory being thrown around is quite large (~10GB).

I don't know why memory would have anything to do with DGEMM performance. DGEMM is not bandwidth-limited for 15000x15000 matrices.

The likely explanation here is that Xeon v3 processors aka Haswell uarch support dual-ported 256-bit FMAs whereas Zen appears to have dual-ported 128-bit FMAs.

References

Thanks for sharing these results. I'm not surprised by MKL outperforming BLIS on Intel hardware, and frankly, nor should anyone else be surprised. It's not a fair fight--not even close. Intel is a company with essentially limitless resources, in money and talent, and they have intimate knowledge of how their hardware works since, well, they designed it. This puts them in a uniquely advantageous position to produce high-performance software.

Indeed. I made similar comments on https://scicomp.stackexchange.com/a/20727/150.

Any time somebody outside Intel beats MKL by a nontrivial amount, I report it to the MKL team. It is fantastic for any open-source project to get within 10% of MKL.

Most importantly, BLIS can do plenty of things that MKL or any other BLAS library cannot, so we aren't particularly upset about being 10% slower for many/most operations.

And this is why Intel funds BLIS development.

References

Disclaimer

I work for Intel.

@fgvanzee
Copy link
Member

The likely explanation here is that Xeon v3 processors aka Haswell uarch support dual-ported 256-bit FMAs whereas Zen appears to have dual-ported 128-bit FMAs.

100% agree with this.

@jkd2016
Copy link
Author

jkd2016 commented Oct 10, 2018

(Elk uses mkl_set_dynamic(.false.) to switch off dynamic thread allocation).

Could you fill me in on the implications of this? I am not a regular user of non-BLAS MKL functions.

This tells MKL not to dynamically adjust the number of threads. Elk chooses the number of threads based on how many are idle at the time of the call to MKL.

In future it would be interesting to try some of the non-BLAS vector operations in BLIS.

Which operations are you referring to here? Things like dotxv and xpbyv?

For example, there is a heavily used routine in which a complex constant times a real vector is accumulated in a complex vector. Currently this is done by two calls to daxpy but it would be more efficient to have a BLAS-like routine to do this.

Another example is a routine which requires an operation like

A := alpha*x*y**H + A                                                   (1)

The BLAS routine zher2 performs only the symmetrised version:

A := alpha*x*y**H + conjg( alpha )*y*x**H + A                           (2)

so we had to hand-code (1) instead.

Elk relies heavily on complex vector operations some of which are not available in BLAS. One pressing example is to calculate many successive complex dot-products: right now I have to call zdotc multiple times and it would be good to dispense with the associated overhead.

Are all of these dot products the same length? Are any of the vectors reused across dot products? If yes to both, you can cast the collection of operations as one or more matrix multiplications.

The vectors are all different but of the same length. You could use a matrix multiplication A* B but only the diagonal part would be needed. This is one of the most heavily used parts of Elk (called millions of times in a typical calculation)

I performed a few benchmarks on our machines. We just bought a 32 core Ryzen 2990WX with 64GB RAM for testing but we also have a cluster with Intel Xeon E5-2680 nodes. Here is how BLIS, OpenBLAS and MKL compare for zgemm with 15000x15000 matrices:

Thanks for sharing these results. I'm not surprised by MKL outperforming BLIS on Intel hardware, and frankly, nor should anyone else be surprised. It's not a fair fight--not even close. Intel is a company with essentially limitless resources, in money and talent, and they have intimate knowledge of how their hardware works since, well, they designed it. This puts them in a uniquely advantageous position to produce high-performance software. And us? We're basically the equivalent of a scrappy startup company that sneezed in the direction of a university. Most importantly, BLIS can do plenty of things that MKL or any other BLAS library cannot, so we aren't particularly upset about being 10% slower for many/most operations.

Actually I was very impressed with the performance of BLIS, particularly the scaling with the number of threads. If you can produce the same scaling with libFLAME LAPACK + BLIS BLAS we would certainly use it. Right now the optimal library choice on the Ryzen is MKL LAPACK + BLIS BLAS. This gives the highest performance for Elk.

One last comment: Measuring in raw time is somewhat foreign to us. We usually measure performance in units of gigaflops per second (gflops), which allows us to characterize performance as a fraction of the theoretical peak rate of computation. (Time in seconds only works as a vehicle for comparison when you have a second implementation against which to compare to.) We also look at a range of problem sizes (e.g. 100 to 2000 in increments of 100). This allows the reader to see how quickly performance asymptotes towards the attained peak. We have standalone test drivers that generate these gflops numbers, so it's relatively easy to get up and running with them. Granted, our method detail-oriented and geared towards those who live and breathe this stuff, so it is not for everyone.

Understood. We typically run Elk on up to 1000 cores with MPI + OpenMP + MKL (or BLIS) hbyrid, nested parallelism on a distributed file system. It's a complex parallel environment to have to write code for and I'm amazed it works as well as it does. We don't really count gflops but just like the code to run as fast as possible so we don't have to wait for results.

I'll add a BLIS interface to Elk for the next official release. Thanks for all your help.

@jkd2016
Copy link
Author

jkd2016 commented Oct 10, 2018

The Intel processors are faster overall, probably because the amount of memory being thrown around is quite large (~10GB).

I don't know why memory would have anything to do with DGEMM performance. DGEMM is not bandwidth-limited for 15000x15000 matrices.

The likely explanation here is that Xeon v3 processors aka Haswell uarch support dual-ported 256-bit FMAs whereas Zen appears to have dual-ported 128-bit FMAs.

I think you're right (the calculation used zgemm not dgemm). I tried several smaller sizes and the results were the same: the Xeon cores are about 30% faster even with the lower clock speed.

Any time somebody outside Intel beats MKL by a nontrivial amount, I report it to the MKL team. It is fantastic for any open-source project to get within 10% of MKL.

Our Ryzen machine also has a nVidia Titan V graphics card. Using CUDA NVBLAS the same 15000x15000 zgemm matrix multiplication takes 6.6 seconds, which includes the copying between CPU and GPU memory; this is about 5 times faster than 24 Xeon cores with MKL. The Ryzen 2990WX + Titan V combination gives a lot of double-precision compute power per dollar, the tricky bit will be getting them to do useful work simultaneously on a complicated code like Elk.

@fgvanzee
Copy link
Member

This tells MKL not to dynamically adjust the number of threads. Elk chooses the number of threads based on how many are idle at the time of the call to MKL.

Got it, thanks.

Another example is a routine which requires an operation like

A := alpha*x*y**H + A                                                   (1)

That looks like BLIS's zger to me (or zgerc, as BLAS calls it). Am I missing something? Or are some of the operands real?

Related to this: We just presented mixed-domain/mixed-precision (mixed-datatype) functionality, albeit only for gemm, at the BLIS Retreat in September. I'm currently making final preparations to commit this new code to github. You can find slides of my presentation here. The slides reference a link to single-core performance of all 128 mixed-datatype combinations here, with comparison to a "dumb wrapper" that tries to do the same datatype mixing as best as possible but through BLAS (in this case, OpenBLAS).

I'll add a BLIS interface to Elk for the next official release. Thanks for all your help.

Sure thing. I'll let you know when I have the Fortran-compatible thread APIs committed.

@jkd2016
Copy link
Author

jkd2016 commented Oct 10, 2018

This tells MKL not to dynamically adjust the number of threads. Elk chooses the number of threads based on how many are idle at the time of the call to MKL.

Got it, thanks.

Another example is a routine which requires an operation like

A := alpha*x*y**H + A                                                   (1)

That looks like BLIS's zger to me (or zgerc, as BLAS calls it). Am I missing something? Or are some of the operands real?

We specifically needed A := x^* y^t + A and not `A:= x y^H +A which represents an inner product not an outer product. This was some years ago now but I recall that first conjugating x and y and then using zgerc was slower than rolling our own. I should probably try again with BLIS...

Related to this: We just presented mixed-domain/mixed-precision (mixed-datatype) functionality, albeit only for gemm, at the BLIS Retreat in September. I'm currently making final preparations to commit this new code to github. You can find slides of my presentation here. The slides reference a link to single-core performance of all 128 mixed-datatype combinations here, with comparison to a "dumb wrapper" that tries to do the same datatype mixing as best as possible but through BLAS (in this case, OpenBLAS).

That could be useful. There is a little bit of mixed precision in Elk where memory is an issue, but we're usually quite wary about lowering precision in the code. Probably just paranoia...

@fgvanzee
Copy link
Member

We specifically needed A := x^* y^t + A and not A:= x y^H +A which represents an inner product not an outer product.

I'm unfamiliar with x^*. What operation is that?

I'm also still unclear what operation you are trying to perform. You first gave the examples:

Another example is a routine which requires an operation like

A := alpha*x*y**H + A                                                   (1)

The BLAS routine zher2 performs only the symmetrised version:

A := alpha*x*y**H + conjg( alpha )*y*x**H + A                           (2)

And it sounded like what you wanted (1) was close to (2) but not quite, and (2) is zher2. So it led me to believe that you wanted some level-2 operation involving a matrix and two vectors.

Maybe it would be clearer to define the dimensions of all the objects in your desired operation. For example, an outer product A := x y^H + A involves an m x n matrix A updated by an m x 1 vector x multiplied by an n x 1 vector y (prior to Hermitian-transpose).

@devinamatthews
Copy link
Member

I believe what he means is zgerc but with the first vector conjugated instead of the second.

@jkd2016
Copy link
Author

jkd2016 commented Oct 10, 2018

I believe what he means is zgerc but with the first vector conjugated instead of the second.

Correct. I could have been a bit clearer.

By ^* I mean conjugation and ^t means transpose, this is following LaTeX.

@rvdg
Copy link
Collaborator

rvdg commented Oct 10, 2018

A := x^* y^T + A means

A^T := y x^H + A^T

if I am right. This means this can be handled by switching the strides in the BLIS API.

@fgvanzee
Copy link
Member

fgvanzee commented Oct 10, 2018

if I am right. This means this can be handled by switching the strides in the BLIS API.

Maybe so, but absolutely no one should have to do this at the user level. I didn't create BLIS's APIs so that people would continue to think about dorking with strides in order to achieve their goals. I created them to free people of thinking in such terms.

@jkd2016 The functionality you want is implemented natively within BLIS's ger operation. BLIS's ger allows you to conjugate x, y, both, or neither independent of the transpose on y. BLIS's ger is more general-purpose than BLAS. (In other words: BLIS is not BLAS. BLIS is much more than BLAS with a BLAS compatibility layer thrown in as an afterthought.)

Caveat: ger in BLIS is not optimized for the complex domain for most architectures because most of the time, we don't provide the complex domain assembly/intrinsics implementation for axpyv, which is the fundamental kernel used by ger. (Hint to @rvdg: complex level-1v and level-1f kernels are important, and yet we give them no love.)

@rvdg
Copy link
Collaborator

rvdg commented Oct 10, 2018

Hint to @field: optimizing level-1 and 2 have long been on my wish list... You just don't read my poker face very well.

@fgvanzee
Copy link
Member

@rvdg Nice try, but you're not going to convince Mr. Complex that it's his fault that we haven't spent enough time on the complex domain.

(BTW, you should use my actual handle, @fgvanzee. You probably just called attention from some poor unsuspecting github user who has no idea why he needs to pay attention to this thread.)

@jkd2016
Copy link
Author

jkd2016 commented Oct 10, 2018

@jkd2016 The functionality you want is implemented natively within BLIS's ger operation. BLIS's ger allows you to conjugate x, y, both, or neither independent of the transpose on y. BLIS's ger is more general-purpose that BLAS. (In other words: BLIS is not BLAS. BLIS is much more than BLAS with a BLAS compatibility layer thrown in as an afterthought.)

That would certainly do the job.

Here is what is in Elk at the moment (sorry about the Fortran):

call omp_hold(n,nthd)
!$OMP PARALLEL DEFAULT(SHARED) &
!$OMP PRIVATE(k,z1,a1,b1) &
!$OMP NUM_THREADS(nthd)
!$OMP DO
do j=1,n
  k=(j-1)*ld
  z1=y(j)
  if (abs(dble(z1)).gt.eps) then
    if (abs(aimag(z1)).gt.eps) then
! complex prefactor
      call zaxpy(j-1,z1,x,1,a(k+1),1)
      a(k+j)=dble(a(k+j))+dble(z1*x(j))
    else
! real prefactor
      a1=dble(z1)
      call daxpy(2*(j-1),a1,x,1,a(k+1),1)
      a(k+j)=dble(a(k+j))+a1*dble(x(j))
    end if
  else if (abs(aimag(z1)).gt.eps) then
! imaginary prefactor
    b1=aimag(z1)
    a(k+1:k+j-1)=a(k+1:k+j-1)+b1*cmplx(-aimag(x(1:j-1)),dble(x(1:j-1)),8)
    a(k+j)=dble(a(k+j))-b1*aimag(x(j))
  end if
end do
!$OMP END DO
!$OMP END PARALLEL
call omp_free(nthd)

The vector x is conjugated external to this routine. This saves two complex conjugations per element over using zgerc. The routine also checks whether the prefactor is complex, real or imaginary. It also invokes parallelism but only if there are any idle threads. This routine is nested four OpenMP levels down with MPI above that. However, there is no possibility of thread over-subscription, which is what often happens when things are left to the compiler.

We find it is good practice to keep checking whether optimisations like these, which were an advantage a few years ago, remain so today with better compilers and libraries.

@fgvanzee
Copy link
Member

fgvanzee commented Oct 10, 2018

This saves two complex conjugations per element over using zgerc.

It is things like this (no conjugation allowed on vector x) that motivated us to formulate the BLIS APIs. The original BLAS treats the complex domain as somewhat of a second-class citizen. Why was "conjugation without transposition" left out of the set of valid values of trans arguments? And why were the zsyr and zsyr2 operations not defined along with the other level-2 operations? These design decisions in the BLAS were mistakes to which BLIS finally provides sensible solutions.

We find it is good practice to keep checking whether optimisations like these, which were an advantage a few years ago, remain so today with better compilers and libraries.

I'm glad you continued to keep your ear to the ground on this, and I'm delighted that we might make your code a bit cleaner.

@fgvanzee
Copy link
Member

@jkd2016 Would you care to share your full name so I can acknowledge you in the git commit log entry and in the CREDITS file? (This offer to cite contributing individuals is standard procedure on our part.)

@jkd2016
Copy link
Author

jkd2016 commented Oct 11, 2018

@jkd2016 Would you care to share your full name so I can acknowledge you in the git commit log entry and in the CREDITS file? (This offer to cite contributing individuals is standard procedure on our part.)

Thanks. It's Kay Dewhurst (Max Planck Institute, Halle, Germany).

@jkd2016
Copy link
Author

jkd2016 commented Oct 11, 2018

This saves two complex conjugations per element over using zgerc.

It is things like this (no conjugation allowed on vector x) that motivated us to formulate the BLIS APIs. The original BLAS treats the complex domain as somewhat of a second-class citizen. Why was "conjugation without transposition" left out of the set of valid values of trans arguments? And why were the zsyr and zsyr2 operations not defined along with the other level-2 operations? These design decisions in the BLAS were mistakes to which BLIS finally provides sensible solutions.

We find it is good practice to keep checking whether optimisations like these, which were an advantage a few years ago, remain so today with better compilers and libraries.

I'm glad you continued to keep your ear to the ground on this, and I'm delighted that we might make your code a bit cleaner.

So the above Fortran code would be replaced with

call omp_hold(maxthds,nthd)
call bli_thread_set_num_threads(nthd)
call bli_zger(BLIS_CONJUGATE,BLIS_NO_CONJUGATE,n,n,zone,x,1,y,1,a,1,1)
call bli_thread_set_num_threads(1)
call omp_free(nthd)

...which is neater and faster to boot.

Also, Elk needs a similar operation when x and y are the same vectors. I see that I could use bli_zher for this (there is no BLAS analogue).

I'm starting to like this a lot.

@fgvanzee
Copy link
Member

fgvanzee commented Oct 11, 2018

Also, Elk needs a similar operation when x and y are the same vectors. I see that I could use bli_zher for this (there is no BLAS analogue).

Yes, BLIS zher implements A := A + alpha * conjx(x) * conjx(x)^H, which is a superset of what zher from the BLAS provides.

I'm starting to like this a lot.

Glad to hear. I always thought that people would like BLIS the more they used it, especially if/when they strayed away from the BLAS interfaces.

call bli_thread_set_num_threads(nthd)

The code is ready to try out in the master branch (starting with commit 667d392). The new interfaces are prototyped in frame/compat/blis/thread/b77_thread.h. The one you seem to be interested in is:

void bli_thread_set_num_threads_( const f77_int* nt );

Here, f77_int is the (signed) integer defined for the BLAS compatibility API. The size of this integer is determined when BLIS is configured. (BLIS supports both 32-bit and 64-bit integers for the BLAS API.) You would, of course, use regular Fortran integers in your own code; f77_int is merely meant to be the C-equivalent. But just be sure to use the same size integer that was determined by BLIS at configure-time:

configure: the BLAS compatibility layer is enabled.
configure: the CBLAS compatibility layer is disabled.
configure: the internal integer size is automatically determined.
configure: the BLAS/CBLAS interface integer size is 32-bit.

Please let us know how it goes. We're counting on you to test these functions out! :) (Nobody here at UT uses Fortran test drivers.)

@devinamatthews
Copy link
Member

I'm closing this in favor of #265. For blis_set_num_threads by itself I endorse the F2003 solution since it is the simplest change.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

6 participants