Pretty (and) fast matrix multiplication
Combining C++ and BLAS to get readable and efficient code
Assume we have matrices $A$, $B$, and $C$ of dimension $N$ and want to compute $$ C = A\,B\,, $$
or, in compontents, $$ C_{ij} = \sum_{k=0}^{N-1} A_{ik} B_{kj} \quad \forall i, j \in [0, N-1]\,. $$
The naïve implementation
1#include <vector>
2
3int constexpr N = 2000;
4
5int main() {
6 // assumed to be stored in column major format
7 std::vector<double> A(N*N), B(N*N), C(N*N);
8
9 for (int i = 0; i < N; ++i) {
10 for (int j = 0; j < N; ++j) {
11 for (int k = 0; k < N; ++k) {
12 C[i + N * j] = A[i + N * k] * B [k + N * j];
13 }
14 }
15 }
16
17 return 0;
18}
This code is quite readable – it is not too hard to see it computes a matrix product – but it is certainly not pretty.
Incidentally, it is also not fast.
This can be seen by, e.g., using the <chrono>
library:
1#include <vector>
2#include <chrono>
3#include <iostream>
4
5int constexpr N = 2000;
6int main() {
7 // assumed to be stored in column major format
8 std::vector<double> A(N*N), B(N*N), C(N*N);
9
10 auto timerStart = std::chrono::high_resolution_clock::now();
11 for (int i = 0; i < N; ++i) {
12 for (int j = 0; j < N; ++j) {
13 for (int k = 0; k < N; ++k) {
14 C[i + N * j] = A[i + N * k] * B [k + N * j];
15 }
16 }
17 }
18 auto timerStop = std::chrono::high_resolution_clock::now();
19 auto duration = std::chrono::duration_cast<std::chrono::milliseconds>(timerStop - timerStart);
20 std::cout << "Run time: " << duration.count() << "ms\n";
21
22 return 0;
23}
Running on my machine, this results in
1Run time: 27493ms
A cache friendlier version
Pulling a page from the book “Matrix Computations”, by Golub and van Loan1, we should reorder those loops to make better use of the CPU cache. In particular, the code
1#include <vector>
2#include <chrono>
3#include <iostream>
4
5int constexpr N = 2000;
6
7int main() {
8 // assumed to be stored in column major format
9 std::vector<double> A(N*N), B(N*N), C(N*N);
10
11 auto timerStart = std::chrono::high_resolution_clock::now();
12 for (int j = 0; j < N; ++j) {
13 for (int k = 0; k < N; ++k) {
14 for (int i = 0; i < N; ++i) {
15 C[i + N * j] = A[i + N * k] * B [k + N * j];
16 }
17 }
18 }
19 auto timerStop = std::chrono::high_resolution_clock::now();
20 auto duration = std::chrono::duration_cast<std::chrono::milliseconds>(timerStop - timerStart);
21 std::cout << "Run time: " << duration.count() << "ms\n";
22
23 return 0;
24}
accesses the elements of both A
and C
sequentially, while B[k + N * j]
is constant within the inner loop.
The new execution time after this simple change is
1Run time: 5250ms
So the code is now faster, but still not pretty.
Here comes Eigen
You may be asking yourself at this point: “What about Eigen?” Using Eigen, the code is considerably clearer,
1#include <chrono>
2#include <iostream>
3#include <Eigen/Core>
4
5int constexpr N = 2000;
6
7int main() {
8 Eigen::MatrixXd A(N, N), B(N, N), C(N, N);
9
10 auto timerStart = std::chrono::high_resolution_clock::now();
11 C = A * B;
12 auto timerStop = std::chrono::high_resolution_clock::now();
13 auto duration = std::chrono::duration_cast<std::chrono::milliseconds>(timerStop - timerStart);
14 std::cout << "Run time: " << duration.count() << "ms\n";
15
16 return 0;
17}
and also faster:
1Run time: 1020ms
From our initial version, we have gained a speedup factor of $30$ and a “pretty” factor of $+ \infty$.
This is, of course, not the end of the story. Eigen is a well-known library that has been around for a while now. If I were here to tell people to use it, I’d be a few years too late.
But how fast can we really go?
Those of you who are into the numerical linear algebra business are undoubtedly aware of BLAS. It is a FORTRAN
library2, that provides a standard framework for Basic Linear Algebra Subprograms (hence the name).
However, BLAS was standardised in he 1970’s????, and its subroutines are meant to be as generic as they can be.
In particular, the matrix multiplication routine for double-precision floating point numbers can be used to perform any of the following operations
$$C = \alpha A\,B + \beta C\,$$
$$C = \alpha A^T\,B + \beta C\,$$
$$C = \alpha A\,B^T + \beta C\,$$
$$C = \alpha A^T\,B^T + \beta C\,$$
Hermitian conjugation is also a possibility when complex matrices are used.
and also multiplications of blocks of $A$ and $B$.
Of course, with great power generality comes great responsibilities large number of parameters.
Compounding that with the fact that in FORTRAN
variables are always passed by reference, our simple code would, using BLAS, read
1#include <vector>
2#include <chrono>
3#include <iostream>
4#include <cblas.h>
5
6int constexpr N = 2000;
7char constexpr ta = 'N';
8
9int main() {
10 // assumed to be stored in column major format
11 std::vector<double> A(N*N), B(N*N), C(N*N);
12
13 double alpha = 1.0;
14 double beta = 0.0;
15
16 auto timerStart = std::chrono::high_resolution_clock::now();
17
18 dgemm_(&ta, &ta, &N, &N, &N, &alpha, A.data(), &N, B.data(), &N, &beta, C.data(), &N);
19
20 auto timerStop = std::chrono::high_resolution_clock::now();
21 auto duration = std::chrono::duration_cast<std::chrono::milliseconds>(timerStop - timerStart);
22 std::cout << "Run time: " << duration.count() << "ms\n";
23
24 return 0;
25}
The details of what each parameter does is not important for this post, and the reader is encouraged to have a look at BLAS’ documentation. But let us get down to business: how long does this matrix multiplication take?
For this test, I have used the openBLAS implementation of BLAS.
1Run time: 297ms
Note that openBLAS uses openMP by default, when available. For this test I ran the executable with
OMP_NUM_THREADS=1
in order to have a fair comparison with the other methods.
This is a speedup factor, compared to the naïve implementation, of $\sim 120$! But it came at the cost of calling a cryptically named function with 13 parameters. So long, readability!
…or not?
Lazy evaluation to the rescue
The “problem” with dgemm_
, or one of them, is that the matrix $C$ which holds the result is passed as an argument.
Assume we have a matrix structure
1struct Matrix {
2 std::vector<double> mat;
3
4 Matrix(int N)
5 : mat{std::vector<double>(N*N)}
6 { }
7};
A “standard” implementation of operator*
would have the signature
1Matrix operator*(Matrix const& A, Matrix const& B);
In this case, the product $A\,B$, potentially done with BLAS, would have to be written to some temporary matrix that is then returned by the function.
This temporary would have to be (re-)allocated every time operator*
is called via, e.g., A * B
.
In practice the could would do something like tmp
$\leftarrow$ A*B
and then C
$\leftarrow$ tmp
, i.e., the contents of the temporary matrix would be copied or moved into C
.
After that step, tmp
is de-allocated.
What we would like here would be to have both operator*
and operator=
working together, such that all the information about the matrices involed, $A$, $B$, and $C$, is available to dgemm_
.
One of of accomplishing this is to make operator*
return something that knows about the matrices being multiplied, but that does not actually perform the multiplication.
1struct Matrix; // forward declaration
2
3struct mul_op {
4 Matrix const& A;
5 Matrix const& B;
6};
7
8struct Matrix {
9 std::vector<double> mat;
10 int const size;
11
12 Matrix(int N)
13 : mat{std::vector<double>(N*N)}
14 , size{N}
15 { }
16
17 Matrix& operator=(mul_op mul);
18};
19
20mul_op operator*(Matrix const& A, Matrix const& B) { return mul_op{A, B}; }
21
22char constexpr ta = 'N';
23double constexpr alpha = 1.0;
24double constexpr beta = 0.0;
25
26Matrix& Matrix::operator=(mul_op mul) {
27 int const N = mul.A.size;
28
29 dgemm_(&ta, &ta, &N, &N, &N, &alpha, mul.A.mat.data(), &N,
30 mul.B.mat.data(), &N, &beta, this->mat.data(), &N);
31
32 return *this;
33}
Let us unpack this. struct mul_op
is merely a container that holds references to two matrices.
Because C++
is strongly typed, we could have similar struct
s for different operations involving two matrices without problems, as long as their names are different.
Note that now operator*
simply returns a mul_op
and does nothing further.
No actual multiplication takes place within it!
When operator=
is called, it now receives a mul_op
that carries the information necessary for the multiplication.
operator=
itself then calls dgemm_
and passes its own container as the output matrix as a parameter.
Using the code above, we can finally run
1int constexpr N = 2000;
2
3int main() {
4 Matrix A(N), B(N), C(N);
5
6 auto timerStart = std::chrono::high_resolution_clock::now();
7 C = A * B;
8 auto timerStop = std::chrono::high_resolution_clock::now();
9 auto duration = std::chrono::duration_cast<std::chrono::milliseconds>(timerStop - timerStart);
10 std::cout << "Run time: " << duration.count() << "ms\n";
11
12 return 0;
13}
and get
1Run time: 286ms
We get the crazy efficiency of BLAS combined with a code just as readable as if it used Eigen.
I took a particularly simple case here of a product of two square matrices as example. The idea shown in this last section can be, with a bit of effort, expanded to deal with the general case handled by BLAS of $C = \alpha op(A) \, op(B) + \beta C$, where $op(X)$ can be $X$, $X^T$, or $X^\dagger$.
It is important to keep in mind that Eigen already provides a way to use BLAS functions within itself, see link. Eigen also makes extensive use of lazy evaluation to achieve the flexibility and readability it is well-known for.
-
Matrix Computations (Golub, Gene H. and Van Loan, Charles F.), The Johns Hopkins University Press, 2012. ↩︎
-
I am not aware of
C
headers that declare theFORTRAN
functions directly. CBLAS is a wrapper around BLAS that allows the user to choose between row- and column-major storage format for the matrices; BLAS itself only supports column-major. My usual approach is to simply write the function declaration myself from the BLAS documention for the functions I need. You can find here the declaration fordgemm_
that is used in this post:↩︎1extern "C" { 2 void dgemm_(char const *transa, char const *transb, int const *m, 3 int const *n, int const *k, double const *alpha, double const *a, 4 int const *lda, double const *b, int const *ldb, double const *beta, 5 double *c, int const *ldc); 6}