Cholesky decomposition

Cholesky decomposition is a numerical method used in linear algebra to decompose a Hermitian, positive-definite matrix into the product of a lower triangular matrix and its conjugate transpose, known as the Hermitian transpose. This decomposition is particularly useful in various scientific and engineering applications, including solving linear systems, optimization problems, and simulation.

Given a Hermitian, positive-definite matrix A, the Cholesky decomposition expresses as a product of lower triangular matrix L and the conjugate transpose of L, as shown as follows:

Out-of-order execution

This decomposition provides a more efficient and numerically stable way to solve linear equation systems than methods like LU (lower-upper) decomposition. Multiple methods exist for implementing Cholesky decomposition; however, for the sake of simplicity, we choose a serial algorithm.

Kernel for Cholesky decomposition

void cho_dec(int* L, 
                const int* A){   

    int dataA[N][N];
    int dataL[N][N] = {{0}};
    int i,j,k;

    LoadA:
    for (int i = 0; i < N; ++i) {
        #pragma HLS loop_tripcount max=N
        for (int j = 0; j <= i; ++j) {
            #pragma HLS loop_tripcount max=N
            dataA[i][j] = A[i * N + j];
        }
    }

    int tmp0 = sqrt(dataA[0][0]);

    // Compute first column 
    dataL[0][0] = tmp0;

    Column_first:
    for ( i = 1; i < N; ++i) {
        #pragma HLS loop_tripcount max=N-1

        dataL[i][0] = dataA[i][0] / tmp0;
    }

    //compute other columns
    int tmp1 = 0;
    int tmp2 = 0;

    Column_other:
    for ( j = 1; j < N; ++j){
        #pragma HLS loop_tripcount max=N-1
        tmp1 = 0;
        // Compute diagonal elements
        Diag:
        for(int k = 0; k < j; k++){
            #pragma HLS loop_tripcount max=N-1
            tmp1 += dataL[j][k]*dataL[j][k];
        }
        dataL[j][j] = sqrt(dataA[j][j] - tmp1);

        // Compute other elements
        if (j < N-1){
            Rows:
            for( i = j+1; i < N; i++){
                #pragma HLS loop_tripcount max=N-2
                tmp2=0;

                Vec_mul:
                for( k = 0; k < j; k++){
                    #pragma HLS loop_tripcount max=N-2
                    tmp2 += dataL[i][k]*dataL[j][k];
                }
                dataL[i][j] = (dataA[i][j] - tmp2)/dataL[j][j];
            }
        }
    }

    // Store L
    StoreL:
    for (int i = 0; i < N; ++i) {
        #pragma HLS loop_tripcount max=N
        for (int j = 0; j < N; ++j) {
            #pragma HLS loop_tripcount max=N
            L[i * N + j] = dataL[i][j];
        }
    }  
}

The kernel begins by loading the input matrix A into the local matrix A_data for optimized memory access.

The main computation involves Cholesky decomposition, focusing on deriving the lower triangular matrix dataL from the input matrix dataA. The computation is carried out in two phases:

In the first phase (Column_first loop), the algorithm calculates the elements of the first column of dataL based on the first column of dataA.
The second phase, within the Column_other loop, extends the computation to the remaining columns of dataL. Starting from the second column, each column calculates the diagonal elements and subsequently computes other elements in the column. The diagonal elements are determined by subtracting the sum of the squares of previously computed elements in the column from the corresponding diagonal elements in the input matrix. The calculation of non-diagonal elements involves subtracting the contribution of previously computed elements from the corresponding element in the input matrix and then scaling the result based on the already computed diagonal element.

The final step involves storing the local matrix dataL back into global memory.