Post

Parallelized Linear Algebra in Practice-Polynomial Regression, BLAS, and F#

Parallelized Linear Algebra in Practice-Polynomial Regression, BLAS, and F#

Introduction

Most of my write ups emanate from stuff i do on a daily and curiosity leads me to explore.I was working on an implementation of multiple polynomial regression recently, I encountered performance bottlenecks when $X$ variables grow in number that made me think on ways to improve matrix operations, i mean at the heart of any regression problem is linear algebra specifically matrix multiplication and solving linear systems.Of course matrix in the .Net side is not a first class citizen so their is that irregardless but mathnet or FSharp stats has implementations wrapping arrays into a matrix.

This led me to explore parallelized linear algebra using BLAS (Basic Linear Algebra Subprograms) and integrating it with F# in this case. This post outlines the mathematical foundations, practical implementations, and performance gains I’ve found.

Enter BLAS: Optimized Linear Algebra

Modern processors support data-level and task-level parallelism. Many matrix operations naturally lend themselves to parallel execution because:

  1. They consist of independent arithmetic operations (e.g., dot products),
  2. They can be blocked and distributed across CPU cores or SIMD (Single Instruction, Multiple Data) units.

Matrix multiplication is an example where a simple algorithm scales poorly with data size, but one that benefits immensely from parallelization.

This is exactly why libraries like BLAS are critical. These libraries provide highly optimized routines such as DGEMM for general matrix multiplication that leverage multithreading, SIMD, and hardware-specific tuning for maximum speed.

A Brief History of BLAS

BLAS was first developed in the 1970s and 1980s with a goal of creating a standardized, high-performance interface for low-level linear algebra routines core tools used throughout scientific computing.

The development of BLAS was organized into three hierarchical levels:

  • Level 1: Vector operations (dot product, scaling),
  • Level 2: Matrix-vector operations (matrix-vector multiplication),
  • Level 3: Matrix-matrix operations (C := AB + C), which are the most computationally intensive and benefit the most from parallelism.

Level 3 BLAS

Lets take a look at one of the most important and widely used routines in Level 3 BLAS is GEMM short for General Matrix-Matrix multiplication. The operation is defined as: \(C \leftarrow \alpha op(A) \cdot op(B) + \beta C\)

you can refer to this guide that outlines the subprograms low level functions

Where:

  • A, B, and C are matrices,
  • α and β are scalar values,
  • op(A) or op(B) may refer to the original matrix or its transpose, it really does depend on the context.

This operation is looks simple but incredibly powerful and highly optimized in every major BLAS implementation.

GEMM routines are not only mathematically central, but also the most optimized due to their potential for high data reuse. This brings us to an important question: Why are some matrix routines so much faster than others, even when they perform the same number of operations? The answer lies in memory.

Why Blocked Algorithms Matter: Data Movement vs Computation

Modern matrix multiplication isn’t just about arithmetic it’s about memory.

Computers use a memory hierarchy, from fast, small registers at the top, followed by cache, then slower RAM, and finally disk.

  • Operations on data already in registers are very fast.
  • But when data has to be fetched from slower memory (like RAM), latency becomes the bottleneck and can dominate the execution time.

Below is a hierarchy on how easy it is to load data from the memory:
Memory Hierarchy Latency

Note how accessing main memory can take 200+ cycles, compared to just 3 cycles from L1 cache or even 0 cycles from registers.

This is why minimizing the footprint of your algorithm is key. If your algorithm requires large data that doesn’t fit in fast memory (i.e cache), it will spend much of its time waiting to retrieve data from slower memory.

Why Naive Matrix Multiplication Is Slow

A naive matrix-matrix multiplication (using three nested loops) can lead to excessive data movement. Each operation reads entire rows and columns from main memory, which is much slower than using fast registers or cache. This repetitive reading of data causes inefficiency.

Blocked Matrix Multiplication

This is where blocked matrix multiplication comes in. Instead of operating on full rows or columns, we break the matrix into smaller, manageable submatrices (blocks). These submatrices are small enough to fit into faster memory (the CPU cache), allowing us to reuse data more efficiently.

By doing so, we minimize the need to fetch data from slower RAM and avoid the penalties of excessive memory access.

As noted in the LAPACK documentation, reordering data in this way can increase performance by orders of magnitude, despite doing the same number of floating-point operations.

Evaluating Algorithm Efficiency

To evaluate how efficient an algorithm is at utilizing memory, we can compute a ratio known as q:

\[q = \frac{\text{# of flops}}{\text{# of memory references}}\]

The higher the q, the more useful work is done per unit of data movement. For example:

  • Level 1 BLAS : q ≈ 0.67
  • Level 2 BLAS : q ≈ 2
  • Level 3 BLAS : q ≈ n/2

This is why algorithms that use Level 3 BLAS routines are highly sought after. They can be executed closer to memory bandwidth limits, minimizing data movement and maximizing computation.

How GEMM Applies to Polynomial Regression

In multiple polynomial regression, a key computational step involves solving:

$\beta = (X^T X)^{-1}X^TY$

Here’s where GEMM enters the picture:

  • Computing $X^TX$ is a matrix-matrix multiplication and this is a classic GEMM operation.
  • It’s often followed by solving a linear system, which may rely on other optimized BLAS routines.

Because the size of matrix $X$ can grow rapidly as the number of samples or polynomial degree increases, efficient computation of $X^TX$ becomes critical.

By delegating this step to GEMM, we benefit from:

  • Cache-optimized block processing
  • Multithreading
  • SIMD acceleration
  • Hardware-specific tuning

This makes even large polynomial regression tasks tractable, turning a potential bottleneck into a fast, reliable operation.

OpenBLAS and Intel MKL are two of the most performant CPU-based BLAS libraries, with OpenBLAS often rivaling MKL in benchmarks.

Bringing in F#

In the F# world, Math.NET Numerics is the go-to library for such computational linear algebra. It provides solid support for linear algebra an so on.

Lets run a matrix multiplication and taking a look into the Math.NET documentation, we find that Math.NET exposes a low-level method :

1
void MatrixMultiply(Double[] x, int rowsX, int columnsX, Double[] y, int rowsY, int columnsY, Double[] result)

The docs further saying this:

Multiples two matrices.

result = x * y

This is a simplified version of the BLAS GEMM routine with alpha set to 1.0 and beta set to 0.0, and x and y are not transposed.

of course when the provider is changed to say MKL What happens under the Hood is How Math.NET Calls MKL When we multiply two matrices in Math.NET with MKL enabled, the operation eventually reaches a native function via interop. Specifically, the managed method:

1
public void MatrixMultiply(...) 

defined in MklLinearAlgebraProvider delegates to:

1
SafeNativeMethods.d_matrix_multiply(...)

This function is declared using [DllImport]`, linking to a native C wrapper that Math.NET ships with its MKL provider. Internally, that native function wraps and calls Intel MKL’s GEMM routine (cblas_dgemm) for fast, hardware optimized matrix multiplication.

This interop layer allows Math.NET to expose high-performance BLAS routines through a simple and safe managed interface.

Performance Evaluation in Polynomial Regression

To test the impact of switching from Math.NET’s managed backend to an MKL powered BLAS backend, I benchmarked a core operation in polynomial regression: solving the normal equations using Cholesky decomposition.

The normal equations involve solving: \(\beta = (X^T X)^{-1} X^T y\)

This is common in ordinary least squares regression, and it becomes computationally heavy when the feature matrix X is large.

Here’s the F# benchmark I used:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
let timeOperation name iterations f =
    f() |> ignore
    
    let sw = Stopwatch.StartNew()
    for _ in 1..iterations do
        f() |> ignore
    sw.Stop()
    let avgTime = float sw.ElapsedMilliseconds / float iterations
    printfn $"{name,-30} {avgTime,8:F2} ms (avg over {iterations} runs)"

let regressionBenchmark () =
    printfn "\nLinear Regression Benchmark:"
    
    let n, p = 5000, 50 
    let X = Matrix<double>.Build.Random(n, p)
    let y = Vector<double>.Build.Random(n)
    
    let runTest name setup =
        setup()
        timeOperation name 3 (fun () ->
            X.TransposeThisAndMultiply(X).Cholesky().Solve(X.TransposeThisAndMultiply(y)))
    
    runTest "Managed Regression" (fun () -> LinearAlgebraControl.UseManaged())
    runTest "MKL Regression" (fun () -> LinearAlgebraControl.UseNativeMKL())

regressionBenchmark()

Results

1
2
3
Linear Regression Benchmark:
Managed Regression                15.33 ms (avg over 3 runs)
MKL Regression                     0.33 ms (avg over 3 runs)

The results speak for themselves: using Intel MKL via Math.NET Numerics gave us a >45× speedup on this matrix-heavy regression problem.

Keep in mind, F# and .NET are already fast and this shows just how much of a difference native BLAS can make when the math gets serious.

Conclusion

Memory layout is crucial for performance. Ideal performance is achieved when data is structured in a way that is CPU-friendly that is, when it allows the processor to predict access patterns and preload data efficiently. The more predictable your code, the more your CPU can help you by reducing costly memory access.

Matrix multiplication might seem simple, but its computational cost scales quickly with matrix size. As we’ve seen, choosing the right algorithm and more importantly, the right backend can drastically improve performance.

Libraries like BLAS, and implementations like Intel MKL, allow us to write high-level, expressive code (in languages like F#) without sacrificing low level speed. By understanding how memory, computation, and linear algebra interact, we can build faster, more efficient numerical software.

References

Hager, W. W. (2021). Applied numerical linear algebra. Society for Industrial and Applied Mathematics.

This post is licensed under CC BY 4.0 by the author.