Curious to revisit an earlier post about computations over lists versus arrays in Scala, and examine how the environment handles more difficult floating point operations with OpenCL acceleration under the hood, I rigged together this very very simple benchmark on scalacl/0.2.Beta10:
import scalacl._
import scala.math._
implicit val context = new ScalaCLContext
def testLists(num: Int): Float = {
val a = List.fromArray(Array.range(0,num))
val start = System.nanoTime
val result = a.map(s => cos(s / 100.0f).toFloat)
val end = System.nanoTime
((end - start).toFloat) / num
}
def testArrays(num: Int): Float = {
val a = Array.range(0,num)
val start = System.nanoTime
val result = a.map(s => cos(s / 100.0f).toFloat)
val end = System.nanoTime
((end - start).toFloat) / num
}
def testParallelLists(num: Int): Float = {
val r = (0 to num).cl
val a = r.toCLArray
val start = System.nanoTime
val result = a.map(s => cos(s / 100.0f).toFloat)
val end = System.nanoTime
((end - start).toFloat) / num
}
def testSuite() = {
val n = 10000000
println(testLists(n))
println(testArrays(n))
println(testParallelLists(n))
}
testSuite()
:wq!
ListFlopParallel.scala (END)
$ JAVA_OPTS="-Xmx1g" scala ListFlopParallel.scala
1050.4584
49.255398
22.513
I’m using an NVIDIA GeForce 330M GPU with a 48-core CUDA processor, and I suspect that there’s a significant overhead cost associated with shuttling the data between main memory and the GPU (and back). But, despite this overhead, as you can see, there’s still a ~2x speedup from pushing these floating point computations onto the GPU. Exciting stuff!
Unfortunately, in scalacl/0.2.Beta11, there’s no ability to convert functions that capture external symbols yet, so the code above doesn’t run. Still, I’m looking forward to seeing where this project goes. In the meantime, I plan to use good functional design principles and leave myself room to hook into OpenCL hardware acceleration later on.