There's been a lot of buzz about GPGPU, or General-Purpose computation on Graphics Processing Units, lately and with much right. I've just recently ended a course on the subject at my university and thought I'd share some code I came up with for the OpenCL part.
The first thing you need, is the OpenCL SDK and runtime for your platform. I provide the following links: Intel CPUs AMD APUs, CPUs and GPUs Nvidia GPUs
Once whichever of these are setup, you are good to start developing OpenCL accelerated software. For making it easier to play around with this, I created a skeleton enabling me to write short programs invoked by something looking like this.
#include "opencl_suit.h" #include "bitonic_sort.h" int main(int argc, char** argv) { CLHost clProgram; clProgram.loadProgram("bitonic.cl", NULL); CustomApp clApp(&clProgram, "bitonic_sort"); clApp.run(); }
This very naive start creates what I call a CLHost. Which takes care of the context and runtime compilation of the CL source code as specified by parameters to the loadProgram member function. Then, the CustomApp runs a single kernel with parameters and everything else specified in the implementation of the class. In this case it runs a parallel multi pass bitonic sort.
A complete run of the program can look like this.
mikael@afrodite(bitonic)> ./bitonic Initializing OpenCL Context... clGetPlatformIDs: CL_SUCCESS clGetDeviceIDs (get number of devices): CL_SUCCESS (answer: 1) clGetDeviceIDs (create device list): CL_SUCCESS clCreateContext: CL_SUCCESS Loading the program from file: "/home/mikael/programmering/bitonic/bitonic.cl"... == LOADED FOLLOWING PROGRAM == __kernel void bitonic_sort(__global int* data, const unsigned int j, const unsigned int k) { unsigned int i = get_global_id(0); unsigned int ixj = i^j; int datai = data_; int dataixj = data[ixj]; if ((ixj)>i) { if ((i&k)==0 && datai > dataixj) { data_ = dataixj; data[ixj] = datai; } if ((i&k)!=0 && datai < dataixj) { data_=dataixj; data[ixj]=datai; } } } == END OF PROGRAM == clCreateProgramWithSource: CL_SUCCESS Building the program... clBuildProgram: CL_SUCCESS clGetProgramBuildInfo: CL_SUCCESS clGetProgramBuildInfo: CL_SUCCESS == BUILD LOG == Build started Kernel <bitonic_sort> was successfully vectorized Done. == END OF LOG == ...Done building. clCreateKernel: CL_SUCCESS clCreateBuffer: CL_SUCCESS clSetKernelArg: CL_SUCCESS Running the kernel... Time spent executing according to OpenCL: 145.26 ms. clEnqueueReadBuffer: CL_SUCCESS The array was: sorted. Releasing OpenCL App Memory Done releasing OpenCL App Memory Releasing OpenCL Host memory OpenCL Host memory released
This is like I said, a rather naive first implementation which has a couple of flaws. Including but not limited to, every CustomApp runs only one kernel. The separation of what is actually individual when it comes to initialization of the apps arn't perfect. There is no good way to return results apart from void pointers. The OOP way of the suit doesn't mix well with things like glut. These are things I might fix when I figure out how and when the need arises.
You can find the complete code here.