How to run OpenCL file (example fast.cl) in OpenCV library?

answered 2017-03-19 22:48:54 -0600

7376 ●13 ●37

Take a look HERE for an explanation of the Transparent API.

Comments

I want to run OpenCL file (example fast.cl at opencv310\sources\modules\features2d\src\opencl\fast.cl) in OpenCV library. But I see in this file have many kernels. And each kernel has a different number of arguments. How can I set up arguments to run this file?

minhntu ( 2017-03-20 05:06:10 -0600 )edit

You don't need to. You just run normal fast with UMat as the arguments.

Tetragramm ( 2017-03-20 17:52:11 -0600 )edit

it's mean when I change Mat by UMat. FAST function will be run on GPU. Do you know what fast.cl in OpenCV library use for?

minhntu ( 2017-03-21 01:31:35 -0600 )edit

I'm not entirely sure what you're asking. Here is where the fast.cl kernels are called from.

Again, you don't need to do anything but use UMat matrices and have compiled OpenCV with OpenCL support.

Tetragramm ( 2017-03-21 17:40:50 -0600 )edit

It means I only need to change from

Mat img = imread("chessboard.jpg", IMREAD_UNCHANGED);
FAST(gray, keypointsCorners, thresholdCorner, true);

cv::ocl::Device(context.device(0)); 
UMat img, gray; 
imread("chessboard.jpg", 1).copyTo(img); 
FAST(gray, keypointsCorners, thresholdCorner, true);

to have OpenCL code and this code will run on GPU? And while FAST () function is implementing it will call fast.cl file inside?

minhntu ( 2017-03-21 20:50:56 -0600 )edit

Try to use the code tag in the editor to make it readable.

Yes, so long as your OpenCV was compiled with OpenCL support, then it will run on the gpu. Though do note that in the snippet here, you don't actually fill gray. I assume you just left that out for brevity.

Tetragramm ( 2017-03-21 20:57:07 -0600 )edit

When I change Mat by UMat. My result is code ran on GPU. But it is slower than CPU. And when I change a name of "fast.cl" file in path: opencv310\sources\modules\features2d\src\opencl\fast.cl), Code still run normally. Whether fast.cl file was not called while the code is implementing. How do we know that when changing Mat by UMat, Code will run parallel on GPU?

minhntu ( 2017-03-22 04:39:54 -0600 )edit

Just to make sure, you are compiling the code? Not using the installer? The installer is not (I think) compiled with OpenCL, so you will see no benefit.

Tetragramm ( 2017-03-22 21:10:34 -0600 )edit

When I run code below

        cout << context.ndevices() << " GPU devices are detected." << endl;
    for (int i = 0; i < context.ndevices(); i++)
    {
        cv::ocl::Device device = context.device(i);
        cout << "name                 : " << device.name() << endl;
        cout << "available            : " << device.available() << endl;
        cout << "imageSupport         : " << device.imageSupport() << endl;
        cout << "OpenCL_C_Version     : " << device.OpenCL_C_Version() << endl;
        cout << endl;
    }

It showed results: 1 GPU devices are detected name: Quadro K2000 available: 1 image surport: 1

OpenCL version: OpenCL C 1.2

And while Code is implementing. It took 11026 ms for MAT and 29340ms for UMAT in the same time. Do you know why it is?

minhntu ( 2017-03-23 01:56:34 -0600 )edit

How many iterations? The first iteration may be much slower because of initializing the context and memory. That looks like a decent amount though.

Also, that's a really old card. What CPU do you have? If it's anything recent, it'll be faster than the GPU. Not because of the processing, but just the memory transfer. If you keep the same data on the GPU and do lots of work on it, it's much better than transferring back and forth.

Tetragramm ( 2017-03-23 19:17:02 -0600 )edit

I use while(1) to iterate infinitely. My specification PC:

- Processor: Intel Xeon CPU E5-1650 v2 @ 3.5GHz.
- Ram: 16 GB.
- GPU: NVIDIA Quadro K2000.

This is a video when I run my fast corner detector code on PC. I showed source code, time running, and performance of CPU and GPU. When I run this code on Odroid XU4 (Processor Samsung Quad 4 core A15 - 2GHZ and 4 core A7 - 1.3GHZ. Ram 2GB. GPU MALI T628). Use MAT is still slower than UMAT. Link: https://www.youtube.com/watch?v=PXKJMepjuHg&feature=youtu.be (https://www.youtube.com/watch?v=PXKJM...)

minhntu ( 2017-03-23 21:11:54 -0600 )edit

Umm, if you iterate infinitely, then how do your time measurements work? Use a fixed number of iterations to make sure the two tests are equal.

For example, on my machine, 10000 iterations of FAST takes 4.89s on CPU and 3.28s on GPU.

    Mat im = imread("result.png");
UMat im2;

cvtColor(im, im, COLOR_BGR2GRAY);
im.copyTo(im2);

vector<KeyPoint> kps;
FAST(im, kps, 20, true);

auto start = high_resolution_clock::now();
for (int i = 0; i < 10000; ++i)
{
    FAST(im, kps, 20, true);
}
auto stop = high_resolution_clock::now();
cout << "CPU code is " << duration_cast<nanoseconds>(stop - start).count()/1.0e9 << "\n\n";

start = (snip);
for (int i = 0; i < 10000; ++i)
{
    FAST(im2, kps, 20, true);
}
stop = (snip);
    (print GPU time)

Tetragramm ( 2017-03-23 23:12:23 -0600 )edit

I understood what is a reason. Running code on GPU is faster than CPU when a parallel volume computation on GPU is enough big and total time transfer data from CPU to GPU, implement it on GPU and transfer data from CPU to GPU again is less than running time on CPU. However, I am wondering that whether processing transfer data from CPU to GPU is implemented at command "im.copyTo(im2)" or at command "Fast()"? And after the first iteration of Fast () function, output data is still on GPU or is transferred to CPU before beginning next iteration of Fast() function.

minhntu ( 2017-03-24 06:56:37 -0600 )edit

The image copy is in the copyTo function, but there is some memory transfer in FAST to bring the keypoints back from the GPU.

Tetragramm ( 2017-03-24 18:39:11 -0600 )edit

When I run fast code using a camera, running time on CPU is less than GPU. I understand that we take one-time overhead in loading the memory for the GPU implementation for each frame. Can I check if subsequently (from the second frame onwards), this memory load can be done in parallel with the computation? to improve performance. Otherwise, there will be an additional overhead for each new image frame that we are processing.

minhntu ( 2017-03-26 20:41:33 -0600 )edit

That's certainly a thing to try.

Another thing is how large your image is. If it's small, the overhead of copying and launching the kernels will outweigh the benefits. Larger images suffer from this less.

Tetragramm ( 2017-03-26 20:53:22 -0600 )edit

Can I check if subsequently (from the second frame onwards), this memory load can be done in parallel with the computation? to improve performance

minhntu ( 2017-03-26 21:25:06 -0600 )edit

see more comments

How to run OpenCL file (example fast.cl) in OpenCV library?

1 answer

Comments

OpenCL version: OpenCL C 1.2

Links

Question Tools

Stats

Related questions

How to run OpenCL file (example fast.cl) in OpenCV library? edit

1 answer

Comments

OpenCL version: OpenCL C 1.2

Links

Question Tools

Stats

Related questions

How to run OpenCL file (example fast.cl) in OpenCV library?