Revision history - OpenCV Q&A Forum

First use of UMat does not run on GPU?

I experience some very strange behaviour using UMats. My intent is to speedup my algorithms by running OpenCV library functions on my GPU (AMD HD 7850, OpenCL capable). In order to test this I load a set of seven images and perform a bilateral filter or a sobel operation on them.

However, it seems that every time I use one of those functions with a new set of parameters it is executed on the CPU first. Only starting from the second use of those same parameters my program uses the GPU.

For example, using the same bilateral filter on all images:

#include <opencv2/highgui/highgui.hpp>
#include <opencv2/imgproc/imgproc.hpp>
#include <opencv2/core/ocl.hpp>

#include <iostream>
#include <chrono>

using namespace std;
using namespace cv;

int main()
{
    cout << "Have OpenCL?: " << cv::ocl::haveOpenCL() << endl; // returns true, OCL available
    ocl::setUseOpenCL(true);

    // Load images test1.jpg, ..., test7.jpg
    vector<UMat> images;
    for (int i = 1; i <= 7; i++)
    {
        string filename = "test" + to_string(i) + ".jpg";
        UMat input = imread(filename).getUMat(ACCESS_READ);
        images.push_back(input);
    }

    for (int i = 0; i < 7; i++)
    {
        chrono::high_resolution_clock::time_point begin = chrono::high_resolution_clock::now();

        // ---------------------- Critical section --------------------------
        UMat result;
        bilateralFilter(images.at(i), result, 0, 10, 3);
        // ------------------------------------------------------------------

        chrono::high_resolution_clock::time_point end = chrono::high_resolution_clock::now();
        long long ms = chrono::duration_cast<chrono::milliseconds>(end - begin).count();
        cout << ms << " ms" << endl;
    }
}

Output:

2251 ms
5 ms
5 ms
5 ms
5 ms
5 ms
5 ms

The GPU utilization goes up, however only after about 2 seconds (i.e. after the first iteration is completed). However, when using a different set of parameters each time:

        // ...
        UMat result;
        bilateralFilter(images.at(i), result, 0, i * 10, 3);
        // ...

Output:

2148 ms
2098 ms
1803 ms
1699 ms
1826 ms
1760 ms
1766 ms

And all of it is executed on my CPU.

The same behaviour shows when using Sobel:

        // ...
        UMat result;
        if (i == 0)
            cv::Sobel(images.at(i), result, CV_32F, 1, 0, 5);
        else if (i == 1)
            cv::Sobel(images.at(i), result, CV_32F, 0, 1, 5);
        else
            cv::Sobel(images.at(i), result, CV_32F, 1, 1, 5);
        // ...

The first three operations are executed on the CPU. Then, iterations 4 to 7 finish on the GPU almost immediately, with the GPU utilization once again going up (because they use the same parameter set as iteration 3). Output:

687 ms
567 ms
655 ms
0 ms
0 ms
1 ms
0 ms

Is this a bug? Am I doing something wrong? Just applying each operation once at the start of the program in order to prevent this feels very hacky. Also I don't know how long the parameter usages are "cached" (I use this word since I have no idea what happens in the background). Did anyone else experience this problem?

I really really want to use UMat since it gives me a huge speedup if I only take the iterations which run on my GPU into consideration.

First use of UMat does not run on GPU?

I experience some very strange behaviour using UMats. My intent is to speedup my algorithms by running OpenCV library functions on my GPU (AMD HD 7850, OpenCL capable). In order to test this I load a set of seven images and perform a bilateral filter or a sobel operation on them.

However, it seems that every time I use one of those functions with a new set of parameters it is executed on the CPU first. Only starting from the second use of those same parameters my program uses the GPU.

For example, using the same bilateral filter on all images:

#include <opencv2/highgui/highgui.hpp>
#include <opencv2/imgproc/imgproc.hpp>
#include <opencv2/core/ocl.hpp>

#include <iostream>
#include <chrono>

using namespace std;
using namespace cv;

int main()
{
    cout << "Have OpenCL?: " << cv::ocl::haveOpenCL() << endl; // returns true, OCL available
    ocl::setUseOpenCL(true);

    // Load images test1.jpg, ..., test7.jpg
    vector<UMat> images;
    for (int i = 1; i <= 7; i++)
    {
        string filename = "test" + to_string(i) + ".jpg";
        UMat input = imread(filename).getUMat(ACCESS_READ);
        images.push_back(input);
    }

    for (int i = 0; i < 7; i++)
    {
        chrono::high_resolution_clock::time_point begin = chrono::high_resolution_clock::now();

        // ---------------------- Critical section --------------------------
        UMat result;
        bilateralFilter(images.at(i), result, 0, 10, 3);
        // ------------------------------------------------------------------

        chrono::high_resolution_clock::time_point end = chrono::high_resolution_clock::now();
        long long ms = chrono::duration_cast<chrono::milliseconds>(end - begin).count();
        cout << ms << " ms" << endl;
    }
}

Output:

2251 ms
5 ms
5 ms
5 ms
5 ms
5 ms
5 ms

The GPU utilization goes up, however only after about 2 seconds (i.e. after the first iteration is completed). However, when using a different set of parameters each time:

        // ...
        UMat result;
        bilateralFilter(images.at(i), result, 0, i * 10, 3);
        // ...

Output:

2148 ms
2098 ms
1803 ms
1699 ms
1826 ms
1760 ms
1766 ms

And all of it is executed on my CPU. Also, those functions run extremely slow. Using Mat instead of UMat only takes these operations about 40ms. I guess there's some crosstalk between the program and OpenCL until the library decides to use the CPU.

The same behaviour shows when using Sobel:

        // ...
        UMat result;
        if (i == 0)
            cv::Sobel(images.at(i), result, CV_32F, 1, 0, 5);
        else if (i == 1)
            cv::Sobel(images.at(i), result, CV_32F, 0, 1, 5);
        else
            cv::Sobel(images.at(i), result, CV_32F, 1, 1, 5);
        // ...

The first three operations are executed on the CPU. Then, iterations 4 to 7 finish on the GPU almost immediately, with the GPU utilization once again going up (because they use the same parameter set as iteration 3). Output:

687 ms
567 ms
655 ms
0 ms
0 ms
1 ms
0 ms

Is this a bug? Am I doing something wrong? Just applying each operation once at the start of the program in order to prevent this feels very hacky. Also I don't know how long the parameter usages are "cached" (I use this word since I have no idea what happens in the background). Did anyone else experience this problem?

I really really want to use UMat since it gives me a huge speedup if I only take the iterations which run on my GPU into consideration.

First use of UMat does not run on GPU?

I experience some very strange behaviour using UMats. My intent is to speedup my algorithms by running OpenCV library functions on my GPU (AMD HD 7850, OpenCL capable). In order to test this I load a set of seven images and perform a bilateral filter or a sobel operation on them.

However, it seems that every time I use one of those functions with a new set of parameters it is executed on the CPU first. Only starting from the second use of those same parameters my program uses the ~~GPU.~~GPU. I compiled this with VS 2013 and OpenCV 3.0 gold.

For example, using the same bilateral filter on all images:

#include <opencv2/highgui/highgui.hpp>
#include <opencv2/imgproc/imgproc.hpp>
#include <opencv2/core/ocl.hpp>

#include <iostream>
#include <chrono>

using namespace std;
using namespace cv;

int main()
{
    cout << "Have OpenCL?: " << cv::ocl::haveOpenCL() << endl; // returns true, OCL available
    ocl::setUseOpenCL(true);

    // Load images test1.jpg, ..., test7.jpg
    vector<UMat> images;
    for (int i = 1; i <= 7; i++)
    {
        string filename = "test" + to_string(i) + ".jpg";
        UMat input = imread(filename).getUMat(ACCESS_READ);
        images.push_back(input);
    }

    for (int i = 0; i < 7; i++)
    {
        chrono::high_resolution_clock::time_point begin = chrono::high_resolution_clock::now();

        // ---------------------- Critical section --------------------------
        UMat result;
        bilateralFilter(images.at(i), result, 0, 10, 3);
        // ------------------------------------------------------------------

        chrono::high_resolution_clock::time_point end = chrono::high_resolution_clock::now();
        long long ms = chrono::duration_cast<chrono::milliseconds>(end - begin).count();
        cout << ms << " ms" << endl;
    }
}

Output:

2251 ms
5 ms
5 ms
5 ms
5 ms
5 ms
5 ms

The GPU utilization goes up, however only after about 2 seconds (i.e. after the first iteration is completed). However, when using a different set of parameters each time:

        // ...
        UMat result;
        bilateralFilter(images.at(i), result, 0, i * 10, 3);
        // ...

Output:

2148 ms
2098 ms
1803 ms
1699 ms
1826 ms
1760 ms
1766 ms

And all of it is executed on my CPU. Also, those functions run extremely slow. Using Mat instead of UMat only takes these operations about 40ms. I guess there's some crosstalk between the program and OpenCL until the library decides to use the CPU.

The same behaviour shows when using Sobel:

        // ...
        UMat result;
        if (i == 0)
            cv::Sobel(images.at(i), result, CV_32F, 1, 0, 5);
        else if (i == 1)
            cv::Sobel(images.at(i), result, CV_32F, 0, 1, 5);
        else
            cv::Sobel(images.at(i), result, CV_32F, 1, 1, 5);
        // ...

The first three operations are executed on the CPU. Then, iterations 4 to 7 finish on the GPU almost immediately, with the GPU utilization once again going up (because they use the same parameter set as iteration 3). Output:

687 ms
567 ms
655 ms
0 ms
0 ms
1 ms
0 ms

Is this a bug? Am I doing something wrong? Just applying each operation once at the start of the program in order to prevent this feels very hacky. Also I don't know how long the parameter usages are "cached" (I use this word since I have no idea what happens in the background). Did anyone else experience this problem?

I really really want to use UMat since it gives me a huge speedup if I only take the iterations which run on my GPU into consideration.

Revision history [back]

First use of UMat does not run on GPU?

First use of UMat does not run on GPU?

First use of UMat does not run on GPU?