Ask Your Question
1

CUDA Canny Edge Detector is slower than cv::Canny

asked 2017-04-19 12:35:48 -0600

Darth Revan gravatar image

updated 2020-12-09 08:35:39 -0600

Hello there.

This is my first post here. I started to learn to use OpenCV and its CUDA capabilities. I've written a simple code that reads input image, resizes it and detect edges with both cv::Canny and CUDA Canny Edge Detector object and log results to a .txt file. My image is 960x585 and 66.3 KB. I used C++ standard chrono library to measure the time spent at Edge Detection process and results show that the time spent at GPU is far more greater than the time spent at CPU. My code and results are given below. Are my results normal or am I doing something very wrong?

Laptop specs;

8 GB RAM

Intel i74700 MQ CPU 2.40 GHz

NVIDIA Geforce GT 745M GPU

#include <stdio.h>
#include <opencv2\core\core.hpp>
#include <opencv2\core\cuda.hpp>
#include <opencv2\imgproc.hpp>
#include <opencv2\opencv.hpp>
#include <chrono>
#include <fstream>


#define SIZE 25

int main()
{
    cv::Mat ImageHost = cv::imread("C:\\Users\\Heisenberg\\Desktop\\revan.jpg", CV_LOAD_IMAGE_GRAYSCALE);



        cv::Mat ImageHostArr[SIZE];

        cv::cuda::GpuMat ImageDev;
        cv::cuda::GpuMat ImageDevArr[SIZE];

        ImageDev.upload(ImageHost);


        for (int n = 1; n < SIZE; n++)
            cv::resize(ImageHost, ImageHostArr[n], cv::Size(), 0.5*n, 0.5*n, CV_INTER_LINEAR);


        for (int n = 1; n < SIZE; n++)
            cv::cuda::resize(ImageDev, ImageDevArr[n], cv::Size(), 0.5*n, 0.5*n, CV_INTER_LINEAR); 

        cv::Mat Detected_EdgesHost[SIZE];
        cv::cuda::GpuMat Detected_EdgesDev[SIZE];

        std::ofstream File1, File2;

        File1.open("C:\\Users\\Heisenberg\\Desktop\\canny_cpu.txt");
        File2.open("C:\\Users\\Heisenberg\\Desktop\\canny_gpu.txt");


        std::cout << "Process started... \n" << std::endl;
        for (int n = 1; n < SIZE; n++) {
            auto start = std::chrono::high_resolution_clock::now();
            cv::Canny(ImageHostArr[n], Detected_EdgesHost[n], 2.0, 100.0, 3, false);
            auto finish = std::chrono::high_resolution_clock::now();
            std::chrono::duration<double> elapsed_time = finish - start;
            File1 << "Image Size: " << ImageHostArr[n].rows* ImageHostArr[n].cols << "  " << "Elapsed Time: " << elapsed_time.count() * 1000 << " msecs" << "\n" << std::endl;
        }


        cv::Ptr<cv::cuda::CannyEdgeDetector> canny_edg = cv::cuda::createCannyEdgeDetector(2.0, 100.0, 3, false);   



        for (int n = 1; n < SIZE; n++) {
            auto start = std::chrono::high_resolution_clock::now();
            canny_edg->detect(ImageDevArr[n], Detected_EdgesDev[n]);
            auto finish = std::chrono::high_resolution_clock::now();
            std::chrono::duration<double> elapsed_time = finish - start;
            File2 << "Image Size: " << ImageDevArr[n].rows* ImageDevArr[n].cols << "  " << "Elapsed Time: " << elapsed_time.count() * 1000 << " msecs" << "\n" << std::endl;
        }
        std::cout << "Process ended... \n" << std::endl;



    return 0;
}
edit retag flag offensive close merge delete

1 answer

Sort by » oldest newest most voted
1

answered 2017-04-21 02:37:20 -0600

Keep in mind that when comparing CPU versus GPU implementations there are several bottlenecks to consider, which can lead to a faster processing on CPU than on GPU.

  1. OpenCV Canny operator is one of those functions that has been heavily optimizes using TBB, SSE, AVX, ... this makes that the edge detector is a more than real time processing on any normal sized resolutions. This directly relates to the fact you have a CPU that has 8 logical cores to process on, which gives you a huge speed benefit at CPU level.
  2. There is always a bottleneck of pushing data to and from the GPU, which you are timing now also. Since you are each pushing a single image to the GPU and processing it, then getting the data back, this is a big bottleneck. One of the best solutions for this is to create a batch set of data, push that to GPU memory and retrieve all results at once.
  3. Finally at the beginning OpenCV does a GPU initialization, on the actual GPU function call. You should thus only time from iteration 2 until N, and average over that, to have a correct GPU timing.
edit flag offensive delete link more

Comments

Hello Steven. Thanks for brief reply.

1) So you say even if I did not compile OpenCV with TBB support there is still built-in TBB optimization at cv::Canny function. Am I right?

2) I exactly understood what you point to but how can I implement this on OpenCV? Sorry I am a total novice.

3) I guess

for (int n = 1; n < SIZE; n++) {
            if (n >= 2) {
            auto start = std::chrono::high_resolution_clock::now();
            canny_edg->detect(ImageDevArr[n], Detected_EdgesDev[n]);
            auto finish = std::chrono::high_resolution_clock::now();
            std::chrono::duration<double> elapsed_time = finish – start;
               }
        }

this will do the job done right?

Darth Revan gravatar imageDarth Revan ( 2017-04-23 01:32:25 -0600 )edit

There is also something I want to add, this was run on my laptop with given specifications first and results were like I said before. Now it was run on my desktop with specifications given below:

16 GB RAM

Intel i7-6700 CPU 3.4 GHz

NVIDIA GeForce GTX 970 GPU

and results were what I expected to see. GPU time is smaller than CPU time. Results were given below and what are the possible reasons of result differences like these between my laptop and desktop

Darth Revan gravatar imageDarth Revan ( 2017-04-23 01:33:27 -0600 )edit

GPU Time

Size: 535366 Time: 6.72341 ms

Size: 1204050 Time: 8.9964 ms

Size: 2141464 Time: 10.1835 ms

Size: 3346910 Time: 14.5982 ms

Size: 4818294 Time: 18.7926 ms

Size: 6557012 Time: 24.2421 ms

Size: 8565856 Time: 30.0094 ms

Size: 10842732 Time: 36.3906 ms

Size: 13384150 Time: 43.9781 ms

Size: 16192902 Time: 48.8068 ms

Size: 19273176 Time: 56.3619 ms

Size: 22621482 Time: 66.0788 ms

Size: 26232934 Time: 70.6756 ms

Size: 30111720 Time: 79.3919 ms

Size: 34263424 Time: 89.1915 ms

Size: 38683160 Time: 100.978 ms

Size: 43364646 Time: 112.416 ms

Size: 48313466 Time: 123.697 ms

Darth Revan gravatar imageDarth Revan ( 2017-04-23 01:33:52 -0600 )edit

CPU Time

Size: 535366 Time: 6.62997 ms

Size: 1204050 Time: 9.85992 ms

Size: 2141464 Time: 13.6559 ms

Size: 3346910 Time: 19.8349 ms

Size: 4818294 Time: 27.9398 ms

Size: 6557012 Time: 35.1151 ms

Size: 8565856 Time: 42.5579 ms

Size: 10842732 Time: 50.725 ms

Size: 13384150 Time: 62.5316 ms

Size: 16192902 Time: 76.4147 ms

Size: 19273176 Time: 96.0429 ms

Size: 22621482 Time: 118.537 ms

Size: 26232934 Time: 120.848 ms

Size: 30111720 Time: 151.584 ms

Size: 34263424 Time: 178.844 ms

Size: 38683160 Time: 183.334 ms

Size: 43364646 Time: 210.917 ms

Size: 48313466 Time: 218.264 ms

Darth Revan gravatar imageDarth Revan ( 2017-04-23 01:34:13 -0600 )edit
1

Hello Darth, let me take the time to respond to your new questions/remarks!

  • No TBB support is disabled by default, so if you did not enable it, it will not use that, but it will use optimizations that are auto selected. You can only know that for sure by looking at your cmake output.
  • Taking a course on batch processing on GPU, you will need CUDA knowledge for that. I am a novice myself in that field.
  • No, you need to call a GPU execution once, outside the loop. Then in the loop chrono. Now iteration 2 will still cary the overhead.
  • Differences, I guess are mainly the architecture and the different graphical cards.
StevenPuttemans gravatar imageStevenPuttemans ( 2017-04-24 04:51:33 -0600 )edit

First of all I apologize for not being able to upvote your answers since my karma is just one point :)

  1. I understood the thing about TBB.
  2. On the other hand, are there any functions about batch processing via GPU at OpenCV library or do I need to code the whole concept myself? Is t he block/thread concept of CUDA about batch processing?
  3. So if I don't get you wrong you mean this:
Darth Revan gravatar imageDarth Revan ( 2017-04-24 13:01:52 -0600 )edit
 canny_edg->detect(ImageDevArr[1], Detected_EdgesDev[1]);

for (int n = 1; n < SIZE; n++) {
        auto start = std::chrono::high_resolution_clock::now();
        canny_edg->detect(ImageDevArr[n], Detected_EdgesDev[n]);
        auto finish = std::chrono::high_resolution_clock::now();
        std::chrono::duration<double> elapsed_time = finish - start;
        File2 << "Image Size: " << ImageDevArr[n].rows* ImageDevArr[n].cols << "  " << "Elapsed Time: " << elapsed_time.count() * 1000 << " msecs" << "\n" << std::endl;
}  <code>
Darth Revan gravatar imageDarth Revan ( 2017-04-24 13:02:49 -0600 )edit

4- I was also thinking about the same .

Darth Revan gravatar imageDarth Revan ( 2017-04-24 13:10:23 -0600 )edit
1

To answer again with some more detail

  • About 2, you will probably need to do that yourself, no idea if OpenCV even does stuff like that
  • About 3, yes the code seems correct now to get an accurate timing.
StevenPuttemans gravatar imageStevenPuttemans ( 2017-04-25 02:31:39 -0600 )edit

Question Tools

1 follower

Stats

Asked: 2017-04-19 12:35:48 -0600

Seen: 5,204 times

Last updated: Apr 21 '17