Why is general OpenCV so fast compared to a specialised implementation

asked 2018-10-21 11:39:40 -0500

I was looking through some old OpenCV university work, and cleaning it up so I could use it in the future.

For our work we had been asked to not use some OpenCV functions and instead to write our own, to gain a better understanding of how the functions are working. In one such case I had written my own function to take an image and find the "Polar Co-ordinates" of the image (according to the comments I made for myself). This was to get the derivatives of an image and calculate the directions of change so I would have a "gradient magnitude" and a "gradient angle" of the image. I also wanted to obtain the mean value in the "gradient magnitude" image, for processing that came later.

I wrote a single function to do all of the above with one full iteration of the original image, basically doing everything at the same time.

I found that in order to obtain the same result with OpenCV functions only would require 4 separate calls:

cv::Sobel(original_image, grad_mag_x, -1, 1, 0, 3);
cv::Sobel(original_image, grad_mag_y, -1, 0, 1, 3);
cv::cartToPolar(grad_mag_x, grad_mag_y, grad_mag_total, grad_angle);
auto mean = cv::mean(grad_mag_total)[0];

3 iterations of the original image are required, with one further iteration over two images simultaneously.

So I went ahead and timed my implementation against OpenCV's implementation. I found my code running an order of magnitude slower with no optimisation and when compiled for speed it was running only twice as slow.

Why is this the case? Surely something specialised that runs through n by n data once would be faster than something running through an n by n set of data 4 times?

What makes me the most curious is I would have thought that looping through an n x n image 3 times would upset the hardware cache and cause more frequent direct reads to memory, making the whole process orders of magnitude slower.

Of course, if you read all of this and think that my implementation really should be faster then I shall take a good hard look at what I wrote all those years ago and make it better.

edit retag flag offensive close merge delete

Comments

1

Because Intel likely wrote a x64-optimized parallel implementation using extensions like SSE and GPU computation. When you get an order-of-magnitude speed-up, it's likely that a GPU is operating behind the scenes. You won't beat OpenCV's version, and matching it is difficult at best.

I recommend that you read through the book Learning OpenCV 3: http://shop.oreilly.com/product/06369...

The first chapter of Learning OpenCV 3 details the origins of the API.

sjhalayka gravatar imagesjhalayka ( 2018-10-21 12:40:06 -0500 )edit

Even if your OpenCV is not running on GPU, its algorithms are really optimized. Furthermore, they can take advantage of the CPU specific commands.

If you want your optimization to run faster, here are a few ideas:

  • Parallelize your methods (TBB library is quite simple to use)
  • Precalculate what you can
  • Use direct pointer access
  • Try to optimize your algorithm for the task.

For example in this case you can do everything in one pass through the image:

  for each pixel p:
      Gx=...
      Gy=...
      magn=sqrt(Gx*Gx+Gy*Gy)
      magn_total+=magn
  magn_total/=num_pixels`

Anyway, implementing for yourself the image processing functions - and trying to optimize them - will teach you a lot!

kbarni gravatar imagekbarni ( 2018-10-22 07:35:30 -0500 )edit