TBB not being used for CMat memory copy, just IPP

asked 2017-12-28 09:25:56 -0500

Boogaloo gravatar image

I'm using 3.4.0 and enabled WITH_TBB, but stepping thro the copyTo and clone procs, they only allow support for IPP.

The memcpy command is slow on one thread for 2K and 4K frames, and I am surprised that TBB isn't used, or should it be?

Any advice would be appreciated.

edit retag flag offensive close merge delete


My guess is that memcpy operation is limited by the memory bandwidth (nowadays, memory transfer is the limiting factor, see the different levels of CPU cache) and parallelization should not improve the operation.

You can always implement your own copyTo function if you want.

Eduardo gravatar imageEduardo ( 2017-12-28 10:17:27 -0500 )edit

I understand that in modern CPUs each core has its own PCIe channel, which is why when I tested OCV 3.1 and TBB was used to transfer large memory CMats it was much faster than single-threaded. Did TBB get removed from the CMat copy source in 3.3 onwards?

Boogaloo gravatar imageBoogaloo ( 2017-12-29 07:20:13 -0500 )edit

What is CMat? There is only cv::Mat and cv::UMat in OpenCV.

All the changes are versioned on Github if you want to check.

Eduardo gravatar imageEduardo ( 2017-12-29 18:10:55 -0500 )edit

I meant cv:Mat, sorry for the confusion.

I remember seeing the parallel_for in use on Mat memory copy back when I used OCV 3.1, I thought it was used for Mat copying, but now I am not sure. Digging through all the source code would be dull work, is there a way to quickly check all areas of OCV using parallel_for?

Given the large size of Mat data with FHD and 4K images, and that CPU cores have their own memory channels, using parallel_for would boost performance in apps using hi-res frames.

I tried with Intel IPP, but the performance showed as only 1% better than a straight fast memcpy using _128 instructions on aligned memory transfer.

Boogaloo gravatar imageBoogaloo ( 2017-12-30 05:14:55 -0500 )edit

Here is the code for copyTo() in OpenCV 3.1.

Eduardo gravatar imageEduardo ( 2017-12-30 08:54:55 -0500 )edit

Thanks, no sign of TBB there, just the IPP, I must have seen the TBB parallel_for somewhere else... think I need a personal memory upgrade! I'm reading-up on fast memory transfers, Intel has a lot of options, but some people report that rep sto is the fastest due to microcode improvement, will have to test it with a modified copyTo, will post if the results are interesting.

Boogaloo gravatar imageBoogaloo ( 2017-12-30 13:26:04 -0500 )edit