Ask Your Question

Wolf's profile - activity

2019-11-13 22:32:38 -0500 received badge  Notable Question (source)
2017-11-17 04:12:05 -0500 received badge  Popular Question (source)
2016-02-15 10:18:31 -0500 received badge  Nice Question (source)
2015-05-21 02:58:30 -0500 asked a question Are `void writeScalar(const uchar*);` declarations in obsolet?

I see you have noted that using cudaMemcpyToSymbol is generally a bad choice if you want to have things asynchrounous and thread-safe as I have mentioned here and removed uses of cudaMemcpyToSymbol from the GPU setTo methods:

I believe the void writeScalar(const uchar*);and so on declarations at

are now obsolet in the 2.4 branch, aren't they?

2015-03-31 10:07:13 -0500 asked a question Best practice for CUDA streams --> How to get OpenCV GPU module to work asynchronously??

I am using CUDA via opencv GPU module. It works good but I am not sure if there isn't some performance improvement possible in my application.

I use multiple threads and multiple cuda streams, i. e. gpu::Streams. All calls (upload, download and processing) are done asynchronously on the streams. If one thread has finshied calling the operations on it's image it sleeps and waits for the stream synchronization. Afterwards the processing of the image is done.

However, recently I noted that cudaMalloc and cudaFree are synchronous methods, i. e. they will wait for all streams (of all threads?) to synchronize before the action is done. I my case I create an empty GpuMat and stream when the processing of one image is started and then start the uploading (enqueueUpload) and processing on the stream. When the processing of the image is done the GpuMat goes out of scope, i. e. device mem is released. So here cudaMalloc and cudaFree are called. I guess this will cause my entire program to have mostly sychronous behaviour??

What would be the best practice here for processing images in an asynchronous pipeline? Allocate a number of GpuMat images at the startup and then only use those images for copying? Would be okay, but does not seem so nice because then again one would have to manage which Images are in use and which not, a charge I liked to get rid of thanks to the ref counting of the cv:Mat and cv::GpuMat.. Are there better methods?


Getting OpenCV GPU module to work !!really!! asynchronously appears to be an issue even beyond what I already mentioned. It might be nice if API had some more clarity on that. (As the CUDA API itself should have!)


When calling into cv::gpu::warpPerspective giving it some properly preallocated cv::gpu::GpuMat matrices and a proper cv::gpu::Stream I think one would assume it having asynchronous behaviour. However, internally it calls cudaMemcpyToSymbol (and there are more functions that do that) which is indeed a synchronous call. So even if it is called with preallocated matrices and proper stream it will result in an at least partly synchronous call (Moreover, it also causes all other currently active streams on the GPU to synchronize).

Are there any ideas/plans how to deal with that?

2015-03-31 09:49:16 -0500 received badge  Scholar (source)
2014-09-08 07:03:34 -0500 asked a question imencode for non CV_8UC3 matrices

Hi guys,

is there a generic way to use imencode for any image depth/channels? I understand the basic functionality is intended for CV_8UC3, i.e. color images. CV_8UC1 oftentimes works also, for some formats still 16 bit depth.

However, is there away for binary data persistence of arbitrary types e.g. CV_32SC6? Even completely uncompressed binary data output would be ok. The yaml file I/O is not suitable for MegaPixel images....

Thanks in advance

2014-02-07 04:57:59 -0500 commented question cv::gpu::blur returns zero Mat when Mat is processed in place

Thanks for reply. Doc of FilterEngine mentions this. Maybe the doc of blur etc. should also mention it, or the functions should assert on this. Took me quite a while to find this error when porting from CPU to GPU....

2014-02-06 05:13:55 -0500 answered a question imshow() and waitKey()

Try cv::startWindowThread() after opening the window with cv::namedWindows(). Makes your window more responsive if there is no waitKey around.....

2014-02-06 05:09:41 -0500 asked a question cv::gpu::blur returns zero Mat when Mat is processed in place

With cv::blur on CPU you can process cv::Mat in place like:

// have defined and filled cv::Mat my_mat
cv::blur( my_mat, my_mat, cv::Size( 4, 4 ) );

If I try the same thing with cv::gpu::GpuMat and cv::gpu::blur like

// have defined and filled cv::gpu::GpuMat my_mat
cv::gpu::blur( my_mat, my_mat, cv::Size( 4, 4 ) );

my_mat is now just all zeros, i. e. black.

If I create an empty GpuMat header, pass it to blur and assign to Mat everything again works like CPU example:

// have defined and filled cv::gpu::GpuMat my_mat
cv::gpu::GpuMat tmp;
cv::gpu::blur( my_mat, tmp, cv::Size( 4, 4 ) );
my_mat = tmp;

I think this is a bug in cv::gpu::blur, isn't it?

I am using ubuntu 12.04 64 bit with cuda 5.5 and OpenCV 2.4.7

2014-01-23 06:16:05 -0500 received badge  Student (source)
2014-01-21 06:43:33 -0500 received badge  Editor (source)
2014-01-21 06:22:36 -0500 asked a question cv::Mat::copyTo and cv::gpu::GpuMat::copyTo show different behaviour

The doc of Mat::copyTo (void Mat::copyTo(OutputArray m, InputArray mask) const) says:

When the operation mask is specified, and the Mat::create call shown above reallocated the matrix, the newly allocated matrix is initialized with all zeros before copying the data.

I called the method with an empty Mat and a mask for both CPU and GPU. The CPU variant initializes all values to zero, while the GPU variant apparently does not.

For clarity:

My Code CPU1:

// have defined and filled cv::Mat lo_input, lo_mask;
cv::Mat lo_mat;
lo_input.copyTo( lo_mat, lo_mask );

--> works perfect lo_mat now contains the values of lo_input, where lo_mask is non-zero, and has value zero everywhere else

My Code GPU1:

// have defined and filled cv::gpu::GpuMat lo_input, lo_mask;
cv::gpu::GpuMat lo_mat;
lo_input.copyTo( lo_mat, lo_mask );

--> does not cause seg fault and lo_mat has size and type, so apparently calls create on lo_mat but lo_mat contains random fragments of released matrices where the lo_mask is zero -> apparently lo_mat values are not initialized to zero

My Code GPU2:

// have defined and filled cv::gpu::GpuMat lo_input, lo_mask;
cv::gpu::GpuMat lo_mat( lo_input.size(), lo_input.type() );
lo_mat.setTo( cv::Scalar::all( 0 ) );
lo_input.copyTo( lo_mat, lo_mask );

--> works perfect like CPU1

Is this difference in the behaviours of cv::Mat::copyTo and cv::gpu::GpuMat::copyTo intended or a bug?

I am using ubuntu 12.04 64 bit with cuda 5.5 and OpenCV 2.4.7