Ask Your Question

Dagmor's profile - activity

2019-12-27 07:50:22 -0600 received badge  Popular Question (source)
2018-07-19 09:13:06 -0600 received badge  Notable Question (source)
2016-08-08 05:36:40 -0600 received badge  Popular Question (source)
2014-11-21 09:57:37 -0600 received badge  Nice Answer (source)
2014-11-19 11:39:12 -0600 commented question gpu::convolve and gpu::filter2D vs cv::filter2D, opencv 2.4.9

Did you ever figure out how to make this work? Seems like a pretty huge issue. If nothing else, it needs better documentation.

2014-11-19 11:38:53 -0600 answered a question gpu::convolve and gpu::filter2D vs cv::filter2D, opencv 2.4.9

Did you ever figure out how to make this work? Seems like a pretty huge issue. If nothing else, it needs better documentation.

2014-09-13 03:29:12 -0600 commented answer resize and remap functions utterly wrong

Wowza! Great reply.

I hear you on the resize, it certainly can be ambiguous. But I ran through the logical options, and the opencv resize results still didn't make sense.

Sounds like there needs to be a flag for remap to turn off the weighting table, and get more accurate results. I just made my own version of the remap function, with only ~30% more time taken on a 10000x10000 float image (using TBB and a 64 core machine). Another oddity, on the same run, I tried the "convertMaps" function, but the "performant" maps HURT performance by a factor of 4 (making it more than twice slower than mine)!

Not sure if/how opencv uses threads, but this suggests to me that the performance improvement(?) may not be worth it to the typical user.

2014-09-13 03:14:57 -0600 received badge  Supporter (source)
2014-09-12 20:48:33 -0600 received badge  Student (source)
2014-09-12 17:42:34 -0600 asked a question resize and remap functions utterly wrong

As far as I can tell, both remap and resize functions are implemented incorrectly (at least with default bilinear interpolation). Consider the following code:

#include <opencv2/opencv.hpp>
#include <iostream>

using namespace std;
using namespace cv;

int main(int argc, char **argv)
{

    int inputSize=2;
    Mat test(inputSize,inputSize,CV_32FC1);
    test.at<float>(0,0)=1.0;
    test.at<float>(0,1)=2.0;
    test.at<float>(1,0)=3.0;
    test.at<float>(1,1)=4.0;
    int size=4;
    Mat output;
    resize(test,output,Size(size,size),0.0,0.0,INTER_LINEAR);
    for(int ridx=0;ridx<size;++ridx)
    {
        for(int cidx=0;cidx<size;++cidx)
            printf("%f ",output.at<float>(ridx,cidx));
        printf("\n");
    }
    printf("\n");
    size = 5;
    Mat rowMap(size,size,CV_32FC1),colMap(size,size,CV_32FC1);
    for(int ridx=0;ridx<size;++ridx)
        for(int cidx=0;cidx<size;++cidx)
        {
            rowMap.at<float>(ridx,cidx) = ridx*(static_cast<float>(inputSize-1)/(size-1));
            colMap.at<float>(ridx,cidx) = cidx*(static_cast<float>(inputSize-1)/(size-1));
        }
    remap(test,output,colMap,rowMap,INTER_LINEAR);
    for(int ridx=0;ridx<size;++ridx)
    {
        for(int cidx=0;cidx<size;++cidx)
            printf("%f ",rowMap.at<float>(ridx,cidx));
        printf("\n");
    }
    printf("\n");
    for(int ridx=0;ridx<size;++ridx)
    {
        for(int cidx=0;cidx<size;++cidx)
            printf("%f ",colMap.at<float>(ridx,cidx));
        printf("\n");
    }
    printf("\n");
    for(int ridx=0;ridx<size;++ridx)
    {
        for(int cidx=0;cidx<size;++cidx)
            printf("%f ",output.at<float>(ridx,cidx));
        printf("\n");
    }
    return 0;
}

And output:

1.000000 1.250000 1.750000 2.000000
1.500000 1.750000 2.250000 2.500000
2.500000 2.750000 3.250000 3.500000
3.000000 3.250000 3.750000 4.000000

0.000000 0.000000 0.000000 0.000000
0.333333 0.333333 0.333333 0.333333
0.666667 0.666667 0.666667 0.666667
1.000000 1.000000 1.000000 1.000000

0.000000 0.333333 0.666667 1.000000
0.000000 0.333333 0.666667 1.000000
0.000000 0.333333 0.666667 1.000000
0.000000 0.333333 0.666667 1.000000

1.000000 1.343750 1.656250 2.000000
1.687500 2.031250 2.343750 2.687500
2.312500 2.656250 2.968750 3.312500
3.000000 3.343750 3.656250 4.000000

The results for resize and remap should both be smooth in 1/3 intervals - I don't know what the heck opencv is doing. To me, these seem completely inaccurate results. Please enlighten me!

2014-09-02 16:06:20 -0600 received badge  Scholar (source)
2014-08-29 08:54:16 -0600 commented answer GPU Cuda initialization much slower with opencv libraries

Thanks Steve, it says I need >50 points to mark my own answer as a solution, so I'll have to come back later and do it. Yeah the GUI stuff was the final straw that gave back all the performance I needed, woohoo!

2014-08-29 05:06:11 -0600 received badge  Teacher (source)
2014-08-29 02:11:46 -0600 received badge  Self-Learner (source)
2014-08-28 16:17:38 -0600 answered a question GPU Cuda initialization much slower with opencv libraries

Problem solved!

For anyone wanting to know how to speed up opencv initialization, here ya go:

  1. Compile binary CUDA kernels for your compute capability (mine was 3.5) (~20% of the extra time)
  2. Compile the library statically instead of dynamically (~40% of the extra time)
  3. Remove all GUI dependencies (~40% of the extra time)

These three changes took my start time from ~7.5 seconds down to ~0.7 seconds (almost the same as it is without opencv at all). Here's the cmake flags I changed to do the above:

CUDA_ARCH_BIN=3.5
CUDA_ARCH_PTX=
BUILD_SHARED_LIBS=off
CMAKE_CXX_FLAGS=-fPIC
WITH_QT=off
WITH_VTK=off
WITH_GTK=off
WITH_OPENGL=off

Hope this helps someone out in the future - there certainly is little information out there about this.

2014-08-28 12:48:58 -0600 commented question GPU Cuda initialization much slower with opencv libraries

Another update - compiling with static libs shaved off another 3 seconds! I'm getting there...

2014-08-28 09:14:24 -0600 commented question GPU Cuda initialization much slower with opencv libraries

Another update - opencv 2.4.9 is slightly slower by about 0.1 seconds, so that didn't help.

2014-08-27 12:37:11 -0600 commented question GPU Cuda initialization much slower with opencv libraries

Reporting back, setting CUDA_ARCH_BIN dropped about 2 seconds off the initialization time, but I'm still looking at around a ~6 second startup lag. If opencv isn't compiling PTX now, what on earth is it doing?

2014-08-26 12:05:35 -0600 commented question GPU Cuda initialization much slower with opencv libraries

Another thought, the documentation makes me believe that the cuda code is precompiled by default for compute capabilities 1.1 and 1.3, and that perhaps if I add CUDA_ARCH_BIN=3.5 to the CMake defines it'll precompile the cuda kernels for my K20c. I'm trying it out and will report back if it helps.

2014-08-26 10:18:31 -0600 received badge  Editor (source)
2014-08-26 09:52:40 -0600 commented question GPU Cuda initialization much slower with opencv libraries

Thanks for the reply Steven. Unfortunately, I don't have the luxury of that startup lag being acceptable. According to the opencv documentation, it could be doing the JIT PTX compilation, and that CUDA_DEVCODE_CACHE should be used to cache the PTX code for future use, but that feature does not seem to be working. Has anyone ever even tried this? Google fails me (or maybe I fail Google).

2014-08-25 17:26:42 -0600 asked a question GPU Cuda initialization much slower with opencv libraries

Hello all,

Prereqs for posting, my environment: Linux x86 64, OpenCV 2.4.6.1, CUDA 5.0, Tesla Kepler K20c GPU

I've got a simple C++ application to benchmark cuda performance. It makes and times the following calls once each in order:

cudaSetDevice(0);
cudaMalloc(&someMemory, sizeof(float)*1024*1024);
cudaFree(someMemory);
cudaDeviceReset();

With just the cuda libraries linking, this takes ~10s of milliseconds for each call except for the malloc, which is about 0.25 seconds. Fine...no biggie, it's all part of GPU startup costs.

Here's the weird part - if I include libopencv_gpu.so and libopencv_core.so in the linker list (-lopencv_gpu -lopencv_core), without changing code whatsoever, those timings go through the roof. The cudaSetDevice call takes ~2.5 seconds, and the malloc takes ~5 seconds. Calls after that seem to be just as fast, but a ~7.5 second startup cost is ridiculous considering it's only ~.5 seconds without the opencv libraries.

Another oddity, taking out libopencv_gpu and just leaving the core library still has an effect: the set device call still takes ~2.5 seconds, and the malloc takes ~.7 seconds. What gives?

This affects more than my benchmark app, and it is repeatable. Does anyone have any insight on how opencv is destroying my startup performance? I tried setting CUDA_DEVCODE_CACHE to /tmp/devcode, thinking it was PTX compilations, but nothing was made in the directory - am I using it wrong?

Any help would be great. Thanks!