Ask Your Question
0

GPU Code Not Working Question

asked Sep 20 '12

bharath422 gravatar image

updated Sep 21 '12

Kirill Kornyakov gravatar image

Hi,

(This is a continuation from the post "Pseudocode for custom GPU computation")

Following is my GPU opencv code. I tried 2 things - 1) to just set all pixels of the output to zero, 2) return a monochrome image based on a threshold on the pixel values of the input image. But for both cases when I run it, I dont get the result I want, I get some grayscale image, which is just peppered noise. My code below:


//The size of the video frame is 480 * 640
//funcam.cpp

using namespace std;
using namespace cv;
using namespace cv::gpu;

void callKernel(const GpuMat& src, const GpuMat& dst)
{
   uchar* p = src.data;
   uchar* p2 = dst.data;
   func(p, p2, src.step, src.cols, src.rows, dst.step);
}

int main(int, char**) 
{
   VideoCapture cap(0);
   if(!cap.isOpened()) return -1;

   int frameH    = (int) cap.get(CV_CAP_PROP_FRAME_HEIGHT);
   int frameW    = (int) cap.get(CV_CAP_PROP_FRAME_WIDTH);
   cout << frameH << " " << frameW << endl;

   CvSize size = cvSize(frameH, frameW);
   Mat frame;
   Mat input;
   Mat output;
   GpuMat d_frame;
   GpuMat d_output;

   for(;;)
   {
      cap >> frame;
      if (frame.empty())
         break;

      //convert to grayscale
      cvtColor(frame, input, CV_BGR2GRAY);

      // memory Copy from Host to Device
      d_frame.upload(input);

      // Call CUDA kernel
      d_output.create(size, CV_8UC1);
      callKernel(d_frame, d_output);

      // memory Copy from Device to Host
      d_output.download(output);

      imshow("output", output);
      if(waitKey(30) >= 0)
         break;
   }
    return 0;
}

//funcam_cuda.cu
__global__ void funcKernel(uchar* srcptr, uchar* dstptr, int step, int cols, int rows, int dststep) 
{
   int rowInd = blockIdx.y * blockDim.y + threadIdx.y;
   int colInd = blockIdx.x * blockDim.x + threadIdx.x;
   uchar* rowsrcPtr = srcptr + rowInd*step;
   uchar* rowdstPtr = dstptr + rowInd*dststep;
   uchar pixVal = rowsrcPtr[colInd];
// rowdstPtr[colInd] = (pixVal > 60 ? 255 : 0);
   rowdstPtr[colInd] = 0;
}

extern "C"
void func(uchar* srcptr, uchar* dstptr, int step, int cols, int rows, int dststep)
{
dim3 grDim (16, 12);
dim3 blDim (40, 40);
funcKernel<<< grDim, blDim >>>(srcptr, dstptr, step, cols, rows, dststep);
cudaThreadSynchronize(); //Is this reqd?
}
Preview: (hide)

1 answer

Sort by » oldest newest most voted
2

answered Sep 21 '12

Vladislav Vinogradov gravatar image

Hello!

Your main error is:

CvSize size = cvSize(frameH, frameW);

cvSize() signature is cvSize(int width, int height), the first parameter must be width:

CvSize size = cvSize(frameW, frameH);

Also dim3 blDim (40, 40); is not a good size for block. The good sizes are 16x16, 32x8. cudaThreadSynchronize is deprecated, use cudaDeviceSynchronize.

// .cu file

#include <cuda_runtime.h>

__global__ void funcKernel(const unsigned char* srcptr, unsigned char* dstptr, 
                           size_t srcstep, size_t dststep, int cols, int rows)
{
    int rowInd = blockIdx.y * blockDim.y + threadIdx.y;
    int colInd = blockIdx.x * blockDim.x + threadIdx.x;

    if (rowInd >= rows || colInd >= cols)
        return;

    const unsigned char* rowsrcPtr = srcptr + rowInd * srcstep;
    unsigned char* rowdstPtr = dstptr + rowInd * dststep;

    unsigned char pixVal = rowsrcPtr[colInd];

    rowdstPtr[colInd] = (pixVal > 60 ? 255 : 0);
}

int divUp(int a, int b)
{
    return (a + b - 1) /b;
}

void func(const unsigned char* srcptr, unsigned char* dstptr, 
          size_t srcstep, size_t dststep, int cols, int rows)
{
    dim3 blDim(32, 8);
    dim3 grDim(divUp(cols, blDim.x), divUp(rows, blDim.y));

    funcKernel<<<grDim, blDim>>>(srcptr, dstptr, srcstep, dststep, cols, rows);

    cudaDeviceSynchronize();
}
Preview: (hide)

Comments

You're a genius! Works like a charm! Thanks a ton! I'll probably have more questions, and hoping you can help me out. Also, I am wondering why you say 16*16 is the best decompositions. Is it because each block can take only up to 512 threads?

bharath422 gravatar imagebharath422 (Sep 21 '12)edit

Threads are executed in groups of 32 threads, called warps. So it's better if block size will be divided by 32.

Oh cool. forgot about that. Isnt it also that the maximum threads per block is 512, so 40* 40 is > 512? Also, is the frames per second of the output video same as the input? Is there a way to measure this?

bharath422 gravatar imagebharath422 (Sep 28 '12)edit

Question Tools

Stats

Asked: Sep 20 '12

Seen: 1,002 times

Last updated: Sep 21 '12