GPU Code Not Working Question

asked 2012-09-20 11:55:57 -0600

bharath422
34 ●3

updated 2012-09-21 08:14:28 -0600

Kirill Kornyakov

2792 ●13 ●25 ●52

Hi,

(This is a continuation from the post "Pseudocode for custom GPU computation")

Following is my GPU opencv code. I tried 2 things - 1) to just set all pixels of the output to zero, 2) return a monochrome image based on a threshold on the pixel values of the input image. But for both cases when I run it, I dont get the result I want, I get some grayscale image, which is just peppered noise. My code below:


//The size of the video frame is 480 * 640
//funcam.cpp

using namespace std;
using namespace cv;
using namespace cv::gpu;

void callKernel(const GpuMat& src, const GpuMat& dst)
{
   uchar* p = src.data;
   uchar* p2 = dst.data;
   func(p, p2, src.step, src.cols, src.rows, dst.step);
}

int main(int, char**) 
{
   VideoCapture cap(0);
   if(!cap.isOpened()) return -1;

   int frameH    = (int) cap.get(CV_CAP_PROP_FRAME_HEIGHT);
   int frameW    = (int) cap.get(CV_CAP_PROP_FRAME_WIDTH);
   cout << frameH << " " << frameW << endl;

   CvSize size = cvSize(frameH, frameW);
   Mat frame;
   Mat input;
   Mat output;
   GpuMat d_frame;
   GpuMat d_output;

   for(;;)
   {
      cap >> frame;
      if (frame.empty())
         break;

      //convert to grayscale
      cvtColor(frame, input, CV_BGR2GRAY);

      // memory Copy from Host to Device
      d_frame.upload(input);

      // Call CUDA kernel
      d_output.create(size, CV_8UC1);
      callKernel(d_frame, d_output);

      // memory Copy from Device to Host
      d_output.download(output);

      imshow("output", output);
      if(waitKey(30) >= 0)
         break;
   }
    return 0;
}

//funcam_cuda.cu
__global__ void funcKernel(uchar* srcptr, uchar* dstptr, int step, int cols, int rows, int dststep) 
{
   int rowInd = blockIdx.y * blockDim.y + threadIdx.y;
   int colInd = blockIdx.x * blockDim.x + threadIdx.x;
   uchar* rowsrcPtr = srcptr + rowInd*step;
   uchar* rowdstPtr = dstptr + rowInd*dststep;
   uchar pixVal = rowsrcPtr[colInd];
// rowdstPtr[colInd] = (pixVal > 60 ? 255 : 0);
   rowdstPtr[colInd] = 0;
}

extern "C"
void func(uchar* srcptr, uchar* dstptr, int step, int cols, int rows, int dststep)
{
dim3 grDim (16, 12);
dim3 blDim (40, 40);
funcKernel<<< grDim, blDim >>>(srcptr, dstptr, step, cols, rows, dststep);
cudaThreadSynchronize(); //Is this reqd?
}

answered 2012-09-21 01:30:02 -0600

Vladislav Vinogradov

4775 ●4 ●24 ●62

Hello!

Your main error is:

CvSize size = cvSize(frameH, frameW);

cvSize() signature is cvSize(int width, int height), the first parameter must be width:

CvSize size = cvSize(frameW, frameH);

Also dim3 blDim (40, 40); is not a good size for block. The good sizes are 16x16, 32x8. cudaThreadSynchronize is deprecated, use cudaDeviceSynchronize.

// .cu file

#include <cuda_runtime.h>

__global__ void funcKernel(const unsigned char* srcptr, unsigned char* dstptr, 
                           size_t srcstep, size_t dststep, int cols, int rows)
{
    int rowInd = blockIdx.y * blockDim.y + threadIdx.y;
    int colInd = blockIdx.x * blockDim.x + threadIdx.x;

    if (rowInd >= rows || colInd >= cols)
        return;

    const unsigned char* rowsrcPtr = srcptr + rowInd * srcstep;
    unsigned char* rowdstPtr = dstptr + rowInd * dststep;

    unsigned char pixVal = rowsrcPtr[colInd];

    rowdstPtr[colInd] = (pixVal > 60 ? 255 : 0);
}

int divUp(int a, int b)
{
    return (a + b - 1) /b;
}

void func(const unsigned char* srcptr, unsigned char* dstptr, 
          size_t srcstep, size_t dststep, int cols, int rows)
{
    dim3 blDim(32, 8);
    dim3 grDim(divUp(cols, blDim.x), divUp(rows, blDim.y));

    funcKernel<<<grDim, blDim>>>(srcptr, dstptr, srcstep, dststep, cols, rows);

    cudaDeviceSynchronize();
}

edit flag offensive delete link

Comments

You're a genius! Works like a charm! Thanks a ton! I'll probably have more questions, and hoping you can help me out. Also, I am wondering why you say 16*16 is the best decompositions. Is it because each block can take only up to 512 threads?

bharath422 ( 2012-09-21 12:05:55 -0600 )edit

Threads are executed in groups of 32 threads, called warps. So it's better if block size will be divided by 32.

Vladislav Vinogradov ( 2012-09-22 09:59:56 -0600 )edit

Oh cool. forgot about that. Isnt it also that the maximum threads per block is 512, so 40* 40 is > 512? Also, is the frames per second of the output video same as the input? Is there a way to measure this?

bharath422 ( 2012-09-28 12:30:34 -0600 )edit

add a comment

GPU Code Not Working Question

1 answer

Comments

Links

Question Tools

Stats

Related questions

GPU Code Not Working Question edit

1 answer

Comments

Links

Question Tools

Stats

Related questions

GPU Code Not Working Question