GPU Code Not Working Question

asked Sep 20 '12

bharath422
34 ●3

updated Sep 21 '12

Kirill Kornyakov

2792 ●13 ●25 ●52

Hi,

(This is a continuation from the post "Pseudocode for custom GPU computation")

Following is my GPU opencv code. I tried 2 things - 1) to just set all pixels of the output to zero, 2) return a monochrome image based on a threshold on the pixel values of the input image. But for both cases when I run it, I dont get the result I want, I get some grayscale image, which is just peppered noise. My code below:


//The size of the video frame is 480 * 640
//funcam.cpp

using namespace std;
using namespace cv;
using namespace cv::gpu;

void callKernel(const GpuMat& src, const GpuMat& dst)
{
   uchar* p = src.data;
   uchar* p2 = dst.data;
   func(p, p2, src.step, src.cols, src.rows, dst.step);
}

int main(int, char**) 
{
   VideoCapture cap(0);
   if(!cap.isOpened()) return -1;

   int frameH    = (int) cap.get(CV_CAP_PROP_FRAME_HEIGHT);
   int frameW    = (int) cap.get(CV_CAP_PROP_FRAME_WIDTH);
   cout << frameH << " " << frameW << endl;

   CvSize size = cvSize(frameH, frameW);
   Mat frame;
   Mat input;
   Mat output;
   GpuMat d_frame;
   GpuMat d_output;

   for(;;)
   {
      cap >> frame;
      if (frame.empty())
         break;

      //convert to grayscale
      cvtColor(frame, input, CV_BGR2GRAY);

      // memory Copy from Host to Device
      d_frame.upload(input);

      // Call CUDA kernel
      d_output.create(size, CV_8UC1);
      callKernel(d_frame, d_output);

      // memory Copy from Device to Host
      d_output.download(output);

      imshow("output", output);
      if(waitKey(30) >= 0)
         break;
   }
    return 0;
}

//funcam_cuda.cu
__global__ void funcKernel(uchar* srcptr, uchar* dstptr, int step, int cols, int rows, int dststep) 
{
   int rowInd = blockIdx.y * blockDim.y + threadIdx.y;
   int colInd = blockIdx.x * blockDim.x + threadIdx.x;
   uchar* rowsrcPtr = srcptr + rowInd*step;
   uchar* rowdstPtr = dstptr + rowInd*dststep;
   uchar pixVal = rowsrcPtr[colInd];
// rowdstPtr[colInd] = (pixVal > 60 ? 255 : 0);
   rowdstPtr[colInd] = 0;
}

extern "C"
void func(uchar* srcptr, uchar* dstptr, int step, int cols, int rows, int dststep)
{
dim3 grDim (16, 12);
dim3 blDim (40, 40);
funcKernel<<< grDim, blDim >>>(srcptr, dstptr, step, cols, rows, dststep);
cudaThreadSynchronize(); //Is this reqd?
}

answered Sep 21 '12

Vladislav Vinogradov

4775 ●4 ●24 ●62

Hello!

Your main error is:

CvSize size = cvSize(frameH, frameW);

cvSize() signature is cvSize(int width, int height), the first parameter must be width:

CvSize size = cvSize(frameW, frameH);

Also dim3 blDim (40, 40); is not a good size for block. The good sizes are 16x16, 32x8. cudaThreadSynchronize is deprecated, use cudaDeviceSynchronize.

// .cu file

#include <cuda_runtime.h>

__global__ void funcKernel(const unsigned char* srcptr, unsigned char* dstptr, 
                           size_t srcstep, size_t dststep, int cols, int rows)
{
    int rowInd = blockIdx.y * blockDim.y + threadIdx.y;
    int colInd = blockIdx.x * blockDim.x + threadIdx.x;

    if (rowInd >= rows || colInd >= cols)
        return;

    const unsigned char* rowsrcPtr = srcptr + rowInd * srcstep;
    unsigned char* rowdstPtr = dstptr + rowInd * dststep;

    unsigned char pixVal = rowsrcPtr[colInd];

    rowdstPtr[colInd] = (pixVal > 60 ? 255 : 0);
}

int divUp(int a, int b)
{
    return (a + b - 1) /b;
}

void func(const unsigned char* srcptr, unsigned char* dstptr, 
          size_t srcstep, size_t dststep, int cols, int rows)
{
    dim3 blDim(32, 8);
    dim3 grDim(divUp(cols, blDim.x), divUp(rows, blDim.y));

    funcKernel<<<grDim, blDim>>>(srcptr, dstptr, srcstep, dststep, cols, rows);

    cudaDeviceSynchronize();
}

Comments

You're a genius! Works like a charm! Thanks a ton! I'll probably have more questions, and hoping you can help me out. Also, I am wondering why you say 16*16 is the best decompositions. Is it because each block can take only up to 512 threads?

bharath422 (Sep 21 '12)edit

Threads are executed in groups of 32 threads, called warps. So it's better if block size will be divided by 32.

Vladislav Vinogradov (Sep 22 '12)edit

Oh cool. forgot about that. Isnt it also that the maximum threads per block is 512, so 40* 40 is > 512? Also, is the frames per second of the output video same as the input? Is there a way to measure this?

bharath422 (Sep 28 '12)edit

add a comment

GPU Code Not Working Question

1 answer

Comments

Links

Question Tools

Stats

Related questions

GPU Code Not Working Question edit savecancel

1 answer

Comments

Links

Question Tools

Stats

Related questions

GPU Code Not Working Question