do Farneback Optical flow destroy AMD cards? or is my code?

asked 2017-04-20 12:50:23 -0500

MikeSZ gravatar image

updated 2017-04-21 02:10:44 -0500

berak gravatar image

I have had a serious problem with a code that I have been developing for a few months. It is created using OpenCL and I use only one OpenCV function, calcOpticalFlowFarneback, accelerated with OpenCL also, the rest of the code are several not very complex kernels.

The issue is that this has destroyed 2 graphics cards in a period of 5 months, an XFX R9 270 and an MSI RX 470. The first was in a state that could only be "used" without the drivers, and during the boot PC pink dots were observed in the letters. The second remained usable for a while, with sporadic hangings and screens in black / white / pink until it completely failed.

The interesting thing is that this only happened to me with videos of 1280x720 (dont tested higher resolutions), but with 320x240 and 640x480, they did not fail. The second time forget how the previous problem had been. I just run the code, with video of this resolution, and a second later, after shown some results, the system fails. A black screen occurred the first time, the second time, a hang up.

The OpenCV I used was first 3.0, then 3.2.

At the moment I do not have any cards to try and I do not want to risk either.

Anyone have any ideas? If the problem is calcOpticalFlowFarneback to these resolutions? Or my kernels?

The host code is mostly flow control. With ocv3.0 ther are copy/writes, with 3.2 I avoid that, the UMats use my bufferas. The most complex kernels are (the rest are only grayscale convertions, type convertion, etc):

 __kernel void kernel_SumAndDiv(__global uchar4 *imageIn, __global float4 *imageSum, __global uchar4 *imageOut, uint count)
{  

    const uint x = get_global_id(0);
    const uint y = get_global_id(1);
    const uint width = get_global_size(0);
    const uint pos = x + y * width;
    float4 color = imageSum[pos];
    color =  color + convert_float4(imageIn[pos]);
    imageSum[pos] = color;
    imageOut[pos] = convert_uchar4(clamp(color / (float)count, 0.0f, 255.0f));
}

 __kernel void kernel_Add(__global uchar4* srcImage, __global uchar4* framesBuffer, __global float4* cumulImage,  __global uchar4* dstImage, const int framePos)
{

    const uint x = get_global_id(0);
    const uint y = get_global_id(1);
    const uint width = get_global_size(0);
    const uint height = get_global_size(1); 
    const uint pixelPos = x + y * width;
    const uint bufferFramePos = width * height * framePos;
    const uint bufferPixelPos = bufferFramePos + pixelPos;
    uchar4 srcColor = srcImage[pixelPos];
    float4 cumulColor = cumulImage[pixelPos];  
    cumulColor = cumulColor + convert_float4(srcColor);
    framesBuffer[bufferPixelPos] = srcColor;
    cumulImage[pixelPos] = cumulColor; 
    dstImage[pixelPos] = convert_uchar4(clamp(cumulColor / (float)(framePos+1), 0.0f, 255.0f));
}

 __kernel void kernel_InsertAndUpdate(__global uchar4* srcImage, __global uchar4* framesBuffer, __global float4* cumulImage, __global uchar4* dstImage,  const int framePos, const int frameCount)
{

    const uint x = get_global_id(0);
    const uint y = get_global_id(1);
    const uint width = get_global_size(0);
    const uint height = get_global_size(1);  
    const uint pixelPos = x + y * width;
    const uint bufferFramePos = width * height * framePos;
    const uint bufferPixelPos = bufferFramePos + pixelPos;
    uchar4 srcColor = srcImage[pixelPos];
    float4 srcColorf = convert_float4(srcColor);
    float4 cumulColor = cumulImage[pixelPos];
    float4 bufferColor = convert_float4(framesBuffer[bufferPixelPos]);
    cumulColor = cumulColor - bufferColor + srcColorf;  
    framesBuffer[bufferPixelPos] = srcColor;
    cumulImage[pixelPos] = cumulColor ...
(more)
edit retag flag offensive close merge delete

Comments

Only reason why graphic cards break down is due to overheating. Your code is NOT in how you have written the code, rather in

  • Your system not being able to cool your card and inner casing
  • Your system power supply not powerful enough to feed a hard working graphics card

Its impossible for plain software kernels to break down a hardware part like this.

StevenPuttemans gravatar imageStevenPuttemans ( 2017-04-21 02:21:57 -0500 )edit

I also use the cards to play, AAA games, and have never come close to the maximum working temperature defined by the manufacturer, in the worst case only the oldest card reached about 73 degrees C. I keep the fans regulated with MSI Afterburner. Here I also observe the level of use of the GPU/CPU during the tests. The power supply is more than enough for the hardware (1000W, 46A). The system is well ventilated by 4 additional fans, ambient temperature of about 25C. The second card only had about 3 weeks of use. The error has occurred almost immediately, less than 5 seconds of processing initiation, I think. Both times. The second time I had forgotten how the previous one had failed, I remembered it in the worst way, seeing it happen again trying the code in its last version, I hate myself.

MikeSZ gravatar imageMikeSZ ( 2017-04-21 08:34:07 -0500 )edit

The two cards were well within the proper temperatures, less than 60 degrees C the first, barely 53 C the second. The level of use was about 50% if I remember. This is just one process among 50 others, some with much higher demand for processing (eg, superresolution), but the two times the failure has been with this one. The most I can think of is a memory or memory controller failure, as one of the kernels performs several steps (up to 5) over a buffer of several MB, which contains several consecutive frames, each workitem processing one pixel in each Frame each time, this is for each new input box of the video. But with 640x480 never failed!!

MikeSZ gravatar imageMikeSZ ( 2017-04-21 08:37:07 -0500 )edit

Yes, I know is VERY strange. I asked the same thing in the forum of the development community of AMD, an answer asks me if I can share the code but unfortunately I can not share everything.

MikeSZ gravatar imageMikeSZ ( 2017-04-21 08:45:10 -0500 )edit