Assisting the compiler into generating better code

asked 2019-06-03 12:47:54 -0600

CK1966 gravatar image

I found a set of routines in the imgcodecs portion of the opencv where a fairly simple code change improves performance by 2x on the Power platform but has little to no effect on x64. My question is should changes be made to help the compiler like below? I realize that eventually the compiler "could" be made to generate better code.

template<class dataType>
inline void cvtBGR2Gray( const dataType* rgb, dataType* gray,
                         Size& size, int ncn, int _swap_rb )
{
    int i;
#if 0
    for( i = 0; i < size.width; i++, rgb += ncn )
    {
        int t = descale( rgb[_swap_rb]*cB + rgb[1]*cG + rgb[_swap_rb^2]*cR, SCALE );
        gray[i] = (dataType)t;
    }
#else
    if (_swap_rb)
    {
        for( i = 0; i < size.width; i++, rgb += ncn )
        {
            int t = descale( rgb[0]*cR + rgb[1]*cG + rgb[2]*cB, SCALE );
            gray[i] = (dataType)t;
        }
    }
    else
    {
        for( i = 0; i < size.width; i++, rgb += ncn )
        {
            int t = descale( rgb[0]*cB + rgb[1]*cG + rgb[2]*cR, SCALE );
            gray[i] = (dataType)t;
        }
    }
#endif
}

void icvCvt_BGRA2Gray_8u_C4C1R( const uchar* rgba, int rgba_step,
                                 uchar* gray, int gray_step,
                                 Size size, int _swap_rb )
{
   _swap_rb = _swap_rb ? 2 : 0;
   for( ; size.height--; gray += gray_step )
   {
       cvtBGR2Gray<uchar>(rgba, gray, size, 4, _swap_rb);

       rgba += rgba_step - size.width*4;
   }
}

// Similar changes to icvCvt_BGR2Gray_8u_C3C1R and icvCvt_BGRA2Gray_16u_CnC1R
edit retag flag offensive close merge delete

Comments

The size of the binaries is a concern, and this is doubling the size of the code for basically all the cvtColors. There may be other ways though. Could you try and see what happens when you change the function signature to ... ncn, const int _swap_rb ) Marking the variable as constant may help the compiler realize the same speed increase.

Tetragramm gravatar imageTetragramm ( 2019-06-03 19:59:03 -0600 )edit

Unfortunately changing the variable to a const int still does NOT get the compiler to output fast code.

CK1966 gravatar imageCK1966 ( 2019-06-04 13:23:07 -0600 )edit

Well, you can make the change for your personal use and submit it as a pull-request. But PowerPC is not a particularly common architecture, so it may not be accepted.

Tetragramm gravatar imageTetragramm ( 2019-06-04 18:22:19 -0600 )edit

I'll be submitting future patches to improve the Power8/9 architecture including performance. So these type of things are important. My goal is have negligible effects on other platforms if possible.

CK1966 gravatar imageCK1966 ( 2019-06-05 06:56:16 -0600 )edit

Found a different approach that gets the speed improvements - 2x on Power8/9 and 5% on x64. The auto vectorizer is able to perform better. Plus there is a small reduction in size.

CK1966 gravatar imageCK1966 ( 2019-06-07 13:21:06 -0600 )edit

Then do please make a pull-request on github.

Tetragramm gravatar imageTetragramm ( 2019-06-07 19:27:09 -0600 )edit