Assisting the compiler into generating better code
I found a set of routines in the imgcodecs portion of the opencv where a fairly simple code change improves performance by 2x on the Power platform but has little to no effect on x64. My question is should changes be made to help the compiler like below? I realize that eventually the compiler "could" be made to generate better code.
template<class dataType>
inline void cvtBGR2Gray( const dataType* rgb, dataType* gray,
Size& size, int ncn, int _swap_rb )
{
int i;
#if 0
for( i = 0; i < size.width; i++, rgb += ncn )
{
int t = descale( rgb[_swap_rb]*cB + rgb[1]*cG + rgb[_swap_rb^2]*cR, SCALE );
gray[i] = (dataType)t;
}
#else
if (_swap_rb)
{
for( i = 0; i < size.width; i++, rgb += ncn )
{
int t = descale( rgb[0]*cR + rgb[1]*cG + rgb[2]*cB, SCALE );
gray[i] = (dataType)t;
}
}
else
{
for( i = 0; i < size.width; i++, rgb += ncn )
{
int t = descale( rgb[0]*cB + rgb[1]*cG + rgb[2]*cR, SCALE );
gray[i] = (dataType)t;
}
}
#endif
}
void icvCvt_BGRA2Gray_8u_C4C1R( const uchar* rgba, int rgba_step,
uchar* gray, int gray_step,
Size size, int _swap_rb )
{
_swap_rb = _swap_rb ? 2 : 0;
for( ; size.height--; gray += gray_step )
{
cvtBGR2Gray<uchar>(rgba, gray, size, 4, _swap_rb);
rgba += rgba_step - size.width*4;
}
}
// Similar changes to icvCvt_BGR2Gray_8u_C3C1R and icvCvt_BGRA2Gray_16u_CnC1R
The size of the binaries is a concern, and this is doubling the size of the code for basically all the cvtColors. There may be other ways though. Could you try and see what happens when you change the function signature to
... ncn, const int _swap_rb )
Marking the variable as constant may help the compiler realize the same speed increase.Unfortunately changing the variable to a const int still does NOT get the compiler to output fast code.
Well, you can make the change for your personal use and submit it as a pull-request. But PowerPC is not a particularly common architecture, so it may not be accepted.
I'll be submitting future patches to improve the Power8/9 architecture including performance. So these type of things are important. My goal is have negligible effects on other platforms if possible.
Found a different approach that gets the speed improvements - 2x on Power8/9 and 5% on x64. The auto vectorizer is able to perform better. Plus there is a small reduction in size.
Then do please make a pull-request on github.