I'm working to improve the work-group reduce functions in a Linux OpenCL stack for GPGPUs. The work-group functions are briefly described here [1] - in short they enable add/min/max collaborative work between threads in the same work-group. For instance workgroup_reduce_add with 4 local threads would do (1, 2, 3, 4, 5, 6, 7, 8) => {(10, 10, 10, 10), (26, 26, 26, 26)}, while workgroup_reduce_min with 2 local threads would be {(1, 1), (3, 3), (5, 5), (7, 7)}.
Are you aware of any particular algorithms that would benefit from the workgroup reduce add/min/max ? - maybe something SIMD oriented that would though require moderate thread communication.
[1] https://software.intel.com/en-us/articles/using-opencl-20-work-group-functions