Use:
gpu::BFMatcher_GPU matcher(NORM_L2);
GpuMat trainIdxMat, distanceMat, allDist;
matcher.knnMatchSingle(descriptors1GPU, descriptors2GPU, trainIdxMat, distanceMat, allDist, 2);
int2* trainIdx = trainIdxMat.ptr<int2>();
float2* distance = distanceMat.ptr<float2>();
trainIdx[i] will contain indexes of two nearest matched points for i-th keypoint. distance[i] will contain distances.
I. e. descriptors1GPU[i] matches with descriptors2GPU[trainIdx[i].x] with distance = distance[i].x and with descriptors2GPU[trainIdx[i].y] with distance = distance[i].y.
You can use trainIdx and distance pointers in your CUDA kernels.