Slow initial call in a batch of cuda sparse pyrLK optical flow operations.
I'm writing a program that registers a frame to a previous set of frames using optical flow to track key points. My keyframes are stored in a circular buffer fashion and then optical flow is called starting on the oldest frame in the buffer and moving towards the newer frames. I'm doing this on a Windows 7x64 computer with NVidia drivers 353.90 on a GTX Titan X.
Because of the architecture of the program, there may be a delay between batches of operations as new images are loaded, etc. IE the stream queue would look like:
upload
opt flow (20 ms)
opt flow (1 ms)
opt flow (1 ms)
opt flow (1 ms)
opt flow (1 ms)
download
upload
opt flow (20 ms)
opt flow (1 ms)
opt flow (1 ms)
.........
I'm running all of this on a stream, however for the sake of measuring time, I'm calling stream.waitForCompletion(). Ideally when this is working correctly, I'll be able to take out all of the synchronization.
I'm also familiar with the fact that first launches should take longer as the driver compiles code. However I was under the impression that this would just be the first launch, not the first launch in each batch of launches.
Is there any way to reduce that 20 ms first call to optical flow to something more reasonable?
Should I setup two streams, so that the memory transfers are on one and I have one stream dedicated to optical flow?
[EDIT] I've tested if it could be a wddm driver queue issue similar to this: https://devtalk.nvidia.com/default/to.... by manually flushing the queue with a cudaEventQuery on one of my events, however this doesn't seem to do anything. If I remove synchronization, my second call to optical flow will cost 20ms.