After scouring the internet for information on utilizing multiple NVIDIA CUDA enabled devices in the same program to provide a performance boost, I found out that there wasn't a whole lot of information available, and that there were several unanswered questions on here from users struggling with the same problem.
Anyway, I am kinda answering my own question with my blog post in hopes to get internet points. The blog explains how to share data between threads that manage a CUDA device. Anyway I hope someone finds this useful. Sorry if I broke the forum rules. link text