Ask Your Question

OpenCV matchTemplate CUDA large images & templates

asked 2013-07-02 10:20:30 -0500

ggl gravatar image

Dear All,

I am interested in using template matching on large (satellite) images (at least 8192 by 8192 pixels), using templates from reference image sets that are typically 256 by 256 or 512 by 512 pixels in size. A normal use case is matching N by N templates against the image (N=5,7,9...).

I am using OpenCV 2.4.6 with CUDA 4.2. I managed to get the gpu version of matchTemplate going, but ran into the initiation timing issue. This causes the gpu version to be slower than the cpu version, when used in a single image/single template match. I have done careful timing analysis (see code below) and find that the code is spending 98% of the time on initiation. I know that this has to do with the JIT compilation of the CUDA related code, but the reference to check this further in the documentation on the nvcc compiler and the CUDA_DEVCODE_CACHE environment variable is leading nowhere to a solution (I set the environment variable, but nothing improves).

This should be a compile once, run often code case, so if someone got the code caching working correctly, I'd appreciate if that knowledge could be shared.

#include "opencv2/highgui/highgui.hpp"
#include "opencv2/imgproc/imgproc.hpp"
#include "opencv2/gpu/gpu.hpp"

#include <iostream>
#include <stdio.h>

using namespace std;
using namespace cv;

/// Global Variables
Mat img; 
Mat templ; 
Mat result;

int match_method;

/** @function main 

  Stripped down version, without GUI functionality 

int main( int argc, char** argv )
  /// Load image and template
  img = imread( argv[1], 1 );
  templ = imread( argv[2], 1 );

  match_method = atoi(argv[2]);

  int result_cols =  img.cols - templ.cols + 1;
  int result_rows = img.rows - templ.rows + 1;

  result.create( result_cols, result_rows, CV_32F);

  size_t t0 = clock();
  catch (const std::exception& e)
    //no GPU, DLL not compiled with GPU
    printf("Exception thrown: %s\n", e.what());
    return 0;

  size_t t1 = clock();
  printf("GPU initialize: %f ms\n", (double(t1 - t0)/CLOCKS_PER_SEC*1000.0));

  gpu::GpuMat d_src, d_templ, d_dst;

  printf("GPU load templ: %f ms\n", (double(clock() - t1)/CLOCKS_PER_SEC*1000.0));
  printf("GPU load img: %f ms\n", (double(clock() - t1)/CLOCKS_PER_SEC*1000.0));
  //printf("GPU load templ: %f ms\n", (double(clock() - t1)/CLOCKS_PER_SEC*1000.0));
  printf("GPU load result: %f ms\n", (double(clock() - t1)/CLOCKS_PER_SEC*1000.0));

  /// Do the Matching
  size_t t2 = clock();

  printf("GPU memory set-up: %f ms\n", (double(t2 - t1)/CLOCKS_PER_SEC*1000.0));

  gpu::matchTemplate( d_src, d_templ, d_dst, match_method );

  size_t t3 = clock();
  printf("GPU template match: %f ms\n", (double(t3 - t2)/CLOCKS_PER_SEC*1000.0));

  /// Localizing the best match with minMaxLoc
  double minVal; double maxVal; Point minLoc; Point maxLoc;
  Point matchLoc;

  gpu::minMaxLoc( d_dst, &minVal, &maxVal, &minLoc, &maxLoc);

  size_t t4 = clock();
  printf("GPU minMaxLoc: %f ms\n", (double(t4 - t3)/CLOCKS_PER_SEC*1000.0));

  /// For SQDIFF and SQDIFF_NORMED, the best matches are lower values. For all the other methods, the ...
edit retag flag offensive close merge delete

1 answer

Sort by ยป oldest newest most voted

answered 2013-07-03 02:45:09 -0500

Vladislav Vinogradov gravatar image

Long initialization time is a known issue. The gpu module is focused on long time processing (e.g. video processing), where initialization time is not important. The best way is to reorganize your application so it processes several inputs at once (e.g. several templates). Also to reduce initialization time you can build OpenCV gpu module only for your card's compute capability. Set CMake variables CUDA_ARCH_BIN="1.3" and CUDA_ARCH_PTX="".

edit flag offensive delete link more


Thanks for this clarification. I understand the multiple run requirement, which will indeed be the case for this match set-up as well (i.e. normally I will run at least 9 to 49 templates against the same image).

Your suggestion does not lead to any significant improvement, however. Initiation time stays around 26 seconds, even if I use smaller input images. I still do not understand whether this is specific to OpenCV gpu implementation. I am running a pure-CUDA phase correlation based on cuFFT, which gives similar results as matchTemplate, but does not seem to have an excessive initiation penalty. It runs in 1710 ms total (330 ms for the GPU part) with the same images as before.

ggl gravatar imageggl ( 2013-07-03 06:50:19 -0500 )edit

I just ran the code on a HP Elitebook with a Quadro 3000M running CUDA 5.0. Initiation time is now only 2040 ms. It's probably driver related as well.

ggl gravatar imageggl ( 2013-07-05 15:45:23 -0500 )edit

Question Tools


Asked: 2013-07-02 10:20:30 -0500

Seen: 6,952 times

Last updated: Jul 03 '13