Revision history [back]

Using HOG features to update bag of words background model is too computationally intensive

I am trying to build a visual bag of words model using HOG descriptors. This is part of patch-level background model for foreground-background selection in cluttered video. I am following this paper here. The paper describes patches of 32 * 32, but is vague as to the rest of the hog properties (see below).

I am using opencv 3.2 binary on windows, python 3.5.

I first create a BOW class with a dictionary size 128:

self.BOW=cv2.BOWKMeansTrainer(128)

I'll be using HOG descriptors with the following parameters (but see below).

#HOG descriptor
winSize = (32,32)
blockSize = (16,16)
blockStride = (8,8)
cellSize = (8,8)
nbins = 9
derivAperture = 1
winSigma = 4.
histogramNormType = 0
L2HysThreshold = 2.0000000000000001e-01
gammaCorrection = 0
nlevels = 32
self.calc_HOG = cv2.HOGDescriptor(winSize,blockSize,blockStride,cellSize,nbins,derivAperture,winSigma,
                        histogramNormType,L2HysThreshold,gammaCorrection,nlevels)

I am using MOG background subtraction to generate my initial background proposal, which I then feed into the BOW model for patch-level features.

An image has the following shape

self.bg_image.shape
(509, 905, 3)

So for each frame, if its been classified as background by MOG I add the HOG feature to the BOW model

self.BOW.add(self.calc_HOG.compute(self.bg_image))

and I'll go along like that until a potential foreground object needs to be checked.

When a frame is potentially foreground, I cluster the BOW descriptors

self.background_vocab=self.BOW.cluster()

Generate an extractor

 self.extract_bow = cv2.BOWImgDescriptorExtractor(self.calc_HOG, self.matcher)

create a vocabulary

  self.extract_bow.setVocabulary(self.background_vocab)

and compare the histogram of the current crop to the corresponding part of the background image using the HOG-clustered visual words.

  current_BOWhist=self.extract_bow.compute(current)

  print("Extract background HOG feature")
  background_BOWhist=self.extract_bow.compute(background)

  print("Compare Background and Foreground Masks")
  BOW_dist =cv2.compareHist(current_BOWhist, background_BOWhist, method=cv2.HISTCMP_CHISQR)

The memory use/performance of this often cited strategy seems completely unusable. the BOW.cluster() method, documented here is incredibly memory intensive and basically locks up the computer. Even on just a few descriptors it takes 10 to 20 seconds per call. This is especially perplexing because the above paper, and many similar papers, specify that the background model is continuously updated, that is, many calls of cluster() and generating new a vocabulary every time a frame is classified as background, its hog features are added.

So in my example file, I go through 30 frames of adding HOG features

a=self.BOW.getDescriptors()
len(a)
30

But this has just a huge number of descriptors.

self.BOW.descriptorsCount()
576979200

I think is the problem, this seems like way too many points for kmeans to compute. Is there something wrong with my HOG descriptor properties? From the cited paper above

"The BoW dictionary size is 128. During the verification step, the MSDPs are resized according to their aspect ratios to make sure all patches contain roughly the same amount of pixels. In order to tackle with the various aspect ratios, which is inevitable during segmentation, we let the number of HOG grids along x and y coordinates to be flexible while keeping the total number to be a fixed number of 600. Similarly, an overall 25 blocks are used to encode the HOG features. The HOG dimension is 32"

What am I not understanding about how HOG features are added to bag of words? This doesn't seem tractable at the current scale?