Ask Your Question
1

why are the opencv models so small /efficient

asked 2018-07-05 08:25:13 -0600

holger gravatar image

Hello,

I noticed the models used by open cv performa pretty well - even without gpu support.

https://raw.githubusercontent.com/ope...' https://raw.githubusercontent.com/pya...

  • Are these model in any kind optimized for open cv (i doubt it)?
  • How did you found these model and how did you made sure they perform "fast"?
  • Does open cv have a "model zoo"?

Thank you + Greetings, Holger

edit retag flag offensive close merge delete

Comments

1

"does it have a "model zoo" ? -- yes, look here

berak gravatar imageberak ( 2018-07-05 08:43:33 -0600 )edit

wow - thats a lot of models - thank you!

holger gravatar imageholger ( 2018-07-05 08:49:16 -0600 )edit

1 answer

Sort by ยป oldest newest most voted
3

answered 2018-07-07 18:33:58 -0600

Eduardo gravatar image

updated 2018-07-07 18:38:13 -0600

Are these model in any kind optimized for open cv (i doubt it)?

You don't optimize a network for OpenCV but you can design a network for speed. See for instance MobileNets (MobileNets: Efficient Convolutional Neural Networks for Mobile Vision Applications) architecture.

The first one has been trained with (in my opinion) speed performance in mind: how_to_train_face_detector.txt . fp16 should mean that weights are stored in 16-bits floating points (Half-precision floating-point format), so less memory usage / better memory transfer.

This is a brief description of training process which has been used to get res10_300x300_ssd_iter_140000.caffemodel. The model was created with SSD framework using ResNet-10 like architecture as a backbone. Channels count in ResNet-10 convolution layers was significantly dropped (2x- or 4x- fewer channels). The model was trained in Caffe framework on some huge and available online dataset.

The second model openface.nn4.small2.v1.t7 is a reduced network (3733968 vs 6959088 parameters for nn4) to try to achieve good accuracy with less parameters and so better computation time.


More information from my different research.

Convolution, deep learning and performance

Convolution is a very expensive operation, as you can see in the following image (Learning Semantic Image Representations at a Large Scale, Yangqing Jia):

image description

To perform efficient convolution, matrix-matrix multiplication (GEMM or General matrix-matrix multiplication) is done instead (Why GEMM is at the heart of deep learning). Data must be re-ordered in a way suitable to perform convolution as a GEMM operation. This is known as im2col operation.

BLAS (Basic Linear Algebra Subprograms) libraries are crucial to have good performance on CPU. You can see for instance the prerequisite dependencies of the Caffe framework:

Caffe has several dependencies:

    CUDA is required for GPU mode.
        library version 7+ and the latest driver version are recommended, but 6.* is fine too
        5.5, and 5.0 are compatible but considered legacy
    BLAS via ATLAS, MKL, or OpenBLAS.
    Boost >= 1.55
    protobuf, glog, gflags, hdf5

Optional dependencies:

    OpenCV >= 2.4 including 3.0
    IO libraries: lmdb, leveldb (note: leveldb requires snappy)
    cuDNN for GPU acceleration (v6)

To achieve better performance, expensive deep learning operations are accelerated using dedicated libraries:

In OpenCV, convolution layer has been carefully optimized using multithreading, vectorized instruction, cache friendly optimizations. But still, better performance is achieved when Intel Inference Engine is installed.

FP16, INT8 precision

Half or reduced precision is also a mean to achieve better performance with better memory transfer, memory usage and higher operations per second. Higher theoretical FLOPS of FP16 vs FP32 is subject to the hardware used since if there is no dedicated FP16 CPU "unit", the operation should be done in FP32 precision. The following image shows FP16 hardware (Half Precision Benchmarking for HPC):

image description

See also Mixed-Precision Training of Deep Neural Networks:

Mixed-precision training lowers the required resources by using lower-precision arithmetic, which has the following benefits.

    Decrease the required amount of memory. Half-precision floating point format (FP16) uses 16 bits, compared to 32 ...
(more)
edit flag offensive delete link more

Comments

Thank you a lot - i will need to read in detal when i am back. But your answer goes exactly in the direction what i had in mind - some general tips on network / model tuning. The face detector i trained performed bad in terms of accuracity / performance.

holger gravatar imageholger ( 2018-07-08 06:45:46 -0600 )edit

I was always wondering about the gemm / blas discussions when i was searching for performance optimization.

I recognize the work you put into this post. This was really helpful. Thank you!

holger gravatar imageholger ( 2018-07-10 01:43:04 -0600 )edit

Question Tools

1 follower

Stats

Asked: 2018-07-05 08:25:13 -0600

Seen: 923 times

Last updated: Jul 07 '18