Ask Your Question

Revision history [back]

click to hide/show revision 1
initial version

Fastest way to convert BGR <-> RGB!

I was reading this question: https://answers.opencv.org/question/188664/colour-conversion-from-bgr-to-rgb-in-cv2-is-slower-than-on-python/ and it didn't explain things very well at all. So here's a deep examination and explanation for everyone's future reference!

Converting RGB to BGR, and vice versa, is one of the most important operations you can do in OpenCV if you're interoperating with other libraries, raw system memory, etc. And each imaging library depends on their own special channel orders.

There are many ways to achieve the conversion, and cv2.cvtColor() is absolutely not the fastest one. If you attempt to convert colors in OpenCV, you actually invoke a huge machinery:

https://github.com/opencv/opencv/blob/8c0b0714e76efef4a8ca2a7c410c60e55c5e9829/modules/imgproc/src/color.cpp#L20-L25 https://github.com/opencv/opencv/blob/8b541e450b511fde9dd363fa55a30fbb6fc0ace6/modules/imgproc/src/color_rgb.dispatch.cpp#L426-L437

As you can see, internally, OpenCV creates an "OpenCL Kernel" with the instructions for the data transformation, and then runs it. Compiling such a kernel takes some extra time and isn't very efficient.

However, there is another way to flip between RGB and BGR channel orders. And that is: Using numpy's built-in methods for manipulating the array data!

Note that there are two ways to manipulate data:

  • One of the ways just changes the "view" of the Numpy array and is therefore instant, but does NOT transform the underlying img.data in RAM/memory. This means that the raw memory does NOT contain the new channel order, and Numpy instead "fakes" it by creating a "view" that says "when we read this data from RAM, view it as R=B, B=R, G=G" basically...
  • The second way is to always ensure that we arrange the pixel data properly in memory too, which is a lot slower but is sometimes necessary depending on what library/API your data is intended to be sent to!

To determine whether a numpy array manipulation has also changed the underlying MEMORY, you can look at the img.flags['C_CONTIGUOUS'] value. If True it means that the data in RAM is in the correct order too. If False it means that the data in RAM is in the wrong order and that we are "cheating" via a numpy View instead.

Whenever you use the "View-based" methods to flip channels in an ndarray (such as RGB -> BGR), its C_CONTIGUOUS becomes False. If you then flip the image's channels again (such as BGR -> back to RGB), its C_CONTIGUOUS becomes True again. So, the "view" is able to be transformed multiple times, and the "Contiguous" flag only says True whenever the view happens to match the actual RAM data.

So in what situations do you need the data to ALWAYS be contiguous? Well, it varies based on API...

  • cv2.imshow(): Does NOT need contiguous data. You can give it a view-manipulated ndarray and it will display the colors in the correct order.
  • Other OpenCV APIs: Not sure. Needs investigation! The OpenCV DNN (neural network) in particular, in its cv2.dnn.blobFromImage function, may require the raw memory to match (not sure, and it's very hard to check... perhaps someone wants to investigate this...). Likewise, stuff inside OpenCV that passes the pixels to other libraries/etc may expect contiguous data in RAM. Remember the fact that OpenCV itself expects that you're converting the data via its cvtColor function, which always creates contiguous data in RAM, so it wouldn't surprise me if there are tons of places in the OpenCV code that expect contiguous data.
  • Other libraries: Depends on the library. Some of them do something like "take the img.data memory address and give it to a raw Windows API via a COM call" in which case YES the RAM-data MUST be contiguous.

What type of data do YOU need?

  • If you want the SAFEST possible data that is 100% sure to work in ANY API ANYWHERE, you should always make CONTIGUOUS pixel data. It takes a little longer, but we're still talking about very fast operations!
  • If you want maximum efficiency, but you understand that you'll have to MANUALLY investigate every API that you send the image to, and ensuring that the recipient API doesn't require contiguous data... then okay, use the "non-contiguous methods". But beware!

Techniques

Without further ado, here are all the ways to convert back/forth between RGB and BGR, along with their benchmarks, on a 4K image (3840x2160):

  • Always Contiguous: No. Method: x = x[...,::-1]. Speed: 237 nsec (aka 0.237 usec aka 0.000237 msec) per loop
  • Always Contiguous: Yes. Method: x = cv2.cvtColor(x, cv2.COLOR_RGB2BGR). Speed: 5.64 msec per loop
  • Always Contiguous: Yes. Method: x = x[...,::-1].copy(). Speed: 37.5 msec per loop
  • Always Contiguous: No. Method: x = np.fliplr(x.reshape(-1,3)).reshape(x.shape). Speed: 1.62 usec (aka 0.00162 msec) per loop
  • Always Contiguous: Yes. Method: x = np.fliplr(x.reshape(-1,3)).reshape(x.shape).copy(). Speed: 37.4 msec per loop
  • Always Contiguous: No. Method: x = np.flip(x, axis=2). Speed: 2.74 usec (aka 0.00274 msec) per loop
  • Always Contiguous: Yes. Method: x = np.flip(x, axis=2).copy(). Speed: 37.5 msec per loop

Benchmarking Yourself

python -m timeit -s "import numpy as np; import cv2; x = np.zeros([2160,3840,3], np.uint8); x[:,:,2] = 255; x[:,:,1] = 100" "ALGORITHM HERE"

Replace the "ALGORITHM HERE" part with the algorithm above, such as "x = np.flip(x, axis=2).copy()".

Conclusions

If the APIs you use are compatible with non-contiguous numpy data, you should use the 1st technique above (which executes in just 0.000237 milliseconds).

If the APIs you use need contiguous data in RAM, you should use the OpenCV converter (which executes in 5.64 milliseconds). That's 23798 times slower than the 1st technique. But it's still the fastest of all the "create entirely new array of data" techniques in this test.

Further Research

Research is needed into which OpenCV APIs (if any) need contiguous data in RAM. If it turns out that all of the APIs read the numpy arrays in the same way and understand the "view" (non-contiguous) method of storage, then we can all enjoy the ultra-fast "view-based" numpy techniques for flipping between RGB and BGR, the most common color transformations of them all!

Fastest way to convert BGR <-> RGB!

I was reading this question: https://answers.opencv.org/question/188664/colour-conversion-from-bgr-to-rgb-in-cv2-is-slower-than-on-python/ and it didn't explain things very well at all. So here's a deep examination and explanation for everyone's future reference!

Converting RGB to BGR, and vice versa, is one of the most important operations you can do in OpenCV if you're interoperating with other libraries, raw system memory, etc. And each imaging library depends on their own special channel orders.

There are many ways to achieve the conversion, and cv2.cvtColor() is absolutely not the fastest one. If you attempt to convert colors in OpenCV, you actually invoke a huge machinery:

https://github.com/opencv/opencv/blob/8c0b0714e76efef4a8ca2a7c410c60e55c5e9829/modules/imgproc/src/color.cpp#L20-L25 https://github.com/opencv/opencv/blob/8b541e450b511fde9dd363fa55a30fbb6fc0ace6/modules/imgproc/src/color_rgb.dispatch.cpp#L426-L437

As you can see, internally, OpenCV creates an "OpenCL Kernel" with the instructions for the data transformation, and then runs it. Compiling such a kernel takes some extra time and isn't very efficient.

However, there is another way to flip between RGB and BGR channel orders. And that is: Using numpy's built-in methods for manipulating the array data!

Note that there are two ways to manipulate data:

  • One of the ways just changes the "view" of the Numpy array and is therefore instant, but does NOT transform the underlying img.data in RAM/memory. This means that the raw memory does NOT contain the new channel order, and Numpy instead "fakes" it by creating a "view" that says "when we read this data from RAM, view it as R=B, B=R, G=G" basically...
  • The second way is to always ensure that we arrange the pixel data properly in memory too, which is a lot slower but is sometimes necessary depending on what library/API your data is intended to be sent to!

To determine whether a numpy array manipulation has also changed the underlying MEMORY, you can look at the img.flags['C_CONTIGUOUS'] value. If True it means that the data in RAM is in the correct order too. If False it means that the data in RAM is in the wrong order and that we are "cheating" via a numpy View instead.

Whenever you use the "View-based" methods to flip channels in an ndarray (such as RGB -> BGR), its C_CONTIGUOUS becomes False. If you then flip the image's channels again (such as BGR -> back to RGB), its C_CONTIGUOUS becomes True again. So, the "view" is able to be transformed multiple times, and the "Contiguous" flag only says True whenever the view happens to match the actual RAM data.

So in what situations do you need the data to ALWAYS be contiguous? Well, it varies based on API...

  • cv2.imshow(): Does NOT need contiguous data. You can give it a view-manipulated ndarray and it will display the colors in the correct order.
  • Other OpenCV APIs: Not sure. Needs investigation! The OpenCV DNN (neural network) in particular, in its cv2.dnn.blobFromImage function, may require the raw memory to match (not sure, and it's very hard to check... perhaps someone wants to investigate this...). Likewise, stuff inside OpenCV that passes the pixels to other libraries/etc may expect contiguous data in RAM. Remember the fact that OpenCV itself expects that you're converting the data via its cvtColor function, which always creates contiguous data in RAM, so it wouldn't surprise me if there are tons of places in the OpenCV code that expect contiguous data.
  • Other libraries: Depends on the library. Some of them do something like "take the img.data memory address and give it to a raw Windows API via a COM call" in which case YES the RAM-data MUST be contiguous.

What type of data do YOU need?

  • If you want the SAFEST possible data that is 100% sure to work in ANY API ANYWHERE, you should always make CONTIGUOUS pixel data. It takes a little longer, but we're still talking about very fast operations!
  • If you want maximum efficiency, but you understand that you'll have to MANUALLY investigate every API that you send the image to, and ensuring that the recipient API doesn't require contiguous data... then okay, use the "non-contiguous methods". But beware!

Techniques

Without further ado, here are all the ways to convert back/forth between RGB and BGR, along with their benchmarks, on a 4K image (3840x2160):

  • Always Contiguous: No. Method: x = x[...,::-1]. Speed: 237 nsec (aka 0.237 usec aka 0.000237 msec) per loop
  • Always Contiguous: Yes. Method: x = cv2.cvtColor(x, cv2.COLOR_RGB2BGR). Speed: 5.64 msec per loop
  • Always Contiguous: Yes. Method: x = x[...,::-1].copy(). Speed: 37.5 msec per loop
  • Always Contiguous: No. Method: x = np.fliplr(x.reshape(-1,3)).reshape(x.shape). Speed: 1.62 usec (aka 0.00162 msec) per loop
  • Always Contiguous: Yes. Method: x = np.fliplr(x.reshape(-1,3)).reshape(x.shape).copy(). Speed: 37.4 msec per loop
  • Always Contiguous: No. Method: x = np.flip(x, axis=2). Speed: 2.74 usec (aka 0.00274 msec) per loop
  • Always Contiguous: Yes. Method: x = np.flip(x, axis=2).copy(). Speed: 37.5 msec per loop

PS: We're using x.copy() to tell Numpy to allocate new RAM and copy all data to it in the correct (contiguous) order. There's also a np.ascontiguousarray(x) API but it does the exact same thing (copies) and requires much more typing. ;-)

Docs for the various Numpy functions: copy, ascontiguousarray, fliplr, flip, reshape

Benchmarking Yourself

python -m timeit -s "import numpy as np; import cv2; x = np.zeros([2160,3840,3], np.uint8); x[:,:,2] = 255; x[:,:,1] = 100" "ALGORITHM HERE"

Replace the "ALGORITHM HERE" part with the algorithm above, such as "x = np.flip(x, axis=2).copy()".

Conclusions

If the APIs you use are compatible with non-contiguous numpy data, you should use the 1st technique above (which executes in just 0.000237 milliseconds).

If the APIs you use need contiguous data in RAM, you should use the OpenCV converter (which executes in 5.64 milliseconds). That's 23798 times slower than the 1st technique. But it's still the fastest of all the "create entirely new array of data" techniques in this test.

Further Research

Research is needed into which OpenCV APIs (if any) need contiguous data in RAM. If it turns out that all of the APIs read the numpy arrays in the same way and understand the "view" (non-contiguous) method of storage, then we can all enjoy the ultra-fast "view-based" numpy techniques for flipping between RGB and BGR, the most common color transformations of them all!

Fastest way to convert BGR <-> RGB!

I was reading this question: https://answers.opencv.org/question/188664/colour-conversion-from-bgr-to-rgb-in-cv2-is-slower-than-on-python/ and it didn't explain things very well at all. So here's a deep examination and explanation for everyone's future reference!

Converting RGB to BGR, and vice versa, is one of the most important operations you can do in OpenCV if you're interoperating with other libraries, raw system memory, etc. And each imaging library depends on their own special channel orders.

There are many ways to achieve the conversion, and cv2.cvtColor() is absolutely not the fastest one. If you attempt to convert colors in OpenCV, you actually invoke a huge machinery:

https://github.com/opencv/opencv/blob/8c0b0714e76efef4a8ca2a7c410c60e55c5e9829/modules/imgproc/src/color.cpp#L20-L25 https://github.com/opencv/opencv/blob/8b541e450b511fde9dd363fa55a30fbb6fc0ace6/modules/imgproc/src/color_rgb.dispatch.cpp#L426-L437

As you can see, internally, OpenCV creates an "OpenCL Kernel" with the instructions for the data transformation, and then runs it. Compiling such a kernel takes some extra time and isn't very efficient.

However, there is another way to flip between RGB and BGR channel orders. And that is: Using numpy's built-in methods for manipulating the array data!

Note that there are two ways to manipulate data:

  • One of the ways just changes the "view" of the Numpy array and is therefore instant, but does NOT transform the underlying img.data in RAM/memory. This means that the raw memory does NOT contain the new channel order, and Numpy instead "fakes" it by creating a "view" that says "when we read this data from RAM, view it as R=B, B=R, G=G" basically...
  • The second way is to always ensure that we arrange the pixel data properly in memory too, which is a lot slower but is sometimes necessary depending on what library/API your data is intended to be sent to!

To determine whether a numpy array manipulation has also changed the underlying MEMORY, you can look at the img.flags['C_CONTIGUOUS'] value. If True it means that the data in RAM is in the correct order too. If False it means that the data in RAM is in the wrong order and that we are "cheating" via a numpy View instead.

Whenever you use the "View-based" methods to flip channels in an ndarray (such as RGB -> BGR), its C_CONTIGUOUS becomes False. If you then flip the image's channels again (such as BGR -> back to RGB), its C_CONTIGUOUS becomes True again. So, the "view" is able to be transformed multiple times, and the "Contiguous" flag only says True whenever the view happens to match the actual RAM data.

So in what situations do you need the data to ALWAYS be contiguous? Well, it varies based on API...

  • cv2.imshow(): Does NOT need contiguous data. You can give it a view-manipulated ndarray and it will display the colors in the correct order.
  • Other OpenCV APIs: Not sure. Needs investigation! The OpenCV DNN (neural network) in particular, in its cv2.dnn.blobFromImage function, may require the raw memory to match (not sure, and it's very hard to check... perhaps someone wants to investigate this...). Likewise, stuff inside OpenCV that passes the pixels to other libraries/etc may expect contiguous data in RAM. Remember the fact that OpenCV itself expects that you're converting the data via its cvtColor function, which always creates contiguous data in RAM, so it wouldn't surprise me if there are tons of places in the OpenCV code that expect contiguous data.
  • Other libraries: Depends on the library. Some of them do something like "take the img.data memory address and give it to a raw Windows API via a COM call" in which case YES the RAM-data MUST be contiguous.

What type of data do YOU need?

  • If you want the SAFEST possible data that is 100% sure to work in ANY API ANYWHERE, you should always make CONTIGUOUS pixel data. It takes a little longer, but we're still talking about very fast operations!
  • If you want maximum efficiency, but you understand that you'll have to MANUALLY investigate every API that you send the image to, and ensuring that the recipient API doesn't require contiguous data... then okay, use the "non-contiguous methods". But beware!

Techniques

Without further ado, here are all the ways to convert back/forth between RGB and BGR, along with their benchmarks, on a 4K image (3840x2160):

  • Always Contiguous: No. Method: x = x[...,::-1]. Speed: 237 nsec (aka 0.237 usec aka 0.000237 msec) per loop
  • Always Contiguous: Yes. Method: x = cv2.cvtColor(x, cv2.COLOR_RGB2BGR). Speed: 5.64 msec per loop
  • Always Contiguous: Yes. Method: x = x[...,::-1].copy(). Speed: 37.5 msec per loop
  • Always Contiguous: No. Method: x = np.fliplr(x.reshape(-1,3)).reshape(x.shape). Speed: 1.62 usec (aka 0.00162 msec) per loop
  • Always Contiguous: Yes. Method: x = np.fliplr(x.reshape(-1,3)).reshape(x.shape).copy(). Speed: 37.4 msec per loop
  • Always Contiguous: No. Method: x = np.flip(x, axis=2). Speed: 2.74 usec (aka 0.00274 msec) per loop
  • Always Contiguous: Yes. Method: x = np.flip(x, axis=2).copy(). Speed: 37.5 msec per loop

PS: We're using x.copy() to tell Numpy to allocate new RAM and copy all data to it in the correct (contiguous) order. There's also a np.ascontiguousarray(x) API but it does the exact same thing (copies) and requires much more typing. ;-)

Docs for the various Numpy functions: copy, ascontiguousarray, fliplr, flip, reshape

Benchmarking Yourself

python -m timeit -s "import numpy as np; import cv2; x = np.zeros([2160,3840,3], np.uint8); x[:,:,2] = 255; x[:,:,1] = 100" "ALGORITHM HERE"

Replace the "ALGORITHM HERE" part with the algorithm above, such as "x = np.flip(x, axis=2).copy()".

Conclusions

If the APIs you use are compatible with non-contiguous numpy data, you should use the 1st technique above (which executes in just 0.000237 milliseconds).

If the APIs you use need contiguous data in RAM, you should use the OpenCV converter (which executes in 5.64 milliseconds). That's 23798 times slower than the 1st technique. But it's still the fastest of all the "create entirely new array of data" techniques in this test.

Further Research

Research is needed into which OpenCV APIs (if any) need contiguous data in RAM. If it turns out that all of the APIs read the numpy arrays in the same way and understand the "view" (non-contiguous) method of storage, then we can all enjoy the ultra-fast "view-based" numpy techniques for flipping between RGB and BGR, the most common color transformations of them all!

Fastest way to convert BGR <-> RGB!RGB! Aka: Do NOT use Numpy magic "tricks".

I was reading this question: https://answers.opencv.org/question/188664/colour-conversion-from-bgr-to-rgb-in-cv2-is-slower-than-on-python/ and it didn't explain things very well at all. So here's a deep examination and explanation for everyone's future reference!

Converting RGB to BGR, and vice versa, is one of the most important operations you can do in OpenCV if you're interoperating with other libraries, raw system memory, etc. And each imaging library depends on their own special channel orders.

There are many ways to achieve the conversion, and cv2.cvtColor() is absolutely not the fastest one. If often frowned upon because there are "much faster" ways to do it via numpy "view" manipulation.

Whenever you attempt to convert colors in OpenCV, you actually invoke a huge machinery:

https://github.com/opencv/opencv/blob/8c0b0714e76efef4a8ca2a7c410c60e55c5e9829/modules/imgproc/src/color.cpp#L20-L25 https://github.com/opencv/opencv/blob/8b541e450b511fde9dd363fa55a30fbb6fc0ace6/modules/imgproc/src/color_rgb.dispatch.cpp#L426-L437

As you can see, internally, OpenCV creates an "OpenCL Kernel" with the instructions for the data transformation, and then runs it. Compiling such a kernel takes some extra time and isn't very efficient.This creates brand new (re-arranged) image data in memory, which is of course a pretty slow operation, involving new memory allocation and data-copying.

However, there is another way to flip between RGB and BGR channel orders. orders, which is very popular - and very bad (as you'll find out soon). And that is: Using numpy's built-in methods for manipulating the array data!data.

Note that there are two ways to manipulate data:data in Numpy:

  • One of the ways ways, the bad way, just changes the "view" of the Numpy array and is therefore instant, instant (O(1)), but does NOT transform the underlying img.data in RAM/memory. This means that the raw memory does NOT contain the new channel order, and Numpy instead "fakes" it by creating a "view" that simply says "when we read this data from RAM, view it as R=B, B=R, G=G" basically...G=G, B=R" basically... (Technically speaking, it changes the ".strides" property of the Numpy object, which instead of saying "read R then G then B" (stride "1" aka going forwards in RAM when reading the color channels) changes it to say "read B, then G, then R" (stride "-1" aka going backwards in RAM when reading the color channels)).
  • The second way way, which is totally fine, is to always ensure that we arrange the pixel data properly in memory too, which is a lot slower but is sometimes necessary depending on what library/API your data is intended to be sent to!

To determine whether a numpy array manipulation has also changed the underlying MEMORY, you can look at the img.flags['C_CONTIGUOUS'] value. If True it means that the data in RAM is in the correct order too. (that's great!). If False it means that the data in RAM is in the wrong order and that we are "cheating" via a numpy View instead.instead (that's BAD!).

Whenever you use the "View-based" methods to flip channels in an ndarray (such as RGB -> BGR), its C_CONTIGUOUS becomes False. If you then flip the image's channels again (such as BGR -> back to RGB), its C_CONTIGUOUS becomes True again. So, the "view" is able to be transformed multiple times, and the "Contiguous" flag only says True whenever the view happens to match the actual RAM data.

So data's layout.

So... in what situations do you need the data to ALWAYS be contiguous? Well, it varies based on API...

  • cv2.imshow(): Does NOT need contiguous data. You can OpenCV APIs _all need_ data in contiguous order. If you give it a view-manipulated ndarray and it will display the colors in the correct order.
  • Other non-contiguous data from Python, the Python-to-OpenCV wrapper layer _internally_ makes a COPY of the BADLY FORMATTED image you gave it, and then converts the color channel order, and THEN finally passes the COPIED-AND-CONTIGUOUS image to the internal OpenCV APIs: Not sure. Needs investigation! The OpenCV DNN (neural network) in particular, in its cv2.dnn.blobFromImage function, may require the raw memory to match (not sure, and it's C++ function. This is of course very hard to check... perhaps someone wants to investigate this...). Likewise, stuff inside OpenCV that passes the pixels to other libraries/etc may expect contiguous data in RAM. Remember the fact that OpenCV itself expects that you're converting the data via its cvtColor function, which always creates contiguous data in RAM, so it wouldn't surprise me if there are tons of places in the OpenCV code that expect contiguous data.wasteful!
  • Other libraries: Depends on the library. Some of them do something like "take the img.data memory address and give it to a raw Windows API via a COM call" in which case YES the RAM-data MUST be contiguous.contiguous too.

What type of data do YOU need?

  • If you want the SAFEST possible data that is 100% sure to work in ANY API ANYWHERE, you should always make CONTIGUOUS pixel data. It takes a little longer, longer to do the conversion up-front, but we're still talking about very fast operations!

  • If you want maximum efficiency,

    There are probably situations where non-contiguous data is fine, such as if you are doing all image manipulations purely in Numpy math without any library APIs (in which case there's no real reason to convert the data layout to contiguous in RAM). But as soon as you invoke various library APIs, you should _pretty much always_ have contiguous data, otherwise you'll create huge performance issues.

    I'll explain those performance issues further down, but you understand that you'll have to MANUALLY investigate every API that you send the image to, and ensuring that the recipient API doesn't require contiguous data... then okay, first we'll look at the various "conversion techniques" people use the "non-contiguous methods". But beware!

in Python.

Techniques

Without further ado, here are all the ways that people use in Python whenever they want to convert back/forth between RGB and BGR, along with their benchmarks, BGR. These benchmarks are on a 4K image (3840x2160):

  • Always Contiguous: No. Method: x = x[...,::-1]. Speed: 237 nsec (aka 0.237 usec aka 0.000237 msec) per loop
  • Always Contiguous: Yes. Method: x = cv2.cvtColor(x, cv2.COLOR_RGB2BGR). Speed: 5.64 msec per loop
  • Always Contiguous: Yes. Method: x = x[...,::-1].copy(). Speed: 37.5 msec per loop
  • Always Contiguous: No. Method: x = np.fliplr(x.reshape(-1,3)).reshape(x.shape). Speed: 1.62 usec (aka 0.00162 msec) per loop
  • Always Contiguous: Yes. Method: x = np.fliplr(x.reshape(-1,3)).reshape(x.shape).copy(). Speed: 37.4 msec per loop
  • Always Contiguous: No. Method: x = np.flip(x, axis=2). Speed: 2.74 usec (aka 0.00274 msec) per loop
  • Always Contiguous: Yes. Method: x = np.flip(x, axis=2).copy(). Speed: 37.5 msec per loop

PS: We're Whenever we want contiguous data from numpy, we're using x.copy() to tell Numpy to allocate new RAM and copy all data to it in the correct (contiguous) order. There's also a np.ascontiguousarray(x) API but it does the exact same thing (copies) and requires much more typing. ;-)

Docs for the various Numpy functions: copy, ascontiguousarray, fliplr, flip, reshape

Benchmarking YourselfHere's the benchmark that was used:

python -m timeit -s "import numpy as np; import cv2; x = np.zeros([2160,3840,3], np.uint8); x[:,:,2] = 255; x[:,:,1] = 100" "ALGORITHM HERE"

Replace the "ALGORITHM HERE" part with the algorithm above, such as "x = np.flip(x, axis=2).copy()".

ConclusionsPeople's Misunderstandings of those Benchmarks

If the APIs you use are compatible with non-contiguous numpy data, you should use the 1st technique above (which Alright, so we're finally getting to the whole purpose of this article!

When people see the benchmarks above, they usually think "Oh my god, x = x[...,::-1] executes in just 0.000237 milliseconds).

If the APIs you use need contiguous data in RAM, you should use the OpenCV converter (which milliseconds, and x = cv2.cvtColor(x, cv2.COLOR_RGB2BGR) executes in 5.64 milliseconds). That's milliseconds, which is 23798 times slowerslower!! than the 1st technique. And then they decide to always use "Numpy view manipulation" to do their channel conversions.

That's a huge mistake. And here's why:

  • When you call an OpenCV API from Python, and pass it a numpy.ndarray image object, there's a process which prepares that data for internal usage within OpenCV (since OpenCV itself doesn't use ndarray internally; it uses "Mat").
  • First, your Python object (which is coming from the PyOpenCV module) goes into the appropriate pyopencv_to() function, whose purpose is to convert raw Python objects (such as numbers, strings, ndarray, etc), into something usable by OpenCV internally in C++.
  • Your Python object first enters the "full ArgInfo converter" code at https://github.com/opencv/opencv/blob/778f42ad34559451d62ac9ba585717aec77fb23a/modules/python/src2/cv2.cpp#L249
  • That code, in turn, looks at the object and determines if it's a number, a float, or a tuple... If it's any of those, it does the appropriate conversion. Otherwise it assumes it's a Numpy array, at this line: https://github.com/opencv/opencv/blob/master/modules/python/src2/cv2.cpp#L292
  • Next, it begins to analyze the Numpy array to determine how to use the data internally. It wants to determine "do we need to copy the data or can we use it as-is? do we need to cast the data?", see here: https://github.com/opencv/opencv/blob/master/modules/python/src2/cv2.cpp#L300
  • It first does some simple checks to see if the data type in the Numpy array is legal or not.
  • Next, it retrieves the "strides" information from the Numpy array, which is those simple numbers such as "-1" which determine how to read a numpy array (such as backwards, in the case of our "fast" numpy-based "channel flipping" code earlier): https://github.com/opencv/opencv/blob/master/modules/python/src2/cv2.cpp#L341-L342
  • Then it analyzes the Strides for all dimensions of the Numpy array, and if it finds a non-contiguous stride (our "screwed up" data layout caused by doing those so-called "fast" Numpy view manipulations), then it marks the data as "needs copy": https://github.com/opencv/opencv/blob/master/modules/python/src2/cv2.cpp#L344-L357
  • Next, it if "needs copy" is true, and we can't solve it with a simple cast ("needs cast" is false), it does this horrible thing: https://github.com/opencv/opencv/blob/master/modules/python/src2/cv2.cpp#L371-L374
  • As you can see, it calls PyArray_GETCONTIGUOUS() which generates a brand-new Python Numpy Object, with all data COPIED by Numpy into brand-new memory, and re-arranged into proper Contiguous ordering. That's extremely wasteful! I'll explain more soon, after this walkthrough of what the code does...
  • Finally, the code proceeds to create a cv::Mat object whose data pointer points at the internal byte-array (RAM) of the Numpy object. That's an incredibly fast operation because it is just a pointer which says Use the data owned by Numpy at RAM address XYZ: https://github.com/opencv/opencv/blob/master/modules/python/src2/cv2.cpp#L415-L416

So, can you spot the problem yet?

When you pass a contiguous Numpy array, the conversion into OpenCV is pretty much INSTANT: "This data looks fine! Just give it to cv::Mat and voila!".

But it's still the fastest when you instead insist on using those so-called "fast" channel transformations, where you "tweak" the Numpy array's view and stride values, then you are giving OpenCV a Numpy array with non-contiguous RAM and bad "strides". The PyOpenCV layer (the wrapper between OpenCV and Python) detects this problem, and creates a BRAND NEW, COPIED, RE-ARRANGED (CONTIGUOUS) NUMPY ARRAY. This is VERY VERY VERY VERY SLOW.

In other words, if you've used those dumb Numpy "view" manipulation tricks, EVERY CALL TO OPENCV APIS IS CAUSING a HUGE memory copy (images are large, especially 1080p+ screenshots/video frames), a lot of math inside PyArray_GETCONTIGUOUS to create that new object while respecting your tweaked "strides", etc.

Your code won't be faster at all. It will be SLOWER!

Proof of Slowness

Let's use a random OpenCV API to demonstrate the slowdown caused by all the "create entirely new of those conversions. We'll use "imshow" here, but _any_ API call will be doing the same "Python to OpenCV" conversions of the data, so the exact API doesn't matter.

Here's the example code:

import cv2
import numpy as np
import time

#img1 = cv2.imread("yourimage.png") # If you want to test with an image.
img1 = np.zeros([2160,3840,3], np.uint8) # Create a 4K image.
img1[:,:,2] = 255; img1[:,:,1] = 100 # Fill the channels with different values.
img2 = img1[...,::-1] # Make a "channel flipped view" of the Numpy data.

print("img1 contiguous", img1.flags['C_CONTIGUOUS'], "img2 contiguous", img2.flags['C_CONTIGUOUS'])
print("img1 strides", img1.strides, "img2 strides", img2.strides)

wnd = cv2.namedWindow("", cv2.WINDOW_NORMAL)

def show1():
    cv2.imshow(wnd, img1)

def show2():
    cv2.imshow(wnd, img2)

iterations = 20

start = time.perf_counter()
for i in range(0, iterations):
    show1()
elapsed1 = (time.perf_counter() - start) * 1000
elapsed1_percall = elapsed1 / iterations

start = time.perf_counter()
for i in range(0, iterations):
    show2()
elapsed2 = (time.perf_counter() - start) * 1000
elapsed2_percall = elapsed2 / iterations

# We know that the contiguos (img1) data does not need conversion,
# which tells us that the runtime of the contiguous data is the
# "internal work of the imshow" function. We only want to measure
# the conversion time for non-contiguous data. So we'll subtract
# the first image's (contiguous) runtime from the non-contiguous time.

noncontiguous_overhead_per_call = elapsed2_percall - elapsed1_percall

print("Extra time taken per OpenCV call when given non-contiguous data (in ms):", noncontiguous_overhead_per_call, "ms")

The results:

img1 contiguous True img2 contiguous False
img1 strides (11520, 3, 1) img2 strides (11520, 3, -1)
Extra time taken per OpenCV call when given non-contiguous data (in ms): 39.45334999999999 ms

As you can see, the time it takes is pretty much the same as when you call img.copy() inside Python itself (as seen in the earlier benchmark for "Method: x = x[...,::-1].copy(). Speed: 37.5 msec per loop").

So yes, every time you call OpenCV with a non-contiguous Numpy array as its argument, you are causing a Numpy .copy() to happen internally!

Conclusions

Stop using Numpy view manipulations and "tricks". They are not "cool". They are SLOW. You are slowing down all of data" techniques in this test.

Further Research

Research is needed into which your OpenCV API calls by about 40 milliseconds PER CALL, since your "cool" data has to be converted by OpenCV internally to PROPER CONTIGUOUS RAM.

Those 39.45ms are wasted on EVERY OpenCV call whenever you give OpenCV a non-contiguous image. So if you're (as people often do) passing the image to multiple OpenCV APIs (if any) to analyze it in multiple ways, then you are causing extreme slowdowns in your code.

Use cv2.cvtColor() instead, which does a one-time conversion to the proper format (in about 5.64ms as seen in the benchmarks earlier), using accelerated OpenCL code. You are guaranteed to get contiguous data which works as-is for EVERY OpenCV call with no memory copying/conversion needed. And OpenCV's color converter is WAY FASTER than Numpy's internal data copier/converter.

Let's end this by imagining a scenario where you're using some library to capture an RGB screenshot as a numpy array, and you need to use that data with OpenCV. So you're thinking you're clever and you write img = img[...,::-1], and you're thinking "Wow, my code is so fast! That operation only took 0.000237 ms!"... And then you're calling five different OpenCV functions to analyze that screenshot in various ways. Well, since you're causing one internal copy-conversion PER CALL, you're now causing 5 * 39.45 = 197.25 ms of conversion overhead, just to get your "stupid" Numpy view into a proper contiguous data in RAM. If it turns out that all of the APIs read the numpy arrays in the same way and understand the "view" (non-contiguous) method of storage, then we can all enjoy the ultra-fast "view-based" numpy techniques for flipping between RGB and BGR, the most common color transformations of them all!memory stream.

Does it still sound "slow" to just do a single, one-time 5.64ms conversion via cv2.cvtColor()? ;-)

Stop. Using. Numpy. Tricks!

Enjoy! ;-)

Fastest way to convert BGR <-> RGB! Aka: Do NOT use Numpy magic "tricks".


IMPORTANT: This article is very long. Remember to click the (more) at the bottom of this post to read the whole article!


I was reading this question: https://answers.opencv.org/question/188664/colour-conversion-from-bgr-to-rgb-in-cv2-is-slower-than-on-python/ and it didn't explain things very well at all. So here's a deep examination and explanation for everyone's future reference!

Converting RGB to BGR, and vice versa, is one of the most important operations you can do in OpenCV if you're interoperating with other libraries, raw system memory, etc. And each imaging library depends on their own special channel orders.

There are many ways to achieve the conversion, and cv2.cvtColor() is often frowned upon because there are "much faster" ways to do it via numpy "view" manipulation.

Whenever you attempt to convert colors in OpenCV, you actually invoke a huge machinery:

https://github.com/opencv/opencv/blob/8c0b0714e76efef4a8ca2a7c410c60e55c5e9829/modules/imgproc/src/color.cpp#L20-L25 https://github.com/opencv/opencv/blob/8b541e450b511fde9dd363fa55a30fbb6fc0ace6/modules/imgproc/src/color_rgb.dispatch.cpp#L426-L437

As you can see, internally, OpenCV creates an "OpenCL Kernel" with the instructions for the data transformation, and then runs it. This creates brand new (re-arranged) image data in memory, which is of course a pretty slow operation, involving new memory allocation and data-copying.

However, there is another way to flip between RGB and BGR channel orders, which is very popular - and very bad (as you'll find out soon). And that is: Using numpy's built-in methods for manipulating the array data.

Note that there are two ways to manipulate data in Numpy:

  • One of the ways, the bad way, just changes the "view" of the Numpy array and is therefore instant (O(1)), but does NOT transform the underlying img.data in RAM/memory. This means that the raw memory does NOT contain the new channel order, and Numpy instead "fakes" it by creating a "view" that simply says "when we read this data from RAM, view it as R=B, G=G, B=R" basically... (Technically speaking, it changes the ".strides" property of the Numpy object, which instead of saying "read R then G then B" (stride "1" aka going forwards in RAM when reading the color channels) changes it to say "read B, then G, then R" (stride "-1" aka going backwards in RAM when reading the color channels)).
  • The second way, which is totally fine, is to always ensure that we arrange the pixel data properly in memory too, which is a lot slower but is sometimes necessary depending on what library/API your data is intended to be sent to!

To determine whether a numpy array manipulation has also changed the underlying MEMORY, you can look at the img.flags['C_CONTIGUOUS'] value. If True it means that the data in RAM is in the correct order (that's great!). If False it means that the data in RAM is in the wrong order and that we are "cheating" via a numpy View instead (that's BAD!).

Whenever you use the "View-based" methods to flip channels in an ndarray (such as RGB -> BGR), its C_CONTIGUOUS becomes False. If you then flip the image's channels again (such as BGR -> back to RGB), its C_CONTIGUOUS becomes True again. So, the "view" is able to be transformed multiple times, and the "Contiguous" flag only says True whenever the view happens to match the actual RAM data's layout.

So... in what situations do you need the data to ALWAYS be contiguous? Well, it varies based on API...

  • OpenCV APIs _all need_ data in contiguous order. If you give it non-contiguous data from Python, the Python-to-OpenCV wrapper layer _internally_ makes a COPY of the BADLY FORMATTED image you gave it, and then converts the color channel order, and THEN finally passes the COPIED-AND-CONTIGUOUS image to the internal OpenCV C++ function. This is of course very wasteful!
  • Other libraries: Depends on the library. Some of them do something like "take the img.data memory address and give it to a raw Windows API via a COM call" in which case YES the RAM-data MUST be contiguous too.

What type of data do YOU need?

If you want the SAFEST possible data that is 100% sure to work in ANY API ANYWHERE, you should always make CONTIGUOUS pixel data. It takes a little longer to do the conversion up-front, but we're still talking about very fast operations!

There are probably situations where non-contiguous data is fine, such as if you are doing all image manipulations purely in Numpy math without any library APIs (in which case there's no real reason to convert the data layout to contiguous in RAM). But as soon as you invoke various library APIs, you should _pretty much always_ have contiguous data, otherwise you'll create huge performance issues.

I'll explain those performance issues further down, but first we'll look at the various "conversion techniques" people use in Python.

Techniques

Without further ado, here are all the ways that people use in Python whenever they want to convert back/forth between RGB and BGR. These benchmarks are on a 4K image (3840x2160):

  • Always Contiguous: No. Method: x = x[...,::-1]. Speed: 237 nsec (aka 0.237 usec aka 0.000237 msec) per loop
  • Always Contiguous: Yes. Method: x = cv2.cvtColor(x, cv2.COLOR_RGB2BGR). Speed: 5.64 msec per loop
  • Always Contiguous: Yes. Method: x = x[...,::-1].copy(). Speed: 37.5 msec per loop
  • Always Contiguous: No. Method: x = np.fliplr(x.reshape(-1,3)).reshape(x.shape). Speed: 1.62 usec (aka 0.00162 msec) per loop
  • Always Contiguous: Yes. Method: x = np.fliplr(x.reshape(-1,3)).reshape(x.shape).copy(). Speed: 37.4 msec per loop
  • Always Contiguous: No. Method: x = np.flip(x, axis=2). Speed: 2.74 usec (aka 0.00274 msec) per loop
  • Always Contiguous: Yes. Method: x = np.flip(x, axis=2).copy(). Speed: 37.5 msec per loop

PS: Whenever we want contiguous data from numpy, we're using x.copy() to tell Numpy to allocate new RAM and copy all data to it in the correct (contiguous) order. There's also a np.ascontiguousarray(x) API but it does the exact same thing (copies) and requires much more typing. ;-)

Docs for the various Numpy functions: copy, ascontiguousarray, fliplr, flip, reshape

Here's the benchmark that was used:

python -m timeit -s "import numpy as np; import cv2; x = np.zeros([2160,3840,3], np.uint8); x[:,:,2] = 255; x[:,:,1] = 100" "ALGORITHM HERE"

Replace the "ALGORITHM HERE" part with the algorithm above, such as "x = np.flip(x, axis=2).copy()".

People's Misunderstandings of those Benchmarks

Alright, so we're finally getting to the whole purpose of this article!

When people see the benchmarks above, they usually think "Oh my god, x = x[...,::-1] executes in 0.000237 milliseconds, and x = cv2.cvtColor(x, cv2.COLOR_RGB2BGR) executes in 5.64 milliseconds, which is 23798 times slower!! And then they decide to always use "Numpy view manipulation" to do their channel conversions.

That's a huge mistake. And here's why:

  • When you call an OpenCV API from Python, and pass it a numpy.ndarray image object, there's a process which prepares that data for internal usage within OpenCV (since OpenCV itself doesn't use ndarray internally; it uses "Mat").
  • First, your Python object (which is coming from the PyOpenCV module) goes into the appropriate pyopencv_to() function, whose purpose is to convert raw Python objects (such as numbers, strings, ndarray, etc), into something usable by OpenCV internally in C++.
  • Your Python object first enters the "full ArgInfo converter" code at https://github.com/opencv/opencv/blob/778f42ad34559451d62ac9ba585717aec77fb23a/modules/python/src2/cv2.cpp#L249
  • That code, in turn, looks at the object and determines if it's a number, a float, or a tuple... If it's any of those, it does the appropriate conversion. Otherwise it assumes it's a Numpy array, at this line: https://github.com/opencv/opencv/blob/master/modules/python/src2/cv2.cpp#L292
  • Next, it begins to analyze the Numpy array to determine how to use the data internally. It wants to determine "do we need to copy the data or can we use it as-is? do we need to cast the data?", see here: https://github.com/opencv/opencv/blob/master/modules/python/src2/cv2.cpp#L300
  • It first does some simple checks to see if the data type in the Numpy array is legal or not.
  • Next, it retrieves the "strides" information from the Numpy array, which is those simple numbers such as "-1" which determine how to read a numpy array (such as backwards, in the case of our "fast" numpy-based "channel flipping" code earlier): https://github.com/opencv/opencv/blob/master/modules/python/src2/cv2.cpp#L341-L342
  • Then it analyzes the Strides for all dimensions of the Numpy array, and if it finds a non-contiguous stride (our "screwed up" data layout caused by doing those so-called "fast" Numpy view manipulations), then it marks the data as "needs copy": https://github.com/opencv/opencv/blob/master/modules/python/src2/cv2.cpp#L344-L357
  • Next, it if "needs copy" is true, and we can't solve it with a simple cast ("needs cast" is false), it does this horrible thing: https://github.com/opencv/opencv/blob/master/modules/python/src2/cv2.cpp#L371-L374
  • As you can see, it calls PyArray_GETCONTIGUOUS() which generates a brand-new Python Numpy Object, with all data COPIED by Numpy into brand-new memory, and re-arranged into proper Contiguous ordering. That's extremely wasteful! I'll explain more soon, after this walkthrough of what the code does...
  • Finally, the code proceeds to create a cv::Mat object whose data pointer points at the internal byte-array (RAM) of the Numpy object. That's an incredibly fast operation because it is just a pointer which says Use the data owned by Numpy at RAM address XYZ: https://github.com/opencv/opencv/blob/master/modules/python/src2/cv2.cpp#L415-L416

So, can you spot the problem yet?

When you pass a contiguous Numpy array, the conversion into OpenCV is pretty much INSTANT: "This data looks fine! Just give it to cv::Mat and voila!".

But when you instead insist on using those so-called "fast" channel transformations, where you "tweak" the Numpy array's view and stride values, then you are giving OpenCV a Numpy array with non-contiguous RAM and bad "strides". The PyOpenCV layer (the wrapper between OpenCV and Python) detects this problem, and creates a BRAND NEW, COPIED, RE-ARRANGED (CONTIGUOUS) NUMPY ARRAY. This is VERY VERY VERY VERY SLOW.

In other words, if you've used those dumb Numpy "view" manipulation tricks, EVERY CALL TO OPENCV APIS IS CAUSING a HUGE memory copy (images are large, especially 1080p+ screenshots/video frames), a lot of math inside PyArray_GETCONTIGUOUS to create that new object while respecting your tweaked "strides", etc.

Your code won't be faster at all. It will be SLOWER!

Proof of Slowness

Let's use a random OpenCV API to demonstrate the slowdown caused by all of those conversions. We'll use "imshow" here, but _any_ API call will be doing the same "Python to OpenCV" conversions of the data, so the exact API doesn't matter.

Here's the example code:

import cv2
import numpy as np
import time

#img1 = cv2.imread("yourimage.png") # If you want to test with an image.
img1 = np.zeros([2160,3840,3], np.uint8) # Create a 4K image.
img1[:,:,2] = 255; img1[:,:,1] = 100 # Fill the channels with different values.
img2 = img1[...,::-1] # Make a "channel flipped view" of the Numpy data.

print("img1 contiguous", img1.flags['C_CONTIGUOUS'], "img2 contiguous", img2.flags['C_CONTIGUOUS'])
print("img1 strides", img1.strides, "img2 strides", img2.strides)

wnd = cv2.namedWindow("", cv2.WINDOW_NORMAL)

def show1():
    cv2.imshow(wnd, img1)

def show2():
    cv2.imshow(wnd, img2)

iterations = 20

start = time.perf_counter()
for i in range(0, iterations):
    show1()
elapsed1 = (time.perf_counter() - start) * 1000
elapsed1_percall = elapsed1 / iterations

start = time.perf_counter()
for i in range(0, iterations):
    show2()
elapsed2 = (time.perf_counter() - start) * 1000
elapsed2_percall = elapsed2 / iterations

# We know that the contiguos (img1) data does not need conversion,
# which tells us that the runtime of the contiguous data is the
# "internal work of the imshow" function. We only want to measure
# the conversion time for non-contiguous data. So we'll subtract
# the first image's (contiguous) runtime from the non-contiguous time.

noncontiguous_overhead_per_call = elapsed2_percall - elapsed1_percall

print("Extra time taken per OpenCV call when given non-contiguous data (in ms):", noncontiguous_overhead_per_call, "ms")

The results:

img1 contiguous True img2 contiguous False
img1 strides (11520, 3, 1) img2 strides (11520, 3, -1)
Extra time taken per OpenCV call when given non-contiguous data (in ms): 39.45334999999999 ms

As you can see, the time it takes is pretty much the same as when you call img.copy() inside Python itself (as seen in the earlier benchmark for "Method: x = x[...,::-1].copy(). Speed: 37.5 msec per loop").

So yes, every time you call OpenCV with a non-contiguous Numpy array as its argument, you are causing a Numpy .copy() to happen internally!

Conclusions

Stop using Numpy view manipulations and "tricks". They are not "cool". They are SLOW. You are slowing down all of your OpenCV API calls by about 40 milliseconds PER CALL, since your "cool" data has to be converted by OpenCV internally to PROPER CONTIGUOUS RAM.

Those 39.45ms are wasted on EVERY OpenCV call whenever you give OpenCV a non-contiguous image. So if you're (as people often do) passing the image to multiple OpenCV APIs to analyze it in multiple ways, then you are causing extreme slowdowns in your code.

Use cv2.cvtColor() instead, which does a one-time conversion to the proper format (in about 5.64ms as seen in the benchmarks earlier), using accelerated OpenCL code. You are guaranteed to get contiguous data which works as-is for EVERY OpenCV call with no memory copying/conversion needed. And OpenCV's color converter is WAY FASTER than Numpy's internal data copier/converter.

Let's end this by imagining a scenario where you're using some library to capture an RGB screenshot as a numpy array, and you need to use that data with OpenCV. So you're thinking you're clever and you write img = img[...,::-1], and you're thinking "Wow, my code is so fast! That operation only took 0.000237 ms!"... And then you're calling five different OpenCV functions to analyze that screenshot in various ways. Well, since you're causing one internal copy-conversion PER CALL, you're now causing 5 * 39.45 = 197.25 ms of conversion overhead, just to get your "stupid" Numpy view into a proper contiguous memory stream.

Does it still sound "slow" to just do a single, one-time 5.64ms conversion via cv2.cvtColor()? ;-)

Stop. Using. Numpy. Tricks!

Enjoy! ;-)

Fastest way to convert BGR <-> RGB! Aka: Do NOT use Numpy magic "tricks".


IMPORTANT: This article is very long. Remember to click the (more) at the bottom of this post to read the whole article!


I was reading this question: https://answers.opencv.org/question/188664/colour-conversion-from-bgr-to-rgb-in-cv2-is-slower-than-on-python/ and it didn't explain things very well at all. So here's a deep examination and explanation for everyone's future reference!

Converting RGB to BGR, and vice versa, is one of the most important operations you can do in OpenCV if you're interoperating with other libraries, raw system memory, etc. And each imaging library depends on their own special channel orders.

There are many ways to achieve the conversion, and cv2.cvtColor() is often frowned upon because there are "much faster" ways to do it via numpy "view" manipulation.

Whenever you attempt to convert colors in OpenCV, you actually invoke a huge machinery:

https://github.com/opencv/opencv/blob/8c0b0714e76efef4a8ca2a7c410c60e55c5e9829/modules/imgproc/src/color.cpp#L20-L25 https://github.com/opencv/opencv/blob/8b541e450b511fde9dd363fa55a30fbb6fc0ace6/modules/imgproc/src/color_rgb.dispatch.cpp#L426-L437

As you can see, internally, OpenCV creates an "OpenCL Kernel" with the instructions for the data transformation, and then runs it. This creates brand new (re-arranged) image data in memory, which is of course a pretty slow operation, involving new memory allocation and data-copying.

However, there is another way to flip between RGB and BGR channel orders, which is very popular - and very bad (as you'll find out soon). And that is: Using numpy's built-in methods for manipulating the array data.

Note that there are two ways to manipulate data in Numpy:

  • One of the ways, the bad way, just changes the "view" of the Numpy array and is therefore instant (O(1)), but does NOT transform the underlying img.data in RAM/memory. This means that the raw memory does NOT contain the new channel order, and Numpy instead "fakes" it by creating a "view" that simply says "when we read this data from RAM, view it as R=B, G=G, B=R" basically... (Technically speaking, it changes the ".strides" property of the Numpy object, which instead of saying "read R then G then B" (stride "1" aka going forwards in RAM when reading the color channels) changes it to say "read B, then G, then R" (stride "-1" aka going backwards in RAM when reading the color channels)).
  • The second way, which is totally fine, is to always ensure that we arrange the pixel data properly in memory too, which is a lot slower but is sometimes necessary depending on what library/API your data is intended to be sent to!

To determine whether a numpy array manipulation has also changed the underlying MEMORY, you can look at the img.flags['C_CONTIGUOUS'] value. If True it means that the data in RAM is in the correct order (that's great!). If False it means that the data in RAM is in the wrong order and that we are "cheating" via a numpy View instead (that's BAD!).

Whenever you use the "View-based" methods to flip channels in an ndarray (such as RGB -> BGR), its C_CONTIGUOUS becomes False. If you then flip the image's channels again (such as BGR -> back to RGB), its C_CONTIGUOUS becomes True again. So, the "view" is able to be transformed multiple times, and the "Contiguous" flag only says True whenever the view happens to match the actual RAM data's layout.

So... in what situations do you need the data to ALWAYS be contiguous? Well, it varies based on API...

  • OpenCV APIs _all need_ all need data in contiguous order. If you give it non-contiguous data from Python, the Python-to-OpenCV wrapper layer _internally_ internally makes a COPY of the BADLY FORMATTED image you gave it, and then converts the color channel order, and THEN finally passes the COPIED-AND-CONTIGUOUS image to the internal OpenCV C++ function. This is of course very wasteful!
  • Other libraries: Depends on the library. Some of them do something like "take the img.data memory address and give it to a raw Windows API via a COM call" in which case YES the RAM-data MUST be contiguous too.

What type of data do YOU need?

If you want the SAFEST possible data that is 100% sure to work in ANY API ANYWHERE, you should always make CONTIGUOUS pixel data. It takes a little longer to do the conversion up-front, but we're still talking about very fast operations!

There are probably situations where non-contiguous data is fine, such as if you are doing all image manipulations purely in Numpy math without any library APIs (in which case there's no real reason to convert the data layout to contiguous in RAM). But as soon as you invoke various library APIs, you should _pretty pretty much always_ always have contiguous data, otherwise you'll create huge performance issues.

I'll explain those performance issues further down, but first we'll look at the various "conversion techniques" people use in Python.

Techniques

Without further ado, here are all the ways that people use in Python whenever they want to convert back/forth between RGB and BGR. These benchmarks are on a 4K image (3840x2160):

  • Always Contiguous: No. Method: x = x[...,::-1]. Speed: 237 nsec (aka 0.237 usec aka 0.000237 msec) per loop
  • Always Contiguous: Yes. Method: x = cv2.cvtColor(x, cv2.COLOR_RGB2BGR). Speed: 5.64 msec per loop
  • Always Contiguous: Yes. Method: x = x[...,::-1].copy(). Speed: 37.5 msec per loop
  • Always Contiguous: No. Method: x = np.fliplr(x.reshape(-1,3)).reshape(x.shape). Speed: 1.62 usec (aka 0.00162 msec) per loop
  • Always Contiguous: Yes. Method: x = np.fliplr(x.reshape(-1,3)).reshape(x.shape).copy(). Speed: 37.4 msec per loop
  • Always Contiguous: No. Method: x = np.flip(x, axis=2). Speed: 2.74 usec (aka 0.00274 msec) per loop
  • Always Contiguous: Yes. Method: x = np.flip(x, axis=2).copy(). Speed: 37.5 msec per loop

PS: Whenever we want contiguous data from numpy, we're using x.copy() to tell Numpy to allocate new RAM and copy all data to it in the correct (contiguous) order. There's also a np.ascontiguousarray(x) API but it does the exact same thing (copies) and requires much more typing. ;-)

Docs for the various Numpy functions: copy, ascontiguousarray, fliplr, flip, reshape

Here's the benchmark that was used:

python -m timeit -s "import numpy as np; import cv2; x = np.zeros([2160,3840,3], np.uint8); x[:,:,2] = 255; x[:,:,1] = 100" "ALGORITHM HERE"

Replace the "ALGORITHM HERE" part with the algorithm above, such as "x = np.flip(x, axis=2).copy()".

People's Misunderstandings of those Benchmarks

Alright, so we're finally getting to the whole purpose of this article!

When people see the benchmarks above, they usually think "Oh my god, x = x[...,::-1] executes in 0.000237 milliseconds, and x = cv2.cvtColor(x, cv2.COLOR_RGB2BGR) executes in 5.64 milliseconds, which is 23798 times slower!! And then they decide to always use "Numpy view manipulation" to do their channel conversions.

That's a huge mistake. And here's why:

  • When you call an OpenCV API from Python, and pass it a numpy.ndarray image object, there's a process which prepares that data for internal usage within OpenCV (since OpenCV itself doesn't use ndarray internally; it uses "Mat").
  • First, your Python object (which is coming from the PyOpenCV module) goes into the appropriate pyopencv_to() function, whose purpose is to convert raw Python objects (such as numbers, strings, ndarray, etc), into something usable by OpenCV internally in C++.
  • Your Python object first enters the "full ArgInfo converter" code at https://github.com/opencv/opencv/blob/778f42ad34559451d62ac9ba585717aec77fb23a/modules/python/src2/cv2.cpp#L249
  • That code, in turn, looks at the object and determines if it's a number, a float, or a tuple... If it's any of those, it does the appropriate conversion. Otherwise it assumes it's a Numpy array, at this line: https://github.com/opencv/opencv/blob/master/modules/python/src2/cv2.cpp#L292
  • Next, it begins to analyze the Numpy array to determine how to use the data internally. It wants to determine "do we need to copy the data or can we use it as-is? do we need to cast the data?", see here: https://github.com/opencv/opencv/blob/master/modules/python/src2/cv2.cpp#L300
  • It first does some simple checks to see if the data type in the Numpy array is legal or not.
  • Next, it retrieves the "strides" information from the Numpy array, which is those simple numbers such as "-1" which determine how to read a numpy array (such as backwards, in the case of our "fast" numpy-based "channel flipping" code earlier): https://github.com/opencv/opencv/blob/master/modules/python/src2/cv2.cpp#L341-L342
  • Then it analyzes the Strides for all dimensions of the Numpy array, and if it finds a non-contiguous stride (our "screwed up" data layout caused by doing those so-called "fast" Numpy view manipulations), then it marks the data as "needs copy": https://github.com/opencv/opencv/blob/master/modules/python/src2/cv2.cpp#L344-L357
  • Next, it if "needs copy" is true, and we can't solve it with a simple cast ("needs cast" is false), it does this horrible thing: https://github.com/opencv/opencv/blob/master/modules/python/src2/cv2.cpp#L371-L374
  • As you can see, it calls PyArray_GETCONTIGUOUS() which generates a brand-new Python Numpy Object, with all data COPIED by Numpy into brand-new memory, and re-arranged into proper Contiguous ordering. That's extremely wasteful! I'll explain more soon, after this walkthrough of what the code does...
  • Finally, the code proceeds to create a cv::Mat object whose data pointer points at the internal byte-array (RAM) of the Numpy object. That's an incredibly fast operation because it is just a pointer which says Use the data owned by Numpy at RAM address XYZ: https://github.com/opencv/opencv/blob/master/modules/python/src2/cv2.cpp#L415-L416

So, can you spot the problem yet?

When you pass a contiguous Numpy array, the conversion into OpenCV is pretty much INSTANT: "This data looks fine! Just give it to cv::Mat and voila!".

But when you instead insist on using those so-called "fast" channel transformations, where you "tweak" the Numpy array's view and stride values, then you are giving OpenCV a Numpy array with non-contiguous RAM and bad "strides". The PyOpenCV layer (the wrapper between OpenCV and Python) detects this problem, and creates a BRAND NEW, COPIED, RE-ARRANGED (CONTIGUOUS) NUMPY ARRAY. This is VERY VERY VERY VERY SLOW.

In other words, if you've used those dumb Numpy "view" manipulation tricks, EVERY CALL TO OPENCV APIS IS CAUSING a HUGE memory copy (images are large, especially 1080p+ screenshots/video frames), a lot of math inside PyArray_GETCONTIGUOUS to create that new object while respecting your tweaked "strides", etc.

Your code won't be faster at all. It will be SLOWER!

Proof of Slowness

Let's use a random OpenCV API to demonstrate the slowdown caused by all of those conversions. We'll use "imshow" here, but _any_ OpenCV API call will always be doing the same "Python to OpenCV" conversions of the data, so the exact API doesn't matter.

Here's the example code:

import cv2
import numpy as np
import time

#img1 = cv2.imread("yourimage.png") # If you want to test with an image.
img1 = np.zeros([2160,3840,3], np.uint8) # Create a 4K image.
img1[:,:,2] = 255; img1[:,:,1] = 100 # Fill the channels with different values.
img2 = img1[...,::-1] # Make a "channel flipped view" of the Numpy data.

print("img1 contiguous", img1.flags['C_CONTIGUOUS'], "img2 contiguous", img2.flags['C_CONTIGUOUS'])
print("img1 strides", img1.strides, "img2 strides", img2.strides)

wnd = cv2.namedWindow("", cv2.WINDOW_NORMAL)

def show1():
    cv2.imshow(wnd, img1)

def show2():
    cv2.imshow(wnd, img2)

iterations = 20

start = time.perf_counter()
for i in range(0, iterations):
    show1()
elapsed1 = (time.perf_counter() - start) * 1000
elapsed1_percall = elapsed1 / iterations

start = time.perf_counter()
for i in range(0, iterations):
    show2()
elapsed2 = (time.perf_counter() - start) * 1000
elapsed2_percall = elapsed2 / iterations

# We know that the contiguos (img1) data does not need conversion,
# which tells us that the runtime of the contiguous data is the
# "internal work of the imshow" function. We only want to measure
# the conversion time for non-contiguous data. So we'll subtract
# the first image's (contiguous) runtime from the non-contiguous time.

noncontiguous_overhead_per_call = elapsed2_percall - elapsed1_percall

print("Extra time taken per OpenCV call when given non-contiguous data (in ms):", noncontiguous_overhead_per_call, "ms")

The results:

img1 contiguous True img2 contiguous False
img1 strides (11520, 3, 1) img2 strides (11520, 3, -1)
Extra time taken per OpenCV call when given non-contiguous data (in ms): 39.45334999999999 ms

As you can see, the time it takes is pretty much the same as when you call img.copy() inside Python itself (as seen in the earlier benchmark for "Method: x = x[...,::-1].copy(). Speed: 37.5 msec per loop").

So yes, every time you call OpenCV with a non-contiguous Numpy array as its argument, you are causing a Numpy .copy() to happen internally!

Conclusions

Stop using Numpy view manipulations and "tricks". They are not "cool". They are SLOW. You are slowing down all of your OpenCV API calls by about 40 milliseconds PER CALL, since your "cool" data has to be converted by OpenCV internally to PROPER CONTIGUOUS RAM.

Those 39.45ms are wasted on EVERY OpenCV call whenever you give OpenCV a non-contiguous image. So if you're (as people often do) passing the image to multiple OpenCV APIs to analyze it in multiple ways, then you are causing extreme slowdowns in your code.

Use cv2.cvtColor() instead, which does a one-time conversion to the proper format (in about 5.64ms as seen in the benchmarks earlier), using accelerated OpenCL code. You are guaranteed to get contiguous data which works as-is for EVERY OpenCV call with no memory copying/conversion needed. And OpenCV's color converter is WAY FASTER than Numpy's internal data copier/converter.

Let's end this by imagining a scenario where you're using some library to capture an RGB screenshot as a numpy array, and you need to use that data with OpenCV. So you're thinking you're clever and you write img = img[...,::-1], and you're thinking "Wow, my code is so fast! That operation only took 0.000237 ms!"... And then you're calling five different OpenCV functions to analyze that screenshot in various ways. Well, since you're causing one internal copy-conversion PER CALL, you're now causing 5 * 39.45 = 197.25 ms of conversion overhead, just to get your "stupid" Numpy view into a proper contiguous memory stream.

Does it still sound "slow" to just do a single, one-time 5.64ms conversion via cv2.cvtColor()? ;-)

Stop. Using. Numpy. Tricks!

Enjoy! ;-)

Fastest way to convert BGR <-> RGB! Aka: Do NOT use Numpy magic "tricks".


IMPORTANT: This article is very long. Remember to click the (more) at the bottom of this post to read the whole article!


I was reading this question: https://answers.opencv.org/question/188664/colour-conversion-from-bgr-to-rgb-in-cv2-is-slower-than-on-python/ and it didn't explain things very well at all. So here's a deep examination and explanation for everyone's future reference!

Converting RGB to BGR, and vice versa, is one of the most important operations you can do in OpenCV if you're interoperating with other libraries, raw system memory, etc. And each imaging library depends on their own special channel orders.

There are many ways to achieve the conversion, and cv2.cvtColor() is often frowned upon because there are "much faster" ways to do it via numpy "view" manipulation.

Whenever you attempt to convert colors in OpenCV, you actually invoke a huge machinery:

https://github.com/opencv/opencv/blob/8c0b0714e76efef4a8ca2a7c410c60e55c5e9829/modules/imgproc/src/color.cpp#L20-L25 https://github.com/opencv/opencv/blob/8b541e450b511fde9dd363fa55a30fbb6fc0ace6/modules/imgproc/src/color_rgb.dispatch.cpp#L426-L437

As you can see, internally, OpenCV creates an "OpenCL Kernel" with the instructions for the data transformation, and then runs it. This creates brand new (re-arranged) image data in memory, which is of course a pretty slow operation, involving new memory allocation and data-copying.

However, there is another way to flip between RGB and BGR channel orders, which is very popular - and very bad (as you'll find out soon). And that is: Using numpy's built-in methods for manipulating the array data.

Note that there are two ways to manipulate data in Numpy:

  • One of the ways, the bad way, just changes the "view" of the Numpy array and is therefore instant (O(1)), but does NOT transform the underlying img.data in RAM/memory. This means that the raw memory does NOT contain the new channel order, and Numpy instead "fakes" it by creating a "view" that simply says "when we read this data from RAM, view it as R=B, G=G, B=R" basically... (Technically speaking, it changes the ".strides" property of the Numpy object, which instead of saying "read R then G then B" (stride "1" aka going forwards in RAM when reading the color channels) changes it to say "read B, then G, then R" (stride "-1" aka going backwards in RAM when reading the color channels)).
  • The second way, which is totally fine, is to always ensure that we arrange the pixel data properly in memory too, which is a lot slower but is sometimes necessary depending on what library/API your data is intended to be sent to!

To determine whether a numpy array manipulation has also changed the underlying MEMORY, you can look at the img.flags['C_CONTIGUOUS'] value. If True it means that the data in RAM is in the correct order (that's great!). If False it means that the data in RAM is in the wrong order and that we are "cheating" via a numpy View instead (that's BAD!).

Whenever you use the "View-based" methods to flip channels in an ndarray (such as RGB -> BGR), its C_CONTIGUOUS becomes False. If you then flip the image's channels again (such as BGR -> back to RGB), its C_CONTIGUOUS becomes True again. So, the "view" is able to be transformed multiple times, and the "Contiguous" flag only says True whenever the view happens to match the actual RAM data's layout.

So... in what situations do you need the data to ALWAYS be contiguous? Well, it varies based on API...

  • OpenCV APIs all need data in contiguous order. If you give it non-contiguous data from Python, the Python-to-OpenCV wrapper layer internally makes a COPY of the BADLY FORMATTED image you gave it, and then converts the color channel order, and THEN finally passes the COPIED-AND-CONTIGUOUS image to the internal OpenCV C++ function. This is of course very wasteful!
  • Other libraries: Depends on the library. Some of them do something like "take the img.data memory address and give it to a raw Windows API via a COM call" in which case YES the RAM-data MUST be contiguous too.

What type of data do YOU need?

If you want the SAFEST possible data that is 100% sure to work in ANY API ANYWHERE, you should always make CONTIGUOUS pixel data. It takes a little longer doesn't take long to do the conversion up-front, but since we're still talking about very fast operations!

There are probably situations where non-contiguous data is fine, such as if you are doing all image manipulations purely in Numpy math without any library APIs (in which case there's no real reason to convert the data layout to contiguous in RAM). But as soon as you invoke various library APIs, you should pretty much always have contiguous data, otherwise you'll create huge performance issues.

I'll explain those performance issues further down, but first we'll look at the various "conversion techniques" people use in Python.

Techniques

Without further ado, here are all the ways that people use in Python whenever they want to convert back/forth between RGB and BGR. These benchmarks are on a 4K image (3840x2160):

  • Always Contiguous: No. Method: x = x[...,::-1]. Speed: 237 nsec (aka 0.237 usec aka 0.000237 msec) per loop
  • Always Contiguous: Yes. Method: x = cv2.cvtColor(x, cv2.COLOR_RGB2BGR). Speed: 5.64 msec per loop
  • Always Contiguous: Yes. Method: x = x[...,::-1].copy(). Speed: 37.5 msec per loop
  • Always Contiguous: No. Method: x = np.fliplr(x.reshape(-1,3)).reshape(x.shape). Speed: 1.62 usec (aka 0.00162 msec) per loop
  • Always Contiguous: Yes. Method: x = np.fliplr(x.reshape(-1,3)).reshape(x.shape).copy(). Speed: 37.4 msec per loop
  • Always Contiguous: No. Method: x = np.flip(x, axis=2). Speed: 2.74 usec (aka 0.00274 msec) per loop
  • Always Contiguous: Yes. Method: x = np.flip(x, axis=2).copy(). Speed: 37.5 msec per loop

PS: Whenever we want contiguous data from numpy, we're using x.copy() to tell Numpy to allocate new RAM and copy all data to it in the correct (contiguous) order. There's also a np.ascontiguousarray(x) API but it does the exact same thing (copies) and requires much more typing. ;-)

Docs for the various Numpy functions: copy, ascontiguousarray, fliplr, flip, reshape

Here's the benchmark that was used:

python -m timeit -s "import numpy as np; import cv2; x = np.zeros([2160,3840,3], np.uint8); x[:,:,2] = 255; x[:,:,1] = 100" "ALGORITHM HERE"

Replace the "ALGORITHM HERE" part with the algorithm above, such as "x = np.flip(x, axis=2).copy()".

People's Misunderstandings of those Benchmarks

Alright, so we're finally getting to the whole purpose of this article!

When people see the benchmarks above, they usually think "Oh my god, x = x[...,::-1] executes in 0.000237 milliseconds, and x = cv2.cvtColor(x, cv2.COLOR_RGB2BGR) executes in 5.64 milliseconds, which is 23798 times slower!! And then they decide to always use "Numpy view manipulation" to do their channel conversions.

That's a huge mistake. And here's why:

  • When you call an OpenCV API from Python, and pass it a numpy.ndarray image object, there's a process which prepares that data for internal usage within OpenCV (since OpenCV itself doesn't use ndarray internally; it uses "Mat").
  • First, your Python object (which is coming from the PyOpenCV module) goes into the appropriate pyopencv_to() function, whose purpose is to convert raw Python objects (such as numbers, strings, ndarray, etc), into something usable by OpenCV internally in C++.
  • Your Python object first enters the "full ArgInfo converter" code at https://github.com/opencv/opencv/blob/778f42ad34559451d62ac9ba585717aec77fb23a/modules/python/src2/cv2.cpp#L249
  • That code, in turn, looks at the object and determines if it's a number, a float, or a tuple... If it's any of those, it does the appropriate conversion. Otherwise it assumes it's a Numpy array, at this line: https://github.com/opencv/opencv/blob/master/modules/python/src2/cv2.cpp#L292
  • Next, it begins to analyze the Numpy array to determine how to use the data internally. It wants to determine "do we need to copy the data or can we use it as-is? do we need to cast the data?", see here: https://github.com/opencv/opencv/blob/master/modules/python/src2/cv2.cpp#L300
  • It first does some simple checks to see if the data type in the Numpy array is legal or not.
  • Next, it retrieves the "strides" information from the Numpy array, which is those simple numbers such as "-1" which determine how to read a numpy array (such as backwards, in the case of our "fast" numpy-based "channel flipping" code earlier): https://github.com/opencv/opencv/blob/master/modules/python/src2/cv2.cpp#L341-L342
  • Then it analyzes the Strides for all dimensions of the Numpy array, and if it finds a non-contiguous stride (our "screwed up" data layout caused by doing those so-called "fast" Numpy view manipulations), then it marks the data as "needs copy": https://github.com/opencv/opencv/blob/master/modules/python/src2/cv2.cpp#L344-L357
  • Next, it if "needs copy" is true, and we can't solve it with a simple cast ("needs cast" is false), it does this horrible thing: https://github.com/opencv/opencv/blob/master/modules/python/src2/cv2.cpp#L371-L374
  • As you can see, it calls PyArray_GETCONTIGUOUS() which generates a brand-new Python Numpy Object, with all data COPIED by Numpy into brand-new memory, and re-arranged into proper Contiguous ordering. That's extremely wasteful! I'll explain more soon, after this walkthrough of what the code does...
  • Finally, the code proceeds to create a cv::Mat object whose data pointer points at the internal byte-array (RAM) of the Numpy object. That's an incredibly fast operation because it is just a pointer which says Use the data owned by Numpy at RAM address XYZ: https://github.com/opencv/opencv/blob/master/modules/python/src2/cv2.cpp#L415-L416

So, can you spot the problem yet?

When you pass a contiguous Numpy array, the conversion into OpenCV is pretty much INSTANT: "This data looks fine! Just give it to cv::Mat and voila!".

But when you instead insist on using those so-called "fast" channel transformations, where you "tweak" the Numpy array's view and stride values, then you are giving OpenCV a Numpy array with non-contiguous RAM and bad "strides". The PyOpenCV layer (the wrapper between OpenCV and Python) detects this problem, and creates a BRAND NEW, COPIED, RE-ARRANGED (CONTIGUOUS) NUMPY ARRAY. This is VERY VERY VERY VERY SLOW.

In other words, if you've used those dumb Numpy "view" manipulation tricks, EVERY CALL TO OPENCV APIS IS CAUSING a HUGE memory copy (images are large, especially 1080p+ screenshots/video frames), a lot of math inside PyArray_GETCONTIGUOUS to create that new object while respecting your tweaked "strides", etc.

Your code won't be faster at all. It will be SLOWER!

Proof of Slowness

Let's use a random OpenCV API to demonstrate the slowdown caused by all of those conversions. We'll use "imshow" here, but _any_ OpenCV API call will always be doing the same "Python to OpenCV" conversions of the data, so the exact API doesn't matter.

Here's the example code:

import cv2
import numpy as np
import time

#img1 = cv2.imread("yourimage.png") # If you want to test with an image.
img1 = np.zeros([2160,3840,3], np.uint8) # Create a 4K image.
img1[:,:,2] = 255; img1[:,:,1] = 100 # Fill the channels with different values.
img2 = img1[...,::-1] # Make a "channel flipped view" of the Numpy data.

print("img1 contiguous", img1.flags['C_CONTIGUOUS'], "img2 contiguous", img2.flags['C_CONTIGUOUS'])
print("img1 strides", img1.strides, "img2 strides", img2.strides)

wnd = cv2.namedWindow("", cv2.WINDOW_NORMAL)

def show1():
    cv2.imshow(wnd, img1)

def show2():
    cv2.imshow(wnd, img2)

iterations = 20

start = time.perf_counter()
for i in range(0, iterations):
    show1()
elapsed1 = (time.perf_counter() - start) * 1000
elapsed1_percall = elapsed1 / iterations

start = time.perf_counter()
for i in range(0, iterations):
    show2()
elapsed2 = (time.perf_counter() - start) * 1000
elapsed2_percall = elapsed2 / iterations

# We know that the contiguos (img1) data does not need conversion,
# which tells us that the runtime of the contiguous data is the
# "internal work of the imshow" function. We only want to measure
# the conversion time for non-contiguous data. So we'll subtract
# the first image's (contiguous) runtime from the non-contiguous time.

noncontiguous_overhead_per_call = elapsed2_percall - elapsed1_percall

print("Extra time taken per OpenCV call when given non-contiguous data (in ms):", noncontiguous_overhead_per_call, "ms")

The results:

img1 contiguous True img2 contiguous False
img1 strides (11520, 3, 1) img2 strides (11520, 3, -1)
Extra time taken per OpenCV call when given non-contiguous data (in ms): 39.45334999999999 ms

As you can see, the time it takes is pretty much the same as when you call img.copy() inside Python itself (as seen in the earlier benchmark for "Method: x = x[...,::-1].copy(). Speed: 37.5 msec per loop").

So yes, every time you call OpenCV with a non-contiguous Numpy array as its argument, you are causing a Numpy .copy() to happen internally!

Conclusions

Stop using Numpy view manipulations and "tricks". They are not "cool". They are SLOW. You are slowing down all of your OpenCV API calls by about 40 milliseconds PER CALL, since your "cool" data has to be converted by OpenCV internally to PROPER CONTIGUOUS RAM.

Those 39.45ms are wasted on EVERY OpenCV call whenever you give OpenCV a non-contiguous image. So if you're (as people often do) passing the image to multiple OpenCV APIs to analyze it in multiple ways, then you are causing extreme slowdowns in your code.

Use cv2.cvtColor() instead, which does a one-time conversion to the proper format (in about 5.64ms as seen in the benchmarks earlier), using accelerated OpenCL code. You are guaranteed to get contiguous data which works as-is for EVERY OpenCV call with no memory copying/conversion needed. And OpenCV's color converter is WAY FASTER than Numpy's internal data copier/converter.

Let's end this by imagining a scenario where you're using some library to capture an RGB screenshot as a numpy array, and you need to use that data with OpenCV. So you're thinking you're clever and you write img = img[...,::-1], and you're thinking "Wow, my code is so fast! That operation only took 0.000237 ms!"... And then you're calling five different OpenCV functions to analyze that screenshot in various ways. Well, since you're causing one internal copy-conversion PER CALL, you're now causing 5 * 39.45 = 197.25 ms of conversion overhead, just to get your "stupid" Numpy view into a proper contiguous memory stream.

Does it still sound "slow" to just do a single, one-time 5.64ms conversion via cv2.cvtColor()? ;-)

Stop. Using. Numpy. Tricks!

Enjoy! ;-)

Fastest way to convert BGR <-> RGB! Aka: Do NOT use Numpy magic "tricks".


IMPORTANT: This article is very long. Remember to click the (more) at the bottom of this post to read the whole article!


I was reading this question: https://answers.opencv.org/question/188664/colour-conversion-from-bgr-to-rgb-in-cv2-is-slower-than-on-python/ and it didn't explain things very well at all. So here's a deep examination and explanation for everyone's future reference!

Converting RGB to BGR, and vice versa, is one of the most important operations you can do in OpenCV if you're interoperating with other libraries, raw system memory, etc. And each imaging library depends on their own special channel orders.

There are many ways to achieve the conversion, and cv2.cvtColor() is often frowned upon because there are "much faster" ways to do it via numpy "view" manipulation.

Whenever you attempt to convert colors in OpenCV, you actually invoke a huge machinery:

https://github.com/opencv/opencv/blob/8c0b0714e76efef4a8ca2a7c410c60e55c5e9829/modules/imgproc/src/color.cpp#L20-L25 https://github.com/opencv/opencv/blob/8b541e450b511fde9dd363fa55a30fbb6fc0ace6/modules/imgproc/src/color_rgb.dispatch.cpp#L426-L437

As you can see, internally, OpenCV creates an "OpenCL Kernel" with the instructions for the data transformation, and then runs it. This creates brand new (re-arranged) image data in memory, which is of course a pretty slow operation, involving new memory allocation and data-copying.

However, there is another way to flip between RGB and BGR channel orders, which is very popular - and very bad (as you'll find out soon). And that is: Using numpy's built-in methods for manipulating the array data.

Note that there are two ways to manipulate data in Numpy:

  • One of the ways, the bad way, just changes the "view" of the Numpy array and is therefore instant (O(1)), but does NOT transform the underlying img.data in RAM/memory. This means that the raw memory does NOT contain the new channel order, and Numpy instead "fakes" it by creating a "view" that simply says "when we read this data from RAM, view it as R=B, G=G, B=R" basically... (Technically speaking, it changes the ".strides" property of the Numpy object, which instead of saying "read R then G then B" (stride "1" aka going forwards in RAM when reading the color channels) changes it to say "read B, then G, then R" (stride "-1" aka going backwards in RAM when reading the color channels)).
  • The second way, which is totally fine, is to always ensure that we arrange the pixel data properly in memory too, which is a lot slower but is sometimes necessary depending on what library/API your data is intended to be sent to!

To determine whether a numpy array manipulation has also changed the underlying MEMORY, you can look at the img.flags['C_CONTIGUOUS'] value. If True it means that the data in RAM is in the correct order (that's great!). If False it means that the data in RAM is in the wrong order and that we are "cheating" via a numpy View instead (that's BAD!).

Whenever you use the "View-based" methods to flip channels in an ndarray (such as RGB -> BGR), its C_CONTIGUOUS becomes False. If you then flip the image's channels again (such as BGR -> back to RGB), its C_CONTIGUOUS becomes True again. So, the "view" is able to be transformed multiple times, and the "Contiguous" flag only says True whenever the view happens to match the actual RAM data's layout.

So... in what situations do you need the data to ALWAYS be contiguous? Well, it varies based on API...

  • OpenCV APIs all need data in contiguous order. If you give it non-contiguous data from Python, the Python-to-OpenCV wrapper layer internally makes a COPY of the BADLY FORMATTED image you gave it, and then converts the color channel order, and THEN finally passes the COPIED-AND-CONTIGUOUS image to the internal OpenCV C++ function. This is of course very wasteful!
  • Other libraries: Depends on the library. Some of them do something like "take the img.data memory address and give it to a raw Windows API via a COM call" in which case YES the RAM-data MUST be contiguous too.

What type of data do YOU need?

If you want the SAFEST possible data that is 100% sure to work in ANY API ANYWHERE, you should always make CONTIGUOUS pixel data. It doesn't take long to do the conversion up-front, since we're still talking about very fast operations!

There are probably situations where non-contiguous data is fine, such as if you are doing all image manipulations purely in Numpy math without any library APIs (in which case there's no real reason to convert the data layout to contiguous in RAM). But as soon as you invoke various library APIs, you should pretty much always have contiguous data, otherwise you'll create huge performance issues.issues (or even completely incorrect results).

I'll explain those performance issues further down, but first we'll look at the various "conversion techniques" people use in Python.

Techniques

Without further ado, here are all the ways that people use in Python whenever they want to convert back/forth between RGB and BGR. These benchmarks are on a 4K image (3840x2160):

  • Always Contiguous: No. Method: x = x[...,::-1]. Speed: 237 nsec (aka 0.237 usec aka 0.000237 msec) per loop
  • Always Contiguous: Yes. Method: x = cv2.cvtColor(x, cv2.COLOR_RGB2BGR). Speed: 5.64 msec per loop
  • Always Contiguous: Yes. Method: x = x[...,::-1].copy(). Speed: 37.5 msec per loop
  • Always Contiguous: No. Method: x = np.fliplr(x.reshape(-1,3)).reshape(x.shape). Speed: 1.62 usec (aka 0.00162 msec) per loop
  • Always Contiguous: Yes. Method: x = np.fliplr(x.reshape(-1,3)).reshape(x.shape).copy(). Speed: 37.4 msec per loop
  • Always Contiguous: No. Method: x = np.flip(x, axis=2). Speed: 2.74 usec (aka 0.00274 msec) per loop
  • Always Contiguous: Yes. Method: x = np.flip(x, axis=2).copy(). Speed: 37.5 msec per loop

PS: Whenever we want contiguous data from numpy, we're using x.copy() to tell Numpy to allocate new RAM and copy all data to it in the correct (contiguous) order. There's also a np.ascontiguousarray(x) API but it does the exact same thing (copies) and requires much more typing. ;-)

Docs for the various Numpy functions: copy, ascontiguousarray, fliplr, flip, reshape

Here's the benchmark that was used:

python -m timeit -s "import numpy as np; import cv2; x = np.zeros([2160,3840,3], np.uint8); x[:,:,2] = 255; x[:,:,1] = 100" "ALGORITHM HERE"

Replace the "ALGORITHM HERE" part with the algorithm above, such as "x = np.flip(x, axis=2).copy()".

People's Misunderstandings of those Benchmarks

Alright, so we're finally getting to the whole purpose of this article!

When people see the benchmarks above, they usually think "Oh my god, x = x[...,::-1] executes in 0.000237 milliseconds, and x = cv2.cvtColor(x, cv2.COLOR_RGB2BGR) executes in 5.64 milliseconds, which is 23798 times slower!! And then they decide to always use "Numpy view manipulation" to do their channel conversions.

That's a huge mistake. And here's why:

  • When you call an OpenCV API from Python, and pass it a numpy.ndarray image object, there's a process which prepares that data for internal usage within OpenCV (since OpenCV itself doesn't use ndarray internally; it uses "Mat").
  • First, your Python object (which is coming from the PyOpenCV module) goes into the appropriate pyopencv_to() function, whose purpose is to convert raw Python objects (such as numbers, strings, ndarray, etc), into something usable by OpenCV internally in C++.
  • Your Python object first enters the "full ArgInfo converter" code at https://github.com/opencv/opencv/blob/778f42ad34559451d62ac9ba585717aec77fb23a/modules/python/src2/cv2.cpp#L249
  • That code, in turn, looks at the object and determines if it's a number, a float, or a tuple... If it's any of those, it does the appropriate conversion. Otherwise it assumes it's a Numpy array, at this line: https://github.com/opencv/opencv/blob/master/modules/python/src2/cv2.cpp#L292
  • Next, it begins to analyze the Numpy array to determine how to use the data internally. It wants to determine "do we need to copy the data or can we use it as-is? do we need to cast the data?", see here: https://github.com/opencv/opencv/blob/master/modules/python/src2/cv2.cpp#L300
  • It first does some simple checks to see if the data type in the Numpy array is legal or not.
  • Next, it retrieves the "strides" information from the Numpy array, which is those simple numbers such as "-1" which determine how to read a numpy array (such as backwards, in the case of our "fast" numpy-based "channel flipping" code earlier): https://github.com/opencv/opencv/blob/master/modules/python/src2/cv2.cpp#L341-L342
  • Then it analyzes the Strides for all dimensions of the Numpy array, and if it finds a non-contiguous stride (our "screwed up" data layout caused by doing those so-called "fast" Numpy view manipulations), then it marks the data as "needs copy": https://github.com/opencv/opencv/blob/master/modules/python/src2/cv2.cpp#L344-L357
  • Next, it if "needs copy" is true, and we can't solve it with a simple cast ("needs cast" is false), it does this horrible thing: https://github.com/opencv/opencv/blob/master/modules/python/src2/cv2.cpp#L371-L374
  • As you can see, it calls PyArray_GETCONTIGUOUS() which generates a brand-new Python Numpy Object, with all data COPIED by Numpy into brand-new memory, and re-arranged into proper Contiguous ordering. That's extremely wasteful! I'll explain more soon, after this walkthrough of what the code does...
  • Finally, the code proceeds to create a cv::Mat object whose data pointer points at the internal byte-array (RAM) of the Numpy object. That's an incredibly fast operation because it is just a pointer which says Use the data owned by Numpy at RAM address XYZ: https://github.com/opencv/opencv/blob/master/modules/python/src2/cv2.cpp#L415-L416

So, can you spot the problem yet?

When you pass a contiguous Numpy array, the conversion into OpenCV is pretty much INSTANT: "This data looks fine! Just give it to cv::Mat and voila!".

But when you instead insist on using those so-called "fast" channel transformations, where you "tweak" the Numpy array's view and stride values, then you are giving OpenCV a Numpy array with non-contiguous RAM and bad "strides". The PyOpenCV layer (the wrapper between OpenCV and Python) detects this problem, and creates a BRAND NEW, COPIED, RE-ARRANGED (CONTIGUOUS) NUMPY ARRAY. This is VERY VERY VERY VERY SLOW.

In other words, if you've used those dumb Numpy "view" manipulation tricks, EVERY CALL TO OPENCV APIS IS CAUSING a HUGE memory copy (images are large, especially 1080p+ screenshots/video frames), a lot of math inside PyArray_GETCONTIGUOUS to create that new object while respecting your tweaked "strides", etc.

Your code won't be faster at all. It will be SLOWER!

Proof of Slowness

Let's use a random OpenCV API to demonstrate the slowdown caused by all of those conversions. We'll use "imshow" here, but _any_ OpenCV API call will always be doing the same "Python to OpenCV" conversions of the data, so the exact API doesn't matter.

Here's the example code:

import cv2
import numpy as np
import time

#img1 = cv2.imread("yourimage.png") # If you want to test with an image.
img1 = np.zeros([2160,3840,3], np.uint8) # Create a 4K image.
img1[:,:,2] = 255; img1[:,:,1] = 100 # Fill the channels with different values.
img2 = img1[...,::-1] # Make a "channel flipped view" of the Numpy data.

print("img1 contiguous", img1.flags['C_CONTIGUOUS'], "img2 contiguous", img2.flags['C_CONTIGUOUS'])
print("img1 strides", img1.strides, "img2 strides", img2.strides)

wnd = cv2.namedWindow("", cv2.WINDOW_NORMAL)

def show1():
    cv2.imshow(wnd, img1)

def show2():
    cv2.imshow(wnd, img2)

iterations = 20

start = time.perf_counter()
for i in range(0, iterations):
    show1()
elapsed1 = (time.perf_counter() - start) * 1000
elapsed1_percall = elapsed1 / iterations

start = time.perf_counter()
for i in range(0, iterations):
    show2()
elapsed2 = (time.perf_counter() - start) * 1000
elapsed2_percall = elapsed2 / iterations

# We know that the contiguos (img1) data does not need conversion,
# which tells us that the runtime of the contiguous data is the
# "internal work of the imshow" function. We only want to measure
# the conversion time for non-contiguous data. So we'll subtract
# the first image's (contiguous) runtime from the non-contiguous time.

noncontiguous_overhead_per_call = elapsed2_percall - elapsed1_percall

print("Extra time taken per OpenCV call when given non-contiguous data (in ms):", noncontiguous_overhead_per_call, "ms")

The results:

img1 contiguous True img2 contiguous False
img1 strides (11520, 3, 1) img2 strides (11520, 3, -1)
Extra time taken per OpenCV call when given non-contiguous data (in ms): 39.45334999999999 ms

As you can see, the time it takes is pretty much the same as when you call img.copy() inside Python itself (as seen in the earlier benchmark for "Method: x = x[...,::-1].copy(). Speed: 37.5 msec per loop").

So yes, every time you call OpenCV with a non-contiguous Numpy array as its argument, you are causing a Numpy .copy() to happen internally!

Conclusions

Stop using Numpy view manipulations and "tricks". They are not "cool". They are SLOW. You are slowing down all of your OpenCV API calls by about 40 milliseconds PER CALL, since your "cool" data has to be converted by OpenCV internally to PROPER CONTIGUOUS RAM.

Those 39.45ms are wasted on EVERY OpenCV call whenever you give OpenCV a non-contiguous image. So if you're (as people often do) passing the image to multiple OpenCV APIs to analyze it in multiple ways, then you are causing extreme slowdowns in your code.

Use cv2.cvtColor() instead, which does a one-time conversion to the proper format (in about 5.64ms as seen in the benchmarks earlier), using accelerated OpenCL code. You are guaranteed to get contiguous data which works as-is for EVERY OpenCV call with no memory copying/conversion needed. And OpenCV's color converter is WAY FASTER than Numpy's internal data copier/converter.

Let's end this by imagining a scenario where you're using some library to capture an RGB screenshot as a numpy array, and you need to use that data with OpenCV. So you're thinking you're clever and you write img = img[...,::-1], and you're thinking "Wow, my code is so fast! That operation only took 0.000237 ms!"... And then you're calling five different OpenCV functions to analyze that screenshot in various ways. Well, since you're causing one internal copy-conversion PER CALL, you're now causing 5 * 39.45 = 197.25 ms of conversion overhead, just to get your "stupid" Numpy view into a proper contiguous memory stream.

Does it still sound "slow" to just do a single, one-time 5.64ms conversion via cv2.cvtColor()? ;-)

Stop. Using. Numpy. Tricks!

Enjoy! ;-)

Fastest way to convert BGR <-> RGB! Aka: Do NOT use Numpy magic "tricks".


IMPORTANT: This article is very long. Remember to click the (more) at the bottom of this post to read the whole article!


I was reading this question: https://answers.opencv.org/question/188664/colour-conversion-from-bgr-to-rgb-in-cv2-is-slower-than-on-python/ and it didn't explain things very well at all. So here's a deep examination and explanation for everyone's future reference!

Converting RGB to BGR, and vice versa, is one of the most important operations you can do in OpenCV if you're interoperating with other libraries, raw system memory, etc. And each imaging library depends on their own special channel orders.

There are many ways to achieve the conversion, and cv2.cvtColor() is often frowned upon because there are "much faster" ways to do it via numpy "view" manipulation.

Whenever you attempt to convert colors in OpenCV, you actually invoke a huge machinery:

https://github.com/opencv/opencv/blob/8c0b0714e76efef4a8ca2a7c410c60e55c5e9829/modules/imgproc/src/color.cpp#L20-L25 https://github.com/opencv/opencv/blob/8b541e450b511fde9dd363fa55a30fbb6fc0ace6/modules/imgproc/src/color_rgb.dispatch.cpp#L426-L437

As you can see, internally, OpenCV creates an "OpenCL Kernel" with the instructions for the data transformation, and then runs it. This creates brand new (re-arranged) image data in memory, which is of course a pretty slow operation, involving new memory allocation and data-copying.

However, there is another way to flip between RGB and BGR channel orders, which is very popular - and very bad (as you'll find out soon). And that is: Using numpy's built-in methods for manipulating the array data.

Note that there are two ways to manipulate data in Numpy:

  • One of the ways, the bad way, just changes the "view" of the Numpy array and is therefore instant (O(1)), but does NOT transform the underlying img.data in RAM/memory. This means that the raw memory does NOT contain the new channel order, and Numpy instead "fakes" it by creating a "view" that simply says "when we read this data from RAM, view it as R=B, G=G, B=R" basically... (Technically speaking, it changes the ".strides" property of the Numpy object, which instead of saying "read R then G then B" (stride "1" aka going forwards in RAM when reading the color channels) changes it to say "read B, then G, then R" (stride "-1" aka going backwards in RAM when reading the color channels)).
  • The second way, which is totally fine, is to always ensure that we arrange the pixel data properly in memory too, which is a lot slower but is sometimes necessary depending on what library/API your data is intended to be sent to!

To determine whether a numpy array manipulation has also changed the underlying MEMORY, you can look at the img.flags['C_CONTIGUOUS'] value. If True it means that the data in RAM is in the correct order (that's great!). If False it means that the data in RAM is in the wrong order and that we are "cheating" via a numpy View instead (that's BAD!).

Whenever you use the "View-based" methods to flip channels in an ndarray (such as RGB -> BGR), its C_CONTIGUOUS becomes False. If you then flip the image's channels again (such as BGR -> back to RGB), its C_CONTIGUOUS becomes True again. So, the "view" is able to be transformed multiple times, and the "Contiguous" flag only says True whenever the view happens to match the actual RAM data's layout.

So... in what situations do you need the data to ALWAYS be contiguous? Well, it varies based on API...

  • OpenCV APIs all need data in contiguous order. If you give it non-contiguous data from Python, the Python-to-OpenCV wrapper layer internally makes a COPY of the BADLY FORMATTED image you gave it, and then converts the color channel order, and THEN finally passes the COPIED-AND-CONTIGUOUS image to the internal OpenCV C++ function. This is of course very wasteful!
  • Other libraries: Depends on the library. Some of them do something like "take the img.data memory address and give it to a raw Windows API via a COM call" in which case YES the RAM-data MUST be contiguous too.

What type of data do YOU need?

If you want the SAFEST possible data that is 100% sure to work in ANY API ANYWHERE, you should always make CONTIGUOUS pixel data. It doesn't take long to do the conversion up-front, since we're still talking about very fast operations!

There are probably situations where non-contiguous data is fine, such as if you are doing all image manipulations purely in Numpy math without any library APIs (in which case there's no real reason to convert the data layout to contiguous in RAM). But as soon as you invoke various library APIs, you should pretty much always have contiguous data, otherwise you'll create huge performance issues (or even completely incorrect results).

I'll explain those performance issues further down, but first we'll look at the various "conversion techniques" people use in Python.

Techniques

Without further ado, here are all the ways that people use in Python whenever they want to convert back/forth between RGB and BGR. These benchmarks are on a 4K image (3840x2160):

  • Always Contiguous: No. Method: x = x[...,::-1]. Speed: 237 nsec (aka 0.237 usec aka 0.000237 msec) per loopcall
  • Always Contiguous: Yes. Method: x = cv2.cvtColor(x, cv2.COLOR_RGB2BGR). Speed: 5.64 msec per loopcall
  • Always Contiguous: Yes. Method: x = x[...,::-1].copy(). Speed: 37.5 msec per loopcall
  • Always Contiguous: No. Method: x = np.fliplr(x.reshape(-1,3)).reshape(x.shape). Speed: 1.62 usec (aka 0.00162 msec) per loopcall
  • Always Contiguous: Yes. Method: x = np.fliplr(x.reshape(-1,3)).reshape(x.shape).copy(). Speed: 37.4 msec per loopcall
  • Always Contiguous: No. Method: x = np.flip(x, axis=2). Speed: 2.74 usec (aka 0.00274 msec) per loopcall
  • Always Contiguous: Yes. Method: x = np.flip(x, axis=2).copy(). Speed: 37.5 msec per loopcall

PS: Whenever we want contiguous data from numpy, we're using x.copy() to tell Numpy to allocate new RAM and copy all data to it in the correct (contiguous) order. There's also a np.ascontiguousarray(x) API but it does the exact same thing (copies) and requires much more typing. ;-)

Docs for the various Numpy functions: copy, ascontiguousarray, fliplr, flip, reshape

Here's the benchmark that was used:

python -m timeit -s "import numpy as np; import cv2; x = np.zeros([2160,3840,3], np.uint8); x[:,:,2] = 255; x[:,:,1] = 100" "ALGORITHM HERE"

Replace the "ALGORITHM HERE" part with the algorithm above, such as "x = np.flip(x, axis=2).copy()".

People's Misunderstandings of those Benchmarks

Alright, so we're finally getting to the whole purpose of this article!

When people see the benchmarks above, they usually think "Oh my god, x = x[...,::-1] executes in 0.000237 milliseconds, and x = cv2.cvtColor(x, cv2.COLOR_RGB2BGR) executes in 5.64 milliseconds, which is 23798 times slower!! And then they decide to always use "Numpy view manipulation" to do their channel conversions.

That's a huge mistake. And here's why:

  • When you call an OpenCV API from Python, and pass it a numpy.ndarray image object, there's a process which prepares that data for internal usage within OpenCV (since OpenCV itself doesn't use ndarray internally; it uses "Mat").
  • First, your Python object (which is coming from the PyOpenCV module) goes into the appropriate pyopencv_to() function, whose purpose is to convert raw Python objects (such as numbers, strings, ndarray, etc), into something usable by OpenCV internally in C++.
  • Your Python object first enters the "full ArgInfo converter" code at https://github.com/opencv/opencv/blob/778f42ad34559451d62ac9ba585717aec77fb23a/modules/python/src2/cv2.cpp#L249
  • That code, in turn, looks at the object and determines if it's a number, a float, or a tuple... If it's any of those, it does the appropriate conversion. Otherwise it assumes it's a Numpy array, at this line: https://github.com/opencv/opencv/blob/master/modules/python/src2/cv2.cpp#L292
  • Next, it begins to analyze the Numpy array to determine how to use the data internally. It wants to determine "do we need to copy the data or can we use it as-is? do we need to cast the data?", see here: https://github.com/opencv/opencv/blob/master/modules/python/src2/cv2.cpp#L300
  • It first does some simple checks to see if the data type in the Numpy array is legal or not.
  • Next, it retrieves the "strides" information from the Numpy array, which is those simple numbers such as "-1" which determine how to read a numpy array (such as backwards, in the case of our "fast" numpy-based "channel flipping" code earlier): https://github.com/opencv/opencv/blob/master/modules/python/src2/cv2.cpp#L341-L342
  • Then it analyzes the Strides for all dimensions of the Numpy array, and if it finds a non-contiguous stride (our "screwed up" data layout caused by doing those so-called "fast" Numpy view manipulations), then it marks the data as "needs copy": https://github.com/opencv/opencv/blob/master/modules/python/src2/cv2.cpp#L344-L357
  • Next, it if "needs copy" is true, and we can't solve it with a simple cast ("needs cast" is false), it does this horrible thing: https://github.com/opencv/opencv/blob/master/modules/python/src2/cv2.cpp#L371-L374
  • As you can see, it calls PyArray_GETCONTIGUOUS() which generates a brand-new Python Numpy Object, with all data COPIED by Numpy into brand-new memory, and re-arranged into proper Contiguous ordering. That's extremely wasteful! I'll explain more soon, after this walkthrough of what the code does...
  • Finally, the code proceeds to create a cv::Mat object whose data pointer points at the internal byte-array (RAM) of the Numpy object. That's an incredibly fast operation because it is just a pointer which says Use the data owned by Numpy at RAM address XYZ: https://github.com/opencv/opencv/blob/master/modules/python/src2/cv2.cpp#L415-L416

So, can you spot the problem yet?

When you pass a contiguous Numpy array, the conversion into OpenCV is pretty much INSTANT: "This data looks fine! Just give it to cv::Mat and voila!".

But when you instead insist on using those so-called "fast" channel transformations, where you "tweak" the Numpy array's view and stride values, then you are giving OpenCV a Numpy array with non-contiguous RAM and bad "strides". The PyOpenCV layer (the wrapper between OpenCV and Python) detects this problem, and creates a BRAND NEW, COPIED, RE-ARRANGED (CONTIGUOUS) NUMPY ARRAY. This is VERY VERY VERY VERY SLOW.

In other words, if you've used those dumb Numpy "view" manipulation tricks, EVERY CALL TO OPENCV APIS IS CAUSING a HUGE memory copy (images are large, especially 1080p+ screenshots/video frames), a lot of math inside PyArray_GETCONTIGUOUS to create that new object while respecting your tweaked "strides", etc.

Your code won't be faster at all. It will be SLOWER!

Proof of Slowness

Let's use a random OpenCV API to demonstrate the slowdown caused by all of those conversions. We'll use "imshow" here, but _any_ OpenCV API call will always be doing the same "Python to OpenCV" conversions of the data, so the exact API doesn't matter.

Here's the example code:

import cv2
import numpy as np
import time

#img1 = cv2.imread("yourimage.png") # If you want to test with an image.
img1 = np.zeros([2160,3840,3], np.uint8) # Create a 4K image.
img1[:,:,2] = 255; img1[:,:,1] = 100 # Fill the channels with different values.
img2 = img1[...,::-1] # Make a "channel flipped view" of the Numpy data.

print("img1 contiguous", img1.flags['C_CONTIGUOUS'], "img2 contiguous", img2.flags['C_CONTIGUOUS'])
print("img1 strides", img1.strides, "img2 strides", img2.strides)

wnd = cv2.namedWindow("", cv2.WINDOW_NORMAL)

def show1():
    cv2.imshow(wnd, img1)

def show2():
    cv2.imshow(wnd, img2)

iterations = 20

start = time.perf_counter()
for i in range(0, iterations):
    show1()
elapsed1 = (time.perf_counter() - start) * 1000
elapsed1_percall = elapsed1 / iterations

start = time.perf_counter()
for i in range(0, iterations):
    show2()
elapsed2 = (time.perf_counter() - start) * 1000
elapsed2_percall = elapsed2 / iterations

# We know that the contiguos (img1) data does not need conversion,
# which tells us that the runtime of the contiguous data is the
# "internal work of the imshow" function. We only want to measure
# the conversion time for non-contiguous data. So we'll subtract
# the first image's (contiguous) runtime from the non-contiguous time.

noncontiguous_overhead_per_call = elapsed2_percall - elapsed1_percall

print("Extra time taken per OpenCV call when given non-contiguous data (in ms):", noncontiguous_overhead_per_call, "ms")

The results:

img1 contiguous True img2 contiguous False
img1 strides (11520, 3, 1) img2 strides (11520, 3, -1)
Extra time taken per OpenCV call when given non-contiguous data (in ms): 39.45334999999999 ms

As you can see, the time it takes is pretty much the same as when you call img.copy() inside Python itself (as seen in the earlier benchmark for "Method: x = x[...,::-1].copy(). Speed: 37.5 msec per loopcall").

So yes, every time you call OpenCV with a non-contiguous Numpy array as its argument, you are causing a Numpy .copy() to happen internally!

Conclusions

Stop using Numpy view manipulations and "tricks". They are not "cool". They are SLOW. You are slowing down all of your OpenCV API calls by about 40 milliseconds PER CALL, since your "cool" data has to be converted by OpenCV internally to PROPER CONTIGUOUS RAM.

Those 39.45ms are wasted on EVERY OpenCV call whenever you give OpenCV a non-contiguous image. So if you're (as people often do) passing the image to multiple OpenCV APIs to analyze it in multiple ways, then you are causing extreme slowdowns in your code.

Use cv2.cvtColor() instead, which does a one-time conversion to the proper format (in about 5.64ms as seen in the benchmarks earlier), using accelerated OpenCL code. You are guaranteed to get contiguous data which works as-is for EVERY OpenCV call with no memory copying/conversion needed. And OpenCV's color converter is WAY FASTER than Numpy's internal data copier/converter.

Let's end this by imagining a scenario where you're using some library to capture an RGB screenshot as a numpy array, and you need to use that data with OpenCV. So you're thinking you're clever and you write img = img[...,::-1], and you're thinking "Wow, my code is so fast! That operation only took 0.000237 ms!"... And then you're calling five different OpenCV functions to analyze that screenshot in various ways. Well, since you're causing one internal copy-conversion PER CALL, you're now causing 5 * 39.45 = 197.25 ms of conversion overhead, just to get your "stupid" Numpy view into a proper contiguous memory stream.

Does it still sound "slow" to just do a single, one-time 5.64ms conversion via cv2.cvtColor()? ;-)

Stop. Using. Numpy. Tricks!

Enjoy! ;-)

Fastest way to convert BGR <-> RGB! Aka: Do NOT use Numpy magic "tricks".


IMPORTANT: This article is very long. Remember to click the (more) at the bottom of this post to read the whole article!


I was reading this question: https://answers.opencv.org/question/188664/colour-conversion-from-bgr-to-rgb-in-cv2-is-slower-than-on-python/ and it didn't explain things very well at all. So here's a deep examination and explanation for everyone's future reference!

Converting RGB to BGR, and vice versa, is one of the most important operations you can do in OpenCV if you're interoperating with other libraries, raw system memory, etc. And each imaging library depends on their own special channel orders.

There are many ways to achieve the conversion, and cv2.cvtColor() is often frowned upon because there are "much faster" ways to do it via numpy "view" manipulation.

Whenever you attempt to convert colors in OpenCV, you actually invoke a huge machinery:

https://github.com/opencv/opencv/blob/8c0b0714e76efef4a8ca2a7c410c60e55c5e9829/modules/imgproc/src/color.cpp#L20-L25 https://github.com/opencv/opencv/blob/8b541e450b511fde9dd363fa55a30fbb6fc0ace6/modules/imgproc/src/color_rgb.dispatch.cpp#L426-L437

As you can see, internally, OpenCV creates an "OpenCL Kernel" with the instructions for the data transformation, and then runs it. This creates brand new (re-arranged) image data in memory, which is of course a pretty slow operation, involving new memory allocation and data-copying.

However, there is another way to flip between RGB and BGR channel orders, which is very popular - and very bad (as you'll find out soon). And that is: Using numpy's built-in methods for manipulating the array data.

Note that there are two ways to manipulate data in Numpy:

  • One of the ways, the bad way, just changes the "view" of the Numpy array and is therefore instant (O(1)), but does NOT transform the underlying img.data in RAM/memory. This means that the raw memory does NOT contain the new channel order, and Numpy instead "fakes" it by creating a "view" that simply says "when we read this data from RAM, view it as R=B, G=G, B=R" basically... (Technically speaking, it changes the ".strides" property of the Numpy object, which instead of saying "read R then G then B" (stride "1" aka going forwards in RAM when reading the color channels) changes it to say "read B, then G, then R" (stride "-1" aka going backwards in RAM when reading the color channels)).
  • The second way, which is totally fine, is to always ensure that we arrange the pixel data properly in memory too, which is a lot slower but is sometimes necessary depending on what library/API your data is intended to be sent to!

To determine whether a numpy array manipulation has also changed the underlying MEMORY, you can look at the img.flags['C_CONTIGUOUS'] value. If True it means that the data in RAM is in the correct order (that's great!). If False it means that the data in RAM is in the wrong order and that we are "cheating" via a numpy View instead (that's BAD!).

Whenever you use the "View-based" methods to flip channels in an ndarray (such as RGB -> BGR), its C_CONTIGUOUS becomes False. If you then flip the image's channels again (such as BGR -> back to RGB), its C_CONTIGUOUS becomes True again. So, the "view" is able to be transformed multiple times, and the "Contiguous" flag only says True whenever the view happens to match the actual RAM data's layout.

So... in what situations do you need the data to ALWAYS be contiguous? Well, it varies based on API...

  • OpenCV APIs all need data in contiguous order. If you give it non-contiguous data from Python, the Python-to-OpenCV wrapper layer internally makes a COPY of the BADLY FORMATTED image you gave it, and then converts the color channel order, and THEN finally passes the COPIED-AND-CONTIGUOUS image to the internal OpenCV C++ function. This is of course very wasteful!
  • Other libraries: Depends on the library. Some of them do something like "take the img.data memory address and give it to a raw Windows API via a COM call" in which case YES the RAM-data MUST be contiguous too.

What type of data do YOU need?

If you want the SAFEST possible data that is 100% sure to work in ANY API ANYWHERE, you should always make CONTIGUOUS pixel data. It doesn't take long to do the conversion up-front, since we're still talking about very fast operations!

There are probably situations where non-contiguous data is fine, such as if you are doing all image manipulations purely in Numpy math without any library APIs (in which case there's no real reason to convert the data layout to contiguous in RAM). But as soon as you invoke various library APIs, you should pretty much always have contiguous data, otherwise you'll create huge performance issues (or even completely incorrect results).

I'll explain those performance issues further down, but first we'll look at the various "conversion techniques" people use in Python.

Techniques

Without further ado, here are all the ways that people use in Python whenever they want to convert back/forth between RGB and BGR. These benchmarks are on a 4K image (3840x2160):

  • Always Contiguous: No. Method: x = x[...,::-1]. Speed: 237 nsec (aka 0.237 usec aka 0.000237 msec) per call
  • Always Contiguous: Yes. Method: x = cv2.cvtColor(x, cv2.COLOR_RGB2BGR). Speed: 5.64 msec per call
  • Always Contiguous: Yes. Method: x = x[...,::-1].copy(). Speed: 37.5 msec per call
  • Always Contiguous: No. Method: x = np.fliplr(x.reshape(-1,3)).reshape(x.shape). Speed: 1.62 usec (aka 0.00162 msec) per call
  • Always Contiguous: Yes. Method: x = np.fliplr(x.reshape(-1,3)).reshape(x.shape).copy(). Speed: 37.4 msec per call
  • Always Contiguous: No. Method: x = np.flip(x, axis=2). Speed: 2.74 usec (aka 0.00274 msec) per call
  • Always Contiguous: Yes. Method: x = np.flip(x, axis=2).copy(). Speed: 37.5 msec per call

PS: Whenever we want contiguous data from numpy, we're using x.copy() to tell Numpy to allocate new RAM and copy all data to it in the correct (contiguous) order. There's also a np.ascontiguousarray(x) API but it does the exact same thing (copies) and requires much more typing. ;-)

Docs for the various Numpy functions: copy, ascontiguousarray, fliplr, flip, reshape

Here's the benchmark that was used:

python -m timeit -s "import numpy as np; import cv2; x = np.zeros([2160,3840,3], np.uint8); x[:,:,2] = 255; x[:,:,1] = 100" "ALGORITHM HERE"

Replace the "ALGORITHM HERE" part with the algorithm above, such as "x = np.flip(x, axis=2).copy()".

People's Misunderstandings of those Benchmarks

Alright, so we're finally getting to the whole purpose of this article!

When people see the benchmarks above, they usually think "Oh my god, x = x[...,::-1] executes in 0.000237 milliseconds, and x = cv2.cvtColor(x, cv2.COLOR_RGB2BGR) executes in 5.64 milliseconds, which is 23798 times slower!! And then they decide to always use "Numpy view manipulation" to do their channel conversions.

That's a huge mistake. And here's why:

  • When you call an OpenCV API from Python, and pass it a numpy.ndarray image object, there's a process which prepares that data for internal usage within OpenCV (since OpenCV itself doesn't use ndarray internally; it uses "Mat").cv::Mat).
  • First, your Python object (which is coming from the PyOpenCV module) goes into the appropriate pyopencv_to() function, whose purpose is to convert raw Python objects (such as numbers, strings, ndarray, etc), into something usable by OpenCV internally in C++.
  • Your Python object first enters the "full ArgInfo converter" code at https://github.com/opencv/opencv/blob/778f42ad34559451d62ac9ba585717aec77fb23a/modules/python/src2/cv2.cpp#L249
  • That code, in turn, looks at the object and determines if it's a number, a float, or a tuple... If it's any of those, it does the appropriate conversion. Otherwise it assumes it's a Numpy array, at this line: https://github.com/opencv/opencv/blob/master/modules/python/src2/cv2.cpp#L292
  • Next, it begins to analyze the Numpy array to determine how to use the data internally. It wants to determine "do we need to copy the data or can we use it as-is? do we need to cast the data?", see here: https://github.com/opencv/opencv/blob/master/modules/python/src2/cv2.cpp#L300
  • It first does some simple checks to see if the data type in the Numpy array is legal or not.
  • Next, it retrieves the "strides" information from the Numpy array, which is those simple numbers such as "-1" which determine how to read a numpy array (such as backwards, in the case of our "fast" numpy-based "channel flipping" code earlier): https://github.com/opencv/opencv/blob/master/modules/python/src2/cv2.cpp#L341-L342
  • Then it analyzes the Strides for all dimensions of the Numpy array, and if it finds a non-contiguous stride (our "screwed up" data layout caused by doing those so-called "fast" Numpy view manipulations), then it marks the data as "needs copy": https://github.com/opencv/opencv/blob/master/modules/python/src2/cv2.cpp#L344-L357
  • Next, it if "needs copy" is true, and we can't solve it with a simple cast ("needs cast" is false), it does this horrible thing: https://github.com/opencv/opencv/blob/master/modules/python/src2/cv2.cpp#L371-L374
  • As you can see, it calls PyArray_GETCONTIGUOUS() which generates a brand-new Python Numpy Object, with all data COPIED by Numpy into brand-new memory, and re-arranged into proper Contiguous ordering. That's extremely wasteful! I'll explain more soon, after this walkthrough of what the code does...
  • Finally, the code proceeds to create a cv::Mat object whose data pointer points at the internal byte-array (RAM) of the Numpy object. That's an incredibly fast operation because it is just a pointer which says Use the data owned by Numpy at RAM address XYZ: https://github.com/opencv/opencv/blob/master/modules/python/src2/cv2.cpp#L415-L416

So, can you spot the problem yet?

When you pass a contiguous Numpy array, the conversion into OpenCV is pretty much INSTANT: "This data looks fine! Just give it to cv::Mat and voila!".

But when you instead insist on using those so-called "fast" channel transformations, where you "tweak" the Numpy array's view and stride values, then you are giving OpenCV a Numpy array with non-contiguous RAM and bad "strides". The PyOpenCV layer (the wrapper between OpenCV and Python) detects this problem, and creates a BRAND NEW, COPIED, RE-ARRANGED (CONTIGUOUS) NUMPY ARRAY. This is VERY VERY VERY VERY SLOW.

In other words, if you've used those dumb Numpy "view" manipulation tricks, EVERY CALL TO OPENCV APIS IS CAUSING a HUGE memory copy (images are large, especially 1080p+ screenshots/video frames), a lot of math inside PyArray_GETCONTIGUOUS to create that new object while respecting your tweaked "strides", etc.

Your code won't be faster at all. It will be SLOWER!

Proof of Slowness

Let's use a random OpenCV API to demonstrate the slowdown caused by all of those conversions. We'll use "imshow" here, but _any_ OpenCV API call will always be doing the same "Python to OpenCV" conversions of the data, so the exact API doesn't matter.

Here's the example code:

import cv2
import numpy as np
import time

#img1 = cv2.imread("yourimage.png") # If you want to test with an image.
img1 = np.zeros([2160,3840,3], np.uint8) # Create a 4K image.
img1[:,:,2] = 255; img1[:,:,1] = 100 # Fill the channels with different values.
img2 = img1[...,::-1] # Make a "channel flipped view" of the Numpy data.

print("img1 contiguous", img1.flags['C_CONTIGUOUS'], "img2 contiguous", img2.flags['C_CONTIGUOUS'])
print("img1 strides", img1.strides, "img2 strides", img2.strides)

wnd = cv2.namedWindow("", cv2.WINDOW_NORMAL)

def show1():
    cv2.imshow(wnd, img1)

def show2():
    cv2.imshow(wnd, img2)

iterations = 20

start = time.perf_counter()
for i in range(0, iterations):
    show1()
elapsed1 = (time.perf_counter() - start) * 1000
elapsed1_percall = elapsed1 / iterations

start = time.perf_counter()
for i in range(0, iterations):
    show2()
elapsed2 = (time.perf_counter() - start) * 1000
elapsed2_percall = elapsed2 / iterations

# We know that the contiguos (img1) data does not need conversion,
# which tells us that the runtime of the contiguous data is the
# "internal work of the imshow" function. We only want to measure
# the conversion time for non-contiguous data. So we'll subtract
# the first image's (contiguous) runtime from the non-contiguous time.

noncontiguous_overhead_per_call = elapsed2_percall - elapsed1_percall

print("Extra time taken per OpenCV call when given non-contiguous data (in ms):", noncontiguous_overhead_per_call, "ms")

The results:

img1 contiguous True img2 contiguous False
img1 strides (11520, 3, 1) img2 strides (11520, 3, -1)
Extra time taken per OpenCV call when given non-contiguous data (in ms): 39.45334999999999 ms

As you can see, the time it takes is pretty much the same as when you call img.copy() inside Python itself (as seen in the earlier benchmark for "Method: x = x[...,::-1].copy(). Speed: 37.5 msec per call").

So yes, every time you call OpenCV with a non-contiguous Numpy array as its argument, you are causing a Numpy .copy() to happen internally!

Conclusions

Stop using Numpy view manipulations and "tricks". They are not "cool". They are SLOW. You are slowing down all of your OpenCV API calls by about 40 milliseconds PER CALL, since your "cool" data has to be converted by OpenCV internally to PROPER CONTIGUOUS RAM.

Those 39.45ms are wasted on EVERY OpenCV call whenever you give OpenCV a non-contiguous image. So if you're (as people often do) passing the image to multiple OpenCV APIs to analyze it in multiple ways, then you are causing extreme slowdowns in your code.

Use cv2.cvtColor() instead, which does a one-time conversion to the proper format (in about 5.64ms as seen in the benchmarks earlier), using accelerated OpenCL code. You are guaranteed to get contiguous data which works as-is for EVERY OpenCV call with no memory copying/conversion needed. And OpenCV's color converter is WAY FASTER than Numpy's internal data copier/converter.

Let's end this by imagining a scenario where you're using some library to capture an RGB screenshot as a numpy array, and you need to use that data with OpenCV. So you're thinking you're clever and you write img = img[...,::-1], and you're thinking "Wow, my code is so fast! That operation only took 0.000237 ms!"... And then you're calling five different OpenCV functions to analyze that screenshot in various ways. Well, since you're causing one internal copy-conversion PER CALL, you're now causing 5 * 39.45 = 197.25 ms of conversion overhead, just to get your "stupid" Numpy view into a proper contiguous memory stream.

Does it still sound "slow" to just do a single, one-time 5.64ms conversion via cv2.cvtColor()? ;-)

Stop. Using. Numpy. Tricks!

Enjoy! ;-)

Fastest way to convert BGR <-> RGB! Aka: Do NOT use Numpy magic "tricks".


IMPORTANT: This article is very long. Remember to click the (more) at the bottom of this post to read the whole article!


I was reading this question: https://answers.opencv.org/question/188664/colour-conversion-from-bgr-to-rgb-in-cv2-is-slower-than-on-python/ and it didn't explain things very well at all. So here's a deep examination and explanation for everyone's future reference!

Converting RGB to BGR, and vice versa, is one of the most important operations you can do in OpenCV if you're interoperating with other libraries, raw system memory, etc. And each imaging library depends on their own special channel orders.

There are many ways to achieve the conversion, and cv2.cvtColor() is often frowned upon because there are "much faster" ways to do it via numpy "view" manipulation.

Whenever you attempt to convert colors in OpenCV, you actually invoke a huge machinery:

https://github.com/opencv/opencv/blob/8c0b0714e76efef4a8ca2a7c410c60e55c5e9829/modules/imgproc/src/color.cpp#L20-L25 https://github.com/opencv/opencv/blob/8b541e450b511fde9dd363fa55a30fbb6fc0ace6/modules/imgproc/src/color_rgb.dispatch.cpp#L426-L437

As you can see, internally, OpenCV creates an "OpenCL Kernel" with the instructions for the data transformation, and then runs it. This creates brand new (re-arranged) image data in memory, which is of course a pretty slow operation, involving new memory allocation and data-copying.

However, there is another way to flip between RGB and BGR channel orders, which is very popular - and very bad (as you'll find out soon). And that is: Using numpy's built-in methods for manipulating the array data.

Note that there are two ways to manipulate data in Numpy:

  • One of the ways, the bad way, just changes the "view" of the Numpy array and is therefore instant (O(1)), but does NOT transform the underlying img.data in RAM/memory. This means that the raw memory does NOT contain the new channel order, and Numpy instead "fakes" it by creating a "view" that simply says "when we read this data from RAM, view it as R=B, G=G, B=R" basically... (Technically speaking, it changes the ".strides" property of the Numpy object, which instead of saying "read R then G then B" (stride "1" aka going forwards in RAM when reading the color channels) changes it to say "read B, then G, then R" (stride "-1" aka going backwards in RAM when reading the color channels)).
  • The second way, which is totally fine, is to always ensure that we arrange the pixel data properly in memory too, which is a lot slower but is sometimes necessary depending on what library/API your data is intended to be sent to!

To determine whether a numpy array manipulation has also changed the underlying MEMORY, you can look at the img.flags['C_CONTIGUOUS'] value. If True it means that the data in RAM is in the correct order (that's great!). If False it means that the data in RAM is in the wrong order and that we are "cheating" via a numpy View instead (that's BAD!).

Whenever you use the "View-based" methods to flip channels in an ndarray (such as RGB -> BGR), its C_CONTIGUOUS becomes False. If you then flip the image's channels again (such as BGR -> back to RGB), its C_CONTIGUOUS becomes True again. So, the "view" is able to be transformed multiple times, and the "Contiguous" flag only says True whenever the view happens to match the actual RAM data's layout.

So... in what situations do you need the data to ALWAYS be contiguous? Well, it varies based on API...

  • OpenCV APIs all need data in contiguous order. If you give it non-contiguous data from Python, the Python-to-OpenCV wrapper layer internally makes a COPY of the BADLY FORMATTED image you gave it, and then converts the color channel order, and THEN finally passes the COPIED-AND-CONTIGUOUS image to the internal OpenCV C++ function. This is of course very wasteful!
  • Other libraries: Depends on the library. Some of them do something like "take the img.data memory address and give it to a raw Windows API via a COM call" in which case YES the RAM-data MUST be contiguous too.

What type of data do YOU need?

If you want the SAFEST possible data that is 100% sure to work in ANY API ANYWHERE, you should always make CONTIGUOUS pixel data. It doesn't take long to do the conversion up-front, since we're still talking about very fast operations!

There are probably situations where non-contiguous data is fine, such as if you are doing all image manipulations purely in Numpy math without any library APIs (in which case there's no real reason to convert the data layout to contiguous in RAM). But as soon as you invoke various library APIs, you should pretty much always have contiguous data, otherwise you'll create huge performance issues (or even completely incorrect results).

I'll explain those performance issues further down, but first we'll look at the various "conversion techniques" people use in Python.

Techniques

Without further ado, here are all the ways that people use in Python whenever they want to convert back/forth between RGB and BGR. These benchmarks are on a 4K image (3840x2160):

  • Always Contiguous: No. Method: x = x[...,::-1]. Speed: 237 nsec (aka 0.237 usec aka 0.000237 msec) per call
  • Always Contiguous: Yes. Method: x = cv2.cvtColor(x, cv2.COLOR_RGB2BGR). Speed: 5.64 msec per call
  • Always Contiguous: Yes. Method: x = x[...,::-1].copy(). Speed: 37.5 msec per call
  • Always Contiguous: No. Method: x = np.fliplr(x.reshape(-1,3)).reshape(x.shape). Speed: 1.62 usec (aka 0.00162 msec) per call
  • Always Contiguous: Yes. Method: x = np.fliplr(x.reshape(-1,3)).reshape(x.shape).copy(). Speed: 37.4 msec per call
  • Always Contiguous: No. Method: x = np.flip(x, axis=2). Speed: 2.74 usec (aka 0.00274 msec) per call
  • Always Contiguous: Yes. Method: x = np.flip(x, axis=2).copy(). Speed: 37.5 msec per call

PS: Whenever we want contiguous data from numpy, we're using x.copy() to tell Numpy to allocate new RAM and copy all data to it in the correct (contiguous) order. There's also a np.ascontiguousarray(x) API but it does the exact same thing (copies) and requires much more typing. ;-)

Docs for the various Numpy functions: copy, ascontiguousarray, fliplr, flip, reshape

Here's the benchmark that was used:

python -m timeit -s "import numpy as np; import cv2; x = np.zeros([2160,3840,3], np.uint8); x[:,:,2] = 255; x[:,:,1] = 100" "ALGORITHM HERE"

Replace the "ALGORITHM HERE" part with the algorithm above, such as "x = np.flip(x, axis=2).copy()".

People's Misunderstandings of those Benchmarks

Alright, so we're finally getting to the whole purpose of this article!

When people see the benchmarks above, they usually think "Oh my god, x = x[...,::-1] executes in 0.000237 milliseconds, and x = cv2.cvtColor(x, cv2.COLOR_RGB2BGR) executes in 5.64 milliseconds, which is 23798 times slower!! And then they decide to always use "Numpy view manipulation" to do their channel conversions.

That's a huge mistake. And here's why:

  • When you call an OpenCV API from Python, and pass it a numpy.ndarray image object, there's a process which prepares that data for internal usage within OpenCV (since OpenCV itself doesn't use ndarray internally; it uses cv::Mat).
  • First, your Python object (which is coming from the PyOpenCV module) goes into the appropriate pyopencv_to() function, whose purpose is to convert raw Python objects (such as numbers, strings, ndarray, etc), into something usable by OpenCV internally in C++.
  • Your Python object first enters the "full ArgInfo converter" code at https://github.com/opencv/opencv/blob/778f42ad34559451d62ac9ba585717aec77fb23a/modules/python/src2/cv2.cpp#L249
  • That code, in turn, looks at the object and determines if it's a number, a float, or a tuple... If it's any of those, it does the appropriate conversion. Otherwise it assumes it's a Numpy array, at this line: https://github.com/opencv/opencv/blob/master/modules/python/src2/cv2.cpp#L292
  • Next, it begins to analyze the Numpy array to determine how to use the data internally. It wants to determine "do we need to copy the data or can we use it as-is? do we need to cast the data?", see here: https://github.com/opencv/opencv/blob/master/modules/python/src2/cv2.cpp#L300
  • It first does some simple checks to see if the data type number-type in the Numpy array is legal or not.
  • Next, it retrieves the "strides" information from the Numpy array, which is those simple numbers such as "-1" which determine how to read a numpy array (such as backwards, in the case of our "fast" numpy-based "channel flipping" code earlier): https://github.com/opencv/opencv/blob/master/modules/python/src2/cv2.cpp#L341-L342
  • Then it analyzes the Strides for all dimensions of the Numpy array, and if it finds a non-contiguous stride (our "screwed up" data layout caused by doing those so-called "fast" Numpy view manipulations), then it marks the data as "needs copy": https://github.com/opencv/opencv/blob/master/modules/python/src2/cv2.cpp#L344-L357
  • Next, it if "needs copy" is true, and we can't solve it with a simple cast ("needs cast" is false), it does this horrible thing: https://github.com/opencv/opencv/blob/master/modules/python/src2/cv2.cpp#L371-L374
  • As you can see, it calls PyArray_GETCONTIGUOUS() which generates a brand-new Python Numpy Object, with all data COPIED by Numpy into brand-new memory, and re-arranged into proper Contiguous ordering. That's extremely wasteful! I'll explain more soon, after this walkthrough of what the code does...
  • Finally, the code proceeds to create a cv::Mat object whose data pointer points at the internal byte-array (RAM) of the Numpy object. That's an incredibly fast operation because it is just a pointer which says Use the data owned by Numpy at RAM address XYZ: https://github.com/opencv/opencv/blob/master/modules/python/src2/cv2.cpp#L415-L416

So, can you spot the problem yet?

When you pass a contiguous Numpy array, the conversion into OpenCV is pretty much INSTANT: "This data looks fine! Just give it to cv::Mat and voila!".

But when you instead insist on using those so-called "fast" channel transformations, where you "tweak" the Numpy array's view and stride values, then you are giving OpenCV a Numpy array with non-contiguous RAM and bad "strides". The PyOpenCV layer (the wrapper between OpenCV and Python) detects this problem, and creates a BRAND NEW, COPIED, RE-ARRANGED (CONTIGUOUS) NUMPY ARRAY. This is VERY VERY VERY VERY SLOW.

In other words, if you've used those dumb Numpy "view" manipulation tricks, EVERY CALL TO OPENCV APIS IS CAUSING a HUGE memory copy (images are large, especially 1080p+ screenshots/video frames), a lot of math inside PyArray_GETCONTIGUOUS to create that new object while respecting your tweaked "strides", etc.

Your code won't be faster at all. It will be SLOWER!

Proof of Slowness

Let's use a random OpenCV API to demonstrate the slowdown caused by all of those conversions. We'll use "imshow" here, but _any_ OpenCV API call will always be doing the same "Python to OpenCV" conversions of the data, so the exact API doesn't matter.

Here's the example code:

import cv2
import numpy as np
import time

#img1 = cv2.imread("yourimage.png") # If you want to test with an image.
img1 = np.zeros([2160,3840,3], np.uint8) # Create a 4K image.
img1[:,:,2] = 255; img1[:,:,1] = 100 # Fill the channels with different values.
img2 = img1[...,::-1] # Make a "channel flipped view" of the Numpy data.

print("img1 contiguous", img1.flags['C_CONTIGUOUS'], "img2 contiguous", img2.flags['C_CONTIGUOUS'])
print("img1 strides", img1.strides, "img2 strides", img2.strides)

wnd = cv2.namedWindow("", cv2.WINDOW_NORMAL)

def show1():
    cv2.imshow(wnd, img1)

def show2():
    cv2.imshow(wnd, img2)

iterations = 20

start = time.perf_counter()
for i in range(0, iterations):
    show1()
elapsed1 = (time.perf_counter() - start) * 1000
elapsed1_percall = elapsed1 / iterations

start = time.perf_counter()
for i in range(0, iterations):
    show2()
elapsed2 = (time.perf_counter() - start) * 1000
elapsed2_percall = elapsed2 / iterations

# We know that the contiguos (img1) data does not need conversion,
# which tells us that the runtime of the contiguous data is the
# "internal work of the imshow" function. We only want to measure
# the conversion time for non-contiguous data. So we'll subtract
# the first image's (contiguous) runtime from the non-contiguous time.

noncontiguous_overhead_per_call = elapsed2_percall - elapsed1_percall

print("Extra time taken per OpenCV call when given non-contiguous data (in ms):", noncontiguous_overhead_per_call, "ms")

The results:

img1 contiguous True img2 contiguous False
img1 strides (11520, 3, 1) img2 strides (11520, 3, -1)
Extra time taken per OpenCV call when given non-contiguous data (in ms): 39.45334999999999 ms

As you can see, the time it takes is pretty much the same as when you call img.copy() inside Python itself (as seen in the earlier benchmark for "Method: x = x[...,::-1].copy(). Speed: 37.5 msec per call").

So yes, every time you call OpenCV with a non-contiguous Numpy array as its argument, you are causing a Numpy .copy() to happen internally!

Conclusions

Stop using Numpy view manipulations and "tricks". They are not "cool". They are SLOW. You are slowing down all of your OpenCV API calls by about 40 milliseconds PER CALL, since your "cool" data has to be converted by OpenCV internally to PROPER CONTIGUOUS RAM.

Those 39.45ms are wasted on EVERY OpenCV call whenever you give OpenCV a non-contiguous image. So if you're (as people often do) passing the image to multiple OpenCV APIs to analyze it in multiple ways, then you are causing extreme slowdowns in your code.

Use cv2.cvtColor() instead, which does a one-time conversion to the proper format (in about 5.64ms as seen in the benchmarks earlier), using accelerated OpenCL code. You are guaranteed to get contiguous data which works as-is for EVERY OpenCV call with no memory copying/conversion needed. And OpenCV's color converter is WAY FASTER than Numpy's internal data copier/converter.

Let's end this by imagining a scenario where you're using some library to capture an RGB screenshot as a numpy array, and you need to use that data with OpenCV. So you're thinking you're clever and you write img = img[...,::-1], and you're thinking "Wow, my code is so fast! That operation only took 0.000237 ms!"... And then you're calling five different OpenCV functions to analyze that screenshot in various ways. Well, since you're causing one internal copy-conversion PER CALL, you're now causing 5 * 39.45 = 197.25 ms of conversion overhead, just to get your "stupid" Numpy view into a proper contiguous memory stream.

Does it still sound "slow" to just do a single, one-time 5.64ms conversion via cv2.cvtColor()? ;-)

Stop. Using. Numpy. Tricks!

Enjoy! ;-)

Fastest way to convert BGR <-> RGB! Aka: Do NOT use Numpy magic "tricks".


IMPORTANT: This article is very long. Remember to click the (more) at the bottom of this post to read the whole article!


I was reading this question: https://answers.opencv.org/question/188664/colour-conversion-from-bgr-to-rgb-in-cv2-is-slower-than-on-python/ and it didn't explain things very well at all. So here's a deep examination and explanation for everyone's future reference!

Converting RGB to BGR, and vice versa, is one of the most important operations you can do in OpenCV if you're interoperating with other libraries, raw system memory, etc. And each imaging library depends on their own special channel orders.

There are many ways to achieve the conversion, and cv2.cvtColor() is often frowned upon because there are "much faster" ways to do it via numpy "view" manipulation.

Whenever you attempt to convert colors in OpenCV, you actually invoke a huge machinery:

https://github.com/opencv/opencv/blob/8c0b0714e76efef4a8ca2a7c410c60e55c5e9829/modules/imgproc/src/color.cpp#L20-L25 https://github.com/opencv/opencv/blob/8b541e450b511fde9dd363fa55a30fbb6fc0ace6/modules/imgproc/src/color_rgb.dispatch.cpp#L426-L437

As you can see, internally, OpenCV creates an "OpenCL Kernel" with the instructions for the data transformation, and then runs it. This creates brand new (re-arranged) image data in memory, which is of course a pretty slow operation, involving new memory allocation and data-copying.

However, there is another way to flip between RGB and BGR channel orders, which is very popular - and very bad (as you'll find out soon). And that is: Using numpy's built-in methods for manipulating the array data.

Note that there are two ways to manipulate data in Numpy:

  • One of the ways, the bad way, just changes the "view" of the Numpy array and is therefore instant (O(1)), but does NOT transform the underlying img.data in RAM/memory. This means that the raw memory does NOT contain the new channel order, and Numpy instead "fakes" it by creating a "view" that simply says "when we read this data from RAM, view it as R=B, G=G, B=R" basically... (Technically speaking, it changes the ".strides" property of the Numpy object, which instead of saying "read R then G then B" (stride "1" aka going forwards in RAM when reading the color channels) changes it to say "read B, then G, then R" (stride "-1" aka going backwards in RAM when reading the color channels)).
  • The second way, which is totally fine, is to always ensure that we arrange the pixel data properly in memory too, which is a lot slower but is sometimes necessary depending on what library/API your data is intended to be sent to!

To determine whether a numpy array manipulation has also changed the underlying MEMORY, you can look at the img.flags['C_CONTIGUOUS'] value. If True it means that the data in RAM is in the correct order (that's great!). If False it means that the data in RAM is in the wrong order and that we are "cheating" via a numpy View instead (that's BAD!).

Whenever you use the "View-based" methods to flip channels in an ndarray (such as RGB -> BGR), its C_CONTIGUOUS becomes False. If you then flip the image's channels again (such as BGR -> back to RGB), its C_CONTIGUOUS becomes True again. So, the "view" is able to be transformed multiple times, and the "Contiguous" flag only says True whenever the view happens to match the actual RAM data's layout.

So... in what situations do you need the data to ALWAYS be contiguous? Well, it varies based on API...

  • OpenCV APIs all need data in contiguous order. If you give it non-contiguous data from Python, the Python-to-OpenCV wrapper layer internally makes a COPY of the BADLY FORMATTED image you gave it, and then converts the color channel order, and THEN finally passes the COPIED-AND-CONTIGUOUS image to the internal OpenCV C++ function. This is of course very wasteful!
  • Other libraries: Depends on the library. Some of them do something like "take the img.data memory address and give it to a raw Windows API via a COM call" in which case YES the RAM-data MUST be contiguous too.

What type of data do YOU need?

If you want the SAFEST possible data that is 100% sure to work in ANY API ANYWHERE, you should always make CONTIGUOUS pixel data. It doesn't take long to do the conversion up-front, since we're still talking about very fast operations!

There are probably situations where non-contiguous data is fine, such as if you are doing all image manipulations purely in Numpy math without any library APIs (in which case there's no real reason to convert the data layout to contiguous in RAM). But as soon as you invoke various library APIs, you should pretty much always have contiguous data, otherwise you'll create huge performance issues (or even completely incorrect results).

I'll explain those performance issues further down, but first we'll look at the various "conversion techniques" people use in Python.

Techniques

Without further ado, here are all the ways that people use in Python whenever they want to convert back/forth between RGB and BGR. These benchmarks are on a 4K image (3840x2160):

  • Always Contiguous: No. Method: x = x[...,::-1]. Speed: 237 nsec (aka 0.237 usec aka 0.000237 msec) per call
  • Always Contiguous: Yes. Method: x = cv2.cvtColor(x, cv2.COLOR_RGB2BGR). Speed: 5.64 msec per call
  • Always Contiguous: Yes. Method: x = x[...,::-1].copy(). Speed: 37.5 msec per call
  • Always Contiguous: No. Method: x = np.fliplr(x.reshape(-1,3)).reshape(x.shape). Speed: 1.62 usec (aka 0.00162 msec) per call
  • Always Contiguous: Yes. Method: x = np.fliplr(x.reshape(-1,3)).reshape(x.shape).copy(). Speed: 37.4 msec per call
  • Always Contiguous: No. Method: x = np.flip(x, axis=2). Speed: 2.74 usec (aka 0.00274 msec) per call
  • Always Contiguous: Yes. Method: x = np.flip(x, axis=2).copy(). Speed: 37.5 msec per call

PS: Whenever we want contiguous data from numpy, we're using x.copy() to tell Numpy to allocate new RAM and copy all data to it in the correct (contiguous) order. There's also a np.ascontiguousarray(x) API but it does the exact same thing (copies) and requires much more typing. ;-)

Docs for the various Numpy functions: copy, ascontiguousarray, fliplr, flip, reshape

Here's the benchmark that was used:

python -m timeit -s "import numpy as np; import cv2; x = np.zeros([2160,3840,3], np.uint8); x[:,:,2] = 255; x[:,:,1] = 100" "ALGORITHM HERE"

Replace the "ALGORITHM HERE" part with the algorithm above, such as "x = np.flip(x, axis=2).copy()".

People's Misunderstandings of those Benchmarks

Alright, so we're finally getting to the whole purpose of this article!

When people see the benchmarks above, they usually think "Oh my god, x = x[...,::-1] executes in 0.000237 milliseconds, and x = cv2.cvtColor(x, cv2.COLOR_RGB2BGR) executes in 5.64 milliseconds, which is 23798 times slower!! And then they decide to always use "Numpy view manipulation" to do their channel conversions.

That's a huge mistake. And here's why:

  • When you call an OpenCV API from Python, and pass it a numpy.ndarray image object, there's a process which prepares that data for internal usage within OpenCV (since OpenCV itself doesn't use ndarray internally; it uses cv::Mat).
  • First, your Python object (which is coming from the PyOpenCV module) goes into the appropriate pyopencv_to() function, whose purpose is to convert raw Python objects (such as numbers, strings, ndarray, etc), into something usable by OpenCV internally in C++.
  • Your Python object first enters the "full ArgInfo converter" code at https://github.com/opencv/opencv/blob/778f42ad34559451d62ac9ba585717aec77fb23a/modules/python/src2/cv2.cpp#L249
  • That code, in turn, looks at the object and determines if it's a number, a float, or a tuple... If it's any of those, it does the appropriate conversion. Otherwise it assumes it's a Numpy array, at this line: https://github.com/opencv/opencv/blob/master/modules/python/src2/cv2.cpp#L292
  • Next, it begins to analyze the Numpy array to determine how to use the data internally. It wants to determine "do we need to copy the data or can we use it as-is? do we need to cast the data?", see here: https://github.com/opencv/opencv/blob/master/modules/python/src2/cv2.cpp#L300
  • It first does some simple checks to see if the number-type in the Numpy array is legal or not.
  • Next, it retrieves the "strides" information from the Numpy array, which is those simple numbers such as "-1" which determine how to read a numpy array (such as backwards, in the case of our "fast" numpy-based "channel flipping" code earlier): https://github.com/opencv/opencv/blob/master/modules/python/src2/cv2.cpp#L341-L342
  • Then it analyzes the Strides for all dimensions of the Numpy array, and if it finds a non-contiguous stride (our "screwed up" data layout caused by doing those so-called "fast" Numpy view manipulations), then it marks the data as "needs copy": https://github.com/opencv/opencv/blob/master/modules/python/src2/cv2.cpp#L344-L357
  • Next, if "needs copy" is true, and we can't solve it with a simple cast ("needs cast" is false), it does this horrible thing: https://github.com/opencv/opencv/blob/master/modules/python/src2/cv2.cpp#L371-L374
  • As you can see, it calls PyArray_GETCONTIGUOUS() which generates a brand-new Python Numpy Object, with all data COPIED by Numpy into brand-new memory, and re-arranged into proper Contiguous ordering. That's extremely wasteful! I'll explain more soon, after this walkthrough of what the code does...
  • Finally, the code proceeds to create a cv::Mat object whose data pointer points at the internal byte-array (RAM) of the Numpy object. object, ie. the RAM address that you can easily see in Python by typing img.data. That's an incredibly fast operation because it is just a pointer which says Use the existing RAM data owned by Numpy at RAM address XYZ: https://github.com/opencv/opencv/blob/master/modules/python/src2/cv2.cpp#L415-L416

So, can you spot the problem yet?

When you pass a contiguous Numpy array, the conversion into OpenCV is pretty much INSTANT: "This data looks fine! Just give it to cv::Mat and voila!".

But when you instead insist on using those so-called "fast" channel transformations, where you "tweak" the Numpy array's view and stride values, then you are giving OpenCV a Numpy array with non-contiguous RAM and bad "strides". The PyOpenCV layer (the wrapper between OpenCV and Python) detects this problem, and creates a BRAND NEW, COPIED, RE-ARRANGED (CONTIGUOUS) NUMPY ARRAY. This is VERY VERY VERY VERY SLOW.

In other words, if you've used those dumb Numpy "view" manipulation tricks, EVERY CALL TO OPENCV APIS IS CAUSING a HUGE memory copy (images are large, especially 1080p+ screenshots/video frames), a lot of math inside PyArray_GETCONTIGUOUS to create that new object while respecting your tweaked "strides", etc.

Your code won't be faster at all. It will be SLOWER!

Proof of Slowness

Let's use a random OpenCV API to demonstrate the slowdown caused by all of those conversions. We'll use "imshow" here, but _any_ OpenCV API call will always be doing the same "Python to OpenCV" conversions of the data, so the exact API doesn't matter.

Here's the example code:

import cv2
import numpy as np
import time

#img1 = cv2.imread("yourimage.png") # If you want to test with an image.
img1 = np.zeros([2160,3840,3], np.uint8) # Create a 4K image.
img1[:,:,2] = 255; img1[:,:,1] = 100 # Fill the channels with different values.
img2 = img1[...,::-1] # Make a "channel flipped view" of the Numpy data.

print("img1 contiguous", img1.flags['C_CONTIGUOUS'], "img2 contiguous", img2.flags['C_CONTIGUOUS'])
print("img1 strides", img1.strides, "img2 strides", img2.strides)

wnd = cv2.namedWindow("", cv2.WINDOW_NORMAL)

def show1():
    cv2.imshow(wnd, img1)

def show2():
    cv2.imshow(wnd, img2)

iterations = 20

start = time.perf_counter()
for i in range(0, iterations):
    show1()
elapsed1 = (time.perf_counter() - start) * 1000
elapsed1_percall = elapsed1 / iterations

start = time.perf_counter()
for i in range(0, iterations):
    show2()
elapsed2 = (time.perf_counter() - start) * 1000
elapsed2_percall = elapsed2 / iterations

# We know that the contiguos (img1) data does not need conversion,
# which tells us that the runtime of the contiguous data is the
# "internal work of the imshow" function. We only want to measure
# the conversion time for non-contiguous data. So we'll subtract
# the first image's (contiguous) runtime from the non-contiguous time.

noncontiguous_overhead_per_call = elapsed2_percall - elapsed1_percall

print("Extra time taken per OpenCV call when given non-contiguous data (in ms):", noncontiguous_overhead_per_call, "ms")

The results:

img1 contiguous True img2 contiguous False
img1 strides (11520, 3, 1) img2 strides (11520, 3, -1)
Extra time taken per OpenCV call when given non-contiguous data (in ms): 39.45334999999999 ms

As you can see, the time it takes is pretty much the same as when you call img.copy() inside Python itself (as seen in the earlier benchmark for "Method: x = x[...,::-1].copy(). Speed: 37.5 msec per call").

So yes, every time you call OpenCV with a non-contiguous Numpy array as its argument, you are causing a Numpy .copy() to happen internally!

Conclusions

Stop using Numpy view manipulations and "tricks". They are not "cool". They are SLOW. You are slowing down all of your OpenCV API calls by about 40 milliseconds PER CALL, since your "cool" data has to be converted by OpenCV internally to PROPER CONTIGUOUS RAM.

Those 39.45ms are wasted on EVERY OpenCV call whenever you give OpenCV a non-contiguous image. So if you're (as people often do) passing the image to multiple OpenCV APIs to analyze it in multiple ways, then you are causing extreme slowdowns in your code.

Use cv2.cvtColor() instead, which does a one-time conversion to the proper format (in about 5.64ms as seen in the benchmarks earlier), using accelerated OpenCL code. You are guaranteed to get contiguous data which works as-is for EVERY OpenCV call with no memory copying/conversion needed. And OpenCV's color converter is WAY FASTER than Numpy's internal data copier/converter.

Let's end this by imagining a scenario where you're using some library to capture an RGB screenshot as a numpy array, and you need to use that data with OpenCV. So you're thinking you're clever and you write img = img[...,::-1], and you're thinking "Wow, my code is so fast! That operation only took 0.000237 ms!"... And then you're calling five different OpenCV functions to analyze that screenshot in various ways. Well, since you're causing one internal copy-conversion PER CALL, you're now causing 5 * 39.45 = 197.25 ms of conversion overhead, just to get your "stupid" Numpy view into a proper contiguous memory stream.

Does it still sound "slow" to just do a single, one-time 5.64ms conversion via cv2.cvtColor()? ;-)

Stop. Using. Numpy. Tricks!

Enjoy! ;-)

Fastest way to convert BGR <-> RGB! Aka: Do NOT use Numpy magic "tricks".


IMPORTANT: This article is very long. Remember to click the (more) at the bottom of this post to read the whole article!


I was reading this question: https://answers.opencv.org/question/188664/colour-conversion-from-bgr-to-rgb-in-cv2-is-slower-than-on-python/ and it didn't explain things very well at all. So here's a deep examination and explanation for everyone's future reference!

Converting RGB to BGR, and vice versa, is one of the most important operations you can do in OpenCV if you're interoperating with other libraries, raw system memory, etc. And each imaging library depends on their own special channel orders.

There are many ways to achieve the conversion, and cv2.cvtColor() is often frowned upon because there are "much faster" ways to do it via numpy "view" manipulation.

Whenever you attempt to convert colors in OpenCV, you actually invoke a huge machinery:

https://github.com/opencv/opencv/blob/8c0b0714e76efef4a8ca2a7c410c60e55c5e9829/modules/imgproc/src/color.cpp#L20-L25 https://github.com/opencv/opencv/blob/8b541e450b511fde9dd363fa55a30fbb6fc0ace6/modules/imgproc/src/color_rgb.dispatch.cpp#L426-L437

As you can see, internally, OpenCV creates an "OpenCL Kernel" with the instructions for the data transformation, and then runs it. This creates brand new (re-arranged) image data in memory, which is of course a pretty slow operation, involving new memory allocation and data-copying.

However, there is another way to flip between RGB and BGR channel orders, which is very popular - and very bad (as you'll find out soon). And that is: Using numpy's built-in methods for manipulating the array data.

Note that there are two ways to manipulate data in Numpy:

  • One of the ways, the bad way, just changes the "view" of the Numpy array and is therefore instant (O(1)), but does NOT transform the underlying img.data in RAM/memory. This means that the raw memory does NOT contain the new channel order, and Numpy instead "fakes" it by creating a "view" that simply says "when we read this data from RAM, view it as R=B, G=G, B=R" basically... (Technically speaking, it changes the ".strides" property of the Numpy object, which instead of saying "read R then G then B" (stride "1" aka going forwards in RAM when reading the color channels) changes it to say "read B, then G, then R" (stride "-1" aka going backwards in RAM when reading the color channels)).
  • The second way, which is totally fine, is to always ensure that we arrange the pixel data properly in memory too, which is a lot slower but is sometimes necessary depending on what library/API your data is intended to be sent to!

To determine whether a numpy array manipulation has also changed the underlying MEMORY, you can look at the img.flags['C_CONTIGUOUS'] value. If True it means that the data in RAM is in the correct order (that's great!). If False it means that the data in RAM is in the wrong order and that we are "cheating" via a numpy View instead (that's BAD!).

Whenever you use the "View-based" methods to flip channels in an ndarray (such as RGB -> BGR), its C_CONTIGUOUS becomes False. If you then flip the image's channels again (such as BGR -> back to RGB), its C_CONTIGUOUS becomes True again. So, the "view" is able to be transformed multiple times, and the "Contiguous" flag only says True whenever the view happens to match the actual RAM data's layout.

So... in what situations do you need the data to ALWAYS be contiguous? Well, it varies based on API...

  • OpenCV APIs all need data in contiguous order. If you give it non-contiguous data from Python, the Python-to-OpenCV wrapper layer internally makes a COPY of the BADLY FORMATTED image you gave it, and then converts the color channel order, and THEN finally passes the COPIED-AND-CONTIGUOUS image to the internal OpenCV C++ function. This is of course very wasteful!
  • Other libraries: Depends on the library. Some of them do something like "take the img.data memory address and give it to a raw Windows API via a COM call" in which case YES the RAM-data MUST be contiguous too.

What type of data do YOU need?

If you want the SAFEST possible data that is 100% sure to work in ANY API ANYWHERE, you should always make CONTIGUOUS pixel data. It doesn't take long to do the conversion up-front, since we're still talking about very fast operations!

There are probably situations where non-contiguous data is fine, such as if you are doing all image manipulations purely in Numpy math without any library APIs (in which case there's no real reason to convert the data layout to contiguous in RAM). But as soon as you invoke various library APIs, you should pretty much always have contiguous data, otherwise you'll create huge performance issues (or even completely incorrect results).

I'll explain those performance issues further down, but first we'll look at the various "conversion techniques" people use in Python.

Techniques

Without further ado, here are all the ways that people use in Python whenever they want to convert back/forth between RGB and BGR. These benchmarks are on a 4K image (3840x2160):

  • Always Contiguous: No. Method: x = x[...,::-1]. Speed: 237 nsec (aka 0.237 usec aka 0.000237 msec) per call
  • Always Contiguous: Yes. Method: x = cv2.cvtColor(x, cv2.COLOR_RGB2BGR). Speed: 5.64 msec per call
  • Always Contiguous: Yes. Method: x = x[...,::-1].copy(). Speed: 37.5 msec per call
  • Always Contiguous: No. Method: x = np.fliplr(x.reshape(-1,3)).reshape(x.shape). Speed: 1.62 usec (aka 0.00162 msec) per call
  • Always Contiguous: Yes. Method: x = np.fliplr(x.reshape(-1,3)).reshape(x.shape).copy(). Speed: 37.4 msec per call
  • Always Contiguous: No. Method: x = np.flip(x, axis=2). Speed: 2.74 usec (aka 0.00274 msec) per call
  • Always Contiguous: Yes. Method: x = np.flip(x, axis=2).copy(). Speed: 37.5 msec per call

PS: Whenever we want contiguous data from numpy, we're using x.copy() to tell Numpy to allocate new RAM and copy all data to it in the correct (contiguous) order. There's also a np.ascontiguousarray(x) API but it does the exact same thing (copies) and requires much more typing. ;-)

Docs for the various Numpy functions: copy, ascontiguousarray, fliplr, flip, reshape

Here's the benchmark that was used:

python -m timeit -s "import numpy as np; import cv2; x = np.zeros([2160,3840,3], np.uint8); x[:,:,2] = 255; x[:,:,1] = 100" "ALGORITHM HERE"

Replace the "ALGORITHM HERE" part with the algorithm above, such as "x = np.flip(x, axis=2).copy()".

People's Misunderstandings of those Benchmarks

Alright, so we're finally getting to the whole purpose of this article!

When people see the benchmarks above, they usually think "Oh my god, x = x[...,::-1] executes in 0.000237 milliseconds, and x = cv2.cvtColor(x, cv2.COLOR_RGB2BGR) executes in 5.64 milliseconds, which is 23798 times slower!! And then they decide to always use "Numpy view manipulation" to do their channel conversions.

That's a huge mistake. And here's why:

  • When you call an OpenCV API from Python, and pass it a numpy.ndarray image object, there's a process which prepares that data for internal usage within OpenCV (since OpenCV itself doesn't use ndarray internally; it uses cv::Mat).
  • First, your Python object (which is coming from the PyOpenCV module) goes into the appropriate pyopencv_to() function, whose purpose is to convert raw Python objects (such as numbers, strings, ndarray, etc), into something usable by OpenCV internally in C++.
  • Your Python object first enters the "full ArgInfo converter" code at https://github.com/opencv/opencv/blob/778f42ad34559451d62ac9ba585717aec77fb23a/modules/python/src2/cv2.cpp#L249
  • That code, in turn, looks at the object and determines if it's a number, a float, or a tuple... If it's any of those, it does the appropriate conversion. Otherwise it assumes it's a Numpy array, at this line: https://github.com/opencv/opencv/blob/master/modules/python/src2/cv2.cpp#L292
  • Next, it begins to analyze the Numpy array to determine how to use the data internally. It wants to determine "do we need to copy the data or can we use it as-is? do we need to cast the data?", see here: https://github.com/opencv/opencv/blob/master/modules/python/src2/cv2.cpp#L300
  • It first does some simple checks to see if the number-type in the Numpy array is legal or not.
  • Next, it retrieves the "strides" information from the Numpy array, which is those simple numbers such as "-1" which determine how to read a numpy array (such as backwards, in the case of our "fast" numpy-based "channel flipping" code earlier): https://github.com/opencv/opencv/blob/master/modules/python/src2/cv2.cpp#L341-L342
  • Then it analyzes the Strides for all dimensions of the Numpy array, and if it finds a non-contiguous stride (our "screwed up" data layout caused by doing those so-called "fast" Numpy view manipulations), then it marks the data as "needs copy": https://github.com/opencv/opencv/blob/master/modules/python/src2/cv2.cpp#L344-L357
  • Next, if "needs copy" is true, and we can't solve it with a simple cast ("needs cast" is false), it does this horrible thing: https://github.com/opencv/opencv/blob/master/modules/python/src2/cv2.cpp#L371-L374
  • As you can see, it calls PyArray_GETCONTIGUOUS() which generates a brand-new Python Numpy Object, with all data COPIED by Numpy into brand-new memory, and re-arranged into proper Contiguous ordering. That's extremely wasteful! I'll explain more soon, after this walkthrough of what the code does...
  • Finally, the code proceeds to create a cv::Mat object whose data pointer points at the internal byte-array (RAM) of the Numpy object, ie. the RAM address that you can easily see in Python by typing img.data. That's an incredibly fast operation because it is just a pointer which says Use the existing RAM data owned by Numpy at RAM address XYZ: https://github.com/opencv/opencv/blob/master/modules/python/src2/cv2.cpp#L415-L416

So, can you spot the problem yet?

When you pass a contiguous Numpy array, the conversion into OpenCV is pretty much INSTANT: "This data looks fine! Just give it its RAM address to cv::Mat and voila!".

But when you instead insist on using those so-called "fast" channel transformations, where you "tweak" the Numpy array's view and stride values, then you are giving OpenCV a Numpy array with non-contiguous RAM and bad "strides". The PyOpenCV layer (the wrapper between OpenCV and Python) detects this problem, and creates a BRAND NEW, COPIED, RE-ARRANGED (CONTIGUOUS) NUMPY ARRAY. This is VERY VERY VERY VERY SLOW.

In other words, if you've used those dumb Numpy "view" manipulation tricks, EVERY CALL TO OPENCV APIS IS CAUSING a HUGE memory copy (images are large, especially 1080p+ screenshots/video frames), a lot of math inside PyArray_GETCONTIGUOUS to create that new object while respecting your tweaked "strides", etc.

Your code won't be faster at all. It will be SLOWER!

Proof of Slowness

Let's use a random OpenCV API to demonstrate the slowdown caused by all of those conversions. We'll use "imshow" here, but _any_ OpenCV API call will always be doing the same "Python to OpenCV" conversions of the data, so the exact API doesn't matter.

Here's the example code:

import cv2
import numpy as np
import time

#img1 = cv2.imread("yourimage.png") # If you want to test with an image.
img1 = np.zeros([2160,3840,3], np.uint8) # Create a 4K image.
img1[:,:,2] = 255; img1[:,:,1] = 100 # Fill the channels with different values.
img2 = img1[...,::-1] # Make a "channel flipped view" of the Numpy data.

print("img1 contiguous", img1.flags['C_CONTIGUOUS'], "img2 contiguous", img2.flags['C_CONTIGUOUS'])
print("img1 strides", img1.strides, "img2 strides", img2.strides)

wnd = cv2.namedWindow("", cv2.WINDOW_NORMAL)

def show1():
    cv2.imshow(wnd, img1)

def show2():
    cv2.imshow(wnd, img2)

iterations = 20

start = time.perf_counter()
for i in range(0, iterations):
    show1()
elapsed1 = (time.perf_counter() - start) * 1000
elapsed1_percall = elapsed1 / iterations

start = time.perf_counter()
for i in range(0, iterations):
    show2()
elapsed2 = (time.perf_counter() - start) * 1000
elapsed2_percall = elapsed2 / iterations

# We know that the contiguos (img1) data does not need conversion,
# which tells us that the runtime of the contiguous data is the
# "internal work of the imshow" function. We only want to measure
# the conversion time for non-contiguous data. So we'll subtract
# the first image's (contiguous) runtime from the non-contiguous time.

noncontiguous_overhead_per_call = elapsed2_percall - elapsed1_percall

print("Extra time taken per OpenCV call when given non-contiguous data (in ms):", noncontiguous_overhead_per_call, "ms")

The results:

img1 contiguous True img2 contiguous False
img1 strides (11520, 3, 1) img2 strides (11520, 3, -1)
Extra time taken per OpenCV call when given non-contiguous data (in ms): 39.45334999999999 ms

As you can see, the time it takes is pretty much the same as when you call img.copy() inside Python itself (as seen in the earlier benchmark for "Method: x = x[...,::-1].copy(). Speed: 37.5 msec per call").

So yes, every time you call OpenCV with a non-contiguous Numpy array as its argument, you are causing a Numpy .copy() to happen internally!

Conclusions

Stop using Numpy view manipulations and "tricks". They are not "cool". They are SLOW. You are slowing down all of your OpenCV API calls by about 40 milliseconds PER CALL, since your "cool" data has to be converted by OpenCV internally to PROPER CONTIGUOUS RAM.

Those 39.45ms are wasted on EVERY OpenCV call whenever you give OpenCV a non-contiguous image. So if you're (as people often do) passing the image to multiple OpenCV APIs to analyze it in multiple ways, then you are causing extreme slowdowns in your code.

Use cv2.cvtColor() instead, which does a one-time conversion to the proper format (in about 5.64ms as seen in the benchmarks earlier), using accelerated OpenCL code. You are guaranteed to get contiguous data which works as-is for EVERY OpenCV call with no memory copying/conversion needed. And OpenCV's color converter is WAY FASTER than Numpy's internal data copier/converter.

Let's end this by imagining a scenario where you're using some library to capture an RGB screenshot as a numpy array, and you need to use that data with OpenCV. So you're thinking you're clever and you write img = img[...,::-1], and you're thinking "Wow, my code is so fast! That operation only took 0.000237 ms!"... And then you're calling five different OpenCV functions to analyze that screenshot in various ways. Well, since you're causing one internal copy-conversion PER CALL, you're now causing 5 * 39.45 = 197.25 ms of conversion overhead, just to get your "stupid" Numpy view into a proper contiguous memory stream.

Does it still sound "slow" to just do a single, one-time 5.64ms conversion via cv2.cvtColor()? ;-)

Stop. Using. Numpy. Tricks!

Enjoy! ;-)

Fastest way to convert BGR <-> RGB! Aka: Do NOT use Numpy magic "tricks".


IMPORTANT: This article is very long. Remember to click the (more) at the bottom of this post to read the whole article!


I was reading this question: https://answers.opencv.org/question/188664/colour-conversion-from-bgr-to-rgb-in-cv2-is-slower-than-on-python/ and it didn't explain things very well at all. So here's a deep examination and explanation for everyone's future reference!

Converting RGB to BGR, and vice versa, is one of the most important operations you can do in OpenCV if you're interoperating with other libraries, raw system memory, etc. And each imaging library depends on their own special channel orders.

There are many ways to achieve the conversion, and cv2.cvtColor() is often frowned upon because there are "much faster" ways to do it via numpy "view" manipulation.

Whenever you attempt to convert colors in OpenCV, you actually invoke a huge machinery:

https://github.com/opencv/opencv/blob/8c0b0714e76efef4a8ca2a7c410c60e55c5e9829/modules/imgproc/src/color.cpp#L20-L25 https://github.com/opencv/opencv/blob/8b541e450b511fde9dd363fa55a30fbb6fc0ace6/modules/imgproc/src/color_rgb.dispatch.cpp#L426-L437

As you can see, internally, OpenCV creates an "OpenCL Kernel" with the instructions for the data transformation, and then runs it. This creates brand new (re-arranged) image data in memory, which is of course a pretty slow operation, involving new memory allocation and data-copying.

However, there is another way to flip between RGB and BGR channel orders, which is very popular - and very bad (as you'll find out soon). And that is: Using numpy's built-in methods for manipulating the array data.

Note that there are two ways to manipulate data in Numpy:

  • One of the ways, the bad way, just changes the "view" of the Numpy array and is therefore instant (O(1)), but does NOT transform the underlying img.data in RAM/memory. This means that the raw memory does NOT contain the new channel order, and Numpy instead "fakes" it by creating a "view" that simply says "when we read this data from RAM, view it as R=B, G=G, B=R" basically... (Technically speaking, it changes the ".strides" property of the Numpy object, which instead of saying "read R then G then B" (stride "1" aka going forwards in RAM when reading the color channels) changes it to say "read B, then G, then R" (stride "-1" aka going backwards in RAM when reading the color channels)).
  • The second way, which is totally fine, is to always ensure that we arrange the pixel data properly in memory too, which is a lot slower but is sometimes necessary depending on what library/API your data is intended to be sent to!

To determine whether a numpy array manipulation has also changed the underlying MEMORY, you can look at the img.flags['C_CONTIGUOUS'] value. If True it means that the data in RAM is in the correct order (that's great!). If False it means that the data in RAM is in the wrong order and that we are "cheating" via a numpy View instead (that's BAD!).

Whenever you use the "View-based" methods to flip channels in an ndarray (such as RGB -> BGR), its C_CONTIGUOUS becomes False. If you then flip the image's channels again (such as BGR -> back to RGB), its C_CONTIGUOUS becomes True again. So, the "view" is able to be transformed multiple times, and the "Contiguous" flag only says True whenever the view happens to match the actual RAM data's layout.

So... in what situations do you need the data to ALWAYS be contiguous? Well, it varies based on API...

  • OpenCV APIs all need data in contiguous order. If you give it non-contiguous data from Python, the Python-to-OpenCV wrapper layer internally makes a COPY of the BADLY FORMATTED image you gave it, and then converts the color channel order, and THEN finally passes the COPIED-AND-CONTIGUOUS image to the internal OpenCV C++ function. This is of course very wasteful!
  • Other libraries: Depends on the library. Some of them do something like "take the img.data memory address and give it to a raw Windows API via a COM call" in which case YES the RAM-data MUST be contiguous too.

What type of data do YOU need?

If you want the SAFEST possible data that is 100% sure to work in ANY API ANYWHERE, you should always make CONTIGUOUS pixel data. It doesn't take long to do the conversion up-front, since we're still talking about very fast operations!

There are probably situations where non-contiguous data is fine, such as if you are doing all image manipulations purely in Numpy math without any library APIs (in which case there's no real reason to convert the data layout to contiguous in RAM). But as soon as you invoke various library APIs, you should pretty much always have contiguous data, otherwise you'll create huge performance issues (or even completely incorrect results).

I'll explain those performance issues further down, but first we'll look at the various "conversion techniques" people use in Python.

Techniques

Without further ado, here are all the ways that people use in Python whenever they want to convert back/forth between RGB and BGR. These benchmarks are on a 4K image (3840x2160):

  • Always Contiguous: No. Method: x = x[...,::-1]. Speed: 237 nsec (aka 0.237 usec aka 0.000237 msec) per call
  • Always Contiguous: Yes. Method: x = cv2.cvtColor(x, cv2.COLOR_RGB2BGR). Speed: 5.64 msec per call
  • Always Contiguous: Yes. Method: x = x[...,::-1].copy(). Speed: 37.5 msec per call
  • Always Contiguous: No. Method: x = np.fliplr(x.reshape(-1,3)).reshape(x.shape). Speed: 1.62 usec (aka 0.00162 msec) per call
  • Always Contiguous: Yes. Method: x = np.fliplr(x.reshape(-1,3)).reshape(x.shape).copy(). Speed: 37.4 msec per call
  • Always Contiguous: No. Method: x = np.flip(x, axis=2). Speed: 2.74 usec (aka 0.00274 msec) per call
  • Always Contiguous: Yes. Method: x = np.flip(x, axis=2).copy(). Speed: 37.5 msec per call

PS: Whenever we want contiguous data from numpy, we're using x.copy() to tell Numpy to allocate new RAM and copy all data to it in the correct (contiguous) order. There's also a np.ascontiguousarray(x) API but it does the exact same thing (copies) and requires much more typing. ;-)

Docs for the various Numpy functions: copy, ascontiguousarray, fliplr, flip, reshape

Here's the benchmark that was used:

python -m timeit -s "import numpy as np; import cv2; x = np.zeros([2160,3840,3], np.uint8); x[:,:,2] = 255; x[:,:,1] = 100" "ALGORITHM HERE"

Replace the "ALGORITHM HERE" part with the algorithm above, such as "x = np.flip(x, axis=2).copy()".

People's Misunderstandings of those Benchmarks

Alright, so we're finally getting to the whole purpose of this article!

When people see the benchmarks above, they usually think "Oh my god, x = x[...,::-1] executes in 0.000237 milliseconds, and x = cv2.cvtColor(x, cv2.COLOR_RGB2BGR) executes in 5.64 milliseconds, which is 23798 times slower!! And then they decide to always use "Numpy view manipulation" to do their channel conversions.

That's a huge mistake. And here's why:

  • When you call an OpenCV API from Python, and pass it a numpy.ndarray image object, there's a process which prepares that data for internal usage within OpenCV (since OpenCV itself doesn't use ndarray internally; it uses cv::Mat).
  • First, your Python object (which is coming from the PyOpenCV module) goes into the appropriate pyopencv_to() function, whose purpose is to convert raw Python objects (such as numbers, strings, ndarray, etc), into something usable by OpenCV internally in C++.
  • Your Python object first enters the "full ArgInfo converter" code at https://github.com/opencv/opencv/blob/778f42ad34559451d62ac9ba585717aec77fb23a/modules/python/src2/cv2.cpp#L249
  • That code, in turn, looks at the object and determines if it's a number, a float, or a tuple... If it's any of those, it does the appropriate conversion. Otherwise it assumes it's a Numpy array, at this line: https://github.com/opencv/opencv/blob/master/modules/python/src2/cv2.cpp#L292
  • Next, it begins to analyze the Numpy array to determine how to use the data internally. It wants to determine "do we need to copy the data or can we use it as-is? do we need to cast the data?", see here: https://github.com/opencv/opencv/blob/master/modules/python/src2/cv2.cpp#L300
  • It first does some simple checks to see if the number-type in the Numpy array is legal or not.
  • Next, it retrieves the "strides" information from the Numpy array, which is those simple numbers such as "-1" which determine how to read a numpy array (such as backwards, in the case of our "fast" numpy-based "channel flipping" code earlier): https://github.com/opencv/opencv/blob/master/modules/python/src2/cv2.cpp#L341-L342
  • Then it analyzes the Strides for all dimensions of the Numpy array, and if it finds a non-contiguous stride (our "screwed up" data layout caused by doing those so-called "fast" Numpy view manipulations), then it marks the data as "needs copy": https://github.com/opencv/opencv/blob/master/modules/python/src2/cv2.cpp#L344-L357
  • Next, if "needs copy" is true, and we can't solve it with a simple cast ("needs cast" is false), it does this horrible thing: https://github.com/opencv/opencv/blob/master/modules/python/src2/cv2.cpp#L371-L374
  • As you can see, it calls PyArray_GETCONTIGUOUS() which generates a brand-new Python Numpy Object, with all data COPIED by Numpy into brand-new memory, and re-arranged into proper Contiguous ordering. That's extremely wasteful! I'll explain more soon, after this walkthrough of what the code does...
  • Finally, the code proceeds to create a cv::Mat object whose data pointer points at the internal byte-array (RAM) of the Numpy object, ie. the RAM address that you can easily see in Python by typing img.data. That's an incredibly fast operation because it is just a pointer which says Use the existing RAM data owned by Numpy at RAM address XYZ: https://github.com/opencv/opencv/blob/master/modules/python/src2/cv2.cpp#L415-L416

So, can you spot the problem yet?

When you pass a contiguous Numpy array, the conversion into OpenCV is pretty much INSTANT: "This data looks fine! Just give its RAM address to cv::Mat and voila!".

But when you instead insist on using those so-called "fast" channel transformations, where you "tweak" the Numpy array's view and stride values, then you are giving OpenCV a Numpy array with non-contiguous RAM and bad "strides". The PyOpenCV layer (the wrapper between OpenCV and Python) detects this problem, and creates a BRAND NEW, COPIED, RE-ARRANGED (CONTIGUOUS) NUMPY ARRAY. This is VERY VERY VERY VERY SLOW.

In other words, if you've used those dumb Numpy "view" manipulation tricks, EVERY CALL TO OPENCV APIS IS CAUSING a HUGE memory copy (images are large, especially 1080p+ screenshots/video frames), a lot of math inside PyArray_GETCONTIGUOUS to create that new object while respecting your tweaked "strides", etc.

Your code won't be faster at all. It will be SLOWER!

Proof Demonstration of the Slowness

Let's use a random OpenCV API to demonstrate the slowdown caused by all of those conversions. We'll use "imshow" cv2.imshow here, but _any_ any OpenCV API call will always be doing the same "Python to OpenCV" conversions of the numpy data, so the exact API doesn't matter.matter. They will all have this overhead.

Here's the example code:

import cv2
import numpy as np
import time

#img1 = cv2.imread("yourimage.png") # If you want to test with an image.
img1 = np.zeros([2160,3840,3], np.uint8) # Create a 4K image.
img1[:,:,2] = 255; img1[:,:,1] = 100 # Fill the channels with different values.
img2 = img1[...,::-1] # Make a "channel flipped view" of the Numpy data.

print("img1 contiguous", img1.flags['C_CONTIGUOUS'], "img2 contiguous", img2.flags['C_CONTIGUOUS'])
print("img1 strides", img1.strides, "img2 strides", img2.strides)

wnd = cv2.namedWindow("", cv2.WINDOW_NORMAL)

def show1():
    cv2.imshow(wnd, img1)

def show2():
    cv2.imshow(wnd, img2)

iterations = 20

start = time.perf_counter()
for i in range(0, iterations):
    show1()
elapsed1 = (time.perf_counter() - start) * 1000
elapsed1_percall = elapsed1 / iterations

start = time.perf_counter()
for i in range(0, iterations):
    show2()
elapsed2 = (time.perf_counter() - start) * 1000
elapsed2_percall = elapsed2 / iterations

# We know that the contiguos (img1) data does not need conversion,
# which tells us that the runtime of the contiguous data is the
# "internal work of the imshow" function. We only want to measure
# the conversion time for non-contiguous data. So we'll subtract
# the first image's (contiguous) runtime from the non-contiguous time.

noncontiguous_overhead_per_call = elapsed2_percall - elapsed1_percall

print("Extra time taken per OpenCV call when given non-contiguous data (in ms):", noncontiguous_overhead_per_call, "ms")

The results:

img1 contiguous True img2 contiguous False
img1 strides (11520, 3, 1) img2 strides (11520, 3, -1)
Extra time taken per OpenCV call when given non-contiguous data (in ms): 39.45334999999999 ms

As you can see, the time it takes is pretty much the same as when you call img.copy() inside Python itself (as seen in the earlier benchmark for "Method: x = x[...,::-1].copy(). Speed: 37.5 msec per call").

So yes, every time you call OpenCV with a non-contiguous Numpy array as its argument, you are causing a Numpy .copy() to happen internally!

Conclusions

Stop using Numpy view manipulations and "tricks". They are not "cool". They are SLOW. You are slowing down all of your OpenCV API calls by about 40 milliseconds PER CALL, since your "cool" data has to be converted by OpenCV internally to PROPER CONTIGUOUS RAM.

Those 39.45ms are wasted on EVERY OpenCV call whenever you give OpenCV a non-contiguous image. So if you're (as people often do) passing the image to multiple OpenCV APIs to analyze it in multiple ways, then you are causing extreme slowdowns in your code.

Use cv2.cvtColor() instead, which does a one-time conversion to the proper format (in about 5.64ms as seen in the benchmarks earlier), using accelerated OpenCL code. You are guaranteed to get contiguous data which works as-is for EVERY OpenCV call with no memory copying/conversion needed. And OpenCV's color converter is WAY FASTER than Numpy's internal data copier/converter.

Let's end this by imagining a scenario where you're using some library to capture an RGB screenshot as a numpy array, and you need to use that data with OpenCV. So you're thinking you're clever and you write img = img[...,::-1], and you're thinking "Wow, my code is so fast! That operation only took 0.000237 ms!"... And then you're calling five different OpenCV functions to analyze that screenshot in various ways. Well, since you're causing one internal copy-conversion PER CALL, you're now causing 5 * 39.45 = 197.25 ms of conversion overhead, just to get your "stupid" Numpy view into a proper contiguous memory stream.

Does it still sound "slow" to just do a single, one-time 5.64ms conversion via cv2.cvtColor()? ;-)

Stop. Using. Numpy. Tricks!

Enjoy! ;-)

Fastest way to convert BGR <-> RGB! Aka: Do NOT use Numpy magic "tricks".


IMPORTANT: This article is very long. Remember to click the (more) at the bottom of this post to read the whole article!


I was reading this question: https://answers.opencv.org/question/188664/colour-conversion-from-bgr-to-rgb-in-cv2-is-slower-than-on-python/ and it didn't explain things very well at all. So here's a deep examination and explanation for everyone's future reference!

Converting RGB to BGR, and vice versa, is one of the most important operations you can do in OpenCV if you're interoperating with other libraries, raw system memory, etc. And each imaging library depends on their own special channel orders.

There are many ways to achieve the conversion, and cv2.cvtColor() is often frowned upon because there are "much faster" ways to do it via numpy "view" manipulation.

Whenever you attempt to convert colors in OpenCV, you actually invoke a huge machinery:

https://github.com/opencv/opencv/blob/8c0b0714e76efef4a8ca2a7c410c60e55c5e9829/modules/imgproc/src/color.cpp#L20-L25 https://github.com/opencv/opencv/blob/8b541e450b511fde9dd363fa55a30fbb6fc0ace6/modules/imgproc/src/color_rgb.dispatch.cpp#L426-L437

As you can see, internally, OpenCV creates an "OpenCL Kernel" with the instructions for the data transformation, and then runs it. This creates brand new (re-arranged) image data in memory, which is of course a pretty slow operation, involving new memory allocation and data-copying.

However, there is another way to flip between RGB and BGR channel orders, which is very popular - and very bad (as you'll find out soon). And that is: Using numpy's built-in methods for manipulating the array data.

Note that there are two ways to manipulate data in Numpy:

  • One of the ways, the bad way, just changes the "view" of the Numpy array and is therefore instant (O(1)), but does NOT transform the underlying img.data in RAM/memory. This means that the raw memory does NOT contain the new channel order, and Numpy instead "fakes" it by creating a "view" that simply says "when we read this data from RAM, view it as R=B, G=G, B=R" basically... (Technically speaking, it changes the ".strides" property of the Numpy object, which instead of saying "read R then G then B" (stride "1" aka going forwards in RAM when reading the color channels) changes it to say "read B, then G, then R" (stride "-1" aka going backwards in RAM when reading the color channels)).
  • The second way, which is totally fine, is to always ensure that we arrange the pixel data properly in memory too, which is a lot slower but is sometimes necessary depending on what library/API your data is intended to be sent to!

To determine whether a numpy array manipulation has also changed the underlying MEMORY, you can look at the img.flags['C_CONTIGUOUS'] value. If True it means that the data in RAM is in the correct order (that's great!). If False it means that the data in RAM is in the wrong order and that we are "cheating" via a numpy View instead (that's BAD!).

Whenever you use the "View-based" methods to flip channels in an ndarray (such as RGB -> BGR), its C_CONTIGUOUS becomes False. If you then flip the image's channels again (such as BGR -> back to RGB), its C_CONTIGUOUS becomes True again. So, the "view" is able to be transformed multiple times, and the "Contiguous" flag only says True whenever the view happens to match the actual RAM data's layout.

So... in what situations do you need the data to ALWAYS be contiguous? Well, it varies based on API...

  • OpenCV APIs all need data in contiguous order. If you give it non-contiguous data from Python, the Python-to-OpenCV wrapper layer internally makes a COPY of the BADLY FORMATTED image you gave it, and then converts the color channel order, and THEN finally passes the COPIED-AND-CONTIGUOUS image to the internal OpenCV C++ function. This is of course very wasteful!
  • Other libraries: Depends on the library. Some of them do something like "take the img.data memory address and give it to a raw Windows API via a COM call" in which case YES the RAM-data MUST be contiguous too.

What type of data do YOU need?

If you want the SAFEST possible data that is 100% sure to work in ANY API ANYWHERE, you should always make CONTIGUOUS pixel data. It doesn't take long to do the conversion up-front, since we're still talking about very fast operations!

There are probably situations where non-contiguous data is fine, such as if you are doing all image manipulations purely in Numpy math without any library APIs (in which case there's no real reason to convert the data layout to contiguous in RAM). But as soon as you invoke various library APIs, you should pretty much always have contiguous data, otherwise you'll create huge performance issues (or even completely incorrect results).

I'll explain those performance issues further down, but first we'll look at the various "conversion techniques" people use in Python.

Techniques

Without further ado, here are all the ways that people use in Python whenever they want to convert back/forth between RGB and BGR. These benchmarks are on a 4K image (3840x2160):

  • Always Contiguous: No. Method: x = x[...,::-1]. Speed: 237 nsec (aka 0.237 usec aka 0.000237 msec) per call
  • Always Contiguous: Yes. Method: x = cv2.cvtColor(x, cv2.COLOR_RGB2BGR). Speed: 5.64 msec per call
  • Always Contiguous: Yes. Method: x = x[...,::-1].copy(). Speed: 37.5 msec per call
  • Always Contiguous: No. Method: x = np.fliplr(x.reshape(-1,3)).reshape(x.shape). Speed: 1.62 usec (aka 0.00162 msec) per call
  • Always Contiguous: Yes. Method: x = np.fliplr(x.reshape(-1,3)).reshape(x.shape).copy(). Speed: 37.4 msec per call
  • Always Contiguous: No. Method: x = np.flip(x, axis=2). Speed: 2.74 usec (aka 0.00274 msec) per call
  • Always Contiguous: Yes. Method: x = np.flip(x, axis=2).copy(). Speed: 37.5 msec per call

PS: Whenever we want contiguous data from numpy, we're using x.copy() to tell Numpy to allocate new RAM and copy all data to it in the correct (contiguous) order. There's also a np.ascontiguousarray(x) API but it does the exact same thing (copies) and requires much more typing. ;-)

Docs for the various Numpy functions: copy, ascontiguousarray, fliplr, flip, reshape

Here's the benchmark that was used:

python -m timeit -s "import numpy as np; import cv2; x = np.zeros([2160,3840,3], np.uint8); x[:,:,2] = 255; x[:,:,1] = 100" "ALGORITHM HERE"

Replace the "ALGORITHM HERE" part with the algorithm above, such as "x = np.flip(x, axis=2).copy()".

People's Misunderstandings of those Benchmarks

Alright, so we're finally getting to the whole purpose of this article!

When people see the benchmarks above, they usually think "Oh my god, x = x[...,::-1] executes in 0.000237 milliseconds, and x = cv2.cvtColor(x, cv2.COLOR_RGB2BGR) executes in 5.64 milliseconds, which is 23798 times slower!! And then they decide to always use "Numpy view manipulation" to do their channel conversions.

That's a huge mistake. And here's why:

  • When you call an OpenCV API from Python, and pass it a numpy.ndarray image object, there's a process which prepares that data for internal usage within OpenCV (since OpenCV itself doesn't use ndarray internally; it uses cv::Mat).
  • First, your Python object (which is coming from the PyOpenCV module) goes into the appropriate pyopencv_to() function, whose purpose is to convert raw Python objects (such as numbers, strings, ndarray, etc), into something usable by OpenCV internally in C++.
  • Your Python object first enters the "full ArgInfo converter" code at https://github.com/opencv/opencv/blob/778f42ad34559451d62ac9ba585717aec77fb23a/modules/python/src2/cv2.cpp#L249
  • That code, in turn, looks at the object and determines if it's a number, a float, or a tuple... If it's any of those, it does the appropriate conversion. Otherwise it assumes it's a Numpy array, at this line: https://github.com/opencv/opencv/blob/master/modules/python/src2/cv2.cpp#L292
  • Next, it begins to analyze the Numpy array to determine how to use the data internally. It wants to determine "do we need to copy the data or can we use it as-is? do we need to cast the data?", see here: https://github.com/opencv/opencv/blob/master/modules/python/src2/cv2.cpp#L300
  • It first does some simple checks to see if the number-type in the Numpy array is legal or not.
  • Next, it retrieves the "strides" information from the Numpy array, which is those simple numbers such as "-1" which determine how to read a numpy array (such as backwards, in the case of our "fast" numpy-based "channel flipping" code earlier): https://github.com/opencv/opencv/blob/master/modules/python/src2/cv2.cpp#L341-L342
  • Then it analyzes the Strides for all dimensions of the Numpy array, and if it finds a non-contiguous stride (our "screwed up" data layout caused by doing those so-called "fast" Numpy view manipulations), then it marks the data as "needs copy": https://github.com/opencv/opencv/blob/master/modules/python/src2/cv2.cpp#L344-L357
  • Next, if "needs copy" is true, and we can't solve it with a simple cast ("needs cast" is false), it does this horrible thing: https://github.com/opencv/opencv/blob/master/modules/python/src2/cv2.cpp#L371-L374
  • As you can see, it calls PyArray_GETCONTIGUOUS() which generates a brand-new Python Numpy Object, with all data COPIED by Numpy into brand-new memory, and re-arranged into proper Contiguous ordering. That's extremely wasteful! I'll explain more soon, after this walkthrough of what the code does...
  • Finally, the code proceeds to create a cv::Mat object whose data pointer points at the internal byte-array (RAM) of the Numpy object, ie. the RAM address that you can easily see in Python by typing img.data. That's an incredibly fast operation because it is just a pointer which says Use the existing RAM data owned by Numpy at RAM address XYZ: https://github.com/opencv/opencv/blob/master/modules/python/src2/cv2.cpp#L415-L416

So, can you spot the problem yet?

When you pass a contiguous Numpy array, the conversion into OpenCV is pretty much INSTANT: "This data looks fine! Just give its RAM address to cv::Mat and voila!".

But when you instead insist on using those so-called "fast" channel transformations, where you "tweak" the Numpy array's view and stride values, then you are giving OpenCV a Numpy array with non-contiguous RAM and bad "strides". The PyOpenCV layer (the wrapper between OpenCV and Python) detects this problem, and creates a BRAND NEW, COPIED, RE-ARRANGED (CONTIGUOUS) NUMPY ARRAY. This is VERY VERY VERY VERY SLOW.

In other words, if you've used those dumb Numpy "view" manipulation tricks, EVERY CALL TO OPENCV APIS IS CAUSING a HUGE memory copy (images are large, especially 1080p+ screenshots/video frames), a lot of math inside PyArray_GETCONTIGUOUS to create that new object while respecting your tweaked "strides", etc.

Your code won't be faster at all. It will be SLOWER!

Demonstration of the Slowness

Let's use a random OpenCV API to demonstrate the slowdown caused by all of those conversions. We'll use cv2.imshow here, but any OpenCV API call will always be doing the same "Python to OpenCV" conversions of the numpy data, so the exact API doesn't matter. They will all have this overhead.

Here's the example code:

import cv2
import numpy as np
import time

#img1 = cv2.imread("yourimage.png") # If you want to test with an image.
img1 = np.zeros([2160,3840,3], np.uint8) # Create a 4K image.
img1[:,:,2] = 255; img1[:,:,1] = 100 # Fill the channels with different values.
img2 = img1[...,::-1] # Make a "channel flipped view" of the Numpy data.

print("img1 contiguous", img1.flags['C_CONTIGUOUS'], "img2 contiguous", img2.flags['C_CONTIGUOUS'])
print("img1 strides", img1.strides, "img2 strides", img2.strides)

wnd = cv2.namedWindow("", cv2.WINDOW_NORMAL)

def show1():
    cv2.imshow(wnd, img1)

def show2():
    cv2.imshow(wnd, img2)

iterations = 20

start = time.perf_counter()
for i in range(0, iterations):
    show1()
elapsed1 = (time.perf_counter() - start) * 1000
elapsed1_percall = elapsed1 / iterations

start = time.perf_counter()
for i in range(0, iterations):
    show2()
elapsed2 = (time.perf_counter() - start) * 1000
elapsed2_percall = elapsed2 / iterations

# We know that the contiguos (img1) data does not need conversion,
# which tells us that the runtime of the contiguous data is the
# "internal work of the imshow" function. We only want to measure
# the conversion time for non-contiguous data. So we'll subtract
# the first image's (contiguous) runtime from the non-contiguous time.

noncontiguous_overhead_per_call = elapsed2_percall - elapsed1_percall

print("Extra time taken per OpenCV call when given non-contiguous data (in ms):", noncontiguous_overhead_per_call, "ms")

The results:

img1 contiguous True img2 contiguous False
img1 strides (11520, 3, 1) img2 strides (11520, 3, -1)
Extra time taken per OpenCV call when given non-contiguous data (in ms): 39.45334999999999 ms

As you can see, the extra time it takes added to the OpenCV calls when copy-conversion is needed, is pretty much the same as when you call Numpy's own img.copy() inside Python itself (as seen in the earlier benchmark for "Method: x = x[...,::-1].copy(). Speed: 37.5 msec per call").

So yes, every every time you call OpenCV with a non-contiguous Numpy array as its argument, you are causing a Numpy .copy() to happen internally!

Conclusions

Stop using Numpy view manipulations and "tricks". They are not "cool". They are SLOW. You are slowing down all of your OpenCV API calls by about 40 milliseconds PER CALL, since your "cool" data has to be converted by OpenCV internally to PROPER CONTIGUOUS RAM.

Those 39.45ms are wasted on EVERY OpenCV call whenever you give OpenCV a non-contiguous image. So if you're (as people often do) passing the image to multiple OpenCV APIs to analyze it in multiple ways, then you are causing extreme slowdowns in your code.

Use cv2.cvtColor() instead, which does a one-time conversion to the proper format (in about 5.64ms as seen in the benchmarks earlier), using accelerated OpenCL code. You are guaranteed to get contiguous data which works as-is for EVERY OpenCV call with no memory copying/conversion needed. And OpenCV's color converter is WAY FASTER than Numpy's internal data copier/converter.

Let's end this by imagining a scenario where you're using some Python library to capture an RGB screenshot as a numpy array, and you need to use that data with OpenCV. So you're thinking you're clever and you write img = img[...,::-1], to "turn the RGB data into BGR (which OpenCV needs)", and you're thinking "Wow, my code is so fast! That RGB-to-BGR operation only took 0.000237 ms!"... And then then you're calling five different OpenCV functions to analyze that screenshot screenshot-image in various ways. Well, since you're causing one internal copy-conversion Numpy copy-conversion-to-contiguous PER CALL, you're now causing 5 * 39.45 = 197.25 ms of total conversion overhead, just to get your "stupid" Numpy view into a proper contiguous memory stream.

Does it still sound "slow" to just do a single, one-time 5.64ms conversion via cv2.cvtColor()? ;-)

Stop. Using. Numpy. Tricks!

Enjoy! ;-)

Fastest way to convert BGR <-> RGB! Aka: Do NOT use Numpy magic "tricks".


IMPORTANT: This article is very long. Remember to click the (more) at the bottom of this post to read the whole article!


I was reading this question: https://answers.opencv.org/question/188664/colour-conversion-from-bgr-to-rgb-in-cv2-is-slower-than-on-python/ and it didn't explain things very well at all. So here's a deep examination and explanation for everyone's future reference!

Converting RGB to BGR, and vice versa, is one of the most important operations you can do in OpenCV if you're interoperating with other libraries, raw system memory, etc. And each imaging library depends on their own special channel orders.

There are many ways to achieve the conversion, and cv2.cvtColor() is often frowned upon because there are "much faster" ways to do it via numpy "view" manipulation.

Whenever you attempt to convert colors in OpenCV, you actually invoke a huge machinery:

https://github.com/opencv/opencv/blob/8c0b0714e76efef4a8ca2a7c410c60e55c5e9829/modules/imgproc/src/color.cpp#L20-L25 https://github.com/opencv/opencv/blob/8b541e450b511fde9dd363fa55a30fbb6fc0ace6/modules/imgproc/src/color_rgb.dispatch.cpp#L426-L437

As you can see, internally, OpenCV creates an "OpenCL Kernel" with the instructions for the data transformation, and then runs it. This creates brand new (re-arranged) image data in memory, which is of course a pretty slow operation, involving new memory allocation and data-copying.

However, there is another way to flip between RGB and BGR channel orders, which is very popular - and very bad (as you'll find out soon). And that is: Using numpy's built-in methods for manipulating the array data.

Note that there are two ways to manipulate data in Numpy:

  • One of the ways, the bad way, just changes the "view" of the Numpy array and is therefore instant (O(1)), but does NOT transform the underlying img.data in RAM/memory. This means that the raw memory does NOT contain the new channel order, and Numpy instead "fakes" it by creating a "view" that simply says "when we read this data from RAM, view it as R=B, G=G, B=R" basically... (Technically speaking, it changes the ".strides" property of the Numpy object, which instead of saying "read R then G then B" (stride "1" aka going forwards in RAM when reading the color channels) changes it to say "read B, then G, then R" (stride "-1" aka going backwards in RAM when reading the color channels)).
  • The second way, which is totally fine, is to always ensure that we arrange the pixel data properly in memory too, which is a lot slower but is sometimes necessary almost always necessary, depending on what library/API your data is intended to be sent to!

To determine whether a numpy array manipulation has also changed the underlying MEMORY, you can look at the img.flags['C_CONTIGUOUS'] value. If True it means that the data in RAM is in the correct order (that's great!). If False it means that the data in RAM is in the wrong order and that we are "cheating" via a numpy View instead (that's BAD!).

Whenever you use the "View-based" methods to flip channels in an ndarray (such as RGB -> BGR), its C_CONTIGUOUS becomes False. If you then flip the image's channels again (such as BGR -> back to RGB), its C_CONTIGUOUS becomes True again. So, the "view" is able to be transformed multiple times, and the "Contiguous" flag only says True whenever the view happens to match the actual RAM data's layout.

So... in what situations do you need the data to ALWAYS be contiguous? Well, it varies based on API...

  • OpenCV APIs all need data in contiguous order. If you give it non-contiguous data from Python, the Python-to-OpenCV wrapper layer internally makes a COPY of the BADLY FORMATTED image you gave it, and then converts the color channel order, and THEN finally passes the COPIED-AND-CONTIGUOUS image to the internal OpenCV C++ function. This is of course very wasteful!
  • Other libraries: Depends on the library. Some of them do something like "take the img.data memory address and give it to a raw Windows API via a COM call" in which case YES the RAM-data MUST be contiguous too.

What type of data do YOU need?

If you want the SAFEST possible data that is 100% sure to work in ANY API ANYWHERE, you should always make CONTIGUOUS pixel data. It doesn't take long to do the conversion up-front, since we're still talking about very fast operations!

There are probably situations where non-contiguous data is fine, such as if you are doing all image manipulations purely in Numpy math without any library APIs (in which case there's no real reason to convert the data layout to contiguous in RAM). But as soon as you invoke various library APIs, you should pretty much always have contiguous data, otherwise you'll create huge performance issues (or even completely incorrect results).

I'll explain those performance issues further down, but first we'll look at the various "conversion techniques" people use in Python.

Techniques

Without further ado, here are all the ways that people use in Python whenever they want to convert back/forth between RGB and BGR. These benchmarks are on a 4K image (3840x2160):

  • Always Contiguous: No. Method: x = x[...,::-1]. Speed: 237 nsec (aka 0.237 usec aka 0.000237 msec) per call
  • Always Contiguous: Yes. Method: x = cv2.cvtColor(x, cv2.COLOR_RGB2BGR). Speed: 5.64 msec per call
  • Always Contiguous: Yes. Method: x = x[...,::-1].copy(). Speed: 37.5 msec per call
  • Always Contiguous: No. Method: x = np.fliplr(x.reshape(-1,3)).reshape(x.shape). Speed: 1.62 usec (aka 0.00162 msec) per call
  • Always Contiguous: Yes. Method: x = np.fliplr(x.reshape(-1,3)).reshape(x.shape).copy(). Speed: 37.4 msec per call
  • Always Contiguous: No. Method: x = np.flip(x, axis=2). Speed: 2.74 usec (aka 0.00274 msec) per call
  • Always Contiguous: Yes. Method: x = np.flip(x, axis=2).copy(). Speed: 37.5 msec per call

PS: Whenever we want contiguous data from numpy, we're using x.copy() to tell Numpy to allocate new RAM and copy all data to it in the correct (contiguous) order. There's also a np.ascontiguousarray(x) API but it does the exact same thing (copies) and requires much more typing. ;-)

Docs for the various Numpy functions: copy, ascontiguousarray, fliplr, flip, reshape

Here's the benchmark that was used:

python -m timeit -s "import numpy as np; import cv2; x = np.zeros([2160,3840,3], np.uint8); x[:,:,2] = 255; x[:,:,1] = 100" "ALGORITHM HERE"

Replace the "ALGORITHM HERE" part with the algorithm above, such as "x = np.flip(x, axis=2).copy()".

People's Misunderstandings of those Benchmarks

Alright, so we're finally getting to the whole purpose of this article!

When people see the benchmarks above, they usually think "Oh my god, x = x[...,::-1] executes in 0.000237 milliseconds, and x = cv2.cvtColor(x, cv2.COLOR_RGB2BGR) executes in 5.64 milliseconds, which is 23798 times slower!! And then they decide to always use "Numpy view manipulation" to do their channel conversions.

That's a huge mistake. And here's why:

  • When you call an OpenCV API from Python, and pass it a numpy.ndarray image object, there's a process which prepares that data for internal usage within OpenCV (since OpenCV itself doesn't use ndarray internally; it uses cv::Mat).
  • First, your Python object (which is coming from the PyOpenCV module) goes into the appropriate pyopencv_to() function, whose purpose is to convert raw Python objects (such as numbers, strings, ndarray, etc), into something usable by OpenCV internally in C++.
  • Your Python object first enters the "full ArgInfo converter" code at https://github.com/opencv/opencv/blob/778f42ad34559451d62ac9ba585717aec77fb23a/modules/python/src2/cv2.cpp#L249
  • That code, in turn, looks at the object and determines if it's a number, a float, or a tuple... If it's any of those, it does the appropriate conversion. Otherwise it assumes it's a Numpy array, at this line: https://github.com/opencv/opencv/blob/master/modules/python/src2/cv2.cpp#L292
  • Next, it begins to analyze the Numpy array to determine how to use the data internally. It wants to determine "do we need to copy the data or can we use it as-is? do we need to cast the data?", see here: https://github.com/opencv/opencv/blob/master/modules/python/src2/cv2.cpp#L300
  • It first does some simple checks to see if the number-type in the Numpy array is legal or not.
  • Next, it retrieves the "strides" information from the Numpy array, which is those simple numbers such as "-1" which determine how to read a numpy array (such as backwards, in the case of our "fast" numpy-based "channel flipping" code earlier): https://github.com/opencv/opencv/blob/master/modules/python/src2/cv2.cpp#L341-L342
  • Then it analyzes the Strides for all dimensions of the Numpy array, and if it finds a non-contiguous stride (our "screwed up" data layout caused by doing those so-called "fast" Numpy view manipulations), then it marks the data as "needs copy": https://github.com/opencv/opencv/blob/master/modules/python/src2/cv2.cpp#L344-L357
  • Next, if "needs copy" is true, and we can't solve it with a simple cast ("needs cast" is false), it does this horrible thing: https://github.com/opencv/opencv/blob/master/modules/python/src2/cv2.cpp#L371-L374
  • As you can see, it calls PyArray_GETCONTIGUOUS() which generates a brand-new Python Numpy Object, with all data COPIED by Numpy into brand-new memory, and re-arranged into proper Contiguous ordering. That's extremely wasteful! I'll explain more soon, after this walkthrough of what the code does...
  • Finally, the code proceeds to create a cv::Mat object whose data pointer points at the internal byte-array (RAM) of the Numpy object, ie. the RAM address that you can easily see in Python by typing img.data. That's an incredibly fast operation because it is just a pointer which says Use the existing RAM data owned by Numpy at RAM address XYZ: https://github.com/opencv/opencv/blob/master/modules/python/src2/cv2.cpp#L415-L416

So, can you spot the problem yet?

When you pass a contiguous Numpy array, the conversion into OpenCV is pretty much INSTANT: "This data looks fine! Just give its RAM address to cv::Mat and voila!".

But when you instead insist on using those so-called "fast" channel transformations, where you "tweak" the Numpy array's view and stride values, then you are giving OpenCV a Numpy array with non-contiguous RAM and bad "strides". The PyOpenCV layer (the wrapper between OpenCV and Python) detects this problem, and creates a BRAND NEW, COPIED, RE-ARRANGED (CONTIGUOUS) NUMPY ARRAY. This is VERY VERY VERY VERY SLOW.

In other words, if you've used those dumb Numpy "view" manipulation tricks, EVERY CALL TO OPENCV APIS IS CAUSING a HUGE memory copy (images are large, especially 1080p+ screenshots/video frames), a lot of math inside PyArray_GETCONTIGUOUS to create that new object while respecting your tweaked "strides", etc.

Your code won't be faster at all. It will be SLOWER!

Demonstration of the Slowness

Let's use a random OpenCV API to demonstrate the slowdown caused by all of those conversions. We'll use cv2.imshow here, but any OpenCV API call will always be doing the same "Python to OpenCV" conversions of the numpy data, so the exact API doesn't matter. They will all have this overhead.

Here's the example code:

import cv2
import numpy as np
import time

#img1 = cv2.imread("yourimage.png") # If you want to test with an image.
img1 = np.zeros([2160,3840,3], np.uint8) # Create a 4K image.
img1[:,:,2] = 255; img1[:,:,1] = 100 # Fill the channels with different values.
img2 = img1[...,::-1] # Make a "channel flipped view" of the Numpy data.

print("img1 contiguous", img1.flags['C_CONTIGUOUS'], "img2 contiguous", img2.flags['C_CONTIGUOUS'])
print("img1 strides", img1.strides, "img2 strides", img2.strides)

wnd = cv2.namedWindow("", cv2.WINDOW_NORMAL)

def show1():
    cv2.imshow(wnd, img1)

def show2():
    cv2.imshow(wnd, img2)

iterations = 20

start = time.perf_counter()
for i in range(0, iterations):
    show1()
elapsed1 = (time.perf_counter() - start) * 1000
elapsed1_percall = elapsed1 / iterations

start = time.perf_counter()
for i in range(0, iterations):
    show2()
elapsed2 = (time.perf_counter() - start) * 1000
elapsed2_percall = elapsed2 / iterations

# We know that the contiguos (img1) data does not need conversion,
# which tells us that the runtime of the contiguous data is the
# "internal work of the imshow" function. We only want to measure
# the conversion time for non-contiguous data. So we'll subtract
# the first image's (contiguous) runtime from the non-contiguous time.

noncontiguous_overhead_per_call = elapsed2_percall - elapsed1_percall

print("Extra time taken per OpenCV call when given non-contiguous data (in ms):", noncontiguous_overhead_per_call, "ms")

The results:

img1 contiguous True img2 contiguous False
img1 strides (11520, 3, 1) img2 strides (11520, 3, -1)
Extra time taken per OpenCV call when given non-contiguous data (in ms): 39.45334999999999 ms

As you can see, the extra time added to the OpenCV calls when copy-conversion is needed, is pretty much the same as when you call Numpy's own img.copy() inside Python itself (as seen in the earlier benchmark for "Method: x = x[...,::-1].copy(). Speed: 37.5 msec per call").

So yes, every time you call OpenCV with a non-contiguous Numpy array as its argument, you are causing a Numpy .copy() to happen internally!

Conclusions

Stop using Numpy view manipulations and "tricks". They are not "cool". They are SLOW. You are slowing down all of your OpenCV API calls by about 40 milliseconds PER CALL, since your "cool" data has to be converted by OpenCV internally to PROPER CONTIGUOUS RAM.

Those 39.45ms are wasted on EVERY OpenCV call whenever you give OpenCV a non-contiguous image. So if you're (as people often do) passing the image to multiple OpenCV APIs to analyze it in multiple ways, then you are causing extreme slowdowns in your code.

Use cv2.cvtColor() instead, which does a one-time conversion to the proper format (in about 5.64ms as seen in the benchmarks earlier), using accelerated OpenCL code. You are guaranteed to get contiguous data which works as-is for EVERY OpenCV call with no memory copying/conversion needed. And OpenCV's color converter is WAY FASTER than Numpy's internal data copier/converter.

Let's end this by imagining a scenario where you're using some Python library to capture an RGB screenshot as a numpy array, and you need to use that data with OpenCV. So you're thinking you're clever and you write img = img[...,::-1] to "turn the RGB data into BGR (which OpenCV needs)", and you're thinking "Wow, my code is so fast! That RGB-to-BGR operation only took 0.000237 ms!"... And then you're calling five different OpenCV functions to analyze that screenshot-image in various ways. Well, since you're causing one internal Numpy copy-conversion-to-contiguous PER CALL, you're now causing 5 * 39.45 = 197.25 ms of total conversion overhead, just to get your "stupid" Numpy view into a proper contiguous memory stream.

Does it still sound "slow" to just do a single, one-time 5.64ms conversion via cv2.cvtColor()? ;-)

Stop. Using. Numpy. Tricks!

Enjoy! ;-)

Fastest way to convert BGR <-> RGB! Aka: Do NOT use Numpy magic "tricks".


IMPORTANT: This article is very long. Remember to click the (more) at the bottom of this post to read the whole article!


I was reading this question: https://answers.opencv.org/question/188664/colour-conversion-from-bgr-to-rgb-in-cv2-is-slower-than-on-python/ and it didn't explain things very well at all. So here's a deep examination and explanation for everyone's future reference!

Converting RGB to BGR, and vice versa, is one of the most important operations you can do in OpenCV if you're interoperating with other libraries, raw system memory, etc. And each imaging library depends on their own special channel orders.

There are many ways to achieve the conversion, and cv2.cvtColor() is often frowned upon because there are "much faster" ways to do it via numpy "view" manipulation.

Whenever you attempt to convert colors in OpenCV, you actually invoke a huge machinery:

https://github.com/opencv/opencv/blob/8c0b0714e76efef4a8ca2a7c410c60e55c5e9829/modules/imgproc/src/color.cpp#L20-L25 https://github.com/opencv/opencv/blob/8b541e450b511fde9dd363fa55a30fbb6fc0ace6/modules/imgproc/src/color_rgb.dispatch.cpp#L426-L437

As you can see, internally, OpenCV creates an "OpenCL Kernel" with the instructions for the data transformation, and then runs it. This creates brand new (re-arranged) image data in memory, which is of course a pretty slow operation, involving new memory allocation and data-copying.

However, there is another way to flip between RGB and BGR channel orders, which is very popular - and very bad (as you'll find out soon). And that is: Using numpy's built-in methods for manipulating the array data.

Note that there are two ways to manipulate data in Numpy:

  • One of the ways, the bad way, just changes the "view" of the Numpy array and is therefore instant (O(1)), but does NOT transform the underlying img.data in RAM/memory. This means that the raw memory does NOT contain the new channel order, and Numpy instead "fakes" it by creating a "view" that simply says "when we read this data from RAM, view it as R=B, G=G, B=R" basically... (Technically speaking, it changes the ".strides" property of the Numpy object, which instead of saying "read R then G then B" (stride "1" aka going forwards in RAM when reading the color channels) changes it to say "read B, then G, then R" (stride "-1" aka going backwards in RAM when reading the color channels)).
  • The second way, which is totally fine, is to always ensure that we arrange the pixel data properly in memory too, which is a lot slower but is almost always necessary, depending on what library/API your data is intended to be sent to!

To determine whether a numpy array manipulation has also changed the underlying MEMORY, you can look at the img.flags['C_CONTIGUOUS'] value. If True it means that the data in RAM is in the correct order (that's great!). If False it means that the data in RAM is in the wrong order and that we are "cheating" via a numpy View instead (that's BAD!).

Whenever you use the "View-based" methods to flip channels in an ndarray (such as RGB -> BGR), its C_CONTIGUOUS becomes False. If you then flip the image's channels again (such as BGR -> back to RGB), its C_CONTIGUOUS becomes True again. So, the "view" is able to be transformed multiple times, and the "Contiguous" flag only says True whenever the view happens to match the actual RAM data's layout.

So... in what situations do you need the data to ALWAYS be contiguous? Well, it varies based on API...

  • OpenCV APIs all need data in contiguous order. If you give it non-contiguous data from Python, the Python-to-OpenCV wrapper layer internally makes a COPY of the BADLY FORMATTED image you gave it, and then converts the color channel order, and THEN finally passes the COPIED-AND-CONTIGUOUS image to the internal OpenCV C++ function. This is of course very wasteful!
  • Other libraries: Depends on the library. Some of them do something like "take the img.data memory address and give it to a raw Windows API via a COM call" in which case YES the RAM-data MUST be contiguous too.

What type of data do YOU need?

If you want the SAFEST possible data that is 100% sure to work in ANY API ANYWHERE, you should always make CONTIGUOUS pixel data. It doesn't take long to do the conversion up-front, since we're still talking about very fast operations!

There are probably situations where non-contiguous data is fine, such as if you are doing all image manipulations purely in Numpy math without any library APIs (in which case there's no real reason to convert the data layout to contiguous in RAM). But as soon as you invoke various library APIs, you should pretty much always have contiguous data, otherwise you'll create huge performance issues (or even completely incorrect results).

I'll explain those performance issues further down, but first we'll look at the various "conversion techniques" people use in Python.

Techniques

Without further ado, here are all the ways that people use in Python whenever they want to convert back/forth between RGB and BGR. These benchmarks are on a 4K image (3840x2160):

  • Always Contiguous: No. Method: x = x[...,::-1]. Speed: 237 nsec (aka 0.237 usec aka 0.000237 msec) per call
  • Always Contiguous: Yes. Method: x = cv2.cvtColor(x, cv2.COLOR_RGB2BGR). Speed: 5.64 msec per call
  • Always Contiguous: Yes. Method: x = x[...,::-1].copy(). Speed: 37.5 msec per call
  • Always Contiguous: No. Method: x = np.fliplr(x.reshape(-1,3)).reshape(x.shape). Speed: 1.62 usec (aka 0.00162 msec) per call
  • Always Contiguous: Yes. Method: x = np.fliplr(x.reshape(-1,3)).reshape(x.shape).copy(). Speed: 37.4 msec per call
  • Always Contiguous: No. Method: x = np.flip(x, axis=2). Speed: 2.74 usec (aka 0.00274 msec) per call
  • Always Contiguous: Yes. Method: x = np.flip(x, axis=2).copy(). Speed: 37.5 msec per call

PS: Whenever we want contiguous data from numpy, we're using x.copy() to tell Numpy to allocate new RAM and copy all data to it in the correct (contiguous) order. There's also a np.ascontiguousarray(x) API but it does the exact same thing (copies) and requires much more typing. ;-)

Docs for the various Numpy functions: copy, ascontiguousarray, fliplr, flip, reshape

Here's the benchmark that was used:

python -m timeit -s "import numpy as np; import cv2; x = np.zeros([2160,3840,3], np.uint8); x[:,:,2] = 255; x[:,:,1] = 100" "ALGORITHM HERE"

Replace the "ALGORITHM HERE" part with the algorithm above, such as "x = np.flip(x, axis=2).copy()".

People's Misunderstandings of those Benchmarks

Alright, so we're finally getting to the whole purpose of this article!

When people see the benchmarks above, they usually think "Oh my god, x = x[...,::-1] executes in 0.000237 milliseconds, and x = cv2.cvtColor(x, cv2.COLOR_RGB2BGR) executes in 5.64 milliseconds, which is 23798 times slower!! And then they decide to always use "Numpy view manipulation" to do their channel conversions.

That's a huge mistake. And here's why:

  • When you call an OpenCV API from Python, and pass it a numpy.ndarray image object, there's a process which prepares that data for internal usage within OpenCV (since OpenCV itself doesn't use ndarray internally; it uses cv::Mat).
  • First, your Python object (which is coming from the PyOpenCV module) goes into the appropriate pyopencv_to() function, whose purpose is to convert raw Python objects (such as numbers, strings, ndarray, etc), into something usable by OpenCV internally in C++.
  • Your Python object first enters the "full ArgInfo converter" code at https://github.com/opencv/opencv/blob/778f42ad34559451d62ac9ba585717aec77fb23a/modules/python/src2/cv2.cpp#L249
  • That code, in turn, looks at the object and determines if it's a number, a float, or a tuple... If it's any of those, it does the appropriate conversion. Otherwise it assumes it's a Numpy array, at this line: https://github.com/opencv/opencv/blob/master/modules/python/src2/cv2.cpp#L292
  • Next, it begins to analyze the Numpy array to determine how to use the data internally. It wants to determine "do we need to copy the data or can we use it as-is? do we need to cast the data?", see here: https://github.com/opencv/opencv/blob/master/modules/python/src2/cv2.cpp#L300
  • It first does some simple checks to see if the number-type in the Numpy array is legal or not.
  • Next, it retrieves the "strides" information from the Numpy array, which is those simple numbers such as "-1" which determine how to read a numpy array (such as backwards, in the case of our "fast" numpy-based "channel flipping" code earlier): https://github.com/opencv/opencv/blob/master/modules/python/src2/cv2.cpp#L341-L342
  • Then it analyzes the Strides for all dimensions of the Numpy array, and if it finds a non-contiguous stride (our "screwed up" data layout caused by doing those so-called "fast" Numpy view manipulations), then it marks the data as "needs copy": https://github.com/opencv/opencv/blob/master/modules/python/src2/cv2.cpp#L344-L357
  • Next, if "needs copy" is true, and we can't solve it with a simple cast ("needs cast" is false), it does this horrible thing: https://github.com/opencv/opencv/blob/master/modules/python/src2/cv2.cpp#L371-L374https://github.com/opencv/opencv/blob/master/modules/python/src2/cv2.cpp#L367-L374
  • As you can see, it calls PyArray_Cast() (if casting was needed too) or PyArray_GETCONTIGUOUS() (if we only need to make sure the data is contiguous). Both of those functions, no matter which one is called, then generates a brand-new Python Numpy Object, with all data COPIED by Numpy into brand-new memory, and re-arranged into proper Contiguous ordering. That's extremely wasteful! I'll explain more soon, after this walkthrough of what the code does...
  • Finally, the code proceeds to create a cv::Mat object whose data pointer points at the internal byte-array (RAM) of the Numpy object, ie. the RAM address that you can easily see in Python by typing img.data. That's an incredibly fast operation because it is just a pointer which says Use the existing RAM data owned by Numpy at RAM address XYZ: https://github.com/opencv/opencv/blob/master/modules/python/src2/cv2.cpp#L415-L416

So, can you spot the problem yet?

When you pass a contiguous Numpy array, the conversion into OpenCV is pretty much INSTANT: "This data looks fine! Just give its RAM address to cv::Mat and voila!".

But when you instead insist on using those so-called "fast" channel transformations, where you "tweak" the Numpy array's view and stride values, then you are giving OpenCV a Numpy array with non-contiguous RAM and bad "strides". The PyOpenCV layer (the wrapper between OpenCV and Python) detects this problem, and creates a BRAND NEW, COPIED, RE-ARRANGED (CONTIGUOUS) NUMPY ARRAY. This is VERY VERY VERY VERY SLOW.

In other words, if you've used those dumb Numpy "view" manipulation tricks, EVERY CALL TO OPENCV APIS IS CAUSING a HUGE memory copy (images are large, especially 1080p+ screenshots/video frames), a lot of math inside PyArray_GETCONTIGUOUS to create that new object while respecting your tweaked "strides", etc.

Your code won't be faster at all. It will be SLOWER!

Demonstration of the Slowness

Let's use a random OpenCV API to demonstrate the slowdown caused by all of those conversions. We'll use cv2.imshow here, but any OpenCV API call will always be doing the same "Python to OpenCV" conversions of the numpy data, so the exact API doesn't matter. They will all have this overhead.

Here's the example code:

import cv2
import numpy as np
import time

#img1 = cv2.imread("yourimage.png") # If you want to test with an image.
img1 = np.zeros([2160,3840,3], np.uint8) # Create a 4K image.
img1[:,:,2] = 255; img1[:,:,1] = 100 # Fill the channels with different values.
img2 = img1[...,::-1] # Make a "channel flipped view" of the Numpy data.

print("img1 contiguous", img1.flags['C_CONTIGUOUS'], "img2 contiguous", img2.flags['C_CONTIGUOUS'])
print("img1 strides", img1.strides, "img2 strides", img2.strides)

wnd = cv2.namedWindow("", cv2.WINDOW_NORMAL)

def show1():
    cv2.imshow(wnd, img1)

def show2():
    cv2.imshow(wnd, img2)

iterations = 20

start = time.perf_counter()
for i in range(0, iterations):
    show1()
elapsed1 = (time.perf_counter() - start) * 1000
elapsed1_percall = elapsed1 / iterations

start = time.perf_counter()
for i in range(0, iterations):
    show2()
elapsed2 = (time.perf_counter() - start) * 1000
elapsed2_percall = elapsed2 / iterations

# We know that the contiguos (img1) data does not need conversion,
# which tells us that the runtime of the contiguous data is the
# "internal work of the imshow" function. We only want to measure
# the conversion time for non-contiguous data. So we'll subtract
# the first image's (contiguous) runtime from the non-contiguous time.

noncontiguous_overhead_per_call = elapsed2_percall - elapsed1_percall

print("Extra time taken per OpenCV call when given non-contiguous data (in ms):", noncontiguous_overhead_per_call, "ms")

The results:

img1 contiguous True img2 contiguous False
img1 strides (11520, 3, 1) img2 strides (11520, 3, -1)
Extra time taken per OpenCV call when given non-contiguous data (in ms): 39.45334999999999 ms

As you can see, the extra time added to the OpenCV calls when copy-conversion is needed, is pretty much the same as when you call Numpy's own img.copy() inside Python itself (as seen in the earlier benchmark for "Method: x = x[...,::-1].copy(). Speed: 37.5 msec per call").

So yes, every time you call OpenCV with a non-contiguous Numpy array as its argument, you are causing a Numpy .copy() to happen internally!

Conclusions

Stop using Numpy view manipulations and "tricks". They are not "cool". They are SLOW. You are slowing down all of your OpenCV API calls by about 40 milliseconds PER CALL, since your "cool" data has to be converted by OpenCV internally to PROPER CONTIGUOUS RAM.

Those 39.45ms are wasted on EVERY OpenCV call whenever you give OpenCV a non-contiguous image. So if you're (as people often do) passing the image to multiple OpenCV APIs to analyze it in multiple ways, then you are causing extreme slowdowns in your code.

Use cv2.cvtColor() instead, which does a one-time conversion to the proper format (in about 5.64ms as seen in the benchmarks earlier), using accelerated OpenCL code. You are guaranteed to get contiguous data which works as-is for EVERY OpenCV call with no memory copying/conversion needed. And OpenCV's color converter is WAY FASTER than Numpy's internal data copier/converter.

Let's end this by imagining a scenario where you're using some Python library to capture an RGB screenshot as a numpy array, and you need to use that data with OpenCV. So you're thinking you're clever and you write img = img[...,::-1] to "turn the RGB data into BGR (which OpenCV needs)", and you're thinking "Wow, my code is so fast! That RGB-to-BGR operation only took 0.000237 ms!"... And then you're calling five different OpenCV functions to analyze that screenshot-image in various ways. Well, since you're causing one internal Numpy copy-conversion-to-contiguous PER CALL, you're now causing 5 * 39.45 = 197.25 ms of total conversion overhead, just to get your "stupid" Numpy view into a proper contiguous memory stream.

Does it still sound "slow" to just do a single, one-time 5.64ms conversion via cv2.cvtColor()? ;-)

Stop. Using. Numpy. Tricks!

Enjoy! ;-)

Fastest way to convert BGR <-> RGB! Aka: Do NOT use Numpy magic "tricks".


IMPORTANT: This article is very long. Remember to click the (more) at the bottom of this post to read the whole article!


I was reading this question: https://answers.opencv.org/question/188664/colour-conversion-from-bgr-to-rgb-in-cv2-is-slower-than-on-python/ and it didn't explain things very well at all. So here's a deep examination and explanation for everyone's future reference!

Converting RGB to BGR, and vice versa, is one of the most important operations you can do in OpenCV if you're interoperating with other libraries, raw system memory, etc. And each imaging library depends on their own special channel orders.

There are many ways to achieve the conversion, and cv2.cvtColor() is often frowned upon because there are "much faster" ways to do it via numpy "view" manipulation.

Whenever you attempt to convert colors in OpenCV, you actually invoke a huge machinery:

https://github.com/opencv/opencv/blob/8c0b0714e76efef4a8ca2a7c410c60e55c5e9829/modules/imgproc/src/color.cpp#L20-L25 https://github.com/opencv/opencv/blob/8b541e450b511fde9dd363fa55a30fbb6fc0ace6/modules/imgproc/src/color_rgb.dispatch.cpp#L426-L437

As you can see, internally, OpenCV creates an "OpenCL Kernel" with the instructions for the data transformation, and then runs it. This creates brand new (re-arranged) image data in memory, which is of course a pretty slow operation, involving new memory allocation and data-copying.

However, there is another way to flip between RGB and BGR channel orders, which is very popular - and very bad (as you'll find out soon). And that is: Using numpy's built-in methods for manipulating the array data.

Note that there are two ways to manipulate data in Numpy:

  • One of the ways, the bad way, just changes the "view" of the Numpy array and is therefore instant (O(1)), but does NOT transform the underlying img.data in RAM/memory. This means that the raw memory does NOT contain the new channel order, and Numpy instead "fakes" it by creating a "view" that simply says "when we read this data from RAM, view it as R=B, G=G, B=R" basically... (Technically speaking, it changes the ".strides" property of the Numpy object, which instead of saying "read R then G then B" (stride "1" aka going forwards in RAM when reading the color channels) changes it to say "read B, then G, then R" (stride "-1" aka going backwards in RAM when reading the color channels)).
  • The second way, which is totally fine, is to always ensure that we arrange the pixel data properly in memory too, which is a lot slower but is almost always necessary, depending on what library/API your data is intended to be sent to!

To determine whether a numpy array manipulation has also changed the underlying MEMORY, you can look at the img.flags['C_CONTIGUOUS'] value. If True it means that the data in RAM is in the correct order (that's great!). If False it means that the data in RAM is in the wrong order and that we are "cheating" via a numpy View instead (that's BAD!).

Whenever you use the "View-based" methods to flip channels in an ndarray (such as RGB -> BGR), its C_CONTIGUOUS becomes False. If you then flip the image's channels again (such as BGR -> back to RGB), its C_CONTIGUOUS becomes True again. So, the "view" is able to be transformed multiple times, and the "Contiguous" flag only says True whenever the view happens to match the actual RAM data's layout.

So... in what situations do you need the data to ALWAYS be contiguous? Well, it varies based on API...

  • OpenCV APIs all need data in contiguous order. If you give it non-contiguous data from Python, the Python-to-OpenCV wrapper layer internally makes a COPY of the BADLY FORMATTED image you gave it, and then converts the color channel order, and THEN finally passes the COPIED-AND-CONTIGUOUS image to the internal OpenCV C++ function. This is of course very wasteful!
  • Other libraries: Depends on the library. Some of them do something like "take the img.data memory address and give it to a raw Windows API via a COM call" in which case YES the RAM-data MUST be contiguous too.

What type of data do YOU need?

If you want the SAFEST possible data that is 100% sure to work in ANY API ANYWHERE, you should always make CONTIGUOUS pixel data. It doesn't take long to do the conversion up-front, since we're still talking about very fast operations!

There are probably situations where non-contiguous data is fine, such as if you are doing all image manipulations purely in Numpy math without any library APIs (in which case there's no real reason to convert the data layout to contiguous in RAM). But as soon as you invoke various library APIs, you should pretty much always have contiguous data, otherwise you'll create huge performance issues (or even completely incorrect results).

I'll explain those performance issues further down, but first we'll look at the various "conversion techniques" people use in Python.

Techniques

Without further ado, here are all the ways that people use in Python whenever they want to convert back/forth between RGB and BGR. These benchmarks are on a 4K image (3840x2160):

  • Always Contiguous: No. Method: x = x[...,::-1]. Speed: 237 nsec (aka 0.237 usec aka 0.000237 msec) per call
  • Always Contiguous: Yes. Method: x = cv2.cvtColor(x, cv2.COLOR_RGB2BGR). Speed: 5.64 msec per call
  • Always Contiguous: Yes. Method: x = x[...,::-1].copy(). Speed: 37.5 msec per call
  • Always Contiguous: No. Method: x = np.fliplr(x.reshape(-1,3)).reshape(x.shape). Speed: 1.62 usec (aka 0.00162 msec) per call
  • Always Contiguous: Yes. Method: x = np.fliplr(x.reshape(-1,3)).reshape(x.shape).copy(). Speed: 37.4 msec per call
  • Always Contiguous: No. Method: x = np.flip(x, axis=2). Speed: 2.74 usec (aka 0.00274 msec) per call
  • Always Contiguous: Yes. Method: x = np.flip(x, axis=2).copy(). Speed: 37.5 msec per call

PS: Whenever we want contiguous data from numpy, we're using x.copy() to tell Numpy to allocate new RAM and copy all data to it in the correct (contiguous) order. There's also a np.ascontiguousarray(x) API but it does the exact same thing (copies) and requires much more typing. ;-)

Docs for the various Numpy functions: copy, ascontiguousarray, fliplr, flip, reshape

Here's the benchmark that was used:

python -m timeit -s "import numpy as np; import cv2; x = np.zeros([2160,3840,3], np.uint8); x[:,:,2] = 255; x[:,:,1] = 100" "ALGORITHM HERE"

Replace the "ALGORITHM HERE" part with the algorithm above, such as "x = np.flip(x, axis=2).copy()".

People's Misunderstandings of those Benchmarks

Alright, so we're finally getting to the whole purpose of this article!

When people see the benchmarks above, they usually think "Oh my god, x = x[...,::-1] executes in 0.000237 milliseconds, and x = cv2.cvtColor(x, cv2.COLOR_RGB2BGR) executes in 5.64 milliseconds, which is 23798 times slower!! And then they decide to always use "Numpy view manipulation" to do their channel conversions.

That's a huge mistake. And here's why:

  • When you call an OpenCV API from Python, and pass it a numpy.ndarray image object, there's a process which prepares that data for internal usage within OpenCV (since OpenCV itself doesn't use ndarray internally; it uses cv::Mat).
  • First, your Python object (which is coming from the PyOpenCV module) goes into the appropriate pyopencv_to() function, whose purpose is to convert raw Python objects (such as numbers, strings, ndarray, etc), into something usable by OpenCV internally in C++.
  • Your Python object first enters the "full ArgInfo converter" code at https://github.com/opencv/opencv/blob/778f42ad34559451d62ac9ba585717aec77fb23a/modules/python/src2/cv2.cpp#L249
  • That code, in turn, looks at the object and determines if it's a number, a float, or a tuple... If it's any of those, it does the appropriate conversion. Otherwise it assumes it's a Numpy array, at this line: https://github.com/opencv/opencv/blob/master/modules/python/src2/cv2.cpp#L292
  • Next, it begins to analyze the Numpy array to determine how to use the data internally. It wants to determine "do we need to copy the data or can we use it as-is? do we need to cast the data?", see here: https://github.com/opencv/opencv/blob/master/modules/python/src2/cv2.cpp#L300
  • It first does some simple checks to see if the number-type in the Numpy array is legal or not.not. (If illegal type, it marks the data as "needs copy" and "needs cast").
  • Next, it retrieves the "strides" information from the Numpy array, which is those simple numbers such as "-1" which determine how to read a numpy array (such as backwards, in the case of our "fast" numpy-based "channel flipping" code earlier): https://github.com/opencv/opencv/blob/master/modules/python/src2/cv2.cpp#L341-L342
  • Then it analyzes the Strides for all dimensions of the Numpy array, and if it finds a non-contiguous stride (our "screwed up" data layout caused by doing those so-called "fast" Numpy view manipulations), then it marks the data as "needs copy": https://github.com/opencv/opencv/blob/master/modules/python/src2/cv2.cpp#L344-L357
  • Next, if "needs copy" is true, it does this horrible thing: https://github.com/opencv/opencv/blob/master/modules/python/src2/cv2.cpp#L367-L374
  • As you can see, it calls PyArray_Cast() (if casting was needed too) or PyArray_GETCONTIGUOUS() (if we only need to make sure the data is contiguous). Both of those functions, no matter which one is called, then generates a brand-new Python Numpy Object, with all data COPIED by Numpy into brand-new memory, and re-arranged into proper Contiguous ordering. That's extremely wasteful! I'll explain more soon, after this walkthrough of what the code does...
  • Finally, the code proceeds to create a cv::Mat object whose data pointer points at the internal byte-array (RAM) of the Numpy object, ie. the RAM address that you can easily see in Python by typing img.data. That's an incredibly fast operation because it is just a pointer which says Use the existing RAM data owned by Numpy at RAM address XYZ: https://github.com/opencv/opencv/blob/master/modules/python/src2/cv2.cpp#L415-L416

So, can you spot the problem yet?

When you pass a contiguous Numpy array, the conversion into OpenCV is pretty much INSTANT: "This data looks fine! Just give its RAM address to cv::Mat and voila!".

But when you instead insist on using those so-called "fast" channel transformations, where you "tweak" the Numpy array's view and stride values, then you are giving OpenCV a Numpy array with non-contiguous RAM and bad "strides". The PyOpenCV layer (the wrapper between OpenCV and Python) detects this problem, and creates a BRAND NEW, COPIED, RE-ARRANGED (CONTIGUOUS) NUMPY ARRAY. This is VERY VERY VERY VERY SLOW.

In other words, if you've used those dumb Numpy "view" manipulation tricks, EVERY CALL TO OPENCV APIS IS CAUSING a HUGE memory copy (images are large, especially 1080p+ screenshots/video frames), a lot of math inside PyArray_GETCONTIGUOUS to create that new object while respecting your tweaked "strides", etc.

Your code won't be faster at all. It will be SLOWER!

Demonstration of the Slowness

Let's use a random OpenCV API to demonstrate the slowdown caused by all of those conversions. We'll use cv2.imshow here, but any OpenCV API call will always be doing the same "Python to OpenCV" conversions of the numpy data, so the exact API doesn't matter. They will all have this overhead.

Here's the example code:

import cv2
import numpy as np
import time

#img1 = cv2.imread("yourimage.png") # If you want to test with an image.
img1 = np.zeros([2160,3840,3], np.uint8) # Create a 4K image.
img1[:,:,2] = 255; img1[:,:,1] = 100 # Fill the channels with different values.
img2 = img1[...,::-1] # Make a "channel flipped view" of the Numpy data.

print("img1 contiguous", img1.flags['C_CONTIGUOUS'], "img2 contiguous", img2.flags['C_CONTIGUOUS'])
print("img1 strides", img1.strides, "img2 strides", img2.strides)

wnd = cv2.namedWindow("", cv2.WINDOW_NORMAL)

def show1():
    cv2.imshow(wnd, img1)

def show2():
    cv2.imshow(wnd, img2)

iterations = 20

start = time.perf_counter()
for i in range(0, iterations):
    show1()
elapsed1 = (time.perf_counter() - start) * 1000
elapsed1_percall = elapsed1 / iterations

start = time.perf_counter()
for i in range(0, iterations):
    show2()
elapsed2 = (time.perf_counter() - start) * 1000
elapsed2_percall = elapsed2 / iterations

# We know that the contiguos (img1) data does not need conversion,
# which tells us that the runtime of the contiguous data is the
# "internal work of the imshow" function. We only want to measure
# the conversion time for non-contiguous data. So we'll subtract
# the first image's (contiguous) runtime from the non-contiguous time.

noncontiguous_overhead_per_call = elapsed2_percall - elapsed1_percall

print("Extra time taken per OpenCV call when given non-contiguous data (in ms):", noncontiguous_overhead_per_call, "ms")

The results:

img1 contiguous True img2 contiguous False
img1 strides (11520, 3, 1) img2 strides (11520, 3, -1)
Extra time taken per OpenCV call when given non-contiguous data (in ms): 39.45334999999999 ms

As you can see, the extra time added to the OpenCV calls when copy-conversion is needed, is pretty much the same as when you call Numpy's own img.copy() inside Python itself (as seen in the earlier benchmark for "Method: x = x[...,::-1].copy(). Speed: 37.5 msec per call").

So yes, every time you call OpenCV with a non-contiguous Numpy array as its argument, you are causing a Numpy .copy() to happen internally!

Conclusions

Stop using Numpy view manipulations and "tricks". They are not "cool". They are SLOW. You are slowing down all of your OpenCV API calls by about 40 milliseconds PER CALL, since your "cool" data has to be converted by OpenCV internally to PROPER CONTIGUOUS RAM.

Those 39.45ms are wasted on EVERY OpenCV call whenever you give OpenCV a non-contiguous image. So if you're (as people often do) passing the image to multiple OpenCV APIs to analyze it in multiple ways, then you are causing extreme slowdowns in your code.

Use cv2.cvtColor() instead, which does a one-time conversion to the proper format (in about 5.64ms as seen in the benchmarks earlier), using accelerated OpenCL code. You are guaranteed to get contiguous data which works as-is for EVERY OpenCV call with no memory copying/conversion needed. And OpenCV's color converter is WAY FASTER than Numpy's internal data copier/converter.

Let's end this by imagining a scenario where you're using some Python library to capture an RGB screenshot as a numpy array, and you need to use that data with OpenCV. So you're thinking you're clever and you write img = img[...,::-1] to "turn the RGB data into BGR (which OpenCV needs)", and you're thinking "Wow, my code is so fast! That RGB-to-BGR operation only took 0.000237 ms!"... And then you're calling five different OpenCV functions to analyze that screenshot-image in various ways. Well, since you're causing one internal Numpy copy-conversion-to-contiguous PER CALL, you're now causing 5 * 39.45 = 197.25 ms of total conversion overhead, just to get your "stupid" Numpy view into a proper contiguous memory stream.

Does it still sound "slow" to just do a single, one-time 5.64ms conversion via cv2.cvtColor()? ;-)

Stop. Using. Numpy. Tricks!

Enjoy! ;-)

Fastest way to convert BGR <-> RGB! Aka: Do NOT use Numpy magic "tricks".


IMPORTANT: This article is very long. Remember to click the (more) at the bottom of this post to read the whole article!


I was reading this question: https://answers.opencv.org/question/188664/colour-conversion-from-bgr-to-rgb-in-cv2-is-slower-than-on-python/ and it didn't explain things very well at all. So here's a deep examination and explanation for everyone's future reference!

Converting RGB to BGR, and vice versa, is one of the most important operations you can do in OpenCV if you're interoperating with other libraries, raw system memory, etc. And each imaging library depends on their own special channel orders.

There are many ways to achieve the conversion, and cv2.cvtColor() is often frowned upon because there are "much faster" ways to do it via numpy "view" manipulation.

Whenever you attempt to convert colors in OpenCV, you actually invoke a huge machinery:

https://github.com/opencv/opencv/blob/8c0b0714e76efef4a8ca2a7c410c60e55c5e9829/modules/imgproc/src/color.cpp#L20-L25 https://github.com/opencv/opencv/blob/8b541e450b511fde9dd363fa55a30fbb6fc0ace6/modules/imgproc/src/color_rgb.dispatch.cpp#L426-L437

As you can see, internally, OpenCV creates an "OpenCL Kernel" with the instructions for the data transformation, and then runs it. This creates brand new (re-arranged) image data in memory, which is of course a pretty slow operation, involving new memory allocation and data-copying.

However, there is another way to flip between RGB and BGR channel orders, which is very popular - and very bad (as you'll find out soon). And that is: Using numpy's built-in methods for manipulating the array data.

Note that there are two ways to manipulate data in Numpy:

  • One of the ways, the bad way, just changes the "view" of the Numpy array and is therefore instant (O(1)), but does NOT transform the underlying img.data in RAM/memory. This means that the raw memory does NOT contain the new channel order, and Numpy instead "fakes" it by creating a "view" that simply says "when we read this data from RAM, view it as R=B, G=G, B=R" basically... (Technically speaking, it changes the ".strides" property of the Numpy object, which instead of saying "read R then G then B" (stride "1" aka going forwards in RAM when reading the color channels) changes it to say "read B, then G, then R" (stride "-1" aka going backwards in RAM when reading the color channels)).
  • The second way, which is totally fine, is to always ensure that we arrange the pixel data properly in memory too, which is a lot slower but is almost always necessary, depending on what library/API your data is intended to be sent to!

To determine whether a numpy array manipulation has also changed the underlying MEMORY, you can look at the img.flags['C_CONTIGUOUS'] value. If True it means that the data in RAM is in the correct order (that's great!). If False it means that the data in RAM is in the wrong order and that we are "cheating" via a numpy View instead (that's BAD!).

Whenever you use the "View-based" methods to flip channels in an ndarray (such as RGB -> BGR), its C_CONTIGUOUS becomes False. If you then flip the image's channels again (such as BGR -> back to RGB), its C_CONTIGUOUS becomes True again. So, the "view" is able to be transformed multiple times, and the "Contiguous" flag only says True whenever the view happens to match the actual RAM data's layout.

So... in what situations do you need the data to ALWAYS be contiguous? Well, it varies based on API...

  • OpenCV APIs all need data in contiguous order. If you give it non-contiguous data from Python, the Python-to-OpenCV wrapper layer internally makes a COPY of the BADLY FORMATTED image you gave it, and then converts the color channel order, and THEN finally passes the COPIED-AND-CONTIGUOUS image to the internal OpenCV C++ function. This is of course very wasteful!
  • Other libraries: Depends on the library. Some of them do something like "take the img.data memory address and give it to a raw Windows API via a COM call" in which case YES the RAM-data MUST be contiguous too.

What type of data do YOU need?

If you want the SAFEST possible data that is 100% sure to work in ANY API ANYWHERE, you should always make CONTIGUOUS pixel data. It doesn't take long to do the conversion up-front, since we're still talking about very fast operations!

There are probably situations where non-contiguous data is fine, such as if you are doing all image manipulations purely in Numpy math without any library APIs (in which case there's no real reason to convert the data layout to contiguous in RAM). But as soon as you invoke various library APIs, you should pretty much always have contiguous data, otherwise you'll create huge performance issues (or even completely incorrect results).

I'll explain those performance issues further down, but first we'll look at the various "conversion techniques" people use in Python.

Techniques

Without further ado, here are all the ways that people use in Python whenever they want to convert back/forth between RGB and BGR. These benchmarks are on a 4K image (3840x2160):

  • Always Contiguous: No. Method: x = x[...,::-1]. Speed: 237 nsec (aka 0.237 usec aka 0.000237 msec) per call
  • Always Contiguous: Yes. Method: x = cv2.cvtColor(x, cv2.COLOR_RGB2BGR). Speed: 5.64 msec per call
  • Always Contiguous: Yes. Method: x = x[...,::-1].copy(). Speed: 37.5 msec per call
  • Always Contiguous: No. Method: x = np.fliplr(x.reshape(-1,3)).reshape(x.shape). Speed: 1.62 usec (aka 0.00162 msec) per call
  • Always Contiguous: Yes. Method: x = np.fliplr(x.reshape(-1,3)).reshape(x.shape).copy(). Speed: 37.4 msec per call
  • Always Contiguous: No. Method: x = np.flip(x, axis=2). Speed: 2.74 usec (aka 0.00274 msec) per call
  • Always Contiguous: Yes. Method: x = np.flip(x, axis=2).copy(). Speed: 37.5 msec per call

PS: Whenever we want contiguous data from numpy, we're using x.copy() to tell Numpy to allocate new RAM and copy all data to it in the correct (contiguous) order. There's also a np.ascontiguousarray(x) API but it does the exact same thing (copies) and requires much more typing. ;-)

Docs for the various Numpy functions: copy, ascontiguousarray, fliplr, flip, reshape

Here's the benchmark that was used:

python -m timeit -s "import numpy as np; import cv2; x = np.zeros([2160,3840,3], np.uint8); x[:,:,2] = 255; x[:,:,1] = 100" "ALGORITHM HERE"

Replace the "ALGORITHM HERE" part with the algorithm above, such as "x = np.flip(x, axis=2).copy()".

People's Misunderstandings of those Benchmarks

Alright, so we're finally getting to the whole purpose of this article!

When people see the benchmarks above, they usually think "Oh my god, x = x[...,::-1] executes in 0.000237 milliseconds, and x = cv2.cvtColor(x, cv2.COLOR_RGB2BGR) executes in 5.64 milliseconds, which is 23798 times slower!! And then they decide to always use "Numpy view manipulation" to do their channel conversions.

That's a huge mistake. And here's why:

  • When you call an OpenCV API from Python, and pass it a numpy.ndarray image object, there's a process which prepares that data for internal usage within OpenCV (since OpenCV itself doesn't use ndarray internally; it uses cv::Mat).
  • First, your Python object (which is coming from the PyOpenCV module) goes into the appropriate pyopencv_to() function, whose purpose is to convert raw Python objects (such as numbers, strings, ndarray, etc), into something usable by OpenCV internally in C++.
  • Your Python object first enters the "full ArgInfo converter" code at https://github.com/opencv/opencv/blob/778f42ad34559451d62ac9ba585717aec77fb23a/modules/python/src2/cv2.cpp#L249
  • That code, in turn, looks at the object and determines if it's a number, a float, or a tuple... If it's any of those, it does the appropriate conversion. Otherwise it assumes it's a Numpy array, at this line: https://github.com/opencv/opencv/blob/master/modules/python/src2/cv2.cpp#L292
  • Next, it begins to analyze the Numpy array to determine how to use the data internally. It wants to determine "do we need to copy the data or can we use it as-is? do we need to cast the data?", see here: https://github.com/opencv/opencv/blob/master/modules/python/src2/cv2.cpp#L300
  • It first does some simple checks to see if the number-type in the Numpy array is legal or not. (If illegal type, it marks the data as "needs copy" and "needs cast").
  • Next, it retrieves the "strides" information from the Numpy array, which is those simple numbers such as "-1" which determine how to read a numpy array (such as backwards, in the case of our "fast" numpy-based "channel flipping" code earlier): https://github.com/opencv/opencv/blob/master/modules/python/src2/cv2.cpp#L341-L342
  • Then it analyzes the Strides for all dimensions of the Numpy array, and if it finds a non-contiguous stride (our "screwed up" data layout caused by doing those so-called "fast" Numpy view manipulations), then it marks the data as "needs copy": https://github.com/opencv/opencv/blob/master/modules/python/src2/cv2.cpp#L344-L357
  • Next, if "needs copy" is true, it does this horrible thing: https://github.com/opencv/opencv/blob/master/modules/python/src2/cv2.cpp#L367-L374
  • As you can see, it calls PyArray_Cast() (if casting was needed too) or PyArray_GETCONTIGUOUS() (if we only need to make sure the data is contiguous). Both of those functions, no matter which one is called, then generates a brand-new Python Numpy Object, with all data COPIED by Numpy into brand-new memory, and re-arranged into proper Contiguous ordering. That's extremely wasteful! I'll explain more soon, after this walkthrough of what the code does...
  • Finally, the code proceeds to create a cv::Mat object whose data pointer points at the internal byte-array (RAM) of the Numpy object, ie. the RAM address that you can easily see in Python by typing img.data. That's an incredibly fast operation because it is just a pointer which says Use the existing RAM data owned by Numpy at RAM address XYZ: https://github.com/opencv/opencv/blob/master/modules/python/src2/cv2.cpp#L415-L416

So, can you spot the problem yet?

When you pass a contiguous Numpy array, the conversion into OpenCV is pretty much INSTANT: "This data looks fine! Just give its RAM address to cv::Mat and voila!".

But when you instead insist on using those so-called "fast" channel transformations, where you "tweak" the Numpy array's view and stride values, then you are giving OpenCV a Numpy array with non-contiguous RAM and bad "strides". The PyOpenCV layer (the wrapper between OpenCV and Python) detects this problem, and creates a BRAND NEW, COPIED, RE-ARRANGED (CONTIGUOUS) NUMPY ARRAY. This is VERY VERY VERY VERY SLOW.

In other words, if you've used those dumb Numpy "view" manipulation tricks, EVERY CALL TO OPENCV APIS IS CAUSING a HUGE memory copy (images are large, especially 1080p+ screenshots/video frames), a lot of math inside PyArray_GETCONTIGUOUSPyArray_GETCONTIGUOUS / PyArray_Cast to create that new object while respecting your tweaked "strides", etc.

Your code won't be faster at all. It will be SLOWER!

Demonstration of the Slowness

Let's use a random OpenCV API to demonstrate the slowdown caused by all of those conversions. We'll use cv2.imshow here, but any OpenCV API call will always be doing the same "Python to OpenCV" conversions of the numpy data, so the exact API doesn't matter. They will all have this overhead.

Here's the example code:

import cv2
import numpy as np
import time

#img1 = cv2.imread("yourimage.png") # If you want to test with an image.
img1 = np.zeros([2160,3840,3], np.uint8) # Create a 4K image.
img1[:,:,2] = 255; img1[:,:,1] = 100 # Fill the channels with different values.
img2 = img1[...,::-1] # Make a "channel flipped view" of the Numpy data.

print("img1 contiguous", img1.flags['C_CONTIGUOUS'], "img2 contiguous", img2.flags['C_CONTIGUOUS'])
print("img1 strides", img1.strides, "img2 strides", img2.strides)

wnd = cv2.namedWindow("", cv2.WINDOW_NORMAL)

def show1():
    cv2.imshow(wnd, img1)

def show2():
    cv2.imshow(wnd, img2)

iterations = 20

start = time.perf_counter()
for i in range(0, iterations):
    show1()
elapsed1 = (time.perf_counter() - start) * 1000
elapsed1_percall = elapsed1 / iterations

start = time.perf_counter()
for i in range(0, iterations):
    show2()
elapsed2 = (time.perf_counter() - start) * 1000
elapsed2_percall = elapsed2 / iterations

# We know that the contiguos (img1) data does not need conversion,
# which tells us that the runtime of the contiguous data is the
# "internal work of the imshow" function. We only want to measure
# the conversion time for non-contiguous data. So we'll subtract
# the first image's (contiguous) runtime from the non-contiguous time.

noncontiguous_overhead_per_call = elapsed2_percall - elapsed1_percall

print("Extra time taken per OpenCV call when given non-contiguous data (in ms):", noncontiguous_overhead_per_call, "ms")

The results:

img1 contiguous True img2 contiguous False
img1 strides (11520, 3, 1) img2 strides (11520, 3, -1)
Extra time taken per OpenCV call when given non-contiguous data (in ms): 39.45334999999999 ms

As you can see, the extra time added to the OpenCV calls when copy-conversion is needed, is pretty much the same as when you call Numpy's own img.copy() inside Python itself (as seen in the earlier benchmark for "Method: x = x[...,::-1].copy(). Speed: 37.5 msec per call").

So yes, every time you call OpenCV with a non-contiguous Numpy array as its argument, you are causing a Numpy .copy() to happen internally!

Conclusions

Stop using Numpy view manipulations and "tricks". They are not "cool". They are SLOW. You are slowing down all of your OpenCV API calls by about 40 milliseconds PER CALL, since your "cool" data has to be converted by OpenCV internally to PROPER CONTIGUOUS RAM.

Those 39.45ms are wasted on EVERY OpenCV call whenever you give OpenCV a non-contiguous image. So if you're (as people often do) passing the image to multiple OpenCV APIs to analyze it in multiple ways, then you are causing extreme slowdowns in your code.

Use cv2.cvtColor() instead, which does a one-time conversion to the proper format (in about 5.64ms as seen in the benchmarks earlier), using accelerated OpenCL code. You are guaranteed to get contiguous data which works as-is for EVERY OpenCV call with no memory copying/conversion needed. And OpenCV's color converter is WAY FASTER than Numpy's internal data copier/converter.

Let's end this by imagining a scenario where you're using some Python library to capture an RGB screenshot as a numpy array, and you need to use that data with OpenCV. So you're thinking you're clever and you write img = img[...,::-1] to "turn the RGB data into BGR (which OpenCV needs)", and you're thinking "Wow, my code is so fast! That RGB-to-BGR operation only took 0.000237 ms!"... And then you're calling five different OpenCV functions to analyze that screenshot-image in various ways. Well, since you're causing one internal Numpy copy-conversion-to-contiguous PER CALL, you're now causing 5 * 39.45 = 197.25 ms of total conversion overhead, just to get your "stupid" Numpy view into a proper contiguous memory stream.

Does it still sound "slow" to just do a single, one-time 5.64ms conversion via cv2.cvtColor()? ;-)

Stop. Using. Numpy. Tricks!

Enjoy! ;-)

Fastest way to convert BGR <-> RGB! Aka: Do NOT use Numpy magic "tricks".


IMPORTANT: This article is very long. Remember to click the (more) at the bottom of this post to read the whole article!


I was reading this question: https://answers.opencv.org/question/188664/colour-conversion-from-bgr-to-rgb-in-cv2-is-slower-than-on-python/ and it didn't explain things very well at all. So here's a deep examination and explanation for everyone's future reference!

Converting RGB to BGR, and vice versa, is one of the most important operations you can do in OpenCV if you're interoperating with other libraries, raw system memory, etc. And each imaging library depends on their own special channel orders.

There are many ways to achieve the conversion, and cv2.cvtColor() is often frowned upon because there are "much faster" ways to do it via numpy "view" manipulation.

Whenever you attempt to convert colors in OpenCV, you actually invoke a huge machinery:

https://github.com/opencv/opencv/blob/8c0b0714e76efef4a8ca2a7c410c60e55c5e9829/modules/imgproc/src/color.cpp#L20-L25 https://github.com/opencv/opencv/blob/8b541e450b511fde9dd363fa55a30fbb6fc0ace6/modules/imgproc/src/color_rgb.dispatch.cpp#L426-L437

As you can see, internally, OpenCV creates an "OpenCL Kernel" with the instructions for the data transformation, and then runs it. This creates brand new (re-arranged) image data in memory, which is of course a pretty slow operation, involving new memory allocation and data-copying.

However, there is another way to flip between RGB and BGR channel orders, which is very popular - and very bad (as you'll find out soon). And that is: Using numpy's built-in methods for manipulating the array data.

Note that there are two ways to manipulate data in Numpy:

  • One of the ways, the bad way, just changes the "view" of the Numpy array and is therefore instant (O(1)), but does NOT transform the underlying img.data in RAM/memory. This means that the raw memory does NOT contain the new channel order, and Numpy instead "fakes" it by creating a "view" that simply says "when we read this data from RAM, view it as R=B, G=G, B=R" basically... (Technically speaking, it changes the ".strides" property of the Numpy object, which instead of saying "read R then G then B" (stride "1" aka going forwards in RAM when reading the color channels) changes it to say "read B, then G, then R" (stride "-1" aka going backwards in RAM when reading the color channels)).
  • The second way, which is totally fine, is to always ensure that we arrange the pixel data properly in memory too, which is a lot slower but is almost always necessary, depending on what library/API your data is intended to be sent to!

To determine whether a numpy array manipulation has also changed the underlying MEMORY, you can look at the img.flags['C_CONTIGUOUS'] value. If True it means that the data in RAM is in the correct order (that's great!). If False it means that the data in RAM is in the wrong order and that we are "cheating" via a numpy View instead (that's BAD!).

Whenever you use the "View-based" methods to flip channels in an ndarray (such as RGB -> BGR), its C_CONTIGUOUS becomes False. If you then flip the image's channels again (such as BGR -> back to RGB), its C_CONTIGUOUS becomes True again. So, the "view" is able to be transformed multiple times, and the "Contiguous" flag only says True whenever the view happens to match the actual RAM data's layout.

So... in what situations do you need the data to ALWAYS be contiguous? Well, it varies based on API...

  • OpenCV APIs all need data in contiguous order. If you give it non-contiguous data from Python, the Python-to-OpenCV wrapper layer internally makes a COPY of the BADLY FORMATTED image you gave it, and then converts the color channel order, and THEN finally passes the COPIED-AND-CONTIGUOUS image to the internal OpenCV C++ function. This is of course very wasteful!
  • Other libraries: Depends on the library. Some of them do something like "take the img.data memory address and give it to a raw Windows API via a COM call" in which case YES the RAM-data MUST be contiguous too.

What type of data do YOU need?

If you want the SAFEST possible data that is 100% sure to work in ANY API ANYWHERE, you should always make CONTIGUOUS pixel data. It doesn't take long to do the conversion up-front, since we're still talking about very fast operations!

There are probably situations where non-contiguous data is fine, such as if you are doing all image manipulations purely in Numpy math without any library APIs (in which case there's no real reason to convert the data layout to contiguous in RAM). But as soon as you invoke various library APIs, you should pretty much always have contiguous data, otherwise you'll create huge performance issues (or even completely incorrect results).

I'll explain those performance issues further down, but first we'll look at the various "conversion techniques" people use in Python.

Techniques

Without further ado, here are all the ways that people use in Python whenever they want to convert back/forth between RGB and BGR. These benchmarks are on a 4K image (3840x2160):

  • Always Contiguous: No. Method: x = x[...,::-1]. Speed: 237 nsec (aka 0.237 usec aka 0.000237 msec) per call
  • Always Contiguous: Yes. Method: x = cv2.cvtColor(x, cv2.COLOR_RGB2BGR)x[...,::-1].copy(). Speed: 5.64 37.5 msec per call
  • Always Contiguous: No. Method: x = x[:, :, [2, 1, 0]]. Speed: 15.2 msec per call
  • Always Contiguous: Yes. Method: x = x[...,::-1].copy()cv2.cvtColor(x, cv2.COLOR_RGB2BGR). Speed: 37.5 5.64 msec per call
  • Always Contiguous: No. Method: x = np.fliplr(x.reshape(-1,3)).reshape(x.shape). Speed: 1.62 usec (aka 0.00162 msec) per call
  • Always Contiguous: Yes. Method: x = np.fliplr(x.reshape(-1,3)).reshape(x.shape).copy(). Speed: 37.4 msec per call
  • Always Contiguous: No. Method: x = np.flip(x, axis=2). Speed: 2.74 usec (aka 0.00274 msec) per call
  • Always Contiguous: Yes. Method: x = np.flip(x, axis=2).copy(). Speed: 37.5 msec per call
  • Always Contiguous: Yes. Method: r = x[..., 0].copy(); x[..., 0] = x[..., 2]; x[..., 2] = r. Speed: 27.2 msec per call
  • Always Contiguous: Yes. Method: x[:, :, [0, 2]] = x[:, :, [2, 0]]. Speed: 27.3 msec per call
  • Always Contiguous: Yes. Method: x[..., [0, 2]] = x[..., [2, 0]]. Speed: 27.2 msec per call
  • Always Contiguous: Yes. Method: x[:, :, [0, 1, 2]] = x[:, :, [2, 1, 0]]. Speed: 41.4 msec per call
  • Always Contiguous: Yes. Method: x[:, :] = x[:, :, [2, 1, 0]]. Speed: 59.3 msec per call

PS: Whenever we want contiguous data from numpy, we're mostly using x.copy() to tell Numpy to allocate new RAM and copy all data to it in the correct (contiguous) order. There's also a np.ascontiguousarray(x) API but it does the exact exact same thing (copies) and requires much more typing. ;-);-) And in a few of the examples we're using special indexing (such as x[:, :, [0, 1, 2]] = x[:, :, [2, 1, 0]]) which always creates contiguous memory with correct "strides", but is extremely slow.

Docs for the various Numpy functions: copy, ascontiguousarray, fliplr, flip, reshape

Here's the benchmark that was used:

python -m timeit -s "import numpy as np; import cv2; x = np.zeros([2160,3840,3], np.uint8); x[:,:,2] = 255; x[:,:,1] = 100" "ALGORITHM HERE"

Replace the "ALGORITHM HERE" part with the algorithm above, such as "x = np.flip(x, axis=2).copy()".

People's Misunderstandings of those Benchmarks

Alright, so we're finally getting to the whole purpose of this article!

When people see the benchmarks above, they usually think "Oh my god, x = x[...,::-1] executes in 0.000237 milliseconds, and x = cv2.cvtColor(x, cv2.COLOR_RGB2BGR) executes in 5.64 milliseconds, which is 23798 times slower!! And then they decide to always use "Numpy view manipulation" to do their channel conversions.

That's a huge mistake. And here's why:

  • When you call an OpenCV API from Python, and pass it a numpy.ndarray image object, there's a process which prepares that data for internal usage within OpenCV (since OpenCV itself doesn't use ndarray internally; it uses cv::Mat).
  • First, your Python object (which is coming from the PyOpenCV module) goes into the appropriate pyopencv_to() function, whose purpose is to convert raw Python objects (such as numbers, strings, ndarray, etc), into something usable by OpenCV internally in C++.
  • Your Python object first enters the "full ArgInfo converter" code at https://github.com/opencv/opencv/blob/778f42ad34559451d62ac9ba585717aec77fb23a/modules/python/src2/cv2.cpp#L249
  • That code, in turn, looks at the object and determines if it's a number, a float, or a tuple... If it's any of those, it does the appropriate conversion. Otherwise it assumes it's a Numpy array, at this line: https://github.com/opencv/opencv/blob/master/modules/python/src2/cv2.cpp#L292
  • Next, it begins to analyze the Numpy array to determine how to use the data internally. It wants to determine "do we need to copy the data or can we use it as-is? do we need to cast the data?", see here: https://github.com/opencv/opencv/blob/master/modules/python/src2/cv2.cpp#L300
  • It first does some simple checks to see if the number-type in the Numpy array is legal or not. (If illegal type, it marks the data as "needs copy" and "needs cast").
  • Next, it retrieves the "strides" information from the Numpy array, which is those simple numbers such as "-1" which determine how to read a numpy array (such as backwards, in the case of our "fast" numpy-based "channel flipping" code earlier): https://github.com/opencv/opencv/blob/master/modules/python/src2/cv2.cpp#L341-L342
  • Then it analyzes the Strides for all dimensions of the Numpy array, and if it finds a non-contiguous stride (our "screwed up" data layout caused by doing those so-called "fast" Numpy view manipulations), then it marks the data as "needs copy": https://github.com/opencv/opencv/blob/master/modules/python/src2/cv2.cpp#L344-L357
  • Next, if "needs copy" is true, it does this horrible thing: https://github.com/opencv/opencv/blob/master/modules/python/src2/cv2.cpp#L367-L374
  • As you can see, it calls PyArray_Cast() (if casting was needed too) or PyArray_GETCONTIGUOUS() (if we only need to make sure the data is contiguous). Both of those functions, no matter which one is called, then generates a brand-new Python Numpy Object, with all data COPIED by Numpy into brand-new memory, and re-arranged into proper Contiguous ordering. That's extremely wasteful! I'll explain more soon, after this walkthrough of what the code does...
  • Finally, the code proceeds to create a cv::Mat object whose data pointer points at the internal byte-array (RAM) of the Numpy object, ie. the RAM address that you can easily see in Python by typing img.data. That's an incredibly fast operation because it is just a pointer which says Use the existing RAM data owned by Numpy at RAM address XYZ: https://github.com/opencv/opencv/blob/master/modules/python/src2/cv2.cpp#L415-L416

So, can you spot the problem yet?

When you pass a contiguous Numpy array, the conversion into OpenCV is pretty much INSTANT: "This data looks fine! Just give its RAM address to cv::Mat and voila!".

But when you instead insist on using those so-called "fast" channel transformations, where you "tweak" the Numpy array's view and stride values, then you are giving OpenCV a Numpy array with non-contiguous RAM and bad "strides". The PyOpenCV layer (the wrapper between OpenCV and Python) detects this problem, and creates a BRAND NEW, COPIED, RE-ARRANGED (CONTIGUOUS) NUMPY ARRAY. This is VERY VERY VERY VERY SLOW.

In other words, if you've used those dumb Numpy "view" manipulation tricks, EVERY CALL TO OPENCV APIS IS CAUSING a HUGE memory copy (images are large, especially 1080p+ screenshots/video frames), a lot of math inside PyArray_GETCONTIGUOUS / PyArray_Cast to create that new object while respecting your tweaked "strides", etc.

Your code won't be faster at all. It will be SLOWER!

Demonstration of the Slowness

Let's use a random OpenCV API to demonstrate the slowdown caused by all of those conversions. We'll use cv2.imshow here, but any OpenCV API call will always be doing the same "Python to OpenCV" conversions of the numpy data, so the exact API doesn't matter. They will all have this overhead.

Here's the example code:

import cv2
import numpy as np
import time

#img1 = cv2.imread("yourimage.png") # If you want to test with an image.
img1 = np.zeros([2160,3840,3], np.uint8) # Create a 4K image.
img1[:,:,2] = 255; img1[:,:,1] = 100 # Fill the channels with different values.
img2 = img1[...,::-1] # Make a "channel flipped view" of the Numpy data.

print("img1 contiguous", img1.flags['C_CONTIGUOUS'], "img2 contiguous", img2.flags['C_CONTIGUOUS'])
print("img1 strides", img1.strides, "img2 strides", img2.strides)

wnd = cv2.namedWindow("", cv2.WINDOW_NORMAL)

def show1():
    cv2.imshow(wnd, img1)

def show2():
    cv2.imshow(wnd, img2)

iterations = 20

start = time.perf_counter()
for i in range(0, iterations):
    show1()
elapsed1 = (time.perf_counter() - start) * 1000
elapsed1_percall = elapsed1 / iterations

start = time.perf_counter()
for i in range(0, iterations):
    show2()
elapsed2 = (time.perf_counter() - start) * 1000
elapsed2_percall = elapsed2 / iterations

# We know that the contiguos (img1) data does not need conversion,
# which tells us that the runtime of the contiguous data is the
# "internal work of the imshow" function. We only want to measure
# the conversion time for non-contiguous data. So we'll subtract
# the first image's (contiguous) runtime from the non-contiguous time.

noncontiguous_overhead_per_call = elapsed2_percall - elapsed1_percall

print("Extra time taken per OpenCV call when given non-contiguous data (in ms):", noncontiguous_overhead_per_call, "ms")

The results:

img1 contiguous True img2 contiguous False
img1 strides (11520, 3, 1) img2 strides (11520, 3, -1)
Extra time taken per OpenCV call when given non-contiguous data (in ms): 39.45334999999999 ms

As you can see, the extra time added to the OpenCV calls when copy-conversion is needed, is pretty much the same as when you call Numpy's own img.copy() inside Python itself (as seen in the earlier benchmark for "Method: x = x[...,::-1].copy(). Speed: 37.5 msec per call").

So yes, every time you call OpenCV with a non-contiguous Numpy array as its argument, you are causing a Numpy .copy() to happen internally!

Conclusions

Stop using Numpy view manipulations and "tricks". They are not "cool". They are SLOW. You are slowing down all of your OpenCV API calls by about 40 milliseconds PER CALL, since your "cool" data has to be converted by OpenCV internally to PROPER CONTIGUOUS RAM.

Those 39.45ms are wasted on EVERY OpenCV call whenever you give OpenCV a non-contiguous image. So if you're (as people often do) passing the image to multiple OpenCV APIs to analyze it in multiple ways, then you are causing extreme slowdowns in your code.

Use cv2.cvtColor() instead, which does a one-time conversion to the proper format (in about 5.64ms as seen in the benchmarks earlier), using accelerated OpenCL code. You are guaranteed to get contiguous data which works as-is for EVERY OpenCV call with no memory copying/conversion needed. And OpenCV's color converter is WAY FASTER than Numpy's internal data copier/converter.

Let's end this by imagining a scenario where you're using some Python library to capture an RGB screenshot as a numpy array, and you need to use that data with OpenCV. So you're thinking you're clever and you write img = img[...,::-1] to "turn the RGB data into BGR (which OpenCV needs)", and you're thinking "Wow, my code is so fast! That RGB-to-BGR operation only took 0.000237 ms!"... And then you're calling five different OpenCV functions to analyze that screenshot-image in various ways. Well, since you're causing one internal Numpy copy-conversion-to-contiguous PER CALL, you're now causing 5 * 39.45 = 197.25 ms of total conversion overhead, just to get your "stupid" Numpy view into a proper contiguous memory stream.

Does it still sound "slow" to just do a single, one-time 5.64ms conversion via cv2.cvtColor()? ;-)

Stop. Using. Numpy. Tricks!

Enjoy! ;-)

Fastest way to convert BGR <-> RGB! Aka: Do NOT use Numpy magic "tricks".


IMPORTANT: This article is very long. Remember to click the (more) at the bottom of this post to read the whole article!


I was reading this question: https://answers.opencv.org/question/188664/colour-conversion-from-bgr-to-rgb-in-cv2-is-slower-than-on-python/ and it didn't explain things very well at all. So here's a deep examination and explanation for everyone's future reference!

Converting RGB to BGR, and vice versa, is one of the most important operations you can do in OpenCV if you're interoperating with other libraries, raw system memory, etc. And each imaging library depends on their own special channel orders.

There are many ways to achieve the conversion, and cv2.cvtColor() is often frowned upon because there are "much faster" ways to do it via numpy "view" manipulation.

Whenever you attempt to convert colors in OpenCV, you actually invoke a huge machinery:

https://github.com/opencv/opencv/blob/8c0b0714e76efef4a8ca2a7c410c60e55c5e9829/modules/imgproc/src/color.cpp#L20-L25 https://github.com/opencv/opencv/blob/8b541e450b511fde9dd363fa55a30fbb6fc0ace6/modules/imgproc/src/color_rgb.dispatch.cpp#L426-L437

As you can see, internally, OpenCV creates an "OpenCL Kernel" with the instructions for the data transformation, and then runs it. This creates brand new (re-arranged) image data in memory, which is of course a pretty slow operation, involving new memory allocation and data-copying.

However, there is another way to flip between RGB and BGR channel orders, which is very popular - and very bad (as you'll find out soon). And that is: Using numpy's built-in methods for manipulating the array data.

Note that there are two ways to manipulate data in Numpy:

  • One of the ways, the bad way, just changes the "view" of the Numpy array and is therefore instant (O(1)), but does NOT transform the underlying img.data in RAM/memory. This means that the raw memory does NOT contain the new channel order, and Numpy instead "fakes" it by creating a "view" that simply says "when we read this data from RAM, view it as R=B, G=G, B=R" basically... (Technically speaking, it changes the ".strides" property of the Numpy object, which instead of saying "read R then G then B" (stride "1" aka going forwards in RAM when reading the color channels) changes it to say "read B, then G, then R" (stride "-1" aka going backwards in RAM when reading the color channels)).
  • The second way, which is totally fine, is to always ensure that we arrange the pixel data properly in memory too, which is a lot slower but is almost always necessary, depending on what library/API your data is intended to be sent to!

To determine whether a numpy array manipulation has also changed the underlying MEMORY, you can look at the img.flags['C_CONTIGUOUS'] value. If True it means that the data in RAM is in the correct order (that's great!). If False it means that the data in RAM is in the wrong order and that we are "cheating" via a numpy View instead (that's BAD!).

Whenever you use the "View-based" methods to flip channels in an ndarray (such as RGB -> BGR), its C_CONTIGUOUS becomes False. If you then flip the image's channels again (such as BGR -> back to RGB), its C_CONTIGUOUS becomes True again. So, the "view" is able to be transformed multiple times, and the "Contiguous" flag only says True whenever the view happens to match the actual RAM data's layout.

So... in what situations do you need the data to ALWAYS be contiguous? Well, it varies based on API...

  • OpenCV APIs all need data in contiguous order. If you give it non-contiguous data from Python, the Python-to-OpenCV wrapper layer internally makes a COPY of the BADLY FORMATTED image you gave it, and then converts the color channel order, and THEN finally passes the COPIED-AND-CONTIGUOUS image to the internal OpenCV C++ function. This is of course very wasteful!
  • Other libraries: Depends on the library. Some of them do something like "take the img.data memory address and give it to a raw Windows API via a COM call" in which case YES the RAM-data MUST be contiguous too.

What type of data do YOU need?

If you want the SAFEST possible data that is 100% sure to work in ANY API ANYWHERE, you should always make CONTIGUOUS pixel data. It doesn't take long to do the conversion up-front, since we're still talking about very fast operations!

There are probably situations where non-contiguous data is fine, such as if you are doing all image manipulations purely in Numpy math without any library APIs (in which case there's no real reason to convert the data layout to contiguous in RAM). But as soon as you invoke various library APIs, you should pretty much always have contiguous data, otherwise you'll create huge performance issues (or even completely incorrect results).

I'll explain those performance issues further down, but first we'll look at the various "conversion techniques" people use in Python.

Techniques

Without further ado, here are all the ways that people use in Python whenever they want to convert back/forth between RGB and BGR. These benchmarks are on a 4K image (3840x2160):

  • Always Contiguous: No. Method: x = x[...,::-1]. Speed: 237 nsec (aka 0.237 usec aka 0.000237 msec) per call
  • Always Contiguous: Yes. Method: x = x[...,::-1].copy(). Speed: 37.5 msec per call
  • Always Contiguous: No. Method: x = x[:, :, [2, 1, 0]]. Speed: 15.2 msec per call
  • Always Contiguous: Yes. Method: x = cv2.cvtColor(x, cv2.COLOR_RGB2BGR). Speed: 5.64 msec per call
  • Always Contiguous: No. Method: x = np.fliplr(x.reshape(-1,3)).reshape(x.shape). Speed: 1.62 usec (aka 0.00162 msec) per call
  • Always Contiguous: Yes. Method: x = np.fliplr(x.reshape(-1,3)).reshape(x.shape).copy(). Speed: 37.4 msec per call
  • Always Contiguous: No. Method: x = np.flip(x, axis=2). Speed: 2.74 usec (aka 0.00274 msec) per call
  • Always Contiguous: Yes. Method: x = np.flip(x, axis=2).copy(). Speed: 37.5 msec per call
  • Always Contiguous: Yes. Method: r = x[..., 0].copy(); x[..., 0] = x[..., 2]; x[..., 2] = r. Speed: 27.2 msec per call
  • Always Contiguous: Yes. Method: x[:, :, [0, 2]] = x[:, :, [2, 0]]. Speed: 27.3 msec per call
  • Always Contiguous: Yes. Method: x[..., [0, 2]] = x[..., [2, 0]]. Speed: 27.2 msec per call
  • Always Contiguous: Yes. Method: x[:, :, [0, 1, 2]] = x[:, :, [2, 1, 0]]. Speed: 41.4 msec per call
  • Always Contiguous: Yes. Method: x[:, :] = x[:, :, [2, 1, 0]]. Speed: 59.3 msec per call

PS: Whenever we want contiguous data from numpy, we're mostly using x.copy() to tell Numpy to allocate new RAM and copy all data to it in the correct (contiguous) order. There's also a np.ascontiguousarray(x) API but it does the exact same thing (copies) and requires much more typing. ;-) And in a few of the examples we're using special indexing (such as x[:, :, [0, 1, 2]] = x[:, :, [2, 1, 0]]) to overwrite the memory directly, which always creates contiguous memory with correct "strides", but is extremely slow.

Docs for the various Numpy functions: copy, ascontiguousarray, fliplr, flip, reshape

Here's the benchmark that was used:

python -m timeit -s "import numpy as np; import cv2; x = np.zeros([2160,3840,3], np.uint8); x[:,:,2] = 255; x[:,:,1] = 100" "ALGORITHM HERE"

Replace the "ALGORITHM HERE" part with the algorithm above, such as "x = np.flip(x, axis=2).copy()".

People's Misunderstandings of those Benchmarks

Alright, so we're finally getting to the whole purpose of this article!

When people see the benchmarks above, they usually think "Oh my god, x = x[...,::-1] executes in 0.000237 milliseconds, and x = cv2.cvtColor(x, cv2.COLOR_RGB2BGR) executes in 5.64 milliseconds, which is 23798 times slower!! And then they decide to always use "Numpy view manipulation" to do their channel conversions.

That's a huge mistake. And here's why:

  • When you call an OpenCV API from Python, and pass it a numpy.ndarray image object, there's a process which prepares that data for internal usage within OpenCV (since OpenCV itself doesn't use ndarray internally; it uses cv::Mat).
  • First, your Python object (which is coming from the PyOpenCV module) goes into the appropriate pyopencv_to() function, whose purpose is to convert raw Python objects (such as numbers, strings, ndarray, etc), into something usable by OpenCV internally in C++.
  • Your Python object first enters the "full ArgInfo converter" code at https://github.com/opencv/opencv/blob/778f42ad34559451d62ac9ba585717aec77fb23a/modules/python/src2/cv2.cpp#L249
  • That code, in turn, looks at the object and determines if it's a number, a float, or a tuple... If it's any of those, it does the appropriate conversion. Otherwise it assumes it's a Numpy array, at this line: https://github.com/opencv/opencv/blob/master/modules/python/src2/cv2.cpp#L292
  • Next, it begins to analyze the Numpy array to determine how to use the data internally. It wants to determine "do we need to copy the data or can we use it as-is? do we need to cast the data?", see here: https://github.com/opencv/opencv/blob/master/modules/python/src2/cv2.cpp#L300
  • It first does some simple checks to see if the number-type in the Numpy array is legal or not. (If illegal type, it marks the data as "needs copy" and "needs cast").
  • Next, it retrieves the "strides" information from the Numpy array, which is those simple numbers such as "-1" which determine how to read a numpy array (such as backwards, in the case of our "fast" numpy-based "channel flipping" code earlier): https://github.com/opencv/opencv/blob/master/modules/python/src2/cv2.cpp#L341-L342
  • Then it analyzes the Strides for all dimensions of the Numpy array, and if it finds a non-contiguous stride (our "screwed up" data layout caused by doing those so-called "fast" Numpy view manipulations), then it marks the data as "needs copy": https://github.com/opencv/opencv/blob/master/modules/python/src2/cv2.cpp#L344-L357
  • Next, if "needs copy" is true, it does this horrible thing: https://github.com/opencv/opencv/blob/master/modules/python/src2/cv2.cpp#L367-L374
  • As you can see, it calls PyArray_Cast() (if casting was needed too) or PyArray_GETCONTIGUOUS() (if we only need to make sure the data is contiguous). Both of those functions, no matter which one is called, then generates a brand-new Python Numpy Object, with all data COPIED by Numpy into brand-new memory, and re-arranged into proper Contiguous ordering. That's extremely wasteful! I'll explain more soon, after this walkthrough of what the code does...
  • Finally, the code proceeds to create a cv::Mat object whose data pointer points at the internal byte-array (RAM) of the Numpy object, ie. the RAM address that you can easily see in Python by typing img.data. That's an incredibly fast operation because it is just a pointer which says Use the existing RAM data owned by Numpy at RAM address XYZ: https://github.com/opencv/opencv/blob/master/modules/python/src2/cv2.cpp#L415-L416

So, can you spot the problem yet?

When you pass a contiguous Numpy array, the conversion into OpenCV is pretty much INSTANT: "This data looks fine! Just give its RAM address to cv::Mat and voila!".

But when you instead insist on using those so-called "fast" channel transformations, where you "tweak" the Numpy array's view and stride values, then you are giving OpenCV a Numpy array with non-contiguous RAM and bad "strides". The PyOpenCV layer (the wrapper between OpenCV and Python) detects this problem, and creates a BRAND NEW, COPIED, RE-ARRANGED (CONTIGUOUS) NUMPY ARRAY. This is VERY VERY VERY VERY SLOW.

In other words, if you've used those dumb Numpy "view" manipulation tricks, EVERY CALL TO OPENCV APIS IS CAUSING a HUGE memory copy (images are large, especially 1080p+ screenshots/video frames), a lot of math inside PyArray_GETCONTIGUOUS / PyArray_Cast to create that new object while respecting your tweaked "strides", etc.

Your code won't be faster at all. It will be SLOWER!

Demonstration of the Slowness

Let's use a random OpenCV API to demonstrate the slowdown caused by all of those conversions. We'll use cv2.imshow here, but any OpenCV API call will always be doing the same "Python to OpenCV" conversions of the numpy data, so the exact API doesn't matter. They will all have this overhead.

Here's the example code:

import cv2
import numpy as np
import time

#img1 = cv2.imread("yourimage.png") # If you want to test with an image.
img1 = np.zeros([2160,3840,3], np.uint8) # Create a 4K image.
img1[:,:,2] = 255; img1[:,:,1] = 100 # Fill the channels with different values.
img2 = img1[...,::-1] # Make a "channel flipped view" of the Numpy data.

print("img1 contiguous", img1.flags['C_CONTIGUOUS'], "img2 contiguous", img2.flags['C_CONTIGUOUS'])
print("img1 strides", img1.strides, "img2 strides", img2.strides)

wnd = cv2.namedWindow("", cv2.WINDOW_NORMAL)

def show1():
    cv2.imshow(wnd, img1)

def show2():
    cv2.imshow(wnd, img2)

iterations = 20

start = time.perf_counter()
for i in range(0, iterations):
    show1()
elapsed1 = (time.perf_counter() - start) * 1000
elapsed1_percall = elapsed1 / iterations

start = time.perf_counter()
for i in range(0, iterations):
    show2()
elapsed2 = (time.perf_counter() - start) * 1000
elapsed2_percall = elapsed2 / iterations

# We know that the contiguos (img1) data does not need conversion,
# which tells us that the runtime of the contiguous data is the
# "internal work of the imshow" function. We only want to measure
# the conversion time for non-contiguous data. So we'll subtract
# the first image's (contiguous) runtime from the non-contiguous time.

noncontiguous_overhead_per_call = elapsed2_percall - elapsed1_percall

print("Extra time taken per OpenCV call when given non-contiguous data (in ms):", noncontiguous_overhead_per_call, "ms")

The results:

img1 contiguous True img2 contiguous False
img1 strides (11520, 3, 1) img2 strides (11520, 3, -1)
Extra time taken per OpenCV call when given non-contiguous data (in ms): 39.45334999999999 ms

As you can see, the extra time added to the OpenCV calls when copy-conversion is needed, is pretty much the same as when you call Numpy's own img.copy() inside Python itself (as seen in the earlier benchmark for "Method: x = x[...,::-1].copy(). Speed: 37.5 msec per call").

So yes, every time you call OpenCV with a non-contiguous Numpy array as its argument, you are causing a Numpy .copy() to happen internally!

Conclusions

Stop using Numpy view manipulations and "tricks". They are not "cool". They are SLOW. You are slowing down all of your OpenCV API calls by about 40 milliseconds PER CALL, since your "cool" data has to be converted by OpenCV internally to PROPER CONTIGUOUS RAM.

Those 39.45ms are wasted on EVERY OpenCV call whenever you give OpenCV a non-contiguous image. So if you're (as people often do) passing the image to multiple OpenCV APIs to analyze it in multiple ways, then you are causing extreme slowdowns in your code.

Use cv2.cvtColor() instead, which does a one-time conversion to the proper format (in about 5.64ms as seen in the benchmarks earlier), using accelerated OpenCL code. You are guaranteed to get contiguous data which works as-is for EVERY OpenCV call with no memory copying/conversion needed. And OpenCV's color converter is WAY FASTER than Numpy's internal data copier/converter.

Let's end this by imagining a scenario where you're using some Python library to capture an RGB screenshot as a numpy array, and you need to use that data with OpenCV. So you're thinking you're clever and you write img = img[...,::-1] to "turn the RGB data into BGR (which OpenCV needs)", and you're thinking "Wow, my code is so fast! That RGB-to-BGR operation only took 0.000237 ms!"... And then you're calling five different OpenCV functions to analyze that screenshot-image in various ways. Well, since you're causing one internal Numpy copy-conversion-to-contiguous PER CALL, you're now causing 5 * 39.45 = 197.25 ms of total conversion overhead, just to get your "stupid" Numpy view into a proper contiguous memory stream.

Does it still sound "slow" to just do a single, one-time 5.64ms conversion via cv2.cvtColor()? ;-)

Stop. Using. Numpy. Tricks!

Enjoy! ;-)

Fastest way to convert BGR <-> RGB! Aka: Do NOT use Numpy magic "tricks".


IMPORTANT: This article is very long. Remember to click the (more) at the bottom of this post to read the whole article!


I was reading this question: https://answers.opencv.org/question/188664/colour-conversion-from-bgr-to-rgb-in-cv2-is-slower-than-on-python/ and it didn't explain things very well at all. So here's a deep examination and explanation for everyone's future reference!

Converting RGB to BGR, and vice versa, is one of the most important operations you can do in OpenCV if you're interoperating with other libraries, raw system memory, etc. And each imaging library depends on their own special channel orders.

There are many ways to achieve the conversion, and cv2.cvtColor() is often frowned upon because there are "much faster" ways to do it via numpy "view" manipulation.

Whenever you attempt to convert colors in OpenCV, you actually invoke a huge machinery:

https://github.com/opencv/opencv/blob/8c0b0714e76efef4a8ca2a7c410c60e55c5e9829/modules/imgproc/src/color.cpp#L20-L25 https://github.com/opencv/opencv/blob/8b541e450b511fde9dd363fa55a30fbb6fc0ace6/modules/imgproc/src/color_rgb.dispatch.cpp#L426-L437

As you can see, internally, OpenCV creates an "OpenCL Kernel" with the instructions for the data transformation, and then runs it. This creates brand new (re-arranged) image data in memory, which is of course a pretty slow operation, involving new memory allocation and data-copying.

However, there is another way to flip between RGB and BGR channel orders, which is very popular - and very bad (as you'll find out soon). And that is: Using numpy's built-in methods for manipulating the array data.

Note that there are two ways to manipulate data in Numpy:

  • One of the ways, the bad way, just changes the "view" of the Numpy array and is therefore instant (O(1)), but does NOT transform the underlying img.data in RAM/memory. This means that the raw memory does NOT contain the new channel order, and Numpy instead "fakes" it by creating a "view" that simply says "when we read this data from RAM, view it as R=B, G=G, B=R" basically... (Technically speaking, it changes the ".strides" property of the Numpy object, which instead of saying "read R then G then B" (stride "1" aka going forwards in RAM when reading the color channels) changes it to say "read B, then G, then R" (stride "-1" aka going backwards in RAM when reading the color channels)).
  • The second way, which is totally fine, is to always ensure that we arrange the pixel data properly in memory too, which is a lot slower but is almost always necessary, depending on what library/API your data is intended to be sent to!

To determine whether a numpy array manipulation has also changed the underlying MEMORY, you can look at the img.flags['C_CONTIGUOUS'] value. If True it means that the data in RAM is in the correct order (that's great!). If False it means that the data in RAM is in the wrong order and that we are "cheating" via a numpy View instead (that's BAD!).

Whenever you use the "View-based" methods to flip channels in an ndarray (such as RGB -> BGR), its C_CONTIGUOUS becomes False. If you then flip the image's channels again (such as BGR -> back to RGB), its C_CONTIGUOUS becomes True again. So, the "view" is able to be transformed multiple times, and the "Contiguous" flag only says True whenever the view happens to match the actual RAM data's layout.

So... in what situations do you need the data to ALWAYS be contiguous? Well, it varies based on API...

  • OpenCV APIs all need data in contiguous order. If you give it non-contiguous data from Python, the Python-to-OpenCV wrapper layer internally makes a COPY of the BADLY FORMATTED image you gave it, and then converts the color channel order, and THEN finally passes the COPIED-AND-CONTIGUOUS image to the internal OpenCV C++ function. This is of course very wasteful!
  • Other libraries: Depends on the library. Some of them do something like "take the img.data memory address and give it to a raw Windows API via a COM call" in which case YES the RAM-data MUST be contiguous too.

What type of data do YOU need?

If you want the SAFEST possible data that is 100% sure to work in ANY API ANYWHERE, you should always make CONTIGUOUS pixel data. It doesn't take long to do the conversion up-front, since we're still talking about very fast operations!

There are probably situations where non-contiguous data is fine, such as if you are doing all image manipulations purely in Numpy math without any library APIs (in which case there's no real reason to convert the data layout to contiguous in RAM). But as soon as you invoke various library APIs, you should pretty much always have contiguous data, otherwise you'll create huge performance issues (or even completely incorrect results).

I'll explain those performance issues further down, but first we'll look at the various "conversion techniques" people use in Python.

Techniques

Without further ado, here are all the ways that people use in Python whenever they want to convert back/forth between RGB and BGR. These benchmarks are on a 4K image (3840x2160):

  • Always Contiguous: No. Method: x = x[...,::-1]. Speed: 237 nsec (aka 0.237 usec aka 0.000237 msec) per call
  • Always Contiguous: Yes. Method: x = x[...,::-1].copy(). Speed: 37.5 msec per call
  • Always Contiguous: No. Method: x = x[:, :, [2, 1, 0]]. Speed: 15.2 msec per call
  • Always Contiguous: Yes. Method: x = cv2.cvtColor(x, cv2.COLOR_RGB2BGR). Speed: 5.64 msec per call
  • Always Contiguous: No. Method: x = np.fliplr(x.reshape(-1,3)).reshape(x.shape). Speed: 1.62 usec (aka 0.00162 msec) per call
  • Always Contiguous: Yes. Method: x = np.fliplr(x.reshape(-1,3)).reshape(x.shape).copy(). Speed: 37.4 msec per call
  • Always Contiguous: No. Method: x = np.flip(x, axis=2). Speed: 2.74 usec (aka 0.00274 msec) per call
  • Always Contiguous: Yes. Method: x = np.flip(x, axis=2).copy(). Speed: 37.5 msec per call
  • Always Contiguous: Yes. Method: r = x[..., 0].copy(); x[..., 0] = x[..., 2]; x[..., 2] = r. Speed: 27.2 msec per call
  • Always Contiguous: Yes. Method: x[:, :, [0, 2]] = x[:, :, [2, 0]]. Speed: 27.3 msec per call
  • Always Contiguous: Yes. Method: x[..., [0, 2]] = x[..., [2, 0]]. Speed: 27.2 msec per call
  • Always Contiguous: Yes. Method: x[:, :, [0, 1, 2]] = x[:, :, [2, 1, 0]]. Speed: 41.4 msec per call
  • Always Contiguous: Yes. Method: x[:, :] = x[:, :, [2, 1, 0]]. Speed: 59.3 msec per call

PS: Whenever we want contiguous data from numpy, we're mostly using x.copy() to tell Numpy to allocate new RAM and copy all data to it in the correct (contiguous) order. There's also a np.ascontiguousarray(x) API but it does the exact same thing (copies) and requires much more typing. ;-) And in a few of the examples we're using special indexing (such as x[:, :, [0, 1, 2]] = x[:, :, [2, 1, 0]]) to overwrite the memory directly, which always creates contiguous memory with correct "strides", but is extremely slow.

Docs for the various Numpy functions: copy, ascontiguousarray, fliplr, flip, reshape

Here's the benchmark that was used:

python -m timeit -s "import numpy as np; import cv2; x = np.zeros([2160,3840,3], np.uint8); x[:,:,2] = 255; x[:,:,1] = 100" "ALGORITHM HERE"

Replace the "ALGORITHM HERE" part with the algorithm above, such as "x = np.flip(x, axis=2).copy()".

People's Misunderstandings of those Benchmarks

Alright, so we're finally getting to the whole purpose of this article!

When people see the benchmarks above, they usually think "Oh my god, x = x[...,::-1] executes in 0.000237 milliseconds, and x = cv2.cvtColor(x, cv2.COLOR_RGB2BGR) executes in 5.64 milliseconds, which is 23798 times slower!! And then they decide to always use "Numpy view manipulation" to do their channel conversions.

That's a huge mistake. And here's why:

  • When you call an OpenCV API from Python, and pass it a numpy.ndarray image object, there's a process which prepares that data for internal usage within OpenCV (since OpenCV itself doesn't use ndarray internally; it uses cv::Mat).
  • First, your Python object (which is coming from the PyOpenCV module) goes into the appropriate pyopencv_to() function, whose purpose is to convert raw Python objects (such as numbers, strings, ndarray, etc), into something usable by OpenCV internally in C++.
  • Your Python object first enters the "full ArgInfo converter" code at https://github.com/opencv/opencv/blob/778f42ad34559451d62ac9ba585717aec77fb23a/modules/python/src2/cv2.cpp#L249
  • That code, in turn, looks at the object and determines if it's a number, a float, or a tuple... If it's any of those, it does the appropriate conversion. Otherwise it assumes it's a Numpy array, at this line: https://github.com/opencv/opencv/blob/master/modules/python/src2/cv2.cpp#L292
  • Next, it begins to analyze the Numpy array to determine how to use the data internally. It wants to determine "do we need to copy the data or can we use it as-is? do we need to cast the data?", see here: https://github.com/opencv/opencv/blob/master/modules/python/src2/cv2.cpp#L300
  • It first does some simple checks to see if the number-type in the Numpy array is legal or not. (If illegal type, it marks the data as "needs copy" and "needs cast").
  • Next, it retrieves the "strides" information from the Numpy array, which is those simple numbers such as "-1" which determine how to read a numpy array (such as backwards, in the case of our "fast" numpy-based "channel flipping" code earlier): https://github.com/opencv/opencv/blob/master/modules/python/src2/cv2.cpp#L341-L342
  • Then it analyzes the Strides for all dimensions of the Numpy array, and if it finds a non-contiguous stride (our "screwed up" data layout caused by doing those so-called "fast" Numpy view manipulations), then it marks the data as "needs copy": https://github.com/opencv/opencv/blob/master/modules/python/src2/cv2.cpp#L344-L357
  • Next, if "needs copy" is true, it does this horrible thing: https://github.com/opencv/opencv/blob/master/modules/python/src2/cv2.cpp#L367-L374
  • As you can see, it calls PyArray_Cast() (if casting was needed too) or PyArray_GETCONTIGUOUS() (if we only need to make sure the data is contiguous). Both of those functions, no matter which one is called, then generates a brand-new Python Numpy Object, with all data COPIED by Numpy into brand-new memory, and re-arranged into proper Contiguous ordering. That's extremely wasteful! I'll explain more soon, after this walkthrough of what the code does...
  • Finally, the code proceeds to create a cv::Mat object whose data pointer points at the internal byte-array (RAM) of the Numpy object, ie. the RAM address that you can easily see in Python by typing img.data. That's an incredibly fast operation because it is just a pointer which says Use the existing RAM data owned by Numpy at RAM address XYZ: https://github.com/opencv/opencv/blob/master/modules/python/src2/cv2.cpp#L415-L416

So, can you spot the problem yet?

When you pass a contiguous Numpy array, the conversion into OpenCV is pretty much INSTANT: "This data looks fine! Just give its RAM address to cv::Mat and voila!".

But when you instead insist on using those so-called "fast" channel transformations, where you "tweak" the Numpy array's view and stride values, then you are giving OpenCV a Numpy array with non-contiguous RAM and bad "strides". The PyOpenCV layer (the wrapper between OpenCV and Python) detects this problem, and creates a BRAND NEW, COPIED, RE-ARRANGED (CONTIGUOUS) NUMPY ARRAY. This is VERY VERY VERY VERY SLOW.

In other words, if you've used those dumb Numpy "view" manipulation tricks, EVERY EVERY CALL TO OPENCV APIS IS CAUSING a HUGE memory copy (images are large, especially 1080p+ screenshots/video frames), a lot of math inside PyArray_GETCONTIGUOUS / PyArray_Cast to create that new object while respecting your tweaked "strides", etc.

Your code won't be faster at all. It will be SLOWER!

Demonstration of the Slowness

Let's use a random OpenCV API to demonstrate the slowdown caused by all of those conversions. We'll use cv2.imshow here, but any OpenCV API call will always be doing the same "Python to OpenCV" conversions of the numpy data, so the exact API doesn't matter. They will all have this overhead.

Here's the example code:

import cv2
import numpy as np
import time

#img1 = cv2.imread("yourimage.png") # If you want to test with an image.
img1 = np.zeros([2160,3840,3], np.uint8) # Create a 4K image.
img1[:,:,2] = 255; img1[:,:,1] = 100 # Fill the channels with different values.
img2 = img1[...,::-1] # Make a "channel flipped view" of the Numpy data.

print("img1 contiguous", img1.flags['C_CONTIGUOUS'], "img2 contiguous", img2.flags['C_CONTIGUOUS'])
print("img1 strides", img1.strides, "img2 strides", img2.strides)

wnd = cv2.namedWindow("", cv2.WINDOW_NORMAL)

def show1():
    cv2.imshow(wnd, img1)

def show2():
    cv2.imshow(wnd, img2)

iterations = 20

start = time.perf_counter()
for i in range(0, iterations):
    show1()
elapsed1 = (time.perf_counter() - start) * 1000
elapsed1_percall = elapsed1 / iterations

start = time.perf_counter()
for i in range(0, iterations):
    show2()
elapsed2 = (time.perf_counter() - start) * 1000
elapsed2_percall = elapsed2 / iterations

# We know that the contiguos (img1) data does not need conversion,
# which tells us that the runtime of the contiguous data is the
# "internal work of the imshow" function. We only want to measure
# the conversion time for non-contiguous data. So we'll subtract
# the first image's (contiguous) runtime from the non-contiguous time.

noncontiguous_overhead_per_call = elapsed2_percall - elapsed1_percall

print("Extra time taken per OpenCV call when given non-contiguous data (in ms):", noncontiguous_overhead_per_call, "ms")

The results:

img1 contiguous True img2 contiguous False
img1 strides (11520, 3, 1) img2 strides (11520, 3, -1)
Extra time taken per OpenCV call when given non-contiguous data (in ms): 39.45334999999999 ms

As you can see, the extra time added to the OpenCV calls when copy-conversion is needed, is pretty much the same as when you call Numpy's own img.copy() inside Python itself (as seen in the earlier benchmark for "Method: x = x[...,::-1].copy(). Speed: 37.5 msec per call").

So yes, every time you call OpenCV with a non-contiguous Numpy array as its argument, you are causing a Numpy .copy() to happen internally!

Conclusions

Stop using Numpy view manipulations and "tricks". They are not "cool". They are SLOW. You are slowing down all of your OpenCV API calls by about 40 milliseconds PER CALL, since your "cool" data has to be converted by OpenCV internally to PROPER CONTIGUOUS RAM.

Those 39.45ms are wasted on EVERY OpenCV call whenever you give OpenCV a non-contiguous image. So if you're (as people often do) passing the image to multiple OpenCV APIs to analyze it in multiple ways, then you are causing extreme slowdowns in your code.

Use cv2.cvtColor() instead, which does a one-time conversion to the proper format (in about 5.64ms as seen in the benchmarks earlier), using accelerated OpenCL code. You are guaranteed to get contiguous data which works as-is for EVERY OpenCV call with no memory copying/conversion needed. And OpenCV's color converter is WAY FASTER than Numpy's internal data copier/converter.

Let's end this by imagining a scenario where you're using some Python library to capture an RGB screenshot as a numpy array, and you need to use that data with OpenCV. So you're thinking you're clever and you write img = img[...,::-1] to "turn the RGB data into BGR (which OpenCV needs)", and you're thinking "Wow, my code is so fast! That RGB-to-BGR operation only took 0.000237 ms!"... And then you're calling five different OpenCV functions to analyze that screenshot-image in various ways. Well, since you're causing one internal Numpy copy-conversion-to-contiguous PER CALL, you're now causing 5 * 39.45 = 197.25 ms of total conversion overhead, just to get your "stupid" Numpy view into a proper contiguous memory stream.

Does it still sound "slow" to just do a single, one-time 5.64ms conversion via cv2.cvtColor()? ;-)

Stop. Using. Numpy. Tricks!

Enjoy! ;-)

Fastest way to convert BGR <-> RGB! Aka: Do NOT use Numpy magic "tricks".


IMPORTANT: This article is very long. Remember to click the (more) at the bottom of this post to read the whole article!


I was reading this question: https://answers.opencv.org/question/188664/colour-conversion-from-bgr-to-rgb-in-cv2-is-slower-than-on-python/ and it didn't explain things very well at all. So here's a deep examination and explanation for everyone's future reference!

Converting RGB to BGR, and vice versa, is one of the most important operations you can do in OpenCV if you're interoperating with other libraries, raw system memory, etc. And each imaging library depends on their own special channel orders.

There are many ways to achieve the conversion, and cv2.cvtColor() is often frowned upon because there are "much faster" ways to do it via numpy "view" manipulation.

Whenever you attempt to convert colors in OpenCV, you actually invoke a huge machinery:

https://github.com/opencv/opencv/blob/8c0b0714e76efef4a8ca2a7c410c60e55c5e9829/modules/imgproc/src/color.cpp#L20-L25 https://github.com/opencv/opencv/blob/8b541e450b511fde9dd363fa55a30fbb6fc0ace6/modules/imgproc/src/color_rgb.dispatch.cpp#L426-L437

As you can see, internally, OpenCV creates an "OpenCL Kernel" with the instructions for the data transformation, and then runs it. This creates brand new (re-arranged) image data in memory, which is of course a pretty slow operation, involving new memory allocation and data-copying.

However, there is another way to flip between RGB and BGR channel orders, which is very popular - and very bad (as you'll find out soon). And that is: Using numpy's built-in methods for manipulating the array data.

Note that there are two ways to manipulate data in Numpy:

  • One of the ways, the bad way, just changes the "view" of the Numpy array and is therefore instant (O(1)), but does NOT transform the underlying img.data in RAM/memory. This means that the raw memory does NOT contain the new channel order, and Numpy instead "fakes" it by creating a "view" that simply says "when we read this data from RAM, view it as R=B, G=G, B=R" basically... (Technically speaking, it changes the ".strides" property of the Numpy object, which instead of saying "read R then G then B" (stride "1" aka going forwards in RAM when reading the color channels) changes it to say "read B, then G, then R" (stride "-1" aka going backwards in RAM when reading the color channels)).
  • The second way, which is totally fine, is to always ensure that we arrange the pixel data properly in memory too, which is a lot slower but is almost always necessary, depending on what library/API your data is intended to be sent to!

To determine whether a numpy array manipulation has also changed the underlying MEMORY, you can look at the img.flags['C_CONTIGUOUS'] value. If True it means that the data in RAM is in the correct order (that's great!). If False it means that the data in RAM is in the wrong order and that we are "cheating" via a numpy View instead (that's BAD!).

Whenever you use the "View-based" methods to flip channels in an ndarray (such as RGB -> BGR), its C_CONTIGUOUS becomes False. If you then flip the image's channels again (such as BGR -> back to RGB), its C_CONTIGUOUS becomes True again. So, the "view" is able to be transformed multiple times, and the "Contiguous" flag only says True whenever the view happens to match the actual RAM data's layout.

So... in what situations do you need the data to ALWAYS be contiguous? Well, it varies based on API...

  • OpenCV APIs all need data in contiguous order. If you give it non-contiguous data from Python, the Python-to-OpenCV wrapper layer internally makes a COPY of the BADLY FORMATTED image you gave it, and then converts the color channel order, and THEN finally passes the COPIED-AND-CONTIGUOUS image to the internal OpenCV C++ function. This is of course very wasteful!
  • Other libraries: Depends on the library. Some of them do something like "take the img.data memory address and give it to a raw Windows API via a COM call" in which case YES the RAM-data MUST be contiguous too.

What type of data do YOU need?

If you want the SAFEST possible data that is 100% sure to work in ANY API ANYWHERE, you should always make CONTIGUOUS pixel data. It doesn't take long to do the conversion up-front, since we're still talking about very fast operations!

There are probably situations where non-contiguous data is fine, such as if you are doing all image manipulations purely in Numpy math without any library APIs (in which case there's no real reason to convert the data layout to contiguous in RAM). But as soon as you invoke various library APIs, you should pretty much always have contiguous data, otherwise you'll create huge performance issues (or even completely incorrect results).

I'll explain those performance issues further down, but first we'll look at the various "conversion techniques" people use in Python.

Techniques

Without further ado, here are all the ways that people use in Python whenever they want to convert back/forth between RGB and BGR. These benchmarks are on a 4K image (3840x2160):

  • Always Contiguous: No. Method: x = x[...,::-1]. Speed: 237 nsec (aka 0.237 usec aka 0.000237 msec) per call
  • Always Contiguous: Yes. Method: x = x[...,::-1].copy(). Speed: 37.5 msec per call
  • Always Contiguous: No. Method: x = x[:, :, [2, 1, 0]]. Speed: 15.2 msec per call
  • Always Contiguous: Yes. Method: x = cv2.cvtColor(x, cv2.COLOR_RGB2BGR). Speed: 5.64 msec per call
  • Always Contiguous: No. Method: x = np.fliplr(x.reshape(-1,3)).reshape(x.shape). Speed: 1.62 usec (aka 0.00162 msec) per call
  • Always Contiguous: Yes. Method: x = np.fliplr(x.reshape(-1,3)).reshape(x.shape).copy(). Speed: 37.4 msec per call
  • Always Contiguous: No. Method: x = np.flip(x, axis=2). Speed: 2.74 usec (aka 0.00274 msec) per call
  • Always Contiguous: Yes. Method: x = np.flip(x, axis=2).copy(). Speed: 37.5 msec per call
  • Always Contiguous: Yes. Method: r = x[..., 0].copy(); x[..., 0] = x[..., 2]; x[..., 2] = r. Speed: 27.2 msec per call
  • Always Contiguous: Yes. Method: x[:, :, [0, 2]] = x[:, :, [2, 0]]. Speed: 27.3 msec per call
  • Always Contiguous: Yes. Method: x[..., [0, 2]] = x[..., [2, 0]]. Speed: 27.2 msec per call
  • Always Contiguous: Yes. Method: x[:, :, [0, 1, 2]] = x[:, :, [2, 1, 0]]. Speed: 41.4 msec per call
  • Always Contiguous: Yes. Method: x[:, :] = x[:, :, [2, 1, 0]]. Speed: 59.3 msec per call

PS: Whenever we want contiguous data from numpy, we're mostly using x.copy() to tell Numpy to allocate new RAM and copy all data to it in the correct (contiguous) order. There's also a np.ascontiguousarray(x) API but it does the exact same thing (copies) and requires much more typing. ;-) And in a few of the examples we're using special indexing (such as x[:, :, [0, 1, 2]] = x[:, :, [2, 1, 0]]) to overwrite the memory directly, which always creates contiguous memory with correct "strides", but is extremely slow.

Docs for the various Numpy functions: copy, ascontiguousarray, fliplr, flip, reshape

Here's the benchmark that was used:

python -m timeit -s "import numpy as np; import cv2; x = np.zeros([2160,3840,3], np.uint8); x[:,:,2] = 255; x[:,:,1] = 100" "ALGORITHM HERE"

Replace the "ALGORITHM HERE" part with the algorithm above, such as "x = np.flip(x, axis=2).copy()".

People's Misunderstandings of those Benchmarks

Alright, so we're finally getting to the whole purpose of this article!

When people see the benchmarks above, they usually think "Oh my god, x = x[...,::-1] executes in 0.000237 milliseconds, and x = cv2.cvtColor(x, cv2.COLOR_RGB2BGR) executes in 5.64 milliseconds, which is 23798 times slower!! And then they decide to always use "Numpy view manipulation" to do their channel conversions.

That's a huge mistake. And here's why:

  • When you call an OpenCV API from Python, and pass it a numpy.ndarray image object, there's a process which prepares that data for internal usage within OpenCV (since OpenCV itself doesn't use ndarray internally; it uses cv::Mat).
  • First, your Python object (which is coming from the PyOpenCV module) goes into the appropriate pyopencv_to() function, whose purpose is to convert raw Python objects (such as numbers, strings, ndarray, etc), into something usable by OpenCV internally in C++.
  • Your Python object first enters the "full ArgInfo converter" code at https://github.com/opencv/opencv/blob/778f42ad34559451d62ac9ba585717aec77fb23a/modules/python/src2/cv2.cpp#L249
  • That code, in turn, looks at the object and determines if it's a number, a float, or a tuple... If it's any of those, it does the appropriate conversion. Otherwise it assumes it's a Numpy array, at this line: https://github.com/opencv/opencv/blob/master/modules/python/src2/cv2.cpp#L292
  • Next, it begins to analyze the Numpy array to determine how to use the data internally. It wants to determine "do we need to copy the data or can we use it as-is? do we need to cast the data?", see here: https://github.com/opencv/opencv/blob/master/modules/python/src2/cv2.cpp#L300
  • It first does some simple checks to see if the number-type in the Numpy array is legal or not. (If illegal type, it marks the data as "needs copy" and "needs cast").
  • Next, it retrieves the "strides" information from the Numpy array, which is those simple numbers such as "-1" which determine how to read a numpy array (such as backwards, in the case of our "fast" numpy-based "channel flipping" code earlier): https://github.com/opencv/opencv/blob/master/modules/python/src2/cv2.cpp#L341-L342
  • Then it analyzes the Strides for all dimensions of the Numpy array, and if it finds a non-contiguous stride (our "screwed up" data layout caused by doing those so-called "fast" Numpy view manipulations), then it marks the data as "needs copy": https://github.com/opencv/opencv/blob/master/modules/python/src2/cv2.cpp#L344-L357
  • Next, if "needs copy" is true, it does this horrible thing: https://github.com/opencv/opencv/blob/master/modules/python/src2/cv2.cpp#L367-L374
  • As you can see, it calls PyArray_Cast() (if casting was needed too) or PyArray_GETCONTIGUOUS() (if we only need to make sure the data is contiguous). Both of those functions, no matter which one is called, then generates a brand-new Python Numpy Object, with all data COPIED by Numpy into brand-new memory, and re-arranged into proper Contiguous ordering. That's extremely wasteful! I'll explain more soon, after this walkthrough of what the code does...
  • Finally, the code proceeds to create a cv::Mat object whose data pointer points at the internal byte-array (RAM) of the Numpy object, ie. the RAM address that you can easily see in Python by typing img.data. That's an incredibly fast operation because it is just a pointer which says Use the existing RAM data owned by Numpy at RAM address XYZ: https://github.com/opencv/opencv/blob/master/modules/python/src2/cv2.cpp#L415-L416

So, can you spot the problem yet?

When you pass a contiguous Numpy array, the conversion into OpenCV is pretty much INSTANT: "This data looks fine! Just give its RAM address to cv::Mat and voila!".

But when you instead insist on using those so-called "fast" channel transformations, where you "tweak" the Numpy array's view and stride values, then you are giving OpenCV a Numpy array with non-contiguous RAM and bad "strides". The PyOpenCV layer (the wrapper between OpenCV and Python) detects this problem, and creates a BRAND NEW, COPIED, RE-ARRANGED (CONTIGUOUS) NUMPY ARRAY. This is VERY VERY VERY VERY SLOW.

In other words, if you've used those dumb Numpy "view" manipulation tricks, EVERY CALL TO OPENCV APIS IS CAUSING a HUGE memory copy (images are large, especially 1080p+ screenshots/video frames), a lot of math inside PyArray_GETCONTIGUOUS / PyArray_Cast to create that new object while respecting your tweaked "strides", etc.

Your code won't be faster at all. It will be SLOWER!

Demonstration of the Slowness

Let's use a random OpenCV API to demonstrate the slowdown caused by all of those conversions. We'll use cv2.imshow here, but any OpenCV API call will always be doing the same "Python to OpenCV" conversions of the numpy data, so the exact API doesn't matter. They will all have this overhead.

Here's the example code:

import cv2
import numpy as np
import time

#img1 = cv2.imread("yourimage.png") # If you want to test with an image.
img1 = np.zeros([2160,3840,3], np.uint8) # Create a 4K image.
img1[:,:,2] = 255; img1[:,:,1] = 100 # Fill the channels with different values.
img2 = img1[...,::-1] # Make a "channel flipped view" of the Numpy data.

print("img1 contiguous", img1.flags['C_CONTIGUOUS'], "img2 contiguous", img2.flags['C_CONTIGUOUS'])
print("img1 strides", img1.strides, "img2 strides", img2.strides)

wnd = cv2.namedWindow("", cv2.WINDOW_NORMAL)

def show1():
    cv2.imshow(wnd, img1)

def show2():
    cv2.imshow(wnd, img2)

iterations = 20

start = time.perf_counter()
for i in range(0, iterations):
    show1()
elapsed1 = (time.perf_counter() - start) * 1000
elapsed1_percall = elapsed1 / iterations

start = time.perf_counter()
for i in range(0, iterations):
    show2()
elapsed2 = (time.perf_counter() - start) * 1000
elapsed2_percall = elapsed2 / iterations

# We know that the contiguos (img1) data does not need conversion,
# which tells us that the runtime of the contiguous data is the
# "internal work of the imshow" function. We only want to measure
# the conversion time for non-contiguous data. So we'll subtract
# the first image's (contiguous) runtime from the non-contiguous time.

noncontiguous_overhead_per_call = elapsed2_percall - elapsed1_percall

print("Extra time taken per OpenCV call when given non-contiguous data (in ms):", noncontiguous_overhead_per_call, "ms")

The results:

img1 contiguous True img2 contiguous False
img1 strides (11520, 3, 1) img2 strides (11520, 3, -1)
Extra time taken per OpenCV call when given non-contiguous data (in ms): 39.45334999999999 ms

As you can see, the extra time added to the OpenCV calls when copy-conversion is needed, needed (39.45 ms), is pretty much the same as when you call Numpy's own img.copy() inside Python itself (as seen in the earlier benchmark for "Method: x = x[...,::-1].copy(). Speed: 37.5 msec per call").

So yes, every time you call OpenCV with a non-contiguous Numpy array as its argument, you are causing a Numpy .copy() to happen internally!

Conclusions

Stop using Numpy view manipulations and "tricks". They are not "cool". They are SLOW. You are slowing down all of your OpenCV API calls by about 40 milliseconds PER CALL, since your "cool" data has to be converted by OpenCV internally to PROPER CONTIGUOUS RAM.

Those 39.45ms39.45 ms are wasted on EVERY OpenCV call whenever you give OpenCV a non-contiguous image. So if you're (as people often do) passing the image to multiple OpenCV APIs to analyze it in multiple ways, then you are causing extreme slowdowns in your code.

Use cv2.cvtColor() instead, which does a one-time conversion to the proper format (in about 5.64ms5.64 ms as seen in the benchmarks earlier), using accelerated OpenCL code. You are guaranteed to get contiguous data which works as-is for EVERY OpenCV call with no memory copying/conversion needed. And OpenCV's color converter is WAY FASTER than Numpy's internal data copier/converter.

Let's end this by imagining a scenario where you're using some Python library to capture an RGB screenshot as a numpy array, and you need to use that data with OpenCV. So you're thinking you're clever and you write img = img[...,::-1] to "turn the RGB data into BGR (which OpenCV needs)", and you're thinking "Wow, my code is so fast! That RGB-to-BGR operation only took 0.000237 ms!"... And then you're calling five different OpenCV functions to analyze that screenshot-image in various ways. Well, since you're causing one internal Numpy copy-conversion-to-contiguous PER CALL, you're now causing 5 * 39.45 = 197.25 ms of total conversion overhead, just to get your "stupid" Numpy view into a proper contiguous memory stream.

Does it still sound "slow" to just do a single, one-time 5.64ms5.64 ms conversion via cv2.cvtColor()? ;-)

Stop. Using. Numpy. Tricks!

Enjoy! ;-)

Fastest way to convert BGR <-> RGB! Aka: Do NOT use Numpy magic "tricks".


IMPORTANT: This article is very long. Remember to click the (more) at the bottom of this post to read the whole article!


I was reading this question: https://answers.opencv.org/question/188664/colour-conversion-from-bgr-to-rgb-in-cv2-is-slower-than-on-python/ and it didn't explain things very well at all. So here's a deep examination and explanation for everyone's future reference!

Converting RGB to BGR, and vice versa, is one of the most important operations you can do in OpenCV if you're interoperating with other libraries, raw system memory, etc. And each imaging library depends on their own special channel orders.

There are many ways to achieve the conversion, and cv2.cvtColor() is often frowned upon because there are "much faster" ways to do it via numpy "view" manipulation.

Whenever you attempt to convert colors in OpenCV, you actually invoke a huge machinery:

https://github.com/opencv/opencv/blob/8c0b0714e76efef4a8ca2a7c410c60e55c5e9829/modules/imgproc/src/color.cpp#L20-L25 https://github.com/opencv/opencv/blob/8b541e450b511fde9dd363fa55a30fbb6fc0ace6/modules/imgproc/src/color_rgb.dispatch.cpp#L426-L437

As you can see, internally, OpenCV creates an "OpenCL Kernel" with the instructions for the data transformation, and then runs it. This creates brand new (re-arranged) image data in memory, which is of course a pretty slow operation, involving new memory allocation and data-copying.

However, there is another way to flip between RGB and BGR channel orders, which is very popular - and very bad (as you'll find out soon). And that is: Using numpy's built-in methods for manipulating the array data.

Note that there are two ways to manipulate data in Numpy:

  • One of the ways, the bad way, just changes the "view" of the Numpy array and is therefore instant (O(1)), but does NOT transform the underlying img.data in RAM/memory. This means that the raw memory does NOT contain the new channel order, and Numpy instead "fakes" it by creating a "view" that simply says "when we read this data from RAM, view it as R=B, G=G, B=R" basically... (Technically speaking, it changes the ".strides" property of the Numpy object, which instead of saying "read R then G then B" (stride "1" aka going forwards in RAM when reading the color channels) changes it to say "read B, then G, then R" (stride "-1" aka going backwards in RAM when reading the color channels)).
  • The second way, which is totally fine, is to always ensure that we arrange the pixel data properly in memory too, which is a lot slower but is almost always necessary, depending on what library/API your data is intended to be sent to!

To determine whether a numpy array manipulation has also changed the underlying MEMORY, you can look at the img.flags['C_CONTIGUOUS'] value. If True it means that the data in RAM is in the correct order (that's great!). If False it means that the data in RAM is in the wrong order and that we are "cheating" via a numpy View instead (that's BAD!).

Whenever you use the "View-based" methods to flip channels in an ndarray (such as RGB -> BGR), its C_CONTIGUOUS becomes False. If you then flip the image's channels again (such as BGR -> back to RGB), its C_CONTIGUOUS becomes True again. So, the "view" is able to be transformed multiple times, and the "Contiguous" flag only says True whenever the view happens to match the actual RAM data's layout.

So... in what situations do you need the data to ALWAYS be contiguous? Well, it varies based on API...

  • OpenCV APIs all need data in contiguous order. If you give it non-contiguous data from Python, the Python-to-OpenCV wrapper layer internally makes a COPY of the BADLY FORMATTED image you gave it, and then converts the color channel order, and THEN finally passes the COPIED-AND-CONTIGUOUS image to the internal OpenCV C++ function. This is of course very wasteful!
  • Other libraries: Depends on the library. Some of them do something like "take the img.data memory address and give it to a raw Windows API via a COM call" in which case YES the RAM-data MUST be contiguous too.

What type of data do YOU need?

If you want the SAFEST possible data that is 100% sure to work in ANY API ANYWHERE, you should always make CONTIGUOUS pixel data. It doesn't take long to do the conversion up-front, since we're still talking about very fast operations!

There are probably situations where non-contiguous data is fine, such as if you are doing all image manipulations purely in Numpy math without any library APIs (in which case there's no real reason to convert the data layout to contiguous in RAM). But as soon as you invoke various library APIs, you should pretty much always have contiguous data, otherwise you'll create huge performance issues (or even completely incorrect results).

I'll explain those performance issues further down, but first we'll look at the various "conversion techniques" people use in Python.

Techniques

Without further ado, here are all the ways that people use in Python whenever they want to convert back/forth between RGB and BGR. These benchmarks are on a 4K image (3840x2160):

  • Always Contiguous: No. Method: x = x[...,::-1]. Speed: 237 nsec (aka 0.237 usec aka 0.000237 msec) per call
  • Always Contiguous: Yes. Method: x = x[...,::-1].copy(). Speed: 37.5 msec per call
  • Always Contiguous: No. Method: x = x[:, :, [2, 1, 0]]. Speed: 15.2 12.6 msec per call
  • Always Contiguous: Yes. Method: x = cv2.cvtColor(x, cv2.COLOR_RGB2BGR). Speed: 5.64 5.48 msec per call
  • Always Contiguous: No. Method: x = np.fliplr(x.reshape(-1,3)).reshape(x.shape). Speed: 1.62 usec (aka 0.00162 msec) per call
  • Always Contiguous: Yes. Method: x = np.fliplr(x.reshape(-1,3)).reshape(x.shape).copy(). Speed: 37.4 msec per call
  • Always Contiguous: No. Method: x = np.flip(x, axis=2). Speed: 2.74 usec (aka 0.00274 msec) per call
  • Always Contiguous: Yes. Method: x = np.flip(x, axis=2).copy(). Speed: 37.5 msec per call
  • Always Contiguous: Yes. Method: r = x[..., 0].copy(); x[..., 0] = x[..., 2]; x[..., 2] = r. Speed: 27.2 21.8 msec per call
  • Always Contiguous: Yes. Method: x[:, :, [0, 2]] = x[:, :, [2, 0]]. Speed: 27.3 21.7 msec per call
  • Always Contiguous: Yes. Method: x[..., [0, 2]] = x[..., [2, 0]]. Speed: 27.2 21.8 msec per call
  • Always Contiguous: Yes. Method: x[:, :, [0, 1, 2]] = x[:, :, [2, 1, 0]]. Speed: 41.4 33.1 msec per call
  • Always Contiguous: Yes. Method: x[:, :] = x[:, :, [2, 1, 0]]. Speed: 59.3 49.3 msec per call

PS: Whenever we want contiguous data from numpy, we're mostly using x.copy() to tell Numpy to allocate new RAM and copy all data to it in the correct (contiguous) order. There's also a np.ascontiguousarray(x) API but it does the exact same thing (copies) and requires much more typing. ;-) And in a few of the examples we're using special indexing (such as x[:, :, [0, 1, 2]] = x[:, :, [2, 1, 0]]) to overwrite the memory directly, which always creates contiguous memory with correct "strides", but is extremely slow.

Docs for the various Numpy functions: copy, ascontiguousarray, fliplr, flip, reshape

Here's the benchmark that was used:

python -m timeit -s "import numpy as np; import cv2; x = np.zeros([2160,3840,3], np.uint8); x[:,:,2] = 255; x[:,:,1] = 100" "ALGORITHM HERE"

Replace the "ALGORITHM HERE" part with the algorithm above, such as "x = np.flip(x, axis=2).copy()".

People's Misunderstandings of those Benchmarks

Alright, so we're finally getting to the whole purpose of this article!

When people see the benchmarks above, they usually think "Oh my god, x = x[...,::-1] executes in 0.000237 milliseconds, and x = cv2.cvtColor(x, cv2.COLOR_RGB2BGR) executes in 5.645.48 milliseconds, which is 23798 times slower!! And then they decide to always use "Numpy view manipulation" to do their channel conversions.

That's a huge mistake. And here's why:

  • When you call an OpenCV API from Python, and pass it a numpy.ndarray image object, there's a process which prepares that data for internal usage within OpenCV (since OpenCV itself doesn't use ndarray internally; it uses cv::Mat).
  • First, your Python object (which is coming from the PyOpenCV module) goes into the appropriate pyopencv_to() function, whose purpose is to convert raw Python objects (such as numbers, strings, ndarray, etc), into something usable by OpenCV internally in C++.
  • Your Python object first enters the "full ArgInfo converter" code at https://github.com/opencv/opencv/blob/778f42ad34559451d62ac9ba585717aec77fb23a/modules/python/src2/cv2.cpp#L249
  • That code, in turn, looks at the object and determines if it's a number, a float, or a tuple... If it's any of those, it does the appropriate conversion. Otherwise it assumes it's a Numpy array, at this line: https://github.com/opencv/opencv/blob/master/modules/python/src2/cv2.cpp#L292
  • Next, it begins to analyze the Numpy array to determine how to use the data internally. It wants to determine "do we need to copy the data or can we use it as-is? do we need to cast the data?", see here: https://github.com/opencv/opencv/blob/master/modules/python/src2/cv2.cpp#L300
  • It first does some simple checks to see if the number-type in the Numpy array is legal or not. (If illegal type, it marks the data as "needs copy" and "needs cast").
  • Next, it retrieves the "strides" information from the Numpy array, which is those simple numbers such as "-1" which determine how to read a numpy array (such as backwards, in the case of our "fast" numpy-based "channel flipping" code earlier): https://github.com/opencv/opencv/blob/master/modules/python/src2/cv2.cpp#L341-L342
  • Then it analyzes the Strides for all dimensions of the Numpy array, and if it finds a non-contiguous stride (our "screwed up" data layout caused by doing those so-called "fast" Numpy view manipulations), then it marks the data as "needs copy": https://github.com/opencv/opencv/blob/master/modules/python/src2/cv2.cpp#L344-L357
  • Next, if "needs copy" is true, it does this horrible thing: https://github.com/opencv/opencv/blob/master/modules/python/src2/cv2.cpp#L367-L374
  • As you can see, it calls PyArray_Cast() (if casting was needed too) or PyArray_GETCONTIGUOUS() (if we only need to make sure the data is contiguous). Both of those functions, no matter which one is called, then generates a brand-new Python Numpy Object, with all data COPIED by Numpy into brand-new memory, and re-arranged into proper Contiguous ordering. That's extremely wasteful! I'll explain more soon, after this walkthrough of what the code does...
  • Finally, the code proceeds to create a cv::Mat object whose data pointer points at the internal byte-array (RAM) of the Numpy object, ie. the RAM address that you can easily see in Python by typing img.data. That's an incredibly fast operation because it is just a pointer which says Use the existing RAM data owned by Numpy at RAM address XYZ: https://github.com/opencv/opencv/blob/master/modules/python/src2/cv2.cpp#L415-L416

So, can you spot the problem yet?

When you pass a contiguous Numpy array, the conversion into OpenCV is pretty much INSTANT: "This data looks fine! Just give its RAM address to cv::Mat and voila!".

But when you instead insist on using those so-called "fast" channel transformations, where you "tweak" the Numpy array's view and stride values, then you are giving OpenCV a Numpy array with non-contiguous RAM and bad "strides". The PyOpenCV layer (the wrapper between OpenCV and Python) detects this problem, and creates a BRAND NEW, COPIED, RE-ARRANGED (CONTIGUOUS) NUMPY ARRAY. This is VERY VERY VERY VERY SLOW.

In other words, if you've used those dumb Numpy "view" manipulation tricks, EVERY CALL TO OPENCV APIS IS CAUSING a HUGE memory copy (images are large, especially 1080p+ screenshots/video frames), a lot of math inside PyArray_GETCONTIGUOUS / PyArray_Cast to create that new object while respecting your tweaked "strides", etc.

Your code won't be faster at all. It will be SLOWER!

Demonstration of the Slowness

Let's use a random OpenCV API to demonstrate the slowdown caused by all of those conversions. We'll use cv2.imshow here, but any OpenCV API call will always be doing the same "Python to OpenCV" conversions of the numpy data, so the exact API doesn't matter. They will all have this overhead.

Here's the example code:

import cv2
import numpy as np
import time

#img1 = cv2.imread("yourimage.png") # If you want to test with an image.
img1 = np.zeros([2160,3840,3], np.uint8) # Create a 4K image.
img1[:,:,2] = 255; img1[:,:,1] = 100 # Fill the channels with different values.
img2 = img1[...,::-1] # Make a "channel flipped view" of the Numpy data.

print("img1 contiguous", img1.flags['C_CONTIGUOUS'], "img2 contiguous", img2.flags['C_CONTIGUOUS'])
print("img1 strides", img1.strides, "img2 strides", img2.strides)

wnd = cv2.namedWindow("", cv2.WINDOW_NORMAL)

def show1():
    cv2.imshow(wnd, img1)

def show2():
    cv2.imshow(wnd, img2)

iterations = 20

start = time.perf_counter()
for i in range(0, iterations):
    show1()
elapsed1 = (time.perf_counter() - start) * 1000
elapsed1_percall = elapsed1 / iterations

start = time.perf_counter()
for i in range(0, iterations):
    show2()
elapsed2 = (time.perf_counter() - start) * 1000
elapsed2_percall = elapsed2 / iterations

# We know that the contiguos (img1) data does not need conversion,
# which tells us that the runtime of the contiguous data is the
# "internal work of the imshow" function. We only want to measure
# the conversion time for non-contiguous data. So we'll subtract
# the first image's (contiguous) runtime from the non-contiguous time.

noncontiguous_overhead_per_call = elapsed2_percall - elapsed1_percall

print("Extra time taken per OpenCV call when given non-contiguous data (in ms):", noncontiguous_overhead_per_call, "ms")

The results:

img1 contiguous True img2 contiguous False
img1 strides (11520, 3, 1) img2 strides (11520, 3, -1)
Extra time taken per OpenCV call when given non-contiguous data (in ms): 39.45334999999999 ms

As you can see, the extra time added to the OpenCV calls when copy-conversion is needed (39.45 ms), is pretty much the same as when you call Numpy's own img.copy() inside Python itself (as seen in the earlier benchmark for "Method: x = x[...,::-1].copy(). Speed: 37.5 msec per call").

So yes, every time you call OpenCV with a non-contiguous Numpy array as its argument, you are causing a Numpy .copy() to happen internally!

Conclusions

Stop using Numpy view manipulations and "tricks". They are not "cool". They are SLOW. You are slowing down all of your OpenCV API calls by about 40 milliseconds PER CALL, since your "cool" data has to be converted by OpenCV internally to PROPER CONTIGUOUS RAM.

Those 39.45 ms are wasted on EVERY OpenCV call whenever you give OpenCV a non-contiguous image. So if you're (as people often do) passing the image to multiple OpenCV APIs to analyze it in multiple ways, then you are causing extreme slowdowns in your code.

Use cv2.cvtColor() instead, which does a one-time conversion to the proper format (in about 5.64 5.48 ms as seen in the benchmarks earlier), using accelerated OpenCL code. You are guaranteed to get contiguous data which works as-is for EVERY OpenCV call with no memory copying/conversion needed. And OpenCV's color converter is WAY FASTER than Numpy's internal data copier/converter.

Let's end this by imagining a scenario where you're using some Python library to capture an RGB screenshot as a numpy array, and you need to use that data with OpenCV. So you're thinking you're clever and you write img = img[...,::-1] to "turn the RGB data into BGR (which OpenCV needs)", and you're thinking "Wow, my code is so fast! That RGB-to-BGR operation only took 0.000237 ms!"... And then you're calling five different OpenCV functions to analyze that screenshot-image in various ways. Well, since you're causing one internal Numpy copy-conversion-to-contiguous PER CALL, you're now causing 5 * 39.45 = 197.25 ms of total conversion overhead, just to get your "stupid" Numpy view into a proper contiguous memory stream.

Does it still sound "slow" to just do a single, one-time 5.64 5.48 ms conversion via cv2.cvtColor()? ;-)

Stop. Using. Numpy. Tricks!

Enjoy! ;-)

Fastest way to convert BGR <-> RGB! Aka: Do NOT use Numpy magic "tricks".


IMPORTANT: This article is very long. Remember to click the (more) at the bottom of this post to read the whole article!


I was reading this question: https://answers.opencv.org/question/188664/colour-conversion-from-bgr-to-rgb-in-cv2-is-slower-than-on-python/ and it didn't explain things very well at all. So here's a deep examination and explanation for everyone's future reference!

Converting RGB to BGR, and vice versa, is one of the most important operations you can do in OpenCV if you're interoperating with other libraries, raw system memory, etc. And each imaging library depends on their own special channel orders.

There are many ways to achieve the conversion, and cv2.cvtColor() is often frowned upon because there are "much faster" ways to do it via numpy "view" manipulation.

Whenever you attempt to convert colors in OpenCV, you actually invoke a huge machinery:

https://github.com/opencv/opencv/blob/8c0b0714e76efef4a8ca2a7c410c60e55c5e9829/modules/imgproc/src/color.cpp#L20-L25 https://github.com/opencv/opencv/blob/8b541e450b511fde9dd363fa55a30fbb6fc0ace6/modules/imgproc/src/color_rgb.dispatch.cpp#L426-L437

As you can see, internally, OpenCV creates an "OpenCL Kernel" with the instructions for the data transformation, and then runs it. This creates brand new (re-arranged) image data in memory, which is of course a pretty slow operation, involving new memory allocation and data-copying.

However, there is another way to flip between RGB and BGR channel orders, which is very popular - and very bad (as you'll find out soon). And that is: Using numpy's built-in methods for manipulating the array data.

Note that there are two ways to manipulate data in Numpy:

  • One of the ways, the bad way, just changes the "view" of the Numpy array and is therefore instant (O(1)), but does NOT transform the underlying img.data in RAM/memory. This means that the raw memory does NOT contain the new channel order, and Numpy instead "fakes" it by creating a "view" that simply says "when we read this data from RAM, view it as R=B, G=G, B=R" basically... (Technically speaking, it changes the ".strides" property of the Numpy object, which instead of saying "read R then G then B" (stride "1" aka going forwards in RAM when reading the color channels) changes it to say "read B, then G, then R" (stride "-1" aka going backwards in RAM when reading the color channels)).
  • The second way, which is totally fine, is to always ensure that we arrange the pixel data properly in memory too, which is a lot slower but is almost always necessary, depending on what library/API your data is intended to be sent to!

To determine whether a numpy array manipulation has also changed the underlying MEMORY, you can look at the img.flags['C_CONTIGUOUS'] value. If True it means that the data in RAM is in the correct order (that's great!). If False it means that the data in RAM is in the wrong order and that we are "cheating" via a numpy View instead (that's BAD!).

Whenever you use the "View-based" methods to flip channels in an ndarray (such as RGB -> BGR), its C_CONTIGUOUS becomes False. If you then flip the image's channels again (such as BGR -> back to RGB), its C_CONTIGUOUS becomes True again. So, the "view" is able to be transformed multiple times, and the "Contiguous" flag only says True whenever the view happens to match the actual RAM data's layout.

So... in what situations do you need the data to ALWAYS be contiguous? Well, it varies based on API...

  • OpenCV APIs all need data in contiguous order. If you give it non-contiguous data from Python, the Python-to-OpenCV wrapper layer internally makes a COPY of the BADLY FORMATTED image you gave it, and then converts the color channel order, and THEN finally passes the COPIED-AND-CONTIGUOUS image to the internal OpenCV C++ function. This is of course very wasteful!
  • Other libraries: Depends on the library. Some of them do something like "take the img.data memory address and give it to a raw Windows API via a COM call" in which case YES the RAM-data MUST be contiguous too.

What type of data do YOU need?

If you want the SAFEST possible data that is 100% sure to work in ANY API ANYWHERE, you should always make CONTIGUOUS pixel data. It doesn't take long to do the conversion up-front, since we're still talking about very fast operations!

There are probably situations where non-contiguous data is fine, such as if you are doing all image manipulations purely in Numpy math without any library APIs (in which case there's no real reason to convert the data layout to contiguous in RAM). But as soon as you invoke various library APIs, you should pretty much always have contiguous data, otherwise you'll create huge performance issues (or even completely incorrect results).

I'll explain those performance issues further down, but first we'll look at the various "conversion techniques" people use in Python.

Techniques

Without further ado, here are all the ways that people use in Python whenever they want to convert back/forth between RGB and BGR. These benchmarks are on a 4K image (3840x2160):

  • Always Contiguous: No. Method: x = x[...,::-1]. Speed: 237 nsec (aka 0.237 usec aka 0.000237 msec) per call
  • Always Contiguous: Yes. Method: x = x[...,::-1].copy(). Speed: 37.5 msec per call
  • Always Contiguous: No. Method: x = x[:, :, [2, 1, 0]]. Speed: 12.6 msec per call
  • Always Contiguous: Yes. Method: x = cv2.cvtColor(x, cv2.COLOR_RGB2BGR). Speed: 5.48 msec per call
  • Always Contiguous: No. Method: x = np.fliplr(x.reshape(-1,3)).reshape(x.shape). Speed: 1.62 usec (aka 0.00162 msec) per call
  • Always Contiguous: Yes. Method: x = np.fliplr(x.reshape(-1,3)).reshape(x.shape).copy(). Speed: 37.4 msec per call
  • Always Contiguous: No. Method: x = np.flip(x, axis=2). Speed: 2.74 usec (aka 0.00274 msec) per call
  • Always Contiguous: Yes. Method: x = np.flip(x, axis=2).copy(). Speed: 37.5 msec per call
  • Always Contiguous: Yes. Method: r = x[..., 0].copy(); x[..., 0] = x[..., 2]; x[..., 2] = r. Speed: 21.8 msec per call
  • Always Contiguous: Yes. Method: x[:, :, [0, 2]] = x[:, :, [2, 0]]. Speed: 21.7 msec per call
  • Always Contiguous: Yes. Method: x[..., [0, 2]] = x[..., [2, 0]]. Speed: 21.8 msec per call
  • Always Contiguous: Yes. Method: x[:, :, [0, 1, 2]] = x[:, :, [2, 1, 0]]. Speed: 33.1 msec per call
  • Always Contiguous: Yes. Method: x[:, :] = x[:, :, [2, 1, 0]]. Speed: 49.3 msec per call

PS: Whenever we want contiguous data from numpy, we're mostly using x.copy() to tell Numpy to allocate new RAM and copy all data to it in the correct (contiguous) order. There's also a np.ascontiguousarray(x) API but it does the exact same thing (copies) (it copies too, "but only when the Numpy data isn't already contiguous") and requires much more typing. ;-) And in a few of the examples we're using special indexing (such as x[:, :, [0, 1, 2]] = x[:, :, [2, 1, 0]]) to overwrite the memory directly, which always creates contiguous memory with correct "strides", but is extremely slow.

Docs for the various Numpy functions: copy, ascontiguousarray, fliplr, flip, reshape

Here's the benchmark that was used:

python -m timeit -s "import numpy as np; import cv2; x = np.zeros([2160,3840,3], np.uint8); x[:,:,2] = 255; x[:,:,1] = 100" "ALGORITHM HERE"

Replace the "ALGORITHM HERE" part with the algorithm above, such as "x = np.flip(x, axis=2).copy()".

People's Misunderstandings of those Benchmarks

Alright, so we're finally getting to the whole purpose of this article!

When people see the benchmarks above, they usually think "Oh my god, x = x[...,::-1] executes in 0.000237 milliseconds, and x = cv2.cvtColor(x, cv2.COLOR_RGB2BGR) executes in 5.48 milliseconds, which is 23798 times slower!! And then they decide to always use "Numpy view manipulation" to do their channel conversions.

That's a huge mistake. And here's why:

  • When you call an OpenCV API from Python, and pass it a numpy.ndarray image object, there's a process which prepares that data for internal usage within OpenCV (since OpenCV itself doesn't use ndarray internally; it uses cv::Mat).
  • First, your Python object (which is coming from the PyOpenCV module) goes into the appropriate pyopencv_to() function, whose purpose is to convert raw Python objects (such as numbers, strings, ndarray, etc), into something usable by OpenCV internally in C++.
  • Your Python object first enters the "full ArgInfo converter" code at https://github.com/opencv/opencv/blob/778f42ad34559451d62ac9ba585717aec77fb23a/modules/python/src2/cv2.cpp#L249
  • That code, in turn, looks at the object and determines if it's a number, a float, or a tuple... If it's any of those, it does the appropriate conversion. Otherwise it assumes it's a Numpy array, at this line: https://github.com/opencv/opencv/blob/master/modules/python/src2/cv2.cpp#L292
  • Next, it begins to analyze the Numpy array to determine how to use the data internally. It wants to determine "do we need to copy the data or can we use it as-is? do we need to cast the data?", see here: https://github.com/opencv/opencv/blob/master/modules/python/src2/cv2.cpp#L300
  • It first does some simple checks to see if the number-type in the Numpy array is legal or not. (If illegal type, it marks the data as "needs copy" and "needs cast").
  • Next, it retrieves the "strides" information from the Numpy array, which is those simple numbers such as "-1" which determine how to read a numpy array (such as backwards, in the case of our "fast" numpy-based "channel flipping" code earlier): https://github.com/opencv/opencv/blob/master/modules/python/src2/cv2.cpp#L341-L342
  • Then it analyzes the Strides for all dimensions of the Numpy array, and if it finds a non-contiguous stride (our "screwed up" data layout caused by doing those so-called "fast" Numpy view manipulations), then it marks the data as "needs copy": https://github.com/opencv/opencv/blob/master/modules/python/src2/cv2.cpp#L344-L357
  • Next, if "needs copy" is true, it does this horrible thing: https://github.com/opencv/opencv/blob/master/modules/python/src2/cv2.cpp#L367-L374
  • As you can see, it calls PyArray_Cast() (if casting was needed too) or PyArray_GETCONTIGUOUS() (if we only need to make sure the data is contiguous). Both of those functions, no matter which one is called, then generates a brand-new Python Numpy Object, with all data COPIED by Numpy into brand-new memory, and re-arranged into proper Contiguous ordering. That's extremely wasteful! I'll explain more soon, after this walkthrough of what the code does...
  • Finally, the code proceeds to create a cv::Mat object whose data pointer points at the internal byte-array (RAM) of the Numpy object, ie. the RAM address that you can easily see in Python by typing img.data. That's an incredibly fast operation because it is just a pointer which says Use the existing RAM data owned by Numpy at RAM address XYZ: https://github.com/opencv/opencv/blob/master/modules/python/src2/cv2.cpp#L415-L416

So, can you spot the problem yet?

When you pass a contiguous Numpy array, the conversion into OpenCV is pretty much INSTANT: "This data looks fine! Just give its RAM address to cv::Mat and voila!".

But when you instead insist on using those so-called "fast" channel transformations, where you "tweak" the Numpy array's view and stride values, then you are giving OpenCV a Numpy array with non-contiguous RAM and bad "strides". The PyOpenCV layer (the wrapper between OpenCV and Python) detects this problem, and creates a BRAND NEW, COPIED, RE-ARRANGED (CONTIGUOUS) NUMPY ARRAY. This is VERY VERY VERY VERY SLOW.

In other words, if you've used those dumb Numpy "view" manipulation tricks, EVERY CALL TO OPENCV APIS IS CAUSING a HUGE memory copy (images are large, especially 1080p+ screenshots/video frames), a lot of math inside PyArray_GETCONTIGUOUS / PyArray_Cast to create that new object while respecting your tweaked "strides", etc.

Your code won't be faster at all. It will be SLOWER!

Demonstration of the Slowness

Let's use a random OpenCV API to demonstrate the slowdown caused by all of those conversions. We'll use cv2.imshow here, but any OpenCV API call will always be doing the same "Python to OpenCV" conversions of the numpy data, so the exact API doesn't matter. They will all have this overhead.

Here's the example code:

import cv2
import numpy as np
import time

#img1 = cv2.imread("yourimage.png") # If you want to test with an image.
img1 = np.zeros([2160,3840,3], np.uint8) # Create a 4K image.
img1[:,:,2] = 255; img1[:,:,1] = 100 # Fill the channels with different values.
img2 = img1[...,::-1] # Make a "channel flipped view" of the Numpy data.

print("img1 contiguous", img1.flags['C_CONTIGUOUS'], "img2 contiguous", img2.flags['C_CONTIGUOUS'])
print("img1 strides", img1.strides, "img2 strides", img2.strides)

wnd = cv2.namedWindow("", cv2.WINDOW_NORMAL)

def show1():
    cv2.imshow(wnd, img1)

def show2():
    cv2.imshow(wnd, img2)

iterations = 20

start = time.perf_counter()
for i in range(0, iterations):
    show1()
elapsed1 = (time.perf_counter() - start) * 1000
elapsed1_percall = elapsed1 / iterations

start = time.perf_counter()
for i in range(0, iterations):
    show2()
elapsed2 = (time.perf_counter() - start) * 1000
elapsed2_percall = elapsed2 / iterations

# We know that the contiguos (img1) data does not need conversion,
# which tells us that the runtime of the contiguous data is the
# "internal work of the imshow" function. We only want to measure
# the conversion time for non-contiguous data. So we'll subtract
# the first image's (contiguous) runtime from the non-contiguous time.

noncontiguous_overhead_per_call = elapsed2_percall - elapsed1_percall

print("Extra time taken per OpenCV call when given non-contiguous data (in ms):", noncontiguous_overhead_per_call, "ms")

The results:

img1 contiguous True img2 contiguous False
img1 strides (11520, 3, 1) img2 strides (11520, 3, -1)
Extra time taken per OpenCV call when given non-contiguous data (in ms): 39.45334999999999 ms

As you can see, the extra time added to the OpenCV calls when copy-conversion is needed (39.45 ms), is pretty much the same as when you call Numpy's own img.copy() inside Python itself (as seen in the earlier benchmark for "Method: x = x[...,::-1].copy(). Speed: 37.5 msec per call").

So yes, every time you call OpenCV with a non-contiguous Numpy array as its argument, you are causing a Numpy .copy() to happen internally!

Conclusions

Stop using Numpy view manipulations and "tricks". They are not "cool". They are SLOW. You are slowing down all of your OpenCV API calls by about 40 milliseconds PER CALL, since your "cool" data has to be converted by OpenCV internally to PROPER CONTIGUOUS RAM.

Those 39.45 ms are wasted on EVERY OpenCV call whenever you give OpenCV a non-contiguous image. So if you're (as people often do) passing the image to multiple OpenCV APIs to analyze it in multiple ways, then you are causing extreme slowdowns in your code.

Use cv2.cvtColor() instead, which does a one-time conversion to the proper format (in about 5.48 ms as seen in the benchmarks earlier), using accelerated OpenCL code. You are guaranteed to get contiguous data which works as-is for EVERY OpenCV call with no memory copying/conversion needed. And OpenCV's color converter is WAY FASTER than Numpy's internal data copier/converter.

Let's end this by imagining a scenario where you're using some Python library to capture an RGB screenshot as a numpy array, and you need to use that data with OpenCV. So you're thinking you're clever and you write img = img[...,::-1] to "turn the RGB data into BGR (which OpenCV needs)", and you're thinking "Wow, my code is so fast! That RGB-to-BGR operation only took 0.000237 ms!"... And then you're calling five different OpenCV functions to analyze that screenshot-image in various ways. Well, since you're causing one internal Numpy copy-conversion-to-contiguous PER CALL, you're now causing 5 * 39.45 = 197.25 ms of total conversion overhead, just to get your "stupid" Numpy view into a proper contiguous memory stream.

Does it still sound "slow" to just do a single, one-time 5.48 ms conversion via cv2.cvtColor()? ;-)

Stop. Using. Numpy. Tricks!

Enjoy! ;-)

Fastest way to convert BGR <-> RGB! Aka: Do NOT use Numpy magic "tricks".


IMPORTANT: This article is very long. Remember to click the (more) at the bottom of this post to read the whole article!


I was reading this question: https://answers.opencv.org/question/188664/colour-conversion-from-bgr-to-rgb-in-cv2-is-slower-than-on-python/ and it didn't explain things very well at all. So here's a deep examination and explanation for everyone's future reference!

Converting RGB to BGR, and vice versa, is one of the most important operations you can do in OpenCV if you're interoperating with other libraries, raw system memory, etc. And each imaging library depends on their own special channel orders.

There are many ways to achieve the conversion, and cv2.cvtColor() is often frowned upon because there are "much faster" ways to do it via numpy "view" manipulation.

Whenever you attempt to convert colors in OpenCV, you actually invoke a huge machinery:

https://github.com/opencv/opencv/blob/8c0b0714e76efef4a8ca2a7c410c60e55c5e9829/modules/imgproc/src/color.cpp#L20-L25 https://github.com/opencv/opencv/blob/8b541e450b511fde9dd363fa55a30fbb6fc0ace6/modules/imgproc/src/color_rgb.dispatch.cpp#L426-L437

As you can see, internally, OpenCV creates an "OpenCL Kernel" with the instructions for the data transformation, and then runs it. This creates brand new (re-arranged) image data in memory, which is of course a pretty slow operation, involving new memory allocation and data-copying.

However, there is another way to flip between RGB and BGR channel orders, which is very popular - and very bad (as you'll find out soon). And that is: Using numpy's built-in methods for manipulating the array data.

Note that there are two ways to manipulate data in Numpy:

  • One of the ways, the bad way, just changes the "view" of the Numpy array and is therefore instant (O(1)), but does NOT transform the underlying img.data in RAM/memory. This means that the raw memory does NOT contain the new channel order, and Numpy instead "fakes" it by creating a "view" that simply says "when we read this data from RAM, view it as R=B, G=G, B=R" basically... (Technically speaking, it changes the ".strides" property of the Numpy object, which instead of saying "read R then G then B" (stride "1" aka going forwards in RAM when reading the color channels) changes it to say "read B, then G, then R" (stride "-1" aka going backwards in RAM when reading the color channels)).
  • The second way, which is totally fine, is to always ensure that we arrange the pixel data properly in memory too, which is a lot slower but is almost always necessary, depending on what library/API your data is intended to be sent to!

To determine whether a numpy array manipulation has also changed the underlying MEMORY, you can look at the img.flags['C_CONTIGUOUS'] value. If True it means that the data in RAM is in the correct order (that's great!). If False it means that the data in RAM is in the wrong order and that we are "cheating" via a numpy View instead (that's BAD!).

Whenever you use the "View-based" methods to flip channels in an ndarray (such as RGB -> BGR), its C_CONTIGUOUS becomes False. If you then flip the image's channels again (such as BGR -> back to RGB), its C_CONTIGUOUS becomes True again. So, the "view" is able to be transformed multiple times, and the "Contiguous" flag only says True whenever the view happens to match the actual RAM data's layout.

So... in what situations do you need the data to ALWAYS be contiguous? Well, it varies based on API...

  • OpenCV APIs all need data in contiguous order. If you give it non-contiguous data from Python, the Python-to-OpenCV wrapper layer internally makes a COPY of the BADLY FORMATTED image you gave it, and then converts the color channel order, and THEN finally passes the COPIED-AND-CONTIGUOUS image to the internal OpenCV C++ function. This is of course very wasteful!
  • Other libraries: Depends on the library. Some of them do something like "take the img.data memory address and give it to a raw Windows API via a COM call" in which case YES the RAM-data MUST be contiguous too.

What type of data do YOU need?

If you want the SAFEST possible data that is 100% sure to work in ANY API ANYWHERE, you should always make CONTIGUOUS pixel data. It doesn't take long to do the conversion up-front, since we're still talking about very fast operations!

There are probably situations where non-contiguous data is fine, such as if you are doing all image manipulations purely in Numpy math without any library APIs (in which case there's no real reason to convert the data layout to contiguous in RAM). But as soon as you invoke various library APIs, you should pretty much always have contiguous data, otherwise you'll create huge performance issues (or even completely incorrect results).

I'll explain those performance issues further down, but first we'll look at the various "conversion techniques" people use in Python.

Techniques

Without further ado, here are all the ways that people use in Python whenever they want to convert back/forth between RGB and BGR. These benchmarks are on a 4K image (3840x2160):

  • Always Contiguous: No. Method: x = x[...,::-1]. Speed: 237 nsec (aka 0.237 usec aka 0.000237 msec) per call
  • Always Contiguous: Yes. Method: x = x[...,::-1].copy(). Speed: 37.5 msec per call
  • Always Contiguous: No. Method: x = x[:, :, [2, 1, 0]]. Speed: 12.6 msec per call
  • Always Contiguous: Yes. Method: x = cv2.cvtColor(x, cv2.COLOR_RGB2BGR). Speed: 5.48 msec per call
  • Always Contiguous: No. Method: x = np.fliplr(x.reshape(-1,3)).reshape(x.shape). Speed: 1.62 usec (aka 0.00162 msec) per call
  • Always Contiguous: Yes. Method: x = np.fliplr(x.reshape(-1,3)).reshape(x.shape).copy(). Speed: 37.4 msec per call
  • Always Contiguous: No. Method: x = np.flip(x, axis=2). Speed: 2.74 usec (aka 0.00274 msec) per call
  • Always Contiguous: Yes. Method: x = np.flip(x, axis=2).copy(). Speed: 37.5 msec per call
  • Always Contiguous: Yes. Method: r = x[..., 0].copy(); x[..., 0] = x[..., 2]; x[..., 2] = r. Speed: 21.8 msec per call
  • Always Contiguous: Yes. Method: x[:, :, [0, 2]] = x[:, :, [2, 0]]. Speed: 21.7 msec per call
  • Always Contiguous: Yes. Method: x[..., [0, 2]] = x[..., [2, 0]]. Speed: 21.8 msec per call
  • Always Contiguous: Yes. Method: x[:, :, [0, 1, 2]] = x[:, :, [2, 1, 0]]. Speed: 33.1 msec per call
  • Always Contiguous: Yes. Method: x[:, :] = x[:, :, [2, 1, 0]]. Speed: 49.3 msec per call

PS: Whenever we want contiguous data from numpy, we're mostly using x.copy() to tell Numpy to allocate new RAM and copy all data to it in the correct (contiguous) order. There's also a np.ascontiguousarray(x) API but it does the exact same thing (it copies too, "but only when the Numpy data isn't already contiguous") and requires much more typing. ;-) And in a few of the examples we're using special indexing (such as x[:, :, [0, 1, 2]] = x[:, :, [2, 1, 0]]) to overwrite the memory directly, which always creates contiguous memory with correct "strides", and is faster than telling Numpy to do a .copy(), but is still extremely slow.slow compared to cv2.cvtColor().

Docs for the various Numpy functions: copy, ascontiguousarray, fliplr, flip, reshape

Here's the benchmark that was used:

python -m timeit -s "import numpy as np; import cv2; x = np.zeros([2160,3840,3], np.uint8); x[:,:,2] = 255; x[:,:,1] = 100" "ALGORITHM HERE"

Replace the "ALGORITHM HERE" part with the algorithm above, such as "x = np.flip(x, axis=2).copy()".

People's Misunderstandings of those Benchmarks

Alright, so we're finally getting to the whole purpose of this article!

When people see the benchmarks above, they usually think "Oh my god, x = x[...,::-1] executes in 0.000237 milliseconds, and x = cv2.cvtColor(x, cv2.COLOR_RGB2BGR) executes in 5.48 milliseconds, which is 23798 times slower!! And then they decide to always use "Numpy view manipulation" to do their channel conversions.

That's a huge mistake. And here's why:

  • When you call an OpenCV API from Python, and pass it a numpy.ndarray image object, there's a process which prepares that data for internal usage within OpenCV (since OpenCV itself doesn't use ndarray internally; it uses cv::Mat).
  • First, your Python object (which is coming from the PyOpenCV module) goes into the appropriate pyopencv_to() function, whose purpose is to convert raw Python objects (such as numbers, strings, ndarray, etc), into something usable by OpenCV internally in C++.
  • Your Python object first enters the "full ArgInfo converter" code at https://github.com/opencv/opencv/blob/778f42ad34559451d62ac9ba585717aec77fb23a/modules/python/src2/cv2.cpp#L249
  • That code, in turn, looks at the object and determines if it's a number, a float, or a tuple... If it's any of those, it does the appropriate conversion. Otherwise it assumes it's a Numpy array, at this line: https://github.com/opencv/opencv/blob/master/modules/python/src2/cv2.cpp#L292
  • Next, it begins to analyze the Numpy array to determine how to use the data internally. It wants to determine "do we need to copy the data or can we use it as-is? do we need to cast the data?", see here: https://github.com/opencv/opencv/blob/master/modules/python/src2/cv2.cpp#L300
  • It first does some simple checks to see if the number-type in the Numpy array is legal or not. (If illegal type, it marks the data as "needs copy" and "needs cast").
  • Next, it retrieves the "strides" information from the Numpy array, which is those simple numbers such as "-1" which determine how to read a numpy array (such as backwards, in the case of our "fast" numpy-based "channel flipping" code earlier): https://github.com/opencv/opencv/blob/master/modules/python/src2/cv2.cpp#L341-L342
  • Then it analyzes the Strides for all dimensions of the Numpy array, and if it finds a non-contiguous stride (our "screwed up" data layout caused by doing those so-called "fast" Numpy view manipulations), then it marks the data as "needs copy": https://github.com/opencv/opencv/blob/master/modules/python/src2/cv2.cpp#L344-L357
  • Next, if "needs copy" is true, it does this horrible thing: https://github.com/opencv/opencv/blob/master/modules/python/src2/cv2.cpp#L367-L374
  • As you can see, it calls PyArray_Cast() (if casting was needed too) or PyArray_GETCONTIGUOUS() (if we only need to make sure the data is contiguous). Both of those functions, no matter which one is called, then generates a brand-new Python Numpy Object, with all data COPIED by Numpy into brand-new memory, and re-arranged into proper Contiguous ordering. That's extremely wasteful! I'll explain more soon, after this walkthrough of what the code does...
  • Finally, the code proceeds to create a cv::Mat object whose data pointer points at the internal byte-array (RAM) of the Numpy object, ie. the RAM address that you can easily see in Python by typing img.data. That's an incredibly fast operation because it is just a pointer which says Use the existing RAM data owned by Numpy at RAM address XYZ: https://github.com/opencv/opencv/blob/master/modules/python/src2/cv2.cpp#L415-L416

So, can you spot the problem yet?

When you pass a contiguous Numpy array, the conversion into OpenCV is pretty much INSTANT: "This data looks fine! Just give its RAM address to cv::Mat and voila!".

But when you instead insist on using those so-called "fast" channel transformations, where you "tweak" the Numpy array's view and stride values, then you are giving OpenCV a Numpy array with non-contiguous RAM and bad "strides". The PyOpenCV layer (the wrapper between OpenCV and Python) detects this problem, and creates a BRAND NEW, COPIED, RE-ARRANGED (CONTIGUOUS) NUMPY ARRAY. This is VERY VERY VERY VERY SLOW.

In other words, if you've used those dumb Numpy "view" manipulation tricks, EVERY CALL TO OPENCV APIS IS CAUSING a HUGE memory copy (images are large, especially 1080p+ screenshots/video frames), a lot of math inside PyArray_GETCONTIGUOUS / PyArray_Cast to create that new object while respecting your tweaked "strides", etc.

Your code won't be faster at all. It will be SLOWER!

Demonstration of the Slowness

Let's use a random OpenCV API to demonstrate the slowdown caused by all of those conversions. We'll use cv2.imshow here, but any OpenCV API call will always be doing the same "Python to OpenCV" conversions of the numpy data, so the exact API doesn't matter. They will all have this overhead.

Here's the example code:

import cv2
import numpy as np
import time

#img1 = cv2.imread("yourimage.png") # If you want to test with an image.
img1 = np.zeros([2160,3840,3], np.uint8) # Create a 4K image.
img1[:,:,2] = 255; img1[:,:,1] = 100 # Fill the channels with different values.
img2 = img1[...,::-1] # Make a "channel flipped view" of the Numpy data.

print("img1 contiguous", img1.flags['C_CONTIGUOUS'], "img2 contiguous", img2.flags['C_CONTIGUOUS'])
print("img1 strides", img1.strides, "img2 strides", img2.strides)

wnd = cv2.namedWindow("", cv2.WINDOW_NORMAL)

def show1():
    cv2.imshow(wnd, img1)

def show2():
    cv2.imshow(wnd, img2)

iterations = 20

start = time.perf_counter()
for i in range(0, iterations):
    show1()
elapsed1 = (time.perf_counter() - start) * 1000
elapsed1_percall = elapsed1 / iterations

start = time.perf_counter()
for i in range(0, iterations):
    show2()
elapsed2 = (time.perf_counter() - start) * 1000
elapsed2_percall = elapsed2 / iterations

# We know that the contiguos (img1) data does not need conversion,
# which tells us that the runtime of the contiguous data is the
# "internal work of the imshow" function. We only want to measure
# the conversion time for non-contiguous data. So we'll subtract
# the first image's (contiguous) runtime from the non-contiguous time.

noncontiguous_overhead_per_call = elapsed2_percall - elapsed1_percall

print("Extra time taken per OpenCV call when given non-contiguous data (in ms):", noncontiguous_overhead_per_call, "ms")

The results:

img1 contiguous True img2 contiguous False
img1 strides (11520, 3, 1) img2 strides (11520, 3, -1)
Extra time taken per OpenCV call when given non-contiguous data (in ms): 39.45334999999999 ms

As you can see, the extra time added to the OpenCV calls when copy-conversion is needed (39.45 ms), is pretty much the same as when you call Numpy's own img.copy() inside Python itself (as seen in the earlier benchmark for "Method: x = x[...,::-1].copy(). Speed: 37.5 msec per call").

So yes, every time you call OpenCV with a non-contiguous Numpy array as its argument, you are causing a Numpy .copy() to happen internally!

Conclusions

Stop using Numpy view manipulations and "tricks". They are not "cool". They are SLOW. You are slowing down all of your OpenCV API calls by about 40 milliseconds PER CALL, since your "cool" data has to be converted by OpenCV internally to PROPER CONTIGUOUS RAM.

Those 39.45 ms are wasted on EVERY OpenCV call whenever you give OpenCV a non-contiguous image. So if you're (as people often do) passing the image to multiple OpenCV APIs to analyze it in multiple ways, then you are causing extreme slowdowns in your code.

Use cv2.cvtColor() instead, which does a one-time conversion to the proper format (in about 5.48 ms as seen in the benchmarks earlier), using accelerated OpenCL code. You are guaranteed to get contiguous data which works as-is for EVERY OpenCV call with no memory copying/conversion needed. And OpenCV's color converter is WAY FASTER than Numpy's internal data copier/converter.

Let's end this by imagining a scenario where you're using some Python library to capture an RGB screenshot as a numpy array, and you need to use that data with OpenCV. So you're thinking you're clever and you write img = img[...,::-1] to "turn the RGB data into BGR (which OpenCV needs)", and you're thinking "Wow, my code is so fast! That RGB-to-BGR operation only took 0.000237 ms!"... And then you're calling five different OpenCV functions to analyze that screenshot-image in various ways. Well, since you're causing one internal Numpy copy-conversion-to-contiguous PER CALL, you're now causing 5 * 39.45 = 197.25 ms of total conversion overhead, just to get your "stupid" Numpy view into a proper contiguous memory stream.

Does it still sound "slow" to just do a single, one-time 5.48 ms conversion via cv2.cvtColor()? ;-)

Stop. Using. Numpy. Tricks!

Enjoy! ;-)

click to hide/show revision 27
retagged

updated 2019-10-01 01:13:33 -0500

berak gravatar image

Fastest way to convert BGR <-> RGB! Aka: Do NOT use Numpy magic "tricks".


IMPORTANT: This article is very long. Remember to click the (more) at the bottom of this post to read the whole article!


I was reading this question: https://answers.opencv.org/question/188664/colour-conversion-from-bgr-to-rgb-in-cv2-is-slower-than-on-python/ and it didn't explain things very well at all. So here's a deep examination and explanation for everyone's future reference!

Converting RGB to BGR, and vice versa, is one of the most important operations you can do in OpenCV if you're interoperating with other libraries, raw system memory, etc. And each imaging library depends on their own special channel orders.

There are many ways to achieve the conversion, and cv2.cvtColor() is often frowned upon because there are "much faster" ways to do it via numpy "view" manipulation.

Whenever you attempt to convert colors in OpenCV, you actually invoke a huge machinery:

https://github.com/opencv/opencv/blob/8c0b0714e76efef4a8ca2a7c410c60e55c5e9829/modules/imgproc/src/color.cpp#L20-L25 https://github.com/opencv/opencv/blob/8b541e450b511fde9dd363fa55a30fbb6fc0ace6/modules/imgproc/src/color_rgb.dispatch.cpp#L426-L437

As you can see, internally, OpenCV creates an "OpenCL Kernel" with the instructions for the data transformation, and then runs it. This creates brand new (re-arranged) image data in memory, which is of course a pretty slow operation, involving new memory allocation and data-copying.

However, there is another way to flip between RGB and BGR channel orders, which is very popular - and very bad (as you'll find out soon). And that is: Using numpy's built-in methods for manipulating the array data.

Note that there are two ways to manipulate data in Numpy:

  • One of the ways, the bad way, just changes the "view" of the Numpy array and is therefore instant (O(1)), but does NOT transform the underlying img.data in RAM/memory. This means that the raw memory does NOT contain the new channel order, and Numpy instead "fakes" it by creating a "view" that simply says "when we read this data from RAM, view it as R=B, G=G, B=R" basically... (Technically speaking, it changes the ".strides" property of the Numpy object, which instead of saying "read R then G then B" (stride "1" aka going forwards in RAM when reading the color channels) changes it to say "read B, then G, then R" (stride "-1" aka going backwards in RAM when reading the color channels)).
  • The second way, which is totally fine, is to always ensure that we arrange the pixel data properly in memory too, which is a lot slower but is almost always necessary, depending on what library/API your data is intended to be sent to!

To determine whether a numpy array manipulation has also changed the underlying MEMORY, you can look at the img.flags['C_CONTIGUOUS'] value. If True it means that the data in RAM is in the correct order (that's great!). If False it means that the data in RAM is in the wrong order and that we are "cheating" via a numpy View instead (that's BAD!).

Whenever you use the "View-based" methods to flip channels in an ndarray (such as RGB -> BGR), its C_CONTIGUOUS becomes False. If you then flip the image's channels again (such as BGR -> back to RGB), its C_CONTIGUOUS becomes True again. So, the "view" is able to be transformed multiple times, and the "Contiguous" flag only says True whenever the view happens to match the actual RAM data's layout.

So... in what situations do you need the data to ALWAYS be contiguous? Well, it varies based on API...

  • OpenCV APIs all need data in contiguous order. If you give it non-contiguous data from Python, the Python-to-OpenCV wrapper layer internally makes a COPY of the BADLY FORMATTED image you gave it, and then converts the color channel order, and THEN finally passes the COPIED-AND-CONTIGUOUS image to the internal OpenCV C++ function. This is of course very wasteful!
  • Other libraries: Depends on the library. Some of them do something like "take the img.data memory address and give it to a raw Windows API via a COM call" in which case YES the RAM-data MUST be contiguous too.

What type of data do YOU need?

If you want the SAFEST possible data that is 100% sure to work in ANY API ANYWHERE, you should always make CONTIGUOUS pixel data. It doesn't take long to do the conversion up-front, since we're still talking about very fast operations!

There are probably situations where non-contiguous data is fine, such as if you are doing all image manipulations purely in Numpy math without any library APIs (in which case there's no real reason to convert the data layout to contiguous in RAM). But as soon as you invoke various library APIs, you should pretty much always have contiguous data, otherwise you'll create huge performance issues (or even completely incorrect results).

I'll explain those performance issues further down, but first we'll look at the various "conversion techniques" people use in Python.

Techniques

Without further ado, here are all the ways that people use in Python whenever they want to convert back/forth between RGB and BGR. These benchmarks are on a 4K image (3840x2160):

  • Always Contiguous: No. Method: x = x[...,::-1]. Speed: 237 nsec (aka 0.237 usec aka 0.000237 msec) per call
  • Always Contiguous: Yes. Method: x = x[...,::-1].copy(). Speed: 37.5 msec per call
  • Always Contiguous: No. Method: x = x[:, :, [2, 1, 0]]. Speed: 12.6 msec per call
  • Always Contiguous: Yes. Method: x = cv2.cvtColor(x, cv2.COLOR_RGB2BGR). Speed: 5.48 msec per call
  • Always Contiguous: No. Method: x = np.fliplr(x.reshape(-1,3)).reshape(x.shape). Speed: 1.62 usec (aka 0.00162 msec) per call
  • Always Contiguous: Yes. Method: x = np.fliplr(x.reshape(-1,3)).reshape(x.shape).copy(). Speed: 37.4 msec per call
  • Always Contiguous: No. Method: x = np.flip(x, axis=2). Speed: 2.74 usec (aka 0.00274 msec) per call
  • Always Contiguous: Yes. Method: x = np.flip(x, axis=2).copy(). Speed: 37.5 msec per call
  • Always Contiguous: Yes. Method: r = x[..., 0].copy(); x[..., 0] = x[..., 2]; x[..., 2] = r. Speed: 21.8 msec per call
  • Always Contiguous: Yes. Method: x[:, :, [0, 2]] = x[:, :, [2, 0]]. Speed: 21.7 msec per call
  • Always Contiguous: Yes. Method: x[..., [0, 2]] = x[..., [2, 0]]. Speed: 21.8 msec per call
  • Always Contiguous: Yes. Method: x[:, :, [0, 1, 2]] = x[:, :, [2, 1, 0]]. Speed: 33.1 msec per call
  • Always Contiguous: Yes. Method: x[:, :] = x[:, :, [2, 1, 0]]. Speed: 49.3 msec per call

PS: Whenever we want contiguous data from numpy, we're mostly using x.copy() to tell Numpy to allocate new RAM and copy all data to it in the correct (contiguous) order. There's also a np.ascontiguousarray(x) API but it does the exact same thing (it copies too, "but only when the Numpy data isn't already contiguous") and requires much more typing. ;-) And in a few of the examples we're using special indexing (such as x[:, :, [0, 1, 2]] = x[:, :, [2, 1, 0]]) to overwrite the memory directly, which always creates contiguous memory with correct "strides", and is faster than telling Numpy to do a .copy(), but is still extremely slow compared to cv2.cvtColor().

Docs for the various Numpy functions: copy, ascontiguousarray, fliplr, flip, reshape

Here's the benchmark that was used:

python -m timeit -s "import numpy as np; import cv2; x = np.zeros([2160,3840,3], np.uint8); x[:,:,2] = 255; x[:,:,1] = 100" "ALGORITHM HERE"

Replace the "ALGORITHM HERE" part with the algorithm above, such as "x = np.flip(x, axis=2).copy()".

People's Misunderstandings of those Benchmarks

Alright, so we're finally getting to the whole purpose of this article!

When people see the benchmarks above, they usually think "Oh my god, x = x[...,::-1] executes in 0.000237 milliseconds, and x = cv2.cvtColor(x, cv2.COLOR_RGB2BGR) executes in 5.48 milliseconds, which is 23798 times slower!! And then they decide to always use "Numpy view manipulation" to do their channel conversions.

That's a huge mistake. And here's why:

  • When you call an OpenCV API from Python, and pass it a numpy.ndarray image object, there's a process which prepares that data for internal usage within OpenCV (since OpenCV itself doesn't use ndarray internally; it uses cv::Mat).
  • First, your Python object (which is coming from the PyOpenCV module) goes into the appropriate pyopencv_to() function, whose purpose is to convert raw Python objects (such as numbers, strings, ndarray, etc), into something usable by OpenCV internally in C++.
  • Your Python object first enters the "full ArgInfo converter" code at https://github.com/opencv/opencv/blob/778f42ad34559451d62ac9ba585717aec77fb23a/modules/python/src2/cv2.cpp#L249
  • That code, in turn, looks at the object and determines if it's a number, a float, or a tuple... If it's any of those, it does the appropriate conversion. Otherwise it assumes it's a Numpy array, at this line: https://github.com/opencv/opencv/blob/master/modules/python/src2/cv2.cpp#L292
  • Next, it begins to analyze the Numpy array to determine how to use the data internally. It wants to determine "do we need to copy the data or can we use it as-is? do we need to cast the data?", see here: https://github.com/opencv/opencv/blob/master/modules/python/src2/cv2.cpp#L300
  • It first does some simple checks to see if the number-type in the Numpy array is legal or not. (If illegal type, it marks the data as "needs copy" and "needs cast").
  • Next, it retrieves the "strides" information from the Numpy array, which is those simple numbers such as "-1" which determine how to read a numpy array (such as backwards, in the case of our "fast" numpy-based "channel flipping" code earlier): https://github.com/opencv/opencv/blob/master/modules/python/src2/cv2.cpp#L341-L342
  • Then it analyzes the Strides for all dimensions of the Numpy array, and if it finds a non-contiguous stride (our "screwed up" data layout caused by doing those so-called "fast" Numpy view manipulations), then it marks the data as "needs copy": https://github.com/opencv/opencv/blob/master/modules/python/src2/cv2.cpp#L344-L357
  • Next, if "needs copy" is true, it does this horrible thing: https://github.com/opencv/opencv/blob/master/modules/python/src2/cv2.cpp#L367-L374
  • As you can see, it calls PyArray_Cast() (if casting was needed too) or PyArray_GETCONTIGUOUS() (if we only need to make sure the data is contiguous). Both of those functions, no matter which one is called, then generates a brand-new Python Numpy Object, with all data COPIED by Numpy into brand-new memory, and re-arranged into proper Contiguous ordering. That's extremely wasteful! I'll explain more soon, after this walkthrough of what the code does...
  • Finally, the code proceeds to create a cv::Mat object whose data pointer points at the internal byte-array (RAM) of the Numpy object, ie. the RAM address that you can easily see in Python by typing img.data. That's an incredibly fast operation because it is just a pointer which says Use the existing RAM data owned by Numpy at RAM address XYZ: https://github.com/opencv/opencv/blob/master/modules/python/src2/cv2.cpp#L415-L416

So, can you spot the problem yet?

When you pass a contiguous Numpy array, the conversion into OpenCV is pretty much INSTANT: "This data looks fine! Just give its RAM address to cv::Mat and voila!".

But when you instead insist on using those so-called "fast" channel transformations, where you "tweak" the Numpy array's view and stride values, then you are giving OpenCV a Numpy array with non-contiguous RAM and bad "strides". The PyOpenCV layer (the wrapper between OpenCV and Python) detects this problem, and creates a BRAND NEW, COPIED, RE-ARRANGED (CONTIGUOUS) NUMPY ARRAY. This is VERY VERY VERY VERY SLOW.

In other words, if you've used those dumb Numpy "view" manipulation tricks, EVERY CALL TO OPENCV APIS IS CAUSING a HUGE memory copy (images are large, especially 1080p+ screenshots/video frames), a lot of math inside PyArray_GETCONTIGUOUS / PyArray_Cast to create that new object while respecting your tweaked "strides", etc.

Your code won't be faster at all. It will be SLOWER!

Demonstration of the Slowness

Let's use a random OpenCV API to demonstrate the slowdown caused by all of those conversions. We'll use cv2.imshow here, but any OpenCV API call will always be doing the same "Python to OpenCV" conversions of the numpy data, so the exact API doesn't matter. They will all have this overhead.

Here's the example code:

import cv2
import numpy as np
import time

#img1 = cv2.imread("yourimage.png") # If you want to test with an image.
img1 = np.zeros([2160,3840,3], np.uint8) # Create a 4K image.
img1[:,:,2] = 255; img1[:,:,1] = 100 # Fill the channels with different values.
img2 = img1[...,::-1] # Make a "channel flipped view" of the Numpy data.

print("img1 contiguous", img1.flags['C_CONTIGUOUS'], "img2 contiguous", img2.flags['C_CONTIGUOUS'])
print("img1 strides", img1.strides, "img2 strides", img2.strides)

wnd = cv2.namedWindow("", cv2.WINDOW_NORMAL)

def show1():
    cv2.imshow(wnd, img1)

def show2():
    cv2.imshow(wnd, img2)

iterations = 20

start = time.perf_counter()
for i in range(0, iterations):
    show1()
elapsed1 = (time.perf_counter() - start) * 1000
elapsed1_percall = elapsed1 / iterations

start = time.perf_counter()
for i in range(0, iterations):
    show2()
elapsed2 = (time.perf_counter() - start) * 1000
elapsed2_percall = elapsed2 / iterations

# We know that the contiguos (img1) data does not need conversion,
# which tells us that the runtime of the contiguous data is the
# "internal work of the imshow" function. We only want to measure
# the conversion time for non-contiguous data. So we'll subtract
# the first image's (contiguous) runtime from the non-contiguous time.

noncontiguous_overhead_per_call = elapsed2_percall - elapsed1_percall

print("Extra time taken per OpenCV call when given non-contiguous data (in ms):", noncontiguous_overhead_per_call, "ms")

The results:

img1 contiguous True img2 contiguous False
img1 strides (11520, 3, 1) img2 strides (11520, 3, -1)
Extra time taken per OpenCV call when given non-contiguous data (in ms): 39.45334999999999 ms

As you can see, the extra time added to the OpenCV calls when copy-conversion is needed (39.45 ms), is pretty much the same as when you call Numpy's own img.copy() inside Python itself (as seen in the earlier benchmark for "Method: x = x[...,::-1].copy(). Speed: 37.5 msec per call").

So yes, every time you call OpenCV with a non-contiguous Numpy array as its argument, you are causing a Numpy .copy() to happen internally!

Conclusions

Stop using Numpy view manipulations and "tricks". They are not "cool". They are SLOW. You are slowing down all of your OpenCV API calls by about 40 milliseconds PER CALL, since your "cool" data has to be converted by OpenCV internally to PROPER CONTIGUOUS RAM.

Those 39.45 ms are wasted on EVERY OpenCV call whenever you give OpenCV a non-contiguous image. So if you're (as people often do) passing the image to multiple OpenCV APIs to analyze it in multiple ways, then you are causing extreme slowdowns in your code.

Use cv2.cvtColor() instead, which does a one-time conversion to the proper format (in about 5.48 ms as seen in the benchmarks earlier), using accelerated OpenCL code. You are guaranteed to get contiguous data which works as-is for EVERY OpenCV call with no memory copying/conversion needed. And OpenCV's color converter is WAY FASTER than Numpy's internal data copier/converter.

Let's end this by imagining a scenario where you're using some Python library to capture an RGB screenshot as a numpy array, and you need to use that data with OpenCV. So you're thinking you're clever and you write img = img[...,::-1] to "turn the RGB data into BGR (which OpenCV needs)", and you're thinking "Wow, my code is so fast! That RGB-to-BGR operation only took 0.000237 ms!"... And then you're calling five different OpenCV functions to analyze that screenshot-image in various ways. Well, since you're causing one internal Numpy copy-conversion-to-contiguous PER CALL, you're now causing 5 * 39.45 = 197.25 ms of total conversion overhead, just to get your "stupid" Numpy view into a proper contiguous memory stream.

Does it still sound "slow" to just do a single, one-time 5.48 ms conversion via cv2.cvtColor()? ;-)

Stop. Using. Numpy. Tricks!

Enjoy! ;-)

Fastest way to convert BGR <-> RGB! Aka: Do NOT use Numpy magic "tricks".


IMPORTANT: This article is very long. Remember to click the (more) at the bottom of this post to read the whole article!


I was reading this question: https://answers.opencv.org/question/188664/colour-conversion-from-bgr-to-rgb-in-cv2-is-slower-than-on-python/ and it didn't explain things very well at all. So here's a deep examination and explanation for everyone's future reference!

Converting RGB to BGR, and vice versa, is one of the most important operations you can do in OpenCV if you're interoperating with other libraries, raw system memory, etc. And each imaging library depends on their own special channel orders.

There are many ways to achieve the conversion, and cv2.cvtColor() is often frowned upon because there are "much faster" ways to do it via numpy "view" manipulation.

Whenever you attempt to convert colors in OpenCV, you actually invoke a huge machinery:

https://github.com/opencv/opencv/blob/8c0b0714e76efef4a8ca2a7c410c60e55c5e9829/modules/imgproc/src/color.cpp#L20-L25 https://github.com/opencv/opencv/blob/8b541e450b511fde9dd363fa55a30fbb6fc0ace6/modules/imgproc/src/color_rgb.dispatch.cpp#L426-L437

As you can see, internally, OpenCV creates an "OpenCL Kernel" with the instructions for the data transformation, and then runs it. This creates brand new (re-arranged) image data in memory, which is of course a pretty slow operation, involving new memory allocation and data-copying.

However, there is another way to flip between RGB and BGR channel orders, which is very popular - and very bad (as you'll find out soon). And that is: Using numpy's built-in methods for manipulating the array data.

Note that there are two ways to manipulate data in Numpy:

  • One of the ways, the bad way, just changes the "view" of the Numpy array and is therefore instant (O(1)), but does NOT transform the underlying img.data in RAM/memory. This means that the raw memory does NOT contain the new channel order, and Numpy instead "fakes" it by creating a "view" that simply says "when we read this data from RAM, view it as R=B, G=G, B=R" basically... (Technically speaking, it changes the ".strides" property of the Numpy object, which instead of saying "read R then G then B" (stride "1" aka going forwards in RAM when reading the color channels) changes it to say "read B, then G, then R" (stride "-1" aka going backwards in RAM when reading the color channels)).
  • The second way, which is totally fine, is to always ensure that we arrange the pixel data properly in memory too, which is a lot slower but is almost always necessary, depending on what library/API your data is intended to be sent to!

To determine whether a numpy array manipulation has also changed the underlying MEMORY, you can look at the img.flags['C_CONTIGUOUS'] value. If True it means that the data in RAM is in the correct order (that's great!). If False it means that the data in RAM is in the wrong order and that we are "cheating" via a numpy View instead (that's BAD!).

Whenever you use the "View-based" methods to flip channels in an ndarray (such as RGB -> BGR), its C_CONTIGUOUS becomes False. If you then flip the image's channels again (such as BGR -> back to RGB), its C_CONTIGUOUS becomes True again. So, the "view" is able to be transformed multiple times, and the "Contiguous" flag only says True whenever the view happens to match the actual RAM data's layout.

So... in what situations do you need the data to ALWAYS be contiguous? Well, it varies based on API...

  • OpenCV APIs all need data in contiguous order. If you give it non-contiguous data from Python, the Python-to-OpenCV wrapper layer internally makes a COPY of the BADLY FORMATTED image you gave it, and then converts the color channel order, and THEN finally passes the COPIED-AND-CONTIGUOUS image to the internal OpenCV C++ function. This is of course very wasteful!
  • Other libraries: Depends on the library. Some of them do something like "take the img.data memory address and give it to a raw Windows API via a COM call" in which case YES the RAM-data MUST be contiguous too.

What type of data do YOU need?

If you want the SAFEST possible data that is 100% sure to work in ANY API ANYWHERE, you should always make CONTIGUOUS pixel data. It doesn't take long to do the conversion up-front, since we're still talking about very fast operations!

There are probably situations where non-contiguous data is fine, such as if you are doing all image manipulations purely in Numpy math without any library APIs (in which case there's no real reason to convert the data layout to contiguous in RAM). But as soon as you invoke various library APIs, you should pretty much always have contiguous data, otherwise you'll create huge performance issues (or even completely incorrect results).

I'll explain those performance issues further down, but first we'll look at the various "conversion techniques" people use in Python.

Techniques

Without further ado, here are all the ways that people use in Python whenever they want to convert back/forth between RGB and BGR. These benchmarks are on a 4K image (3840x2160):

  • Always Contiguous: No. Method: x = x[...,::-1]. Speed: 237 nsec (aka 0.237 usec aka 0.000237 msec) per call
  • Always Contiguous: Yes. Method: x = x[...,::-1].copy(). Speed: 37.5 msec per call
  • Always Contiguous: No. Method: x = x[:, :, [2, 1, 0]]. Speed: 12.6 msec per call
  • Always Contiguous: Yes. Method: x = cv2.cvtColor(x, cv2.COLOR_RGB2BGR). Speed: 5.48 msec per call
  • Always Contiguous: No. Method: x = np.fliplr(x.reshape(-1,3)).reshape(x.shape). Speed: 1.62 usec (aka 0.00162 msec) per call
  • Always Contiguous: Yes. Method: x = np.fliplr(x.reshape(-1,3)).reshape(x.shape).copy(). Speed: 37.4 msec per call
  • Always Contiguous: No. Method: x = np.flip(x, axis=2). Speed: 2.74 usec (aka 0.00274 msec) per call
  • Always Contiguous: Yes. Method: x = np.flip(x, axis=2).copy(). Speed: 37.5 msec per call
  • Always Contiguous: Yes. Method: r = x[..., 0].copy(); x[..., 0] = x[..., 2]; x[..., 2] = r. Speed: 21.8 msec per call
  • Always Contiguous: Yes. Method: x[:, :, [0, 2]] = x[:, :, [2, 0]]. Speed: 21.7 msec per call
  • Always Contiguous: Yes. Method: x[..., [0, 2]] = x[..., [2, 0]]. Speed: 21.8 msec per call
  • Always Contiguous: Yes. Method: x[:, :, [0, 1, 2]] = x[:, :, [2, 1, 0]]. Speed: 33.1 msec per call
  • Always Contiguous: Yes. Method: x[:, :] = x[:, :, [2, 1, 0]]. Speed: 49.3 msec per call
  • Always Contiguous: Yes. Method: foo = x.copy(). Speed: 11.8 msec per call (This example doesn't change the RGB/BGR channel order, and is just included here as a reference, to show how slow Numpy is at doing a super simple copy of an already-contiguous chunk of RAM. As you can see, even when the data is already in the proper order, Numpy is very slow... And if "x" had been non-contiguous here, it would be even slower (as shown in the x = x[...,::-1].copy() example near the top of the list, which took 37.5 msec and demonstrates Numpy copying non-contiguous (negative "stride") RAM into contiguous RAM...).

PS: Whenever we want contiguous data from numpy, we're mostly using x.copy() to tell Numpy to allocate new RAM and copy all data to it in the correct (contiguous) order. There's also a np.ascontiguousarray(x) API but it does the exact same thing (it copies too, "but only when the Numpy data isn't already contiguous") and requires much more typing. ;-) And in a few of the examples we're using special indexing (such as x[:, :, [0, 1, 2]] = x[:, :, [2, 1, 0]]) to overwrite the memory directly, which always creates contiguous memory with correct "strides", and is faster than telling Numpy to do a .copy(), but is still extremely slow compared to cv2.cvtColor().

Docs for the various Numpy functions: copy, ascontiguousarray, fliplr, flip, reshape

Here's the benchmark that was used:

python -m timeit -s "import numpy as np; import cv2; x = np.zeros([2160,3840,3], np.uint8); x[:,:,2] = 255; x[:,:,1] = 100" "ALGORITHM HERE"

Replace the "ALGORITHM HERE" part with the algorithm above, such as "x = np.flip(x, axis=2).copy()".

People's Misunderstandings of those Benchmarks

Alright, so we're finally getting to the whole purpose of this article!

When people see the benchmarks above, they usually think "Oh my god, x = x[...,::-1] executes in 0.000237 milliseconds, and x = cv2.cvtColor(x, cv2.COLOR_RGB2BGR) executes in 5.48 milliseconds, which is 23798 times slower!! And then they decide to always use "Numpy view manipulation" to do their channel conversions.

That's a huge mistake. And here's why:

  • When you call an OpenCV API from Python, and pass it a numpy.ndarray image object, there's a process which prepares that data for internal usage within OpenCV (since OpenCV itself doesn't use ndarray internally; it uses cv::Mat).
  • First, your Python object (which is coming from the PyOpenCV module) goes into the appropriate pyopencv_to() function, whose purpose is to convert raw Python objects (such as numbers, strings, ndarray, etc), into something usable by OpenCV internally in C++.
  • Your Python object first enters the "full ArgInfo converter" code at https://github.com/opencv/opencv/blob/778f42ad34559451d62ac9ba585717aec77fb23a/modules/python/src2/cv2.cpp#L249
  • That code, in turn, looks at the object and determines if it's a number, a float, or a tuple... If it's any of those, it does the appropriate conversion. Otherwise it assumes it's a Numpy array, at this line: https://github.com/opencv/opencv/blob/master/modules/python/src2/cv2.cpp#L292https://github.com/opencv/opencv/blob/778f42ad34559451d62ac9ba585717aec77fb23a/modules/python/src2/cv2.cpp#L292
  • Next, it begins to analyze the Numpy array to determine how to use the data internally. It wants to determine "do we need to copy the data or can we use it as-is? do we need to cast the data?", see here: https://github.com/opencv/opencv/blob/master/modules/python/src2/cv2.cpp#L300https://github.com/opencv/opencv/blob/778f42ad34559451d62ac9ba585717aec77fb23a/modules/python/src2/cv2.cpp#L300
  • It first does some simple checks to see if the number-type in the Numpy array is legal or not. (If illegal type, it marks the data as "needs copy" and "needs cast").
  • Next, it retrieves the "strides" information from the Numpy array, which is those simple numbers such as "-1" which determine how to read a numpy array (such as backwards, in the case of our "fast" numpy-based "channel flipping" code earlier): https://github.com/opencv/opencv/blob/master/modules/python/src2/cv2.cpp#L341-L342https://github.com/opencv/opencv/blob/778f42ad34559451d62ac9ba585717aec77fb23a/modules/python/src2/cv2.cpp#L341-L342
  • Then it analyzes the Strides for all dimensions of the Numpy array, and if it finds a non-contiguous stride (our "screwed up" data layout caused by doing those so-called "fast" Numpy view manipulations), then it marks the data as "needs copy": https://github.com/opencv/opencv/blob/master/modules/python/src2/cv2.cpp#L344-L357https://github.com/opencv/opencv/blob/778f42ad34559451d62ac9ba585717aec77fb23a/modules/python/src2/cv2.cpp#L344-L357
  • Next, if "needs copy" is true, it does this horrible thing: https://github.com/opencv/opencv/blob/master/modules/python/src2/cv2.cpp#L367-L374https://github.com/opencv/opencv/blob/778f42ad34559451d62ac9ba585717aec77fb23a/modules/python/src2/cv2.cpp#L367-L374
  • As you can see, it calls PyArray_Cast() (if casting was needed too) or PyArray_GETCONTIGUOUS() (if we only need to make sure the data is contiguous). Both of those functions, no matter which one is called, then generates a brand-new Python Numpy Object, with all data COPIED by Numpy into brand-new memory, and re-arranged into proper Contiguous ordering. That's extremely wasteful! I'll explain more soon, after this walkthrough of what the code does...
  • Finally, the code proceeds to create a cv::Mat object whose data pointer points at the internal byte-array (RAM) of the Numpy object, ie. the RAM address that you can easily see in Python by typing img.data. That's an incredibly fast operation because it is just a pointer which says Use the existing RAM data owned by Numpy at RAM address XYZ: https://github.com/opencv/opencv/blob/master/modules/python/src2/cv2.cpp#L415-L416https://github.com/opencv/opencv/blob/778f42ad34559451d62ac9ba585717aec77fb23a/modules/python/src2/cv2.cpp#L415-L416

So, can you spot the problem yet?

When you pass a contiguous Numpy array, the conversion into OpenCV is pretty much INSTANT: "This data looks fine! Just give its RAM address to cv::Mat and voila!".

But when you instead insist on using those so-called "fast" channel transformations, where you "tweak" the Numpy array's view and stride values, then you are giving OpenCV a Numpy array with non-contiguous RAM and bad "strides". The PyOpenCV layer (the wrapper between OpenCV and Python) detects this problem, and creates a BRAND NEW, COPIED, RE-ARRANGED (CONTIGUOUS) NUMPY ARRAY. This is VERY VERY VERY VERY SLOW.

In other words, if you've used those dumb Numpy "view" manipulation tricks, EVERY CALL TO OPENCV APIS IS CAUSING a HUGE memory copy (images are large, especially 1080p+ screenshots/video frames), a lot of math inside PyArray_GETCONTIGUOUS / PyArray_Cast to create that new object while respecting your tweaked "strides", etc.

Your code won't be faster at all. It will be SLOWER!

Demonstration of the Slowness

Let's use a random OpenCV API to demonstrate the slowdown caused by all of those conversions. We'll use cv2.imshow here, but any OpenCV API call will always be doing the same "Python to OpenCV" conversions of the numpy data, so the exact API doesn't matter. They will all have this overhead.

Here's the example code:

import cv2
import numpy as np
import time

#img1 = cv2.imread("yourimage.png") # If you want to test with an image.
img1 = np.zeros([2160,3840,3], np.uint8) # Create a 4K image.
img1[:,:,2] = 255; img1[:,:,1] = 100 # Fill the channels with different values.
img2 = img1[...,::-1] # Make a "channel flipped view" of the Numpy data.

print("img1 contiguous", img1.flags['C_CONTIGUOUS'], "img2 contiguous", img2.flags['C_CONTIGUOUS'])
print("img1 strides", img1.strides, "img2 strides", img2.strides)

wnd = cv2.namedWindow("", cv2.WINDOW_NORMAL)

def show1():
    cv2.imshow(wnd, img1)

def show2():
    cv2.imshow(wnd, img2)

iterations = 20

start = time.perf_counter()
for i in range(0, iterations):
    show1()
elapsed1 = (time.perf_counter() - start) * 1000
elapsed1_percall = elapsed1 / iterations

start = time.perf_counter()
for i in range(0, iterations):
    show2()
elapsed2 = (time.perf_counter() - start) * 1000
elapsed2_percall = elapsed2 / iterations

# We know that the contiguos (img1) data does not need conversion,
# which tells us that the runtime of the contiguous data is the
# "internal work of the imshow" function. We only want to measure
# the conversion time for non-contiguous data. So we'll subtract
# the first image's (contiguous) runtime from the non-contiguous time.

noncontiguous_overhead_per_call = elapsed2_percall - elapsed1_percall

print("Extra time taken per OpenCV call when given non-contiguous data (in ms):", noncontiguous_overhead_per_call, "ms")

The results:

img1 contiguous True img2 contiguous False
img1 strides (11520, 3, 1) img2 strides (11520, 3, -1)
Extra time taken per OpenCV call when given non-contiguous data (in ms): 39.45334999999999 ms

As you can see, the extra time added to the OpenCV calls when copy-conversion is needed (39.45 ms), is pretty much the same as when you call Numpy's own img.copy() on a "flipped view" inside Python itself (as seen in the earlier benchmark for "Method: x = x[...,::-1].copy(). Speed: 37.5 msec per call").

So yes, every time you call OpenCV with a non-contiguous Numpy array as its argument, you are causing a Numpy .copy() to happen internally!

Numpy "tricks" will cause subtle Bugs too!

Using those Numpy "tricks" isn't just extremely slow. It will cause very subtle bugs in your code, too.

Look at this code and see if you can figure out the bug yourself before you run this example:

import cv2
import numpy as np

img1 = np.zeros([200,200,3], np.uint8) # Create a 200x200 image. (Is Contiguous)
img2 = img1[...,::-1] # Make a "channel flipped view" of the Numpy data. (A Non-Contiguous View)

print("img1 contiguous", img1.flags['C_CONTIGUOUS'], "img2 contiguous", img2.flags['C_CONTIGUOUS'])
print("img1 strides", img1.strides, "img2 strides", img2.strides)

cv2.rectangle(img2, (80,80), (120,120), (255,255,255), 2)

cv2.imshow("", img2)

What do you think the result will be when running this program? Logically, you expect to see black image with a white rectangle in the middle... But instead, you see nothing except a black image. Why?

Well, it's simple... think about what was explained earlier about how PyOpenCV converts every incoming numpy.ndarray object into an internal C++ cv::Mat object. In this example, we're giving a non-contiguous ndarray as an argument to cv2.rectangle(), which causes PyOpenCV to "fix" the data by making a temporary, internal, contiguous .copy() of the image data, and then it wraps the copy's memory address in a cv::Mat. Next, it passes that cv::Mat object to the internal C++ "draw rectangle" function, which dutifully draws a rectangle onto the memory pointed to by the cv::Mat object... which is... the memory of the temporary internal copy of your input array, since a copy had to be created...

So, OpenCV happily writes a rectangle to the temporary object copy. And then when execution returns to Python, you're of course seeing NO RECTANGLE, since nothing was drawn to your actual ndarray data in RAM (since its memory storage was non-contiguous and therefore not usable as-is by OpenCV).

If you want to see what the code above should be doing, simply add img2 = img2.copy() immediately above the cv2.rectangle call, to cause the img2 ndarray object to become contiguous memory so that OpenCV won't need to make a copy of it (and will be able to use that exact object's memory internally, as intended)... After that tweak, you'll see OpenCV properly drawing the rectangle to the image...

This is the kind of subtle bug that is very easy to cause when you're playing around with faked Numpy "views" rather than real contiguous memory.

Conclusions

Stop using Numpy view manipulations and "tricks". They are not "cool". They lead to SUBTLE BUGS and they are EXTREMELY SLOW. You are slowing down all of your OpenCV API calls by about 40 milliseconds PER CALL, since your "cool" data has to be converted by OpenCV internally to PROPER CONTIGUOUS RAM.

Those 39.45 ms are wasted on EVERY OpenCV call whenever you give OpenCV a non-contiguous image. So if you're (as people often do) passing the image to multiple OpenCV APIs to analyze it in multiple ways, then you are causing extreme slowdowns in your code.

Use cv2.cvtColor() instead, which does a one-time conversion to the proper format (in about 5.48 ms as seen in the benchmarks earlier), using accelerated OpenCL code. You are guaranteed to get contiguous data which works as-is for EVERY OpenCV call with no memory copying/conversion needed. And OpenCV's color converter is WAY FASTER than Numpy's internal data copier/converter.

Let's end this by imagining a scenario where you're using some Python library to capture an RGB screenshot as a numpy array, and you need to use that data with OpenCV. So you're thinking you're clever and you write img = img[...,::-1] to "turn the RGB data into BGR (which OpenCV needs)", and you're thinking "Wow, my code is so fast! That RGB-to-BGR operation only took 0.000237 ms!"... And then you're calling five different OpenCV functions to analyze that screenshot-image in various ways. Well, since you're causing one internal Numpy copy-conversion-to-contiguous PER CALL, you're now causing 5 * 39.45 = 197.25 ms of total conversion overhead, just to get your "stupid" Numpy view into a proper contiguous memory stream.

Does it still sound "slow" to just do a single, one-time 5.48 ms conversion via cv2.cvtColor()? ;-)

Stop. Using. Numpy. Tricks!

Enjoy! ;-)

Fastest way to convert BGR <-> RGB! Aka: Do NOT use Numpy magic "tricks".


IMPORTANT: This article is very long. Remember to click the (more) at the bottom of this post to read the whole article!


I was reading this question: https://answers.opencv.org/question/188664/colour-conversion-from-bgr-to-rgb-in-cv2-is-slower-than-on-python/ and it didn't explain things very well at all. So here's a deep examination and explanation for everyone's future reference!

Converting RGB to BGR, and vice versa, is one of the most important operations you can do in OpenCV if you're interoperating with other libraries, raw system memory, etc. And each imaging library depends on their own special channel orders.

There are many ways to achieve the conversion, and cv2.cvtColor() is often frowned upon because there are "much faster" ways to do it via numpy "view" manipulation.

Whenever you attempt to convert colors in OpenCV, you actually invoke a huge machinery:

https://github.com/opencv/opencv/blob/8c0b0714e76efef4a8ca2a7c410c60e55c5e9829/modules/imgproc/src/color.cpp#L20-L25 https://github.com/opencv/opencv/blob/8b541e450b511fde9dd363fa55a30fbb6fc0ace6/modules/imgproc/src/color_rgb.dispatch.cpp#L426-L437

As you can see, internally, OpenCV creates an "OpenCL Kernel" with the instructions for the data transformation, and then runs it. This creates brand new (re-arranged) image data in memory, which is of course a pretty slow operation, involving new memory allocation and data-copying.

However, there is another way to flip between RGB and BGR channel orders, which is very popular - and very bad (as you'll find out soon). And that is: Using numpy's built-in methods for manipulating the array data.

Note that there are two ways to manipulate data in Numpy:

  • One of the ways, the bad way, just changes the "view" of the Numpy array and is therefore instant (O(1)), but does NOT transform the underlying img.data in RAM/memory. This means that the raw memory does NOT contain the new channel order, and Numpy instead "fakes" it by creating a "view" that simply says "when we read this data from RAM, view it as R=B, G=G, B=R" basically... (Technically speaking, it changes the ".strides" property of the Numpy object, which instead of saying "read R then G then B" (stride "1" aka going forwards in RAM when reading the color channels) changes it to say "read B, then G, then R" (stride "-1" aka going backwards in RAM when reading the color channels)).
  • The second way, which is totally fine, is to always ensure that we arrange the pixel data properly in memory too, which is a lot slower but is almost always necessary, depending on what library/API your data is intended to be sent to!

To determine whether a numpy array manipulation has also changed the underlying MEMORY, you can look at the img.flags['C_CONTIGUOUS'] value. If True it means that the data in RAM is in the correct order (that's great!). If False it means that the data in RAM is in the wrong order and that we are "cheating" via a numpy View instead (that's BAD!).

Whenever you use the "View-based" methods to flip channels in an ndarray (such as RGB -> BGR), its C_CONTIGUOUS becomes False. If you then flip the image's channels again (such as BGR -> back to RGB), its C_CONTIGUOUS becomes True again. So, the "view" is able to be transformed multiple times, and the "Contiguous" flag only says True whenever the view happens to match the actual RAM data's layout.

So... in what situations do you need the data to ALWAYS be contiguous? Well, it varies based on API...

  • OpenCV APIs all need data in contiguous order. If you give it non-contiguous data from Python, the Python-to-OpenCV wrapper layer internally makes a COPY of the BADLY FORMATTED image you gave it, and then converts the color channel order, and THEN finally passes the COPIED-AND-CONTIGUOUS image to the internal OpenCV C++ function. This is of course very wasteful!
  • Other libraries: Depends on the library. Some of them do something like "take the img.data memory address and give it to a raw Windows API via a COM call" in which case YES the RAM-data MUST be contiguous too.

What type of data do YOU need?

If you want the SAFEST possible data that is 100% sure to work in ANY API ANYWHERE, you should always make CONTIGUOUS pixel data. It doesn't take long to do the conversion up-front, since we're still talking about very fast operations!

There are probably situations where non-contiguous data is fine, such as if you are doing all image manipulations purely in Numpy math without any library APIs (in which case there's no real reason to convert the data layout to contiguous in RAM). But as soon as you invoke various library APIs, you should pretty much always have contiguous data, otherwise you'll create huge performance issues (or even completely incorrect results).

I'll explain those performance issues further down, but first we'll look at the various "conversion techniques" people use in Python.

Techniques

Without further ado, here are all the ways that people use in Python whenever they want to convert back/forth between RGB and BGR. These benchmarks are on a 4K image (3840x2160):

  • Always Contiguous: No. Method: x = x[...,::-1]. Speed: 237 nsec (aka 0.237 usec aka 0.000237 msec) per call
  • Always Contiguous: Yes. Method: x = x[...,::-1].copy(). Speed: 37.5 msec per call
  • Always Contiguous: No. Method: x = x[:, :, [2, 1, 0]]. Speed: 12.6 msec per call
  • Always Contiguous: Yes. Method: x = cv2.cvtColor(x, cv2.COLOR_RGB2BGR). Speed: 5.48 msec per call
  • Always Contiguous: No. Method: x = np.fliplr(x.reshape(-1,3)).reshape(x.shape). Speed: 1.62 usec (aka 0.00162 msec) per call
  • Always Contiguous: Yes. Method: x = np.fliplr(x.reshape(-1,3)).reshape(x.shape).copy(). Speed: 37.4 msec per call
  • Always Contiguous: No. Method: x = np.flip(x, axis=2). Speed: 2.74 usec (aka 0.00274 msec) per call
  • Always Contiguous: Yes. Method: x = np.flip(x, axis=2).copy(). Speed: 37.5 msec per call
  • Always Contiguous: Yes. Method: r = x[..., 0].copy(); x[..., 0] = x[..., 2]; x[..., 2] = r. Speed: 21.8 msec per call
  • Always Contiguous: Yes. Method: x[:, :, [0, 2]] = x[:, :, [2, 0]]. Speed: 21.7 msec per call
  • Always Contiguous: Yes. Method: x[..., [0, 2]] = x[..., [2, 0]]. Speed: 21.8 msec per call
  • Always Contiguous: Yes. Method: x[:, :, [0, 1, 2]] = x[:, :, [2, 1, 0]]. Speed: 33.1 msec per call
  • Always Contiguous: Yes. Method: x[:, :] = x[:, :, [2, 1, 0]]. Speed: 49.3 msec per call
  • Always Contiguous: Yes. Method: foo = x.copy(). Speed: 11.8 msec per call (This example doesn't change the RGB/BGR channel order, and is just included here as a reference, to show how slow Numpy is at doing a super simple copy of an already-contiguous chunk of RAM. As you can see, even when the data is already in the proper order, Numpy is very slow... And if "x" had been non-contiguous here, it would be even slower (as shown in the x = x[...,::-1].copy() example near the top of the list, which took 37.5 msec and demonstrates Numpy copying non-contiguous (negative "stride") RAM (from numpy "views" marked as "read in reverse order" via "stride = -1") into contiguous RAM...).

PS: Whenever we want contiguous data from numpy, we're mostly using x.copy() to tell Numpy to allocate new RAM and copy all data to it in the correct (contiguous) order. There's also a np.ascontiguousarray(x) API but it does the exact same thing (it copies too, "but only when the Numpy data isn't already contiguous") and requires much more typing. ;-) And in a few of the examples we're using special indexing (such as x[:, :, [0, 1, 2]] = x[:, :, [2, 1, 0]]) to overwrite the memory directly, which always creates contiguous memory with correct "strides", and is faster than telling Numpy to do a .copy(), but is still extremely slow compared to cv2.cvtColor().

Docs for the various Numpy functions: copy, ascontiguousarray, fliplr, flip, reshape

Here's the benchmark that was used:

python -m timeit -s "import numpy as np; import cv2; x = np.zeros([2160,3840,3], np.uint8); x[:,:,2] = 255; x[:,:,1] = 100" "ALGORITHM HERE"

Replace the "ALGORITHM HERE" part with the algorithm above, such as "x = np.flip(x, axis=2).copy()".

People's Misunderstandings of those Benchmarks

Alright, so we're finally getting to the whole purpose of this article!

When people see the benchmarks above, they usually think "Oh my god, x = x[...,::-1] executes in 0.000237 milliseconds, and x = cv2.cvtColor(x, cv2.COLOR_RGB2BGR) executes in 5.48 milliseconds, which is 23798 times slower!! And then they decide to always use "Numpy view manipulation" to do their channel conversions.

That's a huge mistake. And here's why:

  • When you call an OpenCV API from Python, and pass it a numpy.ndarray image object, there's a process which prepares that data for internal usage within OpenCV (since OpenCV itself doesn't use ndarray internally; it uses cv::Mat).
  • First, your Python object (which is coming from the PyOpenCV module) goes into the appropriate pyopencv_to() function, whose purpose is to convert raw Python objects (such as numbers, strings, ndarray, etc), into something usable by OpenCV internally in C++.
  • Your Python object first enters the "full ArgInfo converter" code at https://github.com/opencv/opencv/blob/778f42ad34559451d62ac9ba585717aec77fb23a/modules/python/src2/cv2.cpp#L249
  • That code, in turn, looks at the object and determines if it's a number, a float, or a tuple... If it's any of those, it does the appropriate conversion. Otherwise it assumes it's a Numpy array, at this line: https://github.com/opencv/opencv/blob/778f42ad34559451d62ac9ba585717aec77fb23a/modules/python/src2/cv2.cpp#L292
  • Next, it begins to analyze the Numpy array to determine how to use the data internally. It wants to determine "do we need to copy the data or can we use it as-is? do we need to cast the data?", see here: https://github.com/opencv/opencv/blob/778f42ad34559451d62ac9ba585717aec77fb23a/modules/python/src2/cv2.cpp#L300
  • It first does some simple checks to see if the number-type in the Numpy array is legal or not. (If illegal type, it marks the data as "needs copy" and "needs cast").
  • Next, it retrieves the "strides" information from the Numpy array, which is those simple numbers such as "-1" which determine how to read a numpy array (such as backwards, in the case of our "fast" numpy-based "channel flipping" code earlier): https://github.com/opencv/opencv/blob/778f42ad34559451d62ac9ba585717aec77fb23a/modules/python/src2/cv2.cpp#L341-L342
  • Then it analyzes the Strides for all dimensions of the Numpy array, and if it finds a non-contiguous stride (our "screwed up" data layout caused by doing those so-called "fast" Numpy view manipulations), then it marks the data as "needs copy": https://github.com/opencv/opencv/blob/778f42ad34559451d62ac9ba585717aec77fb23a/modules/python/src2/cv2.cpp#L344-L357
  • Next, if "needs copy" is true, it does this horrible thing: https://github.com/opencv/opencv/blob/778f42ad34559451d62ac9ba585717aec77fb23a/modules/python/src2/cv2.cpp#L367-L374
  • As you can see, it calls PyArray_Cast() (if casting was needed too) or PyArray_GETCONTIGUOUS() (if we only need to make sure the data is contiguous). Both of those functions, no matter which one is called, then generates a brand-new Python Numpy Object, with all data COPIED by Numpy into brand-new memory, and re-arranged into proper Contiguous ordering. That's extremely wasteful! I'll explain more soon, after this walkthrough of what the code does...
  • Finally, the code proceeds to create a cv::Mat object whose data pointer points at the internal byte-array (RAM) of the Numpy object, ie. the RAM address that you can easily see in Python by typing img.data. That's an incredibly fast operation because it is just a pointer which says Use the existing RAM data owned by Numpy at RAM address XYZ: https://github.com/opencv/opencv/blob/778f42ad34559451d62ac9ba585717aec77fb23a/modules/python/src2/cv2.cpp#L415-L416

So, can you spot the problem yet?

When you pass a contiguous Numpy array, the conversion into OpenCV is pretty much INSTANT: "This data looks fine! Just give its RAM address to cv::Mat and voila!".

But when you instead insist on using those so-called "fast" channel transformations, where you "tweak" the Numpy array's view and stride values, then you are giving OpenCV a Numpy array with non-contiguous RAM and bad "strides". The PyOpenCV layer (the wrapper between OpenCV and Python) detects this problem, and creates a BRAND NEW, COPIED, RE-ARRANGED (CONTIGUOUS) NUMPY ARRAY. This is VERY VERY VERY VERY SLOW.

In other words, if you've used those dumb Numpy "view" manipulation tricks, EVERY CALL TO OPENCV APIS IS CAUSING a HUGE memory copy (images are large, especially 1080p+ screenshots/video frames), a lot of math inside PyArray_GETCONTIGUOUS / PyArray_Cast to create that new object while respecting your tweaked "strides", etc.

Your code won't be faster at all. It will be SLOWER!

Demonstration of the Slowness

Let's use a random OpenCV API to demonstrate the slowdown caused by all of those conversions. We'll use cv2.imshow here, but any OpenCV API call will always be doing the same "Python to OpenCV" conversions of the numpy data, so the exact API doesn't matter. They will all have this overhead.

Here's the example code:

import cv2
import numpy as np
import time

#img1 = cv2.imread("yourimage.png") # If you want to test with an image.
img1 = np.zeros([2160,3840,3], np.uint8) # Create a 4K image.
img1[:,:,2] = 255; img1[:,:,1] = 100 # Fill the channels with different values.
img2 = img1[...,::-1] # Make a "channel flipped view" of the Numpy data.

print("img1 contiguous", img1.flags['C_CONTIGUOUS'], "img2 contiguous", img2.flags['C_CONTIGUOUS'])
print("img1 strides", img1.strides, "img2 strides", img2.strides)

wnd = cv2.namedWindow("", cv2.WINDOW_NORMAL)

def show1():
    cv2.imshow(wnd, img1)

def show2():
    cv2.imshow(wnd, img2)

iterations = 20

start = time.perf_counter()
for i in range(0, iterations):
    show1()
elapsed1 = (time.perf_counter() - start) * 1000
elapsed1_percall = elapsed1 / iterations

start = time.perf_counter()
for i in range(0, iterations):
    show2()
elapsed2 = (time.perf_counter() - start) * 1000
elapsed2_percall = elapsed2 / iterations

# We know that the contiguos (img1) data does not need conversion,
# which tells us that the runtime of the contiguous data is the
# "internal work of the imshow" function. We only want to measure
# the conversion time for non-contiguous data. So we'll subtract
# the first image's (contiguous) runtime from the non-contiguous time.

noncontiguous_overhead_per_call = elapsed2_percall - elapsed1_percall

print("Extra time taken per OpenCV call when given non-contiguous data (in ms):", noncontiguous_overhead_per_call, "ms")

The results:

img1 contiguous True img2 contiguous False
img1 strides (11520, 3, 1) img2 strides (11520, 3, -1)
Extra time taken per OpenCV call when given non-contiguous data (in ms): 39.45334999999999 ms

As you can see, the extra time added to the OpenCV calls when copy-conversion is needed (39.45 ms), is pretty much the same as when you call Numpy's own img.copy() on a "flipped view" inside Python itself (as seen in the earlier benchmark for "Method: x = x[...,::-1].copy(). Speed: 37.5 msec per call").

So yes, every time you call OpenCV with a non-contiguous Numpy array as its argument, you are causing a Numpy .copy() to happen internally!

Numpy "tricks" will cause subtle Bugs too!

Using those Numpy "tricks" isn't just extremely slow. It will cause very subtle bugs in your code, too.

Look at this code and see if you can figure out the bug yourself before you run this example:

import cv2
import numpy as np

img1 = np.zeros([200,200,3], np.uint8) # Create a 200x200 image. (Is Contiguous)
img2 = img1[...,::-1] # Make a "channel flipped view" of the Numpy data. (A Non-Contiguous View)

print("img1 contiguous", img1.flags['C_CONTIGUOUS'], "img2 contiguous", img2.flags['C_CONTIGUOUS'])
print("img1 strides", img1.strides, "img2 strides", img2.strides)

cv2.rectangle(img2, (80,80), (120,120), (255,255,255), 2)

cv2.imshow("", img2)

What do you think the result will be when running this program? Logically, you expect to see black image with a white rectangle in the middle... But instead, you see nothing except a black image. Why?

Well, it's simple... think about what was explained earlier about how PyOpenCV converts every incoming numpy.ndarray object into an internal C++ cv::Mat object. In this example, we're giving a non-contiguous ndarray as an argument to cv2.rectangle(), which causes PyOpenCV to "fix" the data by making a temporary, internal, contiguous .copy() of the image data, and then it wraps the copy's memory address in a cv::Mat. Next, it passes that cv::Mat object to the internal C++ "draw rectangle" function, which dutifully draws a rectangle onto the memory pointed to by the cv::Mat object... which is... the memory of the temporary internal copy of your input array, since a copy had to be created...

So, OpenCV happily writes a rectangle to the temporary object copy. And then when execution returns to Python, you're of course seeing NO RECTANGLE, since nothing was drawn to your actual ndarray data in RAM (since its memory storage was non-contiguous and therefore not usable as-is by OpenCV).

If you want to see what the code above should be doing, simply add img2 = img2.copy() immediately above the cv2.rectangle call, to cause the img2 ndarray object to become contiguous memory so that OpenCV won't need to make a copy of it (and will be able to use that exact object's memory internally, as intended)... After that tweak, you'll see OpenCV properly drawing the rectangle to the image...

This is the kind of subtle bug that is very easy to cause when you're playing around with faked Numpy "views" rather than real contiguous memory.

Conclusions

Stop using Numpy view manipulations and "tricks". They are not "cool". They lead to SUBTLE BUGS and they are EXTREMELY SLOW. You are slowing down all of your OpenCV API calls by about 40 milliseconds PER CALL, since your "cool" data has to be converted by OpenCV internally to PROPER CONTIGUOUS RAM.

Those 39.45 ms are wasted on EVERY OpenCV call whenever you give OpenCV a non-contiguous image. So if you're (as people often do) passing the image to multiple OpenCV APIs to analyze it in multiple ways, then you are causing extreme slowdowns in your code.

Use cv2.cvtColor() instead, which does a one-time conversion to the proper format (in about 5.48 ms as seen in the benchmarks earlier), using accelerated OpenCL code. You are guaranteed to get contiguous data which works as-is for EVERY OpenCV call with no memory copying/conversion needed. And OpenCV's color converter is WAY FASTER than Numpy's internal data copier/converter.

Let's end this by imagining a scenario where you're using some Python library to capture an RGB screenshot as a numpy array, and you need to use that data with OpenCV. So you're thinking you're clever and you write img = img[...,::-1] to "turn the RGB data into BGR (which OpenCV needs)", and you're thinking "Wow, my code is so fast! That RGB-to-BGR operation only took 0.000237 ms!"... And then you're calling five different OpenCV functions to analyze that screenshot-image in various ways. Well, since you're causing one internal Numpy copy-conversion-to-contiguous PER CALL, you're now causing 5 * 39.45 = 197.25 ms of total conversion overhead, just to get your "stupid" Numpy view into a proper contiguous memory stream.

Does it still sound "slow" to just do a single, one-time 5.48 ms conversion via cv2.cvtColor()? ;-)

Stop. Using. Numpy. Tricks!

Enjoy! ;-)

Fastest way to convert BGR <-> RGB! Aka: Do NOT use Numpy magic "tricks".


IMPORTANT: This article is very long. Remember to click the (more) at the bottom of this post to read the whole article!


I was reading this question: https://answers.opencv.org/question/188664/colour-conversion-from-bgr-to-rgb-in-cv2-is-slower-than-on-python/ and it didn't explain things very well at all. So here's a deep examination and explanation for everyone's future reference!

Converting RGB to BGR, and vice versa, is one of the most important operations you can do in OpenCV if you're interoperating with other libraries, raw system memory, etc. And each imaging library depends on their own special channel orders.

There are many ways to achieve the conversion, and cv2.cvtColor() is often frowned upon because there are "much faster" ways to do it via numpy "view" manipulation.

Whenever you attempt to convert colors in OpenCV, you actually invoke a huge machinery:

https://github.com/opencv/opencv/blob/8c0b0714e76efef4a8ca2a7c410c60e55c5e9829/modules/imgproc/src/color.cpp#L20-L25 https://github.com/opencv/opencv/blob/8b541e450b511fde9dd363fa55a30fbb6fc0ace6/modules/imgproc/src/color_rgb.dispatch.cpp#L426-L437

As you can see, internally, OpenCV creates an "OpenCL Kernel" with the instructions for the data transformation, and then runs it. This creates brand new (re-arranged) image data in memory, which is of course a pretty slow operation, involving new memory allocation and data-copying.

However, there is another way to flip between RGB and BGR channel orders, which is very popular - and very bad (as you'll find out soon). And that is: Using numpy's built-in methods for manipulating the array data.

Note that there are two ways to manipulate data in Numpy:

  • One of the ways, the bad way, just changes the "view" of the Numpy array and is therefore instant (O(1)), but does NOT transform the underlying img.data in RAM/memory. This means that the raw memory does NOT contain the new channel order, and Numpy instead "fakes" it by creating a "view" that simply says "when we read this data from RAM, view it as R=B, G=G, B=R" basically... (Technically speaking, it changes the ".strides" property of the Numpy object, which instead of saying "read R then G then B" (stride "1" aka going forwards in RAM when reading the color channels) changes it to say "read B, then G, then R" (stride "-1" aka going backwards in RAM when reading the color channels)).
  • The second way, which is totally fine, is to always ensure that we arrange the pixel data properly in memory too, which is a lot slower but is almost always necessary, depending on what library/API your data is intended to be sent to!

To determine whether a numpy array manipulation has also changed the underlying MEMORY, you can look at the img.flags['C_CONTIGUOUS'] value. If True it means that the data in RAM is in the correct order (that's great!). If False it means that the data in RAM is in the wrong order and that we are "cheating" via a numpy View instead (that's BAD!).

Whenever you use the "View-based" methods to flip channels in an ndarray (such as RGB -> BGR), its C_CONTIGUOUS becomes False. If you then flip the image's channels again (such as BGR -> back to RGB), its C_CONTIGUOUS becomes True again. So, the "view" is able to be transformed multiple times, and the "Contiguous" flag only says True whenever the view happens to match the actual RAM data's layout.

So... in what situations do you need the data to ALWAYS be contiguous? Well, it varies based on API...

  • OpenCV APIs all need data in contiguous order. If you give it non-contiguous data from Python, the Python-to-OpenCV wrapper layer internally makes a COPY of the BADLY FORMATTED image you gave it, and then converts the color channel order, and THEN finally passes the COPIED-AND-CONTIGUOUS image to the internal OpenCV C++ function. This is of course very wasteful!
  • Other libraries: Depends on the library. Some of them do something like "take the img.data memory address and give it to a raw Windows API via a COM call" in which case YES the RAM-data MUST be contiguous too.

What type of data do YOU need?

If you want the SAFEST possible data that is 100% sure to work in ANY API ANYWHERE, you should always make CONTIGUOUS pixel data. It doesn't take long to do the conversion up-front, since we're still talking about very fast operations!

There are probably situations where non-contiguous data is fine, such as if you are doing all image manipulations purely in Numpy math without any library APIs (in which case there's no real reason to convert the data layout to contiguous in RAM). But as soon as you invoke various library APIs, you should pretty much always have contiguous data, otherwise you'll create huge performance issues (or even completely incorrect results).

I'll explain those performance issues further down, but first we'll look at the various "conversion techniques" people use in Python.

Techniques

Without further ado, here are all the ways that people use in Python whenever they want to convert back/forth between RGB and BGR. These benchmarks are on a 4K image (3840x2160):

  • Always Contiguous: No. Method: x = x[...,::-1]. Speed: 237 nsec (aka 0.237 usec aka 0.000237 msec) per call
  • Always Contiguous: Yes. Method: x = x[...,::-1].copy(). Speed: 37.5 msec per call
  • Always Contiguous: No. Method: x = x[:, :, [2, 1, 0]]. Speed: 12.6 msec per call
  • Always Contiguous: Yes. Method: x = cv2.cvtColor(x, cv2.COLOR_RGB2BGR). Speed: 5.48 msec per call
  • Always Contiguous: No. Method: x = np.fliplr(x.reshape(-1,3)).reshape(x.shape). Speed: 1.62 usec (aka 0.00162 msec) per call
  • Always Contiguous: Yes. Method: x = np.fliplr(x.reshape(-1,3)).reshape(x.shape).copy(). Speed: 37.4 msec per call
  • Always Contiguous: No. Method: x = np.flip(x, axis=2). Speed: 2.74 usec (aka 0.00274 msec) per call
  • Always Contiguous: Yes. Method: x = np.flip(x, axis=2).copy(). Speed: 37.5 msec per call
  • Always Contiguous: Yes. Method: r = x[..., 0].copy(); x[..., 0] = x[..., 2]; x[..., 2] = r. Speed: 21.8 msec per call
  • Always Contiguous: Yes. Method: x[:, :, [0, 2]] = x[:, :, [2, 0]]. Speed: 21.7 msec per call
  • Always Contiguous: Yes. Method: x[..., [0, 2]] = x[..., [2, 0]]. Speed: 21.8 msec per call
  • Always Contiguous: Yes. Method: x[:, :, [0, 1, 2]] = x[:, :, [2, 1, 0]]. Speed: 33.1 msec per call
  • Always Contiguous: Yes. Method: x[:, :] = x[:, :, [2, 1, 0]]. Speed: 49.3 msec per call
  • Always Contiguous: Yes. Method: foo = x.copy(). Speed: 11.8 msec per call (This example doesn't change the RGB/BGR channel order, and is just included here as a reference, to show how slow Numpy is at doing a super simple copy of an already-contiguous chunk of RAM. As you can see, even when the data is already in the proper order, Numpy is very slow... And if "x" had been non-contiguous here, it would be even slower (as slower, as shown in the x = x[...,::-1].copy() (equivalent to saying bar = x[...,::-1]; foo = bar.copy()) example near the top of the list, which took 37.5 msec and demonstrates Numpy copying non-contiguous RAM (from numpy "views" marked as "read in reverse order" via "stride = -1") into contiguous RAM...).RAM...

PS: Whenever we want contiguous data from numpy, we're mostly using x.copy() to tell Numpy to allocate new RAM and copy all data to it in the correct (contiguous) order. There's also a np.ascontiguousarray(x) API but it does the exact same thing (it copies too, "but only when the Numpy data isn't already contiguous") and requires much more typing. ;-) And in a few of the examples we're using special indexing (such as x[:, :, [0, 1, 2]] = x[:, :, [2, 1, 0]]) to overwrite the memory directly, which always creates contiguous memory with correct "strides", and is faster than telling Numpy to do a .copy(), but is still extremely slow compared to cv2.cvtColor().

Docs for the various Numpy functions: copy, ascontiguousarray, fliplr, flip, reshape

Here's the benchmark that was used:

python -m timeit -s "import numpy as np; import cv2; x = np.zeros([2160,3840,3], np.uint8); x[:,:,2] = 255; x[:,:,1] = 100" "ALGORITHM HERE"

Replace the "ALGORITHM HERE" part with the algorithm above, such as "x = np.flip(x, axis=2).copy()".

People's Misunderstandings of those Benchmarks

Alright, so we're finally getting to the whole purpose of this article!

When people see the benchmarks above, they usually think "Oh my god, x = x[...,::-1] executes in 0.000237 milliseconds, and x = cv2.cvtColor(x, cv2.COLOR_RGB2BGR) executes in 5.48 milliseconds, which is 23798 times slower!! And then they decide to always use "Numpy view manipulation" to do their channel conversions.

That's a huge mistake. And here's why:

  • When you call an OpenCV API from Python, and pass it a numpy.ndarray image object, there's a process which prepares that data for internal usage within OpenCV (since OpenCV itself doesn't use ndarray internally; it uses cv::Mat).
  • First, your Python object (which is coming from the PyOpenCV module) goes into the appropriate pyopencv_to() function, whose purpose is to convert raw Python objects (such as numbers, strings, ndarray, etc), into something usable by OpenCV internally in C++.
  • Your Python object first enters the "full ArgInfo converter" code at https://github.com/opencv/opencv/blob/778f42ad34559451d62ac9ba585717aec77fb23a/modules/python/src2/cv2.cpp#L249
  • That code, in turn, looks at the object and determines if it's a number, a float, or a tuple... If it's any of those, it does the appropriate conversion. Otherwise it assumes it's a Numpy array, at this line: https://github.com/opencv/opencv/blob/778f42ad34559451d62ac9ba585717aec77fb23a/modules/python/src2/cv2.cpp#L292
  • Next, it begins to analyze the Numpy array to determine how to use the data internally. It wants to determine "do we need to copy the data or can we use it as-is? do we need to cast the data?", see here: https://github.com/opencv/opencv/blob/778f42ad34559451d62ac9ba585717aec77fb23a/modules/python/src2/cv2.cpp#L300
  • It first does some simple checks to see if the number-type in the Numpy array is legal or not. (If illegal type, it marks the data as "needs copy" and "needs cast").
  • Next, it retrieves the "strides" information from the Numpy array, which is those simple numbers such as "-1" which determine how to read a numpy array (such as backwards, in the case of our "fast" numpy-based "channel flipping" code earlier): https://github.com/opencv/opencv/blob/778f42ad34559451d62ac9ba585717aec77fb23a/modules/python/src2/cv2.cpp#L341-L342
  • Then it analyzes the Strides for all dimensions of the Numpy array, and if it finds a non-contiguous stride (our "screwed up" data layout caused by doing those so-called "fast" Numpy view manipulations), then it marks the data as "needs copy": https://github.com/opencv/opencv/blob/778f42ad34559451d62ac9ba585717aec77fb23a/modules/python/src2/cv2.cpp#L344-L357
  • Next, if "needs copy" is true, it does this horrible thing: https://github.com/opencv/opencv/blob/778f42ad34559451d62ac9ba585717aec77fb23a/modules/python/src2/cv2.cpp#L367-L374
  • As you can see, it calls PyArray_Cast() (if casting was needed too) or PyArray_GETCONTIGUOUS() (if we only need to make sure the data is contiguous). Both of those functions, no matter which one is called, then generates a brand-new Python Numpy Object, with all data COPIED by Numpy into brand-new memory, and re-arranged into proper Contiguous ordering. That's extremely wasteful! I'll explain more soon, after this walkthrough of what the code does...
  • Finally, the code proceeds to create a cv::Mat object whose data pointer points at the internal byte-array (RAM) of the Numpy object, ie. the RAM address that you can easily see in Python by typing img.data. That's an incredibly fast operation because it is just a pointer which says Use the existing RAM data owned by Numpy at RAM address XYZ: https://github.com/opencv/opencv/blob/778f42ad34559451d62ac9ba585717aec77fb23a/modules/python/src2/cv2.cpp#L415-L416

So, can you spot the problem yet?

When you pass a contiguous Numpy array, the conversion into OpenCV is pretty much INSTANT: "This data looks fine! Just give its RAM address to cv::Mat and voila!".

But when you instead insist on using those so-called "fast" channel transformations, where you "tweak" the Numpy array's view and stride values, then you are giving OpenCV a Numpy array with non-contiguous RAM and bad "strides". The PyOpenCV layer (the wrapper between OpenCV and Python) detects this problem, and creates a BRAND NEW, COPIED, RE-ARRANGED (CONTIGUOUS) NUMPY ARRAY. This is VERY VERY VERY VERY SLOW.

In other words, if you've used those dumb Numpy "view" manipulation tricks, EVERY CALL TO OPENCV APIS IS CAUSING a HUGE memory copy (images are large, especially 1080p+ screenshots/video frames), a lot of math inside PyArray_GETCONTIGUOUS / PyArray_Cast to create that new object while respecting your tweaked "strides", etc.

Your code won't be faster at all. It will be SLOWER!

Demonstration of the Slowness

Let's use a random OpenCV API to demonstrate the slowdown caused by all of those conversions. We'll use cv2.imshow here, but any OpenCV API call will always be doing the same "Python to OpenCV" conversions of the numpy data, so the exact API doesn't matter. They will all have this overhead.

Here's the example code:

import cv2
import numpy as np
import time

#img1 = cv2.imread("yourimage.png") # If you want to test with an image.
img1 = np.zeros([2160,3840,3], np.uint8) # Create a 4K image.
img1[:,:,2] = 255; img1[:,:,1] = 100 # Fill the channels with different values.
img2 = img1[...,::-1] # Make a "channel flipped view" of the Numpy data.

print("img1 contiguous", img1.flags['C_CONTIGUOUS'], "img2 contiguous", img2.flags['C_CONTIGUOUS'])
print("img1 strides", img1.strides, "img2 strides", img2.strides)

wnd = cv2.namedWindow("", cv2.WINDOW_NORMAL)

def show1():
    cv2.imshow(wnd, img1)

def show2():
    cv2.imshow(wnd, img2)

iterations = 20

start = time.perf_counter()
for i in range(0, iterations):
    show1()
elapsed1 = (time.perf_counter() - start) * 1000
elapsed1_percall = elapsed1 / iterations

start = time.perf_counter()
for i in range(0, iterations):
    show2()
elapsed2 = (time.perf_counter() - start) * 1000
elapsed2_percall = elapsed2 / iterations

# We know that the contiguos (img1) data does not need conversion,
# which tells us that the runtime of the contiguous data is the
# "internal work of the imshow" function. We only want to measure
# the conversion time for non-contiguous data. So we'll subtract
# the first image's (contiguous) runtime from the non-contiguous time.

noncontiguous_overhead_per_call = elapsed2_percall - elapsed1_percall

print("Extra time taken per OpenCV call when given non-contiguous data (in ms):", noncontiguous_overhead_per_call, "ms")

The results:

img1 contiguous True img2 contiguous False
img1 strides (11520, 3, 1) img2 strides (11520, 3, -1)
Extra time taken per OpenCV call when given non-contiguous data (in ms): 39.45334999999999 ms

As you can see, the extra time added to the OpenCV calls when copy-conversion is needed (39.45 ms), is pretty much the same as when you call Numpy's own img.copy() on a "flipped view" inside Python itself (as seen in the earlier benchmark for "Method: x = x[...,::-1].copy(). Speed: 37.5 msec per call").

So yes, every time you call OpenCV with a non-contiguous Numpy array as its argument, you are causing a Numpy .copy() to happen internally!

Numpy "tricks" will cause subtle Bugs too!

Using those Numpy "tricks" isn't just extremely slow. It will cause very subtle bugs in your code, too.

Look at this code and see if you can figure out the bug yourself before you run this example:

import cv2
import numpy as np

img1 = np.zeros([200,200,3], np.uint8) # Create a 200x200 image. (Is Contiguous)
img2 = img1[...,::-1] # Make a "channel flipped view" of the Numpy data. (A Non-Contiguous View)

print("img1 contiguous", img1.flags['C_CONTIGUOUS'], "img2 contiguous", img2.flags['C_CONTIGUOUS'])
print("img1 strides", img1.strides, "img2 strides", img2.strides)

cv2.rectangle(img2, (80,80), (120,120), (255,255,255), 2)

cv2.imshow("", img2)

What do you think the result will be when running this program? Logically, you expect to see black image with a white rectangle in the middle... But instead, you see nothing except a black image. Why?

Well, it's simple... think about what was explained earlier about how PyOpenCV converts every incoming numpy.ndarray object into an internal C++ cv::Mat object. In this example, we're giving a non-contiguous ndarray as an argument to cv2.rectangle(), which causes PyOpenCV to "fix" the data by making a temporary, internal, contiguous .copy() of the image data, and then it wraps the copy's memory address in a cv::Mat. Next, it passes that cv::Mat object to the internal C++ "draw rectangle" function, which dutifully draws a rectangle onto the memory pointed to by the cv::Mat object... which is... the memory of the temporary internal copy of your input array, since a copy had to be created...

So, OpenCV happily writes a rectangle to the temporary object copy. And then when execution returns to Python, you're of course seeing NO RECTANGLE, since nothing was drawn to your actual ndarray data in RAM (since its memory storage was non-contiguous and therefore not usable as-is by OpenCV).

If you want to see what the code above should be doing, simply add img2 = img2.copy() immediately above the cv2.rectangle call, to cause the img2 ndarray object to become contiguous memory so that OpenCV won't need to make a copy of it (and will be able to use that exact object's memory internally, as intended)... After that tweak, you'll see OpenCV properly drawing the rectangle to the image...

This is the kind of subtle bug that is very easy to cause when you're playing around with faked Numpy "views" rather than real contiguous memory.

Conclusions

Stop using Numpy view manipulations and "tricks". They are not "cool". They lead to SUBTLE BUGS and they are EXTREMELY SLOW. You are slowing down all of your OpenCV API calls by about 40 milliseconds PER CALL, since your "cool" data has to be converted by OpenCV internally to PROPER CONTIGUOUS RAM.

Those 39.45 ms are wasted on EVERY OpenCV call whenever you give OpenCV a non-contiguous image. So if you're (as people often do) passing the image to multiple OpenCV APIs to analyze it in multiple ways, then you are causing extreme slowdowns in your code.

Use cv2.cvtColor() instead, which does a one-time conversion to the proper format (in about 5.48 ms as seen in the benchmarks earlier), using accelerated OpenCL code. You are guaranteed to get contiguous data which works as-is for EVERY OpenCV call with no memory copying/conversion needed. And OpenCV's color converter is WAY FASTER than Numpy's internal data copier/converter.

Let's end this by imagining a scenario where you're using some Python library to capture an RGB screenshot as a numpy array, and you need to use that data with OpenCV. So you're thinking you're clever and you write img = img[...,::-1] to "turn the RGB data into BGR (which OpenCV needs)", and you're thinking "Wow, my code is so fast! That RGB-to-BGR operation only took 0.000237 ms!"... And then you're calling five different OpenCV functions to analyze that screenshot-image in various ways. Well, since you're causing one internal Numpy copy-conversion-to-contiguous PER CALL, you're now causing 5 * 39.45 = 197.25 ms of total conversion overhead, just to get your "stupid" Numpy view into a proper contiguous memory stream.

Does it still sound "slow" to just do a single, one-time 5.48 ms conversion via cv2.cvtColor()? ;-)

Stop. Using. Numpy. Tricks!

Enjoy! ;-)

Fastest way to convert BGR <-> RGB! Aka: Do NOT use Numpy magic "tricks".


IMPORTANT: This article is very long. Remember to click the (more) at the bottom of this post to read the whole article!


I was reading this question: https://answers.opencv.org/question/188664/colour-conversion-from-bgr-to-rgb-in-cv2-is-slower-than-on-python/ and it didn't explain things very well at all. So here's a deep examination and explanation for everyone's future reference!

Converting RGB to BGR, and vice versa, is one of the most important operations you can do in OpenCV if you're interoperating with other libraries, raw system memory, etc. And each imaging library depends on their own special channel orders.

There are many ways to achieve the conversion, and cv2.cvtColor() is often frowned upon because there are "much faster" ways to do it via numpy "view" manipulation.

Whenever you attempt to convert colors in OpenCV, you actually invoke a huge machinery:

https://github.com/opencv/opencv/blob/8c0b0714e76efef4a8ca2a7c410c60e55c5e9829/modules/imgproc/src/color.cpp#L20-L25 https://github.com/opencv/opencv/blob/8b541e450b511fde9dd363fa55a30fbb6fc0ace6/modules/imgproc/src/color_rgb.dispatch.cpp#L426-L437

As you can see, internally, OpenCV creates an "OpenCL Kernel" with the instructions for the data transformation, and then runs it. This creates brand new (re-arranged) image data in memory, which is of course a pretty slow operation, involving new memory allocation and data-copying.

However, there is another way to flip between RGB and BGR channel orders, which is very popular - and very bad (as you'll find out soon). And that is: Using numpy's built-in methods for manipulating the array data.

Note that there are two ways to manipulate data in Numpy:

  • One of the ways, the bad way, just changes the "view" of the Numpy array and is therefore instant (O(1)), but does NOT transform the underlying img.data in RAM/memory. This means that the raw memory does NOT contain the new channel order, and Numpy instead "fakes" it by creating a "view" that simply says "when we read this data from RAM, view it as R=B, G=G, B=R" basically... (Technically speaking, it changes the ".strides" property of the Numpy object, which instead of saying "read R then G then B" (stride "1" aka going forwards in RAM when reading the color channels) changes it to say "read B, then G, then R" (stride "-1" aka going backwards in RAM when reading the color channels)).
  • The second way, which is totally fine, is to always ensure that we arrange the pixel data properly in memory too, which is a lot slower but is almost always necessary, depending on what library/API your data is intended to be sent to!

To determine whether a numpy array manipulation has also changed the underlying MEMORY, you can look at the img.flags['C_CONTIGUOUS'] value. If True it means that the data in RAM is in the correct order (that's great!). If False it means that the data in RAM is in the wrong order and that we are "cheating" via a numpy View instead (that's BAD!).

Whenever you use the "View-based" methods to flip channels in an ndarray (such as RGB -> BGR), its C_CONTIGUOUS becomes False. If you then flip the image's channels again (such as BGR -> back to RGB), its C_CONTIGUOUS becomes True again. So, the "view" is able to be transformed multiple times, and the "Contiguous" flag only says True whenever the view happens to match the actual RAM data's layout.

So... in what situations do you need the data to ALWAYS be contiguous? Well, it varies based on API...

  • OpenCV APIs all need data in contiguous order. If you give it non-contiguous data from Python, the Python-to-OpenCV wrapper layer internally makes a COPY of the BADLY FORMATTED image you gave it, and then converts the color channel order, and THEN finally passes the COPIED-AND-CONTIGUOUS image to the internal OpenCV C++ function. This is of course very wasteful!
  • Other libraries: Depends on the library. Some of them do something like "take the img.data memory address and give it to a raw Windows API via a COM call" in which case YES the RAM-data MUST be contiguous too.

What type of data do YOU need?

If you want the SAFEST possible data that is 100% sure to work in ANY API ANYWHERE, you should always make CONTIGUOUS pixel data. It doesn't take long to do the conversion up-front, since we're still talking about very fast operations!

There are probably situations where non-contiguous data is fine, such as if you are doing all image manipulations purely in Numpy math without any library APIs (in which case there's no real reason to convert the data layout to contiguous in RAM). But as soon as you invoke various library APIs, you should pretty much always have contiguous data, otherwise you'll create huge performance issues (or even completely incorrect results).

I'll explain those performance issues further down, but first we'll look at the various "conversion techniques" people use in Python.

Techniques

Without further ado, here are all the ways that people use in Python whenever they want to convert back/forth between RGB and BGR. These benchmarks are on a 4K image (3840x2160):

  • Always Contiguous: No. Method: x = x[...,::-1]. Speed: 237 nsec (aka 0.237 usec aka 0.000237 msec) per call
  • Always Contiguous: Yes. Method: x = x[...,::-1].copy(). Speed: 37.5 msec per call
  • Always Contiguous: No. Method: x = x[:, :, [2, 1, 0]]. Speed: 12.6 msec per call
  • Always Contiguous: Yes. Method: x = cv2.cvtColor(x, cv2.COLOR_RGB2BGR). Speed: 5.48 msec per call
  • Always Contiguous: No. Method: x = np.fliplr(x.reshape(-1,3)).reshape(x.shape). Speed: 1.62 usec (aka 0.00162 msec) per call
  • Always Contiguous: Yes. Method: x = np.fliplr(x.reshape(-1,3)).reshape(x.shape).copy(). Speed: 37.4 msec per call
  • Always Contiguous: No. Method: x = np.flip(x, axis=2). Speed: 2.74 usec (aka 0.00274 msec) per call
  • Always Contiguous: Yes. Method: x = np.flip(x, axis=2).copy(). Speed: 37.5 msec per call
  • Always Contiguous: Yes. Method: r = x[..., 0].copy(); x[..., 0] = x[..., 2]; x[..., 2] = r. Speed: 21.8 msec per call
  • Always Contiguous: Yes. Method: x[:, :, [0, 2]] = x[:, :, [2, 0]]. Speed: 21.7 msec per call
  • Always Contiguous: Yes. Method: x[..., [0, 2]] = x[..., [2, 0]]. Speed: 21.8 msec per call
  • Always Contiguous: Yes. Method: x[:, :, [0, 1, 2]] = x[:, :, [2, 1, 0]]. Speed: 33.1 msec per call
  • Always Contiguous: Yes. Method: x[:, :] = x[:, :, [2, 1, 0]]. Speed: 49.3 msec per call
  • Always Contiguous: Yes. Method: foo = x.copy(). Speed: 11.8 msec per call (This example doesn't change the RGB/BGR channel order, and is just included here as a reference, to show how slow Numpy is at doing a super simple copy of an already-contiguous chunk of RAM. As you can see, even when the data is already in the proper order, Numpy is very slow... And if "x" had been non-contiguous here, it would be even slower, as shown in the x = x[...,::-1].copy() (equivalent to saying bar = x[...,::-1]; foo = bar.copy()) example near the top of the list, which took 37.5 msec and demonstrates Numpy copying non-contiguous RAM (from numpy "views" marked as "read in reverse order" via "stride = -1") into contiguous RAM...

PS: Whenever we want contiguous data from numpy, we're mostly using x.copy() to tell Numpy to allocate new RAM and copy all data to it in the correct (contiguous) order. There's also a np.ascontiguousarray(x) API but it does the exact same thing (it copies too, "but only when the Numpy data isn't already contiguous") and requires much more typing. ;-) And in a few of the examples we're using special indexing (such as x[:, :, [0, 1, 2]] = x[:, :, [2, 1, 0]]) to overwrite the memory directly, which always creates contiguous memory with correct "strides", and is faster than telling Numpy to do a .copy(), but is still extremely slow compared to cv2.cvtColor().

Docs for the various Numpy functions: copy, ascontiguousarray, fliplr, flip, reshape

Here's the benchmark that was used:

python -m timeit -s "import numpy as np; import cv2; x = np.zeros([2160,3840,3], np.uint8); x[:,:,2] = 255; x[:,:,1] = 100" "ALGORITHM HERE"

Replace the "ALGORITHM HERE" part with the algorithm above, such as "x = np.flip(x, axis=2).copy()".

People's Misunderstandings of those Benchmarks

Alright, so we're finally getting to the whole purpose of this article!

When people see the benchmarks above, they usually think "Oh my god, x = x[...,::-1] executes in 0.000237 milliseconds, and x = cv2.cvtColor(x, cv2.COLOR_RGB2BGR) executes in 5.48 milliseconds, which is 23798 times slower!! And then they decide to always use "Numpy view manipulation" to do their channel conversions.

That's a huge mistake. And here's why:

  • When you call an OpenCV API from Python, and pass it a numpy.ndarray image object, there's a process which prepares that data for internal usage within OpenCV (since OpenCV itself doesn't use ndarray internally; it uses cv::Mat).
  • First, your Python object (which is coming from the PyOpenCV module) goes into the appropriate pyopencv_to() function, whose purpose is to convert raw Python objects (such as numbers, strings, ndarray, etc), into something usable by OpenCV internally in C++.
  • Your Python object first enters the "full ArgInfo converter" code at https://github.com/opencv/opencv/blob/778f42ad34559451d62ac9ba585717aec77fb23a/modules/python/src2/cv2.cpp#L249
  • That code, in turn, looks at the object and determines if it's a number, a float, or a tuple... If it's any of those, it does the appropriate conversion. Otherwise it assumes it's a Numpy array, at this line: https://github.com/opencv/opencv/blob/778f42ad34559451d62ac9ba585717aec77fb23a/modules/python/src2/cv2.cpp#L292
  • Next, it begins to analyze the Numpy array to determine how to use the data internally. It wants to determine "do we need to copy the data or can we use it as-is? do we need to cast the data?", see here: https://github.com/opencv/opencv/blob/778f42ad34559451d62ac9ba585717aec77fb23a/modules/python/src2/cv2.cpp#L300
  • It first does some simple checks to see if the number-type in the Numpy array is legal or not. (If illegal type, it marks the data as "needs copy" and "needs cast").
  • Next, it retrieves the "strides" information from the Numpy array, which is those simple numbers such as "-1" which determine how to read a numpy array (such as backwards, in the case of our "fast" numpy-based "channel flipping" code earlier): https://github.com/opencv/opencv/blob/778f42ad34559451d62ac9ba585717aec77fb23a/modules/python/src2/cv2.cpp#L341-L342
  • Then it analyzes the Strides for all dimensions of the Numpy array, and if it finds a non-contiguous stride (our "screwed up" data layout caused by doing those so-called "fast" Numpy view manipulations), then it marks the data as "needs copy": https://github.com/opencv/opencv/blob/778f42ad34559451d62ac9ba585717aec77fb23a/modules/python/src2/cv2.cpp#L344-L357
  • Next, if "needs copy" is true, it does this horrible thing: https://github.com/opencv/opencv/blob/778f42ad34559451d62ac9ba585717aec77fb23a/modules/python/src2/cv2.cpp#L367-L374
  • As you can see, it calls PyArray_Cast() (if casting was needed too) or PyArray_GETCONTIGUOUS() (if we only need to make sure the data is contiguous). Both of those functions, no matter which one is called, then generates a brand-new Python Numpy Object, with all data COPIED by Numpy into brand-new memory, and re-arranged into proper Contiguous ordering. That's extremely wasteful! I'll explain more soon, after this walkthrough of what the code does...
  • Finally, the code proceeds to create a cv::Mat object whose data pointer points at the internal byte-array (RAM) of the Numpy object, ie. the RAM address that you can easily see in Python by typing img.data. That's an incredibly fast operation because it is just a pointer which says Use the existing RAM data owned by Numpy at RAM address XYZ: https://github.com/opencv/opencv/blob/778f42ad34559451d62ac9ba585717aec77fb23a/modules/python/src2/cv2.cpp#L415-L416

So, can you spot the problem yet?

When you pass a contiguous Numpy array, the conversion into OpenCV is pretty much INSTANT: "This data looks fine! Just give its RAM address to cv::Mat and voila!".

But when you instead insist on using those so-called "fast" channel transformations, where you "tweak" the Numpy array's view and stride values, then you are giving OpenCV a Numpy array with non-contiguous RAM and bad "strides". The PyOpenCV layer (the wrapper between OpenCV and Python) detects this problem, and creates a BRAND NEW, COPIED, RE-ARRANGED (CONTIGUOUS) NUMPY ARRAY. This is VERY VERY VERY VERY SLOW.

In other words, if you've used those dumb Numpy "view" manipulation tricks, EVERY CALL TO OPENCV APIS IS CAUSING a HUGE memory copy (images are large, especially 1080p+ screenshots/video frames), a lot of math inside PyArray_GETCONTIGUOUS / PyArray_Cast to create that new object while respecting your tweaked "strides", etc.

Your code won't be faster at all. It will be SLOWER!

Demonstration of the Slowness

Let's use a random OpenCV API to demonstrate the slowdown caused by all of those conversions. We'll use cv2.imshow here, but any OpenCV API call will always be doing the same "Python to OpenCV" conversions of the numpy data, so the exact API doesn't matter. They will all have this overhead.

Here's the example code:

import cv2
import numpy as np
import time

#img1 = cv2.imread("yourimage.png") # If you want to test with an image.
img1 = np.zeros([2160,3840,3], np.uint8) # Create a 4K image.
img1[:,:,2] = 255; img1[:,:,1] = 100 # Fill the channels with different values.
img2 = img1[...,::-1] # Make a "channel flipped view" of the Numpy data.

print("img1 contiguous", img1.flags['C_CONTIGUOUS'], "img2 contiguous", img2.flags['C_CONTIGUOUS'])
print("img1 strides", img1.strides, "img2 strides", img2.strides)

wnd = cv2.namedWindow("", cv2.WINDOW_NORMAL)

def show1():
    cv2.imshow(wnd, img1)

def show2():
    cv2.imshow(wnd, img2)

iterations = 20

start = time.perf_counter()
for i in range(0, iterations):
    show1()
elapsed1 = (time.perf_counter() - start) * 1000
elapsed1_percall = elapsed1 / iterations

start = time.perf_counter()
for i in range(0, iterations):
    show2()
elapsed2 = (time.perf_counter() - start) * 1000
elapsed2_percall = elapsed2 / iterations

# We know that the contiguos (img1) data does not need conversion,
# which tells us that the runtime of the contiguous data is the
# "internal work of the imshow" function. We only want to measure
# the conversion time for non-contiguous data. So we'll subtract
# the first image's (contiguous) runtime from the non-contiguous time.

noncontiguous_overhead_per_call = elapsed2_percall - elapsed1_percall

print("Extra time taken per OpenCV call when given non-contiguous data (in ms):", noncontiguous_overhead_per_call, "ms")

The results:

img1 contiguous True img2 contiguous False
img1 strides (11520, 3, 1) img2 strides (11520, 3, -1)
Extra time taken per OpenCV call when given non-contiguous data (in ms): 39.45334999999999 ms

As you can see, the extra time added to the OpenCV calls when copy-conversion is needed (39.45 ms), is pretty much the same as when you call Numpy's own img.copy() on a "flipped view" inside Python itself (as seen in the earlier benchmark for "Method: x = x[...,::-1].copy(). Speed: 37.5 msec per call").

So yes, every time you call OpenCV with a non-contiguous Numpy array as its argument, you are causing a Numpy .copy() to happen internally!

Numpy "tricks" will cause subtle Bugs too!

Using those Numpy "tricks" isn't just extremely slow. It will cause very subtle bugs in your code, too.

Look at this code and see if you can figure out the bug yourself before you run this example:

import cv2
import numpy as np

img1 = np.zeros([200,200,3], np.uint8) # Create a 200x200 image. (Is Contiguous)
img2 = img1[...,::-1] # Make a "channel flipped view" of the Numpy data. (A Non-Contiguous View)

print("img1 contiguous", img1.flags['C_CONTIGUOUS'], "img2 contiguous", img2.flags['C_CONTIGUOUS'])
print("img1 strides", img1.strides, "img2 strides", img2.strides)

cv2.rectangle(img2, (80,80), (120,120), (255,255,255), 2)

cv2.imshow("", img2)

What do you think the result will be when running this program? Logically, you expect to see black image with a white rectangle in the middle... But instead, you see nothing except a black image. Why?

Well, it's simple... think about what was explained earlier about how PyOpenCV converts every incoming numpy.ndarray object into an internal C++ cv::Mat object. In this example, we're giving a non-contiguous ndarray as an argument to cv2.rectangle(), which causes PyOpenCV to "fix" the data by making a temporary, internal, contiguous .copy() of the image data, and then it wraps the copy's memory address in a cv::Mat. Next, it passes that cv::Mat object to the internal C++ "draw rectangle" function, which dutifully draws a rectangle onto the memory pointed to by the cv::Mat object... which is... the memory of the temporary internal copy of your input array, since a copy had to be created...

So, OpenCV happily writes a rectangle to the temporary object copy. And then when execution returns to Python, you're of course seeing NO RECTANGLE, since nothing was drawn to your actual ndarray data in RAM (since its memory storage was non-contiguous and therefore not usable as-is by OpenCV).

If you want to see what the code above should be doing, simply add img2 = img2.copy() immediately above the cv2.rectangle call, to cause the img2 ndarray object to become contiguous memory so that OpenCV won't need to make a copy of it (and will be able to use that exact object's memory internally, as intended)... After that tweak, you'll see OpenCV properly drawing the rectangle to the image...

This is the kind of subtle bug that is very easy to cause when you're playing around with faked Numpy "views" rather than real contiguous memory.

Another subtle Bug caused by Numpy "tricks"

If you create a partial view (slice) of a non-contiguous Numpy ndarray, the new Numpy object's flags get completely messed up and believe that the partial view is contiguous (when it's really not). That bug has been reported to Numpy here: https://github.com/numpy/numpy/issues/14627

So be very careful and ensure that you never make partial views of non-contiguous Numpy arrays, due to the bug above. Be very careful when you get numpy.ndarray objects created by other libraries that may have given you non-contiguous objects that you then decide to slice. You'll create tons of subtle bugs by doing that!

It's only safe to make partial views (like img[0:0, 100:100]) when img itself is already PROVEN to be FULLY contiguous (with no "Numpy tricks" applied to it). In that case, feel free to pass your contiguous, partial image slices to OpenCV functions. You won't invoke any copy-mechanics in that case.

Bonus: What to do you get a non-contiguous ndarray from a library?

As an example, the very cool D3DShot library has an optional numpy mode where it retrieves the screenshots as ndarray objects. The problem is that it generates them from RAM data laid out in a different order, so it tweaks the ndarray strides etc to give us an object of the proper "shape" (height, width, 3 color channels in RGB order). Its .flags property shows that Contiguous is FALSE.

So what do you do? If you try to pass that directly to OpenCV, you'll invoke the heavy PyOpenCV copy-mechanics described earlier.

Well, you have two options. In this example case, the colors are in RGB order, and you want them to be BGR for usage in OpenCV. So you should be invoking cv2.cvtColor which internally will trigger the Numpy .copy() for you (just like all OpenCV APIs do when given non-contiguous data), and then changes the color order in RAM for you.

The second option is when you have Numpy data that is already in the correct color order (such as BGR), but whose RAM is non-contiguous. In that case, you should directly invoke img = img.copy() to tell Numpy to make a contiguous copy of the array, to fix it. Then you're welcome to use that contiguous copy for everything.

Alright, so let's look at the D3DShot example:

import cv2
import d3dshot
import time

d = d3dshot.create(capture_output="numpy", frame_buffer_size=60)

img1 = d.screenshot()
img2 = d.screenshot()

print(img1.strides, img1.flags)
print(img2.strides, img2.flags)

print("-------------")

start = time.perf_counter()
img1_justcopy = img1.copy() # copy RGB image to new, contiguous RAM
elapsed = (time.perf_counter() - start) * 1000
print(img1_justcopy.strides, img1_justcopy.flags)
print("justcopy milliseconds:", elapsed)

print("-------------")

start = time.perf_counter()
img1 = img1.copy()
img1 = cv2.cvtColor(img1, cv2.COLOR_BGR2RGB) # flip RGB -> BGR
elapsed = (time.perf_counter() - start) * 1000
print(img1.strides, img1.flags)
print("copy+cvtColor milliseconds:", elapsed)

print("-------------")

start = time.perf_counter()
img2 = cv2.cvtColor(img2, cv2.COLOR_BGR2RGB) # flip RGB -> BGR
elapsed = (time.perf_counter() - start) * 1000
print(img2.strides, img2.flags)
print("cvtColor milliseconds:", elapsed)

Output:

(1920, 1, 2073600)   C_CONTIGUOUS : False
  F_CONTIGUOUS : False
  OWNDATA : False
  WRITEABLE : True
  ALIGNED : True
  WRITEBACKIFCOPY : False
  UPDATEIFCOPY : False

(1920, 1, 2073600)   C_CONTIGUOUS : False
  F_CONTIGUOUS : False
  OWNDATA : False
  WRITEABLE : True
  ALIGNED : True
  WRITEBACKIFCOPY : False
  UPDATEIFCOPY : False

-------------
(5760, 3, 1)   C_CONTIGUOUS : True
  F_CONTIGUOUS : False
  OWNDATA : True
  WRITEABLE : True
  ALIGNED : True
  WRITEBACKIFCOPY : False
  UPDATEIFCOPY : False

justcopy milliseconds: 9.122899999999989
-------------
(5760, 3, 1)   C_CONTIGUOUS : True
  F_CONTIGUOUS : False
  OWNDATA : True
  WRITEABLE : True
  ALIGNED : True
  WRITEBACKIFCOPY : False
  UPDATEIFCOPY : False

copy+cvtColor milliseconds: 12.177900000000019
-------------
(5760, 3, 1)   C_CONTIGUOUS : True
  F_CONTIGUOUS : False
  OWNDATA : True
  WRITEABLE : True
  ALIGNED : True
  WRITEBACKIFCOPY : False
  UPDATEIFCOPY : False

cvtColor milliseconds: 11.461500000000013

These examples are all on my 1920x1080 screen, so they're not directly comparable to the 4K resolution times we saw in earlier benchmarks.

Anyway, what we can see here, is first of all that the two captured images (img1 and img2) coming straight from the D3DShot library have very strange strides values, and C_CONTIGUOUS : False. That's because they are raw RAM given to D3DShot by Windows and then just packaged into a ndarray with custom strides to make it read the raw RAM data in the desired order.

Next, we see that just doing img1_justcopy = img1.copy() (which copies the RGB-channeled, non-contiguous RAM into new, contiguous RAM, but does not change the channel order (the image will still be RGB)), takes 9.12 ms, which is indeed how slow Numpy is at copying non-contiguous ndarray data into new, contiguous RAM. Basically, internally, Numpy has to do a ton of looping to read the data byte-by-byte while writing each byte into the correct order in the new, contiguous RAM.

So, the PyArray (Numpy) copying of non-contiguous to contiguous is always the slowest operation. That's why we want to avoid having non-contiguous RAM.

Alright, we also demonstrated how to make a "copy AND fix the colors from RGB to BGR" in two different ways. Doing img1 = img1.copy(); img1 = cv2.cvtColor(img1, cv2.COLOR_BGR2RGB) takes 11.83 ms, and letting cvtColor trigger the Numpy .copy internally via directly calling img2 = cv2.cvtColor(img2, cv2.COLOR_BGR2RGB) takes 10.61 ms. The reason for the slight difference is of course that there's slightly more work involved when we're doing 2 separate function calls, than when we let OpenCV do the Numpy copying in its single call.

In both cases, a PyArray (Numpy) copy operation happens internally, to give us a straight, contiguous RAM location. And then we pass that fixed, contiguous ndarray to cvtColor which fixes the color channel order.

That gives you the following guidelines: - If your Numpy data is non-contiguous but is already in the correct channel order (you don't want to convert RGB to/from BGR, etc): Use img = img.copy() to force Numpy to make a contiguous copy of the data, which is then usable in all OpenCV calls without any bugs and without causing any slow internal, temporary copying. - If your Numpy data is non-contiguous and you also want to change the channel order: Use img = cv2.cvtColor(img, cv2.COLOR_<your conversion choice>), which will internally do the .copy slightly more efficiently than if you had used two separate Python statements.

Both techniques will result in giving you fast, contiguous RAM, in the color arrangement of your choice!

Conclusions

Stop using Numpy view manipulations and "tricks". They are not "cool". They lead to SUBTLE BUGS and they are EXTREMELY SLOW. You are slowing down all of your OpenCV API calls by about 40 milliseconds PER CALL, since your "cool" data has to be converted by OpenCV internally to PROPER CONTIGUOUS RAM.

Those 39.45 ms are wasted on EVERY OpenCV call whenever you give OpenCV a non-contiguous image. So if you're (as people often do) passing the image to multiple OpenCV APIs to analyze it in multiple ways, then you are causing extreme slowdowns in your code.

Use cv2.cvtColor() instead, which does a one-time conversion to the proper format (in about 5.48 ms as seen in the benchmarks earlier), using accelerated OpenCL code. You are guaranteed to get contiguous data which works as-is for EVERY OpenCV call with no memory copying/conversion needed. And OpenCV's color converter is WAY FASTER than Numpy's internal data copier/converter.

Let's end this by imagining a scenario where you're using some Python library to capture an RGB screenshot as a numpy array, and you need to use that data with OpenCV. So you're thinking you're clever and you write img = img[...,::-1] to "turn the RGB data into BGR (which OpenCV needs)", and you're thinking "Wow, my code is so fast! That RGB-to-BGR operation only took 0.000237 ms!"... And then you're calling five different OpenCV functions to analyze that screenshot-image in various ways. Well, since you're causing one internal Numpy copy-conversion-to-contiguous PER CALL, you're now causing 5 * 39.45 = 197.25 ms of total conversion overhead, just to get your "stupid" Numpy view into a proper contiguous memory stream.

Does it still sound "slow" to just do a single, one-time 5.48 ms conversion via cv2.cvtColor()? ;-)

Stop. Using. Numpy. Tricks!

Enjoy! ;-)

Fastest way to convert BGR <-> RGB! Aka: Do NOT use Numpy magic "tricks".


IMPORTANT: This article is very long. Remember to click the (more) at the bottom of this post to read the whole article!


I was reading this question: https://answers.opencv.org/question/188664/colour-conversion-from-bgr-to-rgb-in-cv2-is-slower-than-on-python/ and it didn't explain things very well at all. So here's a deep examination and explanation for everyone's future reference!

Converting RGB to BGR, and vice versa, is one of the most important operations you can do in OpenCV if you're interoperating with other libraries, raw system memory, etc. And each imaging library depends on their own special channel orders.

There are many ways to achieve the conversion, and cv2.cvtColor() is often frowned upon because there are "much faster" ways to do it via numpy "view" manipulation.

Whenever you attempt to convert colors in OpenCV, you actually invoke a huge machinery:

https://github.com/opencv/opencv/blob/8c0b0714e76efef4a8ca2a7c410c60e55c5e9829/modules/imgproc/src/color.cpp#L20-L25 https://github.com/opencv/opencv/blob/8b541e450b511fde9dd363fa55a30fbb6fc0ace6/modules/imgproc/src/color_rgb.dispatch.cpp#L426-L437

As you can see, internally, OpenCV creates an "OpenCL Kernel" with the instructions for the data transformation, and then runs it. This creates brand new (re-arranged) image data in memory, which is of course a pretty slow operation, involving new memory allocation and data-copying.

However, there is another way to flip between RGB and BGR channel orders, which is very popular - and very bad (as you'll find out soon). And that is: Using numpy's built-in methods for manipulating the array data.

Note that there are two ways to manipulate data in Numpy:

  • One of the ways, the bad way, just changes the "view" of the Numpy array and is therefore instant (O(1)), but does NOT transform the underlying img.data in RAM/memory. This means that the raw memory does NOT contain the new channel order, and Numpy instead "fakes" it by creating a "view" that simply says "when we read this data from RAM, view it as R=B, G=G, B=R" basically... (Technically speaking, it changes the ".strides" property of the Numpy object, which instead of saying "read R then G then B" (stride "1" aka going forwards in RAM when reading the color channels) changes it to say "read B, then G, then R" (stride "-1" aka going backwards in RAM when reading the color channels)).
  • The second way, which is totally fine, is to always ensure that we arrange the pixel data properly in memory too, which is a lot slower but is almost always necessary, depending on what library/API your data is intended to be sent to!

To determine whether a numpy array manipulation has also changed the underlying MEMORY, you can look at the img.flags['C_CONTIGUOUS'] value. If True it means that the data in RAM is in the correct order (that's great!). If False it means that the data in RAM is in the wrong order and that we are "cheating" via a numpy View instead (that's BAD!).

Whenever you use the "View-based" methods to flip channels in an ndarray (such as RGB -> BGR), its C_CONTIGUOUS becomes False. If you then flip the image's channels again (such as BGR -> back to RGB), its C_CONTIGUOUS becomes True again. So, the "view" is able to be transformed multiple times, and the "Contiguous" flag only says True whenever the view happens to match the actual RAM data's layout.

So... in what situations do you need the data to ALWAYS be contiguous? Well, it varies based on API...

  • OpenCV APIs all need data in contiguous order. If you give it non-contiguous data from Python, the Python-to-OpenCV wrapper layer internally makes a COPY of the BADLY FORMATTED image you gave it, and then converts the color channel order, and THEN finally passes the COPIED-AND-CONTIGUOUS image to the internal OpenCV C++ function. This is of course very wasteful!
  • Other libraries: Depends on the library. Some of them do something like "take the img.data memory address and give it to a raw Windows API via a COM call" in which case YES the RAM-data MUST be contiguous too.

What type of data do YOU need?

If you want the SAFEST possible data that is 100% sure to work in ANY API ANYWHERE, you should always make CONTIGUOUS pixel data. It doesn't take long to do the conversion up-front, since we're still talking about very fast operations!

There are probably situations where non-contiguous data is fine, such as if you are doing all image manipulations purely in Numpy math without any library APIs (in which case there's no real reason to convert the data layout to contiguous in RAM). But as soon as you invoke various library APIs, you should pretty much always have contiguous data, otherwise you'll create huge performance issues (or even completely incorrect results).

I'll explain those performance issues further down, but first we'll look at the various "conversion techniques" people use in Python.

Techniques

Without further ado, here are all the ways that people use in Python whenever they want to convert back/forth between RGB and BGR. These benchmarks are on a 4K image (3840x2160):

  • Always Contiguous: No. Method: x = x[...,::-1]. Speed: 237 nsec (aka 0.237 usec aka 0.000237 msec) per call
  • Always Contiguous: Yes. Method: x = x[...,::-1].copy(). Speed: 37.5 msec per call
  • Always Contiguous: No. Method: x = x[:, :, [2, 1, 0]]. Speed: 12.6 msec per call
  • Always Contiguous: Yes. Method: x = cv2.cvtColor(x, cv2.COLOR_RGB2BGR). Speed: 5.48 msec per call
  • Always Contiguous: No. Method: x = np.fliplr(x.reshape(-1,3)).reshape(x.shape). Speed: 1.62 usec (aka 0.00162 msec) per call
  • Always Contiguous: Yes. Method: x = np.fliplr(x.reshape(-1,3)).reshape(x.shape).copy(). Speed: 37.4 msec per call
  • Always Contiguous: No. Method: x = np.flip(x, axis=2). Speed: 2.74 usec (aka 0.00274 msec) per call
  • Always Contiguous: Yes. Method: x = np.flip(x, axis=2).copy(). Speed: 37.5 msec per call
  • Always Contiguous: Yes. Method: r = x[..., 0].copy(); x[..., 0] = x[..., 2]; x[..., 2] = r. Speed: 21.8 msec per call
  • Always Contiguous: Yes. Method: x[:, :, [0, 2]] = x[:, :, [2, 0]]. Speed: 21.7 msec per call
  • Always Contiguous: Yes. Method: x[..., [0, 2]] = x[..., [2, 0]]. Speed: 21.8 msec per call
  • Always Contiguous: Yes. Method: x[:, :, [0, 1, 2]] = x[:, :, [2, 1, 0]]. Speed: 33.1 msec per call
  • Always Contiguous: Yes. Method: x[:, :] = x[:, :, [2, 1, 0]]. Speed: 49.3 msec per call
  • Always Contiguous: Yes. Method: foo = x.copy(). Speed: 11.8 msec per call (This example doesn't change the RGB/BGR channel order, and is just included here as a reference, to show how slow Numpy is at doing a super simple copy of an already-contiguous chunk of RAM. As you can see, even when the data is already in the proper order, Numpy is very slow... And if "x" had been non-contiguous here, it would be even slower, as shown in the x = x[...,::-1].copy() (equivalent to saying bar = x[...,::-1]; foo = bar.copy()) example near the top of the list, which took 37.5 msec and demonstrates Numpy copying non-contiguous RAM (from numpy "views" marked as "read in reverse order" via "stride = -1") into contiguous RAM...

PS: Whenever we want contiguous data from numpy, we're mostly using x.copy() to tell Numpy to allocate new RAM and copy all data to it in the correct (contiguous) order. There's also a np.ascontiguousarray(x) API but it does the exact same thing (it copies too, "but only when the Numpy data isn't already contiguous") and requires much more typing. ;-) And in a few of the examples we're using special indexing (such as x[:, :, [0, 1, 2]] = x[:, :, [2, 1, 0]]) to overwrite the memory directly, which always creates contiguous memory with correct "strides", and is faster than telling Numpy to do a .copy(), but is still extremely slow compared to cv2.cvtColor().

Docs for the various Numpy functions: copy, ascontiguousarray, fliplr, flip, reshape

Here's the benchmark that was used:

python -m timeit -s "import numpy as np; import cv2; x = np.zeros([2160,3840,3], np.uint8); x[:,:,2] = 255; x[:,:,1] = 100" "ALGORITHM HERE"

Replace the "ALGORITHM HERE" part with the algorithm above, such as "x = np.flip(x, axis=2).copy()".

People's Misunderstandings of those Benchmarks

Alright, so we're finally getting to the whole purpose of this article!

When people see the benchmarks above, they usually think "Oh my god, x = x[...,::-1] executes in 0.000237 milliseconds, and x = cv2.cvtColor(x, cv2.COLOR_RGB2BGR) executes in 5.48 milliseconds, which is 23798 times slower!! And then they decide to always use "Numpy view manipulation" to do their channel conversions.

That's a huge mistake. And here's why:

  • When you call an OpenCV API from Python, and pass it a numpy.ndarray image object, there's a process which prepares that data for internal usage within OpenCV (since OpenCV itself doesn't use ndarray internally; it uses cv::Mat).
  • First, your Python object (which is coming from the PyOpenCV module) goes into the appropriate pyopencv_to() function, whose purpose is to convert raw Python objects (such as numbers, strings, ndarray, etc), into something usable by OpenCV internally in C++.
  • Your Python object first enters the "full ArgInfo converter" code at https://github.com/opencv/opencv/blob/778f42ad34559451d62ac9ba585717aec77fb23a/modules/python/src2/cv2.cpp#L249
  • That code, in turn, looks at the object and determines if it's a number, a float, or a tuple... If it's any of those, it does the appropriate conversion. Otherwise it assumes it's a Numpy array, at this line: https://github.com/opencv/opencv/blob/778f42ad34559451d62ac9ba585717aec77fb23a/modules/python/src2/cv2.cpp#L292
  • Next, it begins to analyze the Numpy array to determine how to use the data internally. It wants to determine "do we need to copy the data or can we use it as-is? do we need to cast the data?", see here: https://github.com/opencv/opencv/blob/778f42ad34559451d62ac9ba585717aec77fb23a/modules/python/src2/cv2.cpp#L300
  • It first does some simple checks to see if the number-type in the Numpy array is legal or not. (If illegal type, it marks the data as "needs copy" and "needs cast").
  • Next, it retrieves the "strides" information from the Numpy array, which is those simple numbers such as "-1" which determine how to read a numpy array (such as backwards, in the case of our "fast" numpy-based "channel flipping" code earlier): https://github.com/opencv/opencv/blob/778f42ad34559451d62ac9ba585717aec77fb23a/modules/python/src2/cv2.cpp#L341-L342
  • Then it analyzes the Strides for all dimensions of the Numpy array, and if it finds a non-contiguous stride (our "screwed up" data layout caused by doing those so-called "fast" Numpy view manipulations), then it marks the data as "needs copy": https://github.com/opencv/opencv/blob/778f42ad34559451d62ac9ba585717aec77fb23a/modules/python/src2/cv2.cpp#L344-L357
  • Next, if "needs copy" is true, it does this horrible thing: https://github.com/opencv/opencv/blob/778f42ad34559451d62ac9ba585717aec77fb23a/modules/python/src2/cv2.cpp#L367-L374
  • As you can see, it calls PyArray_Cast() (if casting was needed too) or PyArray_GETCONTIGUOUS() (if we only need to make sure the data is contiguous). Both of those functions, no matter which one is called, then generates a brand-new Python Numpy Object, with all data COPIED by Numpy into brand-new memory, and re-arranged into proper Contiguous ordering. That's extremely wasteful! I'll explain more soon, after this walkthrough of what the code does...
  • Finally, the code proceeds to create a cv::Mat object whose data pointer points at the internal byte-array (RAM) of the Numpy object, ie. the RAM address that you can easily see in Python by typing img.data. That's an incredibly fast operation because it is just a pointer which says Use the existing RAM data owned by Numpy at RAM address XYZ: https://github.com/opencv/opencv/blob/778f42ad34559451d62ac9ba585717aec77fb23a/modules/python/src2/cv2.cpp#L415-L416

So, can you spot the problem yet?

When you pass a contiguous Numpy array, the conversion into OpenCV is pretty much INSTANT: "This data looks fine! Just give its RAM address to cv::Mat and voila!".

But when you instead insist on using those so-called "fast" channel transformations, where you "tweak" the Numpy array's view and stride values, then you are giving OpenCV a Numpy array with non-contiguous RAM and bad "strides". The PyOpenCV layer (the wrapper between OpenCV and Python) detects this problem, and creates a BRAND NEW, COPIED, RE-ARRANGED (CONTIGUOUS) NUMPY ARRAY. This is VERY VERY VERY VERY SLOW.

In other words, if you've used those dumb Numpy "view" manipulation tricks, EVERY CALL TO OPENCV APIS IS CAUSING a HUGE memory copy (images are large, especially 1080p+ screenshots/video frames), a lot of math inside PyArray_GETCONTIGUOUS / PyArray_Cast to create that new object while respecting your tweaked "strides", etc.

Your code won't be faster at all. It will be SLOWER!

Demonstration of the Slowness

Let's use a random OpenCV API to demonstrate the slowdown caused by all of those conversions. We'll use cv2.imshow here, but any OpenCV API call will always be doing the same "Python to OpenCV" conversions of the numpy data, so the exact API doesn't matter. They will all have this overhead.

Here's the example code:

import cv2
import numpy as np
import time

#img1 = cv2.imread("yourimage.png") # If you want to test with an image.
img1 = np.zeros([2160,3840,3], np.uint8) # Create a 4K image.
img1[:,:,2] = 255; img1[:,:,1] = 100 # Fill the channels with different values.
img2 = img1[...,::-1] # Make a "channel flipped view" of the Numpy data.

print("img1 contiguous", img1.flags['C_CONTIGUOUS'], "img2 contiguous", img2.flags['C_CONTIGUOUS'])
print("img1 strides", img1.strides, "img2 strides", img2.strides)

wnd = cv2.namedWindow("", cv2.WINDOW_NORMAL)

def show1():
    cv2.imshow(wnd, img1)

def show2():
    cv2.imshow(wnd, img2)

iterations = 20

start = time.perf_counter()
for i in range(0, iterations):
    show1()
elapsed1 = (time.perf_counter() - start) * 1000
elapsed1_percall = elapsed1 / iterations

start = time.perf_counter()
for i in range(0, iterations):
    show2()
elapsed2 = (time.perf_counter() - start) * 1000
elapsed2_percall = elapsed2 / iterations

# We know that the contiguos (img1) data does not need conversion,
# which tells us that the runtime of the contiguous data is the
# "internal work of the imshow" function. We only want to measure
# the conversion time for non-contiguous data. So we'll subtract
# the first image's (contiguous) runtime from the non-contiguous time.

noncontiguous_overhead_per_call = elapsed2_percall - elapsed1_percall

print("Extra time taken per OpenCV call when given non-contiguous data (in ms):", noncontiguous_overhead_per_call, "ms")

The results:

img1 contiguous True img2 contiguous False
img1 strides (11520, 3, 1) img2 strides (11520, 3, -1)
Extra time taken per OpenCV call when given non-contiguous data (in ms): 39.45334999999999 ms

As you can see, the extra time added to the OpenCV calls when copy-conversion is needed (39.45 ms), is pretty much the same as when you call Numpy's own img.copy() on a "flipped view" inside Python itself (as seen in the earlier benchmark for "Method: x = x[...,::-1].copy(). Speed: 37.5 msec per call").

So yes, every time you call OpenCV with a non-contiguous Numpy array as its argument, you are causing a Numpy .copy() to happen internally!

Numpy "tricks" will cause subtle Bugs too!

Using those Numpy "tricks" isn't just extremely slow. It will cause very subtle bugs in your code, too.

Look at this code and see if you can figure out the bug yourself before you run this example:

import cv2
import numpy as np

img1 = np.zeros([200,200,3], np.uint8) # Create a 200x200 image. (Is Contiguous)
img2 = img1[...,::-1] # Make a "channel flipped view" of the Numpy data. (A Non-Contiguous View)

print("img1 contiguous", img1.flags['C_CONTIGUOUS'], "img2 contiguous", img2.flags['C_CONTIGUOUS'])
print("img1 strides", img1.strides, "img2 strides", img2.strides)

cv2.rectangle(img2, (80,80), (120,120), (255,255,255), 2)

cv2.imshow("", img2)

What do you think the result will be when running this program? Logically, you expect to see black image with a white rectangle in the middle... But instead, you see nothing except a black image. Why?

Well, it's simple... think about what was explained earlier about how PyOpenCV converts every incoming numpy.ndarray object into an internal C++ cv::Mat object. In this example, we're giving a non-contiguous ndarray as an argument to cv2.rectangle(), which causes PyOpenCV to "fix" the data by making a temporary, internal, contiguous .copy() of the image data, and then it wraps the copy's memory address in a cv::Mat. Next, it passes that cv::Mat object to the internal C++ "draw rectangle" function, which dutifully draws a rectangle onto the memory pointed to by the cv::Mat object... which is... the memory of the temporary internal copy of your input array, since a copy had to be created...

So, OpenCV happily writes a rectangle to the temporary object copy. And then when execution returns to Python, you're of course seeing NO RECTANGLE, since nothing was drawn to your actual ndarray data in RAM (since its memory storage was non-contiguous and therefore not usable as-is by OpenCV).

If you want to see what the code above should be doing, simply add img2 = img2.copy() immediately above the cv2.rectangle call, to cause the img2 ndarray object to become contiguous memory so that OpenCV won't need to make a copy of it (and will be able to use that exact object's memory internally, as intended)... After that tweak, you'll see OpenCV properly drawing the rectangle to the image...

This is the kind of subtle bug that is very easy to cause when you're playing around with faked Numpy "views" rather than real contiguous memory.

Another subtle Bug caused by Numpy "tricks"

If you create a partial view (slice) of a non-contiguous Numpy ndarray, the new Numpy object's flags get completely messed up and believe that the partial view is contiguous (when it's really not). That bug has been reported to Numpy here: https://github.com/numpy/numpy/issues/14627

So be very careful and ensure that you never make partial views of non-contiguous Numpy arrays, due to the bug above. Be very careful when you get numpy.ndarray objects created by other libraries that may have given you non-contiguous objects that you then decide to slice. You'll create tons of subtle bugs by doing that!

It's only safe to make partial views (like img[0:0, 100:100]) when img itself is already PROVEN to be FULLY contiguous (with no "Numpy tricks" applied to it). In that case, feel free to pass your contiguous, partial image slices to OpenCV functions. You won't invoke any copy-mechanics in that case.

Bonus: What to do you get a non-contiguous ndarray from a library?

As an example, the very cool D3DShot library has an optional numpy mode where it retrieves the screenshots as ndarray objects. The problem is that it generates them from RAM data laid out in a different order, so it tweaks the ndarray strides etc to give us an object of the proper "shape" (height, width, 3 color channels in RGB order). Its .flags property shows that Contiguous is FALSE.

So what do you do? If you try to pass that directly to OpenCV, you'll invoke the heavy PyOpenCV copy-mechanics described earlier.

Well, you have two options. In this example case, the colors are in RGB order, and you want them to be BGR for usage in OpenCV. So you should be invoking cv2.cvtColor which internally will trigger the Numpy .copy() for you (just like all OpenCV APIs do when given non-contiguous data), and then changes the color order in RAM for you.

The second option is when you have Numpy data that is already in the correct color order (such as BGR), but whose RAM is non-contiguous. In that case, you should directly invoke img = img.copy() to tell Numpy to make a contiguous copy of the array, to fix it. Then you're welcome to use that contiguous copy for everything.

Alright, so let's look at the D3DShot example:

import cv2
import d3dshot
import time

d = d3dshot.create(capture_output="numpy", frame_buffer_size=60)

img1 = d.screenshot()
img2 = d.screenshot()

print(img1.strides, img1.flags)
print(img2.strides, img2.flags)

print("-------------")

start = time.perf_counter()
img1_justcopy = img1.copy() # copy RGB image to new, contiguous RAM
elapsed = (time.perf_counter() - start) * 1000
print(img1_justcopy.strides, img1_justcopy.flags)
print("justcopy milliseconds:", elapsed)

print("-------------")

start = time.perf_counter()
img1 = img1.copy()
img1 = cv2.cvtColor(img1, cv2.COLOR_BGR2RGB) # flip RGB -> BGR
elapsed = (time.perf_counter() - start) * 1000
print(img1.strides, img1.flags)
print("copy+cvtColor milliseconds:", elapsed)

print("-------------")

start = time.perf_counter()
img2 = cv2.cvtColor(img2, cv2.COLOR_BGR2RGB) # flip RGB -> BGR
elapsed = (time.perf_counter() - start) * 1000
print(img2.strides, img2.flags)
print("cvtColor milliseconds:", elapsed)

Output:

(1920, 1, 2073600)   C_CONTIGUOUS : False
  F_CONTIGUOUS : False
  OWNDATA : False
  WRITEABLE : True
  ALIGNED : True
  WRITEBACKIFCOPY : False
  UPDATEIFCOPY : False

(1920, 1, 2073600)   C_CONTIGUOUS : False
  F_CONTIGUOUS : False
  OWNDATA : False
  WRITEABLE : True
  ALIGNED : True
  WRITEBACKIFCOPY : False
  UPDATEIFCOPY : False

-------------
(5760, 3, 1)   C_CONTIGUOUS : True
  F_CONTIGUOUS : False
  OWNDATA : True
  WRITEABLE : True
  ALIGNED : True
  WRITEBACKIFCOPY : False
  UPDATEIFCOPY : False

justcopy milliseconds: 9.122899999999989
-------------
(5760, 3, 1)   C_CONTIGUOUS : True
  F_CONTIGUOUS : False
  OWNDATA : True
  WRITEABLE : True
  ALIGNED : True
  WRITEBACKIFCOPY : False
  UPDATEIFCOPY : False

copy+cvtColor milliseconds: 12.177900000000019
-------------
(5760, 3, 1)   C_CONTIGUOUS : True
  F_CONTIGUOUS : False
  OWNDATA : True
  WRITEABLE : True
  ALIGNED : True
  WRITEBACKIFCOPY : False
  UPDATEIFCOPY : False

cvtColor milliseconds: 11.461500000000013

These examples are all on my 1920x1080 screen, so they're not directly comparable to the 4K resolution times we saw in earlier benchmarks.

Anyway, what we can see here, is first of all that the two captured images (img1 and img2) coming straight from the D3DShot library have very strange strides values, and C_CONTIGUOUS : False. That's because they are raw RAM given to D3DShot by Windows and then just packaged into a ndarray with custom strides to make it read the raw RAM data in the desired order.

Next, we see that just doing img1_justcopy = img1.copy() (which copies the RGB-channeled, non-contiguous RAM into new, contiguous RAM, but does not change the channel order (the image will still be RGB)), takes 9.12 ms, which is indeed how slow Numpy is at copying non-contiguous ndarray data into new, contiguous RAM. Basically, internally, Numpy has to do a ton of looping to read the data byte-by-byte while writing each byte into the correct order in the new, contiguous RAM.

So, the PyArray (Numpy) copying of non-contiguous to contiguous is always the slowest operation. That's why we want to avoid having non-contiguous RAM.

Alright, we also demonstrated how to make a "copy AND fix the colors from RGB to BGR" in two different ways. Doing img1 = img1.copy(); img1 = cv2.cvtColor(img1, cv2.COLOR_BGR2RGB) takes 11.83 ms, and letting cvtColor trigger the Numpy .copy internally via directly calling img2 = cv2.cvtColor(img2, cv2.COLOR_BGR2RGB) takes 10.61 ms. The reason for the slight difference is of course that there's slightly more work involved when we're doing 2 separate function calls, than when we let OpenCV do the Numpy copying in its single call.

In both cases, a PyArray (Numpy) copy operation happens internally, to give us a straight, contiguous RAM location. And then we pass that fixed, contiguous ndarray to cvtColor which fixes the color channel order.

That gives you the following guidelines: - guidelines:

  • If your Numpy data is non-contiguous but is already in the correct channel order (you don't want to convert RGB to/from BGR, etc): Use img = img.copy() to force Numpy to make a contiguous copy of the data, which is then usable in all OpenCV calls without any bugs and without causing any slow internal, temporary copying. - copying.
  • If your Numpy data is non-contiguous and you also want to change the channel order: Use img = cv2.cvtColor(img, cv2.COLOR_<your conversion choice>), which will internally do the .copy slightly more efficiently than if you had used two separate Python statements.

Both techniques will result in giving you fast, contiguous RAM, in the color arrangement of your choice!

Conclusions

Stop using Numpy view manipulations and "tricks". They are not "cool". They lead to SUBTLE BUGS and they are EXTREMELY SLOW. You are slowing down all of your OpenCV API calls by about 40 milliseconds PER CALL, since your "cool" data has to be converted by OpenCV internally to PROPER CONTIGUOUS RAM.

Those 39.45 ms are wasted on EVERY OpenCV call whenever you give OpenCV a non-contiguous image. So if you're (as people often do) passing the image to multiple OpenCV APIs to analyze it in multiple ways, then you are causing extreme slowdowns in your code.

Use cv2.cvtColor() instead, which does a one-time conversion to the proper format (in about 5.48 ms as seen in the benchmarks earlier), using accelerated OpenCL code. You are guaranteed to get contiguous data which works as-is for EVERY OpenCV call with no memory copying/conversion needed. And OpenCV's color converter is WAY FASTER than Numpy's internal data copier/converter.

Let's end this by imagining a scenario where you're using some Python library to capture an RGB screenshot as a numpy array, and you need to use that data with OpenCV. So you're thinking you're clever and you write img = img[...,::-1] to "turn the RGB data into BGR (which OpenCV needs)", and you're thinking "Wow, my code is so fast! That RGB-to-BGR operation only took 0.000237 ms!"... And then you're calling five different OpenCV functions to analyze that screenshot-image in various ways. Well, since you're causing one internal Numpy copy-conversion-to-contiguous PER CALL, you're now causing 5 * 39.45 = 197.25 ms of total conversion overhead, just to get your "stupid" Numpy view into a proper contiguous memory stream.

Does it still sound "slow" to just do a single, one-time 5.48 ms conversion via cv2.cvtColor()? ;-)

Stop. Using. Numpy. Tricks!

Enjoy! ;-)

Fastest way to convert BGR <-> RGB! Aka: Do NOT use Numpy magic "tricks".


IMPORTANT: This article is very long. Remember to click the (more) at the bottom of this post to read the whole article!


I was reading this question: https://answers.opencv.org/question/188664/colour-conversion-from-bgr-to-rgb-in-cv2-is-slower-than-on-python/ and it didn't explain things very well at all. So here's a deep examination and explanation for everyone's future reference!

Converting RGB to BGR, and vice versa, is one of the most important operations you can do in OpenCV if you're interoperating with other libraries, raw system memory, etc. And each imaging library depends on their own special channel orders.

There are many ways to achieve the conversion, and cv2.cvtColor() is often frowned upon because there are "much faster" ways to do it via numpy "view" manipulation.

Whenever you attempt to convert colors in OpenCV, you actually invoke a huge machinery:

https://github.com/opencv/opencv/blob/8c0b0714e76efef4a8ca2a7c410c60e55c5e9829/modules/imgproc/src/color.cpp#L20-L25 https://github.com/opencv/opencv/blob/8b541e450b511fde9dd363fa55a30fbb6fc0ace6/modules/imgproc/src/color_rgb.dispatch.cpp#L426-L437

As you can see, internally, OpenCV creates an "OpenCL Kernel" with the instructions for the data transformation, and then runs it. This creates brand new (re-arranged) image data in memory, which is of course a pretty slow operation, involving new memory allocation and data-copying.

However, there is another way to flip between RGB and BGR channel orders, which is very popular - and very bad (as you'll find out soon). And that is: Using numpy's built-in methods for manipulating the array data.

Note that there are two ways to manipulate data in Numpy:

  • One of the ways, the bad way, just changes the "view" of the Numpy array and is therefore instant (O(1)), but does NOT transform the underlying img.data in RAM/memory. This means that the raw memory does NOT contain the new channel order, and Numpy instead "fakes" it by creating a "view" that simply says "when we read this data from RAM, view it as R=B, G=G, B=R" basically... (Technically speaking, it changes the ".strides" property of the Numpy object, which instead of saying "read R then G then B" (stride "1" aka going forwards in RAM when reading the color channels) changes it to say "read B, then G, then R" (stride "-1" aka going backwards in RAM when reading the color channels)).
  • The second way, which is totally fine, is to always ensure that we arrange the pixel data properly in memory too, which is a lot slower but is almost always necessary, depending on what library/API your data is intended to be sent to!

To determine whether a numpy array manipulation has also changed the underlying MEMORY, you can look at the img.flags['C_CONTIGUOUS'] value. If True it means that the data in RAM is in the correct order (that's great!). If False it means that the data in RAM is in the wrong order and that we are "cheating" via a numpy View instead (that's BAD!).

Whenever you use the "View-based" methods to flip channels in an ndarray (such as RGB -> BGR), its C_CONTIGUOUS becomes False. If you then flip the image's channels again (such as BGR -> back to RGB), its C_CONTIGUOUS becomes True again. So, the "view" is able to be transformed multiple times, and the "Contiguous" flag only says True whenever the view happens to match the actual RAM data's layout.

So... in what situations do you need the data to ALWAYS be contiguous? Well, it varies based on API...

  • OpenCV APIs all need data in contiguous order. If you give it non-contiguous data from Python, the Python-to-OpenCV wrapper layer internally makes a COPY of the BADLY FORMATTED image you gave it, and then converts the color channel order, and THEN finally passes the COPIED-AND-CONTIGUOUS image to the internal OpenCV C++ function. This is of course very wasteful!
  • Other libraries: Depends on the library. Some of them do something like "take the img.data memory address and give it to a raw Windows API via a COM call" in which case YES the RAM-data MUST be contiguous too.

What type of data do YOU need?

If you want the SAFEST possible data that is 100% sure to work in ANY API ANYWHERE, you should always make CONTIGUOUS pixel data. It doesn't take long to do the conversion up-front, since we're still talking about very fast operations!

There are probably situations where non-contiguous data is fine, such as if you are doing all image manipulations purely in Numpy math without any library APIs (in which case there's no real reason to convert the data layout to contiguous in RAM). But as soon as you invoke various library APIs, you should pretty much always have contiguous data, otherwise you'll create huge performance issues (or even completely incorrect results).

I'll explain those performance issues further down, but first we'll look at the various "conversion techniques" people use in Python.

Techniques

Without further ado, here are all the ways that people use in Python whenever they want to convert back/forth between RGB and BGR. These benchmarks are on a 4K image (3840x2160):

  • Always Contiguous: No. Method: x = x[...,::-1]. Speed: 237 nsec (aka 0.237 usec aka 0.000237 msec) per call
  • Always Contiguous: Yes. Method: x = x[...,::-1].copy(). Speed: 37.5 msec per call
  • Always Contiguous: No. Method: x = x[:, :, [2, 1, 0]]. Speed: 12.6 msec per call
  • Always Contiguous: Yes. Method: x = cv2.cvtColor(x, cv2.COLOR_RGB2BGR). Speed: 5.48 msec per call
  • Always Contiguous: No. Method: x = np.fliplr(x.reshape(-1,3)).reshape(x.shape). Speed: 1.62 usec (aka 0.00162 msec) per call
  • Always Contiguous: Yes. Method: x = np.fliplr(x.reshape(-1,3)).reshape(x.shape).copy(). Speed: 37.4 msec per call
  • Always Contiguous: No. Method: x = np.flip(x, axis=2). Speed: 2.74 usec (aka 0.00274 msec) per call
  • Always Contiguous: Yes. Method: x = np.flip(x, axis=2).copy(). Speed: 37.5 msec per call
  • Always Contiguous: Yes. Method: r = x[..., 0].copy(); x[..., 0] = x[..., 2]; x[..., 2] = r. Speed: 21.8 msec per call
  • Always Contiguous: Yes. Method: x[:, :, [0, 2]] = x[:, :, [2, 0]]. Speed: 21.7 msec per call
  • Always Contiguous: Yes. Method: x[..., [0, 2]] = x[..., [2, 0]]. Speed: 21.8 msec per call
  • Always Contiguous: Yes. Method: x[:, :, [0, 1, 2]] = x[:, :, [2, 1, 0]]. Speed: 33.1 msec per call
  • Always Contiguous: Yes. Method: x[:, :] = x[:, :, [2, 1, 0]]. Speed: 49.3 msec per call
  • Always Contiguous: Yes. Method: foo = x.copy(). Speed: 11.8 msec per call (This example doesn't change the RGB/BGR channel order, and is just included here as a reference, to show how slow Numpy is at doing a super simple copy of an already-contiguous chunk of RAM. As you can see, even when the data is already in the proper order, Numpy is very slow... And if "x" had been non-contiguous here, it would be even slower, as shown in the x = x[...,::-1].copy() (equivalent to saying bar = x[...,::-1]; foo = bar.copy()) example near the top of the list, which took 37.5 msec and demonstrates Numpy copying non-contiguous RAM (from numpy "views" marked as "read in reverse order" via "stride = -1") into contiguous RAM...

PS: Whenever we want contiguous data from numpy, we're mostly using x.copy() to tell Numpy to allocate new RAM and copy all data to it in the correct (contiguous) order. There's also a np.ascontiguousarray(x) API but it does the exact same thing (it copies too, "but only when the Numpy data isn't already contiguous") and requires much more typing. ;-) And in a few of the examples we're using special indexing (such as x[:, :, [0, 1, 2]] = x[:, :, [2, 1, 0]]) to overwrite the memory directly, which always creates contiguous memory with correct "strides", and is faster than telling Numpy to do a .copy(), but is still extremely slow compared to cv2.cvtColor().

Docs for the various Numpy functions: copy, ascontiguousarray, fliplr, flip, reshape

Here's the benchmark that was used:

python -m timeit -s "import numpy as np; import cv2; x = np.zeros([2160,3840,3], np.uint8); x[:,:,2] = 255; x[:,:,1] = 100" "ALGORITHM HERE"

Replace the "ALGORITHM HERE" part with the algorithm above, such as "x = np.flip(x, axis=2).copy()".

People's Misunderstandings of those Benchmarks

Alright, so we're finally getting to the whole purpose of this article!

When people see the benchmarks above, they usually think "Oh my god, x = x[...,::-1] executes in 0.000237 milliseconds, and x = cv2.cvtColor(x, cv2.COLOR_RGB2BGR) executes in 5.48 milliseconds, which is 23798 times slower!! And then they decide to always use "Numpy view manipulation" to do their channel conversions.

That's a huge mistake. And here's why:

  • When you call an OpenCV API from Python, and pass it a numpy.ndarray image object, there's a process which prepares that data for internal usage within OpenCV (since OpenCV itself doesn't use ndarray internally; it uses cv::Mat).
  • First, your Python object (which is coming from the PyOpenCV module) goes into the appropriate pyopencv_to() function, whose purpose is to convert raw Python objects (such as numbers, strings, ndarray, etc), into something usable by OpenCV internally in C++.
  • Your Python object first enters the "full ArgInfo converter" code at https://github.com/opencv/opencv/blob/778f42ad34559451d62ac9ba585717aec77fb23a/modules/python/src2/cv2.cpp#L249
  • That code, in turn, looks at the object and determines if it's a number, a float, or a tuple... If it's any of those, it does the appropriate conversion. Otherwise it assumes it's a Numpy array, at this line: https://github.com/opencv/opencv/blob/778f42ad34559451d62ac9ba585717aec77fb23a/modules/python/src2/cv2.cpp#L292
  • Next, it begins to analyze the Numpy array to determine how to use the data internally. It wants to determine "do we need to copy the data or can we use it as-is? do we need to cast the data?", see here: https://github.com/opencv/opencv/blob/778f42ad34559451d62ac9ba585717aec77fb23a/modules/python/src2/cv2.cpp#L300
  • It first does some simple checks to see if the number-type in the Numpy array is legal or not. (If illegal type, it marks the data as "needs copy" and "needs cast").
  • Next, it retrieves the "strides" information from the Numpy array, which is those simple numbers such as "-1" which determine how to read a numpy array (such as backwards, in the case of our "fast" numpy-based "channel flipping" code earlier): https://github.com/opencv/opencv/blob/778f42ad34559451d62ac9ba585717aec77fb23a/modules/python/src2/cv2.cpp#L341-L342
  • Then it analyzes the Strides for all dimensions of the Numpy array, and if it finds a non-contiguous stride (our "screwed up" data layout caused by doing those so-called "fast" Numpy view manipulations), then it marks the data as "needs copy": https://github.com/opencv/opencv/blob/778f42ad34559451d62ac9ba585717aec77fb23a/modules/python/src2/cv2.cpp#L344-L357
  • Next, if "needs copy" is true, it does this horrible thing: https://github.com/opencv/opencv/blob/778f42ad34559451d62ac9ba585717aec77fb23a/modules/python/src2/cv2.cpp#L367-L374
  • As you can see, it calls PyArray_Cast() (if casting was needed too) or PyArray_GETCONTIGUOUS() (if we only need to make sure the data is contiguous). Both of those functions, no matter which one is called, then generates a brand-new Python Numpy Object, with all data COPIED by Numpy into brand-new memory, and re-arranged into proper Contiguous ordering. That's extremely wasteful! I'll explain more soon, after this walkthrough of what the code does...
  • Finally, the code proceeds to create a cv::Mat object whose data pointer points at the internal byte-array (RAM) of the Numpy object, ie. the RAM address that you can easily see in Python by typing img.data. That's an incredibly fast operation because it is just a pointer which says Use the existing RAM data owned by Numpy at RAM address XYZ: https://github.com/opencv/opencv/blob/778f42ad34559451d62ac9ba585717aec77fb23a/modules/python/src2/cv2.cpp#L415-L416

So, can you spot the problem yet?

When you pass a contiguous Numpy array, the conversion into OpenCV is pretty much INSTANT: "This data looks fine! Just give its RAM address to cv::Mat and voila!".

But when you instead insist on using those so-called "fast" channel transformations, where you "tweak" the Numpy array's view and stride values, then you are giving OpenCV a Numpy array with non-contiguous RAM and bad "strides". The PyOpenCV layer (the wrapper between OpenCV and Python) detects this problem, and creates a BRAND NEW, COPIED, RE-ARRANGED (CONTIGUOUS) NUMPY ARRAY. This is VERY VERY VERY VERY SLOW.

In other words, if you've used those dumb Numpy "view" manipulation tricks, EVERY CALL TO OPENCV APIS IS CAUSING a HUGE memory copy (images are large, especially 1080p+ screenshots/video frames), a lot of math inside PyArray_GETCONTIGUOUS / PyArray_Cast to create that new object while respecting your tweaked "strides", etc.

Your code won't be faster at all. It will be SLOWER!

Demonstration of the Slowness

Let's use a random OpenCV API to demonstrate the slowdown caused by all of those conversions. We'll use cv2.imshow here, but any OpenCV API call will always be doing the same "Python to OpenCV" conversions of the numpy data, so the exact API doesn't matter. They will all have this overhead.

Here's the example code:

import cv2
import numpy as np
import time

#img1 = cv2.imread("yourimage.png") # If you want to test with an image.
img1 = np.zeros([2160,3840,3], np.uint8) # Create a 4K image.
img1[:,:,2] = 255; img1[:,:,1] = 100 # Fill the channels with different values.
img2 = img1[...,::-1] # Make a "channel flipped view" of the Numpy data.

print("img1 contiguous", img1.flags['C_CONTIGUOUS'], "img2 contiguous", img2.flags['C_CONTIGUOUS'])
print("img1 strides", img1.strides, "img2 strides", img2.strides)

wnd = cv2.namedWindow("", cv2.WINDOW_NORMAL)

def show1():
    cv2.imshow(wnd, img1)

def show2():
    cv2.imshow(wnd, img2)

iterations = 20

start = time.perf_counter()
for i in range(0, iterations):
    show1()
elapsed1 = (time.perf_counter() - start) * 1000
elapsed1_percall = elapsed1 / iterations

start = time.perf_counter()
for i in range(0, iterations):
    show2()
elapsed2 = (time.perf_counter() - start) * 1000
elapsed2_percall = elapsed2 / iterations

# We know that the contiguos (img1) data does not need conversion,
# which tells us that the runtime of the contiguous data is the
# "internal work of the imshow" function. We only want to measure
# the conversion time for non-contiguous data. So we'll subtract
# the first image's (contiguous) runtime from the non-contiguous time.

noncontiguous_overhead_per_call = elapsed2_percall - elapsed1_percall

print("Extra time taken per OpenCV call when given non-contiguous data (in ms):", noncontiguous_overhead_per_call, "ms")

The results:

img1 contiguous True img2 contiguous False
img1 strides (11520, 3, 1) img2 strides (11520, 3, -1)
Extra time taken per OpenCV call when given non-contiguous data (in ms): 39.45334999999999 ms

As you can see, the extra time added to the OpenCV calls when copy-conversion is needed (39.45 ms), is pretty much the same as when you call Numpy's own img.copy() on a "flipped view" inside Python itself (as seen in the earlier benchmark for "Method: x = x[...,::-1].copy(). Speed: 37.5 msec per call").

So yes, every time you call OpenCV with a non-contiguous Numpy array as its argument, you are causing a Numpy .copy() to happen internally!

Numpy "tricks" will cause subtle Bugs too!

Using those Numpy "tricks" isn't just extremely slow. It will cause very subtle bugs in your code, too.

Look at this code and see if you can figure out the bug yourself before you run this example:

import cv2
import numpy as np

img1 = np.zeros([200,200,3], np.uint8) # Create a 200x200 image. (Is Contiguous)
img2 = img1[...,::-1] # Make a "channel flipped view" of the Numpy data. (A Non-Contiguous View)

print("img1 contiguous", img1.flags['C_CONTIGUOUS'], "img2 contiguous", img2.flags['C_CONTIGUOUS'])
print("img1 strides", img1.strides, "img2 strides", img2.strides)

cv2.rectangle(img2, (80,80), (120,120), (255,255,255), 2)

cv2.imshow("", img2)

What do you think the result will be when running this program? Logically, you expect to see black image with a white rectangle in the middle... But instead, you see nothing except a black image. Why?

Well, it's simple... think about what was explained earlier about how PyOpenCV converts every incoming numpy.ndarray object into an internal C++ cv::Mat object. In this example, we're giving a non-contiguous ndarray as an argument to cv2.rectangle(), which causes PyOpenCV to "fix" the data by making a temporary, internal, contiguous .copy() of the image data, and then it wraps the copy's memory address in a cv::Mat. Next, it passes that cv::Mat object to the internal C++ "draw rectangle" function, which dutifully draws a rectangle onto the memory pointed to by the cv::Mat object... which is... the memory of the temporary internal copy of your input array, since a copy had to be created...

So, OpenCV happily writes a rectangle to the temporary object copy. And then when execution returns to Python, you're of course seeing NO RECTANGLE, since nothing was drawn to your actual ndarray data in RAM (since its memory storage was non-contiguous and therefore not usable as-is by OpenCV).

If you want to see what the code above should be doing, simply add img2 = img2.copy() immediately above the cv2.rectangle call, to cause the img2 ndarray object to become contiguous memory so that OpenCV won't need to make a copy of it (and will be able to use that exact object's memory internally, as intended)... After that tweak, you'll see OpenCV properly drawing the rectangle to the image...

This is the kind of subtle bug that is very easy to cause when you're playing around with faked Numpy "views" rather than real contiguous memory.

Another subtle Bug caused by Bonus: A note about Numpy "tricks""slices"

If you create a partial view (slice) Numpy allows you to efficiently "slice" arrays, to extract a "partial view" of a non-contiguous the data. This is very useful for images, since you can do something such as extracting a 100:100 pixel square from the middle of an image. The slicing syntax is img_sliced = img[y1:y2,x1:x2]. This generates a full Numpy ndarray, the new object which points at the data of the original image (they share each other's memory), but which only points at the sub-range you wanted.

So it basically becomes a fully usable "Numpy array" object which you can use in any context you would pass an image. Such as to an OpenCV function, which would then only operate on the sliced segment of RAM. That's really useful!

However, be aware that the Numpy object's flags get completely messed up slices inherit the strides and believe that the partial view is contiguous (when it's really not). That bug contiguous flag of the original object / data they were sliced from! So if you're slicing from a non-contiguous array, you'll generate a non-contiguous slice object too, which is horrible and has been reported to Numpy here: https://github.com/numpy/numpy/issues/14627

So be veryall careful and ensure that you never make partial views the issues of non-contiguous Numpy arrays, due to the bug above. Be very careful when you get numpy.ndarray objects created by other libraries that may have given you non-contiguous objects that you then decide to slice. You'll create tons of subtle bugs by doing that!objects.

It's only safe to make partial views views/slices (like img[0:0, 100:100]img[0:100, 0:100]) when img itself is already PROVEN to be FULLY contiguous (with no "Numpy tricks" applied to it). In that case, feel free to pass your contiguous, partial image slices to OpenCV functions. You won't invoke any copy-mechanics in that case.case!

Bonus: What to do you get a non-contiguous ndarray from a library?

As an example, the very cool D3DShot library has an optional numpy mode where it retrieves the screenshots as ndarray objects. The problem is that it generates them from RAM data laid out in a different order, so it tweaks the ndarray strides etc to give us an object of the proper "shape" (height, width, 3 color channels in RGB order). Its .flags property shows that Contiguous is FALSE.

So what do you do? If you try to pass that directly to OpenCV, you'll invoke the heavy PyOpenCV copy-mechanics described earlier.

Well, you have two options. In this example case, the colors are in RGB order, and you want them to be BGR for usage in OpenCV. So you should be invoking cv2.cvtColor which internally will trigger the Numpy .copy() for you (just like all OpenCV APIs do when given non-contiguous data), and then changes the color order in RAM for you.

The second option is when you have Numpy data that is already in the correct color order (such as BGR), but whose RAM is non-contiguous. In that case, you should directly invoke img = img.copy() to tell Numpy to make a contiguous copy of the array, to fix it. Then you're welcome to use that contiguous copy for everything.

Alright, so let's look at the D3DShot example:

import cv2
import d3dshot
import time

d = d3dshot.create(capture_output="numpy", frame_buffer_size=60)

img1 = d.screenshot()
img2 = d.screenshot()

print(img1.strides, img1.flags)
print(img2.strides, img2.flags)

print("-------------")

start = time.perf_counter()
img1_justcopy = img1.copy() # copy RGB image to new, contiguous RAM
elapsed = (time.perf_counter() - start) * 1000
print(img1_justcopy.strides, img1_justcopy.flags)
print("justcopy milliseconds:", elapsed)

print("-------------")

start = time.perf_counter()
img1 = img1.copy()
img1 = cv2.cvtColor(img1, cv2.COLOR_BGR2RGB) # flip RGB -> BGR
elapsed = (time.perf_counter() - start) * 1000
print(img1.strides, img1.flags)
print("copy+cvtColor milliseconds:", elapsed)

print("-------------")

start = time.perf_counter()
img2 = cv2.cvtColor(img2, cv2.COLOR_BGR2RGB) # flip RGB -> BGR
elapsed = (time.perf_counter() - start) * 1000
print(img2.strides, img2.flags)
print("cvtColor milliseconds:", elapsed)

Output:

(1920, 1, 2073600)   C_CONTIGUOUS : False
  F_CONTIGUOUS : False
  OWNDATA : False
  WRITEABLE : True
  ALIGNED : True
  WRITEBACKIFCOPY : False
  UPDATEIFCOPY : False

(1920, 1, 2073600)   C_CONTIGUOUS : False
  F_CONTIGUOUS : False
  OWNDATA : False
  WRITEABLE : True
  ALIGNED : True
  WRITEBACKIFCOPY : False
  UPDATEIFCOPY : False

-------------
(5760, 3, 1)   C_CONTIGUOUS : True
  F_CONTIGUOUS : False
  OWNDATA : True
  WRITEABLE : True
  ALIGNED : True
  WRITEBACKIFCOPY : False
  UPDATEIFCOPY : False

justcopy milliseconds: 9.122899999999989
-------------
(5760, 3, 1)   C_CONTIGUOUS : True
  F_CONTIGUOUS : False
  OWNDATA : True
  WRITEABLE : True
  ALIGNED : True
  WRITEBACKIFCOPY : False
  UPDATEIFCOPY : False

copy+cvtColor milliseconds: 12.177900000000019
-------------
(5760, 3, 1)   C_CONTIGUOUS : True
  F_CONTIGUOUS : False
  OWNDATA : True
  WRITEABLE : True
  ALIGNED : True
  WRITEBACKIFCOPY : False
  UPDATEIFCOPY : False

cvtColor milliseconds: 11.461500000000013

These examples are all on my 1920x1080 screen, so they're not directly comparable to the 4K resolution times we saw in earlier benchmarks.

Anyway, what we can see here, is first of all that the two captured images (img1 and img2) coming straight from the D3DShot library have very strange strides values, and C_CONTIGUOUS : False. That's because they are raw RAM given to D3DShot by Windows and then just packaged into a ndarray with custom strides to make it read the raw RAM data in the desired order.

Next, we see that just doing img1_justcopy = img1.copy() (which copies the RGB-channeled, non-contiguous RAM into new, contiguous RAM, but does not change the channel order (the image will still be RGB)), takes 9.12 ms, which is indeed how slow Numpy is at copying non-contiguous ndarray data into new, contiguous RAM. Basically, internally, Numpy has to do a ton of looping to read the data byte-by-byte while writing each byte into the correct order in the new, contiguous RAM.

So, the PyArray (Numpy) copying of non-contiguous to contiguous is always the slowest operation. That's why we want to avoid having non-contiguous RAM.

Alright, we also demonstrated how to make a "copy AND fix the colors from RGB to BGR" in two different ways. Doing img1 = img1.copy(); img1 = cv2.cvtColor(img1, cv2.COLOR_BGR2RGB) takes 11.83 ms, and letting cvtColor trigger the Numpy .copy internally via directly calling img2 = cv2.cvtColor(img2, cv2.COLOR_BGR2RGB) takes 10.61 ms. The reason for the slight difference is of course that there's slightly more work involved when we're doing 2 separate function calls, than when we let OpenCV do the Numpy copying in its single call.

In both cases, a PyArray (Numpy) copy operation happens internally, to give us a straight, contiguous RAM location. And then we pass that fixed, contiguous ndarray to cvtColor which fixes the color channel order.

That gives you the following guidelines:

  • If your Numpy data is non-contiguous but is already in the correct channel order (you don't want to convert RGB to/from BGR, etc): Use img = img.copy() to force Numpy to make a contiguous copy of the data, which is then usable in all OpenCV calls without any bugs and without causing any slow internal, temporary copying.
  • If your Numpy data is non-contiguous and you also want to change the channel order: Use img = cv2.cvtColor(img, cv2.COLOR_<your conversion choice>), which will internally do the .copy slightly more efficiently than if you had used two separate Python statements.

Both techniques will result in giving you fast, contiguous RAM, in the color arrangement of your choice!

Conclusions

Stop using Numpy view manipulations and "tricks". They are not "cool". They lead to SUBTLE BUGS and they are EXTREMELY SLOW. You are slowing down all of your OpenCV API calls by about 40 milliseconds PER CALL, since your "cool" data has to be converted by OpenCV internally to PROPER CONTIGUOUS RAM.

Those 39.45 ms are wasted on EVERY OpenCV call whenever you give OpenCV a non-contiguous image. So if you're (as people often do) passing the image to multiple OpenCV APIs to analyze it in multiple ways, then you are causing extreme slowdowns in your code.

Use cv2.cvtColor() instead, which does a one-time conversion to the proper format (in about 5.48 ms as seen in the benchmarks earlier), using accelerated OpenCL code. You are guaranteed to get contiguous data which works as-is for EVERY OpenCV call with no memory copying/conversion needed. And OpenCV's color converter is WAY FASTER than Numpy's internal data copier/converter.

Let's end this by imagining a scenario where you're using some Python library to capture an RGB screenshot as a numpy array, and you need to use that data with OpenCV. So you're thinking you're clever and you write img = img[...,::-1] to "turn the RGB data into BGR (which OpenCV needs)", and you're thinking "Wow, my code is so fast! That RGB-to-BGR operation only took 0.000237 ms!"... And then you're calling five different OpenCV functions to analyze that screenshot-image in various ways. Well, since you're causing one internal Numpy copy-conversion-to-contiguous PER CALL, you're now causing 5 * 39.45 = 197.25 ms of total conversion overhead, just to get your "stupid" Numpy view into a proper contiguous memory stream.

Does it still sound "slow" to just do a single, one-time 5.48 ms conversion via cv2.cvtColor()? ;-)

Stop. Using. Numpy. Tricks!

Enjoy! ;-)

Fastest way to convert BGR <-> RGB! Aka: Do NOT use Numpy magic "tricks".


IMPORTANT: This article is very long. Remember to click the (more) at the bottom of this post to read the whole article!


I was reading this question: https://answers.opencv.org/question/188664/colour-conversion-from-bgr-to-rgb-in-cv2-is-slower-than-on-python/ and it didn't explain things very well at all. So here's a deep examination and explanation for everyone's future reference!

Converting RGB to BGR, and vice versa, is one of the most important operations you can do in OpenCV if you're interoperating with other libraries, raw system memory, etc. And each imaging library depends on their own special channel orders.

There are many ways to achieve the conversion, and cv2.cvtColor() is often frowned upon because there are "much faster" ways to do it via numpy "view" manipulation.

Whenever you attempt to convert colors in OpenCV, you actually invoke a huge machinery:

https://github.com/opencv/opencv/blob/8c0b0714e76efef4a8ca2a7c410c60e55c5e9829/modules/imgproc/src/color.cpp#L20-L25 https://github.com/opencv/opencv/blob/8b541e450b511fde9dd363fa55a30fbb6fc0ace6/modules/imgproc/src/color_rgb.dispatch.cpp#L426-L437

As you can see, internally, OpenCV creates an "OpenCL Kernel" with the instructions for the data transformation, and then runs it. This creates brand new (re-arranged) image data in memory, which is of course a pretty slow operation, involving new memory allocation and data-copying.

However, there is another way to flip between RGB and BGR channel orders, which is very popular - and very bad (as you'll find out soon). And that is: Using numpy's built-in methods for manipulating the array data.

Note that there are two ways to manipulate data in Numpy:

  • One of the ways, the bad way, just changes the "view" of the Numpy array and is therefore instant (O(1)), but does NOT transform the underlying img.data in RAM/memory. This means that the raw memory does NOT contain the new channel order, and Numpy instead "fakes" it by creating a "view" that simply says "when we read this data from RAM, view it as R=B, G=G, B=R" basically... (Technically speaking, it changes the ".strides" property of the Numpy object, which instead of saying "read R then G then B" (stride "1" aka going forwards in RAM when reading the color channels) changes it to say "read B, then G, then R" (stride "-1" aka going backwards in RAM when reading the color channels)).
  • The second way, which is totally fine, is to always ensure that we arrange the pixel data properly in memory too, which is a lot slower but is almost always necessary, depending on what library/API your data is intended to be sent to!

To determine whether a numpy array manipulation has also changed the underlying MEMORY, you can look at the img.flags['C_CONTIGUOUS'] value. If True it means that the data in RAM is in the correct order (that's great!). If False it means that the data in RAM is in the wrong order and that we are "cheating" via a numpy View instead (that's BAD!).

Whenever you use the "View-based" methods to flip channels in an ndarray (such as RGB -> BGR), its C_CONTIGUOUS becomes False. If you then flip the image's channels again (such as BGR -> back to RGB), its C_CONTIGUOUS becomes True again. So, the "view" is able to be transformed multiple times, and the "Contiguous" flag only says True whenever the view happens to match the actual RAM data's layout.

So... in what situations do you need the data to ALWAYS be contiguous? Well, it varies based on API...

  • OpenCV APIs all need data in contiguous order. If you give it non-contiguous data from Python, the Python-to-OpenCV wrapper layer internally makes a COPY of the BADLY FORMATTED image you gave it, and then converts the color channel order, and THEN finally passes the COPIED-AND-CONTIGUOUS image to the internal OpenCV C++ function. This is of course very wasteful!
  • Other libraries: Depends on the library. Some of them do something like "take the img.data memory address and give it to a raw Windows API via a COM call" in which case YES the RAM-data MUST be contiguous too.

What type of data do YOU need?

If you want the SAFEST possible data that is 100% sure to work in ANY API ANYWHERE, you should always make CONTIGUOUS pixel data. It doesn't take long to do the conversion up-front, since we're still talking about very fast operations!

There are probably situations where non-contiguous data is fine, such as if you are doing all image manipulations purely in Numpy math without any library APIs (in which case there's no real reason to convert the data layout to contiguous in RAM). But as soon as you invoke various library APIs, you should pretty much always have contiguous data, otherwise you'll create huge performance issues (or even completely incorrect results).

I'll explain those performance issues further down, but first we'll look at the various "conversion techniques" people use in Python.

Techniques

Without further ado, here are all the ways that people use in Python whenever they want to convert back/forth between RGB and BGR. These benchmarks are on a 4K image (3840x2160):

  • Always Contiguous: No. Method: x = x[...,::-1]. Speed: 237 nsec (aka 0.237 usec aka 0.000237 msec) per call
  • Always Contiguous: Yes. Method: x = x[...,::-1].copy(). Speed: 37.5 msec per call
  • Always Contiguous: No. Method: x = x[:, :, [2, 1, 0]]. Speed: 12.6 msec per call
  • Always Contiguous: Yes. Method: x = cv2.cvtColor(x, cv2.COLOR_RGB2BGR). Speed: 5.48 msec per call
  • Always Contiguous: No. Method: x = np.fliplr(x.reshape(-1,3)).reshape(x.shape). Speed: 1.62 usec (aka 0.00162 msec) per call
  • Always Contiguous: Yes. Method: x = np.fliplr(x.reshape(-1,3)).reshape(x.shape).copy(). Speed: 37.4 msec per call
  • Always Contiguous: No. Method: x = np.flip(x, axis=2). Speed: 2.74 usec (aka 0.00274 msec) per call
  • Always Contiguous: Yes. Method: x = np.flip(x, axis=2).copy(). Speed: 37.5 msec per call
  • Always Contiguous: Yes. Method: r = x[..., 0].copy(); x[..., 0] = x[..., 2]; x[..., 2] = r. Speed: 21.8 msec per call
  • Always Contiguous: Yes. Method: x[:, :, [0, 2]] = x[:, :, [2, 0]]. Speed: 21.7 msec per call
  • Always Contiguous: Yes. Method: x[..., [0, 2]] = x[..., [2, 0]]. Speed: 21.8 msec per call
  • Always Contiguous: Yes. Method: x[:, :, [0, 1, 2]] = x[:, :, [2, 1, 0]]. Speed: 33.1 msec per call
  • Always Contiguous: Yes. Method: x[:, :] = x[:, :, [2, 1, 0]]. Speed: 49.3 msec per call
  • Always Contiguous: Yes. Method: foo = x.copy(). Speed: 11.8 msec per call (This example doesn't change the RGB/BGR channel order, and is just included here as a reference, to show how slow Numpy is at doing a super simple copy of an already-contiguous chunk of RAM. As you can see, even when the data is already in the proper order, Numpy is very slow... And if "x" had been non-contiguous here, it would be even slower, as shown in the x = x[...,::-1].copy() (equivalent to saying bar = x[...,::-1]; foo = bar.copy()) example near the top of the list, which took 37.5 msec and demonstrates Numpy copying non-contiguous RAM (from numpy "views" marked as "read in reverse order" via "stride = -1") into contiguous RAM...

PS: Whenever we want contiguous data from numpy, we're mostly using x.copy() to tell Numpy to allocate new RAM and copy all data to it in the correct (contiguous) order. There's also a np.ascontiguousarray(x) API but it does the exact same thing (it copies too, "but only when the Numpy data isn't already contiguous") and requires much more typing. ;-) And in a few of the examples we're using special indexing (such as x[:, :, [0, 1, 2]] = x[:, :, [2, 1, 0]]) to overwrite the memory directly, which always creates contiguous memory with correct "strides", and is faster than telling Numpy to do a .copy(), but is still extremely slow compared to cv2.cvtColor().

Docs for the various Numpy functions: copy, ascontiguousarray, fliplr, flip, reshape

Here's the benchmark that was used:

python -m timeit -s "import numpy as np; import cv2; x = np.zeros([2160,3840,3], np.uint8); x[:,:,2] = 255; x[:,:,1] = 100" "ALGORITHM HERE"

Replace the "ALGORITHM HERE" part with the algorithm above, such as "x = np.flip(x, axis=2).copy()".

People's Misunderstandings of those Benchmarks

Alright, so we're finally getting to the whole purpose of this article!

When people see the benchmarks above, they usually think "Oh my god, x = x[...,::-1] executes in 0.000237 milliseconds, and x = cv2.cvtColor(x, cv2.COLOR_RGB2BGR) executes in 5.48 milliseconds, which is 23798 times slower!! And then they decide to always use "Numpy view manipulation" to do their channel conversions.

That's a huge mistake. And here's why:

  • When you call an OpenCV API from Python, and pass it a numpy.ndarray image object, there's a process which prepares that data for internal usage within OpenCV (since OpenCV itself doesn't use ndarray internally; it uses cv::Mat).
  • First, your Python object (which is coming from the PyOpenCV module) goes into the appropriate pyopencv_to() function, whose purpose is to convert raw Python objects (such as numbers, strings, ndarray, etc), into something usable by OpenCV internally in C++.
  • Your Python object first enters the "full ArgInfo converter" code at https://github.com/opencv/opencv/blob/778f42ad34559451d62ac9ba585717aec77fb23a/modules/python/src2/cv2.cpp#L249
  • That code, in turn, looks at the object and determines if it's a number, a float, or a tuple... If it's any of those, it does the appropriate conversion. Otherwise it assumes it's a Numpy array, at this line: https://github.com/opencv/opencv/blob/778f42ad34559451d62ac9ba585717aec77fb23a/modules/python/src2/cv2.cpp#L292
  • Next, it begins to analyze the Numpy array to determine how to use the data internally. It wants to determine "do we need to copy the data or can we use it as-is? do we need to cast the data?", see here: https://github.com/opencv/opencv/blob/778f42ad34559451d62ac9ba585717aec77fb23a/modules/python/src2/cv2.cpp#L300
  • It first does some simple checks to see if the number-type in the Numpy array is legal or not. (If illegal type, it marks the data as "needs copy" and "needs cast").
  • Next, it retrieves the "strides" information from the Numpy array, which is those simple numbers such as "-1" which determine how to read a numpy array (such as backwards, in the case of our "fast" numpy-based "channel flipping" code earlier): https://github.com/opencv/opencv/blob/778f42ad34559451d62ac9ba585717aec77fb23a/modules/python/src2/cv2.cpp#L341-L342
  • Then it analyzes the Strides for all dimensions of the Numpy array, and if it finds a non-contiguous stride (our "screwed up" data layout caused by doing those so-called "fast" Numpy view manipulations), then it marks the data as "needs copy": https://github.com/opencv/opencv/blob/778f42ad34559451d62ac9ba585717aec77fb23a/modules/python/src2/cv2.cpp#L344-L357
  • Next, if "needs copy" is true, it does this horrible thing: https://github.com/opencv/opencv/blob/778f42ad34559451d62ac9ba585717aec77fb23a/modules/python/src2/cv2.cpp#L367-L374
  • As you can see, it calls PyArray_Cast() (if casting was needed too) or PyArray_GETCONTIGUOUS() (if we only need to make sure the data is contiguous). Both of those functions, no matter which one is called, then generates a brand-new Python Numpy Object, with all data COPIED by Numpy into brand-new memory, and re-arranged into proper Contiguous ordering. That's extremely wasteful! I'll explain more soon, after this walkthrough of what the code does...
  • Finally, the code proceeds to create a cv::Mat object whose data pointer points at the internal byte-array (RAM) of the Numpy object, ie. the RAM address that you can easily see in Python by typing img.data. That's an incredibly fast operation because it is just a pointer which says Use the existing RAM data owned by Numpy at RAM address XYZ: https://github.com/opencv/opencv/blob/778f42ad34559451d62ac9ba585717aec77fb23a/modules/python/src2/cv2.cpp#L415-L416

So, can you spot the problem yet?

When you pass a contiguous Numpy array, the conversion into OpenCV is pretty much INSTANT: "This data looks fine! Just give its RAM address to cv::Mat and voila!".

But when you instead insist on using those so-called "fast" channel transformations, where you "tweak" the Numpy array's view and stride values, then you are giving OpenCV a Numpy array with non-contiguous RAM and bad "strides". The PyOpenCV layer (the wrapper between OpenCV and Python) detects this problem, and creates a BRAND NEW, COPIED, RE-ARRANGED (CONTIGUOUS) NUMPY ARRAY. This is VERY VERY VERY VERY SLOW.

In other words, if you've used those dumb Numpy "view" manipulation tricks, EVERY CALL TO OPENCV APIS IS CAUSING a HUGE memory copy (images are large, especially 1080p+ screenshots/video frames), a lot of math inside PyArray_GETCONTIGUOUS / PyArray_Cast to create that new object while respecting your tweaked "strides", etc.

Your code won't be faster at all. It will be SLOWER!

Demonstration of the Slowness

Let's use a random OpenCV API to demonstrate the slowdown caused by all of those conversions. We'll use cv2.imshow here, but any OpenCV API call will always be doing the same "Python to OpenCV" conversions of the numpy data, so the exact API doesn't matter. They will all have this overhead.

Here's the example code:

import cv2
import numpy as np
import time

#img1 = cv2.imread("yourimage.png") # If you want to test with an image.
img1 = np.zeros([2160,3840,3], np.uint8) # Create a 4K image.
img1[:,:,2] = 255; img1[:,:,1] = 100 # Fill the channels with different values.
img2 = img1[...,::-1] # Make a "channel flipped view" of the Numpy data.

print("img1 contiguous", img1.flags['C_CONTIGUOUS'], "img2 contiguous", img2.flags['C_CONTIGUOUS'])
print("img1 strides", img1.strides, "img2 strides", img2.strides)

wnd = cv2.namedWindow("", cv2.WINDOW_NORMAL)

def show1():
    cv2.imshow(wnd, img1)

def show2():
    cv2.imshow(wnd, img2)

iterations = 20

start = time.perf_counter()
for i in range(0, iterations):
    show1()
elapsed1 = (time.perf_counter() - start) * 1000
elapsed1_percall = elapsed1 / iterations

start = time.perf_counter()
for i in range(0, iterations):
    show2()
elapsed2 = (time.perf_counter() - start) * 1000
elapsed2_percall = elapsed2 / iterations

# We know that the contiguos (img1) data does not need conversion,
# which tells us that the runtime of the contiguous data is the
# "internal work of the imshow" function. We only want to measure
# the conversion time for non-contiguous data. So we'll subtract
# the first image's (contiguous) runtime from the non-contiguous time.

noncontiguous_overhead_per_call = elapsed2_percall - elapsed1_percall

print("Extra time taken per OpenCV call when given non-contiguous data (in ms):", noncontiguous_overhead_per_call, "ms")

The results:

img1 contiguous True img2 contiguous False
img1 strides (11520, 3, 1) img2 strides (11520, 3, -1)
Extra time taken per OpenCV call when given non-contiguous data (in ms): 39.45334999999999 ms

As you can see, the extra time added to the OpenCV calls when copy-conversion is needed (39.45 ms), is pretty much the same as when you call Numpy's own img.copy() on a "flipped view" inside Python itself (as seen in the earlier benchmark for "Method: x = x[...,::-1].copy(). Speed: 37.5 msec per call").

So yes, every time you call OpenCV with a non-contiguous Numpy array as its argument, you are causing a Numpy .copy() to happen internally!

Numpy "tricks" will cause subtle Bugs too!

Using those Numpy "tricks" isn't just extremely slow. It will cause very subtle bugs in your code, too.

Look at this code and see if you can figure out the bug yourself before you run this example:

import cv2
import numpy as np

img1 = np.zeros([200,200,3], np.uint8) # Create a 200x200 image. (Is Contiguous)
img2 = img1[...,::-1] # Make a "channel flipped view" of the Numpy data. (A Non-Contiguous View)

print("img1 contiguous", img1.flags['C_CONTIGUOUS'], "img2 contiguous", img2.flags['C_CONTIGUOUS'])
print("img1 strides", img1.strides, "img2 strides", img2.strides)

cv2.rectangle(img2, (80,80), (120,120), (255,255,255), 2)

cv2.imshow("", img2)

What do you think the result will be when running this program? Logically, you expect to see black image with a white rectangle in the middle... But instead, you see nothing except a black image. Why?

Well, it's simple... think about what was explained earlier about how PyOpenCV converts every incoming numpy.ndarray object into an internal C++ cv::Mat object. In this example, we're giving a non-contiguous ndarray as an argument to cv2.rectangle(), which causes PyOpenCV to "fix" the data by making a temporary, internal, contiguous .copy() of the image data, and then it wraps the copy's memory address in a cv::Mat. Next, it passes that cv::Mat object to the internal C++ "draw rectangle" function, which dutifully draws a rectangle onto the memory pointed to by the cv::Mat object... which is... the memory of the temporary internal copy of your input array, since a copy had to be created...

So, OpenCV happily writes a rectangle to the temporary object copy. And then when execution returns to Python, you're of course seeing NO RECTANGLE, since nothing was drawn to your actual ndarray data in RAM (since its memory storage was non-contiguous and therefore not usable as-is by OpenCV).

If you want to see what the code above should be doing, simply add img2 = img2.copy() immediately above the cv2.rectangle call, to cause the img2 ndarray object to become contiguous memory so that OpenCV won't need to make a copy of it (and will be able to use that exact object's memory internally, as intended)... After that tweak, you'll see OpenCV properly drawing the rectangle to the image...

This is the kind of subtle bug that is very easy to cause when you're playing around with faked Numpy "views" rather than real contiguous memory.

Bonus: A note about Numpy "slices"

Numpy allows you to efficiently "slice" arrays, to extract a "partial view" of the data. This is very useful for images, since you can do something such as extracting a 100:100 pixel square from the middle of an image. The slicing syntax is img_sliced = img[y1:y2,x1:x2]. This generates a full Numpy object which points at the data of the original image (they share each other's memory), but which only points at the sub-range you wanted.

So it basically becomes a fully usable "Numpy array" object which you can use in any context you would pass an image. Such as to an OpenCV function, which would then only operate on the sliced segment of RAM. That's really useful!

However, be aware that the Numpy slices inherit the strides and contiguous flag of the original object / data they were sliced from! So if you're slicing from a non-contiguous array, you'll generate a non-contiguous slice object too, which is horrible and has all the issues of non-contiguous objects.

It's only safe to make partial views/slices (like img[0:100, 0:100]) when img itself is already PROVEN to be FULLY contiguous (with no "Numpy tricks" applied to it). In that case, feel free to pass your contiguous, partial image slices to OpenCV functions. You won't invoke any copy-mechanics in that case!

Bonus: What to do when you get a non-contiguous ndarray from a library?

As an example, the very cool D3DShot library has an optional numpy mode where it retrieves the screenshots as ndarray objects. The problem is that it generates them from RAM data laid out in a different order, so it tweaks the ndarray strides etc to give us an object of the proper "shape" (height, width, 3 color channels in RGB order). Its .flags property shows that Contiguous is FALSE.

So what do you do? If you try to pass that directly to OpenCV, you'll invoke the heavy PyOpenCV copy-mechanics described earlier.

Well, you have two options. In this example case, the colors are in RGB order, and you want them to be BGR for usage in OpenCV. So you should be invoking cv2.cvtColor which internally will trigger the Numpy .copy() for you (just like all OpenCV APIs do when given non-contiguous data), and then changes the color order in RAM for you.

The second option is when you have Numpy data that is already in the correct color order (such as BGR), but whose RAM is non-contiguous. In that case, you should directly invoke img = img.copy() to tell Numpy to make a contiguous copy of the array, to fix it. Then you're welcome to use that contiguous copy for everything.

Alright, so let's look at the D3DShot example:

import cv2
import d3dshot
import time

d = d3dshot.create(capture_output="numpy", frame_buffer_size=60)

img1 = d.screenshot()
img2 = d.screenshot()

print(img1.strides, img1.flags)
print(img2.strides, img2.flags)

print("-------------")

start = time.perf_counter()
img1_justcopy = img1.copy() # copy RGB image to new, contiguous RAM
elapsed = (time.perf_counter() - start) * 1000
print(img1_justcopy.strides, img1_justcopy.flags)
print("justcopy milliseconds:", elapsed)

print("-------------")

start = time.perf_counter()
img1 = img1.copy()
img1 = cv2.cvtColor(img1, cv2.COLOR_BGR2RGB) # flip RGB -> BGR
elapsed = (time.perf_counter() - start) * 1000
print(img1.strides, img1.flags)
print("copy+cvtColor milliseconds:", elapsed)

print("-------------")

start = time.perf_counter()
img2 = cv2.cvtColor(img2, cv2.COLOR_BGR2RGB) # flip RGB -> BGR
elapsed = (time.perf_counter() - start) * 1000
print(img2.strides, img2.flags)
print("cvtColor milliseconds:", elapsed)

Output:

(1920, 1, 2073600)   C_CONTIGUOUS : False
  F_CONTIGUOUS : False
  OWNDATA : False
  WRITEABLE : True
  ALIGNED : True
  WRITEBACKIFCOPY : False
  UPDATEIFCOPY : False

(1920, 1, 2073600)   C_CONTIGUOUS : False
  F_CONTIGUOUS : False
  OWNDATA : False
  WRITEABLE : True
  ALIGNED : True
  WRITEBACKIFCOPY : False
  UPDATEIFCOPY : False

-------------
(5760, 3, 1)   C_CONTIGUOUS : True
  F_CONTIGUOUS : False
  OWNDATA : True
  WRITEABLE : True
  ALIGNED : True
  WRITEBACKIFCOPY : False
  UPDATEIFCOPY : False

justcopy milliseconds: 9.122899999999989
-------------
(5760, 3, 1)   C_CONTIGUOUS : True
  F_CONTIGUOUS : False
  OWNDATA : True
  WRITEABLE : True
  ALIGNED : True
  WRITEBACKIFCOPY : False
  UPDATEIFCOPY : False

copy+cvtColor milliseconds: 12.177900000000019
-------------
(5760, 3, 1)   C_CONTIGUOUS : True
  F_CONTIGUOUS : False
  OWNDATA : True
  WRITEABLE : True
  ALIGNED : True
  WRITEBACKIFCOPY : False
  UPDATEIFCOPY : False

cvtColor milliseconds: 11.461500000000013

These examples are all on my 1920x1080 screen, so they're not directly comparable to the 4K resolution times we saw in earlier benchmarks.

Anyway, what we can see here, is first of all that the two captured images (img1 and img2) coming straight from the D3DShot library have very strange strides values, and C_CONTIGUOUS : False. That's because they are raw RAM given to D3DShot by Windows and then just packaged into a ndarray with custom strides to make it read the raw RAM data in the desired order.

Next, we see that just doing img1_justcopy = img1.copy() (which copies the RGB-channeled, non-contiguous RAM into new, contiguous RAM, but does not change the channel order (the image will still be RGB)), takes 9.12 ms, which is indeed how slow Numpy is at copying non-contiguous ndarray data into new, contiguous RAM. Basically, internally, Numpy has to do a ton of looping to read the data byte-by-byte while writing each byte into the correct order in the new, contiguous RAM.

So, the PyArray (Numpy) copying of non-contiguous to contiguous is always the slowest operation. That's why we want to avoid having non-contiguous RAM.

Alright, we also demonstrated how to make a "copy AND fix the colors from RGB to BGR" in two different ways. Doing img1 = img1.copy(); img1 = cv2.cvtColor(img1, cv2.COLOR_BGR2RGB) takes 11.83 ms, and letting cvtColor trigger the Numpy .copy internally via directly calling img2 = cv2.cvtColor(img2, cv2.COLOR_BGR2RGB) takes 10.61 ms. The reason for the slight difference is of course that there's slightly more work involved when we're doing 2 separate function calls, than when we let OpenCV do the Numpy copying in its single call.

In both cases, a PyArray (Numpy) copy operation happens internally, to give us a straight, contiguous RAM location. And then we pass that fixed, contiguous ndarray to cvtColor which fixes the color channel order.

That gives you the following guidelines:

  • If your Numpy data is non-contiguous but is already in the correct channel order (you don't want to convert RGB to/from BGR, etc): Use img = img.copy() to force Numpy to make a contiguous copy of the data, which is then usable in all OpenCV calls without any bugs and without causing any slow internal, temporary copying.
  • If your Numpy data is non-contiguous and you also want to change the channel order: Use img = cv2.cvtColor(img, cv2.COLOR_<your conversion choice>), which will internally do the .copy slightly more efficiently than if you had used two separate Python statements.

Both techniques will result in giving you fast, contiguous RAM, in the color arrangement of your choice!

Conclusions

Stop using Numpy view manipulations and "tricks". They are not "cool". They lead to SUBTLE BUGS and they are EXTREMELY SLOW. You are slowing down all of your OpenCV API calls by about 40 milliseconds PER CALL, since your "cool" data has to be converted by OpenCV internally to PROPER CONTIGUOUS RAM.

Those 39.45 ms are wasted on EVERY OpenCV call whenever you give OpenCV a non-contiguous image. So if you're (as people often do) passing the image to multiple OpenCV APIs to analyze it in multiple ways, then you are causing extreme slowdowns in your code.

Use cv2.cvtColor() instead, which does a one-time conversion to the proper format (in about 5.48 ms as seen in the benchmarks earlier), using accelerated OpenCL code. You are guaranteed to get contiguous data which works as-is for EVERY OpenCV call with no memory copying/conversion needed. And OpenCV's color converter is WAY FASTER than Numpy's internal data copier/converter.

Let's end this by imagining a scenario where you're using some Python library to capture an RGB screenshot as a numpy array, and you need to use that data with OpenCV. So you're thinking you're clever and you write img = img[...,::-1] to "turn the RGB data into BGR (which OpenCV needs)", and you're thinking "Wow, my code is so fast! That RGB-to-BGR operation only took 0.000237 ms!"... And then you're calling five different OpenCV functions to analyze that screenshot-image in various ways. Well, since you're causing one internal Numpy copy-conversion-to-contiguous PER CALL, you're now causing 5 * 39.45 = 197.25 ms of total conversion overhead, just to get your "stupid" Numpy view into a proper contiguous memory stream.

Does it still sound "slow" to just do a single, one-time 5.48 ms conversion via cv2.cvtColor()? ;-)

Stop. Using. Numpy. Tricks!

Enjoy! ;-)

Fastest way to convert BGR <-> RGB! Aka: Do NOT use Numpy magic "tricks".


IMPORTANT: This article is very long. Remember to click the (more) at the bottom of this post to read the whole article!


I was reading this question: https://answers.opencv.org/question/188664/colour-conversion-from-bgr-to-rgb-in-cv2-is-slower-than-on-python/ and it didn't explain things very well at all. So here's a deep examination and explanation for everyone's future reference!

Converting RGB to BGR, and vice versa, is one of the most important operations you can do in OpenCV if you're interoperating with other libraries, raw system memory, etc. And each imaging library depends on their own special channel orders.

There are many ways to achieve the conversion, and cv2.cvtColor() is often frowned upon because there are "much faster" ways to do it via numpy "view" manipulation.

Whenever you attempt to convert colors in OpenCV, you actually invoke a huge machinery:

https://github.com/opencv/opencv/blob/8c0b0714e76efef4a8ca2a7c410c60e55c5e9829/modules/imgproc/src/color.cpp#L20-L25 https://github.com/opencv/opencv/blob/8b541e450b511fde9dd363fa55a30fbb6fc0ace6/modules/imgproc/src/color_rgb.dispatch.cpp#L426-L437

As you can see, internally, OpenCV creates an "OpenCL Kernel" with the instructions for the data transformation, and then runs it. This creates brand new (re-arranged) image data in memory, which is of course a pretty slow operation, involving new memory allocation and data-copying.

However, there is another way to flip between RGB and BGR channel orders, which is very popular - and very bad (as you'll find out soon). And that is: Using numpy's built-in methods for manipulating the array data.

Note that there are two ways to manipulate data in Numpy:

  • One of the ways, the bad way, just changes the "view" of the Numpy array and is therefore instant (O(1)), but does NOT transform the underlying img.data in RAM/memory. This means that the raw memory does NOT contain the new channel order, and Numpy instead "fakes" it by creating a "view" that simply says "when we read this data from RAM, view it as R=B, G=G, B=R" basically... (Technically speaking, it changes the ".strides" property of the Numpy object, which instead of saying "read R then G then B" (stride "1" aka going forwards in RAM when reading the color channels) changes it to say "read B, then G, then R" (stride "-1" aka going backwards in RAM when reading the color channels)).
  • The second way, which is totally fine, is to always ensure that we arrange the pixel data properly in memory too, which is a lot slower but is almost always necessary, depending on what library/API your data is intended to be sent to!

To determine whether a numpy array manipulation has also changed the underlying MEMORY, you can look at the img.flags['C_CONTIGUOUS'] value. If True it means that the data in RAM is in the correct order (that's great!). If False it means that the data in RAM is in the wrong order and that we are "cheating" via a numpy View instead (that's BAD!).

Whenever you use the "View-based" methods to flip channels in an ndarray (such as RGB -> BGR), its C_CONTIGUOUS becomes False. If you then flip the image's channels again (such as BGR -> back to RGB), its C_CONTIGUOUS becomes True again. So, the "view" is able to be transformed multiple times, and the "Contiguous" flag only says True whenever the view happens to match the actual RAM data's layout.

So... in what situations do you need the data to ALWAYS be contiguous? Well, it varies based on API...

  • OpenCV APIs all need data in contiguous order. If you give it non-contiguous data from Python, the Python-to-OpenCV wrapper layer internally makes a COPY of the BADLY FORMATTED image you gave it, and then converts the color channel order, and THEN finally passes the COPIED-AND-CONTIGUOUS image to the internal OpenCV C++ function. This is of course very wasteful!
  • Other libraries: Depends on the library. Some of them do something like "take the img.data memory address and give it to a raw Windows API via a COM call" in which case YES the RAM-data MUST be contiguous too.

What type of data do YOU need?

If you want the SAFEST possible data that is 100% sure to work in ANY API ANYWHERE, you should always make CONTIGUOUS pixel data. It doesn't take long to do the conversion up-front, since we're still talking about very fast operations!

There are probably situations where non-contiguous data is fine, such as if you are doing all image manipulations purely in Numpy math without any library APIs (in which case there's no real reason to convert the data layout to contiguous in RAM). But as soon as you invoke various library APIs, you should pretty much always have contiguous data, otherwise you'll create huge performance issues (or even completely incorrect results).

I'll explain those performance issues further down, but first we'll look at the various "conversion techniques" people use in Python.

Techniques

Without further ado, here are all the ways that people use in Python whenever they want to convert back/forth between RGB and BGR. These benchmarks are on a 4K image (3840x2160):

  • Always Contiguous: No. Method: x = x[...,::-1]. Speed: 237 nsec (aka 0.237 usec aka 0.000237 msec) per call
  • Always Contiguous: Yes. Method: x = x[...,::-1].copy(). Speed: 37.5 msec per call
  • Always Contiguous: No. Method: x = x[:, :, [2, 1, 0]]. Speed: 12.6 msec per call
  • Always Contiguous: Yes. Method: x = cv2.cvtColor(x, cv2.COLOR_RGB2BGR). Speed: 5.48 msec per call
  • Always Contiguous: No. Method: x = np.fliplr(x.reshape(-1,3)).reshape(x.shape). Speed: 1.62 usec (aka 0.00162 msec) per call
  • Always Contiguous: Yes. Method: x = np.fliplr(x.reshape(-1,3)).reshape(x.shape).copy(). Speed: 37.4 msec per call
  • Always Contiguous: No. Method: x = np.flip(x, axis=2). Speed: 2.74 usec (aka 0.00274 msec) per call
  • Always Contiguous: Yes. Method: x = np.flip(x, axis=2).copy(). Speed: 37.5 msec per call
  • Always Contiguous: Yes. Method: r = x[..., 0].copy(); x[..., 0] = x[..., 2]; x[..., 2] = r. Speed: 21.8 msec per call
  • Always Contiguous: Yes. Method: x[:, :, [0, 2]] = x[:, :, [2, 0]]. Speed: 21.7 msec per call
  • Always Contiguous: Yes. Method: x[..., [0, 2]] = x[..., [2, 0]]. Speed: 21.8 msec per call
  • Always Contiguous: Yes. Method: x[:, :, [0, 1, 2]] = x[:, :, [2, 1, 0]]. Speed: 33.1 msec per call
  • Always Contiguous: Yes. Method: x[:, :] = x[:, :, [2, 1, 0]]. Speed: 49.3 msec per call
  • Always Contiguous: Yes. Method: foo = x.copy(). Speed: 11.8 msec per call (This example doesn't change the RGB/BGR channel order, and is just included here as a reference, to show how slow Numpy is at doing a super simple copy of an already-contiguous chunk of RAM. As you can see, even when the data is already in the proper order, Numpy is very slow... And if "x" had been non-contiguous here, it would be even slower, as shown in the x = x[...,::-1].copy() (equivalent to saying bar = x[...,::-1]; foo = bar.copy()) example near the top of the list, which took 37.5 msec and demonstrates Numpy copying non-contiguous RAM (from numpy "views" marked as "read in reverse order" via "stride = -1") into contiguous RAM...

PS: Whenever we want contiguous data from numpy, we're mostly using x.copy() to tell Numpy to allocate new RAM and copy all data to it in the correct (contiguous) order. There's also a np.ascontiguousarray(x) API but it does the exact same thing (it copies too, "but only when the Numpy data isn't already contiguous") and requires much more typing. ;-) And in a few of the examples we're using special indexing (such as x[:, :, [0, 1, 2]] = x[:, :, [2, 1, 0]]) to overwrite the memory directly, which always creates contiguous memory with correct "strides", and is faster than telling Numpy to do a .copy(), but is still extremely slow compared to cv2.cvtColor().

Docs for the various Numpy functions: copy, ascontiguousarray, fliplr, flip, reshape

Here's the benchmark that was used:

python -m timeit -s "import numpy as np; import cv2; x = np.zeros([2160,3840,3], np.uint8); x[:,:,2] = 255; x[:,:,1] = 100" "ALGORITHM HERE"

Replace the "ALGORITHM HERE" part with the algorithm above, such as "x = np.flip(x, axis=2).copy()".

People's Misunderstandings of those Benchmarks

Alright, so we're finally getting to the whole purpose of this article!

When people see the benchmarks above, they usually think "Oh my god, x = x[...,::-1] executes in 0.000237 milliseconds, and x = cv2.cvtColor(x, cv2.COLOR_RGB2BGR) executes in 5.48 milliseconds, which is 23798 times slower!! And then they decide to always use "Numpy view manipulation" to do their channel conversions.

That's a huge mistake. And here's why:

  • When you call an OpenCV API from Python, and pass it a numpy.ndarray image object, there's a process which prepares that data for internal usage within OpenCV (since OpenCV itself doesn't use ndarray internally; it uses cv::Mat).
  • First, your Python object (which is coming from the PyOpenCV module) goes into the appropriate pyopencv_to() function, whose purpose is to convert raw Python objects (such as numbers, strings, ndarray, etc), into something usable by OpenCV internally in C++.
  • Your Python object first enters the "full ArgInfo converter" code at https://github.com/opencv/opencv/blob/778f42ad34559451d62ac9ba585717aec77fb23a/modules/python/src2/cv2.cpp#L249
  • That code, in turn, looks at the object and determines if it's a number, a float, or a tuple... If it's any of those, it does the appropriate conversion. Otherwise it assumes it's a Numpy array, at this line: https://github.com/opencv/opencv/blob/778f42ad34559451d62ac9ba585717aec77fb23a/modules/python/src2/cv2.cpp#L292
  • Next, it begins to analyze the Numpy array to determine how to use the data internally. It wants to determine "do we need to copy the data or can we use it as-is? do we need to cast the data?", see here: https://github.com/opencv/opencv/blob/778f42ad34559451d62ac9ba585717aec77fb23a/modules/python/src2/cv2.cpp#L300
  • It first does some simple checks to see if the number-type in the Numpy array is legal or not. (If illegal type, it marks the data as "needs copy" and "needs cast").
  • Next, it retrieves the "strides" information from the Numpy array, which is those simple numbers such as "-1" which determine how to read a numpy array (such as backwards, in the case of our "fast" numpy-based "channel flipping" code earlier): https://github.com/opencv/opencv/blob/778f42ad34559451d62ac9ba585717aec77fb23a/modules/python/src2/cv2.cpp#L341-L342
  • Then it analyzes the Strides for all dimensions of the Numpy array, and if it finds a non-contiguous stride (our "screwed up" data layout caused by doing those so-called "fast" Numpy view manipulations), then it marks the data as "needs copy": https://github.com/opencv/opencv/blob/778f42ad34559451d62ac9ba585717aec77fb23a/modules/python/src2/cv2.cpp#L344-L357
  • Next, if "needs copy" is true, it does this horrible thing: https://github.com/opencv/opencv/blob/778f42ad34559451d62ac9ba585717aec77fb23a/modules/python/src2/cv2.cpp#L367-L374
  • As you can see, it calls PyArray_Cast() (if casting was needed too) or PyArray_GETCONTIGUOUS() (if we only need to make sure the data is contiguous). Both of those functions, no matter which one is called, then generates a brand-new Python Numpy Object, with all data COPIED by Numpy into brand-new memory, and re-arranged into proper Contiguous ordering. That's extremely wasteful! I'll explain more soon, after this walkthrough of what the code does...
  • Finally, the code proceeds to create a cv::Mat object whose data pointer points at the internal byte-array (RAM) of the Numpy object, ie. the RAM address that you can easily see in Python by typing img.data. That's an incredibly fast operation because it is just a pointer which says Use the existing RAM data owned by Numpy at RAM address XYZ: https://github.com/opencv/opencv/blob/778f42ad34559451d62ac9ba585717aec77fb23a/modules/python/src2/cv2.cpp#L415-L416

So, can you spot the problem yet?

When you pass a contiguous Numpy array, the conversion into OpenCV is pretty much INSTANT: "This data looks fine! Just give its RAM address to cv::Mat and voila!".

But when you instead insist on using those so-called "fast" channel transformations, where you "tweak" the Numpy array's view and stride values, then you are giving OpenCV a Numpy array with non-contiguous RAM and bad "strides". The PyOpenCV layer (the wrapper between OpenCV and Python) detects this problem, and creates a BRAND NEW, COPIED, RE-ARRANGED (CONTIGUOUS) NUMPY ARRAY. This is VERY VERY VERY VERY SLOW.

In other words, if you've used those dumb Numpy "view" manipulation tricks, EVERY CALL TO OPENCV APIS IS CAUSING a HUGE memory copy (images are large, especially 1080p+ screenshots/video frames), a lot of math inside PyArray_GETCONTIGUOUS / PyArray_Cast to create that new object while respecting your tweaked "strides", etc.

Your code won't be faster at all. It will be SLOWER!

Demonstration of the Slowness

Let's use a random OpenCV API to demonstrate the slowdown caused by all of those conversions. We'll use cv2.imshow here, but any OpenCV API call will always be doing the same "Python to OpenCV" conversions of the numpy data, so the exact API doesn't matter. They will all have this overhead.

Here's the example code:

import cv2
import numpy as np
import time

#img1 = cv2.imread("yourimage.png") # If you want to test with an image.
img1 = np.zeros([2160,3840,3], np.uint8) # Create a 4K image.
img1[:,:,2] = 255; img1[:,:,1] = 100 # Fill the channels with different values.
img2 = img1[...,::-1] # Make a "channel flipped view" of the Numpy data.

print("img1 contiguous", img1.flags['C_CONTIGUOUS'], "img2 contiguous", img2.flags['C_CONTIGUOUS'])
print("img1 strides", img1.strides, "img2 strides", img2.strides)

wnd = cv2.namedWindow("", cv2.WINDOW_NORMAL)

def show1():
    cv2.imshow(wnd, img1)

def show2():
    cv2.imshow(wnd, img2)

iterations = 20

start = time.perf_counter()
for i in range(0, iterations):
    show1()
elapsed1 = (time.perf_counter() - start) * 1000
elapsed1_percall = elapsed1 / iterations

start = time.perf_counter()
for i in range(0, iterations):
    show2()
elapsed2 = (time.perf_counter() - start) * 1000
elapsed2_percall = elapsed2 / iterations

# We know that the contiguos (img1) data does not need conversion,
# which tells us that the runtime of the contiguous data is the
# "internal work of the imshow" function. We only want to measure
# the conversion time for non-contiguous data. So we'll subtract
# the first image's (contiguous) runtime from the non-contiguous time.

noncontiguous_overhead_per_call = elapsed2_percall - elapsed1_percall

print("Extra time taken per OpenCV call when given non-contiguous data (in ms):", noncontiguous_overhead_per_call, "ms")

The results:

img1 contiguous True img2 contiguous False
img1 strides (11520, 3, 1) img2 strides (11520, 3, -1)
Extra time taken per OpenCV call when given non-contiguous data (in ms): 39.45334999999999 ms

As you can see, the extra time added to the OpenCV calls when copy-conversion is needed (39.45 ms), is pretty much the same as when you call Numpy's own img.copy() on a "flipped view" inside Python itself (as seen in the earlier benchmark for "Method: x = x[...,::-1].copy(). Speed: 37.5 msec per call").

So yes, every time you call OpenCV with a non-contiguous Numpy array as its argument, you are causing a Numpy .copy() to happen internally!

Numpy "tricks" will cause subtle Bugs too!

Using those Numpy "tricks" isn't just extremely slow. It will cause very subtle bugs in your code, too.

Look at this code and see if you can figure out the bug yourself before you run this example:

import cv2
import numpy as np

img1 = np.zeros([200,200,3], np.uint8) # Create a 200x200 image. (Is Contiguous)
img2 = img1[...,::-1] # Make a "channel flipped view" of the Numpy data. (A Non-Contiguous View)

print("img1 contiguous", img1.flags['C_CONTIGUOUS'], "img2 contiguous", img2.flags['C_CONTIGUOUS'])
print("img1 strides", img1.strides, "img2 strides", img2.strides)

cv2.rectangle(img2, (80,80), (120,120), (255,255,255), 2)

cv2.imshow("", img2)

What do you think the result will be when running this program? Logically, you expect to see black image with a white rectangle in the middle... But instead, you see nothing except a black image. Why?

Well, it's simple... think about what was explained earlier about how PyOpenCV converts every incoming numpy.ndarray object into an internal C++ cv::Mat object. In this example, we're giving a non-contiguous ndarray as an argument to cv2.rectangle(), which causes PyOpenCV to "fix" the data by making a temporary, internal, contiguous .copy() of the image data, and then it wraps the copy's memory address in a cv::Mat. Next, it passes that cv::Mat object to the internal C++ "draw rectangle" function, which dutifully draws a rectangle onto the memory pointed to by the cv::Mat object... which is... the memory of the temporary internal copy of your input array, since a copy had to be created...

So, OpenCV happily writes a rectangle to the temporary object copy. And then when execution returns to Python, you're of course seeing NO RECTANGLE, since nothing was drawn to your actual ndarray data in RAM (since its memory storage was non-contiguous and therefore not usable as-is by OpenCV).

If you want to see what the code above should be doing, simply add img2 = img2.copy() immediately above the cv2.rectangle call, to cause the img2 ndarray object to become contiguous memory so that OpenCV won't need to make a copy of it (and will be able to use that exact object's memory internally, as intended)... After that tweak, you'll see OpenCV properly drawing the rectangle to the image...

This is the kind of subtle bug that is very easy to cause when you're playing around with faked Numpy "views" rather than real contiguous memory.

Bonus: A note about Numpy "slices"

Numpy allows you to efficiently "slice" arrays, to extract a "partial view" of the data. This is very useful for images, since you can do something such as extracting a 100:100 pixel square from the middle of an image. The slicing syntax is img_sliced = img[y1:y2,x1:x2]. This generates a full Numpy object which points at the data of the original image (they share each other's memory), but which only points at the sub-range you wanted.

So it basically becomes a fully usable "Numpy array" object which you can use in any context you would pass an image. Such as to an OpenCV function, which would then only operate on the sliced segment of RAM. That's really useful!

However, be aware that the Numpy slices inherit the strides and contiguous flag of the original object / data they were sliced from! So if you're slicing from a non-contiguous array, you'll generate a non-contiguous slice object too, which is horrible and has all the issues of non-contiguous objects.

It's only safe to make partial views/slices (like img[0:100, 0:100]) when img itself is already PROVEN to be FULLY contiguous (with no "Numpy tricks" applied to it). In that case, feel free to pass your contiguous, partial image slices to OpenCV functions. You won't invoke any copy-mechanics in that case!

Bonus: What to do when you get a non-contiguous ndarray from a library?

As an example, the very cool D3DShot library has an optional numpy mode where it retrieves the screenshots as ndarray objects. The problem is that it generates them from RAM data laid out in a different order, so it tweaks the ndarray strides etc to give us an object of the proper "shape" (height, width, 3 color channels in RGB order). Its .flags property shows that Contiguous is FALSE.

So what do you do? If you try to pass that directly to OpenCV, you'll invoke the heavy PyOpenCV copy-mechanics described earlier.

Well, you have two options. In this example case, the colors are in RGB order, and you want them to be BGR for usage in OpenCV. So you should be invoking cv2.cvtColor which internally will trigger the Numpy .copy() for you (just like all OpenCV APIs do when given non-contiguous data), and then changes the color order in RAM for you.

The second option is when you have Numpy data that is already in the correct color order (such as BGR), but whose RAM is non-contiguous. In that case, you should directly invoke img = img.copy() to tell Numpy to make a contiguous copy of the array, to fix it. Then you're welcome to use that contiguous copy for everything.

Alright, so let's look at the D3DShot example:

import cv2
import d3dshot
import time

d = d3dshot.create(capture_output="numpy", frame_buffer_size=60)

img1 = d.screenshot()
img2 = d.screenshot()

print(img1.strides, img1.flags)
print(img2.strides, img2.flags)

print("-------------")

start = time.perf_counter()
img1_justcopy = img1.copy() # copy RGB image to new, contiguous RAM
elapsed = (time.perf_counter() - start) * 1000
print(img1_justcopy.strides, img1_justcopy.flags)
print("justcopy milliseconds:", elapsed)

print("-------------")

start = time.perf_counter()
img1 = img1.copy()
img1 = cv2.cvtColor(img1, cv2.COLOR_BGR2RGB) # flip RGB -> BGR
elapsed = (time.perf_counter() - start) * 1000
print(img1.strides, img1.flags)
print("copy+cvtColor milliseconds:", elapsed)

print("-------------")

start = time.perf_counter()
img2 = cv2.cvtColor(img2, cv2.COLOR_BGR2RGB) # flip RGB -> BGR
elapsed = (time.perf_counter() - start) * 1000
print(img2.strides, img2.flags)
print("cvtColor milliseconds:", elapsed)

Output:

(1920, 1, 2073600)   C_CONTIGUOUS : False
  F_CONTIGUOUS : False
  OWNDATA : False
  WRITEABLE : True
  ALIGNED : True
  WRITEBACKIFCOPY : False
  UPDATEIFCOPY : False

(1920, 1, 2073600)   C_CONTIGUOUS : False
  F_CONTIGUOUS : False
  OWNDATA : False
  WRITEABLE : True
  ALIGNED : True
  WRITEBACKIFCOPY : False
  UPDATEIFCOPY : False

-------------
(5760, 3, 1)   C_CONTIGUOUS : True
  F_CONTIGUOUS : False
  OWNDATA : True
  WRITEABLE : True
  ALIGNED : True
  WRITEBACKIFCOPY : False
  UPDATEIFCOPY : False

justcopy milliseconds: 9.122899999999989
-------------
(5760, 3, 1)   C_CONTIGUOUS : True
  F_CONTIGUOUS : False
  OWNDATA : True
  WRITEABLE : True
  ALIGNED : True
  WRITEBACKIFCOPY : False
  UPDATEIFCOPY : False

copy+cvtColor milliseconds: 12.177900000000019
-------------
(5760, 3, 1)   C_CONTIGUOUS : True
  F_CONTIGUOUS : False
  OWNDATA : True
  WRITEABLE : True
  ALIGNED : True
  WRITEBACKIFCOPY : False
  UPDATEIFCOPY : False

cvtColor milliseconds: 11.461500000000013

These examples are all on my 1920x1080 screen, so they're not directly comparable to the 4K resolution times we saw in earlier benchmarks.

Anyway, what we can see here, is first of all that the two captured images (img1 and img2) coming straight from the D3DShot library have very strange strides values, and C_CONTIGUOUS : False. That's because they are raw RAM given to D3DShot by Windows and then just packaged into a ndarray with custom strides to make it read the raw RAM data in the desired order.

Next, we see that just doing img1_justcopy = img1.copy() (which copies the RGB-channeled, non-contiguous RAM into new, contiguous RAM, but does not change the channel order (the image will still be RGB)), takes 9.12 ms, which is indeed how slow Numpy is at copying non-contiguous ndarray data into new, contiguous RAM. Basically, internally, Numpy has to do a ton of looping to read the data byte-by-byte while writing each byte into the correct order in the new, contiguous RAM.

So, the PyArray (Numpy) copying of non-contiguous to contiguous is always the slowest operation. That's why we want to avoid having non-contiguous RAM.

Alright, we also demonstrated how to make a "copy AND fix the colors from RGB to BGR" in two different ways. Doing img1 = img1.copy(); img1 = cv2.cvtColor(img1, cv2.COLOR_BGR2RGB) takes 11.83 ms, and letting cvtColor trigger the Numpy .copy internally via directly calling img2 = cv2.cvtColor(img2, cv2.COLOR_BGR2RGB) takes 10.61 ms. The reason for the slight difference is of course that there's slightly more work involved when we're doing 2 separate function calls, than when we let OpenCV do the Numpy copying in its single call.

In both cases, a PyArray (Numpy) copy operation happens internally, to give us a straight, contiguous RAM location. And then we pass that fixed, contiguous ndarray to cvtColor which fixes the color channel order.

That gives you the following guidelines:

  • If your Numpy data is non-contiguous but is already in the correct channel order (you don't want to convert RGB to/from BGR, etc): Use img = img.copy() to force Numpy to make a contiguous copy of the data, which is then usable in all OpenCV calls without any bugs and without causing any slow internal, temporary copying.
  • If your Numpy data is non-contiguous and you also want to change the channel order: Use img = cv2.cvtColor(img, cv2.COLOR_<your conversion choice>), which will internally do the .copy slightly more efficiently than if you had used two separate Python statements.

Both techniques will result in giving you fast, contiguous RAM, in the color arrangement of your choice!

Conclusions

Stop using Numpy view manipulations and "tricks". They are not "cool". They lead to SUBTLE BUGS and they are EXTREMELY SLOW. You are slowing down all of your OpenCV API calls by about 40 milliseconds PER CALL, CALL (at 4K imgres), since your "cool" data has to be converted by OpenCV internally to PROPER CONTIGUOUS RAM.

Those 39.45 ms are wasted on EVERY OpenCV call whenever you give OpenCV a non-contiguous image. So if you're (as people often do) passing the image to multiple OpenCV APIs to analyze it in multiple ways, then you are causing extreme slowdowns in your code.

Use cv2.cvtColor() instead, which does a one-time conversion to the proper format (in about 5.48 ms as seen in the benchmarks earlier), using accelerated OpenCL code. You are guaranteed to get contiguous data which works as-is for EVERY OpenCV call with no memory copying/conversion needed. And OpenCV's color converter is WAY FASTER than Numpy's internal data copier/converter.

Let's end this by imagining a scenario where you're using some Python library to capture an RGB screenshot as a numpy array, and you need to use that data with OpenCV. So you're thinking you're clever and you write img = img[...,::-1] to "turn the RGB data into BGR (which OpenCV needs)", and you're thinking "Wow, my code is so fast! That RGB-to-BGR operation only took 0.000237 ms!"... And then you're calling five different OpenCV functions to analyze that screenshot-image in various ways. Well, since you're causing one internal Numpy copy-conversion-to-contiguous PER CALL, you're now causing 5 * 39.45 = 197.25 ms of total conversion overhead, just to get your "stupid" Numpy view into a proper contiguous memory stream.

Does it still sound "slow" to just do a single, one-time 5.48 ms conversion via cv2.cvtColor()? ;-)

Stop. Using. Numpy. Tricks!

Enjoy! ;-)

Fastest way to convert BGR <-> RGB! Aka: Do NOT use Numpy magic "tricks".


IMPORTANT: This article is very long. Remember to click the (more) at the bottom of this post to read the whole article!


I was reading this question: https://answers.opencv.org/question/188664/colour-conversion-from-bgr-to-rgb-in-cv2-is-slower-than-on-python/ and it didn't explain things very well at all. So here's a deep examination and explanation for everyone's future reference!

Converting RGB to BGR, and vice versa, is one of the most important operations you can do in OpenCV if you're interoperating with other libraries, raw system memory, etc. And each imaging library depends on their own special channel orders.

There are many ways to achieve the conversion, and cv2.cvtColor() is often frowned upon because there are "much faster" ways to do it via numpy "view" manipulation.

Whenever you attempt to convert colors in OpenCV, you actually invoke a huge machinery:

https://github.com/opencv/opencv/blob/8c0b0714e76efef4a8ca2a7c410c60e55c5e9829/modules/imgproc/src/color.cpp#L20-L25 https://github.com/opencv/opencv/blob/8b541e450b511fde9dd363fa55a30fbb6fc0ace6/modules/imgproc/src/color_rgb.dispatch.cpp#L426-L437

As you can see, internally, OpenCV creates an "OpenCL Kernel" with the instructions for the data transformation, and then runs it. This creates brand new (re-arranged) image data in memory, which is of course a pretty slow operation, involving new memory allocation and data-copying.

However, there is another way to flip between RGB and BGR channel orders, which is very popular - and very bad (as you'll find out soon). And that is: Using numpy's built-in methods for manipulating the array data.

Note that there are two ways to manipulate data in Numpy:

  • One of the ways, the bad way, just changes the "view" of the Numpy array and is therefore instant (O(1)), but does NOT transform the underlying img.data in RAM/memory. This means that the raw memory does NOT contain the new channel order, and Numpy instead "fakes" it by creating a "view" that simply says "when we read this data from RAM, view it as R=B, G=G, B=R" basically... (Technically speaking, it changes the ".strides" property of the Numpy object, which instead of saying "read R then G then B" (stride "1" aka going forwards in RAM when reading the color channels) changes it to say "read B, then G, then R" (stride "-1" aka going backwards in RAM when reading the color channels)).
  • The second way, which is totally fine, is to always ensure that we arrange the pixel data properly in memory too, which is a lot slower but is almost always necessary, depending on what library/API your data is intended to be sent to!

To determine whether a numpy array manipulation has also changed the underlying MEMORY, you can look at the img.flags['C_CONTIGUOUS'] value. If True it means that the data in RAM is in the correct order (that's great!). If False it means that the data in RAM is in the wrong order and that we are "cheating" via a numpy View instead (that's BAD!).

Whenever you use the "View-based" methods to flip channels in an ndarray (such as RGB -> BGR), its C_CONTIGUOUS becomes False. If you then flip the image's channels again (such as BGR -> back to RGB), its C_CONTIGUOUS becomes True again. So, the "view" is able to be transformed multiple times, and the "Contiguous" flag only says True whenever the view happens to match the actual RAM data's layout.

So... in what situations do you need the data to ALWAYS be contiguous? Well, it varies based on API...

  • OpenCV APIs all need data in contiguous order. If you give it non-contiguous data from Python, the Python-to-OpenCV wrapper layer internally makes a COPY of the BADLY FORMATTED image you gave it, and then converts the color channel order, and THEN finally passes the COPIED-AND-CONTIGUOUS image to the internal OpenCV C++ function. This is of course very wasteful!
  • Other libraries: Depends on the library. Some of them do something like "take the img.data memory address and give it to a raw Windows API via a COM call" in which case YES the RAM-data MUST be contiguous too.

What type of data do YOU need?

If you want the SAFEST possible data that is 100% sure to work in ANY API ANYWHERE, you should always make CONTIGUOUS pixel data. It doesn't take long to do the conversion up-front, since we're still talking about very fast operations!

There are probably situations where non-contiguous data is fine, such as if you are doing all image manipulations purely in Numpy math without any library APIs (in which case there's no real reason to convert the data layout to contiguous in RAM). But as soon as you invoke various library APIs, you should pretty much always have contiguous data, otherwise you'll create huge performance issues (or even completely incorrect results).

I'll explain those performance issues further down, but first we'll look at the various "conversion techniques" people use in Python.

Techniques

Without further ado, here are all the ways that people use in Python whenever they want to convert back/forth between RGB and BGR. These benchmarks are on a 4K image (3840x2160):

  • Always Contiguous: No. Method: x = x[...,::-1]. Speed: 237 nsec (aka 0.237 usec aka 0.000237 msec) per call
  • Always Contiguous: Yes. Method: x = x[...,::-1].copy(). Speed: 37.5 msec per call
  • Always Contiguous: No. Method: x = x[:, :, [2, 1, 0]]. Speed: 12.6 msec per call
  • Always Contiguous: Yes. Method: x = cv2.cvtColor(x, cv2.COLOR_RGB2BGR). Speed: 5.48 msec per call
  • Always Contiguous: No. Method: x = np.fliplr(x.reshape(-1,3)).reshape(x.shape). Speed: 1.62 usec (aka 0.00162 msec) per call
  • Always Contiguous: Yes. Method: x = np.fliplr(x.reshape(-1,3)).reshape(x.shape).copy(). Speed: 37.4 msec per call
  • Always Contiguous: No. Method: x = np.flip(x, axis=2). Speed: 2.74 usec (aka 0.00274 msec) per call
  • Always Contiguous: Yes. Method: x = np.flip(x, axis=2).copy(). Speed: 37.5 msec per call
  • Always Contiguous: Yes. Method: r = x[..., 0].copy(); x[..., 0] = x[..., 2]; x[..., 2] = r. Speed: 21.8 msec per call
  • Always Contiguous: Yes. Method: x[:, :, [0, 2]] = x[:, :, [2, 0]]. Speed: 21.7 msec per call
  • Always Contiguous: Yes. Method: x[..., [0, 2]] = x[..., [2, 0]]. Speed: 21.8 msec per call
  • Always Contiguous: Yes. Method: x[:, :, [0, 1, 2]] = x[:, :, [2, 1, 0]]. Speed: 33.1 msec per call
  • Always Contiguous: Yes. Method: x[:, :] = x[:, :, [2, 1, 0]]. Speed: 49.3 msec per call
  • Always Contiguous: Yes. Method: foo = x.copy(). Speed: 11.8 msec per call (This example doesn't change the RGB/BGR channel order, and is just included here as a reference, to show how slow Numpy is at doing a super simple copy of an already-contiguous chunk of RAM. As you can see, even when the data is already in the proper order, Numpy is very slow... And if "x" had been non-contiguous here, it would be even slower, as shown in the x = x[...,::-1].copy() (equivalent to saying bar = x[...,::-1]; foo = bar.copy()) example near the top of the list, which took 37.5 msec and demonstrates Numpy copying non-contiguous RAM (from numpy "views" marked as "read in reverse order" via "stride = -1") into contiguous RAM...

PS: Whenever we want contiguous data from numpy, we're mostly using x.copy() to tell Numpy to allocate new RAM and copy all data to it in the correct (contiguous) order. There's also a np.ascontiguousarray(x) API but it does the exact same thing (it copies too, "but only when the Numpy data isn't already contiguous") and requires much more typing. ;-) And in a few of the examples we're using special indexing (such as x[:, :, [0, 1, 2]] = x[:, :, [2, 1, 0]]) to overwrite the memory directly, which always creates contiguous memory with correct "strides", and is faster than telling Numpy to do a .copy(), but is still extremely slow compared to cv2.cvtColor().

Docs for the various Numpy functions: copy, ascontiguousarray, fliplr, flip, reshape

Here's the benchmark that was used:

python -m timeit -s "import numpy as np; import cv2; x = np.zeros([2160,3840,3], np.uint8); x[:,:,2] = 255; x[:,:,1] = 100" "ALGORITHM HERE"

Replace the "ALGORITHM HERE" part with the algorithm above, such as "x = np.flip(x, axis=2).copy()".

People's Misunderstandings of those Benchmarks

Alright, so we're finally getting to the whole purpose of this article!

When people see the benchmarks above, they usually think "Oh my god, x = x[...,::-1] executes in 0.000237 milliseconds, and x = cv2.cvtColor(x, cv2.COLOR_RGB2BGR) executes in 5.48 milliseconds, which is 23798 times slower!! And then they decide to always use "Numpy view manipulation" to do their channel conversions.

That's a huge mistake. And here's why:

  • When you call an OpenCV API from Python, and pass it a numpy.ndarray image object, there's a process which prepares that data for internal usage within OpenCV (since OpenCV itself doesn't use ndarray internally; it uses cv::Mat).
  • First, your Python object (which is coming from the PyOpenCV module) goes into the appropriate pyopencv_to() function, whose purpose is to convert raw Python objects (such as numbers, strings, ndarray, etc), into something usable by OpenCV internally in C++.
  • Your Python object first enters the "full ArgInfo converter" code at https://github.com/opencv/opencv/blob/778f42ad34559451d62ac9ba585717aec77fb23a/modules/python/src2/cv2.cpp#L249
  • That code, in turn, looks at the object and determines if it's a number, a float, or a tuple... If it's any of those, it does the appropriate conversion. Otherwise it assumes it's a Numpy array, at this line: https://github.com/opencv/opencv/blob/778f42ad34559451d62ac9ba585717aec77fb23a/modules/python/src2/cv2.cpp#L292
  • Next, it begins to analyze the Numpy array to determine how to use the data internally. It wants to determine "do we need to copy the data or can we use it as-is? do we need to cast the data?", see here: https://github.com/opencv/opencv/blob/778f42ad34559451d62ac9ba585717aec77fb23a/modules/python/src2/cv2.cpp#L300
  • It first does some simple checks to see if the number-type in the Numpy array is legal or not. (If illegal type, it marks the data as "needs copy" and "needs cast").
  • Next, it retrieves the "strides" information from the Numpy array, which is those simple numbers such as "-1" which determine how to read a numpy array (such as backwards, in the case of our "fast" numpy-based "channel flipping" code earlier): https://github.com/opencv/opencv/blob/778f42ad34559451d62ac9ba585717aec77fb23a/modules/python/src2/cv2.cpp#L341-L342
  • Then it analyzes the Strides for all dimensions of the Numpy array, and if it finds a non-contiguous stride (our "screwed up" data layout caused by doing those so-called "fast" Numpy view manipulations), then it marks the data as "needs copy": https://github.com/opencv/opencv/blob/778f42ad34559451d62ac9ba585717aec77fb23a/modules/python/src2/cv2.cpp#L344-L357
  • Next, if "needs copy" is true, it does this horrible thing: https://github.com/opencv/opencv/blob/778f42ad34559451d62ac9ba585717aec77fb23a/modules/python/src2/cv2.cpp#L367-L374
  • As you can see, it calls PyArray_Cast() (if casting was needed too) or PyArray_GETCONTIGUOUS() (if we only need to make sure the data is contiguous). Both of those functions, no matter which one is called, then generates a brand-new Python Numpy Object, with all data COPIED by Numpy into brand-new memory, and re-arranged into proper Contiguous ordering. That's extremely wasteful! I'll explain more soon, after this walkthrough of what the code does...
  • Finally, the code proceeds to create a cv::Mat object whose data pointer points at the internal byte-array (RAM) of the Numpy object, ie. the RAM address that you can easily see in Python by typing img.data. That's an incredibly fast operation because it is just a pointer which says Use the existing RAM data owned by Numpy at RAM address XYZ: https://github.com/opencv/opencv/blob/778f42ad34559451d62ac9ba585717aec77fb23a/modules/python/src2/cv2.cpp#L415-L416

So, can you spot the problem yet?

When you pass a contiguous Numpy array, the conversion into OpenCV is pretty much INSTANT: "This data looks fine! Just give its RAM address to cv::Mat and voila!".

But when you instead insist on using those so-called "fast" channel transformations, where you "tweak" the Numpy array's view and stride values, then you are giving OpenCV a Numpy array with non-contiguous RAM and bad "strides". The PyOpenCV layer (the wrapper between OpenCV and Python) detects this problem, and creates a BRAND NEW, COPIED, RE-ARRANGED (CONTIGUOUS) NUMPY ARRAY. This is VERY VERY VERY VERY SLOW.

In other words, if you've used those dumb Numpy "view" manipulation tricks, EVERY CALL TO OPENCV APIS IS CAUSING a HUGE memory copy (images are large, especially 1080p+ screenshots/video frames), a lot of math inside PyArray_GETCONTIGUOUS / PyArray_Cast to create that new object while respecting your tweaked "strides", etc.

Your code won't be faster at all. It will be SLOWER!

Demonstration of the Slowness

Let's use a random OpenCV API to demonstrate the slowdown caused by all of those conversions. We'll use cv2.imshow here, but any OpenCV API call will always be doing the same "Python to OpenCV" conversions of the numpy data, so the exact API doesn't matter. They will all have this overhead.

Here's the example code:

import cv2
import numpy as np
import time

#img1 = cv2.imread("yourimage.png") # If you want to test with an image.
img1 = np.zeros([2160,3840,3], np.uint8) # Create a 4K image.
img1[:,:,2] = 255; img1[:,:,1] = 100 # Fill the channels with different values.
img2 = img1[...,::-1] # Make a "channel flipped view" of the Numpy data.

print("img1 contiguous", img1.flags['C_CONTIGUOUS'], "img2 contiguous", img2.flags['C_CONTIGUOUS'])
print("img1 strides", img1.strides, "img2 strides", img2.strides)

wnd = cv2.namedWindow("", cv2.WINDOW_NORMAL)

def show1():
    cv2.imshow(wnd, img1)

def show2():
    cv2.imshow(wnd, img2)

iterations = 20

start = time.perf_counter()
for i in range(0, iterations):
    show1()
elapsed1 = (time.perf_counter() - start) * 1000
elapsed1_percall = elapsed1 / iterations

start = time.perf_counter()
for i in range(0, iterations):
    show2()
elapsed2 = (time.perf_counter() - start) * 1000
elapsed2_percall = elapsed2 / iterations

# We know that the contiguos (img1) data does not need conversion,
# which tells us that the runtime of the contiguous data is the
# "internal work of the imshow" function. We only want to measure
# the conversion time for non-contiguous data. So we'll subtract
# the first image's (contiguous) runtime from the non-contiguous time.

noncontiguous_overhead_per_call = elapsed2_percall - elapsed1_percall

print("Extra time taken per OpenCV call when given non-contiguous data (in ms):", noncontiguous_overhead_per_call, "ms")

The results:

img1 contiguous True img2 contiguous False
img1 strides (11520, 3, 1) img2 strides (11520, 3, -1)
Extra time taken per OpenCV call when given non-contiguous data (in ms): 39.45334999999999 ms

As you can see, the extra time added to the OpenCV calls when copy-conversion is needed (39.45 ms), is pretty much the same as when you call Numpy's own img.copy() on a "flipped view" inside Python itself (as seen in the earlier benchmark for "Method: x = x[...,::-1].copy(). Speed: 37.5 msec per call").

So yes, every time you call OpenCV with a non-contiguous Numpy array as its argument, you are causing a Numpy .copy() to happen internally!

Numpy "tricks" will cause subtle Bugs too!

Using those Numpy "tricks" isn't just extremely slow. It will cause very subtle bugs in your code, too.

Look at this code and see if you can figure out the bug yourself before you run this example:

import cv2
import numpy as np

img1 = np.zeros([200,200,3], np.uint8) # Create a 200x200 image. (Is Contiguous)
img2 = img1[...,::-1] # Make a "channel flipped view" of the Numpy data. (A Non-Contiguous View)

print("img1 contiguous", img1.flags['C_CONTIGUOUS'], "img2 contiguous", img2.flags['C_CONTIGUOUS'])
print("img1 strides", img1.strides, "img2 strides", img2.strides)

cv2.rectangle(img2, (80,80), (120,120), (255,255,255), 2)

cv2.imshow("", img2)

What do you think the result will be when running this program? Logically, you expect to see black image with a white rectangle in the middle... But instead, you see nothing except a black image. Why?

Well, it's simple... think about what was explained earlier about how PyOpenCV converts every incoming numpy.ndarray object into an internal C++ cv::Mat object. In this example, we're giving a non-contiguous ndarray as an argument to cv2.rectangle(), which causes PyOpenCV to "fix" the data by making a temporary, internal, contiguous .copy() of the image data, and then it wraps the copy's memory address in a cv::Mat. Next, it passes that cv::Mat object to the internal C++ "draw rectangle" function, which dutifully draws a rectangle onto the memory pointed to by the cv::Mat object... which is... the memory of the temporary internal copy of your input array, since a copy had to be created...

So, OpenCV happily writes a rectangle to the temporary object copy. And then when execution returns to Python, you're of course seeing NO RECTANGLE, since nothing was drawn to your actual ndarray data in RAM (since its memory storage was non-contiguous and therefore not usable as-is by OpenCV).

If you want to see what the code above should be doing, simply add img2 = img2.copy() immediately above the cv2.rectangle call, to cause the img2 ndarray object to become contiguous memory so that OpenCV won't need to make a copy of it (and will be able to use that exact object's memory internally, as intended)... After that tweak, you'll see OpenCV properly drawing the rectangle to the image...

This is the kind of subtle bug that is very easy to cause when you're playing around with faked Numpy "views" rather than real contiguous memory.

Bonus: A note about Numpy "slices"

Numpy allows you to efficiently "slice" arrays, to extract a "partial view" of the data. This is very useful for images, since you can do something such as extracting a 100:100 pixel square from the middle of an image. The slicing syntax is img_sliced = img[y1:y2,x1:x2]. This generates a full Numpy object which points at the data of the original image (they share each other's memory), but which only points at the sub-range you wanted.

So it basically becomes a fully usable "Numpy array" object which you can use in any context you would pass an image. Such as to an OpenCV function, which would then only operate on the sliced segment of RAM. That's really useful!

However, be aware that the Numpy slices inherit the strides and contiguous flag of the original object / data they were sliced from! So if you're slicing from a non-contiguous array, you'll generate a non-contiguous slice object too, which is horrible and has all the issues of non-contiguous objects.

It's only safe to make partial views/slices (like img[0:100, 0:100]) when img itself is already PROVEN to be FULLY contiguous (with no "Numpy tricks" applied to it). In that case, feel free to pass your contiguous, partial image slices to OpenCV functions. You won't invoke any copy-mechanics in that case!

Bonus: What to do when you get a non-contiguous ndarray from a library?

As an example, the very cool D3DShot library has an optional numpy mode where it retrieves the screenshots as ndarray objects. The problem is that it generates them from RAM data laid out in a different order, so it tweaks the ndarray strides etc to give us an object of the proper "shape" (height, width, 3 color channels in RGB order). Its .flags property shows that Contiguous is FALSE.

So what do you do? If you try to pass that directly to OpenCV, you'll invoke the heavy PyOpenCV copy-mechanics described earlier.

Well, you have two options. In this example case, the colors are in RGB order, and you want them to be BGR for usage in OpenCV. So you should be invoking cv2.cvtColor which internally will trigger the Numpy .copy() for you (just like all OpenCV APIs do when given non-contiguous data), and then changes the color order in RAM for you.

The second option is when you have Numpy data that is already in the correct color order (such as BGR), but whose RAM is non-contiguous. In that case, you should directly invoke img = img.copy() to tell Numpy to make a contiguous copy of the array, to fix it. Then you're welcome to use that contiguous copy for everything.

Alright, so let's look at the D3DShot example:

import cv2
import d3dshot
import time

d = d3dshot.create(capture_output="numpy", frame_buffer_size=60)

img1 = d.screenshot()
img2 = d.screenshot()

print(img1.strides, img1.flags)
print(img2.strides, img2.flags)

print("-------------")

start = time.perf_counter()
img1_justcopy = img1.copy() # copy RGB image to new, contiguous RAM
elapsed = (time.perf_counter() - start) * 1000
print(img1_justcopy.strides, img1_justcopy.flags)
print("justcopy milliseconds:", elapsed)

print("-------------")

start = time.perf_counter()
img1 = img1.copy()
img1 = cv2.cvtColor(img1, cv2.COLOR_BGR2RGB) # flip RGB -> BGR
elapsed = (time.perf_counter() - start) * 1000
print(img1.strides, img1.flags)
print("copy+cvtColor milliseconds:", elapsed)

print("-------------")

start = time.perf_counter()
img2 = cv2.cvtColor(img2, cv2.COLOR_BGR2RGB) # flip RGB -> BGR
elapsed = (time.perf_counter() - start) * 1000
print(img2.strides, img2.flags)
print("cvtColor milliseconds:", elapsed)

Output:

(1920, 1, 2073600)   C_CONTIGUOUS : False
  F_CONTIGUOUS : False
  OWNDATA : False
  WRITEABLE : True
  ALIGNED : True
  WRITEBACKIFCOPY : False
  UPDATEIFCOPY : False

(1920, 1, 2073600)   C_CONTIGUOUS : False
  F_CONTIGUOUS : False
  OWNDATA : False
  WRITEABLE : True
  ALIGNED : True
  WRITEBACKIFCOPY : False
  UPDATEIFCOPY : False

-------------
(5760, 3, 1)   C_CONTIGUOUS : True
  F_CONTIGUOUS : False
  OWNDATA : True
  WRITEABLE : True
  ALIGNED : True
  WRITEBACKIFCOPY : False
  UPDATEIFCOPY : False

justcopy milliseconds: 9.122899999999989
-------------
(5760, 3, 1)   C_CONTIGUOUS : True
  F_CONTIGUOUS : False
  OWNDATA : True
  WRITEABLE : True
  ALIGNED : True
  WRITEBACKIFCOPY : False
  UPDATEIFCOPY : False

copy+cvtColor milliseconds: 12.177900000000019
-------------
(5760, 3, 1)   C_CONTIGUOUS : True
  F_CONTIGUOUS : False
  OWNDATA : True
  WRITEABLE : True
  ALIGNED : True
  WRITEBACKIFCOPY : False
  UPDATEIFCOPY : False

cvtColor milliseconds: 11.461500000000013

These examples are all on my 1920x1080 screen, so they're not directly comparable to the 4K resolution times we saw in earlier benchmarks.

Anyway, what we can see here, is first of all that the two captured images (img1 and img2) coming straight from the D3DShot library have very strange strides values, and C_CONTIGUOUS : False. That's because they are raw RAM given to D3DShot by Windows and then just packaged into a ndarray with custom strides to make it read the raw RAM data in the desired order.

Next, we see that just doing img1_justcopy = img1.copy() (which copies the RGB-channeled, non-contiguous RAM into new, contiguous RAM, but does not change the channel order (the image will still be RGB)), takes 9.12 ms, which is indeed how slow Numpy is at copying non-contiguous ndarray data into new, contiguous RAM. Basically, internally, Numpy has to do a ton of looping to read the data byte-by-byte while writing each byte into the correct order in the new, contiguous RAM.

So, the PyArray (Numpy) copying of non-contiguous to contiguous is always the slowest operation. That's why we want to avoid having non-contiguous RAM.

Alright, we also demonstrated how to make a "copy AND fix the colors from RGB to BGR" in two different ways. Doing img1 = img1.copy(); img1 = cv2.cvtColor(img1, cv2.COLOR_BGR2RGB) takes 11.83 ms, and letting cvtColor trigger the Numpy .copy internally via directly calling img2 = cv2.cvtColor(img2, cv2.COLOR_BGR2RGB) takes 10.61 ms. The reason for the slight difference is of course that there's slightly more work involved when we're doing 2 separate function calls, than when we let OpenCV do the Numpy copying in its single call.

In both cases, a PyArray (Numpy) copy operation happens internally, to give us a straight, contiguous RAM location. And then we pass that fixed, contiguous ndarray to cvtColor which fixes the color channel order.

That gives you the following guidelines:

  • If your Numpy data is non-contiguous but is already in the correct channel order (you don't want to convert RGB to/from BGR, etc): Use img = img.copy() to force Numpy to make a contiguous copy of the data, which is then usable in all OpenCV calls without any bugs and without causing any slow internal, temporary copying.
  • If your Numpy data is non-contiguous and you also want to change the channel order: Use img = cv2.cvtColor(img, cv2.COLOR_<your conversion choice>), which will internally do the .copy slightly more efficiently than if you had used two separate Python statements.

Both techniques will result in giving you fast, contiguous RAM, in the color arrangement of your choice!

Conclusions

Stop using Numpy view manipulations and "tricks". They are not "cool". They lead to SUBTLE BUGS and they are EXTREMELY SLOW. You are slowing down all of your OpenCV API calls by about 40 milliseconds PER CALL (at 4K imgres), resolution), since your "cool" data has to be converted by OpenCV internally to PROPER CONTIGUOUS RAM.

Those 39.45 ms are wasted on EVERY OpenCV call whenever you give OpenCV a non-contiguous image. So if you're (as people often do) passing the image to multiple OpenCV APIs to analyze it in multiple ways, then you are causing extreme slowdowns in your code.

Use cv2.cvtColor() instead, which does a one-time conversion to the proper format (in about 5.48 msms @ 4K as seen in the benchmarks earlier), using accelerated OpenCL code. You are guaranteed to get contiguous data which works as-is for EVERY OpenCV call with no memory copying/conversion needed. And OpenCV's color converter is WAY FASTER than Numpy's internal data copier/converter.

Let's end this by imagining a scenario where you're using some Python library to capture an RGB 4K screenshot as a numpy array, and you need to use that data with OpenCV. So you're thinking you're clever and you write img = img[...,::-1] to "turn the RGB data into BGR (which OpenCV needs)", and you're thinking "Wow, my code is so fast! That RGB-to-BGR operation only took 0.000237 ms!"... And then you're calling five different OpenCV functions to analyze that screenshot-image in various ways. Well, since you're causing one internal Numpy copy-conversion-to-contiguous PER CALL, you're now causing 5 * 39.45 = 197.25 ms of total conversion overhead, just to get your "stupid" Numpy view into a proper contiguous memory stream.

Does it still sound "slow" to just do a single, one-time 5.48 msms (@ 4K) / 3.06 ms (@ 1080p) conversion via cv2.cvtColor()? ;-)

Stop. Using. Numpy. Tricks!

Enjoy! ;-)

Fastest way to convert BGR <-> RGB! Aka: Do NOT use Numpy magic "tricks".


IMPORTANT: This article is very long. Remember to click the (more) at the bottom of this post to read the whole article!


I was reading this question: https://answers.opencv.org/question/188664/colour-conversion-from-bgr-to-rgb-in-cv2-is-slower-than-on-python/ and it didn't explain things very well at all. So here's a deep examination and explanation for everyone's future reference!

Converting RGB to BGR, and vice versa, is one of the most important operations you can do in OpenCV if you're interoperating with other libraries, raw system memory, etc. And each imaging library depends on their own special channel orders.

There are many ways to achieve the conversion, and cv2.cvtColor() is often frowned upon because there are "much faster" ways to do it via numpy "view" manipulation.

Whenever you attempt to convert colors in OpenCV, you actually invoke a huge machinery:

https://github.com/opencv/opencv/blob/8c0b0714e76efef4a8ca2a7c410c60e55c5e9829/modules/imgproc/src/color.cpp#L20-L25 https://github.com/opencv/opencv/blob/8b541e450b511fde9dd363fa55a30fbb6fc0ace6/modules/imgproc/src/color_rgb.dispatch.cpp#L426-L437

As you can see, internally, OpenCV creates an "OpenCL Kernel" with the instructions for the data transformation, and then runs it. This creates brand new (re-arranged) image data in memory, which is of course a pretty slow operation, involving new memory allocation and data-copying.

However, there is another way to flip between RGB and BGR channel orders, which is very popular - and very bad (as you'll find out soon). And that is: Using numpy's built-in methods for manipulating the array data.

Note that there are two ways to manipulate data in Numpy:

  • One of the ways, the bad way, just changes the "view" of the Numpy array and is therefore instant (O(1)), but does NOT transform the underlying img.data in RAM/memory. This means that the raw memory does NOT contain the new channel order, and Numpy instead "fakes" it by creating a "view" that simply says "when we read this data from RAM, view it as R=B, G=G, B=R" basically... (Technically speaking, it changes the ".strides" property of the Numpy object, which instead of saying "read R then G then B" (stride "1" aka going forwards in RAM when reading the color channels) changes it to say "read B, then G, then R" (stride "-1" aka going backwards in RAM when reading the color channels)).
  • The second way, which is totally fine, is to always ensure that we arrange the pixel data properly in memory too, which is a lot slower but is almost always necessary, depending on what library/API your data is intended to be sent to!

To determine whether a numpy array manipulation has also changed the underlying MEMORY, you can look at the img.flags['C_CONTIGUOUS'] value. If True it means that the data in RAM is in the correct order (that's great!). If False it means that the data in RAM is in the wrong order and that we are "cheating" via a numpy View instead (that's BAD!).

Whenever you use the "View-based" methods to flip channels in an ndarray (such as RGB -> BGR), its C_CONTIGUOUS becomes False. If you then flip the image's channels again (such as BGR -> back to RGB), its C_CONTIGUOUS becomes True again. So, the "view" is able to be transformed multiple times, and the "Contiguous" flag only says True whenever the view happens to match the actual RAM data's layout.

So... in what situations do you need the data to ALWAYS be contiguous? Well, it varies based on API...

  • OpenCV APIs all need data in contiguous order. If you give it non-contiguous data from Python, the Python-to-OpenCV wrapper layer internally makes a COPY of the BADLY FORMATTED image you gave it, and then converts the color channel order, and THEN finally passes the COPIED-AND-CONTIGUOUS image to the internal OpenCV C++ function. This is of course very wasteful!
  • Other libraries: Depends on the library. Some of them do something like "take the img.data memory address and give it to a raw Windows API via a COM call" in which case YES the RAM-data MUST be contiguous too.

What type of data do YOU need?

If you want the SAFEST possible data that is 100% sure to work in ANY API ANYWHERE, you should always make CONTIGUOUS pixel data. It doesn't take long to do the conversion up-front, since we're still talking about very fast operations!

There are probably situations where non-contiguous data is fine, such as if you are doing all image manipulations purely in Numpy math without any library APIs (in which case there's no real reason to convert the data layout to contiguous in RAM). But as soon as you invoke various library APIs, you should pretty much always have contiguous data, otherwise you'll create huge performance issues (or even completely incorrect results).

I'll explain those performance issues further down, but first we'll look at the various "conversion techniques" people use in Python.

Techniques

Without further ado, here are all the ways that people use in Python whenever they want to convert back/forth between RGB and BGR. These benchmarks are on a 4K image (3840x2160):

  • Always Contiguous: No. Method: x = x[...,::-1]. Speed: 237 nsec (aka 0.237 usec aka 0.000237 msec) per call
  • Always Contiguous: Yes. Method: x = x[...,::-1].copy(). Speed: 37.5 msec per call
  • Always Contiguous: No. Method: x = x[:, :, [2, 1, 0]]. Speed: 12.6 msec per call
  • Always Contiguous: Yes. Method: x = cv2.cvtColor(x, cv2.COLOR_RGB2BGR). Speed: 5.48 msec per call
  • Always Contiguous: No. Method: x = np.fliplr(x.reshape(-1,3)).reshape(x.shape). Speed: 1.62 usec (aka 0.00162 msec) per call
  • Always Contiguous: Yes. Method: x = np.fliplr(x.reshape(-1,3)).reshape(x.shape).copy(). Speed: 37.4 msec per call
  • Always Contiguous: No. Method: x = np.flip(x, axis=2). Speed: 2.74 usec (aka 0.00274 msec) per call
  • Always Contiguous: Yes. Method: x = np.flip(x, axis=2).copy(). Speed: 37.5 msec per call
  • Always Contiguous: Yes. Method: r = x[..., 0].copy(); x[..., 0] = x[..., 2]; x[..., 2] = r. Speed: 21.8 msec per call
  • Always Contiguous: Yes. Method: x[:, :, [0, 2]] = x[:, :, [2, 0]]. Speed: 21.7 msec per call
  • Always Contiguous: Yes. Method: x[..., [0, 2]] = x[..., [2, 0]]. Speed: 21.8 msec per call
  • Always Contiguous: Yes. Method: x[:, :, [0, 1, 2]] = x[:, :, [2, 1, 0]]. Speed: 33.1 msec per call
  • Always Contiguous: Yes. Method: x[:, :] = x[:, :, [2, 1, 0]]. Speed: 49.3 msec per call
  • Always Contiguous: Yes. Method: foo = x.copy(). Speed: 11.8 msec per call (This example doesn't change the RGB/BGR channel order, and is just included here as a reference, to show how slow Numpy is at doing a super simple copy of an already-contiguous chunk of RAM. As you can see, even when the data is already in the proper order, Numpy is very slow... And if "x" had been non-contiguous here, it would be even slower, as shown in the x = x[...,::-1].copy() (equivalent to saying bar = x[...,::-1]; foo = bar.copy()) example near the top of the list, which took 37.5 msec and demonstrates Numpy copying non-contiguous RAM (from numpy "views" marked as "read in reverse order" via "stride = -1") into contiguous RAM...

PS: Whenever we want contiguous data from numpy, we're mostly using x.copy() to tell Numpy to allocate new RAM and copy all data to it in the correct (contiguous) order. There's also a np.ascontiguousarray(x) API but it does the exact same thing (it copies too, "but only when the Numpy data isn't already contiguous") and requires much more typing. ;-) And in a few of the examples we're using special indexing (such as x[:, :, [0, 1, 2]] = x[:, :, [2, 1, 0]]) to overwrite the memory directly, which always creates contiguous memory with correct "strides", and is faster than telling Numpy to do a .copy(), but is still extremely slow compared to cv2.cvtColor().

Docs for the various Numpy functions: copy, ascontiguousarray, fliplr, flip, reshape

Here's the benchmark that was used:

python -m timeit -s "import numpy as np; import cv2; x = np.zeros([2160,3840,3], np.uint8); x[:,:,2] = 255; x[:,:,1] = 100" "ALGORITHM HERE"

Replace the "ALGORITHM HERE" part with the algorithm above, such as "x = np.flip(x, axis=2).copy()".

People's Misunderstandings of those Benchmarks

Alright, so we're finally getting to the whole purpose of this article!

When people see the benchmarks above, they usually think "Oh my god, x = x[...,::-1] executes in 0.000237 milliseconds, and x = cv2.cvtColor(x, cv2.COLOR_RGB2BGR) executes in 5.48 milliseconds, which is 23798 times slower!! And then they decide to always use "Numpy view manipulation" to do their channel conversions.

That's a huge mistake. And here's why:

  • When you call an OpenCV API from Python, and pass it a numpy.ndarray image object, there's a process which prepares that data for internal usage within OpenCV (since OpenCV itself doesn't use ndarray internally; it uses cv::Mat).
  • First, your Python object (which is coming from the PyOpenCV module) goes into the appropriate pyopencv_to() function, whose purpose is to convert raw Python objects (such as numbers, strings, ndarray, etc), into something usable by OpenCV internally in C++.
  • Your Python object first enters the "full ArgInfo converter" code at https://github.com/opencv/opencv/blob/778f42ad34559451d62ac9ba585717aec77fb23a/modules/python/src2/cv2.cpp#L249
  • That code, in turn, looks at the object and determines if it's a number, a float, or a tuple... If it's any of those, it does the appropriate conversion. Otherwise it assumes it's a Numpy array, at this line: https://github.com/opencv/opencv/blob/778f42ad34559451d62ac9ba585717aec77fb23a/modules/python/src2/cv2.cpp#L292
  • Next, it begins to analyze the Numpy array to determine how to use the data internally. It wants to determine "do we need to copy the data or can we use it as-is? do we need to cast the data?", see here: https://github.com/opencv/opencv/blob/778f42ad34559451d62ac9ba585717aec77fb23a/modules/python/src2/cv2.cpp#L300
  • It first does some simple checks to see if the number-type in the Numpy array is legal or not. (If illegal type, it marks the data as "needs copy" and "needs cast").
  • Next, it retrieves the "strides" information from the Numpy array, which is those simple numbers such as "-1" which determine how to read a numpy array (such as backwards, in the case of our "fast" numpy-based "channel flipping" code earlier): https://github.com/opencv/opencv/blob/778f42ad34559451d62ac9ba585717aec77fb23a/modules/python/src2/cv2.cpp#L341-L342
  • Then it analyzes the Strides for all dimensions of the Numpy array, and if it finds a non-contiguous stride (our "screwed up" data layout caused by doing those so-called "fast" Numpy view manipulations), then it marks the data as "needs copy": https://github.com/opencv/opencv/blob/778f42ad34559451d62ac9ba585717aec77fb23a/modules/python/src2/cv2.cpp#L344-L357
  • Next, if "needs copy" is true, it does this horrible thing: https://github.com/opencv/opencv/blob/778f42ad34559451d62ac9ba585717aec77fb23a/modules/python/src2/cv2.cpp#L367-L374
  • As you can see, it calls PyArray_Cast() (if casting was needed too) or PyArray_GETCONTIGUOUS() (if we only need to make sure the data is contiguous). Both of those functions, no matter which one is called, then generates a brand-new Python Numpy Object, with all data COPIED by Numpy into brand-new memory, and re-arranged into proper Contiguous ordering. That's extremely wasteful! I'll explain more soon, after this walkthrough of what the code does...
  • Finally, the code proceeds to create a cv::Mat object whose data pointer points at the internal byte-array (RAM) of the Numpy object, ie. the RAM address that you can easily see in Python by typing img.data. That's an incredibly fast operation because it is just a pointer which says Use the existing RAM data owned by Numpy at RAM address XYZ: https://github.com/opencv/opencv/blob/778f42ad34559451d62ac9ba585717aec77fb23a/modules/python/src2/cv2.cpp#L415-L416

So, can you spot the problem yet?

When you pass a contiguous Numpy array, the conversion into OpenCV is pretty much INSTANT: "This data looks fine! Just give its RAM address to cv::Mat and voila!".

But when you instead insist on using those so-called "fast" channel transformations, where you "tweak" the Numpy array's view and stride values, then you are giving OpenCV a Numpy array with non-contiguous RAM and bad "strides". The PyOpenCV layer (the wrapper between OpenCV and Python) detects this problem, and creates a BRAND NEW, COPIED, RE-ARRANGED (CONTIGUOUS) NUMPY ARRAY. This is VERY VERY VERY VERY SLOW.

In other words, if you've used those dumb Numpy "view" manipulation tricks, EVERY CALL TO OPENCV APIS IS CAUSING a HUGE memory copy (images are large, especially 1080p+ screenshots/video frames), a lot of math inside PyArray_GETCONTIGUOUS / PyArray_Cast to create that new object while respecting your tweaked "strides", etc.

Your code won't be faster at all. It will be SLOWER!

Demonstration of the Slowness

Let's use a random OpenCV API to demonstrate the slowdown caused by all of those conversions. We'll use cv2.imshow here, but any OpenCV API call will always be doing the same "Python to OpenCV" conversions of the numpy data, so the exact API doesn't matter. They will all have this overhead.

Here's the example code:

import cv2
import numpy as np
import time

#img1 = cv2.imread("yourimage.png") # If you want to test with an image.
img1 = np.zeros([2160,3840,3], np.uint8) # Create a 4K image.
img1[:,:,2] = 255; img1[:,:,1] = 100 # Fill the channels with different values.
img2 = img1[...,::-1] # Make a "channel flipped view" of the Numpy data.

print("img1 contiguous", img1.flags['C_CONTIGUOUS'], "img2 contiguous", img2.flags['C_CONTIGUOUS'])
print("img1 strides", img1.strides, "img2 strides", img2.strides)

wnd = cv2.namedWindow("", cv2.WINDOW_NORMAL)

def show1():
    cv2.imshow(wnd, img1)

def show2():
    cv2.imshow(wnd, img2)

iterations = 20

start = time.perf_counter()
for i in range(0, iterations):
    show1()
elapsed1 = (time.perf_counter() - start) * 1000
elapsed1_percall = elapsed1 / iterations

start = time.perf_counter()
for i in range(0, iterations):
    show2()
elapsed2 = (time.perf_counter() - start) * 1000
elapsed2_percall = elapsed2 / iterations

# We know that the contiguos (img1) data does not need conversion,
# which tells us that the runtime of the contiguous data is the
# "internal work of the imshow" function. We only want to measure
# the conversion time for non-contiguous data. So we'll subtract
# the first image's (contiguous) runtime from the non-contiguous time.

noncontiguous_overhead_per_call = elapsed2_percall - elapsed1_percall

print("Extra time taken per OpenCV call when given non-contiguous data (in ms):", noncontiguous_overhead_per_call, "ms")

The results:

img1 contiguous True img2 contiguous False
img1 strides (11520, 3, 1) img2 strides (11520, 3, -1)
Extra time taken per OpenCV call when given non-contiguous data (in ms): 39.45334999999999 ms

As you can see, the extra time added to the OpenCV calls when copy-conversion is needed (39.45 ms), is pretty much the same as when you call Numpy's own img.copy() on a "flipped view" inside Python itself (as seen in the earlier benchmark for "Method: x = x[...,::-1].copy(). Speed: 37.5 msec per call").

So yes, every time you call OpenCV with a non-contiguous Numpy array as its argument, you are causing a Numpy .copy() to happen internally!

Numpy "tricks" will cause subtle Bugs too!

Using those Numpy "tricks" isn't just extremely slow. It will cause very subtle bugs in your code, too.

Look at this code and see if you can figure out the bug yourself before you run this example:

import cv2
import numpy as np

img1 = np.zeros([200,200,3], np.uint8) # Create a 200x200 image. (Is Contiguous)
img2 = img1[...,::-1] # Make a "channel flipped view" of the Numpy data. (A Non-Contiguous View)

print("img1 contiguous", img1.flags['C_CONTIGUOUS'], "img2 contiguous", img2.flags['C_CONTIGUOUS'])
print("img1 strides", img1.strides, "img2 strides", img2.strides)

cv2.rectangle(img2, (80,80), (120,120), (255,255,255), 2)

cv2.imshow("", img2)

What do you think the result will be when running this program? Logically, you expect to see black image with a white rectangle in the middle... But instead, you see nothing except a black image. Why?

Well, it's simple... think about what was explained earlier about how PyOpenCV converts every incoming numpy.ndarray object into an internal C++ cv::Mat object. In this example, we're giving a non-contiguous ndarray as an argument to cv2.rectangle(), which causes PyOpenCV to "fix" the data by making a temporary, internal, contiguous .copy() of the image data, and then it wraps the copy's memory address in a cv::Mat. Next, it passes that cv::Mat object to the internal C++ "draw rectangle" function, which dutifully draws a rectangle onto the memory pointed to by the cv::Mat object... which is... the memory of the temporary internal copy of your input array, since a copy had to be created...

So, OpenCV happily writes a rectangle to the temporary object copy. And then when execution returns to Python, you're of course seeing NO RECTANGLE, since nothing was drawn to your actual ndarray data in RAM (since its memory storage was non-contiguous and therefore not usable as-is by OpenCV).

If you want to see what the code above should be doing, simply add img2 = img2.copy() immediately above the cv2.rectangle call, to cause the img2 ndarray object to become contiguous memory so that OpenCV won't need to make a copy of it (and will be able to use that exact object's memory internally, as intended)... After that tweak, you'll see OpenCV properly drawing the rectangle to the image...

This is the kind of subtle bug that is very easy to cause when you're playing around with faked Numpy "views" rather than real contiguous memory.

Bonus: A note about Numpy "slices"

Numpy allows you to efficiently "slice" arrays, to extract a "partial view" of the data. This is very useful for images, since you can do something such as extracting a 100:100 pixel square from the middle of an image. The slicing syntax is img_sliced = img[y1:y2,x1:x2]. This generates a full Numpy object which points at the data of the original image (they share each other's memory), but which only points at the sub-range you wanted.

So it basically becomes a fully usable "Numpy array" object which you can use in any context you would pass an image. Such as to an OpenCV function, which would then only operate on the sliced segment of RAM. That's really useful!

However, be aware that the Numpy slices inherit the strides and contiguous flag of the original object / data they were sliced from! So if you're slicing from a non-contiguous array, you'll generate a non-contiguous slice object too, which is horrible and has all the issues of non-contiguous objects.

It's only safe to make partial views/slices (like img[0:100, 0:100]) when img itself is already PROVEN to be FULLY contiguous (with no "Numpy tricks" applied to it). In that case, feel free to pass your contiguous, partial image slices to OpenCV functions. You won't invoke any copy-mechanics in that case!

Bonus: What to do when you get a non-contiguous ndarray from a library?

As an example, the very cool D3DShot library has an optional numpy mode where it retrieves the screenshots as ndarray objects. The problem is that it generates them from RAM data laid out in a different order, so it tweaks the ndarray strides etc to give us an object of the proper "shape" (height, width, 3 color channels in RGB order). Its .flags property shows that Contiguous is FALSE.

So what do you do? If you try to pass that directly to OpenCV, you'll invoke the heavy PyOpenCV copy-mechanics described earlier.

Well, you have two options. In this example case, the colors are in RGB order, and you want them to be BGR for usage in OpenCV. So you should be invoking cv2.cvtColor which internally will trigger the Numpy .copy() for you (just like all OpenCV APIs do when given non-contiguous data), and then changes the color order in RAM for you.

The second option is when you have Numpy data that is already in the correct color order (such as BGR), but whose RAM is non-contiguous. In that case, you should directly invoke img = img.copy() to tell Numpy to make a contiguous copy of the array, to fix it. Then you're welcome to use that contiguous copy for everything.

Alright, so let's look at the D3DShot example:

import cv2
import d3dshot
import time

d = d3dshot.create(capture_output="numpy", frame_buffer_size=60)

img1 = d.screenshot()
img2 = d.screenshot()

print(img1.strides, img1.flags)
print(img2.strides, img2.flags)

print("-------------")

start = time.perf_counter()
img1_justcopy = img1.copy() # copy RGB image to new, contiguous RAM
elapsed = (time.perf_counter() - start) * 1000
print(img1_justcopy.strides, img1_justcopy.flags)
print("justcopy milliseconds:", elapsed)

print("-------------")

start = time.perf_counter()
img1 = img1.copy()
img1 = cv2.cvtColor(img1, cv2.COLOR_BGR2RGB) # flip RGB -> BGR
elapsed = (time.perf_counter() - start) * 1000
print(img1.strides, img1.flags)
print("copy+cvtColor milliseconds:", elapsed)

print("-------------")

start = time.perf_counter()
img2 = cv2.cvtColor(img2, cv2.COLOR_BGR2RGB) # flip RGB -> BGR
elapsed = (time.perf_counter() - start) * 1000
print(img2.strides, img2.flags)
print("cvtColor milliseconds:", elapsed)

Output:

(1920, 1, 2073600)   C_CONTIGUOUS : False
  F_CONTIGUOUS : False
  OWNDATA : False
  WRITEABLE : True
  ALIGNED : True
  WRITEBACKIFCOPY : False
  UPDATEIFCOPY : False

(1920, 1, 2073600)   C_CONTIGUOUS : False
  F_CONTIGUOUS : False
  OWNDATA : False
  WRITEABLE : True
  ALIGNED : True
  WRITEBACKIFCOPY : False
  UPDATEIFCOPY : False

-------------
(5760, 3, 1)   C_CONTIGUOUS : True
  F_CONTIGUOUS : False
  OWNDATA : True
  WRITEABLE : True
  ALIGNED : True
  WRITEBACKIFCOPY : False
  UPDATEIFCOPY : False

justcopy milliseconds: 9.122899999999989
-------------
(5760, 3, 1)   C_CONTIGUOUS : True
  F_CONTIGUOUS : False
  OWNDATA : True
  WRITEABLE : True
  ALIGNED : True
  WRITEBACKIFCOPY : False
  UPDATEIFCOPY : False

copy+cvtColor milliseconds: 12.177900000000019
-------------
(5760, 3, 1)   C_CONTIGUOUS : True
  F_CONTIGUOUS : False
  OWNDATA : True
  WRITEABLE : True
  ALIGNED : True
  WRITEBACKIFCOPY : False
  UPDATEIFCOPY : False

cvtColor milliseconds: 11.461500000000013

These examples are all on my 1920x1080 screen, so they're not directly comparable to the 4K resolution times we saw in earlier benchmarks.

Anyway, what we can see here, is first of all that the two captured images (img1 and img2) coming straight from the D3DShot library have very strange strides values, and C_CONTIGUOUS : False. That's because they are raw RAM given to D3DShot by Windows and then just packaged into a ndarray with custom strides to make it read the raw RAM data in the desired order.

Next, we see that just doing img1_justcopy = img1.copy() (which copies the RGB-channeled, non-contiguous RAM into new, contiguous RAM, but does not change the channel order (the image will still be RGB)), takes 9.12 ms, which is indeed how slow Numpy is at copying non-contiguous ndarray data into new, contiguous RAM. Basically, internally, Numpy has to do a ton of looping to read the data byte-by-byte while writing each byte into the correct order in the new, contiguous RAM.

So, the PyArray (Numpy) copying of non-contiguous to contiguous is always the slowest operation. That's why we want to avoid having non-contiguous RAM.

Alright, we also demonstrated how to make a "copy AND fix the colors from RGB to BGR" in two different ways. Doing img1 = img1.copy(); img1 = cv2.cvtColor(img1, cv2.COLOR_BGR2RGB) takes 11.83 ms, and letting cvtColor trigger the Numpy .copy internally via directly calling img2 = cv2.cvtColor(img2, cv2.COLOR_BGR2RGB) takes 10.61 ms. The reason for the slight difference is of course that there's slightly more work involved when we're doing 2 separate function calls, than when we let OpenCV do the Numpy copying in its single call.

In both cases, a PyArray (Numpy) copy operation happens internally, to give us a straight, contiguous RAM location. And then we pass that fixed, contiguous ndarray to cvtColor which fixes the color channel order.

That gives you the following guidelines:

  • If your Numpy data is non-contiguous but is already in the correct channel order (you don't want to convert RGB to/from BGR, etc): Use img = img.copy() to force Numpy to make a contiguous copy of the data, which is then usable in all OpenCV calls without any bugs and without causing any slow internal, temporary copying.
  • If your Numpy data is non-contiguous and you also want to change the channel order: Use img = cv2.cvtColor(img, cv2.COLOR_<your conversion choice>), which will internally do the .copy slightly more efficiently than if you had used two separate Python statements.

Both techniques will result in giving you fast, contiguous RAM, in the color arrangement of your choice!

Conclusions

Stop using Numpy view manipulations and "tricks". They are not "cool". They lead to SUBTLE BUGS and they are EXTREMELY SLOW. You are slowing down all of your OpenCV API calls by about 40 milliseconds PER CALL (at 4K resolution), since your "cool" data has to be converted by OpenCV internally to PROPER CONTIGUOUS RAM.

Those 39.45 ms are wasted on EVERY OpenCV call whenever you give OpenCV a non-contiguous image. So if you're (as people often do) passing the image to multiple OpenCV APIs to analyze it in multiple ways, then you are causing extreme slowdowns in your code.

Use cv2.cvtColor() instead, which does a one-time conversion to the proper format (in about 5.48 ms @ 4K as seen in the benchmarks earlier), using accelerated OpenCL code. You are guaranteed to get contiguous data which works as-is for EVERY OpenCV call with no memory copying/conversion needed. And OpenCV's color converter is WAY FASTER than Numpy's internal data copier/converter.

Let's end this by imagining a scenario where you're using some Python library to capture an RGB 4K screenshot as a numpy array, and you need to use that data with OpenCV. So you're thinking you're clever and you write img = img[...,::-1] to "turn the RGB data into BGR (which OpenCV needs)", and you're thinking "Wow, my code is so fast! That RGB-to-BGR operation only took 0.000237 ms!"... And then you're calling five different OpenCV functions to analyze that screenshot-image in various ways. Well, since you're causing one internal Numpy copy-conversion-to-contiguous PER CALL, you're now causing 5 * 39.45 = 197.25 ms of total conversion overhead, just to get your "stupid" Numpy view into a proper contiguous memory stream.

Does it still sound "slow" to just do a single, one-time 5.48 ms (@ 4K) / 3.06 or 1.53 ms (@ 1080p)1920x1080) conversion via cv2.cvtColor()? ;-)

Stop. Using. Numpy. Tricks!

Enjoy! ;-)

Fastest way to convert BGR <-> RGB! Aka: Do NOT use Numpy magic "tricks".


IMPORTANT: This article is very long. Remember to click the (more) at the bottom of this post to read the whole article!


I was reading this question: https://answers.opencv.org/question/188664/colour-conversion-from-bgr-to-rgb-in-cv2-is-slower-than-on-python/ and it didn't explain things very well at all. So here's a deep examination and explanation for everyone's future reference!

Converting RGB to BGR, and vice versa, is one of the most important operations you can do in OpenCV if you're interoperating with other libraries, raw system memory, etc. And each imaging library depends on their own special channel orders.

There are many ways to achieve the conversion, and cv2.cvtColor() is often frowned upon because there are "much faster" ways to do it via numpy "view" manipulation.

Whenever you attempt to convert colors in OpenCV, you actually invoke a huge machinery:

https://github.com/opencv/opencv/blob/8c0b0714e76efef4a8ca2a7c410c60e55c5e9829/modules/imgproc/src/color.cpp#L20-L25 https://github.com/opencv/opencv/blob/8b541e450b511fde9dd363fa55a30fbb6fc0ace6/modules/imgproc/src/color_rgb.dispatch.cpp#L426-L437

As you can see, internally, OpenCV creates an "OpenCL Kernel" with the instructions for the data transformation, and then runs it. This creates brand new (re-arranged) image data in memory, which is of course a pretty slow operation, involving new memory allocation and data-copying.

However, there is another way to flip between RGB and BGR channel orders, which is very popular - and very bad (as you'll find out soon). And that is: Using numpy's built-in methods for manipulating the array data.

Note that there are two ways to manipulate data in Numpy:

  • One of the ways, the bad way, just changes the "view" of the Numpy array and is therefore instant (O(1)), but does NOT transform the underlying img.data in RAM/memory. This means that the raw memory does NOT contain the new channel order, and Numpy instead "fakes" it by creating a "view" that simply says "when we read this data from RAM, view it as R=B, G=G, B=R" basically... (Technically speaking, it changes the ".strides" property of the Numpy object, which instead of saying "read R then G then B" (stride "1" aka going forwards in RAM when reading the color channels) changes it to say "read B, then G, then R" (stride "-1" aka going backwards in RAM when reading the color channels)).
  • The second way, which is totally fine, is to always ensure that we arrange the pixel data properly in memory too, which is a lot slower but is almost always necessary, depending on what library/API your data is intended to be sent to!

To determine whether a numpy array manipulation has also changed the underlying MEMORY, you can look at the img.flags['C_CONTIGUOUS'] value. If True it means that the data in RAM is in the correct order (that's great!). If False it means that the data in RAM is in the wrong order and that we are "cheating" via a numpy View instead (that's BAD!).

Whenever you use the "View-based" methods to flip channels in an ndarray (such as RGB -> BGR), its C_CONTIGUOUS becomes False. If you then flip the image's channels again (such as BGR -> back to RGB), its C_CONTIGUOUS becomes True again. So, the "view" is able to be transformed multiple times, and the "Contiguous" flag only says True whenever the view happens to match the actual RAM data's layout.

So... in what situations do you need the data to ALWAYS be contiguous? Well, it varies based on API...

  • OpenCV APIs all need data in contiguous order. If you give it non-contiguous data from Python, the Python-to-OpenCV wrapper layer internally makes a COPY of the BADLY FORMATTED image you gave it, and then converts the color channel order, and THEN finally passes the COPIED-AND-CONTIGUOUS image to the internal OpenCV C++ function. This is of course very wasteful!
  • Other libraries: Depends on the library. Some of them do something like "take the img.data memory address and give it to a raw Windows API via a COM call" in which case YES the RAM-data MUST be contiguous too.

What type of data do YOU need?

If you want the SAFEST possible data that is 100% sure to work in ANY API ANYWHERE, you should always make CONTIGUOUS pixel data. It doesn't take long to do the conversion up-front, since we're still talking about very fast operations!

There are probably situations where non-contiguous data is fine, such as if you are doing all image manipulations purely in Numpy math without any library APIs (in which case there's no real reason to convert the data layout to contiguous in RAM). But as soon as you invoke various library APIs, you should pretty much always have contiguous data, otherwise you'll create huge performance issues (or even completely incorrect results).

I'll explain those performance issues further down, but first we'll look at the various "conversion techniques" people use in Python.

Techniques

Without further ado, here are all the ways that people use in Python whenever they want to convert back/forth between RGB and BGR. These benchmarks are on a 4K image (3840x2160):

  • Always Contiguous: No. Method: x = x[...,::-1]. Speed: 237 nsec (aka 0.237 usec aka 0.000237 msec) per call
  • Always Contiguous: Yes. Method: x = x[...,::-1].copy(). Speed: 37.5 msec per call
  • Always Contiguous: No. Method: x = x[:, :, [2, 1, 0]]. Speed: 12.6 msec per call
  • Always Contiguous: Yes. Method: x = cv2.cvtColor(x, cv2.COLOR_RGB2BGR). Speed: 5.48 msec per call
  • Always Contiguous: No. Method: x = np.fliplr(x.reshape(-1,3)).reshape(x.shape). Speed: 1.62 usec (aka 0.00162 msec) per call
  • Always Contiguous: Yes. Method: x = np.fliplr(x.reshape(-1,3)).reshape(x.shape).copy(). Speed: 37.4 msec per call
  • Always Contiguous: No. Method: x = np.flip(x, axis=2). Speed: 2.74 usec (aka 0.00274 msec) per call
  • Always Contiguous: Yes. Method: x = np.flip(x, axis=2).copy(). Speed: 37.5 msec per call
  • Always Contiguous: Yes. Method: r = x[..., 0].copy(); x[..., 0] = x[..., 2]; x[..., 2] = r. Speed: 21.8 msec per call
  • Always Contiguous: Yes. Method: x[:, :, [0, 2]] = x[:, :, [2, 0]]. Speed: 21.7 msec per call
  • Always Contiguous: Yes. Method: x[..., [0, 2]] = x[..., [2, 0]]. Speed: 21.8 msec per call
  • Always Contiguous: Yes. Method: x[:, :, [0, 1, 2]] = x[:, :, [2, 1, 0]]. Speed: 33.1 msec per call
  • Always Contiguous: Yes. Method: x[:, :] = x[:, :, [2, 1, 0]]. Speed: 49.3 msec per call
  • Always Contiguous: Yes. Method: foo = x.copy(). Speed: 11.8 msec per call (This example doesn't change the RGB/BGR channel order, and is just included here as a reference, to show how slow Numpy is at doing a super simple copy of an already-contiguous chunk of RAM. As you can see, even when the data is already in the proper order, Numpy is very slow... And if "x" had been non-contiguous here, it would be even slower, as shown in the x = x[...,::-1].copy() (equivalent to saying bar = x[...,::-1]; foo = bar.copy()) example near the top of the list, which took 37.5 msec and demonstrates Numpy copying non-contiguous RAM (from numpy "views" marked as "read in reverse order" via "stride = -1") into contiguous RAM...

PS: Whenever we want contiguous data from numpy, we're mostly using x.copy() to tell Numpy to allocate new RAM and copy all data to it in the correct (contiguous) order. There's also a np.ascontiguousarray(x) API but it does the exact same thing (it copies too, "but only when the Numpy data isn't already contiguous") and requires much more typing. ;-) And in a few of the examples we're using special indexing (such as x[:, :, [0, 1, 2]] = x[:, :, [2, 1, 0]]) to overwrite the memory directly, which always creates contiguous memory with correct "strides", and is faster than telling Numpy to do a .copy(), but is still extremely slow compared to cv2.cvtColor().

Docs for the various Numpy functions: copy, ascontiguousarray, fliplr, flip, reshape

Here's the benchmark that was used:

python -m timeit -s "import numpy as np; import cv2; x = np.zeros([2160,3840,3], np.uint8); x[:,:,2] = 255; x[:,:,1] = 100" "ALGORITHM HERE"

Replace the "ALGORITHM HERE" part with the algorithm above, such as "x = np.flip(x, axis=2).copy()".

People's Misunderstandings of those Benchmarks

Alright, so we're finally getting to the whole purpose of this article!

When people see the benchmarks above, they usually think "Oh my god, x = x[...,::-1] executes in 0.000237 milliseconds, and x = cv2.cvtColor(x, cv2.COLOR_RGB2BGR) executes in 5.48 milliseconds, which is 23798 times slower!! And then they decide to always use "Numpy view manipulation" to do their channel conversions.

That's a huge mistake. And here's why:

  • When you call an OpenCV API from Python, and pass it a numpy.ndarray image object, there's a process which prepares that data for internal usage within OpenCV (since OpenCV itself doesn't use ndarray internally; it uses cv::Mat).
  • First, your Python object (which is coming from the PyOpenCV module) goes into the appropriate pyopencv_to() function, whose purpose is to convert raw Python objects (such as numbers, strings, ndarray, etc), into something usable by OpenCV internally in C++.
  • Your Python object first enters the "full ArgInfo converter" code at https://github.com/opencv/opencv/blob/778f42ad34559451d62ac9ba585717aec77fb23a/modules/python/src2/cv2.cpp#L249
  • That code, in turn, looks at the object and determines if it's a number, a float, or a tuple... If it's any of those, it does the appropriate conversion. Otherwise it assumes it's a Numpy array, at this line: https://github.com/opencv/opencv/blob/778f42ad34559451d62ac9ba585717aec77fb23a/modules/python/src2/cv2.cpp#L292
  • Next, it begins to analyze the Numpy array to determine how to use the data internally. It wants to determine "do we need to copy the data or can we use it as-is? do we need to cast the data?", see here: https://github.com/opencv/opencv/blob/778f42ad34559451d62ac9ba585717aec77fb23a/modules/python/src2/cv2.cpp#L300
  • It first does some simple checks to see if the number-type in the Numpy array is legal or not. (If illegal type, it marks the data as "needs copy" and "needs cast").
  • Next, it retrieves the "strides" information from the Numpy array, which is those simple numbers such as "-1" which determine how to read a numpy array (such as backwards, in the case of our "fast" numpy-based "channel flipping" code earlier): https://github.com/opencv/opencv/blob/778f42ad34559451d62ac9ba585717aec77fb23a/modules/python/src2/cv2.cpp#L341-L342
  • Then it analyzes the Strides for all dimensions of the Numpy array, and if it finds a non-contiguous stride (our "screwed up" data layout caused by doing those so-called "fast" Numpy view manipulations), then it marks the data as "needs copy": https://github.com/opencv/opencv/blob/778f42ad34559451d62ac9ba585717aec77fb23a/modules/python/src2/cv2.cpp#L344-L357
  • Next, if "needs copy" is true, it does this horrible thing: https://github.com/opencv/opencv/blob/778f42ad34559451d62ac9ba585717aec77fb23a/modules/python/src2/cv2.cpp#L367-L374
  • As you can see, it calls PyArray_Cast() (if casting was needed too) or PyArray_GETCONTIGUOUS() (if we only need to make sure the data is contiguous). Both of those functions, no matter which one is called, then generates a brand-new Python Numpy Object, with all data COPIED by Numpy into brand-new memory, and re-arranged into proper Contiguous ordering. That's extremely wasteful! I'll explain more soon, after this walkthrough of what the code does...
  • Finally, the code proceeds to create a cv::Mat object whose data pointer points at the internal byte-array (RAM) of the Numpy object, ie. the RAM address that you can easily see in Python by typing img.data. That's an incredibly fast operation because it is just a pointer which says Use the existing RAM data owned by Numpy at RAM address XYZ: https://github.com/opencv/opencv/blob/778f42ad34559451d62ac9ba585717aec77fb23a/modules/python/src2/cv2.cpp#L415-L416

So, can you spot the problem yet?

When you pass a contiguous Numpy array, the conversion into OpenCV is pretty much INSTANT: "This data looks fine! Just give its RAM address to cv::Mat and voila!".

But when you instead insist on using those so-called "fast" channel transformations, where you "tweak" the Numpy array's view and stride values, then you are giving OpenCV a Numpy array with non-contiguous RAM and bad "strides". The PyOpenCV layer (the wrapper between OpenCV and Python) detects this problem, and creates a BRAND NEW, COPIED, RE-ARRANGED (CONTIGUOUS) NUMPY ARRAY. This is VERY VERY VERY VERY SLOW.

In other words, if you've used those dumb Numpy "view" manipulation tricks, EVERY CALL TO OPENCV APIS IS CAUSING a HUGE memory copy (images are large, especially 1080p+ screenshots/video frames), a lot of math inside PyArray_GETCONTIGUOUS / PyArray_Cast to create that new object while respecting your tweaked "strides", etc.

Your code won't be faster at all. It will be SLOWER!

Demonstration of the Slowness

Let's use a random OpenCV API to demonstrate the slowdown caused by all of those conversions. We'll use cv2.imshow here, but any OpenCV API call will always be doing the same "Python to OpenCV" conversions of the numpy data, so the exact API doesn't matter. They will all have this overhead.

Here's the example code:

import cv2
import numpy as np
import time

#img1 = cv2.imread("yourimage.png") # If you want to test with an image.
img1 = np.zeros([2160,3840,3], np.uint8) # Create a 4K image.
img1[:,:,2] = 255; img1[:,:,1] = 100 # Fill the channels with different values.
img2 = img1[...,::-1] # Make a "channel flipped view" of the Numpy data.

print("img1 contiguous", img1.flags['C_CONTIGUOUS'], "img2 contiguous", img2.flags['C_CONTIGUOUS'])
print("img1 strides", img1.strides, "img2 strides", img2.strides)

wnd = cv2.namedWindow("", cv2.WINDOW_NORMAL)

def show1():
    cv2.imshow(wnd, img1)

def show2():
    cv2.imshow(wnd, img2)

iterations = 20

start = time.perf_counter()
for i in range(0, iterations):
    show1()
elapsed1 = (time.perf_counter() - start) * 1000
elapsed1_percall = elapsed1 / iterations

start = time.perf_counter()
for i in range(0, iterations):
    show2()
elapsed2 = (time.perf_counter() - start) * 1000
elapsed2_percall = elapsed2 / iterations

# We know that the contiguos (img1) data does not need conversion,
# which tells us that the runtime of the contiguous data is the
# "internal work of the imshow" function. We only want to measure
# the conversion time for non-contiguous data. So we'll subtract
# the first image's (contiguous) runtime from the non-contiguous time.

noncontiguous_overhead_per_call = elapsed2_percall - elapsed1_percall

print("Extra time taken per OpenCV call when given non-contiguous data (in ms):", noncontiguous_overhead_per_call, "ms")

The results:

img1 contiguous True img2 contiguous False
img1 strides (11520, 3, 1) img2 strides (11520, 3, -1)
Extra time taken per OpenCV call when given non-contiguous data (in ms): 39.45334999999999 ms

As you can see, the extra time added to the OpenCV calls when copy-conversion is needed (39.45 ms), is pretty much the same as when you call Numpy's own img.copy() on a "flipped view" inside Python itself (as seen in the earlier benchmark for "Method: x = x[...,::-1].copy(). Speed: 37.5 msec per call").

So yes, every time you call OpenCV with a non-contiguous Numpy array as its argument, you are causing a Numpy .copy() to happen internally!

PS: If we repeat the same test above with 1920x1080 test data instead of 4K test data, we get Extra time taken per OpenCV call when given non-contiguous data (in ms): 9.972125 ms which means that at the world's most popular image resolution (1080p) you're still adding around 10 milliseconds of overhead to all of your OpenCV calls.

Numpy "tricks" will cause subtle Bugs too!

Using those Numpy "tricks" isn't just extremely slow. It will cause very subtle bugs in your code, too.

Look at this code and see if you can figure out the bug yourself before you run this example:

import cv2
import numpy as np

img1 = np.zeros([200,200,3], np.uint8) # Create a 200x200 image. (Is Contiguous)
img2 = img1[...,::-1] # Make a "channel flipped view" of the Numpy data. (A Non-Contiguous View)

print("img1 contiguous", img1.flags['C_CONTIGUOUS'], "img2 contiguous", img2.flags['C_CONTIGUOUS'])
print("img1 strides", img1.strides, "img2 strides", img2.strides)

cv2.rectangle(img2, (80,80), (120,120), (255,255,255), 2)

cv2.imshow("", img2)

What do you think the result will be when running this program? Logically, you expect to see black image with a white rectangle in the middle... But instead, you see nothing except a black image. Why?

Well, it's simple... think about what was explained earlier about how PyOpenCV converts every incoming numpy.ndarray object into an internal C++ cv::Mat object. In this example, we're giving a non-contiguous ndarray as an argument to cv2.rectangle(), which causes PyOpenCV to "fix" the data by making a temporary, internal, contiguous .copy() of the image data, and then it wraps the copy's memory address in a cv::Mat. Next, it passes that cv::Mat object to the internal C++ "draw rectangle" function, which dutifully draws a rectangle onto the memory pointed to by the cv::Mat object... which is... the memory of the temporary internal copy of your input array, since a copy had to be created...

So, OpenCV happily writes a rectangle to the temporary object copy. And then when execution returns to Python, you're of course seeing NO RECTANGLE, since nothing was drawn to your actual ndarray data in RAM (since its memory storage was non-contiguous and therefore not usable as-is by OpenCV).

If you want to see what the code above should be doing, simply add img2 = img2.copy() immediately above the cv2.rectangle call, to cause the img2 ndarray object to become contiguous memory so that OpenCV won't need to make a copy of it (and will be able to use that exact object's memory internally, as intended)... After that tweak, you'll see OpenCV properly drawing the rectangle to the image...

This is the kind of subtle bug that is very easy to cause when you're playing around with faked Numpy "views" rather than real contiguous memory.

Bonus: A note about Numpy "slices"

Numpy allows you to efficiently "slice" arrays, to extract a "partial view" of the data. This is very useful for images, since you can do something such as extracting a 100:100 pixel square from the middle of an image. The slicing syntax is img_sliced = img[y1:y2,x1:x2]. This generates a full Numpy object which points at the data of the original image (they share each other's memory), but which only points at the sub-range you wanted.

So it basically becomes a fully usable "Numpy array" object which you can use in any context you would pass an image. Such as to an OpenCV function, which would then only operate on the sliced segment of RAM. That's really useful!

However, be aware that the Numpy slices inherit the strides and contiguous flag of the original object / data they were sliced from! So if you're slicing from a non-contiguous array, you'll generate a non-contiguous slice object too, which is horrible and has all the issues of non-contiguous objects.

It's only safe to make partial views/slices (like img[0:100, 0:100]) when img itself is already PROVEN to be FULLY contiguous (with no "Numpy tricks" applied to it). In that case, feel free to pass your contiguous, partial image slices to OpenCV functions. You won't invoke any copy-mechanics in that case!

Bonus: What to do when you get a non-contiguous ndarray from a library?

As an example, the very cool D3DShot library has an optional numpy mode where it retrieves the screenshots as ndarray objects. The problem is that it generates them from RAM data laid out in a different order, so it tweaks the ndarray strides etc to give us an object of the proper "shape" (height, width, 3 color channels in RGB order). Its .flags property shows that Contiguous is FALSE.

So what do you do? If you try to pass that directly to OpenCV, you'll invoke the heavy PyOpenCV copy-mechanics described earlier.

Well, you have two options. In this example case, the colors are in RGB order, and you want them to be BGR for usage in OpenCV. So you should be invoking cv2.cvtColor which internally will trigger the Numpy .copy() for you (just like all OpenCV APIs do when given non-contiguous data), and then changes the color order in RAM for you.

The second option is when you have Numpy data that is already in the correct color order (such as BGR), but whose RAM is non-contiguous. In that case, you should directly invoke img = img.copy() to tell Numpy to make a contiguous copy of the array, to fix it. Then you're welcome to use that contiguous copy for everything.

Alright, so let's look at the D3DShot example:

import cv2
import d3dshot
import time

d = d3dshot.create(capture_output="numpy", frame_buffer_size=60)

img1 = d.screenshot()
img2 = d.screenshot()

print(img1.strides, img1.flags)
print(img2.strides, img2.flags)

print("-------------")

start = time.perf_counter()
img1_justcopy = img1.copy() # copy RGB image to new, contiguous RAM
elapsed = (time.perf_counter() - start) * 1000
print(img1_justcopy.strides, img1_justcopy.flags)
print("justcopy milliseconds:", elapsed)

print("-------------")

start = time.perf_counter()
img1 = img1.copy()
img1 = cv2.cvtColor(img1, cv2.COLOR_BGR2RGB) # flip RGB -> BGR
elapsed = (time.perf_counter() - start) * 1000
print(img1.strides, img1.flags)
print("copy+cvtColor milliseconds:", elapsed)

print("-------------")

start = time.perf_counter()
img2 = cv2.cvtColor(img2, cv2.COLOR_BGR2RGB) # flip RGB -> BGR
elapsed = (time.perf_counter() - start) * 1000
print(img2.strides, img2.flags)
print("cvtColor milliseconds:", elapsed)

Output:

(1920, 1, 2073600)   C_CONTIGUOUS : False
  F_CONTIGUOUS : False
  OWNDATA : False
  WRITEABLE : True
  ALIGNED : True
  WRITEBACKIFCOPY : False
  UPDATEIFCOPY : False

(1920, 1, 2073600)   C_CONTIGUOUS : False
  F_CONTIGUOUS : False
  OWNDATA : False
  WRITEABLE : True
  ALIGNED : True
  WRITEBACKIFCOPY : False
  UPDATEIFCOPY : False

-------------
(5760, 3, 1)   C_CONTIGUOUS : True
  F_CONTIGUOUS : False
  OWNDATA : True
  WRITEABLE : True
  ALIGNED : True
  WRITEBACKIFCOPY : False
  UPDATEIFCOPY : False

justcopy milliseconds: 9.122899999999989
-------------
(5760, 3, 1)   C_CONTIGUOUS : True
  F_CONTIGUOUS : False
  OWNDATA : True
  WRITEABLE : True
  ALIGNED : True
  WRITEBACKIFCOPY : False
  UPDATEIFCOPY : False

copy+cvtColor milliseconds: 12.177900000000019
-------------
(5760, 3, 1)   C_CONTIGUOUS : True
  F_CONTIGUOUS : False
  OWNDATA : True
  WRITEABLE : True
  ALIGNED : True
  WRITEBACKIFCOPY : False
  UPDATEIFCOPY : False

cvtColor milliseconds: 11.461500000000013

These examples are all on my 1920x1080 screen, so they're not directly comparable to the 4K resolution times we saw in earlier benchmarks.

Anyway, what we can see here, is first of all that the two captured images (img1 and img2) coming straight from the D3DShot library have very strange strides values, and C_CONTIGUOUS : False. That's because they are raw RAM given to D3DShot by Windows and then just packaged into a ndarray with custom strides to make it read the raw RAM data in the desired order.

Next, we see that just doing img1_justcopy = img1.copy() (which copies the RGB-channeled, non-contiguous RAM into new, contiguous RAM, but does not change the channel order (the image will still be RGB)), takes 9.12 ms, which is indeed how slow Numpy is at copying non-contiguous ndarray data into new, contiguous RAM. Basically, internally, Numpy has to do a ton of looping to read the data byte-by-byte while writing each byte into the correct order in the new, contiguous RAM.

So, the PyArray (Numpy) copying of non-contiguous to contiguous is always the slowest operation. That's why we want to avoid having non-contiguous RAM.

Alright, we also demonstrated how to make a "copy AND fix the colors from RGB to BGR" in two different ways. Doing img1 = img1.copy(); img1 = cv2.cvtColor(img1, cv2.COLOR_BGR2RGB) takes 11.83 ms, and letting cvtColor trigger the Numpy .copy internally via directly calling img2 = cv2.cvtColor(img2, cv2.COLOR_BGR2RGB) takes 10.61 ms. The reason for the slight difference is of course that there's slightly more work involved when we're doing 2 separate function calls, than when we let OpenCV do the Numpy copying in its single call.

In both cases, a PyArray (Numpy) copy operation happens internally, to give us a straight, contiguous RAM location. And then we pass that fixed, contiguous ndarray to cvtColor which fixes the color channel order.

That gives you the following guidelines:

  • If your Numpy data is non-contiguous but is already in the correct channel order (you don't want to convert RGB to/from BGR, etc): Use img = img.copy() to force Numpy to make a contiguous copy of the data, which is then usable in all OpenCV calls without any bugs and without causing any slow internal, temporary copying.
  • If your Numpy data is non-contiguous and you also want to change the channel order: Use img = cv2.cvtColor(img, cv2.COLOR_<your conversion choice>), which will internally do the .copy slightly more efficiently than if you had used two separate Python statements.

Both techniques will result in giving you fast, contiguous RAM, in the color arrangement of your choice!

Conclusions

Stop using Numpy view manipulations and "tricks". They are not "cool". They lead to SUBTLE BUGS and they are EXTREMELY SLOW. You are slowing down all of your OpenCV API calls by about 40 milliseconds PER CALL (at 4K resolution) or 10 milliseconds PER CALL (at 1920x1080 resolution), since your "cool" data has to be converted by OpenCV internally to PROPER CONTIGUOUS RAM.

Those 39.45 msms (@ 4K) or 9.97 ms (@ 1920x1080) are wasted on EVERY OpenCV call whenever you give OpenCV a non-contiguous image. So if you're (as people often do) passing the image to multiple OpenCV APIs to analyze it in multiple ways, then you are causing extreme slowdowns in your code.

Use cv2.cvtColor() instead, which does a super fast one-time conversion to the proper format (in about 5.48 ms @ 4K as seen in the benchmarks earlier), format, using accelerated OpenCL code. You are guaranteed to get contiguous data which works as-is for EVERY OpenCV call with no memory copying/conversion needed. And OpenCV's color converter is WAY FASTER than Numpy's internal data copier/converter.

Let's end this by imagining a scenario where you're using some Python library to capture an RGB 4K screenshot as a numpy array, and you need to use that data with OpenCV. So you're thinking you're clever and you write img = img[...,::-1] to "turn the RGB data into BGR (which OpenCV needs)", and you're thinking "Wow, my code is so fast! That RGB-to-BGR operation only took 0.000237 ms!"... And then you're calling five different OpenCV functions to analyze that screenshot-image in various ways. Well, since you're causing one internal Numpy copy-conversion-to-contiguous PER CALL, you're now causing 5 * 39.45 = 197.25 ms of total conversion overhead, just to get your "stupid" Numpy view into a proper contiguous memory stream.

Does it still sound "slow" to just do a single, one-time 5.48 ms (@ 4K) or 1.53 ms (@ 1920x1080) conversion via cv2.cvtColor()? ;-)

Stop. Using. Numpy. Tricks!

Enjoy! ;-)

Fastest way to convert BGR <-> RGB! Aka: Do NOT use Numpy magic "tricks".


IMPORTANT: This article is very long. Remember to click the (more) at the bottom of this post to read the whole article!


I was reading this question: https://answers.opencv.org/question/188664/colour-conversion-from-bgr-to-rgb-in-cv2-is-slower-than-on-python/ and it didn't explain things very well at all. So here's a deep examination and explanation for everyone's future reference!

Converting RGB to BGR, and vice versa, is one of the most important operations you can do in OpenCV if you're interoperating with other libraries, raw system memory, etc. And each imaging library depends on their own special channel orders.

There are many ways to achieve the conversion, and cv2.cvtColor() is often frowned upon because there are "much faster" ways to do it via numpy "view" manipulation.

Whenever you attempt to convert colors in OpenCV, you actually invoke a huge machinery:

https://github.com/opencv/opencv/blob/8c0b0714e76efef4a8ca2a7c410c60e55c5e9829/modules/imgproc/src/color.cpp#L20-L25 https://github.com/opencv/opencv/blob/8b541e450b511fde9dd363fa55a30fbb6fc0ace6/modules/imgproc/src/color_rgb.dispatch.cpp#L426-L437

As you can see, internally, OpenCV creates an "OpenCL Kernel" with the instructions for the data transformation, and then runs it. This creates brand new (re-arranged) image data in memory, which is of course a pretty slow operation, involving new memory allocation and data-copying.

However, there is another way to flip between RGB and BGR channel orders, which is very popular - and very bad (as you'll find out soon). And that is: Using numpy's built-in methods for manipulating the array data.

Note that there are two ways to manipulate data in Numpy:

  • One of the ways, the bad way, just changes the "view" of the Numpy array and is therefore instant (O(1)), but does NOT transform the underlying img.data in RAM/memory. This means that the raw memory does NOT contain the new channel order, and Numpy instead "fakes" it by creating a "view" that simply says "when we read this data from RAM, view it as R=B, G=G, B=R" basically... (Technically speaking, it changes the ".strides" property of the Numpy object, which instead of saying "read R then G then B" (stride "1" aka going forwards in RAM when reading the color channels) changes it to say "read B, then G, then R" (stride "-1" aka going backwards in RAM when reading the color channels)).
  • The second way, which is totally fine, is to always ensure that we arrange the pixel data properly in memory too, which is a lot slower but is almost always necessary, depending on what library/API your data is intended to be sent to!

To determine whether a numpy array manipulation has also changed the underlying MEMORY, you can look at the img.flags['C_CONTIGUOUS'] value. If True it means that the data in RAM is in the correct order (that's great!). If False it means that the data in RAM is in the wrong order and that we are "cheating" via a numpy View instead (that's BAD!).

Whenever you use the "View-based" methods to flip channels in an ndarray (such as RGB -> BGR), its C_CONTIGUOUS becomes False. If you then flip the image's channels again (such as BGR -> back to RGB), its C_CONTIGUOUS becomes True again. So, the "view" is able to be transformed multiple times, and the "Contiguous" flag only says True whenever the view happens to match the actual RAM data's layout.

So... in what situations do you need the data to ALWAYS be contiguous? Well, it varies based on API...

  • OpenCV APIs all need data in contiguous order. If you give it non-contiguous data from Python, the Python-to-OpenCV wrapper layer internally makes a COPY of the BADLY FORMATTED image you gave it, and then converts the color channel order, and THEN finally passes the COPIED-AND-CONTIGUOUS image to the internal OpenCV C++ function. This is of course very wasteful!
  • Other libraries: Depends on the library. Some of them do something like "take the img.data memory address and give it to a raw Windows API via a COM call" in which case YES the RAM-data MUST be contiguous too.

What type of data do YOU need?

If you want the SAFEST possible data that is 100% sure to work in ANY API ANYWHERE, you should always make CONTIGUOUS pixel data. It doesn't take long to do the conversion up-front, since we're still talking about very fast operations!

There are probably situations where non-contiguous data is fine, such as if you are doing all image manipulations purely in Numpy math without any library APIs (in which case there's no real reason to convert the data layout to contiguous in RAM). But as soon as you invoke various library APIs, you should pretty much always have contiguous data, otherwise you'll create huge performance issues (or even completely incorrect results).

I'll explain those performance issues further down, but first we'll look at the various "conversion techniques" people use in Python.

Techniques

Without further ado, here are all the ways that people use in Python whenever they want to convert back/forth between RGB and BGR. These benchmarks are on a 4K image (3840x2160):

  • Always Contiguous: No. Method: x = x[...,::-1]. Speed: 237 nsec (aka 0.237 usec aka 0.000237 msec) per call
  • Always Contiguous: Yes. Method: x = x[...,::-1].copy(). Speed: 37.5 msec per call
  • Always Contiguous: No. Method: x = x[:, :, [2, 1, 0]]. Speed: 12.6 msec per call
  • Always Contiguous: Yes. Method: x = cv2.cvtColor(x, cv2.COLOR_RGB2BGR). Speed: 5.48 msec per call
  • Always Contiguous: No. Method: x = np.fliplr(x.reshape(-1,3)).reshape(x.shape). Speed: 1.62 usec (aka 0.00162 msec) per call
  • Always Contiguous: Yes. Method: x = np.fliplr(x.reshape(-1,3)).reshape(x.shape).copy(). Speed: 37.4 msec per call
  • Always Contiguous: No. Method: x = np.flip(x, axis=2). Speed: 2.74 usec (aka 0.00274 msec) per call
  • Always Contiguous: Yes. Method: x = np.flip(x, axis=2).copy(). Speed: 37.5 msec per call
  • Always Contiguous: Yes. Method: r = x[..., 0].copy(); x[..., 0] = x[..., 2]; x[..., 2] = r. Speed: 21.8 msec per call
  • Always Contiguous: Yes. Method: x[:, :, [0, 2]] = x[:, :, [2, 0]]. Speed: 21.7 msec per call
  • Always Contiguous: Yes. Method: x[..., [0, 2]] = x[..., [2, 0]]. Speed: 21.8 msec per call
  • Always Contiguous: Yes. Method: x[:, :, [0, 1, 2]] = x[:, :, [2, 1, 0]]. Speed: 33.1 msec per call
  • Always Contiguous: Yes. Method: x[:, :] = x[:, :, [2, 1, 0]]. Speed: 49.3 msec per call
  • Always Contiguous: Yes. Method: foo = x.copy(). Speed: 11.8 msec per call (This example doesn't change the RGB/BGR channel order, and is just included here as a reference, to show how slow Numpy is at doing a super simple copy of an already-contiguous chunk of RAM. As you can see, even when the data is already in the proper order, Numpy is very slow... And if "x" had been non-contiguous here, it would be even slower, as shown in the x = x[...,::-1].copy() (equivalent to saying bar = x[...,::-1]; foo = bar.copy()) example near the top of the list, which took 37.5 msec and demonstrates Numpy copying non-contiguous RAM (from numpy "views" marked as "read in reverse order" via "stride = -1") into contiguous RAM...

PS: Whenever we want contiguous data from numpy, we're mostly using x.copy() to tell Numpy to allocate new RAM and copy all data to it in the correct (contiguous) order. There's also a np.ascontiguousarray(x) API but it does the exact same thing (it copies too, "but only when the Numpy data isn't already contiguous") and requires much more typing. ;-) And in a few of the examples we're using special indexing (such as x[:, :, [0, 1, 2]] = x[:, :, [2, 1, 0]]) to overwrite the memory directly, which always creates contiguous memory with correct "strides", and is faster than telling Numpy to do a .copy(), but is still extremely slow compared to cv2.cvtColor().

Docs for the various Numpy functions: copy, ascontiguousarray, fliplr, flip, reshape

Here's the benchmark that was used:

python -m timeit -s "import numpy as np; import cv2; x = np.zeros([2160,3840,3], np.uint8); x[:,:,2] = 255; x[:,:,1] = 100" "ALGORITHM HERE"

Replace the "ALGORITHM HERE" part with the algorithm above, such as "x = np.flip(x, axis=2).copy()".

People's Misunderstandings of those Benchmarks

Alright, so we're finally getting to the whole purpose of this article!

When people see the benchmarks above, they usually think "Oh my god, x = x[...,::-1] executes in 0.000237 milliseconds, and x = cv2.cvtColor(x, cv2.COLOR_RGB2BGR) executes in 5.48 milliseconds, which is 23798 times slower!! And then they decide to always use "Numpy view manipulation" to do their channel conversions.

That's a huge mistake. And here's why:

  • When you call an OpenCV API from Python, and pass it a numpy.ndarray image object, there's a process which prepares that data for internal usage within OpenCV (since OpenCV itself doesn't use ndarray internally; it uses cv::Mat).
  • First, your Python object (which is coming from the PyOpenCV module) goes into the appropriate pyopencv_to() function, whose purpose is to convert raw Python objects (such as numbers, strings, ndarray, etc), into something usable by OpenCV internally in C++.
  • Your Python object first enters the "full ArgInfo converter" code at https://github.com/opencv/opencv/blob/778f42ad34559451d62ac9ba585717aec77fb23a/modules/python/src2/cv2.cpp#L249
  • That code, in turn, looks at the object and determines if it's a number, a float, or a tuple... If it's any of those, it does the appropriate conversion. Otherwise it assumes it's a Numpy array, at this line: https://github.com/opencv/opencv/blob/778f42ad34559451d62ac9ba585717aec77fb23a/modules/python/src2/cv2.cpp#L292
  • Next, it begins to analyze the Numpy array to determine how to use the data internally. It wants to determine "do we need to copy the data or can we use it as-is? do we need to cast the data?", see here: https://github.com/opencv/opencv/blob/778f42ad34559451d62ac9ba585717aec77fb23a/modules/python/src2/cv2.cpp#L300
  • It first does some simple checks to see if the number-type in the Numpy array is legal or not. (If illegal type, it marks the data as "needs copy" and "needs cast").
  • Next, it retrieves the "strides" information from the Numpy array, which is those simple numbers such as "-1" which determine how to read a numpy array (such as backwards, in the case of our "fast" numpy-based "channel flipping" code earlier): https://github.com/opencv/opencv/blob/778f42ad34559451d62ac9ba585717aec77fb23a/modules/python/src2/cv2.cpp#L341-L342
  • Then it analyzes the Strides for all dimensions of the Numpy array, and if it finds a non-contiguous stride (our "screwed up" data layout caused by doing those so-called "fast" Numpy view manipulations), then it marks the data as "needs copy": https://github.com/opencv/opencv/blob/778f42ad34559451d62ac9ba585717aec77fb23a/modules/python/src2/cv2.cpp#L344-L357
  • Next, if "needs copy" is true, it does this horrible thing: https://github.com/opencv/opencv/blob/778f42ad34559451d62ac9ba585717aec77fb23a/modules/python/src2/cv2.cpp#L367-L374
  • As you can see, it calls PyArray_Cast() (if casting was needed too) or PyArray_GETCONTIGUOUS() (if we only need to make sure the data is contiguous). Both of those functions, no matter which one is called, then generates a brand-new Python Numpy Object, with all data COPIED by Numpy into brand-new memory, and re-arranged into proper Contiguous ordering. That's extremely wasteful! I'll explain more soon, after this walkthrough of what the code does...
  • Finally, the code proceeds to create a cv::Mat object whose data pointer points at the internal byte-array (RAM) of the Numpy object, ie. the RAM address that you can easily see in Python by typing img.data. That's an incredibly fast operation because it is just a pointer which says Use the existing RAM data owned by Numpy at RAM address XYZ: https://github.com/opencv/opencv/blob/778f42ad34559451d62ac9ba585717aec77fb23a/modules/python/src2/cv2.cpp#L415-L416

So, can you spot the problem yet?

When you pass a contiguous Numpy array, the conversion into OpenCV is pretty much INSTANT: "This data looks fine! Just give its RAM address to cv::Mat and voila!".

But when you instead insist on using those so-called "fast" channel transformations, where you "tweak" the Numpy array's view and stride values, then you are giving OpenCV a Numpy array with non-contiguous RAM and bad "strides". The PyOpenCV layer (the wrapper between OpenCV and Python) detects this problem, and creates a BRAND NEW, COPIED, RE-ARRANGED (CONTIGUOUS) NUMPY ARRAY. This is VERY VERY VERY VERY SLOW.

In other words, if you've used those dumb Numpy "view" manipulation tricks, EVERY CALL TO OPENCV APIS IS CAUSING a HUGE memory copy (images are large, especially 1080p+ screenshots/video frames), a lot of math inside PyArray_GETCONTIGUOUS / PyArray_Cast to create that new object while respecting your tweaked "strides", etc.

Your code won't be faster at all. It will be SLOWER!

Demonstration of the Slowness

Let's use a random OpenCV API to demonstrate the slowdown caused by all of those conversions. We'll use cv2.imshow here, but any OpenCV API call will always be doing the same "Python to OpenCV" conversions of the numpy data, so the exact API doesn't matter. They will all have this overhead.

Here's the example code:

import cv2
import numpy as np
import time

#img1 = cv2.imread("yourimage.png") # If you want to test with an image.
img1 = np.zeros([2160,3840,3], np.uint8) # Create a 4K image.
img1[:,:,2] = 255; img1[:,:,1] = 100 # Fill the channels with different values.
img2 = img1[...,::-1] # Make a "channel flipped view" of the Numpy data.

print("img1 contiguous", img1.flags['C_CONTIGUOUS'], "img2 contiguous", img2.flags['C_CONTIGUOUS'])
print("img1 strides", img1.strides, "img2 strides", img2.strides)

wnd = cv2.namedWindow("", cv2.WINDOW_NORMAL)

def show1():
    cv2.imshow(wnd, img1)

def show2():
    cv2.imshow(wnd, img2)

iterations = 20

start = time.perf_counter()
for i in range(0, iterations):
    show1()
elapsed1 = (time.perf_counter() - start) * 1000
elapsed1_percall = elapsed1 / iterations

start = time.perf_counter()
for i in range(0, iterations):
    show2()
elapsed2 = (time.perf_counter() - start) * 1000
elapsed2_percall = elapsed2 / iterations

# We know that the contiguos (img1) data does not need conversion,
# which tells us that the runtime of the contiguous data is the
# "internal work of the imshow" function. We only want to measure
# the conversion time for non-contiguous data. So we'll subtract
# the first image's (contiguous) runtime from the non-contiguous time.

noncontiguous_overhead_per_call = elapsed2_percall - elapsed1_percall

print("Extra time taken per OpenCV call when given non-contiguous data (in ms):", noncontiguous_overhead_per_call, "ms")

The results:

img1 contiguous True img2 contiguous False
img1 strides (11520, 3, 1) img2 strides (11520, 3, -1)
Extra time taken per OpenCV call when given non-contiguous data (in ms): 39.45334999999999 ms

As you can see, the extra time added to the OpenCV calls when copy-conversion is needed (39.45 ms), is pretty much the same as when you call Numpy's own img.copy() on a "flipped view" inside Python itself (as seen in the earlier benchmark for "Method: x = x[...,::-1].copy(). Speed: 37.5 msec per call").

So yes, every time you call OpenCV with a non-contiguous Numpy array as its argument, you are causing a Numpy .copy() to happen internally!

PS: If we repeat the same test above with 1920x1080 test data instead of 4K test data, we get Extra time taken per OpenCV call when given non-contiguous data (in ms): 9.972125 ms which means that at the world's most popular image resolution (1080p) you're still adding around 10 milliseconds of overhead to all of your OpenCV calls.

Numpy "tricks" will cause subtle Bugs too!

Using those Numpy "tricks" isn't just extremely slow. It will cause very subtle bugs in your code, too.

Look at this code and see if you can figure out the bug yourself before you run this example:

import cv2
import numpy as np

img1 = np.zeros([200,200,3], np.uint8) # Create a 200x200 image. (Is Contiguous)
img2 = img1[...,::-1] # Make a "channel flipped view" of the Numpy data. (A Non-Contiguous View)

print("img1 contiguous", img1.flags['C_CONTIGUOUS'], "img2 contiguous", img2.flags['C_CONTIGUOUS'])
print("img1 strides", img1.strides, "img2 strides", img2.strides)

cv2.rectangle(img2, (80,80), (120,120), (255,255,255), 2)

cv2.imshow("", img2)

What do you think the result will be when running this program? Logically, you expect to see black image with a white rectangle in the middle... But instead, you see nothing except a black image. Why?

Well, it's simple... think about what was explained earlier about how PyOpenCV converts every incoming numpy.ndarray object into an internal C++ cv::Mat object. In this example, we're giving a non-contiguous ndarray as an argument to cv2.rectangle(), which causes PyOpenCV to "fix" the data by making a temporary, internal, contiguous .copy() of the image data, and then it wraps the copy's memory address in a cv::Mat. Next, it passes that cv::Mat object to the internal C++ "draw rectangle" function, which dutifully draws a rectangle onto the memory pointed to by the cv::Mat object... which is... the memory of the temporary internal copy of your input array, since a copy had to be created...

So, OpenCV happily writes a rectangle to the temporary object copy. And then when execution returns to Python, you're of course seeing NO RECTANGLE, since nothing was drawn to your actual ndarray data in RAM (since its memory storage was non-contiguous and therefore not usable as-is by OpenCV).

If you want to see what the code above should be doing, simply add img2 = img2.copy() immediately above the cv2.rectangle call, to cause the img2 ndarray object to become contiguous memory so that OpenCV won't need to make a copy of it (and will be able to use that exact object's memory internally, as intended)... After that tweak, you'll see OpenCV properly drawing the rectangle to the image...

This is the kind of subtle bug that is very easy to cause when you're playing around with faked Numpy "views" rather than real contiguous memory.

Bonus: A note about Numpy "slices"

Numpy allows you to efficiently "slice" arrays, to extract a "partial view" of the data. This is very useful for images, since you can do something such as extracting a 100:100 pixel square from the middle of an image. The slicing syntax is img_sliced = img[y1:y2,x1:x2]. This generates a full Numpy object which points at the data of the original image (they share each other's memory), but which only points at the sub-range you wanted.

So it basically becomes a fully usable "Numpy array" object which you can use in any context you would pass an image. Such as to an OpenCV function, which would then only operate on the sliced segment of RAM. That's really useful!

However, be aware that the Numpy slices inherit the strides and contiguous flag of the original object / data they were sliced from! So if you're slicing from a non-contiguous array, you'll generate a non-contiguous slice object too, which is horrible and has all the issues of non-contiguous objects.

It's only safe to make partial views/slices (like img[0:100, 0:100]) when img itself is already PROVEN to be FULLY contiguous (with no "Numpy tricks" applied to it). In that case, feel free to pass your contiguous, partial image slices to OpenCV functions. You won't invoke any copy-mechanics in that case!

Bonus: What to do when you get a non-contiguous ndarray from a library?

As an example, the very cool D3DShot library has an optional numpy mode where it retrieves the screenshots as ndarray objects. The problem is that it generates them from RAM data laid out in a different order, so it tweaks the ndarray strides etc to give us an object of the proper "shape" (height, width, 3 color channels in RGB order). Its .flags property shows that Contiguous is FALSE.

So what do you do? If you try to pass that directly to OpenCV, you'll invoke the heavy PyOpenCV copy-mechanics described earlier.

Well, you have two options. In this example case, the colors are in RGB order, and you want them to be BGR for usage in OpenCV. So you should be invoking cv2.cvtColor which internally will trigger the Numpy .copy() for you (just like all OpenCV APIs do when given non-contiguous data), and then changes the color order in RAM for you.

The second option is when you have Numpy data that is already in the correct color order (such as BGR), but whose RAM is non-contiguous. In that case, you should directly invoke img = img.copy() to tell Numpy to make a contiguous copy of the array, to fix it. Then you're welcome to use that contiguous copy for everything.

Alright, so let's look at the D3DShot example:

import cv2
import d3dshot
import time

d = d3dshot.create(capture_output="numpy", frame_buffer_size=60)

img1 = d.screenshot()
img2 = d.screenshot()

print(img1.strides, img1.flags)
print(img2.strides, img2.flags)

print("-------------")

start = time.perf_counter()
img1_justcopy = img1.copy() # copy RGB image to new, contiguous RAM
elapsed = (time.perf_counter() - start) * 1000
print(img1_justcopy.strides, img1_justcopy.flags)
print("justcopy milliseconds:", elapsed)

print("-------------")

start = time.perf_counter()
img1 = img1.copy()
img1 = cv2.cvtColor(img1, cv2.COLOR_BGR2RGB) cv2.COLOR_RGB2BGR) # flip RGB -> BGR
elapsed = (time.perf_counter() - start) * 1000
print(img1.strides, img1.flags)
print("copy+cvtColor milliseconds:", elapsed)

print("-------------")

start = time.perf_counter()
img2 = cv2.cvtColor(img2, cv2.COLOR_BGR2RGB) cv2.COLOR_RGB2BGR) # flip RGB -> BGR
elapsed = (time.perf_counter() - start) * 1000
print(img2.strides, img2.flags)
print("cvtColor milliseconds:", elapsed)

Output:

(1920, 1, 2073600)   C_CONTIGUOUS : False
  F_CONTIGUOUS : False
  OWNDATA : False
  WRITEABLE : True
  ALIGNED : True
  WRITEBACKIFCOPY : False
  UPDATEIFCOPY : False

(1920, 1, 2073600)   C_CONTIGUOUS : False
  F_CONTIGUOUS : False
  OWNDATA : False
  WRITEABLE : True
  ALIGNED : True
  WRITEBACKIFCOPY : False
  UPDATEIFCOPY : False

-------------
(5760, 3, 1)   C_CONTIGUOUS : True
  F_CONTIGUOUS : False
  OWNDATA : True
  WRITEABLE : True
  ALIGNED : True
  WRITEBACKIFCOPY : False
  UPDATEIFCOPY : False

justcopy milliseconds: 9.122899999999989
-------------
(5760, 3, 1)   C_CONTIGUOUS : True
  F_CONTIGUOUS : False
  OWNDATA : True
  WRITEABLE : True
  ALIGNED : True
  WRITEBACKIFCOPY : False
  UPDATEIFCOPY : False

copy+cvtColor milliseconds: 12.177900000000019
-------------
(5760, 3, 1)   C_CONTIGUOUS : True
  F_CONTIGUOUS : False
  OWNDATA : True
  WRITEABLE : True
  ALIGNED : True
  WRITEBACKIFCOPY : False
  UPDATEIFCOPY : False

cvtColor milliseconds: 11.461500000000013

These examples are all on my 1920x1080 screen, so they're not directly comparable to the 4K resolution times we saw in earlier benchmarks.

Anyway, what we can see here, is first of all that the two captured images (img1 and img2) coming straight from the D3DShot library have very strange strides values, and C_CONTIGUOUS : False. That's because they are raw RAM given to D3DShot by Windows and then just packaged into a ndarray with custom strides to make it read the raw RAM data in the desired order.

Next, we see that just doing img1_justcopy = img1.copy() (which copies the RGB-channeled, non-contiguous RAM into new, contiguous RAM, but does not change the channel order (the image will still be RGB)), takes 9.12 ms, which is indeed how slow Numpy is at copying non-contiguous ndarray data into new, contiguous RAM. Basically, internally, Numpy has to do a ton of looping to read the data byte-by-byte while writing each byte into the correct order in the new, contiguous RAM.

So, the PyArray (Numpy) copying of non-contiguous to contiguous is always the slowest operation. That's why we want to avoid having non-contiguous RAM.

Alright, we also demonstrated how to make a "copy AND fix the colors from RGB to BGR" in two different ways. Doing img1 = img1.copy(); img1 = cv2.cvtColor(img1, cv2.COLOR_BGR2RGB)cv2.COLOR_RGB2BGR) takes 11.83 ms, and letting cvtColor trigger the Numpy .copy internally via directly calling img2 = cv2.cvtColor(img2, cv2.COLOR_BGR2RGB)cv2.COLOR_RGB2BGR) takes 10.61 ms. The reason for the slight difference is of course that there's slightly more work involved when we're doing 2 separate function calls, than when we let OpenCV do the Numpy copying in its single call.

In both cases, a PyArray (Numpy) copy operation happens internally, to give us a straight, contiguous RAM location. And then we pass that fixed, contiguous ndarray to cvtColor which fixes the color channel order.

That gives you the following guidelines:

  • If your Numpy data is non-contiguous but is already in the correct channel order (you don't want to convert RGB to/from BGR, etc): Use img = img.copy() to force Numpy to make a contiguous copy of the data, which is then usable in all OpenCV calls without any bugs and without causing any slow internal, temporary copying.
  • If your Numpy data is non-contiguous and you also want to change the channel order: Use img = cv2.cvtColor(img, cv2.COLOR_<your conversion choice>), which will internally do the .copy slightly more efficiently than if you had used two separate Python statements.

Both techniques will result in giving you fast, contiguous RAM, in the color arrangement of your choice!

Conclusions

Stop using Numpy view manipulations and "tricks". They are not "cool". They lead to SUBTLE BUGS and they are EXTREMELY SLOW. You are slowing down all of your OpenCV API calls by about 40 milliseconds PER CALL (at 4K resolution) or 10 milliseconds PER CALL (at 1920x1080 resolution), since your "cool" data has to be converted by OpenCV internally to PROPER CONTIGUOUS RAM.

Those 39.45 ms (@ 4K) or 9.97 ms (@ 1920x1080) are wasted on EVERY OpenCV call whenever you give OpenCV a non-contiguous image. So if you're (as people often do) passing the image to multiple OpenCV APIs to analyze it in multiple ways, then you are causing extreme slowdowns in your code.

Use cv2.cvtColor() instead, which does a super fast one-time conversion to the proper format, using accelerated OpenCL code. You are guaranteed to get contiguous data which works as-is for EVERY OpenCV call with no memory copying/conversion needed. And OpenCV's color converter is WAY FASTER than Numpy's internal data copier/converter.

Let's end this by imagining a scenario where you're using some Python library to capture an RGB 4K screenshot as a numpy array, and you need to use that data with OpenCV. So you're thinking you're clever and you write img = img[...,::-1] to "turn the RGB data into BGR (which OpenCV needs)", and you're thinking "Wow, my code is so fast! That RGB-to-BGR operation only took 0.000237 ms!"... And then you're calling five different OpenCV functions to analyze that screenshot-image in various ways. Well, since you're causing one internal Numpy copy-conversion-to-contiguous PER CALL, you're now causing 5 * 39.45 = 197.25 ms of total conversion overhead, just to get your "stupid" Numpy view into a proper contiguous memory stream.

Does it still sound "slow" to just do a single, one-time 5.48 ms (@ 4K) or 1.53 ms (@ 1920x1080) conversion via cv2.cvtColor()? ;-)

Stop. Using. Numpy. Tricks!

Enjoy! ;-)

Fastest way to convert BGR <-> RGB! Aka: Do NOT use Numpy magic "tricks".


IMPORTANT: This article is very long. Remember to click the (more) at the bottom of this post to read the whole article!


I was reading this question: https://answers.opencv.org/question/188664/colour-conversion-from-bgr-to-rgb-in-cv2-is-slower-than-on-python/ and it didn't explain things very well at all. So here's a deep examination and explanation for everyone's future reference!

Converting RGB to BGR, and vice versa, is one of the most important operations you can do in OpenCV if you're interoperating with other libraries, raw system memory, etc. And each imaging library depends on their own special channel orders.

There are many ways to achieve the conversion, and cv2.cvtColor() is often frowned upon because there are "much faster" ways to do it via numpy "view" manipulation.

Whenever you attempt to convert colors in OpenCV, you actually invoke a huge machinery:

https://github.com/opencv/opencv/blob/8c0b0714e76efef4a8ca2a7c410c60e55c5e9829/modules/imgproc/src/color.cpp#L20-L25 https://github.com/opencv/opencv/blob/8b541e450b511fde9dd363fa55a30fbb6fc0ace6/modules/imgproc/src/color_rgb.dispatch.cpp#L426-L437

As you can see, internally, OpenCV creates an "OpenCL Kernel" with the instructions for the data transformation, and then runs it. This creates brand new (re-arranged) image data in memory, which is of course a pretty slow operation, involving new memory allocation and data-copying.

However, there is another way to flip between RGB and BGR channel orders, which is very popular - and very bad (as you'll find out soon). And that is: Using numpy's built-in methods for manipulating the array data.

Note that there are two ways to manipulate data in Numpy:

  • One of the ways, the bad way, just changes the "view" of the Numpy array and is therefore instant (O(1)), but does NOT transform the underlying img.data in RAM/memory. This means that the raw memory does NOT contain the new channel order, and Numpy instead "fakes" it by creating a "view" that simply says "when we read this data from RAM, view it as R=B, G=G, B=R" basically... (Technically speaking, it changes the ".strides" property of the Numpy object, which instead of saying "read R then G then B" (stride "1" aka going forwards in RAM when reading the color channels) changes it to say "read B, then G, then R" (stride "-1" aka going backwards in RAM when reading the color channels)).
  • The second way, which is totally fine, is to always ensure that we arrange the pixel data properly in memory too, which is a lot slower but is almost always necessary, depending on what library/API your data is intended to be sent to!

To determine whether a numpy array manipulation has also changed the underlying MEMORY, you can look at the img.flags['C_CONTIGUOUS'] value. If True it means that the data in RAM is in the correct order (that's great!). If False it means that the data in RAM is in the wrong order and that we are "cheating" via a numpy View instead (that's BAD!).

Whenever you use the "View-based" methods to flip channels in an ndarray (such as RGB -> BGR), its C_CONTIGUOUS becomes False. If you then flip the image's channels again (such as BGR -> back to RGB), its C_CONTIGUOUS becomes True again. So, the "view" is able to be transformed multiple times, and the "Contiguous" flag only says True whenever the view happens to match the actual RAM data's layout.

So... in what situations do you need the data to ALWAYS be contiguous? Well, it varies based on API...

  • OpenCV APIs all need data in contiguous order. If you give it non-contiguous data from Python, the Python-to-OpenCV wrapper layer internally makes a COPY of the BADLY FORMATTED image you gave it, and then converts the color channel order, and THEN finally passes the COPIED-AND-CONTIGUOUS image to the internal OpenCV C++ function. This is of course very wasteful!
  • Other libraries: Depends on the library. Some of them do something like "take the img.data memory address and give it to a raw Windows API via a COM call" in which case YES the RAM-data MUST be contiguous too.

What type of data do YOU need?

If you want the SAFEST possible data that is 100% sure to work in ANY API ANYWHERE, you should always make CONTIGUOUS pixel data. It doesn't take long to do the conversion up-front, since we're still talking about very fast operations!

There are probably situations where non-contiguous data is fine, such as if you are doing all image manipulations purely in Numpy math without any library APIs (in which case there's no real reason to convert the data layout to contiguous in RAM). But as soon as you invoke various library APIs, you should pretty much always have contiguous data, otherwise you'll create huge performance issues (or even completely incorrect results).

I'll explain those performance issues further down, but first we'll look at the various "conversion techniques" people use in Python.

Techniques

Without further ado, here are all the ways that people use in Python whenever they want to convert back/forth between RGB and BGR. These benchmarks are on a 4K image (3840x2160):

  • Always Contiguous: No. Method: x = x[...,::-1]. Speed: 237 nsec (aka 0.237 usec aka 0.000237 msec) per call
  • Always Contiguous: Yes. Method: x = x[...,::-1].copy(). Speed: 37.5 msec per call
  • Always Contiguous: No. Method: x = x[:, :, [2, 1, 0]]. Speed: 12.6 msec per call
  • Always Contiguous: Yes. Method: x = cv2.cvtColor(x, cv2.COLOR_RGB2BGR). Speed: 5.48 5.39 msec per call
  • Always Contiguous: No. Method: x = np.fliplr(x.reshape(-1,3)).reshape(x.shape). Speed: 1.62 usec (aka 0.00162 msec) per call
  • Always Contiguous: Yes. Method: x = np.fliplr(x.reshape(-1,3)).reshape(x.shape).copy(). Speed: 37.4 msec per call
  • Always Contiguous: No. Method: x = np.flip(x, axis=2). Speed: 2.74 usec (aka 0.00274 msec) per call
  • Always Contiguous: Yes. Method: x = np.flip(x, axis=2).copy(). Speed: 37.5 msec per call
  • Always Contiguous: Yes. Method: r = x[..., 0].copy(); x[..., 0] = x[..., 2]; x[..., 2] = r. Speed: 21.8 msec per call
  • Always Contiguous: Yes. Method: x[:, :, [0, 2]] = x[:, :, [2, 0]]. Speed: 21.7 msec per call
  • Always Contiguous: Yes. Method: x[..., [0, 2]] = x[..., [2, 0]]. Speed: 21.8 msec per call
  • Always Contiguous: Yes. Method: x[:, :, [0, 1, 2]] = x[:, :, [2, 1, 0]]. Speed: 33.1 msec per call
  • Always Contiguous: Yes. Method: x[:, :] = x[:, :, [2, 1, 0]]. Speed: 49.3 msec per call
  • Always Contiguous: Yes. Method: foo = x.copy(). Speed: 11.8 msec per call (This example doesn't change the RGB/BGR channel order, and is just included here as a reference, to show how slow Numpy is at doing a super simple copy of an already-contiguous chunk of RAM. As you can see, even when the data is already in the proper order, Numpy is very slow... And if "x" had been non-contiguous here, it would be even slower, as shown in the x = x[...,::-1].copy() (equivalent to saying bar = x[...,::-1]; foo = bar.copy()) example near the top of the list, which took 37.5 msec and demonstrates Numpy copying non-contiguous RAM (from numpy "views" marked as "read in reverse order" via "stride = -1") into contiguous RAM...

PS: Whenever we want contiguous data from numpy, we're mostly using x.copy() to tell Numpy to allocate new RAM and copy all data to it in the correct (contiguous) order. There's also a np.ascontiguousarray(x) API but it does the exact same thing (it copies too, "but only when the Numpy data isn't already contiguous") and requires much more typing. ;-) And in a few of the examples we're using special indexing (such as x[:, :, [0, 1, 2]] = x[:, :, [2, 1, 0]]) to overwrite the memory directly, which always creates contiguous memory with correct "strides", and is faster than telling Numpy to do a .copy(), but is still extremely slow compared to cv2.cvtColor().

Docs for the various Numpy functions: copy, ascontiguousarray, fliplr, flip, reshape

Here's the benchmark that was used:

python -m timeit -s "import numpy as np; import cv2; x = np.zeros([2160,3840,3], np.uint8); x[:,:,2] = 255; x[:,:,1] = 100" "ALGORITHM HERE"

Replace the "ALGORITHM HERE" part with the algorithm above, such as "x = np.flip(x, axis=2).copy()".

People's Misunderstandings of those Benchmarks

Alright, so we're finally getting to the whole purpose of this article!

When people see the benchmarks above, they usually think "Oh my god, x = x[...,::-1] executes in 0.000237 milliseconds, and x = cv2.cvtColor(x, cv2.COLOR_RGB2BGR) executes in 5.485.39 milliseconds, which is 23798 times slower!! And then they decide to always use "Numpy view manipulation" to do their channel conversions.

That's a huge mistake. And here's why:

  • When you call an OpenCV API from Python, and pass it a numpy.ndarray image object, there's a process which prepares that data for internal usage within OpenCV (since OpenCV itself doesn't use ndarray internally; it uses cv::Mat).
  • First, your Python object (which is coming from the PyOpenCV module) goes into the appropriate pyopencv_to() function, whose purpose is to convert raw Python objects (such as numbers, strings, ndarray, etc), into something usable by OpenCV internally in C++.
  • Your Python object first enters the "full ArgInfo converter" code at https://github.com/opencv/opencv/blob/778f42ad34559451d62ac9ba585717aec77fb23a/modules/python/src2/cv2.cpp#L249
  • That code, in turn, looks at the object and determines if it's a number, a float, or a tuple... If it's any of those, it does the appropriate conversion. Otherwise it assumes it's a Numpy array, at this line: https://github.com/opencv/opencv/blob/778f42ad34559451d62ac9ba585717aec77fb23a/modules/python/src2/cv2.cpp#L292
  • Next, it begins to analyze the Numpy array to determine how to use the data internally. It wants to determine "do we need to copy the data or can we use it as-is? do we need to cast the data?", see here: https://github.com/opencv/opencv/blob/778f42ad34559451d62ac9ba585717aec77fb23a/modules/python/src2/cv2.cpp#L300
  • It first does some simple checks to see if the number-type in the Numpy array is legal or not. (If illegal type, it marks the data as "needs copy" and "needs cast").
  • Next, it retrieves the "strides" information from the Numpy array, which is those simple numbers such as "-1" which determine how to read a numpy array (such as backwards, in the case of our "fast" numpy-based "channel flipping" code earlier): https://github.com/opencv/opencv/blob/778f42ad34559451d62ac9ba585717aec77fb23a/modules/python/src2/cv2.cpp#L341-L342
  • Then it analyzes the Strides for all dimensions of the Numpy array, and if it finds a non-contiguous stride (our "screwed up" data layout caused by doing those so-called "fast" Numpy view manipulations), then it marks the data as "needs copy": https://github.com/opencv/opencv/blob/778f42ad34559451d62ac9ba585717aec77fb23a/modules/python/src2/cv2.cpp#L344-L357
  • Next, if "needs copy" is true, it does this horrible thing: https://github.com/opencv/opencv/blob/778f42ad34559451d62ac9ba585717aec77fb23a/modules/python/src2/cv2.cpp#L367-L374
  • As you can see, it calls PyArray_Cast() (if casting was needed too) or PyArray_GETCONTIGUOUS() (if we only need to make sure the data is contiguous). Both of those functions, no matter which one is called, then generates a brand-new Python Numpy Object, with all data COPIED by Numpy into brand-new memory, and re-arranged into proper Contiguous ordering. That's extremely wasteful! I'll explain more soon, after this walkthrough of what the code does...
  • Finally, the code proceeds to create a cv::Mat object whose data pointer points at the internal byte-array (RAM) of the Numpy object, ie. the RAM address that you can easily see in Python by typing img.data. That's an incredibly fast operation because it is just a pointer which says Use the existing RAM data owned by Numpy at RAM address XYZ: https://github.com/opencv/opencv/blob/778f42ad34559451d62ac9ba585717aec77fb23a/modules/python/src2/cv2.cpp#L415-L416

So, can you spot the problem yet?

When you pass a contiguous Numpy array, the conversion into OpenCV is pretty much INSTANT: "This data looks fine! Just give its RAM address to cv::Mat and voila!".

But when you instead insist on using those so-called "fast" channel transformations, where you "tweak" the Numpy array's view and stride values, then you are giving OpenCV a Numpy array with non-contiguous RAM and bad "strides". The PyOpenCV layer (the wrapper between OpenCV and Python) detects this problem, and creates a BRAND NEW, COPIED, RE-ARRANGED (CONTIGUOUS) NUMPY ARRAY. This is VERY VERY VERY VERY SLOW.

In other words, if you've used those dumb Numpy "view" manipulation tricks, EVERY CALL TO OPENCV APIS IS CAUSING a HUGE memory copy (images are large, especially 1080p+ screenshots/video frames), a lot of math inside PyArray_GETCONTIGUOUS / PyArray_Cast to create that new object while respecting your tweaked "strides", etc.

Your code won't be faster at all. It will be SLOWER!

Demonstration of the Slowness

Let's use a random OpenCV API to demonstrate the slowdown caused by all of those conversions. We'll use cv2.imshow here, but any OpenCV API call will always be doing the same "Python to OpenCV" conversions of the numpy data, so the exact API doesn't matter. They will all have this overhead.

Here's the example code:

import cv2
import numpy as np
import time

#img1 = cv2.imread("yourimage.png") # If you want to test with an image.
img1 = np.zeros([2160,3840,3], np.uint8) # Create a 4K image.
img1[:,:,2] = 255; img1[:,:,1] = 100 # Fill the channels with different values.
img2 = img1[...,::-1] # Make a "channel flipped view" of the Numpy data.

print("img1 contiguous", img1.flags['C_CONTIGUOUS'], "img2 contiguous", img2.flags['C_CONTIGUOUS'])
print("img1 strides", img1.strides, "img2 strides", img2.strides)

wnd = cv2.namedWindow("", cv2.WINDOW_NORMAL)

def show1():
    cv2.imshow(wnd, img1)

def show2():
    cv2.imshow(wnd, img2)

iterations = 20

start = time.perf_counter()
for i in range(0, iterations):
    show1()
elapsed1 = (time.perf_counter() - start) * 1000
elapsed1_percall = elapsed1 / iterations

start = time.perf_counter()
for i in range(0, iterations):
    show2()
elapsed2 = (time.perf_counter() - start) * 1000
elapsed2_percall = elapsed2 / iterations

# We know that the contiguos (img1) data does not need conversion,
# which tells us that the runtime of the contiguous data is the
# "internal work of the imshow" function. We only want to measure
# the conversion time for non-contiguous data. So we'll subtract
# the first image's (contiguous) runtime from the non-contiguous time.

noncontiguous_overhead_per_call = elapsed2_percall - elapsed1_percall

print("Extra time taken per OpenCV call when given non-contiguous data (in ms):", noncontiguous_overhead_per_call, "ms")

The results:

img1 contiguous True img2 contiguous False
img1 strides (11520, 3, 1) img2 strides (11520, 3, -1)
Extra time taken per OpenCV call when given non-contiguous data (in ms): 39.45334999999999 ms

As you can see, the extra time added to the OpenCV calls when copy-conversion is needed (39.45 ms), is pretty much the same as when you call Numpy's own img.copy() on a "flipped view" inside Python itself (as seen in the earlier benchmark for "Method: x = x[...,::-1].copy(). Speed: 37.5 msec per call").

So yes, every time you call OpenCV with a non-contiguous Numpy array as its argument, you are causing a Numpy .copy() to happen internally!

PS: If we repeat the same test above with 1920x1080 test data instead of 4K test data, we get Extra time taken per OpenCV call when given non-contiguous data (in ms): 9.972125 ms which means that at the world's most popular image resolution (1080p) you're still adding around 10 milliseconds of overhead to all of your OpenCV calls.

Numpy "tricks" will cause subtle Bugs too!

Using those Numpy "tricks" isn't just extremely slow. It will cause very subtle bugs in your code, too.

Look at this code and see if you can figure out the bug yourself before you run this example:

import cv2
import numpy as np

img1 = np.zeros([200,200,3], np.uint8) # Create a 200x200 image. (Is Contiguous)
img2 = img1[...,::-1] # Make a "channel flipped view" of the Numpy data. (A Non-Contiguous View)

print("img1 contiguous", img1.flags['C_CONTIGUOUS'], "img2 contiguous", img2.flags['C_CONTIGUOUS'])
print("img1 strides", img1.strides, "img2 strides", img2.strides)

cv2.rectangle(img2, (80,80), (120,120), (255,255,255), 2)

cv2.imshow("", img2)

What do you think the result will be when running this program? Logically, you expect to see black image with a white rectangle in the middle... But instead, you see nothing except a black image. Why?

Well, it's simple... think about what was explained earlier about how PyOpenCV converts every incoming numpy.ndarray object into an internal C++ cv::Mat object. In this example, we're giving a non-contiguous ndarray as an argument to cv2.rectangle(), which causes PyOpenCV to "fix" the data by making a temporary, internal, contiguous .copy() of the image data, and then it wraps the copy's memory address in a cv::Mat. Next, it passes that cv::Mat object to the internal C++ "draw rectangle" function, which dutifully draws a rectangle onto the memory pointed to by the cv::Mat object... which is... the memory of the temporary internal copy of your input array, since a copy had to be created...

So, OpenCV happily writes a rectangle to the temporary object copy. And then when execution returns to Python, you're of course seeing NO RECTANGLE, since nothing was drawn to your actual ndarray data in RAM (since its memory storage was non-contiguous and therefore not usable as-is by OpenCV).

If you want to see what the code above should be doing, simply add img2 = img2.copy() immediately above the cv2.rectangle call, to cause the img2 ndarray object to become contiguous memory so that OpenCV won't need to make a copy of it (and will be able to use that exact object's memory internally, as intended)... After that tweak, you'll see OpenCV properly drawing the rectangle to the image...

This is the kind of subtle bug that is very easy to cause when you're playing around with faked Numpy "views" rather than real contiguous memory.

Bonus: A note about Numpy "slices"

Numpy allows you to efficiently "slice" arrays, to extract a "partial view" of the data. This is very useful for images, since you can do something such as extracting a 100:100 pixel square from the middle of an image. The slicing syntax is img_sliced = img[y1:y2,x1:x2]. This generates a full Numpy object which points at the data of the original image (they share each other's memory), but which only points at the sub-range you wanted.

So it basically becomes a fully usable "Numpy array" object which you can use in any context you would pass an image. Such as to an OpenCV function, which would then only operate on the sliced segment of RAM. That's really useful!

However, be aware that the Numpy slices inherit the strides and contiguous flag of the original object / data they were sliced from! So if you're slicing from a non-contiguous array, you'll generate a non-contiguous slice object too, which is horrible and has all the issues of non-contiguous objects.

It's only safe to make partial views/slices (like img[0:100, 0:100]) when img itself is already PROVEN to be FULLY contiguous (with no "Numpy tricks" applied to it). In that case, feel free to pass your contiguous, partial image slices to OpenCV functions. You won't invoke any copy-mechanics in that case!

Bonus: What to do when you get a non-contiguous ndarray from a library?

As an example, the very cool D3DShot library has an optional numpy mode where it retrieves the screenshots as ndarray objects. The problem is that it generates them from RAM data laid out in a different order, so it tweaks the ndarray strides etc to give us an object of the proper "shape" (height, width, 3 color channels in RGB order). Its .flags property shows that Contiguous is FALSE.

So what do you do? If you try to pass that directly to OpenCV, you'll invoke the heavy PyOpenCV copy-mechanics described earlier.

Well, you have two options. In this example case, the colors are in RGB order, and you want them to be BGR for usage in OpenCV. So you should be invoking cv2.cvtColor which internally will trigger the Numpy .copy() for you (just like all OpenCV APIs do when given non-contiguous data), and then changes the color order in RAM for you.

The second option is when you have Numpy data that is already in the correct color order (such as BGR), but whose RAM is non-contiguous. In that case, you should directly invoke img = img.copy() to tell Numpy to make a contiguous copy of the array, to fix it. Then you're welcome to use that contiguous copy for everything.

Alright, so let's look at the D3DShot example:

import cv2
import d3dshot
import time

d = d3dshot.create(capture_output="numpy", frame_buffer_size=60)

img1 = d.screenshot()
img2 = d.screenshot()

print(img1.strides, img1.flags)
print(img2.strides, img2.flags)

print("-------------")

start = time.perf_counter()
img1_justcopy = img1.copy() # copy RGB image to new, contiguous RAM
elapsed = (time.perf_counter() - start) * 1000
print(img1_justcopy.strides, img1_justcopy.flags)
print("justcopy milliseconds:", elapsed)

print("-------------")

start = time.perf_counter()
img1 = img1.copy()
img1 = cv2.cvtColor(img1, cv2.COLOR_RGB2BGR) # flip RGB -> BGR
elapsed = (time.perf_counter() - start) * 1000
print(img1.strides, img1.flags)
print("copy+cvtColor milliseconds:", elapsed)

print("-------------")

start = time.perf_counter()
img2 = cv2.cvtColor(img2, cv2.COLOR_RGB2BGR) # flip RGB -> BGR
elapsed = (time.perf_counter() - start) * 1000
print(img2.strides, img2.flags)
print("cvtColor milliseconds:", elapsed)

Output:

(1920, 1, 2073600)   C_CONTIGUOUS : False
  F_CONTIGUOUS : False
  OWNDATA : False
  WRITEABLE : True
  ALIGNED : True
  WRITEBACKIFCOPY : False
  UPDATEIFCOPY : False

(1920, 1, 2073600)   C_CONTIGUOUS : False
  F_CONTIGUOUS : False
  OWNDATA : False
  WRITEABLE : True
  ALIGNED : True
  WRITEBACKIFCOPY : False
  UPDATEIFCOPY : False

-------------
(5760, 3, 1)   C_CONTIGUOUS : True
  F_CONTIGUOUS : False
  OWNDATA : True
  WRITEABLE : True
  ALIGNED : True
  WRITEBACKIFCOPY : False
  UPDATEIFCOPY : False

justcopy milliseconds: 9.122899999999989
-------------
(5760, 3, 1)   C_CONTIGUOUS : True
  F_CONTIGUOUS : False
  OWNDATA : True
  WRITEABLE : True
  ALIGNED : True
  WRITEBACKIFCOPY : False
  UPDATEIFCOPY : False

copy+cvtColor milliseconds: 12.177900000000019
-------------
(5760, 3, 1)   C_CONTIGUOUS : True
  F_CONTIGUOUS : False
  OWNDATA : True
  WRITEABLE : True
  ALIGNED : True
  WRITEBACKIFCOPY : False
  UPDATEIFCOPY : False

cvtColor milliseconds: 11.461500000000013

These examples are all on my 1920x1080 screen, so they're not directly comparable to the 4K resolution times we saw in earlier benchmarks.

Anyway, what we can see here, is first of all that the two captured images (img1 and img2) coming straight from the D3DShot library have very strange strides values, and C_CONTIGUOUS : False. That's because they are raw RAM given to D3DShot by Windows and then just packaged into a ndarray with custom strides to make it read the raw RAM data in the desired order.

Next, we see that just doing img1_justcopy = img1.copy() (which copies the RGB-channeled, non-contiguous RAM into new, contiguous RAM, but does not change the channel order (the image will still be RGB)), takes 9.12 ms, which is indeed how slow Numpy is at copying non-contiguous ndarray data into new, contiguous RAM. Basically, internally, Numpy has to do a ton of looping to read the data byte-by-byte while writing each byte into the correct order in the new, contiguous RAM.

So, the PyArray (Numpy) copying of non-contiguous to contiguous is always the slowest operation. That's why we want to avoid having non-contiguous RAM.

Alright, we also demonstrated how to make a "copy AND fix the colors from RGB to BGR" in two different ways. Doing img1 = img1.copy(); img1 = cv2.cvtColor(img1, cv2.COLOR_RGB2BGR) takes 11.83 ms, and letting cvtColor trigger the Numpy .copy internally via directly calling img2 = cv2.cvtColor(img2, cv2.COLOR_RGB2BGR) takes 10.61 ms. The reason for the slight difference is of course that there's slightly more work involved when we're doing 2 separate function calls, than when we let OpenCV do the Numpy copying in its single call.

In both cases, a PyArray (Numpy) copy operation happens internally, to give us a straight, contiguous RAM location. And then we pass that fixed, contiguous ndarray to cvtColor which fixes the color channel order.

That gives you the following guidelines:

  • If your Numpy data is non-contiguous but is already in the correct channel order (you don't want to convert RGB to/from BGR, etc): Use img = img.copy() to force Numpy to make a contiguous copy of the data, which is then usable in all OpenCV calls without any bugs and without causing any slow internal, temporary copying.
  • If your Numpy data is non-contiguous and you also want to change the channel order: Use img = cv2.cvtColor(img, cv2.COLOR_<your conversion choice>), which will internally do the .copy slightly more efficiently than if you had used two separate Python statements.

Both techniques will result in giving you fast, contiguous RAM, in the color arrangement of your choice!

Conclusions

Stop using Numpy view manipulations and "tricks". They are not "cool". They lead to SUBTLE BUGS and they are EXTREMELY SLOW. You are slowing down all of your OpenCV API calls by about 40 milliseconds PER CALL (at 4K resolution) or 10 milliseconds PER CALL (at 1920x1080 resolution), since your "cool" data has to be converted by OpenCV internally to PROPER CONTIGUOUS RAM.

Those 39.45 ms (@ 4K) or 9.97 ms (@ 1920x1080) are wasted on EVERY OpenCV call whenever you give OpenCV a non-contiguous image. So if you're (as people often do) passing the image to multiple OpenCV APIs to analyze it in multiple ways, then you are causing extreme slowdowns in your code.

Use cv2.cvtColor() instead, which does a super fast one-time conversion to the proper format, using accelerated OpenCL code. You are guaranteed to get contiguous data which works as-is for EVERY OpenCV call with no memory copying/conversion needed. And OpenCV's color converter is WAY FASTER than Numpy's internal data copier/converter.

Let's end this by imagining a scenario where you're using some Python library to capture an RGB 4K screenshot as a numpy array, and you need to use that data with OpenCV. So you're thinking you're clever and you write img = img[...,::-1] to "turn the RGB data into BGR (which OpenCV needs)", and you're thinking "Wow, my code is so fast! That RGB-to-BGR operation only took 0.000237 ms!"... And then you're calling five different OpenCV functions to analyze that screenshot-image in various ways. Well, since you're causing one internal Numpy copy-conversion-to-contiguous PER CALL, you're now causing 5 * 39.45 = 197.25 ms of total conversion overhead, just to get your "stupid" Numpy view into a proper contiguous memory stream.

Does it still sound "slow" to just do a single, one-time 5.48 5.39 ms (@ 4K) or 1.53 ms (@ 1920x1080) conversion via cv2.cvtColor()? ;-)

Stop. Using. Numpy. Tricks!

Enjoy! ;-)

Fastest way to convert BGR <-> RGB! Aka: Do NOT use Numpy magic "tricks".


IMPORTANT: This article is very long. Remember to click the (more) at the bottom of this post to read the whole article!


I was reading this question: https://answers.opencv.org/question/188664/colour-conversion-from-bgr-to-rgb-in-cv2-is-slower-than-on-python/ and it didn't explain things very well at all. So here's a deep examination and explanation for everyone's future reference!

Converting RGB to BGR, and vice versa, is one of the most important operations you can do in OpenCV if you're interoperating with other libraries, raw system memory, etc. And each imaging library depends on their own special channel orders.

There are many ways to achieve the conversion, and cv2.cvtColor() is often frowned upon because there are "much faster" ways to do it via numpy "view" manipulation.

Whenever you attempt to convert colors in OpenCV, you actually invoke a huge machinery:

https://github.com/opencv/opencv/blob/8c0b0714e76efef4a8ca2a7c410c60e55c5e9829/modules/imgproc/src/color.cpp#L20-L25 https://github.com/opencv/opencv/blob/8b541e450b511fde9dd363fa55a30fbb6fc0ace6/modules/imgproc/src/color_rgb.dispatch.cpp#L426-L437

As you can see, internally, OpenCV creates an "OpenCL Kernel" with the instructions for the data transformation, and then runs it. This creates brand new (re-arranged) image data in memory, which is of course a pretty slow operation, involving new memory allocation and data-copying.

However, there is another way to flip between RGB and BGR channel orders, which is very popular - and very bad (as you'll find out soon). And that is: Using numpy's built-in methods for manipulating the array data.

Note that there are two ways to manipulate data in Numpy:

  • One of the ways, the bad way, just changes the "view" of the Numpy array and is therefore instant (O(1)), but does NOT transform the underlying img.data in RAM/memory. This means that the raw memory does NOT contain the new channel order, and Numpy instead "fakes" it by creating a "view" that simply says "when we read this data from RAM, view it as R=B, G=G, B=R" basically... (Technically speaking, it changes the ".strides" property of the Numpy object, which instead of saying "read R then G then B" (stride "1" aka going forwards in RAM when reading the color channels) changes it to say "read B, then G, then R" (stride "-1" aka going backwards in RAM when reading the color channels)).
  • The second way, which is totally fine, is to always ensure that we arrange the pixel data properly in memory too, which is a lot slower but is almost always necessary, depending on what library/API your data is intended to be sent to!

To determine whether a numpy array manipulation has also changed the underlying MEMORY, you can look at the img.flags['C_CONTIGUOUS'] value. If True it means that the data in RAM is in the correct order (that's great!). If False it means that the data in RAM is in the wrong order and that we are "cheating" via a numpy View instead (that's BAD!).

Whenever you use the "View-based" methods to flip channels in an ndarray (such as RGB -> BGR), its C_CONTIGUOUS becomes False. If you then flip the image's channels again (such as BGR -> back to RGB), its C_CONTIGUOUS becomes True again. So, the "view" is able to be transformed multiple times, and the "Contiguous" flag only says True whenever the view happens to match the actual RAM data's layout.

So... in what situations do you need the data to ALWAYS be contiguous? Well, it varies based on API...

  • OpenCV APIs all need data in contiguous order. If you give it non-contiguous data from Python, the Python-to-OpenCV wrapper layer internally makes a COPY of the BADLY FORMATTED image you gave it, and then converts the color channel order, and THEN finally passes the COPIED-AND-CONTIGUOUS image to the internal OpenCV C++ function. This is of course very wasteful!
  • Other libraries: Depends on the library. Some of them do something like "take the img.data memory address and give it to a raw Windows API via a COM call" in which case YES the RAM-data MUST be contiguous too.

What type of data do YOU need?

If you want the SAFEST possible data that is 100% sure to work in ANY API ANYWHERE, you should always make CONTIGUOUS pixel data. It doesn't take long to do the conversion up-front, since we're still talking about very fast operations!

There are probably situations where non-contiguous data is fine, such as if you are doing all image manipulations purely in Numpy math without any library APIs (in which case there's no real reason to convert the data layout to contiguous in RAM). But as soon as you invoke various library APIs, you should pretty much always have contiguous data, otherwise you'll create huge performance issues (or even completely incorrect results).

I'll explain those performance issues further down, but first we'll look at the various "conversion techniques" people use in Python.

Techniques

Without further ado, here are all the ways that people use in Python whenever they want to convert back/forth between RGB and BGR. These benchmarks are on a 4K image (3840x2160):

  • Always Contiguous: No. Method: x = x[...,::-1]. Speed: 237 nsec (aka 0.237 usec aka 0.000237 msec) per call
  • Always Contiguous: Yes. Method: x = x[...,::-1].copy(). Speed: 37.5 msec per call
  • Always Contiguous: No. Method: x = x[:, :, [2, 1, 0]]. Speed: 12.6 msec per call
  • Always Contiguous: Yes. Method: x = cv2.cvtColor(x, cv2.COLOR_RGB2BGR). Speed: 5.39 msec per call
  • Always Contiguous: No. Method: x = np.fliplr(x.reshape(-1,3)).reshape(x.shape). Speed: 1.62 usec (aka 0.00162 msec) per call
  • Always Contiguous: Yes. Method: x = np.fliplr(x.reshape(-1,3)).reshape(x.shape).copy(). Speed: 37.4 msec per call
  • Always Contiguous: No. Method: x = np.flip(x, axis=2). Speed: 2.74 usec (aka 0.00274 msec) per call
  • Always Contiguous: Yes. Method: x = np.flip(x, axis=2).copy(). Speed: 37.5 msec per call
  • Always Contiguous: Yes. Method: r = x[..., 0].copy(); x[..., 0] = x[..., 2]; x[..., 2] = r. Speed: 21.8 msec per call
  • Always Contiguous: Yes. Method: x[:, :, [0, 2]] = x[:, :, [2, 0]]. Speed: 21.7 msec per call
  • Always Contiguous: Yes. Method: x[..., [0, 2]] = x[..., [2, 0]]. Speed: 21.8 msec per call
  • Always Contiguous: Yes. Method: x[:, :, [0, 1, 2]] = x[:, :, [2, 1, 0]]. Speed: 33.1 msec per call
  • Always Contiguous: Yes. Method: x[:, :] = x[:, :, [2, 1, 0]]. Speed: 49.3 msec per call
  • Always Contiguous: Yes. Method: foo = x.copy(). Speed: 11.8 msec per call (This example doesn't change the RGB/BGR channel order, and is just included here as a reference, to show how slow Numpy is at doing a super simple copy of an already-contiguous chunk of RAM. As you can see, even when the data is already in the proper order, Numpy is very slow... And if "x" had been non-contiguous here, it would be even slower, as shown in the x = x[...,::-1].copy() (equivalent to saying bar = x[...,::-1]; foo = bar.copy()) example near the top of the list, which took 37.5 msec and demonstrates Numpy copying non-contiguous RAM (from numpy "views" marked as "read in reverse order" via "stride = -1") into contiguous RAM...

PS: Whenever we want contiguous data from numpy, we're mostly using x.copy() to tell Numpy to allocate new RAM and copy all data to it in the correct (contiguous) order. There's also a np.ascontiguousarray(x) API but it does the exact same thing (it copies too, "but only when the Numpy data isn't already contiguous") and requires much more typing. ;-) And in a few of the examples we're using special indexing (such as x[:, :, [0, 1, 2]] = x[:, :, [2, 1, 0]]) to overwrite the memory directly, which always creates contiguous memory with correct "strides", and is faster than telling Numpy to do a .copy(), but is still extremely slow compared to cv2.cvtColor().

Docs for the various Numpy functions: copy, ascontiguousarray, fliplr, flip, reshape

Here's the benchmark that was used:

python -m timeit -s "import numpy as np; import cv2; x = np.zeros([2160,3840,3], np.uint8); x[:,:,2] = 255; x[:,:,1] = 100" "ALGORITHM HERE"

Replace the "ALGORITHM HERE" part with the algorithm above, such as "x = np.flip(x, axis=2).copy()".

People's Misunderstandings of those Benchmarks

Alright, so we're finally getting to the whole purpose of this article!

When people see the benchmarks above, they usually think "Oh my god, x = x[...,::-1] executes in 0.000237 milliseconds, and x = cv2.cvtColor(x, cv2.COLOR_RGB2BGR) executes in 5.39 milliseconds, which is 23798 times slower!! And then they decide to always use "Numpy view manipulation" to do their channel conversions.

That's a huge mistake. And here's why:

  • When you call an OpenCV API from Python, and pass it a numpy.ndarray image object, there's a process which prepares that data for internal usage within OpenCV (since OpenCV itself doesn't use ndarray internally; it uses cv::Mat).
  • First, your Python object (which is coming from the PyOpenCV module) goes into the appropriate pyopencv_to() function, whose purpose is to convert raw Python objects (such as numbers, strings, ndarray, etc), into something usable by OpenCV internally in C++.
  • Your Python object first enters the "full ArgInfo converter" code at https://github.com/opencv/opencv/blob/778f42ad34559451d62ac9ba585717aec77fb23a/modules/python/src2/cv2.cpp#L249
  • That code, in turn, looks at the object and determines if it's a number, a float, or a tuple... If it's any of those, it does the appropriate conversion. Otherwise it assumes it's a Numpy array, at this line: https://github.com/opencv/opencv/blob/778f42ad34559451d62ac9ba585717aec77fb23a/modules/python/src2/cv2.cpp#L292
  • Next, it begins to analyze the Numpy array to determine how to use the data internally. It wants to determine "do we need to copy the data or can we use it as-is? do we need to cast the data?", see here: https://github.com/opencv/opencv/blob/778f42ad34559451d62ac9ba585717aec77fb23a/modules/python/src2/cv2.cpp#L300
  • It first does some simple checks to see if the number-type in the Numpy array is legal or not. (If illegal type, it marks the data as "needs copy" and "needs cast").
  • Next, it retrieves the "strides" information from the Numpy array, which is those simple numbers such as "-1" which determine how to read a numpy array (such as backwards, in the case of our "fast" numpy-based "channel flipping" code earlier): https://github.com/opencv/opencv/blob/778f42ad34559451d62ac9ba585717aec77fb23a/modules/python/src2/cv2.cpp#L341-L342
  • Then it analyzes the Strides for all dimensions of the Numpy array, and if it finds a non-contiguous stride (our "screwed up" data layout caused by doing those so-called "fast" Numpy view manipulations), then it marks the data as "needs copy": https://github.com/opencv/opencv/blob/778f42ad34559451d62ac9ba585717aec77fb23a/modules/python/src2/cv2.cpp#L344-L357
  • Next, if "needs copy" is true, it does this horrible thing: https://github.com/opencv/opencv/blob/778f42ad34559451d62ac9ba585717aec77fb23a/modules/python/src2/cv2.cpp#L367-L374
  • As you can see, it calls PyArray_Cast() (if casting was needed too) or PyArray_GETCONTIGUOUS() (if we only need to make sure the data is contiguous). Both of those functions, no matter which one is called, then generates a brand-new Python Numpy Object, with all data COPIED by Numpy into brand-new memory, and re-arranged into proper Contiguous ordering. That's extremely wasteful! I'll explain more soon, after this walkthrough of what the code does...
  • Finally, the code proceeds to create a cv::Mat object whose data pointer points at the internal byte-array (RAM) of the Numpy object, ie. the RAM address that you can easily see in Python by typing img.data. That's an incredibly fast operation because it is just a pointer which says Use the existing RAM data owned by Numpy at RAM address XYZ: https://github.com/opencv/opencv/blob/778f42ad34559451d62ac9ba585717aec77fb23a/modules/python/src2/cv2.cpp#L415-L416

So, can you spot the problem yet?

When you pass a contiguous Numpy array, the conversion into OpenCV is pretty much INSTANT: "This data looks fine! Just give its RAM address to cv::Mat and voila!".

But when you instead insist on using those so-called "fast" channel transformations, where you "tweak" the Numpy array's view and stride values, then you are giving OpenCV a Numpy array with non-contiguous RAM and bad "strides". The PyOpenCV layer (the wrapper between OpenCV and Python) detects this problem, and creates a BRAND NEW, COPIED, RE-ARRANGED (CONTIGUOUS) NUMPY ARRAY. This is VERY VERY VERY VERY SLOW.

In other words, if you've used those dumb Numpy "view" manipulation tricks, EVERY CALL TO OPENCV APIS IS CAUSING a HUGE memory copy (images are large, especially 1080p+ screenshots/video frames), a lot of math inside PyArray_GETCONTIGUOUS / PyArray_Cast to create that new object while respecting your tweaked "strides", etc.

Your code won't be faster at all. It will be SLOWER!

Demonstration of the Slowness

Let's use a random OpenCV API to demonstrate the slowdown caused by all of those conversions. We'll use cv2.imshow here, but any OpenCV API call will always be doing the same "Python to OpenCV" conversions of the numpy data, so the exact API doesn't matter. They will all have this overhead.

Here's the example code:

import cv2
import numpy as np
import time

#img1 = cv2.imread("yourimage.png") # If you want to test with an image.
img1 = np.zeros([2160,3840,3], np.uint8) # Create a 4K image.
img1[:,:,2] = 255; img1[:,:,1] = 100 # Fill the channels with different values.
img2 = img1[...,::-1] # Make a "channel flipped view" of the Numpy data.

print("img1 contiguous", img1.flags['C_CONTIGUOUS'], "img2 contiguous", img2.flags['C_CONTIGUOUS'])
print("img1 strides", img1.strides, "img2 strides", img2.strides)

wnd = cv2.namedWindow("", cv2.WINDOW_NORMAL)

def show1():
    cv2.imshow(wnd, img1)

def show2():
    cv2.imshow(wnd, img2)

iterations = 20

start = time.perf_counter()
for i in range(0, iterations):
    show1()
elapsed1 = (time.perf_counter() - start) * 1000
elapsed1_percall = elapsed1 / iterations

start = time.perf_counter()
for i in range(0, iterations):
    show2()
elapsed2 = (time.perf_counter() - start) * 1000
elapsed2_percall = elapsed2 / iterations

# We know that the contiguos (img1) data does not need conversion,
# which tells us that the runtime of the contiguous data is the
# "internal work of the imshow" function. We only want to measure
# the conversion time for non-contiguous data. So we'll subtract
# the first image's (contiguous) runtime from the non-contiguous time.

noncontiguous_overhead_per_call = elapsed2_percall - elapsed1_percall

print("Extra time taken per OpenCV call when given non-contiguous data (in ms):", noncontiguous_overhead_per_call, "ms")

The results:

img1 contiguous True img2 contiguous False
img1 strides (11520, 3, 1) img2 strides (11520, 3, -1)
Extra time taken per OpenCV call when given non-contiguous data (in ms): 39.45334999999999 ms

As you can see, the extra time added to the OpenCV calls when copy-conversion is needed (39.45 ms), is pretty much the same as when you call Numpy's own img.copy() on a "flipped view" inside Python itself (as seen in the earlier benchmark for "Method: x = x[...,::-1].copy(). Speed: 37.5 msec per call").

So yes, every time you call OpenCV with a non-contiguous Numpy array as its argument, you are causing a Numpy .copy() to happen internally!

PS: If we repeat the same test above with 1920x1080 test data instead of 4K test data, we get Extra time taken per OpenCV call when given non-contiguous data (in ms): 9.972125 ms which means that at the world's most popular image resolution (1080p) you're still adding around 10 milliseconds of overhead to all of your OpenCV calls.

Numpy "tricks" will cause subtle Bugs too!

Using those Numpy "tricks" isn't just extremely slow. It will cause very subtle bugs in your code, too.

Look at this code and see if you can figure out the bug yourself before you run this example:

import cv2
import numpy as np

img1 = np.zeros([200,200,3], np.uint8) # Create a 200x200 image. (Is Contiguous)
img2 = img1[...,::-1] # Make a "channel flipped view" of the Numpy data. (A Non-Contiguous View)

print("img1 contiguous", img1.flags['C_CONTIGUOUS'], "img2 contiguous", img2.flags['C_CONTIGUOUS'])
print("img1 strides", img1.strides, "img2 strides", img2.strides)

cv2.rectangle(img2, (80,80), (120,120), (255,255,255), 2)

cv2.imshow("", img2)

What do you think the result will be when running this program? Logically, you expect to see black image with a white rectangle in the middle... But instead, you see nothing except a black image. Why?

Well, it's simple... think about what was explained earlier about how PyOpenCV converts every incoming numpy.ndarray object into an internal C++ cv::Mat object. In this example, we're giving a non-contiguous ndarray as an argument to cv2.rectangle(), which causes PyOpenCV to "fix" the data by making a temporary, internal, contiguous .copy() of the image data, and then it wraps the copy's memory address in a cv::Mat. Next, it passes that cv::Mat object to the internal C++ "draw rectangle" function, which dutifully draws a rectangle onto the memory pointed to by the cv::Mat object... which is... the memory of the temporary internal copy of your input array, since a copy had to be created...

So, OpenCV happily writes a rectangle to the temporary object copy. And then when execution returns to Python, you're of course seeing NO RECTANGLE, since nothing was drawn to your actual ndarray data in RAM (since its memory storage was non-contiguous and therefore not usable as-is by OpenCV).

If you want to see what the code above should be doing, simply add img2 = img2.copy() immediately above the cv2.rectangle call, to cause the img2 ndarray object to become contiguous memory so that OpenCV won't need to make a copy of it (and will be able to use that exact object's memory internally, as intended)... After that tweak, you'll see OpenCV properly drawing the rectangle to the image...

This is the kind of subtle bug that is very easy to cause when you're playing around with faked Numpy "views" rather than real contiguous memory.

Bonus: A note about Numpy "slices"

Numpy allows you to efficiently "slice" arrays, to extract a "partial view" of the data. This is very useful for images, since you can do something such as extracting a 100:100 pixel square from the middle of an image. The slicing syntax is img_sliced = img[y1:y2,x1:x2]. This generates a full Numpy object which points at the data of the original image (they share each other's memory), but which only points at the sub-range you wanted.

So it basically becomes a fully usable "Numpy array" object which you can use in any context you would pass an image. Such as to an OpenCV function, which would then only operate on the sliced segment of RAM. That's really useful!

However, be aware that the Numpy slices inherit the strides and contiguous flag of the original object / data they were sliced from! So if you're slicing from a non-contiguous array, you'll generate a non-contiguous slice object too, which is horrible and has all the issues of non-contiguous objects.

It's only safe to make partial views/slices (like img[0:100, 0:100]) when img itself is already PROVEN to be FULLY contiguous (with no "Numpy tricks" applied to it). In that case, feel free to pass your contiguous, partial image slices to OpenCV functions. You won't invoke any copy-mechanics in that case!

Bonus: What to do when you get a non-contiguous ndarray from a library?

As an example, the very cool D3DShot library has an optional numpy mode where it retrieves the screenshots as ndarray objects. The problem is that it generates them from RAM data laid out in a different order, so it tweaks the ndarray strides etc to give us an object of the proper "shape" (height, width, 3 color channels in RGB order). Its .flags property shows that Contiguous is FALSE.

So what do you do? If you try to pass that directly to OpenCV, you'll invoke the heavy PyOpenCV copy-mechanics described earlier.

Well, you have two options. In this example case, the colors are in RGB order, and you want them to be BGR for usage in OpenCV. So you should be invoking cv2.cvtColor which internally will trigger the Numpy .copy() for you (just like all OpenCV APIs do when given non-contiguous data), and then changes the color order in RAM for you.

The second option is when you have Numpy data that is already in the correct color order (such as BGR), but whose RAM is non-contiguous. In that case, you should directly invoke img = img.copy() to tell Numpy to make a contiguous copy of the array, to fix it. Then you're welcome to use that contiguous copy for everything.everything. Also note that you can use img = np.ascontiguousarray(img) instead, if you're not sure if your library always returns non-contiguous data; this method automatically returns the same array if it was already contiguous, or does a .copy if it was non-contiguous.

Alright, so let's look at the D3DShot example:

import cv2
import d3dshot
import time

d = d3dshot.create(capture_output="numpy", frame_buffer_size=60)

img1 = d.screenshot()
img2 = d.screenshot()

print(img1.strides, img1.flags)
print(img2.strides, img2.flags)

print("-------------")

start = time.perf_counter()
img1_justcopy = img1.copy() # copy RGB image to new, contiguous RAM
elapsed = (time.perf_counter() - start) * 1000
print(img1_justcopy.strides, img1_justcopy.flags)
print("justcopy milliseconds:", elapsed)

print("-------------")

start = time.perf_counter()
img1 = img1.copy()
img1 = cv2.cvtColor(img1, cv2.COLOR_RGB2BGR) # flip RGB -> BGR
elapsed = (time.perf_counter() - start) * 1000
print(img1.strides, img1.flags)
print("copy+cvtColor milliseconds:", elapsed)

print("-------------")

start = time.perf_counter()
img2 = cv2.cvtColor(img2, cv2.COLOR_RGB2BGR) # flip RGB -> BGR
elapsed = (time.perf_counter() - start) * 1000
print(img2.strides, img2.flags)
print("cvtColor milliseconds:", elapsed)

Output:

(1920, 1, 2073600)   C_CONTIGUOUS : False
  F_CONTIGUOUS : False
  OWNDATA : False
  WRITEABLE : True
  ALIGNED : True
  WRITEBACKIFCOPY : False
  UPDATEIFCOPY : False

(1920, 1, 2073600)   C_CONTIGUOUS : False
  F_CONTIGUOUS : False
  OWNDATA : False
  WRITEABLE : True
  ALIGNED : True
  WRITEBACKIFCOPY : False
  UPDATEIFCOPY : False

-------------
(5760, 3, 1)   C_CONTIGUOUS : True
  F_CONTIGUOUS : False
  OWNDATA : True
  WRITEABLE : True
  ALIGNED : True
  WRITEBACKIFCOPY : False
  UPDATEIFCOPY : False

justcopy milliseconds: 9.122899999999989
-------------
(5760, 3, 1)   C_CONTIGUOUS : True
  F_CONTIGUOUS : False
  OWNDATA : True
  WRITEABLE : True
  ALIGNED : True
  WRITEBACKIFCOPY : False
  UPDATEIFCOPY : False

copy+cvtColor milliseconds: 12.177900000000019
-------------
(5760, 3, 1)   C_CONTIGUOUS : True
  F_CONTIGUOUS : False
  OWNDATA : True
  WRITEABLE : True
  ALIGNED : True
  WRITEBACKIFCOPY : False
  UPDATEIFCOPY : False

cvtColor milliseconds: 11.461500000000013

These examples are all on my 1920x1080 screen, so they're not directly comparable to the 4K resolution times we saw in earlier benchmarks.

Anyway, what we can see here, is first of all that the two captured images (img1 and img2) coming straight from the D3DShot library have very strange strides values, and C_CONTIGUOUS : False. That's because they are raw RAM given to D3DShot by Windows and then just packaged into a ndarray with custom strides to make it read the raw RAM data in the desired order.

Next, we see that just doing img1_justcopy = img1.copy() (which copies the RGB-channeled, non-contiguous RAM into new, contiguous RAM, but does not change the channel order (the image will still be RGB)), takes 9.12 ms, which is indeed how slow Numpy is at copying non-contiguous ndarray data into new, contiguous RAM. Basically, internally, Numpy has to do a ton of looping to read the data byte-by-byte while writing each byte into the correct order in the new, contiguous RAM.

So, the PyArray (Numpy) copying of non-contiguous to contiguous is always the slowest operation. That's why we want to avoid having non-contiguous RAM.

Alright, we also demonstrated how to make a "copy AND fix the colors from RGB to BGR" in two different ways. Doing img1 = img1.copy(); img1 = cv2.cvtColor(img1, cv2.COLOR_RGB2BGR) takes 11.83 ms, and letting cvtColor trigger the Numpy .copy internally via directly calling img2 = cv2.cvtColor(img2, cv2.COLOR_RGB2BGR) takes 10.61 ms. The reason for the slight difference is of course that there's slightly more work involved when we're doing 2 separate function calls, than when we let OpenCV do the Numpy copying in its single call.

In both cases, a PyArray (Numpy) copy operation happens internally, to give us a straight, contiguous RAM location. And then we pass that fixed, contiguous ndarray to cvtColor which fixes the color channel order.

That gives you the following guidelines:

  • If your Numpy data is always non-contiguous but is already in the correct channel order (you don't want to convert RGB to/from BGR, etc): Use img = img.copy() to force Numpy to make a contiguous copy of the data, which is then usable in all OpenCV calls without any bugs and without causing any slow internal, temporary copying.
  • If your Numpy data is SOMETIMES non-contiguous but is already in the correct channel order: Use img = np.ascontiguousarray(img), which automatically copies the array to make it contiguous if necessary, or otherwise returns the exact same array (if it was already contiguous).
  • If your Numpy data is non-contiguous and you also want to change the channel order: Use img = cv2.cvtColor(img, cv2.COLOR_<your conversion choice>), which will internally do the .copy slightly more efficiently than if you had used two separate Python statements.

Both techniques will result in giving you fast, contiguous RAM, in the color arrangement of your choice!

Conclusions

Stop using Numpy view manipulations and "tricks". They are not "cool". They lead to SUBTLE BUGS and they are EXTREMELY SLOW. You are slowing down all of your OpenCV API calls by about 40 milliseconds PER CALL (at 4K resolution) or 10 milliseconds PER CALL (at 1920x1080 resolution), since your "cool" data has to be converted by OpenCV internally to PROPER CONTIGUOUS RAM.

Those 39.45 ms (@ 4K) or 9.97 ms (@ 1920x1080) are wasted on EVERY OpenCV call whenever you give OpenCV a non-contiguous image. So if you're (as people often do) passing the image to multiple OpenCV APIs to analyze it in multiple ways, then you are causing extreme slowdowns in your code.

Use cv2.cvtColor() instead, which does a super fast one-time conversion to the proper format, using accelerated OpenCL code. You are guaranteed to get contiguous data which works as-is for EVERY OpenCV call with no memory copying/conversion needed. And OpenCV's color converter is WAY FASTER than Numpy's internal data copier/converter.

Let's end this by imagining a scenario where you're using some Python library to capture an RGB 4K screenshot as a numpy array, and you need to use that data with OpenCV. So you're thinking you're clever and you write img = img[...,::-1] to "turn the RGB data into BGR (which OpenCV needs)", and you're thinking "Wow, my code is so fast! That RGB-to-BGR operation only took 0.000237 ms!"... And then you're calling five different OpenCV functions to analyze that screenshot-image in various ways. Well, since you're causing one internal Numpy copy-conversion-to-contiguous PER CALL, you're now causing 5 * 39.45 = 197.25 ms of total conversion overhead, just to get your "stupid" Numpy view into a proper contiguous memory stream.

Does it still sound "slow" to just do a single, one-time 5.39 ms (@ 4K) or 1.53 ms (@ 1920x1080) conversion via cv2.cvtColor()? ;-)

Stop. Using. Numpy. Tricks!

Enjoy! ;-)

Fastest way to convert BGR <-> RGB! Aka: Do NOT use Numpy magic "tricks".


IMPORTANT: This article is very long. Remember to click the (more) at the bottom of this post to read the whole article!


I was reading this question: https://answers.opencv.org/question/188664/colour-conversion-from-bgr-to-rgb-in-cv2-is-slower-than-on-python/ and it didn't explain things very well at all. So here's a deep examination and explanation for everyone's future reference!

Converting RGB to BGR, and vice versa, is one of the most important operations you can do in OpenCV if you're interoperating with other libraries, raw system memory, etc. And each imaging library depends on their own special channel orders.

There are many ways to achieve the conversion, and cv2.cvtColor() is often frowned upon because there are "much faster" ways to do it via numpy "view" manipulation.

Whenever you attempt to convert colors in OpenCV, you actually invoke a huge machinery:

https://github.com/opencv/opencv/blob/8c0b0714e76efef4a8ca2a7c410c60e55c5e9829/modules/imgproc/src/color.cpp#L20-L25 https://github.com/opencv/opencv/blob/8b541e450b511fde9dd363fa55a30fbb6fc0ace6/modules/imgproc/src/color_rgb.dispatch.cpp#L426-L437

As you can see, internally, OpenCV creates an "OpenCL Kernel" with the instructions for the data transformation, and then runs it. This creates brand new (re-arranged) image data in memory, which is of course a pretty slow operation, involving new memory allocation and data-copying.

However, there is another way to flip between RGB and BGR channel orders, which is very popular - and very bad (as you'll find out soon). And that is: Using numpy's built-in methods for manipulating the array data.

Note that there are two ways to manipulate data in Numpy:

  • One of the ways, the bad way, just changes the "view" of the Numpy array and is therefore instant (O(1)), but does NOT transform the underlying img.data in RAM/memory. This means that the raw memory does NOT contain the new channel order, and Numpy instead "fakes" it by creating a "view" that simply says "when we read this data from RAM, view it as R=B, G=G, B=R" basically... (Technically speaking, it changes the ".strides" property of the Numpy object, which instead of saying "read R then G then B" (stride "1" aka going forwards in RAM when reading the color channels) changes it to say "read B, then G, then R" (stride "-1" aka going backwards in RAM when reading the color channels)).
  • The second way, which is totally fine, is to always ensure that we arrange the pixel data properly in memory too, which is a lot slower but is almost always necessary, depending on what library/API your data is intended to be sent to!

To determine whether a numpy array manipulation has also changed the underlying MEMORY, you can look at the img.flags['C_CONTIGUOUS'] value. If True it means that the data in RAM is in the correct order (that's great!). If False it means that the data in RAM is in the wrong order and that we are "cheating" via a numpy View instead (that's BAD!).

Whenever you use the "View-based" methods to flip channels in an ndarray (such as RGB -> BGR), its C_CONTIGUOUS becomes False. If you then flip the image's channels again (such as BGR -> back to RGB), its C_CONTIGUOUS becomes True again. So, the "view" is able to be transformed multiple times, and the "Contiguous" flag only says True whenever the view happens to match the actual RAM data's layout.

So... in what situations do you need the data to ALWAYS be contiguous? Well, it varies based on API...

  • OpenCV APIs all need data in contiguous order. If you give it non-contiguous data from Python, the Python-to-OpenCV wrapper layer internally makes a COPY of the BADLY FORMATTED image you gave it, and then converts the color channel order, and THEN finally passes the COPIED-AND-CONTIGUOUS image to the internal OpenCV C++ function. This is of course very wasteful!
  • Matplotlib APIs do not need contiguous data, because they have stride-handling code. But all of their calls are slowed down if given non-contiguous data, as seen here.
  • Other libraries: Depends on the library. Some of them do something like "take the img.data memory address and give it to a raw Windows API via a COM call" in which case YES the RAM-data MUST be contiguous too.

What type of data do YOU need?

If you want the SAFEST possible data that is 100% sure to work in ANY API ANYWHERE, you should always make CONTIGUOUS pixel data. It doesn't take long to do the conversion up-front, since we're still talking about very fast operations!

There are probably situations where non-contiguous data is fine, such as if you are doing all image manipulations purely in Numpy math without any library APIs (in which case there's no real reason to convert the data layout to contiguous in RAM). But as soon as you invoke various library APIs, you should pretty much always have contiguous data, otherwise you'll create huge performance issues (or even completely incorrect results).

I'll explain those performance issues further down, but first we'll look at the various "conversion techniques" people use in Python.

Techniques

Without further ado, here are all the ways that people use in Python whenever they want to convert back/forth between RGB and BGR. These benchmarks are on a 4K image (3840x2160):

  • Always Contiguous: No. Method: x = x[...,::-1]. Speed: 237 nsec (aka 0.237 usec aka 0.000237 msec) per call
  • Always Contiguous: Yes. Method: x = x[...,::-1].copy(). Speed: 37.5 msec per call
  • Always Contiguous: No. Method: x = x[:, :, [2, 1, 0]]. Speed: 12.6 msec per call
  • Always Contiguous: Yes. Method: x = cv2.cvtColor(x, cv2.COLOR_RGB2BGR). Speed: 5.39 msec per call
  • Always Contiguous: No. Method: x = np.fliplr(x.reshape(-1,3)).reshape(x.shape). Speed: 1.62 usec (aka 0.00162 msec) per call
  • Always Contiguous: Yes. Method: x = np.fliplr(x.reshape(-1,3)).reshape(x.shape).copy(). Speed: 37.4 msec per call
  • Always Contiguous: No. Method: x = np.flip(x, axis=2). Speed: 2.74 usec (aka 0.00274 msec) per call
  • Always Contiguous: Yes. Method: x = np.flip(x, axis=2).copy(). Speed: 37.5 msec per call
  • Always Contiguous: Yes. Method: r = x[..., 0].copy(); x[..., 0] = x[..., 2]; x[..., 2] = r. Speed: 21.8 msec per call
  • Always Contiguous: Yes. Method: x[:, :, [0, 2]] = x[:, :, [2, 0]]. Speed: 21.7 msec per call
  • Always Contiguous: Yes. Method: x[..., [0, 2]] = x[..., [2, 0]]. Speed: 21.8 msec per call
  • Always Contiguous: Yes. Method: x[:, :, [0, 1, 2]] = x[:, :, [2, 1, 0]]. Speed: 33.1 msec per call
  • Always Contiguous: Yes. Method: x[:, :] = x[:, :, [2, 1, 0]]. Speed: 49.3 msec per call
  • Always Contiguous: Yes. Method: foo = x.copy(). Speed: 11.8 msec per call (This example doesn't change the RGB/BGR channel order, and is just included here as a reference, to show how slow Numpy is at doing a super simple copy of an already-contiguous chunk of RAM. As you can see, even when the data is already in the proper order, Numpy is very slow... And if "x" had been non-contiguous here, it would be even slower, as shown in the x = x[...,::-1].copy() (equivalent to saying bar = x[...,::-1]; foo = bar.copy()) example near the top of the list, which took 37.5 msec and demonstrates Numpy copying non-contiguous RAM (from numpy "views" marked as "read in reverse order" via "stride = -1") into contiguous RAM...

PS: Whenever we want contiguous data from numpy, we're mostly using x.copy() to tell Numpy to allocate new RAM and copy all data to it in the correct (contiguous) order. There's also a np.ascontiguousarray(x) API but it does the exact same thing (it copies too, "but only when the Numpy data isn't already contiguous") and requires much more typing. ;-) And in a few of the examples we're using special indexing (such as x[:, :, [0, 1, 2]] = x[:, :, [2, 1, 0]]) to overwrite the memory directly, which always creates contiguous memory with correct "strides", and is faster than telling Numpy to do a .copy(), but is still extremely slow compared to cv2.cvtColor().

Docs for the various Numpy functions: copy, ascontiguousarray, fliplr, flip, reshape

Here's the benchmark that was used:

python -m timeit -s "import numpy as np; import cv2; x = np.zeros([2160,3840,3], np.uint8); x[:,:,2] = 255; x[:,:,1] = 100" "ALGORITHM HERE"

Replace the "ALGORITHM HERE" part with the algorithm above, such as "x = np.flip(x, axis=2).copy()".

People's Misunderstandings of those Benchmarks

Alright, so we're finally getting to the whole purpose of this article!

When people see the benchmarks above, they usually think "Oh my god, x = x[...,::-1] executes in 0.000237 milliseconds, and x = cv2.cvtColor(x, cv2.COLOR_RGB2BGR) executes in 5.39 milliseconds, which is 23798 times slower!! And then they decide to always use "Numpy view manipulation" to do their channel conversions.

That's a huge mistake. And here's why:

  • When you call an OpenCV API from Python, and pass it a numpy.ndarray image object, there's a process which prepares that data for internal usage within OpenCV (since OpenCV itself doesn't use ndarray internally; it uses cv::Mat).
  • First, your Python object (which is coming from the PyOpenCV module) goes into the appropriate pyopencv_to() function, whose purpose is to convert raw Python objects (such as numbers, strings, ndarray, etc), into something usable by OpenCV internally in C++.
  • Your Python object first enters the "full ArgInfo converter" code at https://github.com/opencv/opencv/blob/778f42ad34559451d62ac9ba585717aec77fb23a/modules/python/src2/cv2.cpp#L249
  • That code, in turn, looks at the object and determines if it's a number, a float, or a tuple... If it's any of those, it does the appropriate conversion. Otherwise it assumes it's a Numpy array, at this line: https://github.com/opencv/opencv/blob/778f42ad34559451d62ac9ba585717aec77fb23a/modules/python/src2/cv2.cpp#L292
  • Next, it begins to analyze the Numpy array to determine how to use the data internally. It wants to determine "do we need to copy the data or can we use it as-is? do we need to cast the data?", see here: https://github.com/opencv/opencv/blob/778f42ad34559451d62ac9ba585717aec77fb23a/modules/python/src2/cv2.cpp#L300
  • It first does some simple checks to see if the number-type in the Numpy array is legal or not. (If illegal type, it marks the data as "needs copy" and "needs cast").
  • Next, it retrieves the "strides" information from the Numpy array, which is those simple numbers such as "-1" which determine how to read a numpy array (such as backwards, in the case of our "fast" numpy-based "channel flipping" code earlier): https://github.com/opencv/opencv/blob/778f42ad34559451d62ac9ba585717aec77fb23a/modules/python/src2/cv2.cpp#L341-L342
  • Then it analyzes the Strides for all dimensions of the Numpy array, and if it finds a non-contiguous stride (our "screwed up" data layout caused by doing those so-called "fast" Numpy view manipulations), then it marks the data as "needs copy": https://github.com/opencv/opencv/blob/778f42ad34559451d62ac9ba585717aec77fb23a/modules/python/src2/cv2.cpp#L344-L357
  • Next, if "needs copy" is true, it does this horrible thing: https://github.com/opencv/opencv/blob/778f42ad34559451d62ac9ba585717aec77fb23a/modules/python/src2/cv2.cpp#L367-L374
  • As you can see, it calls PyArray_Cast() (if casting was needed too) or PyArray_GETCONTIGUOUS() (if we only need to make sure the data is contiguous). Both of those functions, no matter which one is called, then generates a brand-new Python Numpy Object, with all data COPIED by Numpy into brand-new memory, and re-arranged into proper Contiguous ordering. That's extremely wasteful! I'll explain more soon, after this walkthrough of what the code does...
  • Finally, the code proceeds to create a cv::Mat object whose data pointer points at the internal byte-array (RAM) of the Numpy object, ie. the RAM address that you can easily see in Python by typing img.data. That's an incredibly fast operation because it is just a pointer which says Use the existing RAM data owned by Numpy at RAM address XYZ: https://github.com/opencv/opencv/blob/778f42ad34559451d62ac9ba585717aec77fb23a/modules/python/src2/cv2.cpp#L415-L416

So, can you spot the problem yet?

When you pass a contiguous Numpy array, the conversion into OpenCV is pretty much INSTANT: "This data looks fine! Just give its RAM address to cv::Mat and voila!".

But when you instead insist on using those so-called "fast" channel transformations, where you "tweak" the Numpy array's view and stride values, then you are giving OpenCV a Numpy array with non-contiguous RAM and bad "strides". The PyOpenCV layer (the wrapper between OpenCV and Python) detects this problem, and creates a BRAND NEW, COPIED, RE-ARRANGED (CONTIGUOUS) NUMPY ARRAY. This is VERY VERY VERY VERY SLOW.

In other words, if you've used those dumb Numpy "view" manipulation tricks, EVERY CALL TO OPENCV APIS IS CAUSING a HUGE memory copy (images are large, especially 1080p+ screenshots/video frames), a lot of math inside PyArray_GETCONTIGUOUS / PyArray_Cast to create that new object while respecting your tweaked "strides", etc.

Your code won't be faster at all. It will be SLOWER!

Demonstration of the Slowness

Let's use a random OpenCV API to demonstrate the slowdown caused by all of those conversions. We'll use cv2.imshow here, but any OpenCV API call will always be doing the same "Python to OpenCV" conversions of the numpy data, so the exact API doesn't matter. They will all have this overhead.

Here's the example code:

import cv2
import numpy as np
import time

#img1 = cv2.imread("yourimage.png") # If you want to test with an image.
img1 = np.zeros([2160,3840,3], np.uint8) # Create a 4K image.
img1[:,:,2] = 255; img1[:,:,1] = 100 # Fill the channels with different values.
img2 = img1[...,::-1] # Make a "channel flipped view" of the Numpy data.

print("img1 contiguous", img1.flags['C_CONTIGUOUS'], "img2 contiguous", img2.flags['C_CONTIGUOUS'])
print("img1 strides", img1.strides, "img2 strides", img2.strides)

wnd = cv2.namedWindow("", cv2.WINDOW_NORMAL)

def show1():
    cv2.imshow(wnd, img1)

def show2():
    cv2.imshow(wnd, img2)

iterations = 20

start = time.perf_counter()
for i in range(0, iterations):
    show1()
elapsed1 = (time.perf_counter() - start) * 1000
elapsed1_percall = elapsed1 / iterations

start = time.perf_counter()
for i in range(0, iterations):
    show2()
elapsed2 = (time.perf_counter() - start) * 1000
elapsed2_percall = elapsed2 / iterations

# We know that the contiguos (img1) data does not need conversion,
# which tells us that the runtime of the contiguous data is the
# "internal work of the imshow" function. We only want to measure
# the conversion time for non-contiguous data. So we'll subtract
# the first image's (contiguous) runtime from the non-contiguous time.

noncontiguous_overhead_per_call = elapsed2_percall - elapsed1_percall

print("Extra time taken per OpenCV call when given non-contiguous data (in ms):", noncontiguous_overhead_per_call, "ms")

The results:

img1 contiguous True img2 contiguous False
img1 strides (11520, 3, 1) img2 strides (11520, 3, -1)
Extra time taken per OpenCV call when given non-contiguous data (in ms): 39.45334999999999 ms

As you can see, the extra time added to the OpenCV calls when copy-conversion is needed (39.45 ms), is pretty much the same as when you call Numpy's own img.copy() on a "flipped view" inside Python itself (as seen in the earlier benchmark for "Method: x = x[...,::-1].copy(). Speed: 37.5 msec per call").

So yes, every time you call OpenCV with a non-contiguous Numpy array as its argument, you are causing a Numpy .copy() to happen internally!

PS: If we repeat the same test above with 1920x1080 test data instead of 4K test data, we get Extra time taken per OpenCV call when given non-contiguous data (in ms): 9.972125 ms which means that at the world's most popular image resolution (1080p) you're still adding around 10 milliseconds of overhead to all of your OpenCV calls.

Numpy "tricks" will cause subtle Bugs too!

Using those Numpy "tricks" isn't just extremely slow. It will cause very subtle bugs in your code, too.

Look at this code and see if you can figure out the bug yourself before you run this example:

import cv2
import numpy as np

img1 = np.zeros([200,200,3], np.uint8) # Create a 200x200 image. (Is Contiguous)
img2 = img1[...,::-1] # Make a "channel flipped view" of the Numpy data. (A Non-Contiguous View)

print("img1 contiguous", img1.flags['C_CONTIGUOUS'], "img2 contiguous", img2.flags['C_CONTIGUOUS'])
print("img1 strides", img1.strides, "img2 strides", img2.strides)

cv2.rectangle(img2, (80,80), (120,120), (255,255,255), 2)

cv2.imshow("", img2)

What do you think the result will be when running this program? Logically, you expect to see black image with a white rectangle in the middle... But instead, you see nothing except a black image. Why?

Well, it's simple... think about what was explained earlier about how PyOpenCV converts every incoming numpy.ndarray object into an internal C++ cv::Mat object. In this example, we're giving a non-contiguous ndarray as an argument to cv2.rectangle(), which causes PyOpenCV to "fix" the data by making a temporary, internal, contiguous .copy() of the image data, and then it wraps the copy's memory address in a cv::Mat. Next, it passes that cv::Mat object to the internal C++ "draw rectangle" function, which dutifully draws a rectangle onto the memory pointed to by the cv::Mat object... which is... the memory of the temporary internal copy of your input array, since a copy had to be created...

So, OpenCV happily writes a rectangle to the temporary object copy. And then when execution returns to Python, you're of course seeing NO RECTANGLE, since nothing was drawn to your actual ndarray data in RAM (since its memory storage was non-contiguous and therefore not usable as-is by OpenCV).

If you want to see what the code above should be doing, simply add img2 = img2.copy() immediately above the cv2.rectangle call, to cause the img2 ndarray object to become contiguous memory so that OpenCV won't need to make a copy of it (and will be able to use that exact object's memory internally, as intended)... After that tweak, you'll see OpenCV properly drawing the rectangle to the image...

This is the kind of subtle bug that is very easy to cause when you're playing around with faked Numpy "views" rather than real contiguous memory.

Bonus: A note about Numpy "slices"

Numpy allows you to efficiently "slice" arrays, to extract a "partial view" of the data. This is very useful for images, since you can do something such as extracting a 100:100 pixel square from the middle of an image. The slicing syntax is img_sliced = img[y1:y2,x1:x2]. This generates a full Numpy object which points at the data of the original image (they share each other's memory), but which only points at the sub-range you wanted.

So it basically becomes a fully usable "Numpy array" object which you can use in any context you would pass an image. Such as to an OpenCV function, which would then only operate on the sliced segment of RAM. That's really useful!

However, be aware that the Numpy slices inherit the strides and contiguous flag of the original object / data they were sliced from! So if you're slicing from a non-contiguous array, you'll generate a non-contiguous slice object too, which is horrible and has all the issues of non-contiguous objects.

It's only safe to make partial views/slices (like img[0:100, 0:100]) when img itself is already PROVEN to be FULLY contiguous (with no "Numpy tricks" applied to it). In that case, feel free to pass your contiguous, partial image slices to OpenCV functions. You won't invoke any copy-mechanics in that case!

Bonus: What to do when you get a non-contiguous ndarray from a library?

As an example, the very cool D3DShot library has an optional numpy mode where it retrieves the screenshots as ndarray objects. The problem is that it generates them from RAM data laid out in a different order, so it tweaks the ndarray strides etc to give us an object of the proper "shape" (height, width, 3 color channels in RGB order). Its .flags property shows that Contiguous is FALSE.

So what do you do? If you try to pass that directly to OpenCV, you'll invoke the heavy PyOpenCV copy-mechanics described earlier.

Well, you have two options. In this example case, the colors are in RGB order, and you want them to be BGR for usage in OpenCV. So you should be invoking cv2.cvtColor which internally will trigger the Numpy .copy() for you (just like all OpenCV APIs do when given non-contiguous data), and then changes the color order in RAM for you.

The second option is when you have Numpy data that is already in the correct color order (such as BGR), but whose RAM is non-contiguous. In that case, you should directly invoke img = img.copy() to tell Numpy to make a contiguous copy of the array, to fix it. Then you're welcome to use that contiguous copy for everything. Also note that you can use img = np.ascontiguousarray(img) instead, if you're not sure if your library always returns non-contiguous data; this method automatically returns the same array if it was already contiguous, or does a .copy if it was non-contiguous.

Alright, so let's look at the D3DShot example:

import cv2
import d3dshot
import time

d = d3dshot.create(capture_output="numpy", frame_buffer_size=60)

img1 = d.screenshot()
img2 = d.screenshot()

print(img1.strides, img1.flags)
print(img2.strides, img2.flags)

print("-------------")

start = time.perf_counter()
img1_justcopy = img1.copy() # copy RGB image to new, contiguous RAM
elapsed = (time.perf_counter() - start) * 1000
print(img1_justcopy.strides, img1_justcopy.flags)
print("justcopy milliseconds:", elapsed)

print("-------------")

start = time.perf_counter()
img1 = img1.copy()
img1 = cv2.cvtColor(img1, cv2.COLOR_RGB2BGR) # flip RGB -> BGR
elapsed = (time.perf_counter() - start) * 1000
print(img1.strides, img1.flags)
print("copy+cvtColor milliseconds:", elapsed)

print("-------------")

start = time.perf_counter()
img2 = cv2.cvtColor(img2, cv2.COLOR_RGB2BGR) # flip RGB -> BGR
elapsed = (time.perf_counter() - start) * 1000
print(img2.strides, img2.flags)
print("cvtColor milliseconds:", elapsed)

Output:

(1920, 1, 2073600)   C_CONTIGUOUS : False
  F_CONTIGUOUS : False
  OWNDATA : False
  WRITEABLE : True
  ALIGNED : True
  WRITEBACKIFCOPY : False
  UPDATEIFCOPY : False

(1920, 1, 2073600)   C_CONTIGUOUS : False
  F_CONTIGUOUS : False
  OWNDATA : False
  WRITEABLE : True
  ALIGNED : True
  WRITEBACKIFCOPY : False
  UPDATEIFCOPY : False

-------------
(5760, 3, 1)   C_CONTIGUOUS : True
  F_CONTIGUOUS : False
  OWNDATA : True
  WRITEABLE : True
  ALIGNED : True
  WRITEBACKIFCOPY : False
  UPDATEIFCOPY : False

justcopy milliseconds: 9.122899999999989
-------------
(5760, 3, 1)   C_CONTIGUOUS : True
  F_CONTIGUOUS : False
  OWNDATA : True
  WRITEABLE : True
  ALIGNED : True
  WRITEBACKIFCOPY : False
  UPDATEIFCOPY : False

copy+cvtColor milliseconds: 12.177900000000019
-------------
(5760, 3, 1)   C_CONTIGUOUS : True
  F_CONTIGUOUS : False
  OWNDATA : True
  WRITEABLE : True
  ALIGNED : True
  WRITEBACKIFCOPY : False
  UPDATEIFCOPY : False

cvtColor milliseconds: 11.461500000000013

These examples are all on my 1920x1080 screen, so they're not directly comparable to the 4K resolution times we saw in earlier benchmarks.

Anyway, what we can see here, is first of all that the two captured images (img1 and img2) coming straight from the D3DShot library have very strange strides values, and C_CONTIGUOUS : False. That's because they are raw RAM given to D3DShot by Windows and then just packaged into a ndarray with custom strides to make it read the raw RAM data in the desired order.

Next, we see that just doing img1_justcopy = img1.copy() (which copies the RGB-channeled, non-contiguous RAM into new, contiguous RAM, but does not change the channel order (the image will still be RGB)), takes 9.12 ms, which is indeed how slow Numpy is at copying non-contiguous ndarray data into new, contiguous RAM. Basically, internally, Numpy has to do a ton of looping to read the data byte-by-byte while writing each byte into the correct order in the new, contiguous RAM.

So, the PyArray (Numpy) copying of non-contiguous to contiguous is always the slowest operation. That's why we want to avoid having non-contiguous RAM.

Alright, we also demonstrated how to make a "copy AND fix the colors from RGB to BGR" in two different ways. Doing img1 = img1.copy(); img1 = cv2.cvtColor(img1, cv2.COLOR_RGB2BGR) takes 11.83 ms, and letting cvtColor trigger the Numpy .copy internally via directly calling img2 = cv2.cvtColor(img2, cv2.COLOR_RGB2BGR) takes 10.61 ms. The reason for the slight difference is of course that there's slightly more work involved when we're doing 2 separate function calls, than when we let OpenCV do the Numpy copying in its single call.

In both cases, a PyArray (Numpy) copy operation happens internally, to give us a straight, contiguous RAM location. And then we pass that fixed, contiguous ndarray to cvtColor which fixes the color channel order.

That gives you the following guidelines:

  • If your Numpy data is always non-contiguous but is already in the correct channel order (you don't want to convert RGB to/from BGR, etc): Use img = img.copy() to force Numpy to make a contiguous copy of the data, which is then usable in all OpenCV calls without any bugs and without causing any slow internal, temporary copying.
  • If your Numpy data is SOMETIMES non-contiguous but is already in the correct channel order: Use img = np.ascontiguousarray(img), which automatically copies the array to make it contiguous if necessary, or otherwise returns the exact same array (if it was already contiguous).
  • If your Numpy data is non-contiguous and you also want to change the channel order: Use img = cv2.cvtColor(img, cv2.COLOR_<your conversion choice>), which will internally do the .copy slightly more efficiently than if you had used two separate Python statements.

Both techniques will result in giving you fast, contiguous RAM, in the color arrangement of your choice!

Conclusions

Stop using Numpy view manipulations and "tricks". They are not "cool". They lead to SUBTLE BUGS and they are EXTREMELY SLOW. You are slowing down all of your OpenCV API calls by about 40 milliseconds PER CALL (at 4K resolution) or 10 milliseconds PER CALL (at 1920x1080 resolution), since your "cool" data has to be converted by OpenCV internally to PROPER CONTIGUOUS RAM.

Those 39.45 ms (@ 4K) or 9.97 ms (@ 1920x1080) are wasted on EVERY OpenCV call whenever you give OpenCV a non-contiguous image. So if you're (as people often do) passing the image to multiple OpenCV APIs to analyze it in multiple ways, then you are causing extreme slowdowns in your code.

Use cv2.cvtColor() instead, which does a super fast one-time conversion to the proper format, using accelerated OpenCL code. You are guaranteed to get contiguous data which works as-is for EVERY OpenCV call with no memory copying/conversion needed. And OpenCV's color converter is WAY FASTER than Numpy's internal data copier/converter.

Let's end this by imagining a scenario where you're using some Python library to capture an RGB 4K screenshot as a numpy array, and you need to use that data with OpenCV. So you're thinking you're clever and you write img = img[...,::-1] to "turn the RGB data into BGR (which OpenCV needs)", and you're thinking "Wow, my code is so fast! That RGB-to-BGR operation only took 0.000237 ms!"... And then you're calling five different OpenCV functions to analyze that screenshot-image in various ways. Well, since you're causing one internal Numpy copy-conversion-to-contiguous PER CALL, you're now causing 5 * 39.45 = 197.25 ms of total conversion overhead, just to get your "stupid" Numpy view into a proper contiguous memory stream.

Does it still sound "slow" to just do a single, one-time 5.39 ms (@ 4K) or 1.53 ms (@ 1920x1080) conversion via cv2.cvtColor()? ;-)

Stop. Using. Numpy. Tricks!

Enjoy! ;-)

Fastest way to convert BGR <-> RGB! Aka: Do NOT use Numpy magic "tricks".


IMPORTANT: This article is very long. Remember to click the (more) at the bottom of this post to read the whole article!


I was reading this question: https://answers.opencv.org/question/188664/colour-conversion-from-bgr-to-rgb-in-cv2-is-slower-than-on-python/ and it didn't explain things very well at all. So here's a deep examination and explanation for everyone's future reference!

Converting RGB to BGR, and vice versa, is one of the most important operations you can do in OpenCV if you're interoperating with other libraries, raw system memory, etc. And each imaging library depends on their own special channel orders.

There are many ways to achieve the conversion, and cv2.cvtColor() is often frowned upon because there are "much faster" ways to do it via numpy "view" manipulation.

Whenever you attempt to convert colors in OpenCV, you actually invoke a huge machinery:

https://github.com/opencv/opencv/blob/8c0b0714e76efef4a8ca2a7c410c60e55c5e9829/modules/imgproc/src/color.cpp#L20-L25 https://github.com/opencv/opencv/blob/8b541e450b511fde9dd363fa55a30fbb6fc0ace6/modules/imgproc/src/color_rgb.dispatch.cpp#L426-L437

As you can see, internally, OpenCV creates an "OpenCL Kernel" with the instructions for the data transformation, and then runs it. This creates brand new (re-arranged) image data in memory, which is of course a pretty slow operation, involving new memory allocation and data-copying.

However, there is another way to flip between RGB and BGR channel orders, which is very popular - and very bad (as you'll find out soon). And that is: Using numpy's built-in methods for manipulating the array data.

Note that there are two ways to manipulate data in Numpy:

  • One of the ways, the bad way, just changes the "view" of the Numpy array and is therefore instant (O(1)), but does NOT transform the underlying img.data in RAM/memory. This means that the raw memory does NOT contain the new channel order, and Numpy instead "fakes" it by creating a "view" that simply says "when we read this data from RAM, view it as R=B, G=G, B=R" basically... (Technically speaking, it changes the ".strides" property of the Numpy object, which instead of saying "read R then G then B" (stride "1" aka going forwards in RAM when reading the color channels) changes it to say "read B, then G, then R" (stride "-1" aka going backwards in RAM when reading the color channels)).
  • The second way, which is totally fine, is to always ensure that we arrange the pixel data properly in memory too, which is a lot slower but is almost always necessary, depending on what library/API your data is intended to be sent to!

To determine whether a numpy array manipulation has also changed the underlying MEMORY, you can look at the img.flags['C_CONTIGUOUS'] value. If True it means that the data in RAM is in the correct order (that's great!). If False it means that the data in RAM is in the wrong order and that we are "cheating" via a numpy View instead (that's BAD!).

Whenever you use the "View-based" methods to flip channels in an ndarray (such as RGB -> BGR), its C_CONTIGUOUS becomes False. If you then flip the image's channels again (such as BGR -> back to RGB), its C_CONTIGUOUS becomes True again. So, the "view" is able to be transformed multiple times, and the "Contiguous" flag only says True whenever the view happens to match the actual RAM data's layout.

So... in what situations do you need the data to ALWAYS be contiguous? Well, it varies based on API...

  • OpenCV APIs all need data in contiguous order. If you give it non-contiguous data from Python, the Python-to-OpenCV wrapper layer internally makes a COPY of the BADLY FORMATTED image you gave it, and then converts the color channel order, and THEN finally passes the COPIED-AND-CONTIGUOUS image to the internal OpenCV C++ function. This is of course very wasteful!
  • Matplotlib APIs do not need contiguous data, because they have stride-handling code. But all of their calls are slowed down if given non-contiguous data, as seen here.
  • Other libraries: Depends on the library. Some of them do something like "take the img.data memory address and give it to a raw Windows API via a COM call" in which case YES the RAM-data MUST be contiguous too.

What type of data do YOU need?

If you want the SAFEST possible data that is 100% sure to work in ANY API ANYWHERE, you should always make CONTIGUOUS pixel data. It doesn't take long to do the conversion up-front, since we're still talking about very fast operations!

There are probably situations where non-contiguous data is fine, such as if you are doing all image manipulations purely in Numpy math without any library APIs (in which case there's no real reason to convert the data layout to contiguous in RAM). But as soon as you invoke various library APIs, you should pretty much always have contiguous data, otherwise you'll create huge performance issues (or even completely incorrect results).

I'll explain those performance issues further down, but first we'll look at the various "conversion techniques" people use in Python.

Techniques

Without further ado, here are all the ways that people use in Python whenever they want to convert back/forth between RGB and BGR. These benchmarks are on a 4K image (3840x2160):

  • Always Contiguous: No. Method: x = x[...,::-1]. Speed: 237 nsec (aka 0.237 usec aka 0.000237 msec) per call
  • Always Contiguous: Yes. Method: x = x[...,::-1].copy(). Speed: 37.5 msec per call
  • Always Contiguous: No. Method: x = x[:, :, [2, 1, 0]]. Speed: 12.6 msec per call
  • Always Contiguous: Yes. Method: x = cv2.cvtColor(x, cv2.COLOR_RGB2BGR). Speed: 5.39 msec per call
  • Always Contiguous: No. Method: x = np.fliplr(x.reshape(-1,3)).reshape(x.shape). Speed: 1.62 usec (aka 0.00162 msec) per call
  • Always Contiguous: Yes. Method: x = np.fliplr(x.reshape(-1,3)).reshape(x.shape).copy(). Speed: 37.4 msec per call
  • Always Contiguous: No. Method: x = np.flip(x, axis=2). Speed: 2.74 usec (aka 0.00274 msec) per call
  • Always Contiguous: Yes. Method: x = np.flip(x, axis=2).copy(). Speed: 37.5 msec per call
  • Always Contiguous: Yes. Method: r = x[..., 0].copy(); x[..., 0] = x[..., 2]; x[..., 2] = r. Speed: 21.8 msec per call
  • Always Contiguous: Yes. Method: x[:, :, [0, 2]] = x[:, :, [2, 0]]. Speed: 21.7 msec per call
  • Always Contiguous: Yes. Method: x[..., [0, 2]] = x[..., [2, 0]]. Speed: 21.8 msec per call
  • Always Contiguous: Yes. Method: x[:, :, [0, 1, 2]] = x[:, :, [2, 1, 0]]. Speed: 33.1 msec per call
  • Always Contiguous: Yes. Method: x[:, :] = x[:, :, [2, 1, 0]]. Speed: 49.3 msec per call
  • Always Contiguous: Yes. Method: foo = x.copy(). Speed: 11.8 msec per call (This example doesn't change the RGB/BGR channel order, and is just included here as a reference, to show how slow Numpy is at doing a super simple copy of an already-contiguous chunk of RAM. As you can see, even when the data is already in the proper order, Numpy is very slow... And if "x" had been non-contiguous here, it would be even slower, as shown in the x = x[...,::-1].copy() (equivalent to saying bar = x[...,::-1]; foo = bar.copy()) example near the top of the list, which took 37.5 msec and demonstrates Numpy copying non-contiguous RAM (from numpy "views" marked as "read in reverse order" via "stride = -1") into contiguous RAM...

PS: Whenever we want contiguous data from numpy, we're mostly using x.copy() to tell Numpy to allocate new RAM and copy all data to it in the correct (contiguous) order. There's also a np.ascontiguousarray(x) API but it does the exact same thing (it copies too, "but only when the Numpy data isn't already contiguous") and requires much more typing. ;-) And in a few of the examples we're using special indexing (such as x[:, :, [0, 1, 2]] = x[:, :, [2, 1, 0]]) to overwrite the memory directly, which always creates contiguous memory with correct "strides", and is faster than telling Numpy to do a .copy(), but is still extremely slow compared to cv2.cvtColor().

Docs for the various Numpy functions: copy, ascontiguousarray, fliplr, flip, reshape

Here's the benchmark that was used:

python -m timeit -s "import numpy as np; import cv2; x = np.zeros([2160,3840,3], np.uint8); x[:,:,2] = 255; x[:,:,1] = 100" "ALGORITHM HERE"

Replace the "ALGORITHM HERE" part with the algorithm above, such as "x = np.flip(x, axis=2).copy()".

People's Misunderstandings of those Benchmarks

Alright, so we're finally getting to the whole purpose of this article!

When people see the benchmarks above, they usually think "Oh my god, x = x[...,::-1] executes in 0.000237 milliseconds, and x = cv2.cvtColor(x, cv2.COLOR_RGB2BGR) executes in 5.39 milliseconds, which is 23798 times slower!! And then they decide to always use "Numpy view manipulation" to do their channel conversions.

That's a huge mistake. And here's why:

  • When you call an OpenCV API from Python, and pass it a numpy.ndarray image object, there's a process which prepares that data for internal usage within OpenCV (since OpenCV itself doesn't use ndarray internally; it uses cv::Mat).
  • First, your Python object (which is coming from the PyOpenCV module) goes into the appropriate pyopencv_to() function, whose purpose is to convert raw Python objects (such as numbers, strings, ndarray, etc), into something usable by OpenCV internally in C++.
  • Your Python object first enters the "full ArgInfo converter" code at https://github.com/opencv/opencv/blob/778f42ad34559451d62ac9ba585717aec77fb23a/modules/python/src2/cv2.cpp#L249
  • That code, in turn, looks at the object and determines if it's a number, a float, or a tuple... If it's any of those, it does the appropriate conversion. Otherwise it assumes it's a Numpy array, at this line: https://github.com/opencv/opencv/blob/778f42ad34559451d62ac9ba585717aec77fb23a/modules/python/src2/cv2.cpp#L292
  • Next, it begins to analyze the Numpy array to determine how to use the data internally. It wants to determine "do we need to copy the data or can we use it as-is? do we need to cast the data?", see here: https://github.com/opencv/opencv/blob/778f42ad34559451d62ac9ba585717aec77fb23a/modules/python/src2/cv2.cpp#L300
  • It first does some simple checks to see if the number-type in the Numpy array is legal or not. (If illegal type, it marks the data as "needs copy" and "needs cast").
  • Next, it retrieves the "strides" information from the Numpy array, which is those simple numbers such as "-1" which determine how to read a numpy array (such as backwards, in the case of our "fast" numpy-based "channel flipping" code earlier): https://github.com/opencv/opencv/blob/778f42ad34559451d62ac9ba585717aec77fb23a/modules/python/src2/cv2.cpp#L341-L342
  • Then it analyzes the Strides for all dimensions of the Numpy array, and if it finds a non-contiguous stride (our "screwed up" data layout caused by doing those so-called "fast" Numpy view manipulations), then it marks the data as "needs copy": https://github.com/opencv/opencv/blob/778f42ad34559451d62ac9ba585717aec77fb23a/modules/python/src2/cv2.cpp#L344-L357
  • Next, if "needs copy" is true, it does this horrible thing: https://github.com/opencv/opencv/blob/778f42ad34559451d62ac9ba585717aec77fb23a/modules/python/src2/cv2.cpp#L367-L374
  • As you can see, it calls PyArray_Cast() (if casting was needed too) or PyArray_GETCONTIGUOUS() (if we only need to make sure the data is contiguous). Both of those functions, no matter which one is called, then generates a brand-new Python Numpy Object, with all data COPIED by Numpy into brand-new memory, and re-arranged into proper Contiguous ordering. That's extremely wasteful! I'll explain more soon, after this walkthrough of what the code does...
  • Finally, the code proceeds to create a cv::Mat object whose data pointer points at the internal byte-array (RAM) of the Numpy object, ie. the RAM address that you can easily see in Python by typing img.data. That's an incredibly fast operation because it is just a pointer which says Use the existing RAM data owned by Numpy at RAM address XYZ: https://github.com/opencv/opencv/blob/778f42ad34559451d62ac9ba585717aec77fb23a/modules/python/src2/cv2.cpp#L415-L416

So, can you spot the problem yet?

When you pass a contiguous Numpy array, the conversion into OpenCV is pretty much INSTANT: "This data looks fine! Just give its RAM address to cv::Mat and voila!".

But when you instead insist on using those so-called "fast" channel transformations, where you "tweak" the Numpy array's view and stride values, then you are giving OpenCV a Numpy array with non-contiguous RAM and bad "strides". The PyOpenCV layer (the wrapper between OpenCV and Python) detects this problem, and creates a BRAND NEW, COPIED, RE-ARRANGED (CONTIGUOUS) NUMPY ARRAY. This is VERY VERY VERY VERY SLOW.

In other words, if you've used those dumb Numpy "view" manipulation tricks, EVERY CALL TO OPENCV APIS IS CAUSING a HUGE memory copy (images are large, especially 1080p+ screenshots/video frames), a lot of math inside PyArray_GETCONTIGUOUS / PyArray_Cast to create that new object while respecting your tweaked "strides", etc.

Your code won't be faster at all. It will be SLOWER!

Demonstration of the Slowness

Let's use a random OpenCV API to demonstrate the slowdown caused by all of those conversions. We'll use cv2.imshow here, but any OpenCV API call will always be doing the same "Python to OpenCV" conversions of the numpy data, so the exact API doesn't matter. They will all have this overhead.

Here's the example code:

import cv2
import numpy as np
import time

#img1 = cv2.imread("yourimage.png") # If you want to test with an image.
img1 = np.zeros([2160,3840,3], np.uint8) # Create a 4K image.
img1[:,:,2] = 255; img1[:,:,1] = 100 # Fill the channels with different values.
img2 = img1[...,::-1] # Make a "channel flipped view" of the Numpy data.

print("img1 contiguous", img1.flags['C_CONTIGUOUS'], "img2 contiguous", img2.flags['C_CONTIGUOUS'])
print("img1 strides", img1.strides, "img2 strides", img2.strides)

wnd = cv2.namedWindow("", cv2.WINDOW_NORMAL)

def show1():
    cv2.imshow(wnd, img1)

def show2():
    cv2.imshow(wnd, img2)

iterations = 20

start = time.perf_counter()
for i in range(0, iterations):
    show1()
elapsed1 = (time.perf_counter() - start) * 1000
elapsed1_percall = elapsed1 / iterations

start = time.perf_counter()
for i in range(0, iterations):
    show2()
elapsed2 = (time.perf_counter() - start) * 1000
elapsed2_percall = elapsed2 / iterations

# We know that the contiguos (img1) data does not need conversion,
# which tells us that the runtime of the contiguous data is the
# "internal work of the imshow" function. We only want to measure
# the conversion time for non-contiguous data. So we'll subtract
# the first image's (contiguous) runtime from the non-contiguous time.

noncontiguous_overhead_per_call = elapsed2_percall - elapsed1_percall

print("Extra time taken per OpenCV call when given non-contiguous data (in ms):", noncontiguous_overhead_per_call, "ms")

The results:

img1 contiguous True img2 contiguous False
img1 strides (11520, 3, 1) img2 strides (11520, 3, -1)
Extra time taken per OpenCV call when given non-contiguous data (in ms): 39.45334999999999 ms

As you can see, the extra time added to the OpenCV calls when copy-conversion is needed (39.45 ms), is pretty much the same as when you call Numpy's own img.copy() on a "flipped view" inside Python itself (as seen in the earlier benchmark for "Method: x = x[...,::-1].copy(). Speed: 37.5 msec per call").

So yes, every time you call OpenCV with a non-contiguous Numpy array as its argument, you are causing a Numpy .copy() to happen internally!

PS: If we repeat the same test above with 1920x1080 test data instead of 4K test data, we get Extra time taken per OpenCV call when given non-contiguous data (in ms): 9.972125 ms which means that at the world's most popular image resolution (1080p) you're still adding around 10 milliseconds of overhead to all of your OpenCV calls.

Numpy "tricks" will cause subtle Bugs too!

Using those Numpy "tricks" isn't just extremely slow. It will cause very subtle bugs in your code, too.

Look at this code and see if you can figure out the bug yourself before you run this example:

import cv2
import numpy as np

img1 = np.zeros([200,200,3], np.uint8) # Create a 200x200 image. (Is Contiguous)
img2 = img1[...,::-1] # Make a "channel flipped view" of the Numpy data. (A Non-Contiguous View)

print("img1 contiguous", img1.flags['C_CONTIGUOUS'], "img2 contiguous", img2.flags['C_CONTIGUOUS'])
print("img1 strides", img1.strides, "img2 strides", img2.strides)

cv2.rectangle(img2, (80,80), (120,120), (255,255,255), 2)

cv2.imshow("", img2)

What do you think the result will be when running this program? Logically, you expect to see black image with a white rectangle in the middle... But instead, you see nothing except a black image. Why?

Well, it's simple... think about what was explained earlier about how PyOpenCV converts every incoming numpy.ndarray object into an internal C++ cv::Mat object. In this example, we're giving a non-contiguous ndarray as an argument to cv2.rectangle(), which causes PyOpenCV to "fix" the data by making a temporary, internal, contiguous .copy() of the image data, and then it wraps the copy's memory address in a cv::Mat. Next, it passes that cv::Mat object to the internal C++ "draw rectangle" function, which dutifully draws a rectangle onto the memory pointed to by the cv::Mat object... which is... the memory of the temporary internal copy of your input array, since a copy had to be created...

So, OpenCV happily writes a rectangle to the temporary object copy. And then when execution returns to Python, you're of course seeing NO RECTANGLE, since nothing was drawn to your actual ndarray data in RAM (since its memory storage was non-contiguous and therefore not usable as-is by OpenCV).

If you want to see what the code above should be doing, simply add img2 = img2.copy() immediately above the cv2.rectangle call, to cause the img2 ndarray object to become contiguous memory so that OpenCV won't need to make a copy of it (and will be able to use that exact object's memory internally, as intended)... After that tweak, you'll see OpenCV properly drawing the rectangle to the image...

This is the kind of subtle bug that is very easy to cause when you're playing around with faked Numpy "views" rather than real contiguous memory.

Bonus: A note about Numpy "slices"

Numpy allows you to efficiently "slice" arrays, to extract a "partial view" of the data. This is very useful for images, since you can do something such as extracting a 100:100 pixel square from the middle of an image. The slicing syntax is img_sliced = img[y1:y2,x1:x2]. This generates a full Numpy object which points at the data of the original image (they share each other's memory), but which only points at the sub-range you wanted.wanted. Therefore it's super fast (since the slice is just a small object that points at and says how to interpret a small range of the original array's data).

So it basically becomes a fully usable "Numpy array" object which you can use in any context you would pass an image. Such as to an OpenCV function, which would then only operate on the sliced segment of RAM. That's really useful!

However, be aware that the Numpy slices inherit the strides and contiguous flag of the original object / data they were sliced from! So if you're slicing from a non-contiguous array, you'll generate a non-contiguous slice object too, which is horrible and has all the issues of non-contiguous objects.

It's only safe to make partial views/slices (like img[0:100, 0:100]) when img itself is already PROVEN to be FULLY contiguous (with no "Numpy tricks" applied to it). In that case, feel free to pass your contiguous, partial image slices to OpenCV functions. You won't invoke any copy-mechanics in that case!

Alternatively, if you already have a non-contiguous image array and you want to slice it, it's faster to slice first and then make the slice contiguous, since that means less data copying (for example, a 100x100 slice of a 4K image will need much less copying "to make contiguous" than the whole image would have needed). By slicing first and then making a contiguous copy of the slice, you will ensure that your slice is contiguous/safe to use with OpenCV. As an example, let's say that xyz is a non-contiguous image; in that case, the technique would look as follows: slice = xyz[0:100, 0:100].copy() (create a non-contiguous slice "view" of a non-contiguous image, and then force that to become copied which creates a new contiguous array based on the slice's view). Alternatively, if you don't know if the image that you're slicing from is already contiguous or not, then you can use slice = np.ascontiguousarray(xyz[0:100, 0:100]) (creates a slice "view", and then instantly uses that fast view as-is if already contiguous, else copies the data to a new contiguous array and returns that instead).

Bonus: What to do when you get a non-contiguous ndarray from a library?

As an example, the very cool D3DShot library has an optional numpy mode where it retrieves the screenshots as ndarray objects. The problem is that it generates them from RAM data laid out in a different order, so it tweaks the ndarray strides etc to give us an object of the proper "shape" (height, width, 3 color channels in RGB order). Its .flags property shows that Contiguous is FALSE.

So what do you do? If you try to pass that directly to OpenCV, you'll invoke the heavy PyOpenCV copy-mechanics described earlier.

Well, you have two options. In this example case, the colors are in RGB order, and you want them to be BGR for usage in OpenCV. So you should be invoking cv2.cvtColor which internally will trigger the Numpy .copy() for you (just like all OpenCV APIs do when given non-contiguous data), and then changes the color order in RAM for you.

The second option is when you have Numpy data that is already in the correct color order (such as BGR), but whose RAM is non-contiguous. In that case, you should directly invoke img = img.copy() to tell Numpy to make a contiguous copy of the array, to fix it. Then you're welcome to use that contiguous copy for everything. Also note that you can use img = np.ascontiguousarray(img) instead, if you're not sure if your library always returns non-contiguous data; this method automatically returns the same array if it was already contiguous, or does a .copy if it was non-contiguous.

Alright, so let's look at the D3DShot example:

import cv2
import d3dshot
import time

d = d3dshot.create(capture_output="numpy", frame_buffer_size=60)

img1 = d.screenshot()
img2 = d.screenshot()

print(img1.strides, img1.flags)
print(img2.strides, img2.flags)

print("-------------")

start = time.perf_counter()
img1_justcopy = img1.copy() # copy RGB image to new, contiguous RAM
elapsed = (time.perf_counter() - start) * 1000
print(img1_justcopy.strides, img1_justcopy.flags)
print("justcopy milliseconds:", elapsed)

print("-------------")

start = time.perf_counter()
img1 = img1.copy()
img1 = cv2.cvtColor(img1, cv2.COLOR_RGB2BGR) # flip RGB -> BGR
elapsed = (time.perf_counter() - start) * 1000
print(img1.strides, img1.flags)
print("copy+cvtColor milliseconds:", elapsed)

print("-------------")

start = time.perf_counter()
img2 = cv2.cvtColor(img2, cv2.COLOR_RGB2BGR) # flip RGB -> BGR
elapsed = (time.perf_counter() - start) * 1000
print(img2.strides, img2.flags)
print("cvtColor milliseconds:", elapsed)

Output:

(1920, 1, 2073600)   C_CONTIGUOUS : False
  F_CONTIGUOUS : False
  OWNDATA : False
  WRITEABLE : True
  ALIGNED : True
  WRITEBACKIFCOPY : False
  UPDATEIFCOPY : False

(1920, 1, 2073600)   C_CONTIGUOUS : False
  F_CONTIGUOUS : False
  OWNDATA : False
  WRITEABLE : True
  ALIGNED : True
  WRITEBACKIFCOPY : False
  UPDATEIFCOPY : False

-------------
(5760, 3, 1)   C_CONTIGUOUS : True
  F_CONTIGUOUS : False
  OWNDATA : True
  WRITEABLE : True
  ALIGNED : True
  WRITEBACKIFCOPY : False
  UPDATEIFCOPY : False

justcopy milliseconds: 9.122899999999989
-------------
(5760, 3, 1)   C_CONTIGUOUS : True
  F_CONTIGUOUS : False
  OWNDATA : True
  WRITEABLE : True
  ALIGNED : True
  WRITEBACKIFCOPY : False
  UPDATEIFCOPY : False

copy+cvtColor milliseconds: 12.177900000000019
-------------
(5760, 3, 1)   C_CONTIGUOUS : True
  F_CONTIGUOUS : False
  OWNDATA : True
  WRITEABLE : True
  ALIGNED : True
  WRITEBACKIFCOPY : False
  UPDATEIFCOPY : False

cvtColor milliseconds: 11.461500000000013

These examples are all on my 1920x1080 screen, so they're not directly comparable to the 4K resolution times we saw in earlier benchmarks.

Anyway, what we can see here, is first of all that the two captured images (img1 and img2) coming straight from the D3DShot library have very strange strides values, and C_CONTIGUOUS : False. That's because they are raw RAM given to D3DShot by Windows and then just packaged into a ndarray with custom strides to make it read the raw RAM data in the desired order.

Next, we see that just doing img1_justcopy = img1.copy() (which copies the RGB-channeled, non-contiguous RAM into new, contiguous RAM, but does not change the channel order (the image will still be RGB)), takes 9.12 ms, which is indeed how slow Numpy is at copying non-contiguous ndarray data into new, contiguous RAM. Basically, internally, Numpy has to do a ton of looping to read the data byte-by-byte while writing each byte into the correct order in the new, contiguous RAM.

So, the PyArray (Numpy) copying of non-contiguous to contiguous is always the slowest operation. That's why we want to avoid having non-contiguous RAM.

Alright, we also demonstrated how to make a "copy AND fix the colors from RGB to BGR" in two different ways. Doing img1 = img1.copy(); img1 = cv2.cvtColor(img1, cv2.COLOR_RGB2BGR) takes 11.83 ms, and letting cvtColor trigger the Numpy .copy internally via directly calling img2 = cv2.cvtColor(img2, cv2.COLOR_RGB2BGR) takes 10.61 ms. The reason for the slight difference is of course that there's slightly more work involved when we're doing 2 separate function calls, than when we let OpenCV do the Numpy copying in its single call.

In both cases, a PyArray (Numpy) copy operation happens internally, to give us a straight, contiguous RAM location. And then we pass that fixed, contiguous ndarray to cvtColor which fixes the color channel order.

That gives you the following guidelines:

  • If your Numpy data is always non-contiguous but is already in the correct channel order (you don't want to convert RGB to/from BGR, etc): Use img = img.copy() to force Numpy to make a contiguous copy of the data, which is then usable in all OpenCV calls without any bugs and without causing any slow internal, temporary copying.
  • If your Numpy data is SOMETIMES non-contiguous but is already in the correct channel order: Use img = np.ascontiguousarray(img), which automatically copies the array to make it contiguous if necessary, or otherwise returns the exact same array (if it was already contiguous).
  • If your Numpy data is non-contiguous and you also want to change the channel order: Use img = cv2.cvtColor(img, cv2.COLOR_<your conversion choice>), which will internally do the .copy slightly more efficiently than if you had used two separate Python statements.

Both techniques will result in giving you fast, contiguous RAM, in the color arrangement of your choice!

Conclusions

Stop using Numpy view manipulations and "tricks". They are not "cool". They lead to SUBTLE BUGS and they are EXTREMELY SLOW. You are slowing down all of your OpenCV API calls by about 40 milliseconds PER CALL (at 4K resolution) or 10 milliseconds PER CALL (at 1920x1080 resolution), since your "cool" data has to be converted by OpenCV internally to PROPER CONTIGUOUS RAM.

Those 39.45 ms (@ 4K) or 9.97 ms (@ 1920x1080) are wasted on EVERY OpenCV call whenever you give OpenCV a non-contiguous image. So if you're (as people often do) passing the image to multiple OpenCV APIs to analyze it in multiple ways, then you are causing extreme slowdowns in your code.

Use cv2.cvtColor() instead, which does a super fast one-time conversion to the proper format, using accelerated OpenCL code. You are guaranteed to get contiguous data which works as-is for EVERY OpenCV call with no memory copying/conversion needed. And OpenCV's color converter is WAY FASTER than Numpy's internal data copier/converter.

Let's end this by imagining a scenario where you're using some Python library to capture an RGB 4K screenshot as a numpy array, and you need to use that data with OpenCV. So you're thinking you're clever and you write img = img[...,::-1] to "turn the RGB data into BGR (which OpenCV needs)", and you're thinking "Wow, my code is so fast! That RGB-to-BGR operation only took 0.000237 ms!"... And then you're calling five different OpenCV functions to analyze that screenshot-image in various ways. Well, since you're causing one internal Numpy copy-conversion-to-contiguous PER CALL, you're now causing 5 * 39.45 = 197.25 ms of total conversion overhead, just to get your "stupid" Numpy view into a proper contiguous memory stream.

Does it still sound "slow" to just do a single, one-time 5.39 ms (@ 4K) or 1.53 ms (@ 1920x1080) conversion via cv2.cvtColor()? ;-)

Stop. Using. Numpy. Tricks!

Enjoy! ;-)

Fastest way to convert BGR <-> RGB! Aka: Do NOT use Numpy magic "tricks".


IMPORTANT: This article is very long. Remember to click the (more) at the bottom of this post to read the whole article!


I was reading this question: https://answers.opencv.org/question/188664/colour-conversion-from-bgr-to-rgb-in-cv2-is-slower-than-on-python/ and it didn't explain things very well at all. So here's a deep examination and explanation for everyone's future reference!

Converting RGB to BGR, and vice versa, is one of the most important operations you can do in OpenCV if you're interoperating with other libraries, raw system memory, etc. And each imaging library depends on their own special channel orders.

There are many ways to achieve the conversion, and cv2.cvtColor() is often frowned upon because there are "much faster" ways to do it via numpy "view" manipulation.

Whenever you attempt to convert colors in OpenCV, you actually invoke a huge machinery:

https://github.com/opencv/opencv/blob/8c0b0714e76efef4a8ca2a7c410c60e55c5e9829/modules/imgproc/src/color.cpp#L20-L25 https://github.com/opencv/opencv/blob/8b541e450b511fde9dd363fa55a30fbb6fc0ace6/modules/imgproc/src/color_rgb.dispatch.cpp#L426-L437

As you can see, internally, OpenCV creates an "OpenCL Kernel" with the instructions for the data transformation, and then runs it. This creates brand new (re-arranged) image data in memory, which is of course a pretty slow operation, involving new memory allocation and data-copying.

However, there is another way to flip between RGB and BGR channel orders, which is very popular - and very bad (as you'll find out soon). And that is: Using numpy's built-in methods for manipulating the array data.

Note that there are two ways to manipulate data in Numpy:

  • One of the ways, the bad way, just changes the "view" of the Numpy array and is therefore instant (O(1)), but does NOT transform the underlying img.data in RAM/memory. This means that the raw memory does NOT contain the new channel order, and Numpy instead "fakes" it by creating a "view" that simply says "when we read this data from RAM, view it as R=B, G=G, B=R" basically... (Technically speaking, it changes the ".strides" property of the Numpy object, which instead of saying "read R then G then B" (stride "1" aka going forwards in RAM when reading the color channels) changes it to say "read B, then G, then R" (stride "-1" aka going backwards in RAM when reading the color channels)).
  • The second way, which is totally fine, is to always ensure that we arrange the pixel data properly in memory too, which is a lot slower but is almost always necessary, depending on what library/API your data is intended to be sent to!

To determine whether a numpy array manipulation has also changed the underlying MEMORY, you can look at the img.flags['C_CONTIGUOUS'] value. If True it means that the data in RAM is in the correct order (that's great!). If False it means that the data in RAM is in the wrong order and that we are "cheating" via a numpy View instead (that's BAD!).

Whenever you use the "View-based" methods to flip channels in an ndarray (such as RGB -> BGR), its C_CONTIGUOUS becomes False. If you then flip the image's channels again (such as BGR -> back to RGB), its C_CONTIGUOUS becomes True again. So, the "view" is able to be transformed multiple times, and the "Contiguous" flag only says True whenever the view happens to match the actual RAM data's layout.

So... in what situations do you need the data to ALWAYS be contiguous? Well, it varies based on API...

  • OpenCV APIs all need data in contiguous order. If you give it non-contiguous data from Python, the Python-to-OpenCV wrapper layer internally makes a COPY of the BADLY FORMATTED image you gave it, and then converts the color channel order, and THEN finally passes the COPIED-AND-CONTIGUOUS image to the internal OpenCV C++ function. This is of course very wasteful!
  • Matplotlib APIs do not need contiguous data, because they have stride-handling code. But all of their calls are slowed down if given non-contiguous data, as seen here.
  • Other libraries: Depends on the library. Some of them do something like "take the img.data memory address and give it to a raw Windows API via a COM call" in which case YES the RAM-data MUST be contiguous too.

What type of data do YOU need?

If you want the SAFEST possible data that is 100% sure to work in ANY API ANYWHERE, you should always make CONTIGUOUS pixel data. It doesn't take long to do the conversion up-front, since we're still talking about very fast operations!

There are probably situations where non-contiguous data is fine, such as if you are doing all image manipulations purely in Numpy math without any library APIs (in which case there's no real reason to convert the data layout to contiguous in RAM). But as soon as you invoke various library APIs, you should pretty much always have contiguous data, otherwise you'll create huge performance issues (or even completely incorrect results).

I'll explain those performance issues further down, but first we'll look at the various "conversion techniques" people use in Python.

Techniques

Without further ado, here are all the ways that people use in Python whenever they want to convert back/forth between RGB and BGR. These benchmarks are on a 4K image (3840x2160):

  • Always Contiguous: No. Method: x = x[...,::-1]. Speed: 237 nsec (aka 0.237 usec aka 0.000237 msec) per call
  • Always Contiguous: Yes. Method: x = x[...,::-1].copy(). Speed: 37.5 msec per call
  • Always Contiguous: No. Method: x = x[:, :, [2, 1, 0]]. Speed: 12.6 msec per call
  • Always Contiguous: Yes. Method: x = cv2.cvtColor(x, cv2.COLOR_RGB2BGR). Speed: 5.39 msec per call
  • Always Contiguous: No. Method: x = np.fliplr(x.reshape(-1,3)).reshape(x.shape). Speed: 1.62 usec (aka 0.00162 msec) per call
  • Always Contiguous: Yes. Method: x = np.fliplr(x.reshape(-1,3)).reshape(x.shape).copy(). Speed: 37.4 msec per call
  • Always Contiguous: No. Method: x = np.flip(x, axis=2). Speed: 2.74 usec (aka 0.00274 msec) per call
  • Always Contiguous: Yes. Method: x = np.flip(x, axis=2).copy(). Speed: 37.5 msec per call
  • Always Contiguous: Yes. Method: r = x[..., 0].copy(); x[..., 0] = x[..., 2]; x[..., 2] = r. Speed: 21.8 msec per call
  • Always Contiguous: Yes. Method: x[:, :, [0, 2]] = x[:, :, [2, 0]]. Speed: 21.7 msec per call
  • Always Contiguous: Yes. Method: x[..., [0, 2]] = x[..., [2, 0]]. Speed: 21.8 msec per call
  • Always Contiguous: Yes. Method: x[:, :, [0, 1, 2]] = x[:, :, [2, 1, 0]]. Speed: 33.1 msec per call
  • Always Contiguous: Yes. Method: x[:, :] = x[:, :, [2, 1, 0]]. Speed: 49.3 msec per call
  • Always Contiguous: Yes. Method: foo = x.copy(). Speed: 11.8 msec per call (This example doesn't change the RGB/BGR channel order, and is just included here as a reference, to show how slow Numpy is at doing a super simple copy of an already-contiguous chunk of RAM. As you can see, even when the data is already in the proper order, Numpy is very slow... And if "x" had been non-contiguous here, it would be even slower, as shown in the x = x[...,::-1].copy() (equivalent to saying bar = x[...,::-1]; foo = bar.copy()) example near the top of the list, which took 37.5 msec and demonstrates Numpy copying non-contiguous RAM (from numpy "views" marked as "read in reverse order" via "stride = -1") into contiguous RAM...

PS: Whenever we want contiguous data from numpy, we're mostly using x.copy() to tell Numpy to allocate new RAM and copy all data to it in the correct (contiguous) order. There's also a np.ascontiguousarray(x) API but it does the exact same thing (it copies too, "but only when the Numpy data isn't already contiguous") and requires much more typing. ;-) And in a few of the examples we're using special indexing (such as x[:, :, [0, 1, 2]] = x[:, :, [2, 1, 0]]) to overwrite the memory directly, which always creates contiguous memory with correct "strides", and is faster than telling Numpy to do a .copy(), but is still extremely slow compared to cv2.cvtColor().

Docs for the various Numpy functions: copy, ascontiguousarray, fliplr, flip, reshape

Here's the benchmark that was used:

python -m timeit -s "import numpy as np; import cv2; x = np.zeros([2160,3840,3], np.uint8); x[:,:,2] = 255; x[:,:,1] = 100" "ALGORITHM HERE"

Replace the "ALGORITHM HERE" part with the algorithm above, such as "x = np.flip(x, axis=2).copy()".

People's Misunderstandings of those Benchmarks

Alright, so we're finally getting to the whole purpose of this article!

When people see the benchmarks above, they usually think "Oh my god, x = x[...,::-1] executes in 0.000237 milliseconds, and x = cv2.cvtColor(x, cv2.COLOR_RGB2BGR) executes in 5.39 milliseconds, which is 23798 times slower!! And then they decide to always use "Numpy view manipulation" to do their channel conversions.

That's a huge mistake. And here's why:

  • When you call an OpenCV API from Python, and pass it a numpy.ndarray image object, there's a process which prepares that data for internal usage within OpenCV (since OpenCV itself doesn't use ndarray internally; it uses cv::Mat).
  • First, your Python object (which is coming from the PyOpenCV module) goes into the appropriate pyopencv_to() function, whose purpose is to convert raw Python objects (such as numbers, strings, ndarray, etc), into something usable by OpenCV internally in C++.
  • Your Python object first enters the "full ArgInfo converter" code at https://github.com/opencv/opencv/blob/778f42ad34559451d62ac9ba585717aec77fb23a/modules/python/src2/cv2.cpp#L249
  • That code, in turn, looks at the object and determines if it's a number, a float, or a tuple... If it's any of those, it does the appropriate conversion. Otherwise it assumes it's a Numpy array, at this line: https://github.com/opencv/opencv/blob/778f42ad34559451d62ac9ba585717aec77fb23a/modules/python/src2/cv2.cpp#L292
  • Next, it begins to analyze the Numpy array to determine how to use the data internally. It wants to determine "do we need to copy the data or can we use it as-is? do we need to cast the data?", see here: https://github.com/opencv/opencv/blob/778f42ad34559451d62ac9ba585717aec77fb23a/modules/python/src2/cv2.cpp#L300
  • It first does some simple checks to see if the number-type in the Numpy array is legal or not. (If illegal type, it marks the data as "needs copy" and "needs cast").
  • Next, it retrieves the "strides" information from the Numpy array, which is those simple numbers such as "-1" which determine how to read a numpy array (such as backwards, in the case of our "fast" numpy-based "channel flipping" code earlier): https://github.com/opencv/opencv/blob/778f42ad34559451d62ac9ba585717aec77fb23a/modules/python/src2/cv2.cpp#L341-L342
  • Then it analyzes the Strides for all dimensions of the Numpy array, and if it finds a non-contiguous stride (our "screwed up" data layout caused by doing those so-called "fast" Numpy view manipulations), then it marks the data as "needs copy": https://github.com/opencv/opencv/blob/778f42ad34559451d62ac9ba585717aec77fb23a/modules/python/src2/cv2.cpp#L344-L357
  • Next, if "needs copy" is true, it does this horrible thing: https://github.com/opencv/opencv/blob/778f42ad34559451d62ac9ba585717aec77fb23a/modules/python/src2/cv2.cpp#L367-L374
  • As you can see, it calls PyArray_Cast() (if casting was needed too) or PyArray_GETCONTIGUOUS() (if we only need to make sure the data is contiguous). Both of those functions, no matter which one is called, then generates a brand-new Python Numpy Object, with all data COPIED by Numpy into brand-new memory, and re-arranged into proper Contiguous ordering. That's extremely wasteful! I'll explain more soon, after this walkthrough of what the code does...
  • Finally, the code proceeds to create a cv::Mat object whose data pointer points at the internal byte-array (RAM) of the Numpy object, ie. the RAM address that you can easily see in Python by typing img.data. That's an incredibly fast operation because it is just a pointer which says Use the existing RAM data owned by Numpy at RAM address XYZ: https://github.com/opencv/opencv/blob/778f42ad34559451d62ac9ba585717aec77fb23a/modules/python/src2/cv2.cpp#L415-L416

So, can you spot the problem yet?

When you pass a contiguous Numpy array, the conversion into OpenCV is pretty much INSTANT: "This data looks fine! Just give its RAM address to cv::Mat and voila!".

But when you instead insist on using those so-called "fast" channel transformations, where you "tweak" the Numpy array's view and stride values, then you are giving OpenCV a Numpy array with non-contiguous RAM and bad "strides". The PyOpenCV layer (the wrapper between OpenCV and Python) detects this problem, and creates a BRAND NEW, COPIED, RE-ARRANGED (CONTIGUOUS) NUMPY ARRAY. This is VERY VERY VERY VERY SLOW.

In other words, if you've used those dumb Numpy "view" manipulation tricks, EVERY CALL TO OPENCV APIS IS CAUSING a HUGE memory copy (images are large, especially 1080p+ screenshots/video frames), a lot of math inside PyArray_GETCONTIGUOUS / PyArray_Cast to create that new object while respecting your tweaked "strides", etc.

Your code won't be faster at all. It will be SLOWER!

Demonstration of the Slowness

Let's use a random OpenCV API to demonstrate the slowdown caused by all of those conversions. We'll use cv2.imshow here, but any OpenCV API call will always be doing the same "Python to OpenCV" conversions of the numpy data, so the exact API doesn't matter. They will all have this overhead.

Here's the example code:

import cv2
import numpy as np
import time

#img1 = cv2.imread("yourimage.png") # If you want to test with an image.
img1 = np.zeros([2160,3840,3], np.uint8) # Create a 4K image.
img1[:,:,2] = 255; img1[:,:,1] = 100 # Fill the channels with different values.
img2 = img1[...,::-1] # Make a "channel flipped view" of the Numpy data.

print("img1 contiguous", img1.flags['C_CONTIGUOUS'], "img2 contiguous", img2.flags['C_CONTIGUOUS'])
print("img1 strides", img1.strides, "img2 strides", img2.strides)

wnd = cv2.namedWindow("", cv2.WINDOW_NORMAL)

def show1():
    cv2.imshow(wnd, img1)

def show2():
    cv2.imshow(wnd, img2)

iterations = 20

start = time.perf_counter()
for i in range(0, iterations):
    show1()
elapsed1 = (time.perf_counter() - start) * 1000
elapsed1_percall = elapsed1 / iterations

start = time.perf_counter()
for i in range(0, iterations):
    show2()
elapsed2 = (time.perf_counter() - start) * 1000
elapsed2_percall = elapsed2 / iterations

# We know that the contiguos (img1) data does not need conversion,
# which tells us that the runtime of the contiguous data is the
# "internal work of the imshow" function. We only want to measure
# the conversion time for non-contiguous data. So we'll subtract
# the first image's (contiguous) runtime from the non-contiguous time.

noncontiguous_overhead_per_call = elapsed2_percall - elapsed1_percall

print("Extra time taken per OpenCV call when given non-contiguous data (in ms):", noncontiguous_overhead_per_call, "ms")

The results:

img1 contiguous True img2 contiguous False
img1 strides (11520, 3, 1) img2 strides (11520, 3, -1)
Extra time taken per OpenCV call when given non-contiguous data (in ms): 39.45334999999999 ms

As you can see, the extra time added to the OpenCV calls when copy-conversion is needed (39.45 ms), is pretty much the same as when you call Numpy's own img.copy() on a "flipped view" inside Python itself (as seen in the earlier benchmark for "Method: x = x[...,::-1].copy(). Speed: 37.5 msec per call").

So yes, every time you call OpenCV with a non-contiguous Numpy array as its argument, you are causing a Numpy .copy() to happen internally!

PS: If we repeat the same test above with 1920x1080 test data instead of 4K test data, we get Extra time taken per OpenCV call when given non-contiguous data (in ms): 9.972125 ms which means that at the world's most popular image resolution (1080p) you're still adding around 10 milliseconds of overhead to all of your OpenCV calls.

Numpy "tricks" will cause subtle Bugs too!

Using those Numpy "tricks" isn't just extremely slow. It will cause very subtle bugs in your code, too.

Look at this code and see if you can figure out the bug yourself before you run this example:

import cv2
import numpy as np

img1 = np.zeros([200,200,3], np.uint8) # Create a 200x200 image. (Is Contiguous)
img2 = img1[...,::-1] # Make a "channel flipped view" of the Numpy data. (A Non-Contiguous View)

print("img1 contiguous", img1.flags['C_CONTIGUOUS'], "img2 contiguous", img2.flags['C_CONTIGUOUS'])
print("img1 strides", img1.strides, "img2 strides", img2.strides)

cv2.rectangle(img2, (80,80), (120,120), (255,255,255), 2)

cv2.imshow("", img2)

What do you think the result will be when running this program? Logically, you expect to see black image with a white rectangle in the middle... But instead, you see nothing except a black image. Why?

Well, it's simple... think about what was explained earlier about how PyOpenCV converts every incoming numpy.ndarray object into an internal C++ cv::Mat object. In this example, we're giving a non-contiguous ndarray as an argument to cv2.rectangle(), which causes PyOpenCV to "fix" the data by making a temporary, internal, contiguous .copy() of the image data, and then it wraps the copy's memory address in a cv::Mat. Next, it passes that cv::Mat object to the internal C++ "draw rectangle" function, which dutifully draws a rectangle onto the memory pointed to by the cv::Mat object... which is... the memory of the temporary internal copy of your input array, since a copy had to be created...

So, OpenCV happily writes a rectangle to the temporary object copy. And then when execution returns to Python, you're of course seeing NO RECTANGLE, since nothing was drawn to your actual ndarray data in RAM (since its memory storage was non-contiguous and therefore not usable as-is by OpenCV).

If you want to see what the code above should be doing, simply add img2 = img2.copy() immediately above the cv2.rectangle call, to cause the img2 ndarray object to become contiguous memory so that OpenCV won't need to make a copy of it (and will be able to use that exact object's memory internally, as intended)... After that tweak, you'll see OpenCV properly drawing the rectangle to the image...

This is the kind of subtle bug that is very easy to cause when you're playing around with faked Numpy "views" rather than real contiguous memory.

Bonus: A note about Numpy "slices"

Numpy allows you to efficiently "slice" arrays, to extract a "partial view" of the data. This is very useful for images, since you can do something such as extracting a 100:100 pixel square from the middle of an image. The slicing syntax is img_sliced = img[y1:y2,x1:x2]. This generates a full Numpy object which points at the data of the original image (they share each other's memory), but which only points at the sub-range you wanted. Therefore it's super fast (since the slice is just a small object that points at and says how to interpret a small range of the original array's data).

So it basically becomes a fully usable "Numpy array" object which you can use in any context you would pass an image. Such as to an OpenCV function, which would then only operate on the sliced segment of RAM. That's really useful!

However, be aware that the Numpy slices inherit the strides and contiguous flag of the original object / data they were sliced from! So if you're slicing from a non-contiguous array, you'll generate a non-contiguous slice object too, which is horrible and has all the issues of non-contiguous objects.

It's only safe to make partial views/slices (like img[0:100, 0:100]) when img itself is already PROVEN to be FULLY contiguous (with no "Numpy tricks" applied to it). In that case, feel free to pass your contiguous, partial image slices to OpenCV functions. You won't invoke any copy-mechanics in that case!

Alternatively, if you already have a non-contiguous image array and you want to slice it, it's faster to slice first and then make the slice contiguous, since that means less data copying (for example, a 100x100 slice of a 4K image will need much less copying "to make contiguous" than the whole image would have needed). By slicing first and then making a contiguous copy of the slice, you will ensure that your slice is contiguous/safe to use with OpenCV. As an example, let's say that xyz is a non-contiguous image; in that case, the technique would look as follows: slice = xyz[0:100, 0:100].copy() (create a non-contiguous slice "view" of a non-contiguous image, and then force that to become copied which creates a new contiguous array based on the slice's view). Alternatively, if you don't know if the image that you're slicing from is already contiguous or not, then you can use slice = np.ascontiguousarray(xyz[0:100, 0:100]) (creates a slice "view", and then instantly uses that fast view as-is if already contiguous, else copies the data to a new contiguous array and returns that instead).

Bonus: What to do when you get a non-contiguous ndarray from a library?

As an example, the very cool D3DShot library has an optional numpy mode where it retrieves the screenshots as ndarray objects. The problem is that it generates them from RAM data laid out in a different order, so it tweaks the ndarray strides etc to give us an object of the proper "shape" (height, width, 3 color channels in RGB order). Its .flags property shows that Contiguous is FALSE.

So what do you do? If you try to pass that directly to OpenCV, you'll invoke the heavy PyOpenCV copy-mechanics described earlier.

Well, you have two options. In this example case, the colors are in RGB order, and you want them to be BGR for usage in OpenCV. So you should be invoking cv2.cvtColor which internally will trigger the Numpy .copy() for you (just like all OpenCV APIs do when given non-contiguous data), and then changes the color order in RAM for you.

The second option is when you have Numpy data that is already in the correct color order (such as BGR), but whose RAM is non-contiguous. In that case, you should directly invoke img = img.copy() to tell Numpy to make a contiguous copy of the array, to fix it. Then you're welcome to use that contiguous copy for everything. Also note that you can use img = np.ascontiguousarray(img) instead, if you're not sure if your library always returns non-contiguous data; this method automatically returns the same array if it was already contiguous, or does a .copy if it was non-contiguous.

Alright, so let's look at the D3DShot example:

import cv2
import d3dshot
import time

d = d3dshot.create(capture_output="numpy", frame_buffer_size=60)

img1 = d.screenshot()
img2 = d.screenshot()

print(img1.strides, img1.flags)
print(img2.strides, img2.flags)

print("-------------")

start = time.perf_counter()
img1_justcopy = img1.copy() # copy RGB image to new, contiguous RAM
elapsed = (time.perf_counter() - start) * 1000
print(img1_justcopy.strides, img1_justcopy.flags)
print("justcopy milliseconds:", elapsed)

print("-------------")

start = time.perf_counter()
img1 = img1.copy()
img1 = cv2.cvtColor(img1, cv2.COLOR_RGB2BGR) # flip RGB -> BGR
elapsed = (time.perf_counter() - start) * 1000
print(img1.strides, img1.flags)
print("copy+cvtColor milliseconds:", elapsed)

print("-------------")

start = time.perf_counter()
img2 = cv2.cvtColor(img2, cv2.COLOR_RGB2BGR) # flip RGB -> BGR
elapsed = (time.perf_counter() - start) * 1000
print(img2.strides, img2.flags)
print("cvtColor milliseconds:", elapsed)

Output:

(1920, 1, 2073600)   C_CONTIGUOUS : False
  F_CONTIGUOUS : False
  OWNDATA : False
  WRITEABLE : True
  ALIGNED : True
  WRITEBACKIFCOPY : False
  UPDATEIFCOPY : False

(1920, 1, 2073600)   C_CONTIGUOUS : False
  F_CONTIGUOUS : False
  OWNDATA : False
  WRITEABLE : True
  ALIGNED : True
  WRITEBACKIFCOPY : False
  UPDATEIFCOPY : False

-------------
(5760, 3, 1)   C_CONTIGUOUS : True
  F_CONTIGUOUS : False
  OWNDATA : True
  WRITEABLE : True
  ALIGNED : True
  WRITEBACKIFCOPY : False
  UPDATEIFCOPY : False

justcopy milliseconds: 9.122899999999989
-------------
(5760, 3, 1)   C_CONTIGUOUS : True
  F_CONTIGUOUS : False
  OWNDATA : True
  WRITEABLE : True
  ALIGNED : True
  WRITEBACKIFCOPY : False
  UPDATEIFCOPY : False

copy+cvtColor milliseconds: 12.177900000000019
-------------
(5760, 3, 1)   C_CONTIGUOUS : True
  F_CONTIGUOUS : False
  OWNDATA : True
  WRITEABLE : True
  ALIGNED : True
  WRITEBACKIFCOPY : False
  UPDATEIFCOPY : False

cvtColor milliseconds: 11.461500000000013

These examples are all on my 1920x1080 screen, so they're not directly comparable to the 4K resolution times we saw in earlier benchmarks.

Anyway, what we can see here, is first of all that the two captured images (img1 and img2) coming straight from the D3DShot library have very strange strides values, and C_CONTIGUOUS : False. That's because they are raw RAM given to D3DShot by Windows and then just packaged into a ndarray with custom strides to make it read the raw RAM data in the desired order.

Next, we see that just doing img1_justcopy = img1.copy() (which copies the RGB-channeled, non-contiguous RAM into new, contiguous RAM, but does not change the channel order (the image will still be RGB)), takes 9.12 ms, which is indeed how slow Numpy is at copying non-contiguous ndarray data into new, contiguous RAM. Basically, internally, Numpy has to do a ton of looping to read the data byte-by-byte while writing each byte into the correct order in the new, contiguous RAM.

So, the PyArray (Numpy) copying of non-contiguous to contiguous is always the slowest operation. That's why we want to avoid having non-contiguous RAM.

Alright, we also demonstrated how to make a "copy AND fix the colors from RGB to BGR" in two different ways. Doing img1 = img1.copy(); img1 = cv2.cvtColor(img1, cv2.COLOR_RGB2BGR) takes 11.83 ms, and letting cvtColor trigger the Numpy .copy internally via directly calling img2 = cv2.cvtColor(img2, cv2.COLOR_RGB2BGR) takes 10.61 ms. The reason for the slight difference is of course that there's slightly more work involved when we're doing 2 separate function calls, than when we let OpenCV do the Numpy copying in its single call.

In both cases, a PyArray (Numpy) copy operation happens internally, to give us a straight, contiguous RAM location. And then we pass that fixed, contiguous ndarray to cvtColor which fixes the color channel order.

That gives you the following guidelines:guidelines for dealing with image data from libraries:

  • If your Numpy data is always non-contiguous but is already in the correct channel order (you don't want to convert RGB to/from BGR, etc): Use img = img.copy() to force Numpy to make a contiguous copy of the data, which is then usable in all OpenCV calls without any bugs and without causing any slow internal, temporary copying.
  • If your Numpy data is SOMETIMES non-contiguous but is already in the correct channel order: Use img = np.ascontiguousarray(img), which automatically copies the array to make it contiguous if necessary, or otherwise returns the exact same array (if it was already contiguous).
  • If your Numpy data is non-contiguous and you also want to change the has the wrong color channel order: order (and is either contiguous or non-contiguous; it doesn't matter which): Use img = cv2.cvtColor(img, cv2.COLOR_<your conversion choice>), which will internally do the .copy (only does it if necessary) slightly more efficiently than if you had used two separate Python statements.statements. And it will do the color conversion very rapidly with OpenCL accelerated code.

Both All of these techniques will result in giving you fast, contiguous RAM, in the color arrangement of your choice!

Conclusions

Stop using Numpy view manipulations and "tricks". They are not "cool". They lead to SUBTLE BUGS and they are EXTREMELY SLOW. You are slowing down all of your OpenCV API calls by about 40 milliseconds PER CALL (at 4K resolution) or 10 milliseconds PER CALL (at 1920x1080 resolution), since your "cool" data has to be converted by OpenCV internally to PROPER CONTIGUOUS RAM.

Those 39.45 ms (@ 4K) or 9.97 ms (@ 1920x1080) are wasted on EVERY OpenCV call whenever you give OpenCV a non-contiguous image. So if you're (as people often do) passing the image to multiple OpenCV APIs to analyze it in multiple ways, then you are causing extreme slowdowns in your code.

Use cv2.cvtColor() instead, which does a super fast one-time conversion to the proper format, using accelerated OpenCL code. You are guaranteed to get contiguous data which works as-is for EVERY OpenCV call with no memory copying/conversion needed. And OpenCV's color converter is WAY FASTER than Numpy's internal data copier/converter.

Let's end this by imagining a scenario where you're using some Python library to capture an RGB 4K screenshot as a numpy array, and you need to use that data with OpenCV. So you're thinking you're clever and you write img = img[...,::-1] to "turn the RGB data into BGR (which OpenCV needs)", and you're thinking "Wow, my code is so fast! That RGB-to-BGR operation only took 0.000237 ms!"... And then you're calling five different OpenCV functions to analyze that screenshot-image in various ways. Well, since you're causing one internal Numpy copy-conversion-to-contiguous PER CALL, you're now causing 5 * 39.45 = 197.25 ms of total conversion overhead, just to get your "stupid" Numpy view into a proper contiguous memory stream.

Does it still sound "slow" to just do a single, one-time 5.39 ms (@ 4K) or 1.53 ms (@ 1920x1080) conversion via cv2.cvtColor()? ;-)

Stop. Using. Numpy. Tricks!

Enjoy! ;-)