Hi,

My test code is like this:

```
n_test = 1000
t1 = time.time()
for i in range(n_test):
histogram = im.histogram()
t2 = time.time()
for i in range(n_test):
for c in range(3):
hist2 = cv2.calcHist([imcv[:, :, c]], [0], None, [256], [0, 256]).reshape(-1)
t3 = time.time()
for i in range(n_test):
for c in range(3):
hist3 = np.histogram(imcv[:,:,c].ravel(), bins=256, range=(0, 256))
t4 = time.time()
print('pil hist time: {}'.format(t2 - t1))
print('cv2 hist time: {}'.format(t3 - t2))
print('np hist time: {}'.format(t4 - t3))
```

The result is that cv2 implementation is around 3x slower than pillow. Did I make any mistake here, how could I use cv2/numpy to make it as fast as pillow?