|   | 1 |  initial version  | 
My first try would follow the classic object detection pipeline, i.e. slide over the image using a fix window size, compute features for each window and classify them in text / non-text (or if you actually want the images: image / not image) . For features HOG will probably work well. If you want to detect the images then something with color will probably work better.
You could also try to detect the text using the text detection module of OpenCV, s. http://docs.opencv.org/trunk/modules/text/doc/text.html which basically does the option from above.
Good luck!