I think that I could use all cores for loading negatives, but only one core used for that....

This is not correct. The process of grabbing negatives is predetermined by the window size and completely sequential. So it will only use a single core. The preprocessing of samples however has been improved by using multiple cores.

If you want it, then you will need to crack up sourcecode.

How improve training cascade speed?

Please add your output ... it can have so many reasons ...