Block normalization is done, but there are still problems.
1. For one image, each computing block is on one detection window and each thread is on a single block(2x2 cells). If not using share_memory, there will be 4*9*2 memory read per thread for complete block normalization.
If using shared_memory, that will be 4*9(length of feature vector for a single block) * number of thread per block(7*15) * 4(sizeof(float)) bytes total for a single block… my total amount of shared memory is 16000bytes .. it’s enough for one block but is it gonna be performance bottleneck?
2.I am not sure which is better data structure to hold output feature vectors . Ideally, it should be 2-dimensional and each row stores feature vector for one detection window. But I’ve been struggling with pointers to pointers in CUDA….Or should I just put into one long array and use offset instead?