Tuesday, April 26, 2011

8 Final Project Submission

Paper Submission: http://dl.dropbox.com/u/3876381/FinalPaperFan.pdf

Video Submission:http://dl.dropbox.com/u/3876381/Final.wmv

Poster Submission:http://dl.dropbox.com/u/3876381/FinalPoster.pdf

Code Submission: http://dl.dropbox.com/u/3876381/FastHoG.rar

 

I should have done window merge, but I am running out of time.

However, it is a rather easy technique and often done in CPU

3452e001c4

Saturday, April 23, 2011

7

Trained with different SVM kernels. SVM  parameters affects the overall detection performance.
SVM also consumes much more processing time which I can do nothing to optimize so far.
The following are the results of different SVM parameter settings.
e001c4
  e001c2e01c01e01c2e01c4LinearDefaultLineare01
According to the reference paper, the group trained for a few days. It should be understood that usually the more training data the better.
As the deadline is approaching, it seems impossible for me to apply non-maximum suppression, instead I’ll select best performance SVM parameter combination ,run error check and get all benchmark ready.

Thursday, April 21, 2011

6

Is running to generate training data for SVM.
Previously a trained SVM model works fine on a few test image, then decided to generate a large data to boost SVM performance.
However, the data set is 1.5Gb and svm-train from libsvm already crashed on that , seems the system does not support that much of memory use of libsvm , another half an hour wasted.
Will add svmprediction directly to the source, but will probably on CPU instead of GPU.

Tuesday, April 19, 2011

Monday, April 18, 2011

4

Detection Kernel.

   1: __global__ void Compute_win(float*His_Img,float*Fea_vector)
   2: {
   3:     //Notice constant number here will change as window size changes
   4:     __shared__ float cache_his[105][36];    
   5:  
   6:     //Thread index Index should be less  
   7:     unsigned int thread_index =  threadIdx.x + __umul24(threadIdx.y,Win_Attr.win_width);
   8:     //Block Index 
   9:     unsigned int block_index = blockIdx.x + __umul24(blockIdx.y,gridDim.x);
  10:  
  11:     unsigned int tid_x =  threadIdx.x + blockIdx.x;
  12:     unsigned int tid_y =  threadIdx.y + blockIdx.y;
  13:  
  14:  
  15:     if(tid_x + 1 < Img_Attr.Image_width&&tid_y  + 1< Img_Attr.Image_height&&threadIdx.x < Win_Attr.win_width&&threadIdx.y <Win_Attr.win_height)
  16:     {
  17:         unsigned int index_0 = tid_x  + tid_y*Img_Attr.Image_width;
  18:         unsigned int index_1 = (tid_x + 1) + tid_y*Img_Attr.Image_width;
  19:         unsigned int index_2 = (tid_x) + (tid_y + 1)*Img_Attr.Image_width;
  20:         unsigned int index_3 = (tid_x + 1) + (tid_y + 1)*Img_Attr.Image_width;
  21:         
  22:         float norm_2 = 0;
  23:         unsigned int j = 0;
  24:         
  25:         for(int Bin_id = 0; Bin_id < K ;Bin_id++)
  26:         {
  27:             cache_his[thread_index][j++] = His_Img[index_0 + Bin_id*Img_Attr.Image_size];
  28:             cache_his[thread_index][j++] = His_Img[index_1 + Bin_id*Img_Attr.Image_size];
  29:             cache_his[thread_index][j++] = His_Img[index_2 + Bin_id*Img_Attr.Image_size];
  30:             cache_his[thread_index][j++] = His_Img[index_3 + Bin_id*Img_Attr.Image_size];
  31:         }
  32:         for(int i = 0; i <  K*BLOCK_SIZE*BLOCK_SIZE; i++)
  33:             norm_2 += cache_his[thread_index][i]*cache_his[thread_index][i];
  34:         norm_2 = sqrtf(norm_2);
  35:         
  36:         unsigned int index = block_index*Win_Attr.win_width*Win_Attr.win_height*K*BLOCK_SIZE*BLOCK_SIZE+ thread_index*K*BLOCK_SIZE*BLOCK_SIZE;
  37:         for(int i = 0; i <  K*BLOCK_SIZE*BLOCK_SIZE; i++)
  38:         {
  39:             
  40:             //cache_his[thread_index][i] = cache_his[thread_index][j]/norm_2;            
  41:             if(norm_2 >= 0.001f)
  42:                 Fea_vector[index] = cache_his[thread_index][i]/norm_2;
  43:             else 
  44:                 Fea_vector[index] = 0.0f;
  45:  
  46:             index ++;
  47:         }
  48:  
  49:     }
  50:  
  51:  
  52: }

SVM integration seems to be harder than I expected. I have to write extra code to generate data to fit the requirements of libsvm .


HOG feature looks good, however I can not verify with complete confidence, though tested with some artificial examples.


Capture

Saturday, April 16, 2011

3

Block normalization is done, but there are still problems.

1. For one image, each computing block is on one detection window and each thread is on a single block(2x2 cells). If not using share_memory, there will be 4*9*2 memory read per thread for complete block normalization.

If using shared_memory, that will be 4*9(length of feature vector for a single block) * number of thread per block(7*15) * 4(sizeof(float)) bytes total for a single block… my total amount of shared memory  is 16000bytes .. it’s enough for one block but is it gonna be performance bottleneck?

2.I am not sure which is better data structure to hold output feature vectors . Ideally, it should be 2-dimensional and each row stores feature vector for one detection window.  But  I’ve been struggling with pointers to pointers in CUDA….Or should I just put into one long array and use offset instead? 

Friday, April 15, 2011

2

Got really stuck recently, fortunately I finally found the bug.

I wrote a small GPU version bilinear interpolation image scaling Kernel but got some weird results, the output image was randomly shift right/left a few pixel for no reason.

At beginning, I thought the mistakes happened in Kernels. Then I notice it might be some “data type conversion& data type accuracy”  based on some “seems to be correct” results, but I was on the wrong direction.

The problem actually happens with OpenCV data matrix aligned problem. Previously, it ran “looks perfect” on other kernels because input and output are both falsely aligned.Anyway….

The performance seems to be good so far. My scaling kernel (running on my low-end nvs 3100m) took 0.28 milliseconds while cvResize from OpenCV took 6.7 milliseconds on single operation . That’s 20x speedup

Though memory transfer latency is not taken into account  and cvResize is higher level of processing function and its slow (see performance comparison on read from CFILE , c++ fileread API, FILE from C ) , there are many better GPU will have much stronger performance.

Capture

Scaling Kernel

   1: __global__ void DownScale(float*O_data, float scale)
   2: {
   3:  
   4:     //__shared__  float tile[16][16];
   5:  
   6:     unsigned int tid_x =  threadIdx.x + __umul24(blockIdx.x,blockDim.x);
   8:  
   9:  
  10:     /*****Mapping to Unscaled Image****/
  11:     unsigned int Pixel_x = (unsigned int)(tid_x*scale) ; 
  12:     unsigned int Pixel_y = (unsigned int)(tid_y*scale) ; 
  13:     float a = tid_x*scale - Pixel_x;
  14:     float b = tid_x*scale - Pixel_y;
   7:     unsigned int tid_y =  threadIdx.y + __umul24(blockIdx.y,blockDim.y);
  15:     b= 0;
  16:     //unsigned int index = tid_x  + tid_y*Img_Attr.Image_width;
  17:  
  18:      
  19:     /****Memory Coalescing is even slower!!**/ 
  20:         /*if (tid_x < Img_Attr.Image_width&&tid_y < Img_Attr.Image_height)
  21:             tile[threadIdx.x][threadIdx.y] = tex2D(Image_tex, Pixel_x,Pixel_y);
  22:     __syncthreads();
  23: 
  24: 
  25:         if (tid_x < Img_Attr.Image_width&&tid_y < Img_Attr.Image_height)
  26:             O_data[index] = tile[threadIdx.x][threadIdx.y];
  27:     */
  28:     ////////////////////////////////////////////////////////////
  29:     /*****************Bilinear Interpolation****************/
  30:     if(tid_x < Img_Attr.Image_width &&tid_y  < Img_Attr.Image_height)
  31:         O_data[tid_x  + tid_y*Img_Attr.Image_width]  = (1.0f - a)*(1.0f - b)*tex2D(Image_tex,Pixel_x ,Pixel_y) 
  32:                                                                                         + a*(1.0f - b)*tex2D(Image_tex, Pixel_x + 1,Pixel_y) 
  33:                                                                                     + (1.0f - a)*b*tex2D(Image_tex, Pixel_x,Pixel_y + 1)
  34:                                                                                     + a*b*tex2D(Image_tex, Pixel_x + 1,Pixel_y + 1)  ;
  35: }

Integral Kernel



   1: //*********************Compute Integral********************/
   2:  
   3: __global__  void Compute_IntegralBin(float*Bin_Img,float*His_Img)
   4: {
   5:     __shared__ float cache[CELL_SIZE][CELL_SIZE];
   6:     __shared__ float cache_t[CELL_SIZE];
   7:  
   8:     unsigned int tid_x =  threadIdx.x + __umul24(blockIdx.x,blockDim.x);
   9:     unsigned int tid_y =  threadIdx.y + __umul24(blockIdx.y,blockDim.y);
  10:     
  11:     //unsigned int block_x = blockIdx.x;
  12:     //unsigned int block_y = blockIdx.y;
  13:  
  14:     unsigned int cacheId_x = threadIdx.x ;
  15:     unsigned int cacheId_y = threadIdx.y;
  16:     //int cacheIndex = threadIdx.x;
  17:  
  18:     if(tid_x  + tid_y*Img_Attr.Image_width < Img_Attr.Image_size)
  19:     {
  20:         float sum;
  21:         for(int Bin_id = 0; Bin_id < K; Bin_id ++)
  22:         {
  23:             
  24:             cache[cacheId_x][cacheId_y]  = Bin_Img[tid_x  + tid_y*Img_Attr.Image_width + Bin_id* Img_Attr.Image_size] ;
  25:             __syncthreads();
  26:             
  27:             /////////////////////////////////
  28:             
  29:             sum = 0;
  30:             /*********Version One,Correct given all integral in the cell*****/
  31:             /*
  32:             for  (int i  = 0; i <= cacheId_y ; i ++) 
  33:                 sum  += cache[cacheId_x][i];                                                        
  34: 
  35:             __syncthreads();
  36:             cache[cacheId_x][cacheId_y] = sum;
  37: 
  38:             __syncthreads();
  39: 
  40:             sum = 0;
  41: 
  42:             for  (int i  = 0; i <= cacheId_x ; i ++) 
  43:                 sum  += cache[i][cacheId_y];            
  44:             __syncthreads();
  45:             */
  46:             /****************************************************/
  47:  
  48:             /*********************Version 2 compute sum only*********/
  49:             int i = CELL_SIZE/2;
  50:             while (i != 0) 
  51:             {
  52:                 if (cacheId_x < i)
  53:                     cache[cacheId_x][cacheId_y] += cache[cacheId_x + i][cacheId_y];
  54:                 __syncthreads();
  55:                 i /= 2;
  56:             }
  57:  
  58:             if(cacheId_x == 0)
  59:             {
  60:                 cache_t[cacheId_y] = cache[0][cacheId_y];
  61:                 __syncthreads();
  62:  
  63:                 i = CELL_SIZE/2;
  64:                 while (i != 0) 
  65:                 {
  66:                     if (cacheId_y < i)
  67:                         cache_t[cacheId_y] += cache_t[cacheId_y + i];
  68:                     __syncthreads();
  69:                     i /= 2;
  70:                 }
  71:                 if(cacheId_y == 0)
  72:                     sum = cache_t[0];
  73:             }
  74:             
  75:             /***********************************************************/
  76:  
  77:             Bin_Img[tid_x  + tid_y*Img_Attr.Image_width + Bin_id* Img_Attr.Image_size]  = sum/16;
  78:  
  79:             //////////////TEST////////////
  80:             if (cacheId_x  == 0 &&cacheId_y  == 0) 
  81:             His_Img [blockIdx.x + blockIdx.y*gridDim.x +Bin_id* __umul24(gridDim.x,gridDim.y)] = sum/16;
  82:             /////////////////////////////
  83:         }
  84:  
  85:         //
  86:         
  87:     }
  88: }


Gradient and Bins



   1: /*******************Compute Gradient and assign them to bin Images ******/
   2: __global__ void Compute_GradientBin(float*O_data,float*Bin_Img)
   3: {
   4:  
   5:  
   6:     unsigned int tid_x =  threadIdx.x + __umul24(blockIdx.x,blockDim.x);
   7:     unsigned int tid_y =  threadIdx.y + __umul24(blockIdx.y,blockDim.y);
   8:  
   9:     float angle;
  10:     float Gradx,Grady;
  11:     
  12:     ///////////////////////////////////////////////Access thhe image/////////////////////////////////
  13:     //#if Texture REading////////////////////////////////////TExture Memory
  14:     if(tid_x < Img_Attr.Image_width&&tid_y < Img_Attr.Image_height)
  15:     {
  16:  
  17:         if(tid_x == 0 || tid_x == Img_Attr.Image_width - 1 )
  18:             Gradx = 0.0f;
  19:         else 
  20:             Gradx = (tex2D(Image_tex, tid_x + 1,tid_y) - tex2D(Image_tex, tid_x - 1,tid_y));
  21:                 //+ tex2D(Image_tex, tid_x + 1,tid_y - 1 ) - tex2D(Image_tex, tid_x - 1,tid_y - 1)
  22:                 //+ tex2D(Image_tex, tid_x + 1,tid_y + 1) - tex2D(Image_tex, tid_x - 1,tid_y + 1) ;
  23:         if (tid_y == 0|| tid_y == Img_Attr. Image_height - 1)
  24:             Grady = 0.0f;
  25:         else 
  26:             Grady =  (tex2D(Image_tex, tid_x ,tid_y + 1) - tex2D(Image_tex, tid_x,tid_y - 1)) ;
  27:                 //+ tex2D(Image_tex, tid_x - 1,tid_y + 1 ) - tex2D(Image_tex, tid_x - 1,tid_y - 1)
  28:                 //+ tex2D(Image_tex, tid_x + 1,tid_y + 1) - tex2D(Image_tex, tid_x + 1,tid_y - 1) ;
  29:  
  30:         ///////////////////////////////////////////////////////////////////////////////////////////////
  31:  
  32:  
  33:         /////////////////////////////Angle of the Gradient///////////////// 
  34:     
  35:         if(Gradx != 0)
  36:         {angle  = atan(Grady/Gradx)/3.14159f  >= 0.0f ? atan(Grady/Gradx)/3.14159f : atan(Grady/Gradx)/3.14159f + 1.0f ;
  37:             if(Gradx < 0.0f && angle == 0.0f)
  38:                 angle = 1.0f;
  39:         }
  40:         else
  41:             angle = 0.5f;//Though Condition Gradx == 0&&Grady == 0 exsits , it does not matter, cos its magnitude will be zero, which would affect Feature 
  42:  
  43:         //Plain Output 
  44:         O_data[tid_x  + tid_y*Img_Attr.Image_width] = 0.5*(sqrtf(Grady*Grady + Gradx*Gradx));
  45:  
  46:         ///////////////////Bins    
  47:         angle = angle*K;
  48:         //////////////////////////////////Bin Gradient Images
  49:         for(int Bin_id = 0; Bin_id < K;Bin_id++)
  50:         {    
  51:             Bin_Img[tid_x  + tid_y*Img_Attr.Image_width + Bin_id*Img_Attr.Image_size]  = 0.0f;
  52:             if(angle <=  (float)Bin_id + 1.0f&&angle >= (float)Bin_id)
  53:                 Bin_Img[tid_x  + tid_y*Img_Attr.Image_width + Bin_id*Img_Attr.Image_size] = 0.5*sqrtf(Gradx*Gradx + Grady*Grady);    
  54:         }        
  55:     }
  56:  
  57: }