LEVEL 2
Optimization Removed - Avoidance of Idle threads
This
optimization removal was achieved by making the convolution column kernel's block size bigger.Thus we have less pixels loaded by each thread and also there are lot of threads which are just used for loading and sit idle during processing.
The table below shows the processing time(ms) for 10 iterations of running the exe and getting convolved image.
The code can be downloaded from the download section.