This paper presents the implementation of a 3GPP standards compliant configurable turbo decoder on a GPU. The challenge in implementing a turbo decoder on a GPU is in suitably parallelizing the Log-MAP decoding algorithm and doing an architecture aware mapping of it on to the GPU. The approximations in parallelizing the Log-MAP algorithm come at the cost of reduced BER performance. To mitigate this reduction, different guarding mechanisms of varying computational complexity have been presented. The limited shared memory and registers available on GPUs are carefully allocated to obtain a high real-time decoding rate without requiring several independent data streams in parallel. © 2012 IEEE.