Turbo decoders use the recursive BCJR algorithm which is computationally intensive and hard to parallelise. The branch metric and extrinsic log-likelihood ratio computations are easily parallelisable, but the forward and backward metric computation is not parallelisable without compromising bit error rate. This paper proposes a lossless parallelisation technique for Turbo decoders on Graphics Processing Units (GPU). The recursive forward and backward metric computation is formulated as prefix (scan) matrix multiplication problem which is computed on the GPU using parallel prefix sum computation technique. Overall, this method achieves a throughput of 73 Mbps for a 3GPP LTE compliant turbo decoder without any BER loss and latency as low as 61 μs. © 2018 IEEE.