Many modern task-parallel languages allow the programmer to synchronize tasks using high-level constructs like barriers, clocks, and phasers. While these high-level synchronization primitives help the programmer express the program logic in a convenient manner, they also have their associated overheads. In this paper, we identify the sources of some of these overheads for task-parallel languages like X10 that support lock-step synchronization, and propose a mechanism to reduce these overheads. We first propose three desirable properties that an efficient runtime (for task-parallel languages like X10, HJ, Chapel, and so on) should satisfy, to minimize the overheads during lock-step synchronization. We use these properties to derive a scheme to called uClocks to improve the efficiency of X10 clocks; uClocks consists of an extension to X10 clocks and two related runtime optimizations. We prove that uClocks satisfies the proposed desirable properties. We have implemented uClocks for the X10 language+runtime and show that the resulting system leads to a geometric mean speedup of 5.36× on a 16-core Intel system and 11.39× on a 64-core AMD system, for benchmarks with a significant number of synchronization operations. © 2019 John Wiley & Sons, Ltd.