This paper proposes a feature reconstruction based approach using pixel-graph and Generative Adversarial Networks (GAN) for solving the problem of synthesizing future frames from video scenes. Recent methods of frame synthesis often generate blurry outcomes in case of long-range prediction and scenes involving multiple objects moving at different velocities due to their holistic approach. Our proposed method introduces a novel pixel-graph based context aggregation layer (PixGraph) which efficiently captures long range dependencies. PixGraph incorporates a weighting scheme through which the internal features of each pixel (or a group of neighboring pixels) can be modeled independently of the others, thus handling the issue of separate objects moving in different directions and with very dissimilar speed. We also introduce a novel objective function, the Locally Guided Gram Loss (LGGL), which aides the GAN based model to maximize the similarity between the intermediate features of the ground-truth and the network output by constructing Gram matrices from locally extracted patches over several levels of the generator. Our proposed model is end-to-end trainable and exhibits superior performance compared to the state-of-the-art on four real-world benchmark video datasets. © Springer Nature Switzerland AG 2019.