This paper proposes a novel network architecture for video frame prediction based on Graph Convolutional Neural Networks (GCNN). Most recent methods often fail in situations where multiple close-by objects at different scales move in random directions with variable speeds. We overcome this by modeling the scene as a space-time graph with intermediate features from the pixels (or a local region) as vertices and the relationships among them as edges. Our main contribution lies within posing the frame generation problem with our proposed space-time graph, which enables the network to learn the spatial as well as temporal inter-pixel relationships independent of each other, thus making the system invariant to velocity differences among the moving objects present in the scene. Moreover, we also propose a novel directional attention mechanism for the graph based model to efficiently learn a significance score based on directional relationship between pixels in the original scene. We also show that the proposed model generalizes better on the much more challenging task of predicting semantic scene segmentation of future scenes, even without access to any raw RGB frames. We perform several proxy tasks such as comparison of the quality of the semantic segmentation produced on the generated frames and comparing the accuracies for the task of recognizing actions in case of the dataset consisting of human actions. We use the popular Cityscapes traffic scene segmentation dataset as well as UCF-101 and Penn Action containing human actions to quantitatively and qualitatively evaluate the proposed framework over the recent state-of-the-art. © 2019 IEEE.