Supply-chain management problems are quite common in various industries and it is becoming increasingly necessary to tackle uncertainties while making decisions due to the rapid rise in production and consumption levels, and shortening of product life cycles. In our work, we tackle this problem of general stochastic supply-chain management problem by formulating it as a multi-arm non-contextual bandit problem and then taking a policy gradient descent approach (a Reinforcement Learning approach) to find a robust policy. The gradient descent is guided by cost from a simulator which models the demand, lead times and other uncertainties. Our experiments demonstrate that it finds better solutions than naive worst-case linear programming solutions to such problems. Copyright 2014 ACM.