Media mining has taken a major shift from conventional data mining due to the ever increasing complexity of web documents. Another new dimension gets added when the web documents are of Indian origin since variety of languages and dialects get into the development of web pages. These web documents wherein words in different languages are used with or without translation can be termed as hybrid documents. A typical yahoo news page in different languages is an example of this. The complexity of extracting information or content and eventually knowledge gets more involved when words from other languages are used as yjet are without translation like 'computer' or 'mobile' being used freely in regional languages. Even though the reader/ surfer can follow the content easily, no translation has been done. Such documents are the focus of this study and a statistical approach for describing the features of the words in different languages is used as the basis for correlation to assess the content of such web documents. As a benchmark study six words related to education are taken in four different languages, English, Tamizh, Telugu and Hindi and different ways of normalizing within and outside the group are taken as the base vectors and using correlation study, any new data or group of data is checked for assessing the probability of getting the content. The words being in different scripts are converted to a three layer pixel map groups so that translational and text related issues do not affect the mining procedure. Further as textual data is well-structured irrespective of language, this approach of getting attributes and using them as bases is more general and does have the ability to include texts from any language. © 2012 Springer-Verlag.