Header menu link for other important links
X
Text studies towards multi-lingual content mining for web communication
Anand Rajaraman
Published in
2010
Pages: 28 - 31
Abstract
Communication through web is becoming increasingly popular thanks to wireless and cellular networks. As this awareness spreads far and wide in different countries, significant complexities arise in terms of language and communication means for extracting information on the web. This is particularly true in India where more than fifteen officially recognized language texts and more variations in local dialect exist. An example is in Tamilnadu where Tamizh, native language with its own variations like Chennai, Madurai and Coimbatore dialects is combined effectively and easily with other languages Telugu, Kannada and Malayalam from nearby states and English and Hindi from global and national perspectives. So a web document here could be in any one of the languages or a mixture of words from different languages to avoid translation like 'computer' of English doesn't have translation in Tamizh. There are several aspects to this variational usage with language protagonists and communication engineers. But the complexity in the web document due to these variations does create difficulties in using conventional data mining approaches. The present study focuses attention on this, beginning from text variations to word and document. Typical characters which have similar usage like 'a' in English with those in Tamizh and Telugu are taken and their pixelmaps are mapped for similarity and contrasts. This is later extended to more complex characters like unknown sign in Telugu which is one character as compared to its English equivalent 'kO' making representations difficult. When one starts looking at words, complexity increases as 'temple' in English translated as 'unknown sign' in Telugu or 'mandiram' written in English. Similarities in pixel-maps are looked at and characteristics in terms of matrices are projected so that mining content when such words or letters are extracted in web document can be put in a probabilistic format with predictions based on correlations. Typical histograms highlighting these aspects are presented and later an experiment with a document page dealing with magnetism is used as model-l for predicting content. ©2010 IEEE.
About the journal
JournalProceedings of the 2nd International Conference on Trendz in Information Sciences and Computing, TISC-2010
Open AccessNo
Concepts (15)
  •  related image
    Cellular network
  •  related image
    Communication engineers
  •  related image
    COMPLEX CHARACTER
  •  related image
    CONTENT MINING
  •  related image
    CONVENTIONAL DATA MINING
  •  related image
    Extracting information
  •  related image
    Native language
  •  related image
    WEB COMMUNICATIONS
  •  related image
    WEB DOCUMENT
  •  related image
    Cellular neural networks
  •  related image
    Communication
  •  related image
    Data mining
  •  related image
    Information science
  •  related image
    World wide web
  •  related image
    Translation (languages)