Header menu link for other important links
X
Statistical interpretation for mining hybrid regional web documents
Anand Rajaraman
Published in
2012
Volume: 292 CCIS
   
Pages: 503 - 512
Abstract
Media mining has taken a major shift from conventional data mining due to the ever increasing complexity of web documents. Another new dimension gets added when the web documents are of Indian origin since variety of languages and dialects get into the development of web pages. These web documents wherein words in different languages are used with or without translation can be termed as hybrid documents. A typical yahoo news page in different languages is an example of this. The complexity of extracting information or content and eventually knowledge gets more involved when words from other languages are used as yjet are without translation like 'computer' or 'mobile' being used freely in regional languages. Even though the reader/ surfer can follow the content easily, no translation has been done. Such documents are the focus of this study and a statistical approach for describing the features of the words in different languages is used as the basis for correlation to assess the content of such web documents. As a benchmark study six words related to education are taken in four different languages, English, Tamizh, Telugu and Hindi and different ways of normalizing within and outside the group are taken as the base vectors and using correlation study, any new data or group of data is checked for assessing the probability of getting the content. The words being in different scripts are converted to a three layer pixel map groups so that translational and text related issues do not affect the mining procedure. Further as textual data is well-structured irrespective of language, this approach of getting attributes and using them as bases is more general and does have the ability to include texts from any language. © 2012 Springer-Verlag.
About the journal
JournalCommunications in Computer and Information Science
ISSN18650929
Open AccessNo
Concepts (18)
  •  related image
    BASE VECTORS
  •  related image
    BENCHMARK STUDY
  •  related image
    CONVENTIONAL DATA MINING
  •  related image
    CORRELATION STUDIES
  •  related image
    Extracting information
  •  related image
    MULTI-LINGUAL
  •  related image
    NEW DIMENSIONS
  •  related image
    Statistical approach
  •  related image
    STATISTICAL INTERPRETATION
  •  related image
    Textual data
  •  related image
    THREE-LAYER
  •  related image
    WEB COMMUNICATIONS
  •  related image
    WEB DOCUMENT
  •  related image
    Artificial intelligence
  •  related image
    Data processing
  •  related image
    HTML
  •  related image
    World wide web
  •  related image
    Translation (languages)