Header menu link for other important links
X
Mining of Bilingual Indian Web Documents
Anand Rajaraman
Published in Elsevier B.V.
2016
Volume: 89
   
Pages: 514 - 520
Abstract
Web and mobile communication are growing in popularity globally and regionally catering to different ways of information dissemination, rendering complex web documents having script, language and media content embedded into them. Thus information extraction from different web documents in the modern day scenario is becoming a real challenge, as one has to cater to format and script variations in documented form and media variations in soft-web form. This has become very relevant in Indian education scenario, where bilingual and multi-lingual communication and web documents through on-line courses, are considered. When regional native dialect comes into picture, another dimension of complexity is added. The present paper focuses on content extraction of such documents through a generic approach using pixel-based approach and mining through classification. Indian bilingual web documents are considered and attribute generation is done through reducing the pixel matrix. Five different attributes were identified and studied. A clear state of art comparison between trained dataset and test dataset is given. The results give reasonable content extraction with good accuracy of the datasets studied. © 2016 Elsevier B.V. All rights reserved.
About the journal
JournalData powered by TypesetProcedia Computer Science
PublisherData powered by TypesetElsevier B.V.
ISSN18770509
Open AccessYes
Concepts (17)
  •  related image
    Classification (of information)
  •  related image
    Complex networks
  •  related image
    Data mining
  •  related image
    Image processing
  •  related image
    Information dissemination
  •  related image
    Mining
  •  related image
    Pixels
  •  related image
    Signal processing
  •  related image
    Statistical tests
  •  related image
    WAREHOUSES
  •  related image
    World wide web
  •  related image
    Attribute
  •  related image
    BILINGUAL
  •  related image
    CONTENT EXTRACTION
  •  related image
    PIXEL BASED APPROACH
  •  related image
    VOXEL
  •  related image
    Information retrieval systems