Identifying Informative Web Content Blocks using Web Page Segmentation

Stevina Dias; Jayant Gadge

Identifying Informative Web Content Blocks using Web Page Segmentation

User Rating: 0 / 5

Print Email

International Journal of Applied Information Systems
Foundation of Computer Science (FCS), NY, USA
Volume 7 - Number 1
Year of Publication: 2012
Authors: Stevina Dias, Jayant Gadge
10.5120/ijais14-451129

1716

Export

Stevina Dias and Jayant Gadge 2014. Identifying Informative Web Content Blocks using Web Page Segmentation. International Journal of Applied Information Systems. 7, 1 (April 2014), 37-41. DOI=http://dx.doi.org/10.5120/ijais451129

@article{10.5120/ijais2017451568,
author = {Stevina Dias and Jayant Gadge},
title = {Identifying Informative Web Content Blocks using Web Page Segmentation},
journal = {International Journal of Applied Information Systems},
issue_date = {April 2014},
volume = {7},
number = {},
month = {April},
year = {2014},
issn = {},
pages = {37-41},
numpages = {},
url = {/archives/volume7/number1/614-1129},
doi = { 10.5120/ijais14-451129},
publisher = { xA9 2013 by IJAIS Journal},
address = {}
}

%1 451129
%A Stevina Dias
%A Jayant Gadge
%T Identifying Informative Web Content Blocks using Web Page Segmentation
%J International Journal of Applied Information Systems
%@ 
%V 7
%N 
%P 37-41
%D 2014
%I  xA9 2013 by IJAIS Journal

Abstract

In the study of content authentication and tamper detection of digital text documents, there are very limited techniques available for content authentication of text documents using digital watermarking techniques. A novel intelligent text zero watermarking approach based on probabilistic patterns has been proposed in this paper for content authentication and tamper detection of English text documents. In the proposed approach, Markov model of order THREE and letter-based was constructed and abbreviated as LNMZW3 for text analysis and utilizes the interrelationship between contents of given text to generate the watermark. However, we can extract this watermark later using extraction and detection algorithms to identify the status of text document such as authentic, or tampered. The proposed approach was implemented using PHP Programming language with Net Beans IDE 7. 0. Furthermore, the effectiveness and feasibility of our LNMZW3 approach has proved and compared with other recent approaches with experiments using five datasets of varying lengths and different volumes of attacks. Results show that the proposed approach is always detects tampering attacks occurred randomly on text even when the tampering volume is low, mid or high. Comparative results with the recent approaches shows that the our LNMZW3 approach provides added value under random insertion and deletion attacks in terms of performance, watermark robustness and watermark security. However, it is provide worst enhancement under reorder attacks.

References

P. Sivakumar , R. M. S Parvathi , "An Efficient Approach of Noise Removal from Web Page for Effectual Web Content Mining", European Journal of Scientific Research ISSN 1450-216X Vol. 50 No. 3 (2011), pp. 340-351 © EuroJournals Publishing, Inc. 2011
Jinbeom Kang, Jaeyoung Yang, Nonmember and Joongmin Choi, Member, IEEE "Repetition-based Web Page Segmentation by Detecting Tag Patterns for Small-Screen Devices", IEEE Transactions on Consumer Electronics, Vol. 56, No. 2, May 2010
S. H. Lin and J. M. Ho , "Discovering Informative Content Blocks from Web Documents",Proc. Eighth ACM SIGKDD Int'l conf. Knowledge Discovery and Data Mining , pp. 588-593, 2002.
Lan Yi, Bing Liu, Xiaoli Li, "Eliminating Noisy Information in Web Pages for Data Mining", SIGKDD . 03, August 24-27, 2003, Washington, DC, USA.
Sandip Debnath, Prasenjit Mitra, C. Lee Giles, "Automatic Extraction of Informative Blocks from Webpages", SAC'05 March 2005, Santa Fe, New Mexico, USA
Lan Yi, Bing Liu, "Web Page Cleaning for Web Mining through Feature Weighting" SAC' 05 March 13-17, 2005, New Mexico, USA
Manisha Marathe, Dr. S. H. Patil, G. V. Garje,M. S. Bewoor, "Extracting Content Blocks from Web Pages", REVIEW PAPER International Journal of Recent Trends in Engineering, Vol 2, No. 4, November 2009
A. Arasu and H. Garcia-Molina, "Extracting structured data from web page," Proc. ACM SIGMOD Intl. Conf. on Management of Data, pp. 337–348, 2003.
Shine N. Das, Pramod K. Vijayaraghavan, Midhun Mathew, "Eliminating Noisy Information in Web Pages using featured DOM tree," International Journal of Applied Information Systems (IJAIS) – ISSN : 2249-0868, Foundation of Computer Science FCS, New York, USA Volume 2– No. 2, May 2012 – www. ijais. org
L. Yi, B. Liu, and X. Li, "Eliminating noisy information in web pages for data mining," Proc. 9th ACM SIGKDD Intl. Conf. on Knowledge Discovery and Data Mining, pp. 296-305, 2003.
D. Chakrabarti, R. Kumar, and K. Punera, "Page-level template detection via isotonic smoothing," Proc. 16th Intl. Conf. on World Wide Web, pp. 61-70, 2007.
Y. Chen, W. -Y. Ma, and H. -J. Zhang, "Detecting web page structure for adaptive viewing on small form factor devices," Proc. 12th Intl. Conf. on World Wide Web, pp. 225–233, 2003.
Y. Chen, X. Xie, W. Ma, and H. Zhang, "Adapting web pages for small screen devices," IEEE Internet Computing, vol. 9, no. 1, pp. 40-56, 2005.
Y. Yang and H. Zhang, "HTML page analysis based on visual cues," Proc. 16th Intl. Conf. on Document Analysis and Recognition, p. 859, 2001.
G. Hattori, K. Hoashi, K. Matsumoto, and F. Sugaya, "Robust web page segmentation for mobile terminal using content distances and page layout information," Proc. 16th Intl. Conf. on World Wide Web, pp. 361–370, 2007.
C. Choi, J. Kang, and J. Choi, "Extraction of user-defined data blocks using the regularity of dynamic web pages," Lecture Notes in Computer Science, vol. 4681, pp. 123-133, 2007.
S. Lin and J. Ho, "Discovering informative content blocks from Web documents," Proc. 8th ACM SIGKDD Intl. Conf. on Knowledge Discovery and Data Mining, pp. 588-593, 2002.
A. K. Tripathy and A. K. Singh, "An Efficient Method of Eliminating Noisy Information in Web Pages for Data Mining", In Proceedings of the Fourth International Conference on Computer and Information Technology (CIT'04), pp. 978 – 985, September 14-16, Wuhan, China, 2004.

Keywords

Search engine, information extraction, web content mining, web segmentation, repetition detection, Informative blocks, non-informative blocks, and noise

Index Terms

Computer Science

Information Sciences

Call for paper April Edition 2017

Number 1

Identifying Informative Web Content Blocks using Web Page Segmentation

Export

Abstract

References

Keywords

Index Terms

Call for paper
April Edition 2017