A Web Page Topic Segmentation Algorithm Based on Visual Criteria and Content Layout
Abstract
This paper presents experiments using an algorithm of web page topic segmentation that show significant improvement in the retrieval of documents. Instead of processing the whole document, a web page is segmented into different semantic blocks according to visual criteria (such as horizontal lines, colors) and structural tags (such as heading, paragraph). Several segmentation solutions have been evaluated and we show that combining visual and content layout criteria give the best result for increasing the precision: the ranking of a page is calculated by the sum of the scores of relevant segments of the page resulting from the segmentation algorithm.