A Web Page Topic Segmentation Algorithm Based on Visual Criteria and Content Layout - Archive ouverte HAL Access content directly
Conference Papers Year : 2007

A Web Page Topic Segmentation Algorithm Based on Visual Criteria and Content Layout

Abstract

This paper presents experiments using an algorithm of web page topic segmentation that show significant improvement in the retrieval of documents. Instead of processing the whole document, a web page is segmented into different semantic blocks according to visual criteria (such as horizontal lines, colors) and structural tags (such as heading, paragraph). Several segmentation solutions have been evaluated and we show that combining visual and content layout criteria give the best result for increasing the precision: the ranking of a page is calculated by the sum of the scores of relevant segments of the page resulting from the segmentation algorithm.
Not file

Dates and versions

hal-00232588 , version 1 (01-02-2008)

Identifiers

  • HAL Id : hal-00232588 , version 1

Cite

Idir Chibane, Bich-Liên Doan. A Web Page Topic Segmentation Algorithm Based on Visual Criteria and Content Layout. SIGIR'07, Jul 2007, Amsterdam, Netherlands. pp.817-818. ⟨hal-00232588⟩
22 View
0 Download

Share

Gmail Facebook Twitter LinkedIn More