A Web Page Topic Segmentation Algorithm Based on Visual Criteria and Content Layout - CentraleSupélec Accéder directement au contenu
Communication Dans Un Congrès Année : 2007

A Web Page Topic Segmentation Algorithm Based on Visual Criteria and Content Layout

Résumé

This paper presents experiments using an algorithm of web page topic segmentation that show significant improvement in the retrieval of documents. Instead of processing the whole document, a web page is segmented into different semantic blocks according to visual criteria (such as horizontal lines, colors) and structural tags (such as heading, paragraph). Several segmentation solutions have been evaluated and we show that combining visual and content layout criteria give the best result for increasing the precision: the ranking of a page is calculated by the sum of the scores of relevant segments of the page resulting from the segmentation algorithm.
Fichier non déposé

Dates et versions

hal-00232588 , version 1 (01-02-2008)

Identifiants

  • HAL Id : hal-00232588 , version 1

Citer

Idir Chibane, Bich-Liên Doan. A Web Page Topic Segmentation Algorithm Based on Visual Criteria and Content Layout. SIGIR'07, Jul 2007, Amsterdam, Netherlands. pp.817-818. ⟨hal-00232588⟩
23 Consultations
0 Téléchargements

Partager

Gmail Facebook X LinkedIn More