A resource-frugal probabilistic dictionary and applications in bioinformatics

Indexing massive data sets is extremely expensive for large scale problems. In many fields, huge amounts of data are currently generated, however extracting meaningful information from voluminous data sets, such as computing similarity between elements, is far from being trivial. It remains nonetheless a fundamental need. This work proposes a probabilistic data structure based on a minimal perfect hash function for indexing large sets of keys. Our structure out-compete the hash table for construction, query times and for memory usage, in the case of the indexation of a static set. To illustrate the impact of algorithms performances, we provide two applications based on similarity computation between collections of sequences, and for which this calculation is an expensive but required operation. In particular, we show a practical case in which other bioinformatics tools fail to scale up the tested data set or provide lower recall quality results.

Mots clés

Bloomier filter Minimal Perfect Hash Functions Genomics Data structures Bioinformatics Sequences comparison

Domaines

Bio-informatique [q-bio.QM] Algorithme et structure de données [cs.DS]

Fichier principal

short_read_connectors.pdf (323 Ko)

Origine : Fichiers produits par l'(les) auteur(s)

Pierre Peterlongo : Connectez-vous pour contacter le contributeur

https://inria.hal.science/hal-01322440

Soumis le : vendredi 27 mai 2016-11:00:32

Dernière modification le : lundi 15 avril 2024-15:16:20

Archivage à long terme le : dimanche 28 août 2016-10:35:23

Dates et versions

hal-01322440 , version 1 (27-05-2016)

Identifiants

HAL Id : hal-01322440 , version 1
ARXIV : 1605.08319
DOI : 10.1016/j.dam.2018.03.035

Citer

Camille Marchet, Lolita Lecompte, Antoine Limasset, Lucie Bittner, Pierre Peterlongo. A resource-frugal probabilistic dictionary and applications in bioinformatics. Discrete Applied Mathematics, 2020, 92-102 (Volume 274), ⟨10.1016/j.dam.2018.03.035⟩. ⟨hal-01322440⟩

Exporter

BibTeX XML-TEI Dublin Core DC Terms EndNote DataCite

Collections

UNIV-RENNES1 CNRS INRIA INSA-RENNES IRISA CENTRALESUPELEC IRISA-D7 INRIA2 UR1-MATH-STIC UR1-UFR-ISTIC IBPS UNIV-RENNES SORBONNE-UNIVERSITE SU-SCIENCES ANR UR1-MATH-NUM

539 Consultations

305 Téléchargements