Raptor 3.0.0-rc.1
A fast and space-efficient pre-filter for querying very large collections of nucleotide sequences
 
raptor::hierarchical_interleaved_bloom_filter< data_layout_mode_ > Class Template Reference

The HIBF binning directory. A data structure that efficiently answers set-membership queries for multiple bins. More...

#include <raptor/hierarchical_interleaved_bloom_filter.hpp>

+ Inheritance diagram for raptor::hierarchical_interleaved_bloom_filter< data_layout_mode_ >:
+ Collaboration diagram for raptor::hierarchical_interleaved_bloom_filter< data_layout_mode_ >:

Classes

class  counting_agent_type
 Manages counting ranges of values for the raptor::hierarchical_interleaved_bloom_filter. More...
 
class  membership_agent
 Manages membership queries for the raptor::hierarchical_interleaved_bloom_filter. More...
 
class  user_bins
 Bookkeeping for user and technical bins. More...
 

Public Types

using ibf_t = seqan3::interleaved_bloom_filter< data_layout_mode_ >
 The type of an individual Bloom filter.
 

Public Member Functions

template<std::integral value_t = uint16_t>
counting_agent_type< value_t > counting_agent () const
 Returns a counting_agent_type to be used for counting.
 
membership_agent membership_agent () const
 Returns a membership_agent to be used for counting.
 
template<seqan3::cereal_archive archive_t>
void serialize (archive_t &archive)
 Serialisation support function.
 
Constructors, destructor and assignment
 hierarchical_interleaved_bloom_filter ()=default
 Defaulted.
 
 hierarchical_interleaved_bloom_filter (hierarchical_interleaved_bloom_filter const &)=default
 Defaulted.
 
hierarchical_interleaved_bloom_filteroperator= (hierarchical_interleaved_bloom_filter const &)=default
 Defaulted.
 
 hierarchical_interleaved_bloom_filter (hierarchical_interleaved_bloom_filter &&)=default
 Defaulted.
 
hierarchical_interleaved_bloom_filteroperator= (hierarchical_interleaved_bloom_filter &&)=default
 Defaulted.
 
 ~hierarchical_interleaved_bloom_filter ()=default
 Defaulted.
 

Public Attributes

std::vector< ibf_tibf_vector
 The individual interleaved Bloom filters.
 
std::vector< std::vector< int64_t > > next_ibf_id
 Stores for each bin in each IBF of the HIBF the ID of the next IBF.
 
user_bins user_bins
 The underlying user bins.
 

Static Public Attributes

static constexpr seqan3::data_layout data_layout_mode = data_layout_mode_
 Indicates whether the Interleaved Bloom Filter is compressed.
 

Detailed Description

template<seqan3::data_layout data_layout_mode_ = seqan3::data_layout::uncompressed>
class raptor::hierarchical_interleaved_bloom_filter< data_layout_mode_ >

The HIBF binning directory. A data structure that efficiently answers set-membership queries for multiple bins.

Template Parameters
data_layout_mode_Indicates whether the underlying data type is compressed. See seqan3::data_layout.
See also
seqan3::interleaved_bloom_filter

This class improves the seqan3::interleaved_bloom_filter by adding additional bookkeeping that allows to establish a hierarchical structure. This structure can then be used to split or merge user bins and distribute them over a variable number of technical bins. In the seqan3::interleaved_bloom_filter, the number of user bins and technical bins is always the same. This causes performance degradation when there are many user bins or the user bins are unevenly distributed.

Terminology

User Bin

The user may impose a structure on his sequence data in the form of logical groups (e.g. species). When querying the (H)IBF, the user is interested in an answer that differentiates between these groups.

Technical Bin

A Technical Bin represents an actual bin in the binning directory. In the IBF, it stores its kmers in a single Bloom Filter (which is interleaved with all the other BFs). In the HIBF each of these bins could be merged or splitted, thus it differs to the original user bins.

Layout of the HIBF

The relationship between user bins and technical bins in the HIBF, i.e. which are split or merged, how substructures of merged bins (lower-level IBFs) look like is called layout.

Hierarchical Interleaved Bloom Filter (HIBF)

In constrast to the seqan3::interleaved_bloom_filter, the user bins may be split across multiple technical bins, or multiple user bins may be merged into one technical bin. When merging multiple user bins, the HIBF stores another IBF that is built over the user bins constituting the merged bin. This lower-level IBF can then be used to further distinguish between merged bins.

In this example layout, user bin 1 was split into two technical bins. Bins 3, 4, and 5 were merged into a single technical bin, and another IBF was added for the merged bin.

The individual IBFs may have a different number of technical bins and differ in their sizes, allowing an efficient distribution of the user bins.

Querying

To query the Hierarchical Interleaved Bloom Filter for values, call raptor::hierarchical_interleaved_bloom_filter::membership_agent() and use the returned raptor::hierarchical_interleaved_bloom_filter::membership_agent. In contrast to the seqan3::interleaved_bloom_filter, the result will consist of indices of user bins.

To count the occurrences in each user bin of a range of values in the Hierarchical Interleaved Bloom Filter, call raptor::hierarchical_interleaved_bloom_filter::counting_agent() and use the returned raptor::hierarchical_interleaved_bloom_filter::counting_agent_type.

Thread safety

The Interleaved Bloom Filter promises the basic thread-safety by the STL that all calls to const member functions are safe from multiple threads (as long as no thread calls a non-const member function at the same time).

Member Function Documentation

◆ counting_agent()

template<seqan3::data_layout data_layout_mode_ = seqan3::data_layout::uncompressed>
template<std::integral value_t = uint16_t>
counting_agent_type< value_t > raptor::hierarchical_interleaved_bloom_filter< data_layout_mode_ >::counting_agent ( ) const
inline

Returns a counting_agent_type to be used for counting.

Template Parameters
value_tThe type to use for the counters; must model std::integral.

◆ serialize()

template<seqan3::data_layout data_layout_mode_ = seqan3::data_layout::uncompressed>
template<seqan3::cereal_archive archive_t>
void raptor::hierarchical_interleaved_bloom_filter< data_layout_mode_ >::serialize ( archive_t &  archive)
inline

Serialisation support function.

Template Parameters
archive_tType of archive; must satisfy seqan3::cereal_archive.
Parameters
[in]archiveThe archive being serialised from/to.
Attention
These functions are never called directly.
See also
https://docs.seqan.de/seqan/3.2.0/group__io.html#serialisation

Member Data Documentation

◆ next_ibf_id

template<seqan3::data_layout data_layout_mode_ = seqan3::data_layout::uncompressed>
std::vector<std::vector<int64_t> > raptor::hierarchical_interleaved_bloom_filter< data_layout_mode_ >::next_ibf_id

Stores for each bin in each IBF of the HIBF the ID of the next IBF.

Assume we look up a bin b in IBF i, i.e. next_ibf_id[i][b]. If i is returned, there is no lower level IBF, bin b is hence not a merged bin. If j != i is returned, there is a lower level IBF, bin b is a merged bin, and j is the ID of the lower level IBF in ibf_vector.


The documentation for this class was generated from the following file: