Raptor 3.0.0-rc.1
A fast and space-efficient pre-filter for querying very large collections of nucleotide sequences
|
|
The HIBF binning directory. A data structure that efficiently answers set-membership queries for multiple bins. More...
#include <raptor/hierarchical_interleaved_bloom_filter.hpp>
Classes | |
class | counting_agent_type |
Manages counting ranges of values for the raptor::hierarchical_interleaved_bloom_filter. More... | |
class | membership_agent |
Manages membership queries for the raptor::hierarchical_interleaved_bloom_filter. More... | |
class | user_bins |
Bookkeeping for user and technical bins. More... | |
Public Types | |
using | ibf_t = seqan3::interleaved_bloom_filter< data_layout_mode_ > |
The type of an individual Bloom filter. | |
Public Member Functions | |
template<std::integral value_t = uint16_t> | |
counting_agent_type< value_t > | counting_agent () const |
Returns a counting_agent_type to be used for counting. | |
membership_agent | membership_agent () const |
Returns a membership_agent to be used for counting. | |
template<seqan3::cereal_archive archive_t> | |
void | serialize (archive_t &archive) |
Serialisation support function. | |
Constructors, destructor and assignment | |
hierarchical_interleaved_bloom_filter ()=default | |
Defaulted. | |
hierarchical_interleaved_bloom_filter (hierarchical_interleaved_bloom_filter const &)=default | |
Defaulted. | |
hierarchical_interleaved_bloom_filter & | operator= (hierarchical_interleaved_bloom_filter const &)=default |
Defaulted. | |
hierarchical_interleaved_bloom_filter (hierarchical_interleaved_bloom_filter &&)=default | |
Defaulted. | |
hierarchical_interleaved_bloom_filter & | operator= (hierarchical_interleaved_bloom_filter &&)=default |
Defaulted. | |
~hierarchical_interleaved_bloom_filter ()=default | |
Defaulted. | |
Public Attributes | |
std::vector< ibf_t > | ibf_vector |
The individual interleaved Bloom filters. | |
std::vector< std::vector< int64_t > > | next_ibf_id |
Stores for each bin in each IBF of the HIBF the ID of the next IBF. | |
user_bins | user_bins |
The underlying user bins. | |
Static Public Attributes | |
static constexpr seqan3::data_layout | data_layout_mode = data_layout_mode_ |
Indicates whether the Interleaved Bloom Filter is compressed. | |
The HIBF binning directory. A data structure that efficiently answers set-membership queries for multiple bins.
data_layout_mode_ | Indicates whether the underlying data type is compressed. See seqan3::data_layout. |
This class improves the seqan3::interleaved_bloom_filter by adding additional bookkeeping that allows to establish a hierarchical structure. This structure can then be used to split or merge user bins and distribute them over a variable number of technical bins. In the seqan3::interleaved_bloom_filter, the number of user bins and technical bins is always the same. This causes performance degradation when there are many user bins or the user bins are unevenly distributed.
The user may impose a structure on his sequence data in the form of logical groups (e.g. species). When querying the (H)IBF, the user is interested in an answer that differentiates between these groups.
A Technical Bin represents an actual bin in the binning directory. In the IBF, it stores its kmers in a single Bloom Filter (which is interleaved with all the other BFs). In the HIBF each of these bins could be merged or splitted, thus it differs to the original user bins.
The relationship between user bins and technical bins in the HIBF, i.e. which are split or merged, how substructures of merged bins (lower-level IBFs) look like is called layout.
In constrast to the seqan3::interleaved_bloom_filter, the user bins may be split across multiple technical bins, or multiple user bins may be merged into one technical bin. When merging multiple user bins, the HIBF stores another IBF that is built over the user bins constituting the merged bin. This lower-level IBF can then be used to further distinguish between merged bins.
In this example layout, user bin 1 was split into two technical bins. Bins 3, 4, and 5 were merged into a single technical bin, and another IBF was added for the merged bin.
The individual IBFs may have a different number of technical bins and differ in their sizes, allowing an efficient distribution of the user bins.
To query the Hierarchical Interleaved Bloom Filter for values, call raptor::hierarchical_interleaved_bloom_filter::membership_agent() and use the returned raptor::hierarchical_interleaved_bloom_filter::membership_agent. In contrast to the seqan3::interleaved_bloom_filter, the result will consist of indices of user bins.
To count the occurrences in each user bin of a range of values in the Hierarchical Interleaved Bloom Filter, call raptor::hierarchical_interleaved_bloom_filter::counting_agent() and use the returned raptor::hierarchical_interleaved_bloom_filter::counting_agent_type.
The Interleaved Bloom Filter promises the basic thread-safety by the STL that all calls to const
member functions are safe from multiple threads (as long as no thread calls a non-const
member function at the same time).
|
inline |
Returns a counting_agent_type to be used for counting.
value_t | The type to use for the counters; must model std::integral. |
|
inline |
Serialisation support function.
archive_t | Type of archive ; must satisfy seqan3::cereal_archive. |
[in] | archive | The archive being serialised from/to. |
std::vector<std::vector<int64_t> > raptor::hierarchical_interleaved_bloom_filter< data_layout_mode_ >::next_ibf_id |
Stores for each bin in each IBF of the HIBF the ID of the next IBF.
Assume we look up a bin b
in IBF i
, i.e. next_ibf_id[i][b]
. If i
is returned, there is no lower level IBF, bin b
is hence not a merged bin. If j != i
is returned, there is a lower level IBF, bin b
is a merged bin, and j
is the ID of the lower level IBF in ibf_vector.