轉自： http://blog.huang-wei.com/2010/11/02/bloom-filter/

介紹

Bloom Filter是一種簡單的節省空間的隨機化的數據結構，支持用戶查詢的集合。一般我們使用STL的std::set, stdext::hash_set，std::set是用紅黑樹實現的，stdext::hash_set是用桶式哈希表。上述兩種數據結構，都會需要保存原始數據信息，當數據量較大時，內存就會是個問題。如果應用場景中允許出現一定幾率的誤判，且不需要逆向遍歷集合中的數據時，Bloom Filter是很好的結構。

優點

查詢操作十分高效。
節省空間。
易于擴展成并行。
集合計算方便。
代碼實現方便。
有誤判的概率，即存在False Position。
無法獲取集合中的元素數據。
不支持刪除操作。

缺點

有誤判的概率，即存在False Position。
無法獲取集合中的元素數據。
不支持刪除操作。

定義

Bloom Filter是一個有m位的位數組，初始全為0，并有k個各自獨立的哈希函數。

圖1

添加操作

每個元素，用k個哈希函數計算出大小為k的哈希向量 $/bg_white /left (H_{1},H_{2}/cdots ,H_{k} /right )$
，將向量里的每個哈希值對應的位設置為1。時間復雜度為 $/bg_white k/cdot O(F_{H})$ ，一般字符串哈希函數的時間復雜度也就是。

查詢操作

和添加類似，先計算出哈希向量，如果每個哈希值對應的位都為1，則該元素存在。時間復雜度與添加操作相同。

示例

圖2表示m=16，k=2的Bloom Filter，和的哈希值分別為(3, 6)和(10, 3)。

圖2

False Position

如果某元素不在Bloom Filter中，但是它所有哈希值的位置均被設為1。這種情況就是False Position，也就是誤判。

借用示例，如下：

圖3

這個問題其實和哈希表中的沖突是相同的道理，哈希表中可以使用開散列和閉散列的方法，而Bloom Filter則允許這樣的情況發生，它更關心于誤判的發生概率。

概率

宏觀上，我們能得出以下結論：

參數表	變量	減少	增加
哈希函數總數	K	l 更少的哈希值計算 l 增加False Position的概率	l 更多的計算 l 位值0減少
Bloom Filter 大小	M	l 更少的內存 l 增加False Position的概率	l 更多的內存 l 降低概率
元素總數	N	l 降低False Position的概率	l 增加概率

False Position的概率為 $F=(1-e^{-/frac{kn}{m}})^{k}$ 。

假設m和n已知，為了最小化False Position，則 $/bg_white k=/left [ /ln 2/cdot /frac{m}{n} /right ]$ 。

數據

圖4

擴展

Counter Bloom Filter

Bloom Filter有個缺點，就是不支持刪除操作，因為它不知道某一個位從屬于哪些向量。那我們可以給Bloom Filter加上計數器，添加時增加計數器，刪除時減少計數器。

但這樣的Filter需要考慮附加的計數器大小，假如同個元素多次插入的話，計數器位數較少的情況下，就會出現溢出問題。如果對計數器設置上限值的話，會導致Cache Miss，但對某些應用來說，這并不是什么問題，如Web Sharing。

Compressed Bloom Filter

為了能在服務器之間更快地通過網絡傳輸Bloom Filter，我們有方法能在已完成Bloom Filter之后，得到一些實際參數的情況下進行壓縮。

將元素全部添加入Bloom Filter后，我們能得到真實的空間使用率，用這個值代入公式計算出一個比m小的值，重新構造Bloom Filter，對原先的哈希值進行求余處理，在誤判率不變的情況下，使得其內存大小更合適。

應用

加速查詢

適用于一些key-value存儲系統，當values存在硬盤時，查詢就是件費時的事。

將Storage的數據都插入Filter，在Filter中查詢都不存在時，那就不需要去Storage查詢了。

當False Position出現時，只是會導致一次多余的Storage查詢。

圖5

l Google的BigTable也使用了Bloom Filter，以減少不存在的行或列在磁盤上的查詢，大大提高了數據庫的查詢操作的性能。

l 在Internet Cache Protocol中的Proxy-Cache很多都是使用Bloom Filter存儲URLs，除了高效的查詢外，還能很方便得傳輸交換Cache信息。

網絡應用

l P2P網絡中查找資源操作，可以對每條網絡通路保存Bloom Filter，當命中時，則選擇該通路訪問。

l 廣播消息時，可以檢測某個IP是否已發包。

l 檢測廣播消息包的環路，將Bloom Filter保存在包里，每個節點將自己添加入Bloom Filter。

l 信息隊列管理，使用Counter Bloom Filter管理信息流量。

垃圾郵件地址過濾

來自于Google黑板報的例子。

像網易，QQ這樣的公眾電子郵件（email）提供商，總是需要過濾來自發送垃圾郵件的人（spamer）的垃圾郵件。

一個辦法就是記錄下那些發垃圾郵件的 email 地址。由于那些發送者不停地在注冊新的地址，全世界少說也有幾十億個發垃圾郵件的地址，將他們都存起來則需要大量的網絡服務器。

如果用哈希表，每存儲一億個 email 地址，就需要 1.6GB 的內存（用哈希表實現的具體辦法是將每一個 email 地址對應成一個八字節的信息指紋，然后將這些信息指紋存入哈希表，由于哈希表的存儲效率一般只有 50%，因此一個 email 地址需要占用十六個字節。一億個地址大約要 1.6GB，即十六億字節的內存）。因此存貯幾十億個郵件地址可能需要上百 GB 的內存。

而Bloom Filter只需要哈希表 1/8 到 1/4 的大小就能解決同樣的問題。

Bloom Filter決不會漏掉任何一個在黑名單中的可疑地址。而至于誤判問題，常見的補救辦法是在建立一個小的白名單，存儲那些可能別誤判的郵件地址。

引用

[1] Bloom filter; http://en.wikipedia.org/wiki/Bloom_filter

[2] Summary Cache: A Scalable Wide-Area Web Cache Sharing Protocol; http://pages.cs.wisc.edu/~cao/papers/summary-cache/

[3] Network Applications of Bloom Filters: A Survey; http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.127.9672&rep=rep1&type=pdf

[4] An Examination of Bloom Filters and their Applications; http://cs.unc.edu/~fabian/courses/CS600.624/slides/bloomslides.pdf

[5] 數學之美系列二十一－布隆過濾器（Bloom Filter）; http://www.google.com.hk/ggblog/googlechinablog/2007/07/bloom-filter_7469.html

/* * File: bloomfilter.h * Created: 2010/10/31 * Author: Huang.WisKey * E-Mail: sir.huangwei[at]gmail.com * Brief: * * May you do good and not evil. * May you find forgiveness for yourself and forgive others. * May you share freely, never taking more than you give. */ #pragma once #ifndef __BLOOMFILTER_H__ #define __BLOOMFILTER_H__ #include "stdlib.h" #include "memory.h" #include "time.h" #include "math.h" #ifndef NULL # ifdef __cplusplus # define NULL 0 # else # define NULL ((void *)0) # endif #endif unsigned int string_SDBM_hash(const char* _str); unsigned int string_RS_hash(const char* _str); unsigned int string_JS_hash(const char* _str); unsigned int string_PJW_hash(const char* _str); unsigned int string_ELF_hash(const char* _str); unsigned int string_BKDR_hash(const char* _str); unsigned int string_DJB_hash(const char* _str); unsigned int string_AP_hash(const char* _str); template <typename T> class bloomfilter { public: typedef unsigned int hash_key; typedef hash_key (*hash_func_type)(const T&); typedef unsigned int cell_type; typedef unsigned int size_type; protected: static hash_key default_hash_func(const T& _obj) { return string_AP_hash(reinterpret_cast <const char*> (_obj)); } public: bloomfilter(size_type _elem_size, double _prob_false_positive, unsigned int _rand_seed = static_cast <unsigned int> (time(NULL)), hash_func_type _hash_func = default_hash_func ) : bit_table_(NULL), table_size_(0), hash_func_(_hash_func ? _hash_func : default_hash_func), elem_bit_size_(0), elem_bit_randoms_(NULL), elem_size_(0), randoms_seed_(_rand_seed) { _optimal_parameters(_elem_size, _prob_false_positive); _generate_random(); bit_table_ = new cell_type[table_size_ / sizeof(cell_type)]; memset(bit_table_, 0, table_size_); } bloomfilter(const bloomfilter& _obj) { // bit table table_size_ = _obj.table_size_; bit_table_ = new cell_type[table_size_ / sizeof(cell_type)]; memcpy(bit_table_, _obj.bit_table_, table_size_); // hash func hash_func_ = _obj.hash_func_; // elem elem_bit_size_ = _obj.elem_bit_size_; elem_bit_randoms_ = new hash_key[elem_bit_size_]; memcpy(elem_bit_randoms_, _obj.elem_bit_randoms_, elem_bit_size_ * sizeof(hash_key)); elem_size_ = _obj.elem_size_; } virtual ~bloomfilter() { delete[] bit_table_; bit_table_ = NULL; delete[] elem_bit_randoms_; elem_bit_randoms_ = NULL; hash_func_ = NULL; table_size_ = 0; hash_func_ = NULL; elem_bit_size_ = 0; elem_size_ = 0; } void insert(const T& _obj) { hash_key p, b; hash_key hash = hash_func_(_obj); bool exist = true; for (unsigned int i = 0; i < elem_bit_size_; i ++) { _get_pos(elem_bit_randoms_[i] * hash, p, b); exist = exist && (bit_table_[p] & (0x01L<<b)); bit_table_[p] |= 0x01L << b; } elem_size_ += ! exist; } bool find(const T& _obj) { hash_key p, b; hash_key hash = hash_func_(_obj); bool exist = true; for (unsigned int i = 0; i < elem_bit_size_ && exist; i ++) { _get_pos(elem_bit_randoms_[i] * hash, p, b); exist = exist && (bit_table_[p] & (0x01L<<b)); } return exist; } double effective_fpp() const { /* Note: The effective false positive probability is calculated using the designated table size and hash function count in conjunction with the current number of inserted elements - not the user defined predicated/expected number of inserted elements. */ return pow(1.0 - exp(-1.0 * elem_bit_size_ * elem_size_ / table_size_), 1.0 * elem_bit_size_); } size_type size() { /* in bytes */ return table_size_; } size_type cell_size() { return table_size_ / sizeof(cell_type); } size_type count() { return elem_size_; } void clear() { memset(bit_table_, 0, table_size_); elem_size_ = 0; } const cell_type* table() { return bit_table_; } bloomfilter& operator &= (const bloomfilter& _obj) { /* intersection */ if ( (elem_bit_size_ == _obj.elem_bit_size_) && (table_size_ == _obj.table_size_) && (randoms_seed_ == _obj.randoms_seed_) && (hash_func_ == _obj.hash_func_) ) { size_type cells = cell_size(); for (size_type i = 0; i < cells; ++i) bit_table_[i] &= _obj.bit_table_[i]; } return *this; } bloomfilter& operator |= (const bloomfilter& _obj) { /* union */ if ( (elem_bit_size_ == _obj.elem_bit_size_) && (table_size_ == _obj.table_size_) && (randoms_seed_ == _obj.randoms_seed_) && (hash_func_ == _obj.hash_func_) ) { size_type cells = cell_size(); for (size_type i = 0; i < cells; ++i) bit_table_[i] |= _obj.bit_table_[i]; } return *this; } bloomfilter& operator ^= (const bloomfilter& _obj) { /* difference */ if ( (elem_bit_size_ == _obj.elem_bit_size_) && (table_size_ == _obj.table_size_) && (randoms_seed_ == _obj.randoms_seed_) && (hash_func_ == _obj.hash_func_) ) { size_type cells = cell_size(); for (size_type i = 0; i < cells; ++i) bit_table_[i] ^= _obj.bit_table_[i]; } return *this; } protected: void _optimal_parameters(unsigned int _elem_size_prob, double _prob_false_positive) { /* Note: The following will attempt to find the number of hash functions and minimum amount of storage bits required to construct a bloom _obj consistent with the user defined false positive probability and estimated element insertion count. */ double min_m = 1e99; double min_k = 0.0; double curr_m = 0.0; for(double k = 0.0; k < 1000.0; ++k) { if ((curr_m = ((- k * _elem_size_prob) / log(1.0 - pow(_prob_false_positive, 1.0 / k)))) < min_m) { min_m = curr_m; min_k = k; } } elem_bit_size_ = static_cast <size_type> (min_k); table_size_ = static_cast <size_type> (min_m); table_size_ = ((table_size_ > _elem_size_prob ? table_size_ : _elem_size_prob) / sizeof(cell_type) + 1) * sizeof(cell_type); } void _generate_random() { elem_bit_randoms_ = new hash_key[elem_bit_size_]; srand(randoms_seed_); for (unsigned int i = 0; i < elem_bit_size_; i ++) { elem_bit_randoms_[i] = rand(); } } void _get_pos(hash_key _hash, hash_key& _cell, hash_key& _bit) { _hash %= table_size_; _cell = _hash / sizeof(hash_key); _bit = _hash % sizeof(hash_key); } protected: cell_type* bit_table_; size_type table_size_; hash_func_type hash_func_; size_type elem_bit_size_; hash_key* elem_bit_randoms_; size_type elem_size_; unsigned int randoms_seed_; }; // SDBM Hash Function unsigned int string_SDBM_hash(const char* _str) { unsigned int hash = 0; while (*_str) { // equivalent to: hash = 65599*hash + (*_str++); hash = (*_str++) + (hash << 6) + (hash << 16) - hash; } return (hash & 0x7FFFFFFF); } // RS Hash Function unsigned int string_RS_hash(const char* _str) { unsigned int b = 378551; unsigned int a = 63689; unsigned int hash = 0; while (*_str) { hash = hash * a + (*_str++); a *= b; } return (hash & 0x7FFFFFFF); } // JS Hash Function unsigned int string_JS_hash(const char* _str) { unsigned int hash = 1315423911; while (*_str) { hash ^= ((hash << 5) + (*_str++) + (hash >> 2)); } return (hash & 0x7FFFFFFF); } // P. J. Weinberger Hash Function unsigned int string_PJW_hash(const char* _str) { unsigned int BitsInUnignedInt = (unsigned int)(sizeof(unsigned int) * 8); unsigned int ThreeQuarters = (unsigned int)((BitsInUnignedInt * 3) / 4); unsigned int OneEighth = (unsigned int)(BitsInUnignedInt / 8); unsigned int HighBits = (unsigned int)(0xFFFFFFFF) << (BitsInUnignedInt - OneEighth); unsigned int hash = 0; unsigned int test = 0; while (*_str) { hash = (hash << OneEighth) + (*_str++); if ((test = hash & HighBits) != 0) { hash = ((hash ^ (test >> ThreeQuarters)) & (~HighBits)); } } return (hash & 0x7FFFFFFF); } // ELF Hash Function unsigned int string_ELF_hash(const char* _str) { unsigned int hash = 0; unsigned int x = 0; while (*_str) { hash = (hash << 4) + (*_str++); if ((x = hash & 0xF0000000L) != 0) { hash ^= (x >> 24); hash &= ~x; } } return (hash & 0x7FFFFFFF); } // BKDR Hash Function unsigned int string_BKDR_hash(const char* _str) { unsigned int seed = 131; // 31 131 1313 13131 131313 etc.. unsigned int hash = 0; while (*_str) { hash = hash * seed + (*_str++); } return (hash & 0x7FFFFFFF); } // DJB Hash Function unsigned int string_DJB_hash(const char* _str) { unsigned int hash = 5381; while (*_str) { hash += (hash << 5) + (*_str++); } return (hash & 0x7FFFFFFF); } // AP Hash Function unsigned int string_AP_hash(const char* _str) { unsigned int hash = 0; for (int i=0; *_str; i++) { if ((i & 1) == 0) hash ^= ((hash << 7) ^ (*_str++) ^ (hash >> 3)); else hash ^= (~((hash << 11) ^ (*_str++) ^ (hash >> 5))); } return (hash & 0x7FFFFFFF); } #endif // __BLOOMFILTER_H__

Bloom Filter 原理與應用

更多文章、技術交流、商務合作、聯系博主

微信掃碼或搜索：z360901061

微信掃一掃加我為好友

QQ號聯系： 360901061

您的支持是博主寫作最大的動力，如果您喜歡我的文章，感覺我的文章對您有幫助，請用微信掃描下面二維碼支持博主2元、5元、10元、20元等您想捐的金額吧，狠狠點擊下面給點支持吧，站長非常感激您！手機微信長按不能支付解決辦法：請將微信支付二維碼保存到相冊，切換到微信，然后點擊微信右上角掃一掃功能，選擇支付二維碼完成支付。

【本文對您有幫助就好】元

2元

5元

10元

20元

自定義