L9 Hashing

Outline:

  • The searching problem
    • The ambition of hashing
  • Hashing
    • Brute force table: direct addressing
    • Basic idea of hashing
  • Collision Handling for Hashing
    • Closed address hashing
    • Open address hashing
  • Amortized Analysis
    • Array doubling

Ref:

  • 算法设计与分析(Algorithm design and analysis) by 黄宇

The searching problem

Cost for Searching

  • Brute force
    • \(O(n)\)
  • Balanced BST
    • \(O(logn)\)
  • Hashing --- almost constant time
    • \(O( 1 + \alpha)\)
  • "Mission impossible"
    • \(O(1)\)

Hashing

  • universe of keys: 所有可能的键
  • actual keys:实际的键

Brute force table: direct addressing

  • 给所有 universe keys 都分配一个位置。 用超多的空间

Basic idea of hashing

  • Index distribution
    • Hash Function \(H(x) = k\)
  • Collision handling

Collision Handling for Hashing

  • 规定m为哈希表大小, n为表中元素个数

Closed address hashing

  • 也称为Chaining

  • Each address is a linked list

  • Insert to the head of the list( 链表不是常数时间寻址 )

  • Assumption - simple uniform hashing

    • For \(j=0,1,2,\dots, m-1\)​​, the average length of the list at \(E[j]\)​​​ is n/m
  • The average cost for an unsuccessful search

    • Any key that is not in the table is equally likely to hash to any of the m address

    • Total cost \(\Theta(1+n/m)\)

      • The average cost to determine that the key is not in the \(E[h(k)]\) is the cost to search to the end of the list, which is n/m

      • 1是算哈希, n/m是链表平均长度

  • For successful search( assuming that \(x_i\) is the \(i^{th}\) element inserted into the table, \(i=1,2,\dots,n\) )

    • For each i, the probability of that \(x_i\)​ is searched is 1/n

    • For a specific \(x_i\)​​, the number of elements examined in a successful search is t+1, where t is the number of elements inserted into the same list as \(x_i\), after \(x_i\)​​​​​​ has been inserted(即链表里在该元素之前的元素的个数). \[ 1 + \frac 1 n \sum\limits_{i=1}^{n}{(1+t)} \]

    • And for any j, the probability of that \(x_j\) is inserted into the same list of \(x_i\) is 1/m. So, the cost is:

      \[ 1 + \frac 1 n \sum\limits_{j=i+1}^{n}{\frac 1 m} \]

  • The average cost of a successful search:

    • Define \(\alpha = n/m\) as load factor, The average cost of a successful search is: \[ \frac 1 n { ( 1 + \sum\limits_{j = i + 1}^n {\frac 1 m} )} = 1 + \frac 1 {nm} {\sum\limits_{i = 1}^n {(n-i)}} = 1 + \frac 1 {nm} \sum_{i=1}^{n-1}i = 1 + \frac {n-1} {2m} = 1 + \frac \alpha 2 - \frac \alpha {2n} = \Theta( 1 + \alpha ) \]

Open address hashing

  • All elements are stored in the hash table
    • No linked list is used
    • The load factor \(\alpha\) can't be larger than 1
  • Collision is settled by "rehashing"
    • A function is used to get a new hashing address for each collided address
      • The hash table slots are probed successively, until a valid location is found.
  • Th probe sequence can be seen as a permutation of \((0,1,2,\dots,m-1)\)

Commonly Used Probing

  • Linear probing:

    • Given an ordinary hash function h', which is called an auxiliary hash function, the hash function is:( clustering may occur ) \[ h(k,i) = (h'(k)+i)\quad mod \quad m \quad ( i = 0,1, \dots, m-1 ) \]
  • Quadratic Probing:

    • Given auxiliary function h' and nonzero auxiliary constant \(c_1\) and \(c_2\)​, the hash function is: (secondary clustering may occur) \[ h(k,i) = (h'(k)+c_1 i + c_2 i^2)\quad mod \quad m \quad ( i = 0,1, \dots, m-1 \]
  • Double hashing:

    • Given auxiliary functions h_1 and h_2, the hash function is: \[ h(k,i) = (h_1(k)+i h_2(k))\quad mod \quad m \quad ( i = 0,1, \dots, m-1 \]

Equally Likely Permutations

  • Assumption
    • Each key is equally likely to have any of the m! permutations of \((1,2,\dots,m)\)​ as​ its probe sequence
  • Note
    • Both linear and quadratic probing have only m distinct probe sequence, as determined by the first probe.(m个格子的合法排列)

Unsuccessful Search

  • The average number of probes in an unsuccessful search is at most \(1/(1-\alpha )\)\((\alpha = n/m < 1)\)

    • Assuming uniform hashing

      The probability of the first probed position being occupied is n/m, and that of the \(j^{th}(j>1)\) position occupied is \(\frac {n-j+1}{m-j+1}\)​​. So the probability of the number of probed no less than i will be:(概率论知识) \[ \frac n m \cdot \frac {n-1} {m-1} \cdot {n-1} {m-2} \cdots \cdot \frac {n-i+2} {m-i+2} \le { (\frac n m )}^{i-1} = \alpha ^{i-1} \]

      The average number of probe is: \(\sum\limits _{i=1} ^ \infty {\alpha ^ {i - 1}} = \sum\limits _{i=0} ^ \infty {\alpha ^ {i }} = \frac {1} {1-\alpha}\)​ (其实上界是m,因为一共m个格子。 这里做了放大,但这个放大得到的上界还是比较 tight的)

Successful Search

  • The average number of probes in an successful search is at most \(\frac 1 \alpha ln \frac 1 {1-\alpha}\) (\(\alpha = n /m <1\)​)​

    • Assuming uniform hashing

      To search for the \((i+1)^{th}\) inserted element in the table, the cost is the same as that for inserting it when there are just i elements in the table. At that time, \(\alpha = \frac i m\). So the cost is( 用 unsuccessful的结论 ) \(\frac 1 {1- \frac i m}\) = \(\frac m {m-i}\)​, \[ \frac 1 n \sum\limits_{i=0}^{n-1} \frac m {m-i} = \frac m n \sum\limits _{i=0}^{n-1} \frac 1 {m-i} = \frac 1 \alpha \sum\limits _{i=m-n+1}^{m} \frac 1 i \le \frac 1 \alpha \int _{m-n} ^m {\frac {dx} x} = \frac 1 \alpha ln \frac m {m-n} = \frac 1 \alpha ln \frac 1 {1-\alpha} \]

Amortized Analysis

Array doubling

  • 顾名思义

Worst-case Analysis

  • For n execution of insertion operations

    • Note that the expansion is required during the \(i^{th}\)​ operation only if \(i = 2^{th}\)​, and the cost of the \(i^{th}\)​ operation $ c_i = i $​( if i - 1 is exactly the power of 2 ) or $c_i = 1 $ ​(otherwise)​

    So the total cost is: \(\sum\limits _{i=1} ^n c_{i} \le n + \sum\limits _{j=0} ^{\lfloor logn \rfloor} 2^j < n + 2n = 3n\)

Amortized Analysis- Why?

  • Unusually expensive operations
  • Relation between expensive and usual operations
    • Each piece of the doubling cost corresponds to some previous insert

Amortized Analysis- How?

  • Amortized equation:

    amortized cost = actual cost + accounting cost

  • Design goals for accounting cost

    • In any legal sequence of operations, the sum of the accounting costs is nonnegative
    • The amortized cost of each operation is fairly regular, in spite of the wide fluctuate possible for the actual cost of individual operations
  • Examples:

    1. Multi-pop Stacks
    Amortized Actual Accounting
    Push 2 1 1
    Multi-pop 0 k -k
    • 相当于每个元素在出生的时候就要记下它死亡的时候的代价
    1. Binary Counter
    Amortized Actual Account
    Set 1 2 1 1
    Set 0 0 1 -1

Amortized Analysis - Array Doubling

Amortized Actual Account
Insert ( normal ) 3 1 2
Insert ( doubling ) 3 k+1 -k + 2
  • k is the number of elements upon doubling
  • -k+2: 清除前k个元素的代价, 新插入的元素要付出2的代价