L6 MergeSort

Oultline

  • MergeSort
    • Worst-case analysis of MergeSort
    • More than sort: the MergeSort D&C
  • Lower Bounds for comparison-based sorting ( nlogn )
    • Worst-case ( Omega(nlogn) )
    • Average-case (nlogn - 1.443n )

Ref:

  • 算法设计与分析(Algorithm design and analysis) by 黄宇

MergeSort

Space Complexity of Merge

  • A algorithm is "in space"
    • If the extra space it has to use is in Omega(1)
  • Merge is not an in place algorithm
    • Since it needs O(n) extra space to store the merged sequence during the merging process

Worst case Complexity of MergeSort

  • Observations
    • Worst case is that comparison is conducted between A[k-1] and B[m-1]
    • After each comparison, one element is inserted into Array C, at least
    • After entering Array C , an element will never be compared again
    • After last comparison, at least two elements( the two just compared ) have not yet been moved to Array C. So at most n - 1 comparisons are done.
  • In worst case, n - 1 comparisons are done, where n = k + m

Optimality of Merge

  • Any algorithm to merge two sorted arrays, each containing k = m = n/2 entries, by comparison of keys, does at least(如果算法笨的话可能有重复比较) n - 1 comparisons in the worst case.

    • Choose keys so that:

      b0 < a0 < b1 < a1 < ... < bi < ai< bi+1,..., < bm-1 < ak-1

    • Then the algorithm must compare ai with bi for every i in [ 0, m - 1 ], and must compare ai with bi+1 for every i in [0, m - 1], so there are n - 1 comparisons

  • 先考虑最好情况,也就是两边数组大小不一样的时候,一个数组全部比完了,那么另一个数组的剩余部分就在这次比较后全部插入辅助数组中, 比较次数小于 n - 1

  • 反之,最坏情况就是"其中一个数组比完了"这个情况不发生,也就是两个数组一直比到最后, 也就是两个数组的最后一个元素相互比较( "comparison is conducted between A[k-1] and B[m-1]" ), 这就要求k = m = n/2

  • 可能有人会疑惑, 如果两数组大小极度不均衡,但是较小数组的最后一个元素远大于较大数组的所有元素,这不也是比到最后吗? 也是 n - 1 吗? 为什么最坏情况的构造里, 还要要求 k = m = n/2 呢?

    • 因为对于这种情况,对于较小的数组,完全可以用折半查找来插入,不需要归并了.

Worst case Analysis of MergeSort

  • The recurrence equation for MergeSort
    • W(n) = W( floor( n/2 ) ) + W( cell( n/2 ) ) + n - 1 (或者O(n))
    • W (1) = 0
  • The Master Theorem applies for the equation, so:
    • W(n) ∈ Omega(nlogn)
  • 精细分析, 得出 W(n) = nlogn - ( alpha - log alpha )n + 1
    • cell( nlogn - n + 1 ) <= W(n) <= cell( nlogn - 0.914n)

The MergeSort D&C

  • Counting the number of inversions

    • Brute force: O(n2) 蛮力做法
    • Can we use divide & conquer
      • In O(nlogn) => combination in O(n)
  • MergeSort as the carrier 用归并排序做

    • Sorted subarrays

      • A[ 0... k-1] and B[ 0 ... m-1 ]
    • Compare the left and the right elements

      • A[i] vs. B[j]

      • if A[i] > B[j]

        • (i,j) is an inversion

        • For all i' > j

          ( i' , j) are inversions ( i' > i )

        • B[j] is selected

      • if A[i]] < B[j]

        • No inversions found
        • A[i] is selected

Lower Bounds for comparison-based sorting

  • Upper bound, e.g., worst-case cost 给定一个算法,输入不同,复杂度有一个上界(比如worst-case)
    • For any possible input, the cost of the specific algorithm A is no more than the upper bound
      • Max{ Cost( i ) | i is an input }
  • Lower bound, e.g., comparison-based sorting 比如,对于所有可能的算法,每个算法都有一个worst-case, 这是个上界,对所有上界取下界,就是worst-case的下界。相应的, 所有可能的算法的所有可能的期望值,也就是average-case复杂度,的下界,就是average-case复杂度的下界
    • For any possible( comparison-based ) sorting algorithm A, the worst-case is no less than the lower bound
      • Min{ Worst-case(a) | a is an algorithm }

Decision Tree for Sorting

  • Decision tree is a 2-tree( assuming no same keys )
  • The action of Sort on a particular input corresponds to following on path in its decision tree from the root to leaf associated to the specific output
  • Decision tree 是刻画所有 comparison-based 的排序的数学工具

Characterizing the Decision Tree

  • For a sequence of n distinct elements, there are n! different permutation

    • 叶节点是所有可能的输出,这是n个输入元素的所有可能的排列,因此是n!

    • So, the decision tree has at least n! leaves, and exactly n! leaves can be reached from the root

    • So, for the purpose of lower bounds evaluation, we use trees with exactly n! leaves.

  • The number of comparison done in the worst case is the height of the tree

  • The average number of comparison done is the average of lengths of all paths from the root to a leaf.

  • 变成了离散数学问题

Lower Bound for Worst Case

  • Theorem: Any algorithm to sort n items by comparisons of keys must do at least cell(log n!). or approximately cell( nlogn - 1.443n), key comparisons in the worst case.
    • Lemma: let L be the number of leaves in a binary tree and h be its height. Then L <= 2^h 可用归纳法证明
      • 即高度为h, h >= log L , L = n!, 所以 h >= log( n! )
    • log( n! ) >= ... >= log( (n/2)^ (n/2) ) = n/2log(n/2) ∈ Omega( nlogn )
  • 因此,worst-case的下界为nlogn

External Path Length( EPL )

  • EPL --- sum of path length to every leaf
    • The EPL t is recursively defined as follows:
    • [Base case] 0 for a single external node
    • [Recursion] t is non-leaf with sub-trees L and R, then the sum of:
      • The external path length of L
      • the number of external node of L ( 每个完整的树作为子树时,节点的深度都要下沉1,所以递归合并时每个叶节点对应的路径长度都要加一 )
      • The external path length of R
      • the number of external node of R

Lower Bound for Average Behavior

  • Since a decision tree with L leaves is a 2-tree, the average path length from the root to leaf is epl / L

  • Recall that epl >= L log( L )

  • 每个输出对应的概率是 1 / L , 而所有代价的总和是epl, 所以平均情况复杂度在所有可能的输入等概率的前提下,是 epl/L, 因此求average case的下界,就是求epl的下界

  • Theorem: The average number of comparisons done by an algorithm to sort n items by comparison of keys is at least log(n!), which is about nlogn - 1.443n 

  • More Balanced 2-tree, Less EPL

    • 设一棵 决策树有两个节点,高度分别为h , k, h - k > 1( 即该树不平衡 ). 高度为h的节点有两个叶节点, 高度k的节点没有叶节点( 由于是2-tree, 不可能只有一个叶子节点 )
    • Assuming that h - k > 1>, when calculating epl , h + h + k is replaced by ( h - 1 ) + 2( k + 1 ). the net change in epl is k - h + 1 < 0, that is , the epl decreases
    • 因此,求epl的下界,就是求最平衡的树的epl, 最平衡的二叉树就是`完美二叉树. 由此可证明: epl >= L log( L )

MergeSort Has Optimal Average Performance

  • The average number of comparisons done by an algorithm to sort n items by comparison of keys is at least about nlogn - 1.443n
  • The worst complexity of MergeSort is in Omega(nlogn)( 之前已证 )
  • So, MergeSort is optimal as for its average performance
    • 首先, 归并排序的worst-case是nlogn, 而average-case必定小于等于worst-case, 即 MergeSort的average=case的上界是nlogn. 作为一个comparison sorting, 由于comparison sorting的average-case的下界是nlogn. 因此MergeSort的average-case的下界是nlogn. 夹逼得出, 所以MergeSort的average-case就是nlog级别