Polynomial Matching

Introduction

We are to going to implement polynomial regexp/nfa matcher using the techniques of dynamic programming and subset-construction on the fly.

Open a file named prog4_poly_match.py.

0 -- Dynamic Programming

Dynamic Programming is a programming technique to improve algorithms running time. In many cases, the gains can be exponential and we are going to see examples of such benefits with the following problems.

Dynamic programming is possible if an algorithm has parts that are run repetively with the same inputs. In this case, it is possible to avoid recomputation by saving the result in a table indexed by the inputs.

The most common example to illustrate the use of dynamic programming is for the implementation of the Fibonacci function. The naive implementation, corresponds to its mathematical definition and is recursive:


def fibo(n):
  if n <= 1:
    return 1
  else:
    return fibo(n-1) + fibo(n-2)

An analysis of the time complexity of fibo() run on an input n shows that the number of recursive calls to fibo(i) will be exponential in n-i. The dynamic programming solution to fibo is to equip the function with a table storing the result of the function for each input. In practice, we implement the table as a global variable holding a Python dictionary.

fibo_tab = dict()

  
The only two operations necessary on the table is to look-up for the presence of a key
key in fibo_tab
and adding a binding key-value to the dictionary
fibo_tab[key] = value
With this setting, it is almost mechanical to transform the naive fibo() function to its dynamic programming version. fibo_dyn() resets the table and call the function fibot() that is similar to fibo() but looks up into the table before making recomputing anything:
  
def fibo_dyn(n):
  global fibo_tab
  fibo_tab = dict()
  return fibot(n)

def fibot(n):
  global fibo_tab
  if n in fibo_tab:
    return fibo_tab[n]
  else:   
    if n <= 1:
      ret = 1
    else:
      ret = fibot(n-1) + fibot(n-2)

    fibo_tab[n] = ret
    return ret

  
In terms of time complexity, you may notice significant differences between fibo_dyn(n) and fibo(n), for input n bigger than 30.

1 -- Regexp Matching with Dynamic Programming

Consider the following pseudo-code for matching a regular expression E with a string w

match(w,E) =
  case E of
  - eps: return w == ""
  - a: return w == a
  - union(E1,E2):
     b1 = match(w,E1)
     b2 = match(w,E2)
     return (b1 || b2)
  - conc(E1,E2):
      for each split w=w1w2
         b1 = match(w1,E1)
         b2 = match(w2,E2)
         if b1 && b2:
           return true
      return false  
  - star(E1):
      if w == "":
        return True
      else:
        for each split w = w1w2 with w1 != "":
          b1 = match(w1,E1)
          b2 = match(w2,star(E1))
          if b1 && b2:
            return True
        return False

The pseudo-code above should be similar to the function you programmed in the Programming Assignment 1, except that it does not require the use of split_all() in the case of star(E1), the function split() is sufficient. The time complexity of match() is exponential in the size of the input w, in particular with a regexp that has a Kleene's star.

Let's see why dynamic programming applies here. Consider the execution of match(w,E) for a fixed input string w and regexp E. Its execution will generate many recursive calls, sometimes on a substring of w, sometimes on a subsexpression of E. This suggests that there is a maximum number different input arguments for the recursive calls. More specifically, since there is at most |w|^2 substrings of w, and there is at most |E| subexpressions for the regexp E, then we conclude that there will be at most |w|^2 |E| different inputs for the recursive calls.

Compared to the Fibonacci function, the dynamic programming extension of match() has a table whose keys are made of a substring and a subexpression. In Python, it will be convenient to index using integers instead of constructed types like strings and regexp. Any substring of w is identified by the positions of the beginning i and the end j in w. To indicate subexpression, we extend regular expressions with unique indexes for each subexpression in a regexp, see the first part of the problem below:

Problem 1

  1. Copy paste the code from the programming assignment 1 corresponding to the class Re, the constructors for regular expressions, and the matching function match(). Modify the class Re to add an index instance variable. Declare a global variable index_counter. This global variable is used by the constructors of regular expressions to set the index in every constructed regexp. The goal is to make sure any two different subexpressions in a regular expression have different indexes. Modify your code from Programming Assignment 1. Below is the modifications for the class Re and the regexp constructor conc():
    
    class Re:
        def __init__(self):
            self.type = None
            self.char = None
            self.left = None
            self.right = None
            self.index = None
    
    index_counter = 0
    
    def conc(e1,e2):
      global index_counter
      r = Re()
      r.type = "conc"
      r.left = e1
      r.right = e2
      r.index = index_counter
      index_counter += 1
      return r  
    
    
  2. Implement a dynamic programming version of match(), called match_re_dyn(w,e) and based on the following skeleton code. This skeleton code already includes most of the dynamic programming extension of the implementation. Notice that a key is of the form Tuple([i,j,e.index]), that is a tuple made of the positions of the beginning and end of the substring and the index of the subexpression:
    
    match_tab = dict()
    
    def match_re_dyn(w,e):
      global match_tab
      match_tab = dict()
      return matcht(w,0,len(w),e)
    
    def matcht(w,i,j,e):
      global match_tab
      if tuple([i,j,e.index]) in match_tab:
        return match_tab[tuple([i,j,e.index])]
      else:   
        if e.type == "eps":
            ...
            ret = ...
        elif e.type == "symb":
            ...
            ret = ...
        elif e.type == "union":
            ...
            ret = ...
        elif e.type == "conc":
            ...
            ret = ...
        elif e.type == "star":
            ...
            ret = ...
    
        match_tab[tuple([i,j,e.index])] = ret
        return ret
    
        
    In parctice you will also have to modify split(w) to return the positions i and j of the substring.

2 -- NFA Matching with Dynamic Programming

You are going to apply the same methodology to improve the worst case complexity of the nfa string matching function from Programming Assignment 2.

Problem 2

Following the same ideas as in the previous problem, provide a dynamic programming implementation of match_nfa(nfa,w,q), called match_nfa_dyn(nfa,w,q). You may copy-paste any necessary code from your Programming Assignment 2.

3 -- Subset construction on-the-fly

The most common algorithm used to match an NFA with a string is using the so-called subset construction on-the-fly. This technique avoids the construction of the DFA using the subset construction, which is exponential in the worst case, and instead computes the subsets of states that are only relevant to the input string to be matched. This is the reason why it is called the subset constrcution on-the-fly. Compared to the match_nfa(nfa,w,q) matching function from the Programming assignment 2 that has only one current state q, the on-the-fly match function match_nfa_fly(nfa,w,P) has a list of states P as last parameter. Its pseudo code is the following:

match_nfa_fly(nfa,w,P) =
  case w of
  - w=eps:
      if there exists p in P such that p is in nfa.F:
        return true
      else:
        return false
  - w = a w1:
      P’ := { q | there exists p in P such that (p,a,q) in nfa.Delta }
      return match_nfa_fly(nfa,w1,P’)

The solution of the matching problem is then obtained by running match(w,{s}) The complexity of this algorithm is O(|w| × |Q| × |∆|) in time and O(|w| × |Q|) in space, thus polynomial in time and space.

Problem 3

Using any useful code from the Programming assignment 2, implement the function match_nfa_fly(nfa,w,P).

def match_nfa_fly(nfa,w,p) =
  ...

4 Test your work

Problem 4

For each Problems 1, 2, and 3, find a regexp/nfa and string that illustrate the worst case running time of the original algorithm and demonstrate how the dynamic-programming/subset-construction-on-the-fly version is significantly more efficient. Demonstrate it by showing 10 seconds matching versus instantaneous matching with the dynamic-programming/subset-construction-on-the-fly technique.

5 Submit your work

Copy your file to the csci341 git folder. Add, commit and push your work.