Programming Regexps

Introduction

Regular expressions are cool! You are going to write the answer to the problems into a Python file named prog1_re.py

1 A datatype for regular expressions

Regular expressions can be of 5 different types:
  1. epsilon
  2. a symbol
  3. the union of two regexps
  4. the concatenation of two regexps
  5. the Kleene's star of a regexp
A regexp can be easily encoded using a binary tree structure. Each node of the tree contains the information related to the type of regexp it encodes. A node may have no child (epsilon, symbol), or two children (union, concatenation) or just one child (star). To encode regexps in Python we choose the following class Re containing the following instance variables: an instance variable type indicating which type of regexp it is, an instance variable symb in case it is a symbol regexp, and two instance variables left and right for possible children. To illustrate how the various instances are used for each case of a regular expression construction, we provide a to_string() method that translates a regular expression into a string.

class Re:
    def __init__(self):
        self.type = None
        self.symb = None
        self.left = None
        self.right = None

    def to_string(self):
        if self.type == "eps":
            return "\u0395"
        elif self.type == "symb":
            return self.symb
        elif self.type == "union":
            return self.left.to_string() + "\u222A" + self.right.to_string()
        elif self.type == "conc":
            return self.left.to_string() + "." + self.right.to_string()
        elif self.type == "star":
            return self.left.to_string() + "*"

The definition of the class Re ends right here for now. We won't define constructors following the OOP programming fashion. Instead, constructors for regular expressions are defined outside its class, simply as functions. We need 5 constructors, one for each type of regexp. We provide the two constructors for epsilon and union:

def eps():
    r = Re()
    r.type = "eps"
    return r

def union(e1,e2):
    r = Re()
    r.type = "union"
    r.left = e1
    r.right = e2
    return r

  

Problem 1

Following the same format and taking into account how the to_string() function works, define the remaining constructors:

def symb(a):
    ...

def conc(e1,e2):
    ...

def star(e1):
    ...

    

Once you have defined all the constructors, it is easy to write regexps. For example, the regexp a.b is encoded as conc(symb("a"),symb("b")). The regexp (a*.b*)* is encoded as star(conc(star(symb("a")),star(symb("b")))). If you don't like to type symb("a") or symb("b") everytime then you can define ra = symb("a") and rb = symb("b"), and use raand rb to define regexps. Considering the two previous expressions, they could be defined as conc(ra,rb) and star(conc(star(ra),star(rb))), which is shorter and more readable.

Problem 1'

Now when you print the string associated with this last regexp you get:

print((star(conc(star(ra),star(rb)))).to_string())
a*.b**

This is a little bit confusing because there is an ambiguity. It is not clear whether the last star is applied to b* or to the entire expression. Modify the to_string() such it includes parantheses in all cases. After that, you may consider that it prints too many parantheses like in ((a)*.(b)*)*, but that's okay for this assignment.

Problem 2

Encode regexps corresponding to the following regular languages (with Σ = {a,b}):

  1. ex1 = is the regexp for the language containing all the strings starting and ending with the same symb.
  2. ex2 = is the regexp for the language containing all the strings with "abba" as substring.
  3. ex3 = is the regexp for the language containing all strings without 3 consecutive a's.
  4. ex4 = is the regexp for the language containing all strings of even length.
  5. ex5 = is the regexp for the language containing all strings such each a is followed by at least three b's.

2 Matching String with Regexp

The goal of this section is to implement the function match(w,e) that decides whether a string w matches the regexp e. To implement this function, you are going to need two helper functions split(w) and split_all(w). These functions are necessary for the iteration in the concatenation and Kleene's star cases. We provide these functions below:

def split(w):
    res = []
    for i in range(len(w)+1):
        res += [ [w[0:i],w[i:]] ]
    return res

def split_all(w):
    if len(w) == 0:
        return [ ]
    elif len(w) == 1:
        return [ [w] ]
    else:
        res = [ ]
        for i in range(len(w)-1):
            w1 = w[0:i+1]
            res += [ [w[0:i+1]]+lst for lst in split_all(w[i+1:]) ]
        res += [[w]]
        return res

The function call split(w) returns the list of pair of strings such that the concatenation of the two string in a pair is equal to w. The function call split_all(w) returns the list containing all the list of strings of the form [w1,w2,...,wn] such that each wi is not the empty string and the concatenation w1+w2+...+wn of all these strings is equal to w. I let you test this function on the string "xyzt" to make sure you understand what it does.

> split("xyzt")
[['', 'xyzt'], ['x', 'yzt'], ['xy', 'zt'], ['xyz', 't'], ['xyzt', '']]
> split_all("xyzt")

[['x', 'y', 'z', 't'], ['x', 'y', 'zt'], ['x', 'yz', 't'], ['x', 'yzt'],
 ['xy', 'z', 't'], ['xy', 'zt'], ['xyz', 't'], ['xyzt']]

Problem 3

Implement the function match(w,e) using the skeleton code below. It returns True when it matches, and it returns False otherwise.

def match(w,e):
    if e.type == "eps":
        return w == ""
    elif e.type == "symb":
        ...
    elif e.type == "union":
        ...
    elif e.type == "conc":
        for p in split(w):
            ...
        return False
    elif e.type == "star":
        ...


You may need one helper function for the Kleene's star case. All the other cases should translate more or less directly from the pseudo-code for match()

4 Test your work

Problem 4

This problem is for testing your previous work. You have to test to match each regexp from Problem 2 with 3 strings that you believe should match and 3 strings that you believe should not match. For example if you had to test the regexp ex0 = conc(union(ra,rb), union(ra,rb)) you would come up with the 6 strings "ab","ba", "bb", "", "a", "bab". To help you in this task, we provide a test_match function that takes a list of tests and run them one after another and check that the output is as expected.

def test_match(lst):
    if len(lst) == 0:
        return
    else:
        the_test = lst[0]
        w = the_test[0]
        e = the_test[1]
        b = the_test[2]
        print("match("+ w + ",  " + e.to_string() + ") =? " + str(b))
        if match(w,e) == b:
            print("Ok")
            test_match(lst[1:])
        else:
            raise Exception("I have to fix that")

Using this function you could test match() on ex0 by typing:

> test_match([ [ "ab",ex0,True ], [ "ba",ex0,True], [ "bb",ex0, True ],
               [ "",  ex0,False], [ "a" ,ex0,False],[ "bab",ex0,False] ])
match(ab,  (a∪b).(a∪b)) =? True
Ok
match(ba,  (a∪b).(a∪b)) =? True
Ok
match(bb,  (a∪b).(a∪b)) =? True
Ok
match(,  (a∪b).(a∪b)) =? False
Ok
match(a,  (a∪b).(a∪b)) =? False
Ok
match(bab,  (a∪b).(a∪b)) =? False
Ok

5 Submit your work

Create a csci341 folder to your gitlab account. Add me bhcr001 to the list of developers for this project. Copy your file to the folder. Add, commit, push.