forked from scala-lms/tutorials
-
Notifications
You must be signed in to change notification settings - Fork 0
Expand file tree
/
Copy pathautomata.scala
More file actions
206 lines (168 loc) · 6.57 KB
/
automata.scala
File metadata and controls
206 lines (168 loc) · 6.57 KB
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
/**
# Automata-Based Regex Matcher
Outline:
<div id="tableofcontents"></div>
Specializing string matchers and parsers is a popular benchmark in the partial
evaluation and supercompilation literature.
<!--
- http://dblp.uni-trier.de/db/ journals/ipl/ConselD89
- http://dblp.uni-trier.de/db/ journals/toplas/AgerDR06
- http://dblp.uni-trier.de/db/ journals/toplas/SperberT00
- http://dblp.uni-trier.de/db/ journals/toplas/Turchin86,
- http://dblp.uni-trier.de/db/ journals/jfp/SorensenGJ96
-->
We consider ``multi-threaded'' regular expression matchers, that spawn a new
conceptual thread to process alternatives in parallel. Of course these matchers
do not actually spawn OS-level threads, but rather need to be advanced manually
by client code. Thus, they are similar to coroutines.
Regexp Matchers as Nondeterministic Finite Automata (NFA)
---------------------------------------------------------
Here is a simple example for the fixed regular expression `.*AAB`:
def findAAB(): NIO = {
guard(Set('A')) {
guard(Set('A')) {
guard(Set('B'), Found)) {
stop()
}}} ++
guard(None) { findAAB() } // in parallel...
}
We can easily add combinators on top of the core abstractions that take care
of producing matchers from textual regular expressions. However the point
here is to demonstrate how the implementation works.
The given matcher uses an API that models nondeterministic finite automata
(NFA):
type NIO = List[Trans] // state: many possible transitions
case class Trans(c: Set[Char], x: Flag, s: () => NIO)
def guard(cond: Set[Char], flag: Flag)(e: => NIO): NIO =
List(Trans(cond, flag, () => e))
def stop(): NIO = Nil
An NFA state consists of a list of possible transitions. Each transition may
be guarded by a set of characters and it may have a flag to be signaled if
the transition is taken. It also knows how to compute the following state. We
use `Char`s for simplicity, but of course we could use generic types as well.
Note that the API does not mention where input is obtained from (files,
streams, etc).
From NFA to DFA using Staging
-----------------------------
We will translate NFAs to DFAs using staging. This is the unstaged DFA API:
abstract class DfaState {
def hasFlag(x: Flag): Boolean
def next(c: Char): DfaState
}
def dfaFlagged(flag: Flag, link: DfaState) = new DfaState {
def hasFlag(x: Flag) = x == flag || link.hasFlag(x)
def next(c: Char) = link.next(c)
}
def dfaState(f: Char => DfaState) = new DfaState {
def hasFlag(x: Flag) = false
def next(c: Char) = f(c)
}
The staged API is just a thin wrapper:
type DIO = Rep[DfaState]
def dfa_flag(x: Flag)(link: DIO): DIO
def dfa_trans(f: Rep[Char] => DIO): DIO
Translating an NFA to a DFA is accomplished by creating a DFA state for each
encountered NFA configuration (removing duplicate states via `canonicalize`):
def convertNFAtoDFA(states: NIO): DIO = {
val cstates = canonicalize(state)
dfa_trans { c: Rep[Char] =>
exploreNFA(cstates, c)(dfa_flag) { next =>
convertNFAtoDFA(next)
}
}
}
iterate(findAAB())
The LMS framework memoizes functions (see [here (PDF)](http://infoscience.epfl.ch/record/178274/files/lms-cacm2012.pdf)) which
ensures termination if the NFA is indeed finite.
We use a separate function to explore the NFA space (see
below), advancing the automaton by a symbolic character
`cin` to invoke its continuations `k` with a new automaton, i.e. the possible
set of states after consuming `cin`. The given implementation assumes
character sets contain either zero or one characters, the empty set `Set()`
denoting a wildcard match. More elaborate cases such as character ranges are
easy to add. The algorithm tries to remove as many redundant checks and
impossible branches as possible. This only works because the character guards
are staging time values.
def exploreNFA[A](xs: NIO, cin: Rep[Char])(flag: Flag => Rep[A] => Rep[A])
(k: NIO => Rep[A]):Rep[A] = xs match {
case Nil => k(Nil)
case Trans(Set(c), e, s)::rest =>
if (cin == c) {
// found match: drop transitions that look for other chars and
// remove redundant checks
val xs1 = rest collect { case Trans(Set(`c`)|None,e,s) => Trans(Set(),e,s) }
val maybeFlag = e map flag getOrElse (x=>x)
maybeFlag(exploreNFA(xs1, cin)(acc => k(acc ++ s())))
} else {
// no match, drop transitions that look for same char
val xs1 = rest filter { case Trans(Set(`c`),_,_) => false case _ => true }
exploreNFA(xs1, cin)(k)
}
case Trans(Set(), e, s)::rest =>
val maybeFlag = e map flag getOrElse (x=>x)
maybeFlag(exploreNFA(rest, cin)(acc => k(acc ++ s())))
}
The generated code is shown further down below. Each function
corresponds to one DFA state. Note how negative information has been used to
prune the transition space: Given input such as `...AAB` the automaton jumps
back to the initial state, i.e. it recognizes that the last character B
cannot also be A and starts looking for two As after the B.
Generated State Machine Code
----------------------------
To use the generated code, we use the following small driver loop:
var state = stagedFindAAB()
var input = ...
while (input.nonEmpty) {
state = state.next(input.head)
if (state.hasFlag(Found))
println("found AAB. rest: " + input.tail)
input = input.tail
}
If the matcher and input iteration logic is generated together, further
translations can be applied to transform the mutually recursive lambdas into
tight imperative state machines.
def stagedFindAAB(): DfaState = {
val x7 = { x8: (Char) =>
// matched AA
val x9 = x8 == B
val x15 = if (x9) {
x11
} else {
val x12 = x8 == A
val x14 = if (x12) {
x13
} else {
x10
}
x14
}
x15
}
val x13 = dfaState(x7)
val x4 = { x5: (Char) =>
// matched A
val x6 = x5 == A
val x16 = if (x6) {
x13
} else {
x10
}
x16
}
val x17 = dfaState(x4)
val x1 = { x2: (Char) =>
// matched nothing
val x3 = x2 == A
val x18 = if (x3) {
x17
} else {
x10
}
x18
}
val x10 = dfaState(x1)
val x11 = dfaFlagged(Found, x10)
x10
}
Generated matcher code for regular expression `.*AAB`
*/