[9fans] non greedy regular expressions

Post by Rudolf Sykora
regexp(6) seems to know only greedy regular expressions. So does
probably awk, sed, grep, ..., since these are based on that regexp.
My question is what to do if one needs non-greedy regexps.

russ has a great writeup on this.
http://swtch.com/~rsc/regexp/
i think it covers all your questions.

- erik

Rudolf Sykora

2008-10-24 08:12:35 UTC

Post by erik quanstrom
russ has a great writeup on this.
http://swtch.com/~rsc/regexp/
i think it covers all your questions.
- erik

I read trough some of that already yesterday. Anyway, am still
puzzled. In the text of

Regular Expression Matching Can Be Simple And Fast
(but is slow in Java, Perl, PHP, Python, Ruby, ...)

R. Cox writes:
---
While writing the text editor sam [6] in the early 1980s, Rob Pike
wrote a new regular expression implementation, which Dave Presotto
extracted into a library that appeared in the Eighth Edition. Pike's
implementation incorporated submatch tracking into an efficient NFA
simulation but, like the rest of the Eighth Edition source, was not
widely distributed.
...
Pike's regular expression implementation, extended to support Unicode,
was made freely available with sam in late 1992, but the particularly
efficient regular expression search algorithm went unnoticed. The code
is now available in many forms: as part of sam, as Plan 9's regular
expression library, or packaged separately for Unix.
---

But any manual page (regexp(6), that of sam) keeps completely silent
about eg. any submatch tracking.
So what's wrong? Can anybody clarify the situation for me or do I
really have to read the codes?

Ruda

erik quanstrom

2008-10-24 12:43:45 UTC

Post by Rudolf Sykora
But any manual page (regexp(6), that of sam) keeps completely silent
about eg. any submatch tracking.
So what's wrong? Can anybody clarify the situation for me or do I
really have to read the codes?

well reading the code would be a travesty. it's curious
that neither the sam paper nor regexp(6) mentions
submatches. maybe i missed them.

sed -n 's:.*(KRAK[A-Z]+*) +([a-zA-Z]+).*:\2, \1:gp' </lib/volcanoes

sam -d /lib/volcanoes
,x g/KRAK/ s:[^A-Z]+([A-Z]+*) +([a-zA-Z]+).*:\2, \1:
/KRAK/-+

or, if you want to run sam as a stream editor
{ echo ',x g/KRAK/ s:[^A-Z]+([A-Z]+*) +([a-zA-Z]+).*:\2, \1:'
echo '/KRAK/-+'
} | sam -d /lib/volcanoes >[2=]

a couple of examples of submatch tracking from our system
(i'm not sure about the version of iwhois on sources.)

/rc/bin/B
/rc/bin/Kill
/rc/bin/broke
/rc/bin/fax
/rc/bin/fshalt
/rc/bin/iwhois
/rc/bin/juke

- erik

Rudolf Sykora

2008-10-24 16:16:29 UTC

Post by erik quanstrom
well reading the code would be a travesty. it's curious
that neither the sam paper nor regexp(6) mentions
submatches. maybe i missed them.
sed -n 's:.*(KRAK[A-Z]+*) +([a-zA-Z]+).*:\2, \1:gp' </lib/volcanoes
- erik

Ok, so despite the documentation, some submatch tracking is there.
But in all (?) your examples, as well as in the scripts you mentioned,
this tracking is exclusively used with the s command (which is said to
be unnecessary at least in sam/acme). If I try sth. like
/( b(.)b)/a/\1\2/
on
bla blb 56
I get
bla blb\1\2 56
which is not quite what I want... How then? (I'd like to get 'bla blblblb 56')

Further, in R. Cox's text (http://swtch.com/~rsc/regexp/regexp1.html)
he claims that all nice features except for backreferences can be
implemented with Thomson's NFA algorithm. And even the backreferences
can be handled gracefully somehow. That is: ALL: non-greedy operators,
generalized assertions, counted repetitions, character classes CAN be
processed using the fast algorithm. Why then we don't have it? I once
wrote a program in python and was pretty happy to have non-greedy
operators and lookahead assertions on hand. Should I hadn't had those,
I probably wouldn't have been able to write it (nicely).

Ruda

erik quanstrom

2008-10-24 16:58:41 UTC

Post by Rudolf Sykora
Ok, so despite the documentation, some submatch tracking is there.
But in all (?) your examples, as well as in the scripts you mentioned,
this tracking is exclusively used with the s command (which is said to
be unnecessary at least in sam/acme). If I try sth. like
/( b(.)b)/a/\1\2/

this is covered in the sam paper in the section where pike
discusses s.

Post by Rudolf Sykora
Further, in R. Cox's text (http://swtch.com/~rsc/regexp/regexp1.html)
he claims that all nice features except for backreferences can be
implemented with Thomson's NFA algorithm. And even the backreferences
can be handled gracefully somehow. That is: ALL: non-greedy operators,
generalized assertions, counted repetitions, character classes CAN be
processed using the fast algorithm. Why then we don't have it?

let me turn this around. why would these additions (complications)
benefit plan 9?

- erik

John Stalker

2008-10-24 17:05:55 UTC

I think you've understood correctly. Back references mostly aren't
there. Greedy operators aren't there. For back references, this may
be due to philosophical reservations; I have a few myself. For greedy
operators, I suspect it's more because noone has cared enough to do
it. It wouldn't be too hard, as Russ' article says. If someone is
going to to this I would suggest going all the way and implementing
tags. See http://laurikari.net/ville/spire2000-tnfa.ps.

Ok, so despite the documentation, some submatch tracking is there.
But in all (?) your examples, as well as in the scripts you mentioned,
this tracking is exclusively used with the s command (which is said to
be unnecessary at least in sam/acme). If I try sth. like
/( b(.)b)/a/\1\2/
on
bla blb 56
I get
bla blb\1\2 56
which is not quite what I want... How then? (I'd like to get 'bla blblblb 56'
)
Further, in R. Cox's text (http://swtch.com/~rsc/regexp/regexp1.html)
he claims that all nice features except for backreferences can be
implemented with Thomson's NFA algorithm. And even the backreferences
can be handled gracefully somehow. That is: ALL: non-greedy operators,
generalized assertions, counted repetitions, character classes CAN be
processed using the fast algorithm. Why then we don't have it? I once
wrote a program in python and was pretty happy to have non-greedy
operators and lookahead assertions on hand. Should I hadn't had those,
I probably wouldn't have been able to write it (nicely).
Ruda

--
John Stalker
School of Mathematics
Trinity College Dublin
tel +353 1 896 1983
fax +353 1 896 2282

Rob Pike

2008-10-24 17:19:11 UTC

Backreferences within the pattern (such as in /(.*)\1/) make the
matcher non-regular and exponentially hard. It was a deliberate
decision not to implement them in sam and I'd make the same decision
today.

As far as greedy/non-greedy operators go, they have more utility but I
believe they have become popular primarily to deal with the problems
with Perl and PCRE doing greedy implementations badly in order to
avoid the exponential cost of their backtracking matchers. These
matchers can give crazy results.

Sam etc. are O(length of string * length of pattern) at all times and
their leftmost-longest properties work throughout.

Again, Russ has most of this in his analysis.

-rob

Rudolf Sykora

2008-10-24 17:44:41 UTC

Post by John Stalker
I think you've understood correctly. Back references mostly aren't
there. Greedy operators aren't there.

you probably mean NON-greedy ops.
I was not that concerned about backreferences but submatch extraction
(according to the terminology used by R. Cox in his article).
R.

PS.: and submatch extraction is there, somehow, undocumented (to my knowledge)

Russ Cox

2008-10-24 18:04:55 UTC

Regarding greedy vs non-greedy etc.

In Plan 9, a regular expression search always looks
for the "leftmost-longest" match, which means
a match that starts as far to the left as possible
in the target text, and of the matches that start there,
the longest one.

In that model, it is not accurate to describe the * + ?
operators as greedy or not. None of them is working
toward any goal other than the overall longest match
at the leftmost position.

In Perl and its imitators, the match starts at the leftmost
position but is otherwise the first one that is found,
not necessarily the longest. In that context, words like
"greedy" and "non-greedy" start to make sense,
because the behavior of any one operator influences which
match is encountered first.

Either approach -- leftmost-longest or leftmost-first --
can be implemented using finite automata.

Russ

Rudolf Sykora

2008-10-24 19:59:34 UTC

Post by Russ Cox
In that model, it is not accurate to describe the * + ?
operators as greedy or not. None of them is working
toward any goal other than the overall longest match
at the leftmost position.

So then I must be mistaken about the terminology.
I thought greedy=leftmost-longest, while non-greedy=leftmost-first:
Having
bla bla (AB)(CDEF)(GH)
/$.*$/
matches the whole (AB)(CDEF)(GH), 'greedy', while if '*' were
'non-greedy', I'd expect a match with (AB). This is now not in Plan9.
This mentioned example is easy to solve, if I want to go in 'bracket'
steps with
/$[^)]*$/
but there are harder examples, eg. where the_interesting_part is
delimited with a more complicated structure ('ABC' here):
ABCthe_interesting_partABC blabla bla ABCthe_interesting_partABC etc.
Now, how to parse it? With non-greedy '*', no problem. With 'greedy'
and no negative lookahead assertion? Ok, maybe 'y' in sam could help
(don't know). What about
ABCthe_interesting_partCBA blabla bla ABCthe_interesting_partCBA etc.?
All the thinking about this is simply removed with 'non-greedy' ops.

Ruda

Post by Russ Cox
In Perl and its imitators, the match starts at the leftmost
position but is otherwise the first one that is found,
not necessarily the longest. In that context, words like
"greedy" and "non-greedy" start to make sense,
because the behavior of any one operator influences which
match is encountered first.
Either approach -- leftmost-longest or leftmost-first --
can be implemented using finite automata.
Russ

Russ Cox

2008-10-24 21:14:03 UTC

Greedy leftmost-first is different from leftmost-longest.
Search for /a*(ab)?/ in "ab". The leftmost-longest match
is "ab", but the leftmost-first match (because of the
greedy star) is "a". In the leftmost-first case, the greediness
of the star caused an overall short match.

Post by Rudolf Sykora
All the thinking about this is simply removed with 'non-greedy' ops.

But it isn't (or shouldn't be).
Using /$.*$/ to match small parenthesized expressions
is fragile: /$.*$/ in "(a(b))" matches "(a(b)".
In contrast, the solution you rejected /$[^)]*$/ is more robust.

It doesn't make sense to shoehorn non-greedy and greedy
operators into an engine that provides leftmost-longest matching.
If you want a different model, you need to use a different program.
Perl has been ported.

Russ

Rudolf Sykora

2008-10-24 21:43:56 UTC

Post by Russ Cox
Greedy leftmost-first is different from leftmost-longest.
Search for /a*(ab)?/ in "ab". The leftmost-longest match
is "ab", but the leftmost-first match (because of the
greedy star) is "a". In the leftmost-first case, the greediness
of the star caused an overall short match.

Ok, I finally see the point... thanks.
Only one last question: So is there any simple way (using existing
regexps) to find 'the interesting parts' in my last example?:
ABCthe_interesting_partCBA blabla bla ABCthe_interesting_partCBA ...
...i.e. delimited with ABC from the left and CBA (or e.g. EFG) from the right?
Say I have a file with such a structure and want to only get 'the
interesting parts' form it.

Post by Russ Cox

Post by Rudolf Sykora
All the thinking about this is simply removed with 'non-greedy' ops.

But it isn't (or shouldn't be).

well, I admit it isn't :)

erik quanstrom

2008-10-24 21:49:39 UTC

Post by Rudolf Sykora
Ok, I finally see the point... thanks.
Only one last question: So is there any simple way (using existing
ABCthe_interesting_partCBA blabla bla ABCthe_interesting_partCBA ...
...i.e. delimited with ABC from the left and CBA (or e.g. EFG) from the right?
Say I have a file with such a structure and want to only get 'the
interesting parts' form it.

doesn't s/ABC(the_interesting_part)CBA/x/g work for you?
maybe i don't understand the example. if so, could you explain?

- erik

Rudolf Sykora

2008-10-24 22:07:57 UTC

Post by erik quanstrom
doesn't s/ABC(the_interesting_part)CBA/x/g work for you?
maybe i don't understand the example. if so, could you explain?
- erik

I think not.
I have a file say like this

ABC asassadfasdf asdfasdf asdfasdf CBA hhhhhhhhhhjjjjjjjjjjioioioi
sodifs
sdfsd
ABC
dasdfas aasdfa
njnjn CBA

and I want to get

' asassadfasdf asdfasdf asdfasdf '
'dasdfas aasdfa'
'njnjn'

where I added apostrophes to see the spaces on indivial lines. Simply:
give me everything that is between delimiters (ABC and CBA).

Ruda

Gabriel Diaz Lopez de la Llave

2008-10-24 22:43:19 UTC

hello

using sed and only one reg-exp is mandatory?

cat t.txt| sed 's/(ABC | CBA)/ \n\1\n /g' | awk '/ABC/,/CBA/' | grep -
v 'ABC|CBA'

that's a naive and simple approach, but i can't see why you need to
use just one reg-exp and just one sed. May be i missed something
through the thread :-?

gabi

Post by erik quanstrom
doesn't s/ABC(the_interesting_part)CBA/x/g work for you?
maybe i don't understand the example. if so, could you explain?
- erik

I think not.
I have a file say like this
ABC asassadfasdf asdfasdf asdfasdf CBA hhhhhhhhhhjjjjjjjjjjioioioi
sodifs
sdfsd
ABC
dasdfas aasdfa
njnjn CBA
and I want to get
' asassadfasdf asdfasdf asdfasdf '
'dasdfas aasdfa'
'njnjn'
give me everything that is between delimiters (ABC and CBA).
Ruda

Charles Forsyth

2008-10-24 22:46:48 UTC

although it can be satisfying to do things in one instruction (one command)
i confess i often find it quicker to split up the problem in sam or acme:

1. delete everything not between delimiters
,y/ABC([^C]|C[^B]|CB[^A]|\n)+CBA/d
2. delete the delimeters
,x/ABC|CBA/d
3. look to decide if i missed a boundary case for my input

it would be easier if the delimiters weren't words, or your sections didn't span lines.
i didn't spend any time deciding whether there was a better way
to express the trailing word delimiter. it's too late.

an example actually using the () [submatch] to extract and exchange two fields on each line:
,x s/([^,]*),(.*)/\2,\1/

Charles Forsyth

2008-10-24 22:51:35 UTC

Post by Charles Forsyth
i didn't spend any time deciding whether there was a better way
to express the trailing word delimiter. it's too late.

***@9grid.es meanwhile pointed out one way: change the words to otherwise unused special
characters

Tom Simons

2008-10-24 23:55:51 UTC

Is awk available? This worked for me, but it's not on Plan9. It does copy
the newline after the 2nd "ABC" (I wasn't sure if leading or all blank lines
should be deleted).
$ cat a.data
dflkdl dlkrwo3je4ogjmdmxd
ABC asassadfasdf asdfasdf asdfasdf CBA hhhhhhhhhhjjjjjjjjjjioioioi
sodifs
sdfsd
ABC
dasdfas aasdfa
njnjn CBA
fkpri34ouijglkrlptgf;c
$ awk 'BEGIN {RS = "ABC"; FS = "CBA"}NR == 1 {next}{print $1}' a.data
asassadfasdf asdfasdf asdfasdf

dasdfas aasdfa
njnjn

Post by erik quanstrom
doesn't s/ABC(the_interesting_part)CBA/x/g work for you?
maybe i don't understand the example. if so, could you explain?
- erik

I think not.
I have a file say like this
ABC asassadfasdf asdfasdf asdfasdf CBA hhhhhhhhhhjjjjjjjjjjioioioi
sodifs
sdfsd
ABC
dasdfas aasdfa
njnjn CBA
and I want to get
' asassadfasdf asdfasdf asdfasdf '
'dasdfas aasdfa'
'njnjn'
give me everything that is between delimiters (ABC and CBA).
Ruda

Rudolf Sykora

2008-10-25 22:40:39 UTC

Post by Tom Simons
Is awk available? This worked for me, but it's not on Plan9. It does copy
the newline after the 2nd "ABC" (I wasn't sure if leading or all blank lines
should be deleted).
$ awk 'BEGIN {RS = "ABC"; FS = "CBA"}NR == 1 {next}{print $1}' a.data

To that newline: It should copy the newline you describe, since that
one really is between delimiters. However, this one is also the only
one that should be copied. There shouldn't be any blank line anywhere
in the middle of the resulting output. In this sense your solution
doesn't work.

Your solution ALMOST works in linux. It shows not to work in plan9 at
all, probably due to the fact that in plan9 only the 1st character of
the RS variable is considered as the record delimiter.

But what I really wanted to see is how people using plan9 can solve
the problem without using a specialized minilanguage like awk. See
what Erik S. Raymond says in his Art of Unix programming:

http://www.faqs.org/docs/artu/ch08s02.html#awk

Basically he claims that the way this language was designed was
unfortunate. And that the language is on its decline. Among the
reasons is that languages like Perl, Python, Ruby all form a suitable
superset and that
'Programmers increasingly chose to do awklike things with Perl or
(later) Python, rather than keep two different scripting languages in
their heads'.

I myself may not be that competent to claim this too, but at least
from my own experience I have started to like to use as few tools as
possible. Thus I don't want to use awk any longer. I don't like perl
either (in my opinion it's a bad language). Python is nice for coding,
but somehow not handy for commandline use. Ruby seems to be superior
to all. So in my ideal (not the quickest though) world I'd rather get
rid of perl, awk, and use ruby instead, if anything more complicated
is needed.

Anyway, my main reason for the task was to see if someone can really
come with a nice solution using exclusively sam (and more, possibly
without that 's' command --- btw. noone so far has answered the
question of possible use of submatch tracking with commands other than
's'; remember 's' was designated unnecessary).

I wanted to see the thing be done in sam/acme without use of awk or
sed. That is Charles Forsyth's solution, which really works:
---
1. delete everything not between delimiters
,y/ABC([^C]|C[^B]|CB[^A]|\n)+CBA/d
2. delete the delimeters
,x/ABC|CBA/d
3. look to decide if i missed a boundary case for my input
---
I like it. It does exactly what I wanted. And here comes the point
I've been after all the time from the very beginning. I wanted to
show, that the solution has a very ugly part in itself, namely
([^C]|C[^B]|CB[^A]|\n)+
whose only reason is to ensure there is not CBA somewhere in the
middle. Imagine there would be something more complicated as a
delimiter. Imagine, I'd like the closing delimiter be either CBA or
EFG (any one would do). And I think you soon go mad.

In python ( http://www.amk.ca/python/howto/regex/), this is easily
solved with a non-greedy operator

/ABC(.*?)CBA/
/ABC(.*?)(CBA|EFG)/

It's true that non-greedy operators don't have a good meaning in Plan9
(as R. Cox explained), due to its leftmost-longest paradigm. However,
what I conclude from this example is, that the leftmost-first kind of
thinking with two kinds of ops (greedy/nongreedy) can be sometimes
quite useful.

Now. If the leftmost-longest match is usable for my problem, I am fine
with C + regexp(6). If not I only see the possibility to use
perl/python nowadays (if I don't want to go mad like above). Put it
the other way round. Perl/python can hopefully almost always be used,
they solve almost any problem with regexps to our liking. Then
universality wins and we may end up using perl/python exclusively. And
we will (people do) use them inspite of their wrong (i.e. slow;
perhaps horrible -- as some of you said) designs.

My question then is: wouldn't it be better to switch to the
leftmost-first paradigm, hence open possible use of (non-)greedy
operators, and in a way contribute to an accord with perl/python
syntax? And use a good algorithm for that all? But maybe it's not
worth and the current state is just sufficient...

Ruda

Steve Simon

2008-10-25 23:06:08 UTC

...I have started to like to use as few tools as
possible.

I entirely agree, there is too much to learn and you have to be selective;
however my selection is: sed, awk, C and sam (for impromptu, well, editing).

I cannot really comment directly on gready operators, I have never
knowingly used a system that provided them, and I have never found myself
needing features that Plan9's RE don't have.

Perhaps its a lack of vision on my part.

It is probably possible visualise problems that would be solved
much more simply on Plan9 than on Linux/BSD/etc. The bottom line
is that the designers of Plan9 took a different approach
to REs and as a result the systems are different.

-Steve

John Stalker

2008-10-26 09:02:34 UTC

Post by Rudolf Sykora
Now. If the leftmost-longest match is usable for my problem, I am fine
with C + regexp(6). If not I only see the possibility to use
perl/python nowadays (if I don't want to go mad like above).

There is another option: yacc. I'm not saying it's simpler
than perl or python, but it's not much harder and it's more
a more general tool. The resulting code will be fast, if
you care.

Post by Rudolf Sykora
My question then is: wouldn't it be better to switch to the
leftmost-first paradigm, hence open possible use of (non-)greedy
operators, and in a way contribute to an accord with perl/python
syntax? And use a good algorithm for that all? But maybe it's not
worth and the current state is just sufficient...

Switching semantics in existing tools is rarely a good idea. Too many
things break. If anyone is going to do this, and it won't be me, then
it needs to be done in a way that doesn't break anything, like when
UNIX switched from basic to extended regular expressions and added a
-e option to grep.

--
John Stalker
School of Mathematics
Trinity College Dublin
tel +353 1 896 1983
fax +353 1 896 2282

Eris Discordia

2008-10-26 18:49:10 UTC

I loved this thread. Thanks everyone. Thanks Rudolf Sykora.

Russ Cox

2008-10-27 04:59:57 UTC

As I explained in an earlier post, your suggested

Post by Rudolf Sykora
/ABC(.*?)CBA/

is less robust than Charles's spelled-out version,
since yours doesn't handle nested expressions well.
That's a serious enough limitation that your
scenario stops being a compelling argument for
leftmost-first matching and non-greedy operators.

Post by Rudolf Sykora
Then universality wins and we may end up using perl/python exclusively.

No one said you couldn't. If they let you do your job faster than
acme and sam do, no one here is going to tell you not to use them.

Leftmost-first matching is difficult to explain.
When POSIX had to define the rules for regexps,
they chose to define them as leftmost-longest
even though all the implementations were leftmost-first,
because describing leftmost-first precisely was too
complex.

Leftmost-first matching is not widely understood.
At the beginning of this conversation, you believed that in the
absence of non-greedy operators, leftmost-first matching
produced leftmost-longest matches. I would guess that
more people believe this than know otherwise.

Leftmost-first matching only lets you write a regexp that
is half right (see above) in the use case that you are
focusing on.

I don't see the benefit of switching from leftmost-longest
to leftmost-first. Certainly an "accord with perl/python"
is not good justification. Playing catch-up with Perl and
Python regexps can only lead to madness. Both systems
are complex enough that essentially no one completely
understands them.

Russ

Rudolf Sykora

2008-10-27 08:33:25 UTC

Post by Russ Cox
Leftmost-first matching is difficult to explain.
When POSIX had to define the rules for regexps,
they chose to define them as leftmost-longest
even though all the implementations were leftmost-first,
because describing leftmost-first precisely was too
complex.
Leftmost-first matching is not widely understood.
At the beginning of this conversation, you believed that in the
absence of non-greedy operators, leftmost-first matching
produced leftmost-longest matches. I would guess that
more people believe this than know otherwise.

Yes. I weren't quite acquainted with the problematics. Probably
because I have never bumped into a problem with the leftmost-first
implemenatation. Perhaps my problems were just not complex enough to
notice a regexp finds sth. different from my expectation. Seems to be
a case of majority of people.
(E.g. Vim seems to use leftmost-first, too.)

Ok. I am just somehow tired now with this. I thank everybody who
positively contributed to this thread. I'll wait some time and will
try to use what is present and I will see.

Ruda

Charles Forsyth

2008-10-27 10:12:13 UTC

Both systems are complex enough that essentially no one completely understands them.

this touches on an important point. the first introduction of regular expressions
to editors was great, because it took some formal language theory and made it useful
in an `every day' way. conversely, the theory provided significant underpinning (in a structural
sense) to that practical application. now there are big books on `regular expressions'
mainly because they are no longer regular but a big collection of ad-hoc additions.
(i think there is a similarity to Perl's relationship to the history of programming languages.)
the result really is hard to reason about ... because it hasn't got that underpinning and the algebra has gone.
it's worse than that: the code is a mess too, i supposed in consequence.

Eris Discordia

2008-10-27 13:29:04 UTC

Post by Charles Forsyth
practical application. now there are big books on `regular expressions'
mainly because they are no longer regular but a big collection of ad-hoc

I thought they were "regular" because they "regularly" occurred in the
target text. Turns out other interpretations are possible. Though, mine has
the advantage of being correct :-D

The set of "big books on regular expressions" includes Jeffrey Friedl's
"Mastering Regular Expressions" that happens to contain a chapter by the

Post by Charles Forsyth
DFA Speed with NFA Capabilities: Regex Nirvana?
I've said several times that a DFA can't provide capturing parentheses or
backreferences. This is quite true, but it certainly doesn't preclude
hybrid approaches that mix technologies in an attempt to reach regex
nirvana. The sidebar in Section 4.6.3.1 told how NFAs have diverged from
the theoretical straight and narrow in search of more power, and it's
only natural that the same happens with DFAs. A DFA's construction makes
it more difficult, but that doesn't mean impossible.
GNU grep takes a simple but effective approach. It uses a DFA when
possible, reverting to an NFA when backreferences are used. GNU awk does
something similar—it uses GNU grep's fast shortest-leftmost DFA engine
for simple "does it match" checks, and reverts to a different engine for
checks where the actual extent of the match must be known. Since that
other engine is an NFA, GNU awk can conveniently offer capturing
parentheses, and it does via its special gensub function.
Tcl's regex engine is a true hybrid, custom built by Henry Spencer (whom
you may remember having played an important part in the early development
and popularization of regular expressions see Section 3.1.1.6). The Tcl
engine sometimes appears to be an NFA— it has lookaround, capturing
parentheses, backreferences, and lazy quantifiers. Yet, it has true POSIX
longest-leftmost match (see Section 4.6.1), and doesn't suffer from some
of the NFA problems that we'll see in Chapter 6. It really seems quite
wonderful.
Currently, this engine is available only to Tcl, but Henry tells me that
it's on his to-do list to break it out into a separate package that can
be used by others.

Again, turns out the "big books on regular expressions" can give the
lowlife--that's me--things "hackers" deny them.

--On Monday, October 27, 2008 10:18 AM +0000 Charles Forsyth

Post by Charles Forsyth

Both systems are complex enough that essentially no one completely understands them.

this touches on an important point. the first introduction of regular
expressions to editors was great, because it took some formal language
theory and made it useful in an `every day' way. conversely, the theory
provided significant underpinning (in a structural sense) to that
practical application. now there are big books on `regular expressions'
mainly because they are no longer regular but a big collection of ad-hoc
additions. (i think there is a similarity to Perl's relationship to the
history of programming languages.) the result really is hard to reason
about ... because it hasn't got that underpinning and the algebra has
gone. it's worse than that: the code is a mess too, i supposed in
consequence.

erik quanstrom

2008-10-27 13:44:36 UTC

Post by Charles Forsyth
practical application. now there are big books on `regular expressions'
mainly because they are no longer regular but a big collection of ad-hoc

I thought they were "regular" because they "regularly" occurred in the
target text. Turns out other interpretations are possible. Though, mine has
the advantage of being correct :-D

there's a reason they're not called regularly expressions.

(if this were the definition, an expression's regularlyness would
depend on the target text, would it not?)

- erik

Eris Discordia

2008-10-27 19:53:36 UTC

Post by erik quanstrom
there's a reason they're not called regularly expressions.

As explained in the post by Brian L. Stuart it's a matter of "grammar" :-P

Post by erik quanstrom
(if this were the definition, an expression's regularlyness would
depend on the target text, would it not?)

Yes, and that _would_ be why you wouldn't craft a regex for matching in a
text with a, say, single-digit frequency of some string.

--On Monday, October 27, 2008 9:23 AM -0400 erik quanstrom

Post by erik quanstrom

Post by Charles Forsyth
practical application. now there are big books on `regular expressions'
mainly because they are no longer regular but a big collection of ad-hoc

I thought they were "regular" because they "regularly" occurred in the
target text. Turns out other interpretations are possible. Though, mine
has the advantage of being correct :-D

there's a reason they're not called regularly expressions.
(if this were the definition, an expression's regularlyness would
depend on the target text, would it not?)
- erik

Brian L. Stuart

2008-10-27 16:18:12 UTC

Post by Eris Discordia
The set of "big books on regular expressions" includes Jeffrey Friedl's
"Mastering Regular Expressions" that happens to contain a chapter by the

Post by Charles Forsyth
DFA Speed with NFA Capabilities: Regex Nirvana?

This guy seems to blur the distinctions here. His discussion
makes it sound like he's saying that NFAs have more expressive
power than DFAs. This is incorrect. Both NFAs and DFAs have
exactly the same expressive power as the class of grammars
called regular. For the arbitrary case of nesting (e.g. parens),
these machines are insufficient. However, for any prescribed
maximum nesting level, you can write a regular expression to
account for it, though it becomes clumsy.

To get more expressiveness, you need to go to a machine
with more functionality. Classically, the next step
up the hierarchy is the pushdown automaton. The expressive
power of this machine corresponds directly to the context-
free grammars. Because the full generality of the CFG
require nondeterminism, automated translations from CFG
to code/machine are usually done with restricted classes
of CFGs, such as LR(k) and LALR. You can also increase
the power of a FA by adding a counter or by making the
transitions probablistic. If you truly want to build
expression matching mechanisms that go beyond regular,
building on the FA with counter(s) would be a far more
sound foundation than a lot of the ad hoc stuff that's
been done. But the truth is that whipping up a CFG
and feeding it to yacc is far more expedient than torturing
regular expressions all day.

Post by Eris Discordia
Again, turns out the "big books on regular expressions" can give the
lowlife--that's me--things "hackers" deny them.

And a good book on automata theory can give you far more
than any "big book of...", "... in 21 days" or "... for
dummies" book can. Besides, why would you think that
anyone is denying you knowledge or the opportunity
to write code that demonstrates your ideas?

BLS

John Stalker

2008-10-24 18:06:04 UTC

Post by Rudolf Sykora
you probably mean NON-greedy ops.

Yes, my mistake. I'll risk making a very minor correction to

Post by Rudolf Sykora
Backreferences within the pattern (such as in /(.*)\1/) make the
matcher non-regular and exponentially hard.

They do change the class of the grammar and nobody knows how to
implement them in subexponential time, but it hasn't been proved
to be impossible.

--
John Stalker
School of Mathematics
Trinity College Dublin
tel +353 1 896 1983
fax +353 1 896 2282

Uriel

2008-10-24 17:12:50 UTC

Just because something is possible doesn't mean that it is a good
idea. Implementing every possible feature is the GNU/way, if you want
to go down that path, you know where to find it.

Post by Rudolf Sykora
I once
wrote a program in python and was pretty happy to have non-greedy
operators and lookahead assertions on hand. Should I hadn't had those,
I probably wouldn't have been able to write it (nicely).

Python uses the absolutely dreadful and unintelligible PCRE, which not
only have unpredictable performance as Russ points out, but are a
total mess so complex that I doubt anybody fully understands them,
they are to Plan 9 regexps what C++ is to C.

Maybe Plan 9 regexps are not as 'powerful', but at least I can understand them.

uriel

Charles Forsyth

2008-10-24 19:49:24 UTC

Post by Rudolf Sykora
If I try sth. like
/( b(.)b)/a/\1\2/
on
bla blb 56
I get
bla blb\1\2 56
which is not quite what I want... How then? (I'd like to get 'bla blblblb 56')

echo bla blb 56 | sed 's/( b(.)b)/&\1\2/'
bla blb blbl 56

similarly use `s' not `a' in sam.

Rudolf Sykora

2008-10-24 19:59:30 UTC

Post by Charles Forsyth

Post by Rudolf Sykora
If I try sth. like
/( b(.)b)/a/\1\2/
on
bla blb 56
I get
bla blb\1\2 56
which is not quite what I want... How then? (I'd like to get 'bla blblblb 56')

echo bla blb 56 | sed 's/( b(.)b)/&\1\2/'
bla blb blbl 56
similarly use `s' not `a' in sam.

Yes. But my question was more subtle. I know it can be done with the
's' command in sam now. But I asked sth. different. The documentation
(paper about sam) says 's' is redundant. But it seems (only seems
since documentation says nothing), that submatches only work with the
's' command. If this is true 's' is not redundant, since if it weren't
there you would not have any chance to use \1, \2, etc.
That was the reason I wanted to have my example rewritten WITHOUT the
's' command...

But anyway thanks!
Ruda

Rob Pike

2008-10-26 21:28:08 UTC

The ability to put \1 in the right hand side of a substitution was
done by jason@
at the Uni of Sydney, but after the Sam papers were published. It was a welcome
feature that added special functionality to the 's' command within
Sam. (Ed(1) had
the feature, within its limited regexps, long before, of course.)

-rob

Post by Charles Forsyth

Post by Rudolf Sykora
If I try sth. like
/( b(.)b)/a/\1\2/
on
bla blb 56
I get
bla blb\1\2 56
which is not quite what I want... How then? (I'd like to get 'bla blblblb 56')

echo bla blb 56 | sed 's/( b(.)b)/&\1\2/'
bla blb blbl 56
similarly use `s' not `a' in sam.

Yard Ape

2008-11-30 08:29:40 UTC

Post by Rudolf Sykora
I have a file say like this
ABC asassadfasdf asdfasdf asdfasdf CBA hhhhhhhhhhjjjjjjjjjjioioioi
sodifs
sdfsd
ABC
dasdfas aasdfa
njnjn CBA
and I want to get
' asassadfasdf asdfasdf asdfasdf '
'dasdfas aasdfa'
'njnjn'
...i.e. delimited with ABC from the left and CBA (or
e.g. EFG) from the right? Say I have a file with such a
structure and want to only get 'the interesting parts'
form it.
...using exclusively sam (and more, possibly without that
's' command

One of the things I love about ed is how easy it
is to loop through *ranges* which can be delimited
by regexes, like so:

,g/pre-interesting/.+,/post-interesting/-\
s/this/that/g

The same method can be used in sam:

,x/pre-interesting/+#0;/post-interesting/-#0{
x/this/c/that/
}

In cases where the delimeters don't alternate
perfectly (as in nesting), you can just start adding
conditionals and/or subloops:

,x/pre-interesting/.+#0;/post-interesting/-#0{
v/pre-interesting/{
x/this/c/that/
}
}

You can operate on the inverse of the interesting
range by employing a marker in the loop to maintain a
"trailing" range throughout:

0k
,x/pre-interesting/.+#0;/post-interesting/-#0{
v/pre-interesting/{
',.-#0{
x/this/c/that/
}
.+#0k
}
}

etc., etc.

Here's a version that works on your example:

0k
,x/ABC/+#0;/CBA|EFG/{
',.-#0d
.+#0??d
.+#0k
}

-Derek

Rudolf Sykora

2008-12-11 16:38:25 UTC

Post by Yard Ape
0k
,x/ABC/+#0;/CBA|EFG/{
',.-#0d
.+#0??d
.+#0k
}
-Derek

Thanks. This is what I wanted to see...
Ruda

Aharon Robbins

2008-10-24 11:32:04 UTC

You are not missing anything.

Subexpression matching means when you have an expression like

q(a+b)(c*d)z

that you can get access to the exact text matched by the two
parenthesized subexpressions.

You asked about non-greedy regular expressions which were first
popularized by perl.

IIRC the Plan 9 regex library does not provide this at all; Bell Labs
code never did non-greedy regexp matching.

Rob and/or Russ can correct me if I'm wrong.

FWIW, tools from the GNU world also do not support non-greedy
matching, nor are such expressions part of POSIX.

Hope this helps,

Arnold

Post by erik quanstrom
russ has a great writeup on this.
http://swtch.com/~rsc/regexp/
i think it covers all your questions.
- erik

I read trough some of that already yesterday. Anyway, am still
puzzled. In the text of
Regular Expression Matching Can Be Simple And Fast
(but is slow in Java, Perl, PHP, Python, Ruby, ...)
---
While writing the text editor sam [6] in the early 1980s, Rob Pike
wrote a new regular expression implementation, which Dave Presotto
extracted into a library that appeared in the Eighth Edition. Pike's
implementation incorporated submatch tracking into an efficient NFA
simulation but, like the rest of the Eighth Edition source, was not
widely distributed.
...
Pike's regular expression implementation, extended to support Unicode,
was made freely available with sam in late 1992, but the particularly
efficient regular expression search algorithm went unnoticed. The code
is now available in many forms: as part of sam, as Plan 9's regular
expression library, or packaged separately for Unix.
---
But any manual page (regexp(6), that of sam) keeps completely silent
about eg. any submatch tracking.
So what's wrong? Can anybody clarify the situation for me or do I
really have to read the codes?
Ruda

Aharon Robbins

2008-10-27 19:28:48 UTC

Post by Charles Forsyth
GNU grep takes a simple but effective approach. It uses a DFA when
possible, reverting to an NFA when backreferences are used. GNU awk does
something similar---it uses GNU grep's fast shortest-leftmost DFA engine
for simple "does it match" checks, and reverts to a different engine for
checks where the actual extent of the match must be known. Since that
other engine is an NFA, GNU awk can conveniently offer capturing
parentheses, and it does via its special gensub function.

I'll be the first to admit that gawk's behavior is something of a hack.
I'd much rather be able to use a single matcher. (At least I was able to
get Friedl to describe gawk's implementation correctly. :-)

Post by Charles Forsyth
Tcl's regex engine is a true hybrid, custom built by Henry Spencer ....
Currently, this engine is available only to Tcl, but Henry tells me that
it's on his to-do list to break it out into a separate package that can
be used by others.

It's been on that TODO list for well over a decade, IIRC. More vaporware
as far as a broader community is concerned.

Post by Eris Discordia
Again, turns out the "big books on regular expressions" can give the
lowlife--that's me--things "hackers" deny them.

Not in the least. The Plan 9 regexp library in fact gives you close to
the same nirvana; an automata that has DFA speed characteristics with
the NFA's ability to capture sub texts.

As other mails have pointed out, anything that isn't leftmost longest
has weird semantics. Non-greedy operators are mostly syntactic sugar.
I'll agree that in practice they're likely to be useful, but as I'm pretty
much an awk guy (<grin>) I've never felt the pain of their lack, either.

Gawk is stuck with its current two-matcher approach since it provides
the option for different syntaxes, but that approach also has lots of
problems; more than once I've come across cases where they interpret the
same syntax differently, or don't handle multibyte characters the same.
(Not to mention that they sntax bits API in the GNU matchers is a
a real nightmare. Another thing that's too late to change.)

If I could start over again, I'd start with either the plan 9 lib or
ripping out Henry's library from tcl, but the former is probably easier.

Friedl's book is good for what it aims to be: an introduction to
regular expressions. But scientifically rigid (as in, on the same
order as the dragon book) it's definitely not.

Russ's paper describes the state of the world pretty well.

Arnold

Eris Discordia

2008-10-27 20:28:18 UTC

Post by Aharon Robbins
As other mails have pointed out, anything that isn't leftmost longest
has weird semantics. Non-greedy operators are mostly syntactic sugar.

Is (leftmost-longest + all-greedy operators) syntactic salt then?

Post by Aharon Robbins
Not in the least. The Plan 9 regexp library in fact gives you close to
the same nirvana; an automata that has DFA speed characteristics with
the NFA's ability to capture sub texts.

Does regexp(n) also give the lowlife any hint of why it should behave
differently from Perl? Friedl's book doesn't, but it has good reason.

Post by Aharon Robbins
Friedl's book is good for what it aims to be: an introduction to
regular expressions. But scientifically rigid (as in, on the same
order as the dragon book) it's definitely not.

That's good. No problem with "big books on regular expressions," then.
Introductory ones, or scientific ones.

--On Monday, October 27, 2008 9:23 PM +0200 Aharon Robbins

Post by Aharon Robbins

Post by Charles Forsyth
GNU grep takes a simple but effective approach. It uses a DFA when
possible, reverting to an NFA when backreferences are used. GNU awk
does something similar---it uses GNU grep's fast shortest-leftmost DFA
engine for simple "does it match" checks, and reverts to a different
engine for checks where the actual extent of the match must be known.
Since that other engine is an NFA, GNU awk can conveniently offer
capturing parentheses, and it does via its special gensub function.

Post by Charles Forsyth
Tcl's regex engine is a true hybrid, custom built by Henry Spencer ....
Currently, this engine is available only to Tcl, but Henry tells me
that it's on his to-do list to break it out into a separate package
that can be used by others.

It's been on that TODO list for well over a decade, IIRC. More vaporware
as far as a broader community is concerned.

Post by Eris Discordia
Again, turns out the "big books on regular expressions" can give the
lowlife--that's me--things "hackers" deny them.

Not in the least. The Plan 9 regexp library in fact gives you close to
the same nirvana; an automata that has DFA speed characteristics with
the NFA's ability to capture sub texts.
As other mails have pointed out, anything that isn't leftmost longest
has weird semantics. Non-greedy operators are mostly syntactic sugar.
I'll agree that in practice they're likely to be useful, but as I'm pretty
much an awk guy (<grin>) I've never felt the pain of their lack, either.
Gawk is stuck with its current two-matcher approach since it provides
the option for different syntaxes, but that approach also has lots of
problems; more than once I've come across cases where they interpret the
same syntax differently, or don't handle multibyte characters the same.
(Not to mention that they sntax bits API in the GNU matchers is a
a real nightmare. Another thing that's too late to change.)
If I could start over again, I'd start with either the plan 9 lib or
ripping out Henry's library from tcl, but the former is probably easier.
Friedl's book is good for what it aims to be: an introduction to
regular expressions. But scientifically rigid (as in, on the same
order as the dragon book) it's definitely not.
Russ's paper describes the state of the world pretty well.
Arnold

Eris Discordia

2008-10-27 20:00:02 UTC

First of all, thanks for the explanation. It's above my head, but thanks
anyway.

Post by Brian L. Stuart
This guy seems to blur the distinctions here. His discussion

He doesn't. If one reads the whole section part of which was quoted one
will see that he clearly states DFA and NFA are theoretically equivalent,
but then goes on to explain that DFA and NFA _implementations_ are not
identical.

Post by Brian L. Stuart
The true mathematical and computational meaning of "NFA" is different
from what is commonly called an "NFA regex engine." In theory, NFA and
DFA engines should match exactly the same text and have exactly the same
features. In practice, the desire for richer, more expressive regular
expressions has caused their semantics to diverge. An example is the
support for backreferences.
The design of a DFA engine precludes backreferences, but it's a
relatively small task to add backreference support to a true
(mathematically speaking) NFA engine. In doing so, you create a more
powerful tool, but you also make it decidedly nonregular (mathematically
speaking). What does this mean? At most, that you should probably stop
calling it an NFA, and start using the phrase "nonregular expressions,"
since that describes (mathematically speaking) the new situation. No one
has actually done this, so the name "NFA" has lingered, even though the
implementation is no longer (mathematically speaking) an NFA.

-- from the same book

Post by Brian L. Stuart
And a good book on automata theory can give you far more
than any "big book of...", "... in 21 days" or "... for
dummies" book can. Besides, why would you think that
anyone is denying you knowledge or the opportunity
to write code that demonstrates your ideas?

Because lowlifes don't write code. No need for somebody to stop them from
doing so. They learn slowly, hardly, painfully--they aren't smart. If
possible they'll learn less rather than learn more. What the "hacker"
denies the lowlife is the opportunity to exist free of "GNU-is-wrong" or
"X-is-wrong" blemish. GNU may be wrong, or right, but GNU is learnt fast
and easy. And it does the job.

By the way, Friedl's book has the advantage of giving a lowlife a glimpse
of a subject otherwise arcane from that same lowlife's point of view. It's
good--for the lowlife, of course--to know the wonders they see didn't
spring into existence out of the blue. I benefitted from that learning.

--On Monday, October 27, 2008 4:13 PM +0000 "Brian L. Stuart"

Post by Brian L. Stuart

Post by Eris Discordia
The set of "big books on regular expressions" includes Jeffrey Friedl's
"Mastering Regular Expressions" that happens to contain a chapter by the

Post by Charles Forsyth
DFA Speed with NFA Capabilities: Regex Nirvana?

This guy seems to blur the distinctions here. His discussion
makes it sound like he's saying that NFAs have more expressive
power than DFAs. This is incorrect. Both NFAs and DFAs have
exactly the same expressive power as the class of grammars
called regular. For the arbitrary case of nesting (e.g. parens),
these machines are insufficient. However, for any prescribed
maximum nesting level, you can write a regular expression to
account for it, though it becomes clumsy.
To get more expressiveness, you need to go to a machine
with more functionality. Classically, the next step
up the hierarchy is the pushdown automaton. The expressive
power of this machine corresponds directly to the context-
free grammars. Because the full generality of the CFG
require nondeterminism, automated translations from CFG
to code/machine are usually done with restricted classes
of CFGs, such as LR(k) and LALR. You can also increase
the power of a FA by adding a counter or by making the
transitions probablistic. If you truly want to build
expression matching mechanisms that go beyond regular,
building on the FA with counter(s) would be a far more
sound foundation than a lot of the ad hoc stuff that's
been done. But the truth is that whipping up a CFG
and feeding it to yacc is far more expedient than torturing
regular expressions all day.

Post by Eris Discordia
Again, turns out the "big books on regular expressions" can give the
lowlife--that's me--things "hackers" deny them.

Brian L. Stuart

2008-10-28 14:55:19 UTC

Post by Brian L. Stuart
This guy seems to blur the distinctions here. His discussion

Actually, that's why I said "seems to." Basically, what he
should be saying is that some people have found it easier
to add elements to the NFA model that augment the grammar.
But then he goes on to use the initials NFA when he really
means the extensions thereof. This creates in the reader's
mind the misimpression that there's something magically
different between an NFA and a DFA, theoretically or
implementationally. They have exactly the same expressive
power in both realms. Besides, nondeterminism can only
be simulated when running on a deterministic computer.
That's actually the insight behind the NFA to DFA construction.
So talking about an "NFA implementation" is rather artificial.

Post by Eris Discordia
They learn slowly, hardly, painfully--they aren't smart. If
possible they'll learn less rather than learn more. What the "hacker"
denies the lowlife is the opportunity to exist free of "GNU-is-wrong" or
"X-is-wrong" blemish.

Look at it this way. The people here aren't trying to
create blemishes on anything. Rather we are trying to
help your learning process. We're trying to head off
the tendency for people to jump from "it was easy to
learn how to do X in this tool" to "how this tool does
things is better." Instead, before asserting something
is better or asking for others to write software according
to a personal preference, it is important to broaden
your understanding so that you know what the options
really are and what the design decisions really imply.
Implementing a laundry list of features without that
perspective simply produces bad software.

Post by Eris Discordia
It's
good--for the lowlife, of course--to know the wonders they see didn't
spring into existence out of the blue.

That's why we teach the theory that everyone seems to
want to complain about. Building on Newton, if we want
to see farther than those before us, we need to know
those before us well enough to climb onto their shoulders.
No software is based on Hogwarts' technology, and no
good software comes from the "random walk method of
programming." It is the product of intellectual reasoning.
Criticism of the result won't be taken too seriously
if the critic shows they have not understood the reasoning
behind it.

BLS

Eris Discordia

2008-10-28 15:21:34 UTC

Thanks for the explanations. The lowlife learns a bit or two :-)

--On Tuesday, October 28, 2008 2:51 PM +0000 "Brian L. Stuart"

Post by Brian L. Stuart

Post by Brian L. Stuart
This guy seems to blur the distinctions here. His discussion

He doesn't. If one reads the whole section part of which was quoted one
will see that he clearly states DFA and NFA are theoretically
equivalent, but then goes on to explain that DFA and NFA
_implementations_ are not identical.

Post by Eris Discordia
It's
good--for the lowlife, of course--to know the wonders they see didn't
spring into existence out of the blue.

Aharon Robbins

2008-10-27 21:12:17 UTC

Post by Aharon Robbins
As other mails have pointed out, anything that isn't leftmost longest
has weird semantics. Non-greedy operators are mostly syntactic sugar.

Is (leftmost-longest + all-greedy operators) syntactic salt then?

It is merely the traditional POSIX flavor. Some people like that
flavor, some don't.

Does regexp(n) also give the lowlife any hint of why it should behave
differently from Perl? Friedl's book doesn't, but it has good reason.

It is more that Perl simply was never part of the picture for the people
who develop(ed) and use(d) Plan 9. It's like asking why the paper on
the Plan 9 C compiler doesn't state that C++ classes are not available.

The long-time users of Plan 9 were using Unix before Perl even came along;
for reasons having to do with both taste and the theoretical soundness,
they saw no reason to try to support the Perl features. After all,
if you really want Perl, you know where to get it.

Me, I'm pretty happy with the traditional shell + sed + grep + awk
combinations, but then again, I'm biased, particularly towards awk.
:-)

Arnold

Eris Discordia

2008-10-28 15:05:29 UTC

Post by Aharon Robbins
It is merely the traditional POSIX flavor. Some people like that
flavor, some don't.

Understandable.

Post by Aharon Robbins
It is more that Perl simply was never part of the picture for the people
who develop(ed) and use(d) Plan 9. It's like asking why the paper on
the Plan 9 C compiler doesn't state that C++ classes are not available.

Very reasonable.

Post by Aharon Robbins
if you really want Perl, you know where to get it.

We all do. We all did [get it].

Post by Aharon Robbins
Me, I'm pretty happy with the traditional shell + sed + grep + awk
combinations, but then again, I'm biased, particularly towards awk.
:-)

Yeah, I've seen your (g)awk activities on other lists. Good luck, and
thanks.

--On Monday, October 27, 2008 11:08 PM +0200 Aharon Robbins

Post by Aharon Robbins

Post by Aharon Robbins
As other mails have pointed out, anything that isn't leftmost longest
has weird semantics. Non-greedy operators are mostly syntactic sugar.

Is (leftmost-longest + all-greedy operators) syntactic salt then?

It is merely the traditional POSIX flavor. Some people like that
flavor, some don't.