Post by erik quanstromPost by Dan CrossPost by erik quanstromrc already has non-linear pipelines. but they're not very convienient.
And somewhat limited. There's no real concept of 'fanout' of output,
for instance (though that's a fairly trivial command, so probably
doesn't count), or multiplexing input from various sources that would
be needed to implement something like a shell-level data flow network.
Muxing input from multiple sources is hard when the data isn't somehow
self-delimited.
[...]
There may be other ways to achieve the same thing; I remember that the
boundaries of individual writes used to be preserved on read, but I
think that behavior changed somewhere along the way; maybe with the
move away from streams? Or perhaps I'm misremembering?
pipes still preserve write boundaries, as does il. (even the 0-byte write) but tcp of course by
definition does not. but either way, the protocol would need to be
self-framed to be transported on tcp. and even then, there are protocols
that are essentially serial, like tls.
Right. I think this is the reason for Bakul's question about
s-expressions or JSON or a similar format; those formats are
inherently self-delimiting. The problem with that is that, for
passing those things around to work without some kind of reverse
'tee'-like intermediary, the system has to understand the the things
that are being transferred are s-expressions or JSON records or
whatever, not just streams of uninterpreted bytes. We've steadfastly
rejected such system-imposing structure on files in Unix-y type
environments since 1969.
But conceptually, these IPC mechanisms are sort of similar to channels
in CSP-style languages. A natural question then becomes, how do
CSP-style languages handle the issue? Channels work around the muxing
thing by being typed; elements placed onto a channel are indivisible
objects of that type, so one doesn't need to worry about interference
from other objects simultaneously placed onto the same channel in
other threads of execution. Could we do something similar with pipes?
I don't know that anyone wants typed file descriptors; that would
open a whole new can of worms.
Maybe the building blocks are all there; one could imagine some kind
of 'splitter' program that could take input and rebroadcast it across
multiple output descriptors. Coupled with some kind of 'merge'
program that could take multiple input streams and mux them onto a
single output, one could build nearly arbitrarily complicated networks
of computations connected by pipes. Maybe for simplicity constrain
these to be DAGs. With a notation to describe these computation
graphs, one could just do a topological sort of the graph, create
pipes in all the appropriate places and go from there. Is the shell
an appropriate place for such a thing?
Forsyth's link looks interesting; I haven't read through the paper in
detail yet, but it sort of reminded me of LabView in a way (where
non-programmers wire together data flows using boxes and arrows and
stuff).
Post by erik quanstromPost by Dan CrossPost by erik quanstromi suppose i'm stepping close to sawzall now.
Actually, I think you're stepping closer to the reducers stuff Rich
Hickey has done recently in Clojure, though there's admittedly a lot
of overlap with the sawzall way of looking at things.
my knowledge of both is weak. :-)
The Clojure reducers stuff is kind of slick.
Consider a simple reduction in Lisp; say, summing up a list of numbers
or something like that. In Common Lisp, we may write this as:
(reduce #'+ '(1 2 3 4 5))
In clojure, the same thing would be written as:
(reduce + [1 2 3 4 5])
The problem is how the computation is performed. To illustrate,
here's a simple definition of 'reduce' written in Scheme (R5RS doesn't
have a standard 'reduce' function, but it is most commonly written to
take an initial element, so I do that here).
(define (reduce binop a bs)
(if (null? bs)
a
(reduce binop (binop a (car bs)) (cdr bs))))
Notice how the recursive depth of the function is linear in the length
of the list. But, if one thinks about what I'm doing here (just
addition of simple numbers) there's no reason this can't be done in
parallel. In particular, if I can split the list into evenly sized
parts and recurse, I can limit the recursive depth of the computation
to O(lg n). Something more like:
(define (reduce binop a bs)
(if (null? bs)
a
(let ((halves (split-into-halves bs)))
(binop (reduce binop a (car halves)) (reduce binop a (cadr halves)))
If I can exploit parallelism to execute functions in the recursion
tree simultaneously, I can really cut down on execution time. The
requirement is that binop over a and bs's is a monoid; that is, binop
is associative over the set from which 'a' and 'bs' are drawn, and 'a'
is an identity element.
This sounds wonderful, of course, but in Lisp and Scheme, lists are
built from cons cells, and even if I have some magic
'split-into-halves' function that satisfies the requirements of
reduce, doing so is still necessarily linear, so I don't gain much.
Besides, having to pass around the identity all the time is a bummer.
But in clojure, the Lisp concept of a list (composed of cons cells) is
generalized into the concept of a 'seq'. A seq is just a sequence of
things; it could be a list, a vector, some other container (say, a
sequence of key/value pairs derived from some kind of associated
structure), or a stream of data being read from a file or network
connection.
What's the *real* problem here? The issue is that reduce "knows" too
much about the things it is reducing over. Doing things sequentially
is easy, but slow; doing things in parallel requires that reduce know
a lot about the type of thing it's reducing over (e.g., this magic
'split-into-halves' function. Further, that might not be appropriate
for *all* sequence types; e.g., files or lists made from cons cells.
The insight of the reducers framework is that one can just ask the
container to reduce itself. Basically, pass it a function and say,
"here, reduce yourself with this function however you see fit." Then,
random-access containers can do things in parallel; lists and files
and things can do things sequentially; associative containers can do
whatever they want, etc. The implementation is kind of interesting;
more information is here:
http://clojure.com/blog/2012/05/08/reducers-a-library-and-model-for-collection-processing.html
Your example of running multiple 'grep's in parallel sort of reminded
me of this, though it occurs to me that this can probably be done with
a command: a sort of 'parallel apply' thing that can run a command
multiple times concurrently, each invocation on a range of the
arguments. But making it simple and elegant is likely to be tricky.
- Dan C.