This doesn't answer your fundamental questions, but there are purely
functional arrays and other data structures that allow copying (with a
changed element) with much better than O(N) performance, usually
something like O(log N). They are built with tree structures to allow
structural sharing. Clojure's fundamental data structures work like
this. They exist natively and as libraries for many programming
languages.
To answer your question, I think programming languages generally
should encapsulate efficient (mutating) algorithms behind functional
interfaces, so you pass in your immutable array to a function (eg. sort)
and it does a single copy to a mutable array, does a really fast sort
and then creates an immutable array to pass back to you. Since those
copying actions only have to happen once, they don't generally affect
the asymptotic complexity of the overall operation. As a user of those
functions, you continue to have the benefit of immutable data
structures, but get to have some efficient algorithms at the same
time.
@Stephen – implementing an array as a tree has other costs though,
even if you disregard the benefits of better utilization of cache.
What's the cost of a read to a random location of an array implemented
as a tree?
As a side note, I've found that in many cases (on my hardware, where
N < ~50) implementing a frequently read map as a wrapper around
std::vector in C++ is faster than implementing it directly as a
std::map. That is to say, even when you have increased algorithmic
complexity (O(N) vs O(log N)) you can still get a win if your accesses
are fast enough. With an array implemented as a tree, I suspect you are
trading off both algorithmic complexity (in other places, as well as in
the copy as O(log N) is slower than O(1)) and memory access speed.
I do like your idea of keeping interfaces functional, but
implementations mutable. Clean interfaces are always a good idea, and in
this case you can get the benefits of both worlds.
Yes, purely functional algorithms can't be faster than the imperative
ones [1]. Arrays and graphs are good examples, and for achieving the
best performance using an imperative implementation is necessary in
these cases (though we have purely functional array and graph
libraries). But in many other cases there do exist functional data
structures that are asymptotically as efficient as their imperative
analogues (for more information, see Okasaki's "Purely Functional Data
Structures" and [2]).
Purity is not a silver bullet that solves all parallelism problems,
but it can make your life easier (it's obviously safe to run two
different pure functions in parallel). The point is not to get rid of
mutable state altogether (Haskell has great support for mutable state),
but to separate pure code from the impure and make the compiler enforce
that separation via the type system. Haskell's type system is even
advanced enough to allow pure code that uses mutable state internally
(and statically guarantee that the mutable state doesn't "escape")
[3].
Writing pure code is not the only approach to parallel programming
that Haskell offers. Software Transactional Memory[4], which is
fundamentally an abstraction built on top of mutable state, also
deserves a mention.
[1] http://stackoverflow.com/questions/1990464/efficiency-of-purely-functional-programming
[2] http://cstheory.stackexchange.com/questions/1539/whats-new-in-purely-functional-data-structures-since-okasaki
[3] http://www.haskell.org/ghc/docs/latest/html/libraries/base/Control-Monad-ST.html
[4] http://www.haskell.org/haskellwiki/Software_transactional_memory
Please note that the unsafeFoo are *not* the expected way to use the
array when you want mutable update. They're here as escape hatches to
let people shoot themselves in the foot if they really insist on it, but
as you say it gives up safety.
Instead, you should use typed interfaces that allow the type checker
to verify that the use of mutation is only local, and cannot be observed
from the outside (disallowing aliasing in particular); this is done with
STArray for example, that uses the ST monad for this quite neat
purpose.
Of course, not all cases will be covered, and if you look hard enough
you'll see situations that are not covered (rejected as ill-typed by
ST). In this case you're on your own, like in any other language.
(You're remark that it is possible to write code that is too hard to
analyze due to extreme chaining of dependencies but still qualifies as
"pure" is spot on. This is how use of Haskell monads for effects work.
Similarly, if you use ST to use side-effects locally, well are likely to
use an imperative style locally, with the well-known downsides.)
For the theoretical discussion on asymptotic bounds of functional
programs, see this very interesting StackOverflow discussion: http://stackoverflow.com/questions/1990464/efficiency-of-purely-functional-programming
I should note that I was speaking from the Haskell perspective. Most
other functional languages (e.g. OCaml, Scheme, Clojure) don't have this
strict compiler-enforced separation between side-effecting and pure
code.
I was a heavy user of OCaml for a while, and I teach Haskell in my
programming languages course. I have come to believe that "purity" in
programming languages is a great learning tool, but not a very good tool
for building software. I am amazed at how far one can go with purely
functional programming, and I think there is value in exploring those
frontiers, but one should recognize that pursuit as largely academic. I
believe the same applies to languages that rhapsodies about their OOP
purity (I happen to think OOP has been taken too far even in mainstream
programming). Some of your writings on Forth reflect a similar sentiment
about that corner of the design space.
One of the things I love about the CTM book by Van-Roy and Haridi is
that it has a nicely balanced view on this. They explore declarative
programming (a cousin to FP) in detail, then talk about instances where
the model becomes cumbersome enough that it is worth adding a bit of
state just to simplify the code. They also define their concept of
purity in a way that allows state to lurk behind an abstraction; if the
interface meets all the purity requirements, they are willing to cheat
behind the scenes and sneak some state in when no one is looking. They
don't consider the result tainted (think of the Haskell IO monad that
acts almost as a badge of shame for impure code).
In short, the extremes are a fertile ground for learning and
discovering, but being able to pick and choose the best of each model
for each job is better for practical use. I'm one of those who believes
my imperative programming is better because of my FP experience, but not
everything needs to be coerced into a side-effect free model
Yossi, you're neither utterly wrong nor trivially right.
Indeed there's no magical solution for efficiently mutating arrays
within purely functional programming. But there's a interesting way of
working around it, by limiting (localizing?) the usage of side-effects.
Take a look at the ST monad.
http://www.haskell.org/haskellwiki/Monad/ST
For the more general question you might want to look into the
excellent "Purely Functional Data Structures" by Chris Okasaki.
(oops, when I wrote my comment, Mikhail's and gasche's weren't there,
it's mostly redundant now)
But direct memory update is already a O(log(N)) operation and no more
efficient than a functional array! It's only O(1) because you can only
use a finite amount of memory on the von Neumann machine under your
desk.
Inefficiency – this is precisely why functional languages did not
gain such the adoption in the commercial code so far. But this may
change in the future with the increase of computation power. You may do
it inefficiently, but if the user can hardly notice any difference in
time – then, who cares?
There of course will be always realms where efficienty is very
important: games, graphics, scientific computation, etc. However, the
mainstream desktop and web application front ends where performance
rarely matters can switch to functional programming even today, gaining
speed of development and safety at cost of performance.
First, the big O notation is made to count things.
You can count memory cells (which gives you a space complexity), or you
can count operations (which gives you a time complexity), or something
else.
When you count micro-instructions, in a programming language with
side effects, indeed, indexing an array is O(1), in the number of
micro-instructions.
But if your computer doesn't have Random Access Memory (RAM), but for
example, Delay Line Memory (DLM http://en.wikipedia.org/wiki/Delay_line_memory ),
then while the number of instructions needed to index an array will
still be O(1), the clock time needed will depend on the position of the
data in the line. And this doesn't only concerns old technology, the
more recent magnetic bubble memories had linear access times too.
So the question is what complexity is setting a slot in a functional
array. If you implement a sorting algorith in a functional programming
language, you can count the number of slots read from functional arrays
and the number of slot written in the functional arrays. Each kind of
access counting for 1, have a complexity of O(1).
Of course, if you try to implement your functional programming
language on a Von Neuman compter, and if you do so by copying the old
elements, suddenly writing a slot will be O(n) in micro-instructions of
the Von Neuman machine.
But nobody forces you to use a Von Neuman machine. You could use a
quantum computer, and fork a new universe each time you write an array
slot, so you don't have to do the copying. Then setting a functional
array slot will be O(1) in universe forks.
Who knows if forking a universe takes aeons in paradize "time"?
Perhaps an irrelevant tangent, but I am currently building my first
iPhone app and I notice that Objective-C, although not a functional
language, has mutable and immutable versions of all of its data
structures, and the culture – documentation, example code etc. –
encourages use of the immutable versions unless mutability is really
needed.
A very common idiom, for example, is for the data provider to work on
a mutable data structure internally and then return an immutable
version, either by implicit casting (the mutable version is normally a
subclass of the immutable one) or by copy-once-when-finished.
This strikes me as a good example of what Russ is talking about:
you're not forced into pure functional contortions, but you are
encouraged to localise and separate your state-fiddling.
Thanks for all the comments. The bit about "hidden side effects"
(pure interface, impure implementation) seems really interesting; I
wonder if this works for internal state variables maintained across
calls (as opposed to only supporting clobbering temporaries).
Regarding von Neumann machines being just one possible model –
firstly, well, all of the standard complexity analysis assumes this
model and it's awfully widespread, but secondly, I think it scales
rather nicely. In particular, why does access to a pointer require
O(log(N)) operations – because you need log(N) pointer bits? 128 pointer
bits, or even a smaller constant, AFAIK allow you to address every atom
in the universe or something.
reading this post felt a bit odd. Then i skipped to read the comments
section. I think Pascal Bourguignon's comment (#11) captures the best my
reaction to this post.
upon re-reading your post, i found this beginning paragraph, to be
the culprit. Quote: «... very basic complexity analysis that we all use
assumes a von Neumann machine with ...»
the heart of complexity analysis measures the level of steps of
algorithm. The unit of a step is usually mathematically abstract
ones.
for example, Euclid's algorithm for GCD (the earlist or one of the
earlist algorithm), or prime sieve, or various sort algorithms. In
literature, we don't see the mention of cpu/memory architecture, because
it doesn't matter.
So, a functional programing algorithm to create a new array, i'd
consider O(1), just as imperative language updating one element. Actual
unit considerd depends on the context.
my 2 cents. Thank you for your blog. I enjoyed read them.
Well, it's true that we count "steps" and we could count anything as
"a single step"; the question is when such counting is useful. If a new
array with N elements and one element modified is said to be created in
a single step, how is the resulting counting useful? The naive
implementation – copying – will take a varying amount of (physical) time
proportionate to the number of elements, N, so calling it O(1) is not
useful for predicting observed performance. Which less naive
implementation would actually be O(1), observably – take an amount of
physical time limited by a reasonably small constant (not something
larger than the amount of time the universe could possibly exist) that
is not proportionate to N?
hi Yossi (#15), i think what unit to use needs to be case-specific,
goal-specific.
for example, let's say we implement factorial by recursion. We implement
in Scheme and Python. In python, the storage will blow up quick. How can
the same algorithm now have 2 different complexity? Of course it depends
on level of detail, and our purpose. It is in this sense, i think that
thinking of FP's complexity in terms of cpu level detail is incongruous
in general.
suppose we write basic prime sieve to get the first one thousand
primes, in OCaml and also in python. Depending what's the goal of our
study, we could say they both have the same complexity because afterall
it's the same algorithm. Or, we could compare the cpu instruction
generated of a particular hardware. Then it also drags in specific
compiler used, and how the code is exactly written in each lang. (e.g.
is it using a “array”, a “list”, does it use list compresion or
lambda/map) So, the same algorithm can actually result in different
magnitude of O() for the 2 implementations, and i think that for SOME,
the FP's resulting O() could be a magnitude less than Python. So yeah i
think it all depends.
another example: what's the algorithmic complexity of i = i+1 vs ++i
in C? if we compare cpu instructions, don't they differ by some
magnitude with a decade old compiler?
In the i++ example, both versions are O(1). I was talking about
asymptotic complexity.
Of course you could say that a functional specification of sorting is
always O(n*log(n)) given a sufficiently smart compiler; in that sense,
bubble sort is O(n*log(n)) given a sufficiently smart compiler. But that
sort of misses the point of complexity analysis of algorithms.
I think the question of the impact of not having random access for
writing on asymptotic complexity is a valid one – I'm not saying it is
trivial or that I fully answered it but I don't think "it depends"
answers it any more than "O(n*log(n))" is a valid answer to the question
of "what's the complexity of bubble sort". I think that to show that
losing random access doesn't affect complexity you'd need to show a
compilation strategy working around the problem, not merely point out
that it could exist in theory. I think there's no such strategy to date,
which is why functional languages offer cons but no "new array with one
modified element" operation.
hi Yossi, i guess what you said makes sense. but i don't see the
utility of this conclusion with its assumptions?
as a analogy, say we have a list of 1 to 10, and we want to add one
to each. We can do a for loop, or we can map a lambda over it. With
certain assumption, we can decidedly say that the map approach will have
storage requirement double that of imperative approach, ALWAYS. But
what's the utility of this overly general conclusion?
on a different note, i don't think that functional languages uses
cons. I think mostly just lisp. (i.e. it's builtin function “list” is
made of cons, which is called a “proper list” when structured in a
certain way (last linked pair ending in “nil”), and such is called
linked list.). Mathematica's “List” is C's array. Mathematica does not
have anything of linked list. (nor Perl, PHP, Python, as far as i know.)
Of course, in any highlevel lang, linked list a la lisp can be built by
nesting pairs. This is a oft used technique in Mathematica. e.g. nest as
a way to append, then call Flatten on it to rid of nesting.
I know almost no Mathematica; is it purely functional, and how do you
build new collections out of old ones?
In Haskell and ML as well as Lisps, there is a cons – I don't think
they name it so and perhaps they rightly don't since it's a bit of a
weird mnemonic, but they have it. That is, you take a head and a tail
and make a new list; this can be made to work predictably efficiently, I
think.
How people usually build new arrays out of old ones in purely
functional languages or language subsets I don't know; what does
Mathematica do?
As to mapping a lambda vs a for – I think a relatively simple
compiler can convert the map to a for; also it's not an asymptotic
performance difference. What I wonder about is whether purity costs
asymptotic performance, because that kind of thing is not something
stronger hardware can really buy back – not for a sufficiently large
N.
(I imagine one possible solution is to add a bunch of pure functional
primitives such as histogram() that use side effects in their
implementations but have a pure interface; if so, I wonder if you pay a
price in generality – or is there actually an approach letting you
encapsulate side effects in functional primitives so that every
algorithm can be made as efficient as with a writable RAM? If such an
approach exists, it'd answer my question spectacularly and prove my gut
feeling wrong; if a less general approach exist that can be shown to
handle an important sub-space of problems, it's rather interesting as
well.)
Hi Yossi!
I haven't researched this enough, but it seems to me that pure
functions with the added feature of destroying their input might be a
solution. Imaging a (pure) function that creates a new array from an old
one by replacing one element, that has a peculiar feature that using the
original array again an error. It's obvious how using it in code
translates to O(1) machine code, but the code is still functional.
Here http://www.cse.unsw.edu.au/~benl/papers/thesis/lippmeier-impure-world.pdf
is a (surprisingly pleasant to read) PhD thesis, dealing with exactly
this problem. I've read only the first two chapters, so I'm not sure
what's exactly the solution he suggests, but the author dedicates the
work to his beloved data structure, an array with O(1) updates.
Here http://code.google.com/p/decac/ is a programming
language based on the ideas of the above thesis, which aims to give
"bare-metal performance" while using modern programming languages
ideas.
None of these languages got a lot of success or development. However,
I do hope that something similar might work one day. I think that
functional programming that can be translated to efficient machine code
in an obvious way can be a really nice thing.
> The bit about “hidden side effects” (pure interface, impure
implementation)
> seems really interesting; I wonder if this works for internal state
variables
> maintained across calls (as opposed to only supporting
clobbering
> temporaries).
You can do that in Haskell (a good example is the DiffArray data
structure
[1][2]), but the type system won't be able to guarantee that your
program does
not go wrong in that case, so you'll have to use an escape hatch
(such as unsafePerformIO).
[1] http://www.lri.fr/~filliatr/ftp/publis/puf-wml07.ps
[2] http://hackage.haskell.org/package/diffarray
Mathematica (M) isn't purely fuctional. It's somewhat like Common
Lisp in that regard. It's big, has few thousand functions built-in. All
paradigms can be used, but functional and pattern matching term
rewriting system is naturally primary. It's my fav lang. ( http://en.wikipedia.org/wiki/Rewriting )
in M, there's Append and Prepend http://reference.wolfram.com/mathematica/ref/Prepend.html
but it's costy. It basically make a copy of the array. For seasoned M
programers, if we want to grow a list, a common idiom is to start with
huge number of elements (e.g. of 0), then chop off inert tail when done.
Or, nest them in each step e.g. mylist = {mylist, newItem}, then call
Flatten[mylist] in the end.
am aware that Ocaml and Haskel has linked list. But only dabbled in
OCaml. Tried to study Haskell but went nowhere. On the other hand, been
coding emacs lisp for a few years. I hated lisp cons! I think the linked
list in Haskell, OCaml are qualitatively different than lisp's list,
because they don't expose the “cons” primitive as a list constructor.
The fact of the linked list is only at the implementation level.
i don't have experience with low level langs such as C, C++,
assembly, nor system programing, compilers, nor cpu/memory architecture
.... So, i can only wave hands to your deeper technical inquiries in
this discussion. ☺
@Noam: that's really interesting – thanks!
A function or expression updating its input instead of copying it on
the grounds of there being no references to the input could take you
some distance I guess; a simple recursive histogram function could work
that way, for instance. It's not intuitive for me what exactly this
distance is.
One important problem FP solves is aliasing. Aliasing makes
optimising memory/time trade off and reasoning about code hard. While a
convenience, mutable state is what pulls the rug under you while you are
not looking. Dependent types tries to address both the "preventing
copying bits around" and "constant time access for constant amount of
memory" problems by giving a little too much power to the type system.
While optimisation opportunuties sound great in theory, it leads to
rather baroque syntax which helps them to stay obscure for the rest of
us.
Access to arbitrary amounts of memory is not actually constant time
operation, average/worst case time is assumed for analysis. Arbitrary
length operations doesn't run as fast as machine instructions
either.
Another thing worth looking to understand where functional
programming stands relative to imperative is reasoning about
continuations. To get a logical view of world (world stays same upon
continuation reentry), you either need to copy the whole call frames
(and related state) and restore upon reentry or pretend nothing will
change that may interest you. Pretending nothing will change is way
safer for a functional language although it is heavy on memory
consumption. An imperative language needs to take aliasing into account
while copying the modified call frames. Just ditching the logical view
guarantee and letting the programmer make sure nothing bad will happen
is easier (i.e. prolog with cuts). Once you try to incorporate call/cc
and friends, complexity analysis becomes impossible for mere mortals
with mutable state.
> A function or expression updating its input instead of copying
it on the grounds of there being no references to the input could take
you some distance I guess;
This is basically what uniqueness types give you – the type system
guarantees that values with unique types are used in a single threaded
manner. Uniqueness types were developed as a solution to the problem of
combining lazy evaluation with side effects, just like Haskell's ST and
IO monads. Ben Lippmeier discusses these matters in much more detail in
his wonderful dissertation (linked above) and also presents another
solution to the same problem.
Regarding uniqueness types and Lippmeier's thesis – I haven't read
it, but it sounds a lot like linear types (http://en.wikipedia.org/wiki/Linear_type_system),
which are the flagship feature of the Clean language (a cousin of
Haskell), and have since been adopted elsewhere (see the Wikipedia
article).
AFAICT, there is a slight difference between uniqueness types (at
least as implemented by Clean) and Linear types (as presented in Baker /
Wadler):
in Clean, unique <= shared, so a unique type can be coerced to it's
shared type.
In contrast, linear type systems require every binding to be used
exactly once.
On a different note, there is actually a wealth of type systems
designed to control aliasing with varying degrees of precision and
sophistication.
see:
Francois Pottier: Wandering through linear types, capabilities, and
regions.
The perceived O(1) update for random access arrays is, unfortunately,
misleading – a consequence of many simplifications for an abstract model
that cannot physically be implemented. In practice, memory is allocated
in physical space, and there are (minimally) speed of light delays.
Further, unique addresses for memory necessarily have average size
lg(N). If memory is allocated in 2D space, the best we can feasibly do
is O(N^(1/2) * lg(N)). If memory is allocated in 3D space, we can
improve this to O(N^(1/3) * lg(N)).
In practice, the discrepancy between O(1) and actual access times
tends to show up in terms of cache invalidation, virtual memory pages,
even disk reads or network lookups. This is interesting, in its own way,
because our machines are essentially modeling "motion" of memory – the
stuff we access is 'near' (in cache instead of memory, or in in memory)
and thus more cheaply accessible.
In practice, it is not difficult to achieve identical asymptotic
performance in a pure language. For example, if we have a Map of
Integers to Values, we can typically update an integer in that map in
O(lg(N)) time (ignoring physical layout of the map). If you used a map
for your histogram, you'd ultimately have the same asymptotic
performance as you do for an array... you'd simply be a lot more
explicit and obvious about it.
If you're somewhat more clever, you could actually model full
cache-like motions, and thus enable O(1) amortized manipulations to
recently accessed values and their neighbors (cf. Gërard Huet's
Zippers). So, if you're sequentially updating every value in a
histogram, you can get equivalent physical-world asymptotic performance
as modifying an array.
Of course, coefficients cannot reasonably be ignored. A map update is
likely to involve, on average, many more memory operations and much more
indirection than a stateful array manipulation. Zippers, if used,
similarly require extra computations.
In practice your histogram is typically small enough to fit in an L1
cache. So being "much more explicit about asymptotic performance" means
that you're bringing slowdowns which are unavoidable on huge inputs to
problems where they're perfectly avoidable.
On the other hand I agree that realizing that random access is in
practice much slower than sequential access in many cases is valuable.
I'm just saying that fast RAM – small fast RAM as necessitated by
physics but fast nonetheless – is darn useful and castrating it seems
weird. The existence of memory hierarchy on most machines that you use
as evidence that O(1) random access is impossible is actually also
evidence that getting O(1) with as small constants as possible for an as
large data set is possible is useful.
"you're bringing slowdowns which are unavoidable on huge inputs to
problems where they're perfectly avoidable" – True enough. I don't deny
that coefficients matter. They only don't matter to asymptotic
complexity, which was a significant focus of your article. :)
If we're aiming for coefficient performance in a purely functional
programs, there are at least three paths to pursue.
First, it is quite feasible to develop cache-friendly, purely
functional structures – e.g. by using wider tree nodes (not two branches
per level, instead fifty or so – e.g. B-trees, or finger-trees with many
nodes, or ropes built on these). By going wider, we better fit paging
models and cache-lines. Of course, binary trees tend to be simpler to
implement, so there is a tradeoff on complexity.
Second, if we can recognize that there is no aliasing for a data
structure – or prevent it using the type system (i.e. linear or affine
types) – then we can update it in place without losing our ability to
reason about it in a purely functional manner. There are entire
languages and concepts built around this – e.g. Clean, Mercury, DDC
Haskell, ParaSail, the 'Wormholes' work for FRP. It's worth paying
attention to. I'm also developing a language that leverages these
features.
Third, the Zipper data structure – initially developed by Gërard
Huet in 1997, but which remains unknown and underutilized by most
functional programmers. It provides an excellent foundation for O(1)
navigation steps and local updates for any tree-like structure. That
includes kd-trees (which can model 2D grids and 3D volumes). It also
includes text cursors, navigation and user-focus in a user interface.
(Cleverly, ray-tracing can be understood as a form of navigation unto
collision.)
The zipper is another of the bridges between functional and
imperative. Like linearity, zippers do have a weakness with regards to
aliasing: you can't have multiple zippers in the same data structure.
The closest you can get is higher-order zippers, which model volumes
instead of focus/location as a single point.
Related to the zipper and the tree, functional programming languages
have also developed a "finger-tree" concept, which is basically a tree
turned inside out so we can access and update the endpoints in O(1)
time. These make excellent purely functional deques. :)
Anyhow, I think there is a lot of confusion about what is feasible in
purely functional programming. An expert imperative programmer doesn't
immediately know the idioms for functional performance, and might prefer
to remain an expert of imperative than become a beginner of functional.
It's easy to learn "use more trees", but it can take years to learn a
full suite of idioms.
Anyhow, I'm not inclined to give up my "O(1) with as small constants
as possible". As I mentioned above, and before, I use zippers in that
role. Even better with linear in-place updates, and if we can keep it
cache friendly with wider structures. :)
Post a comment