Skip to main content

Clojure for the Really Brave: From Reductions to Transducers

I am currently refreshing my Clojure skills by going through the excellent (and funny) Clojure for the Brave and True. In Chapter 4, the author builds a simple CSV reader in order to identify potential vampires (sic!). The idea is to read the text file, parse its contents and then filter out those records which match certain criteria. The main part of the code is a reduction, which transforms the split CSV data into vectors and then into maps.

The code presented in the book is nice, idiomatic Clojure. But it is also quite complicated to read and understand, even for someone who knows this language. (For the curious, you can check the function mapify in Chapter 4.) So this is a good opportunity to see what a difference transducers can actually make. Reading, parsing and transforming a data set is a typical task for transducers. They allow to express these operations in a much more readable way than the ordinary reduce, and additionally allow for easy composition of multiple data transformations into data pipelines.

In this post, I want to explain how we can get from simple reductions to transductions. The aim is to get an understanding of the abstraction process which leads to transducers, and how they can turn out to be so powerful. I will not cover all technical details; see the final section for some further readings. The first two sections will introduce the important concepts, while the last section actually builds a transducer by refactoring a reduction.

Abstracting Reduction #

Transducers are an abstraction on the well-known and everyday workhorse reduce; they can be seen as the result of a continous refactoring of the reduction function (see here). So let’s begin with looking more closely at what a reduction is, conceptually.

At its core, every reduce uses two types of operations: A transformation which does something with each element of the collection with which the reduce is fed, and a reducing function which collects the modified elements again, e.g. by accumulation. In the remainder of this article, we will continue to use this terminology.

For an illustration, look at the following example:

(reduce (fn [result k]
          (assoc result (keyword k) (count k)))
        {}
        ["hello" "there"])
;; => {:hello 5, :there 5}

The reduction here takes a collection of strings as its input and returns a map indicating how long each word is. Not a particularly good example, to be honest, but it serves our purpose. We can see our two basic types of operations at work: A string is transformed to an association of the form :string (count string), and each hash pair is collected into a map. Incidentally, the example also illustrates that not all reductions actually reduce stuff, which is why the operation is also called, more abstractly, folding.

Despite its weaknesses, the example illustrates very well that these two basic operations are deeply nested within the reduction they define. If you look at the code, it is hard to see where the actual transformation and the collection take place. The mapping is hidden deep inside the reducing function, and the collecting function is only implicitly determined by the use of assoc and a hash map as an initial value.

Against this background, transducers can be fruitfully seen as a way to make these internal processes of the reduction function accessible from the outside. They put handles on them, so to speak. Instead of hiding the two defining operations in the reduction, they are directly exposed as parameters of the transduce function. Thus, our example can be reformulated in the following way using transducers:

(transduce 
  (map #(hash-map (keyword %) (count %)))
  merge 
  ["hello" "there"])
;; => {:hello 5, :there 5}

Compared with the previous code, this abstraction has at least three notable advantages. It is more readable (in the sense of: stating what it does); it separates the transformation from the collecting function; and it finally makes it possible to build effective and simple pipelines by just chaining transformations. Let’s follow these points through.

Expression #

The first point is that transducers allow to express much more succinctly what the reduction is exactly doing. We have seen that in each reduction, there are two operations at work which determine its specific output, a transformation and a collection of data. In the code example above, the transformation applied is in plain sight: It is a mapping (via map), more specifically, a mapping of some data (the argument to the anonymous function, %) to a key-value pair in a hash map.

The collecting function, too, is also made explicit: it is passed as the second argument to transduce. In the original reduction, we had to know that assoc returns a hash map in order to understand the collection process. To see what this reduction returns, we are forced to actually go through the reduction process: We have to see that assoc returns a hash, which is in turn passed as the first argument to the next step of the reduction, and finally as a result. Thus the whole process of building the result is buried within the specifics of the function. In contrast, the transduction simply states that the transformed elements are merged into one final hash map.

In sum, the first gain is that the parameters actually determining the specific result of the reduction (its operative functions) are made explicit. We do not have to infer how the transformation and the collection are each actually done, but instead pass both explicitly as arguments.

Decoupling #

Turning transformations and collections into explicit (and separate) arguments gives rise to a further gain: We decouple the transformation from the actual collection it operates on. This is the focus set by RIch Hickey in a very good talk on transducers. The transforming functions can be understood – and actually operate – completely independent of the specific collecting method applied. This does not only ease programming, insofar we can now adress these two processes separately. Furthermore, it is an impressing gain of speed: The transformations are applied to the items, and not to the collections. The collections are the final, independent step.

To give an example: (map str) is a transforming function (a transducer) which can be understood independently of the final result of the transformation. It is thus not a transformation of list items, or of a collection, even though it looks like a simple partial function doing exactly that. Rather, and more simply, (map str) should be understood as a mapping from a given input to a given output (e.g., turning an integer into a string). This applies regardless of the final result the transformed data is collected in.

Have a look at the following snippet for an illustration. Here (map str) is invoked two times. In both cases, it does exactly the same thing. Yet the results are different because the mappings are passed to a different collector, conj or str:

(transduce (map str) conj [1 2 3])  ;; => ["1" "2" "3"]
(transduce (map str) str [1 2 3])    ;; => "123"

In the context of a transduction, it is helpful to read the map function as a pure mathematical operation which is independent from the actual collection to which it contributes. It literally maps something to something using a specific function, which is passed as its argument. In our case (map str), the actual mapping is done by the string function, str.

Thus understood, it can seem as if the expression map str is actually redundant. For what is str, being a function, other than a mapping of one value to another? But as a transducer, map is not a simple function, but a qualification of what is being done. It qualifies what is done to the input, namely, a mapping. Only the function passed as an argument to map specifies how this mapping actually looks like.

This is made clearer by contrasting map with filter. Compared to map, filter specifies a different kind of change: instead of mapping, it rather sieves out elements. The predicate passed to filter specifies the filter criterion in the same way the function passed to map specifies the mapping. Further, being a transducer and not a reduction, filter displays the same independence as its sibling. It guards the gates through which the data must pass regardless of the actual collection in which the data is finally gathered. We can thus adapt the example above:

(transduce (filter odd?) conj [1 2 3]) ;; => [1 3]
(transduce (filter odd?) str [1 2 3]) ;; => "13"

So in using transductions, the reducing function’s determining parameters are turned into explicit arguments (functions) which, on the operational level, are decoupled from each other. Instead of being forced to reason about the reducing function as a whole, with all its complexity, we can thus immediately identify the transforming function and the collecting function. Furthermore, we can use them independently of each other. We thereby capture transformation functions, such as mapping or filtering, in their essence. A mapping is not a mapping of one collection to another, but essentially a mapping of one element to another. This kind of abstraction is a good example of simple made easy.

Chaining #

Finally, abstracting the reduction yields a third – and decisive – advantage: The transformations described by the transducers can be chained, much like in a threading macro (->>). Transducers can be freely combined, effectively building a data pipeline through which the input flows. By isolating from the collecting functions, the transformations become independent building blocks for a potentially enormously complex transformation process. Have a look at this gist from Rich Hickey to get an impression

How is this power of combination possible? To approach this question, let’s expand our universe of examples and introduce a second transformation, filtering. Expanding our previous code, we want to add the restriction that only those items get converted into strings which match a certain criterion. We thus combine map with filter, as can be seen in the code:

(transduce (comp (filter odd?) (map str)) conj [1 2 3]) ;; => ["1" "3"]

At a first glance, this combination (using the composition function comp) looks like some kind of threading. It seems as if we are successively applying transformations to a collection, which we are in turn passing around to partial functions (for a discussion of this deceptive similarity, see, this article). Yet there is only a superficial similarity between (map str), the transducer, and (map str), the partial expression which is extended by the threading macro.

The most important difference: (map str) and friends are higher-order functions, transforming a function, not data. Thus the combination of transformations, the data pipeline, is not a thread, but a function composition. The pipeline consists of successive modifications of the original reduction function. As a consequence, (map str) can only be used in the context of a reduction: it expects a reducing function as its argument and won’t accept a collection.

Here again, we are confronted with the division of labor introduced by the abstraction of transducers. The transforming function, now defined outside of the reduction, does not know anything about how the data is actually collected (reduced). The ‘manual work’ of passing the data around is delegated to the reducing function. Specific transducers like map or filter are thus modifiers of reducing functions. They take a reducing function, do their thing (e.g. mapping or filtering), and pass the result to the reducing function.

Building a transducer #

Let’s build a simple mapping transducer in order to see how this works. We will begin with a normal reduction function, which we turn then into a transducer. Expressed as a reduction, our str mapping looks like that:

(reduce (fn [res data]
          (conj res (str data)))
        []
        [1 2 3])
;; => ["1" "2" "3"]

We see how in this reduction, the two function actually determining the outcome, conj and str, are well hidden. So let’s extract them in order to make them more accessible. To achieve this, we turn these functions into parameters of another function. And since we are at it, we will also extract the collection by turning it into a third, separate argument. Here’s the next iteration of our code:

(defn my-transduce [mapping-fn coll-fn coll]
  (reduce (fn [res data]
            (coll-fn res (mapping-fn data)))
          []
          coll))

(my-transduce str conj [1 2 3]) ;; => ["1" "2" "3"]

This looks already close to the original transduce! Of course, there are details to take care of. For example, the initial value to reduce is currently still passed as a constant. This prevents a full decoupling of the transformation functions from the collection, because we can currently only use only transformation functions which return vectors. But the general pattern is encouraging, we seem to be approaching real transducers here.

We have to do more, however, if we want to have composable transducers. Remember that transducers are composable because they modify functions, not data? Our present transducer, however, does still modify data. It is nothing but a wrapper around a simple reduce. So in our next step, we have to factor out the reduction itself – by turning our function into a function which modifies a reducing function. We thereby extract the reducer, turning it into an argument. This means, in consequence, that our new function cannot do the reduction alone. In order to use it, we will always have to pass it a further function, the reduction function.

Let’s rewrite our transducer in order to see what that means:

(defn my-transducer [mapping-fn]
  (fn [rf]
    (fn [res elt]
      (rf res (mapping-fn elt)))))

(reduce ((my-transducer str) conj) [] [1 2 3])

If we squint our eyes (and skip over the parentheses), we can still recognize the resemblance to the original transduce. But it has gotten complicated. The new transducer is quite tricky. Like our first transducer, it still needs two arguments, the collecting function and the mapping function. But both arguments are not passed directly anymore. Only the mapping function is an explicit argument to my-transducer. It is the only top-level argument.

Once we pass an argument, however, the collecting function reappears. It is an argument of the function returned by calling my-transducer. It requies a reducing function, rf. This reducing function is then called within the new transducer, namely at the place where the actual transformation (the mapping) occurs. If you look back at our initial reduction, this is exactly where we had put the collecting function conj.

Here’s the code again with some explanations:

;; top-level argument: the transforming function
(defn my-transducer [mapping-fn]
  ;; rf is the argument to the function returned by `(my-transducer f)`
  (fn [rf]
     ;; this is the actual reduction, in our case using mapping-fn
	 ;; for -- you name it! -- mapping:
    (fn [res elt]
      (rf res (mapping-fn elt)))))

;; As first argument to `reduce`, 
;; we call (my-transducer str) as a function
;; with `conj` as its argument:
(reduce ((my-transducer str) conj) [] [1 2 3])

But where has our collecting function gone? Initially, we had stated that a reduction consists of two defining functions, a transformation and a collection. But here, the collecting function is just another variant of the reducing function.

This result might be surprising, since we had treated “collecting” and “reducing” as distinct steps. But if we look at it, a collecting function is, in a formal sense, just another reducing function: It takes an accumulated result and an element ((fn [res elt]), and it returns a new, potentially modified result. This is why in the first section, I did not actually distinguish between a collecting function and a reducing function.

So we have gained yet another abstractive insight: A reduction is defined by a transformation which is passed to another reduction. This kind of recursive definition is what allows for the chaining of the transducers. The reducing function returned by our transducer passes the (potentially modified) elements to the (unknown) reducing function, and then returns this reduced result.

The collecting function, in this view, is just the last reducer in our chain. It actually breaks the chain because it returns a collection, and not yet another modified reducing function. Once we collect things, they are not in the pipeline anymore. But until then, all transducers pass their specific transformations to the reducing function with which they have been called. In short, calling a transducer returns a function which accepts a reducing function and expands it by adding its own kind of processing to the flow.

As long as it is used for reduction (and not for the final step, the collection), a transducer is therefore essentially ‘unsaturated’. It does not know what to do with the data, because that task is delegated to the reducing function. It needs this function in order to do its job, it cannot do it on its own. This higher-order behavior – a function returning a function which still requires a function as its argument – explains the convoluted syntax of our custom transducer. Look at the parentheses!

(reduce ((my-transducer str) conj) [] [1 2 3])

In this snippet, reduce receives a function with two arguments, an accumulator and an element. This is the required form: (fn [acc elt]). This function, however, is not passed literally. It is built in two steps usingmy-transducer, which is why we have two layers of parantheses. The inner parentheses (my-transducer str) return a function which, in turn, is called with the argument conj. So the two arguments are now passed in two distinctive steps.

In order to build a data pipeline, all we have to do now is to replace the second argument – conj – with yet another reducing function. Remember that collecting functions are just reducing functions, distinctive only in that they actually modify collections (not elements). This allows us to replace the conj with another reducing function. This second function, of course, needs again another function in order to continue the chaining. In our case, however, we decide to finish the data pipeline by passing it conj as a final reduction function.

An example will help to illustrate the point. We will adapt our little program and rename our original transducer, since we have now two of them:

(defn my-mapping [mapping-fn]
  (fn [rf]
    (fn [res elt]
      (rf res (mapping-fn elt)))))

(defn my-filter [filter-fn]
  (fn [rf]
    (fn [res elt]
      (if (filter-fn elt)
        (rf res elt)
        res))))

(reduce ((comp (my-filter odd?) (my-mapping str)) conj) [] [1 2 3])
;; => ["1" "3"]

I hope it is clear what is happening here. Our filtering transducer takes its input, checks if it matches the predicate, and only then passes it to the mapping function. If there is no match, the result (the accumulator of the reduction) is returned unchanged, bypassing the mapping function. The data pipeline is cut.

We can also see how the pipeline is built by inserting a reduction function in front of our conj collector. Let’s compare our first, simple transducer with only one transformation, with the second one, which already establishes a pipeline:

;; simple transformation:
(reduce ((comp (my-mapping str)) conj) [] [1 2 3])
;; transformation pipeline:
(reduce ((comp (my-filter odd?) (my-mapping str)) conj) [] [1 2 3])

One last point is that we have two ways to look at that function composition. This can lead to confusion. It we just look at the expression, our mini data pipeline looks very much like some data being passed first to my-filter and then to my-mapping. But this is wrong, since comp is just a function composition. In a function composition, the “last” function is always called first.

;; Here we construct a function composition and call it with x:
((comp f g)  x) 
;; Here we do the same without comp. You see that g is called first:
(f (g x))

So how is it possible that the last function of our data pipeline is called first, and yet is the final function in our data pipeline? The point to keep in mind here is that a transducer, if called, does not return a collection, but a modified function. More precisely, reduce is not passing its data to the composed functions themselves, but to the function returned by the composition (adding the collecting function as its final argument). That function is then called with the elements of the collection to be reduced, and here the data flow aligns with our intuitive understanding.

In our example, the data is initially passed to the filter reduction (the “inner” function of our function composition). This reducing function checks if a particular data passes the test, and only then calls the reducing function it has received as its argument. In other words, if the data element does not pass the test, the currently accumulated result is passed on unchanged straight to the collecting function. In this way, the particular data element just being processed is ignored. If the data element matches the predicate, however, it is added to the accumulated result by passing it to the reducing function.

In sum, the transforming function composition yields a chained function which, as a whole, receives a reducing function and returns a new function. This new function cannot yet be used by reduce, because it remains “unsaturated”, as it were: It still needs a reducing function as its argument. In order to actually use the transformation, we have to add a collecting function as its final completing block. Once we pass a collecting function to a transducer, the transducer becomes “saturated”. It cannot be further chained, or positively put: it does not require any more arguments. As a complete reducing function, it can be passed directly to reduce as its reducing function.

Implementation Details #

This is what we got so far: a transducer is a higher-order function which can be chained because it always returns a function which needs another function as its argument. If we pass it yet another reducing function, using comp, we can chain these transducers. In order to put an end to the chain of transformations thereby expressed, we have to add one last, final reducing function, the collector.

We can now turn back to the implementation of transduce, where this overall conception is clearly discernible. First of all, in its minimal form, transduce takes exactly three mandatory arguments which express these basic concepts:

(transduce xform f coll)

We can easily map our concepts to this argument list. xform is the function composition which, when evaluated, returns a function which still requires a reducing function as its argument. That is, xform returns an unsaturated function. f, the collecting function, is used as a last, finalizing argument to xform, thereby closing the pipeline at one end. Finally, coll refers to the collection on which the reduction is applied. It is the start of the pipeline.

We can thus define transduce using the following code:

(defn my-transduce [xform f coll]
  (reduce ((xform) f) coll))

In reality, of course, things are a bit more complicated. We have to deal with the fact that a reduction needs an initial value, e.g. an empty list or an empty vector. This is being taken care of by introducing the convention that if called with no argument, the reducing function returns its initial (empty) value. That is, (conj) returns the empty vector, while (merge) returns nil.

We also need to take into account that some final reducing functions might be stateful. Say we are witing into a file, for example. In this case, it makes sense to close the file stream after processing it. For this, the full Clojure implementation adds a final, last step in its definition of the transducer, which calls f (the final “completing function”) one last time with the final result only.

This is why the full implementation of transduce, in Clojure, more looks like that:

(defn my-transduce [xform f coll]
  (f (reduce ((xform) f) coll)))

Transducers, on the other hand, also have to follow some conventions in order be used. In particular, they have to return a function which can be called with three different arities: Without any argument, it should return the initial value (e.g. nil). With ony argument, it should return the finalized collection. And the actual transformation is done in the third variant of the function, which accepts a temporary result, an input, and is asked to do something with it.

;; Conventional form of a transducer:
(defn typical-transducer [& args]
 (fn [rf]
   ([] [])
   ([res] res)
   ([res elt]
    (do something))))

Conclusion #

Transducers are an interesting abstraction which highlights some of the core benefits of Clojure. In particular, this abstraction allows us to better understand the essence of what we are actually doing when we are using reduce. The terminology can be confusing, since we are dealing with functions that return functions which in turn require functions as their argument. But once understood, transducers become a powerful and useful tool. They allow us to express our data transformations more succinctly, they separate the different tasks involved (e.g. collection and transformation), and they open the way to a more modular and composable style of programming.

Further Readings #

Author
A professional philosopher doing tech stuff