Iteratee, working with data stream

Last updated at 2024-12-09Posted at 2024-12-06

This article is for EDOCODE Advent Calendar 2024, published on Fri, Dec 6.
The previous article was written by Takamasa Tamura, CEO of Edocode: Thinkings about Starting a New Business (Japanese, the original name is "新規事業に挑戦する"ことについて考える).
Also, please check out the Wano Advent Calendar by our parent group company!

In functional programming, the iteratee is a powerful abstraction that provides a way to process data streams in a memory-efficient, composable manner. As data processing becomes increasingly prevalent in modern applications, Iteratees offer a structured approach to handling input/output operations, especially when working with large datasets or streams that might otherwise overwhelm memory resources.

What is an iteratee

An iteratee is a process that consumes (instead of produces) data incrementally. It is usually implemented as a function of type A -> Impure (), with optional finalisers or additional features.

A basic implementation might look like this:

type Iteratee<A> = {
  feed: (a: A) => void;
  complete: () => void;
};

Incremental processing

Incremental data processing is one of the core strengths of Iteratees. Instead of requiring the entire dataset to be loaded into memory, Iteratees consume data in chunks, making them highly efficient for processing large or unbounded streams. This approach allows developers to handle real-time data, optimise memory usage, and improve performance in scenarios like file processing, network streams, and database queries.
It is critical for:

memory efficiency: Large datasets (e.g., log files, streamed video data) can be processed in small chunks without overwhelming system memory.
real-time data handling: Iteratees can process incoming data as it arrives, rather than waiting them being collected, making them suitable for live systems, like event feed.
efficient error handling: Errors can be detected and handled as they occur during the data stream, avoiding wasted computation.
Example: counting lines in a file

import { createReadStream } from "fs";

const textStream = createReadStream("largeTextFile.txt", { encoding: "utf8" });
let lines = 1;
textStream.on("data", (chunk: string) => {
  for (const ch of chunk) {
    if (ch === "\n") {
	  lines += 1;
    }
  }
});

textStream.on("end", () => {
  console.log(`Total lines: ${lines}`);
});

Note, although this example does not explicitly use the term iteratee, it demonstrates the concept by passing an iteratee as the event handler directly

Hold on, but why not iterator?

You might wonder, why not just use an iterator? After all, it shares many characteristics and benefits with an iteratee. In fact, an iteratee is essentially the dual of an iterator. Imagine reversing the flow of the next method, and it gives you an iteratee.

Asynchronous processing

Iterators operate in a pull-based model. The consumer (e.g. a loop) explicitly requests the next piece of data using a method like next.
The consumer controls the flow of data and decides when to stop consuming.

Iteratees, by contrast, works in a push-based model, where the data producer actively pushes chunks of data to the consumer. The producer controls the flow, making iteratees well-suited for stream processing where data may be arriving asynchronously or continuously (e.g., network sockets, file streams).

Example: HTTP stream handling

import { get } from "http";

get("http://example.com", (r) => {
  r.on("data", (chunk) => {
    // do something with the data
    // or use another iteratee
    someIteratee.feed(chunk);
  });

  r.on("end", () => {
    someIteratee.complete();
  });
});

Error handling and finalisation

Compare the following approaches:

Iterator pattern

function dataSource(): Iterable<Item> {
  // return an iterator
}

const iterator = createTheIterator();

try {
  for (const item of iterator) {
    // do something 
  }
} finally {
  iterator.close()
}

Iteratee pattern

function dataSourceWithIteratee(k: Iteratee<Item>): void {
  try {
    for (...) {
      k.feed(item);
    }
    k.complete();
  } finally {
    // finalise
  }
}

dataSourceWithIteratee({
  feed: (item) => {
    // do something with the item
  },
  complete: () => {
    // finalisation logic
  },
});

In the iteratee pattern, the producer ensures proper finalization after processing all items. This shifts responsibility for cleanup from the consumer to the producer, reducing the risk of resource leaks.

The challenges

An iteratee is just a combination of the idea of incremental processing and delimited continuation-passing. While iteratees offer powerful abstractions, they are not free, particularly in performance-intensive applications or when programming in low-level languages like Rust or C.

Complexity in low-level languages

The example of counting lines uses closures, a feature that is straightforward in high-level languages but more challenging in low-level languages, and unfortunately, nearly all useful delimited cotinuations need closure. In Rust, for instance:

compiler-generated closures are distinct types that cannot be easily expressed;
nested closures can become cumbersome and introduce additional complexity.

Indirect jumps

Naively implementing iteratees with function pointers can lead to indirect jumps, which may degrade performance in performance-critical situations. Modern CPUs often struggle with branch prediction for such jumps, causing pipeline stalls.

Garbage collection overheads

Closures capture variables from their parent scope, potentially increasing memory usage. In garbage-collected environments, improper management of closures can lead to memory leaks or delays in cleanup.

So, should we avoid use iteratees or even closures?

Well, it depends. If your goal is to write maintainable, high-level code, iteratees and closures are incredibly useful. They simplify complex workflows, enable modularity, and provide memory-efficient solutions for handling data streams.

However, in performance-critical systems:

switch to simpler constructs, such as manually managing functions in C.
consider static polymorphism (e.g., Rust ) to eliminate runtime overhead.
profile and identify hot paths before deciding whether to use iteratees.
For most general-purpose systems, iteratees and closures are acceptable and provide significant development advantages. But when squeezing out the last bit of performance, their trade-offs must be carefully evaluated.

Conclusion

The Iteratee pattern is a robust and elegant solution for processing streams of data incrementally. By abstracting the producer-consumer relationship, iteratees enable composability, error handling, and memory efficiency in data-intensive applications. Whether implemented in JavaScript, Scala, or other languages, mastering iteratees equips programmers with the tools to handle large-scale, real-time data effectively. However, like any abstraction, they come with trade-offs. Balancing these trade-offs is key to making the best decision for your specific application.

Tomorrow, Dec 7, the article will be What I learnt from building the Product Quiz, written by Frank Lu, software engineer.

And also, Wano Group is hiring! Check out our open positions at JOBS | Wano Group if you are interested.

You get articles that match your needs
You can efficiently read back useful information
You can use dark theme

What you can do with signing up