
Components of a Storm topology
A Storm topology consists of two basic components: a spout and one or more bolts. These building blocks are tied together using streams; it is over these streams that endless arrays of tuples flow.
Let's discuss the topology with a simple analogy, as depicted in the diagram and explained thereafter:

In our example topology, we have a big processing unit for roasted chips where the input, raw potato, is consumed by the spout, and there are various bolts such as a peeler bolt, slicer bolt, and roasting bolt that perform the tasks as their name suggests. There are various assembly lines or workers that move the chips from the peeler unit to the shredder and beyond; in our case, we have streams to link and wire in the spout and bolts with each other. Now the basic unit of exchange between the peeler and shredder is a peeled potato, and between the shredder units and roasting units is a sliced potato. This is analogous to a tuple, the datum of information exchange between spouts and bolts.
Let's take a closer look at the building blocks of the Storm topology.
Spouts
A spout is the collection funnel of a topology; it feeds events or tuples into the topology. It can be considered as the input source to the Storm processing unit—the topology.
The spout reads messages from external sources such as a queue, file, port, and so on. Also, the spout emits them into the stream, which in turn passes them to the bolts. It's the task of the Storm spout to track each event or tuple throughout its processing through the Directed Acyclic Graph (DAG). The Storm framework then sends and generates either acknowledgement or failure notifications based on the outcome of the execution of tuples in the topology. This mechanism gives the guaranteed processing feature to Storm. Based on the required functionality, spouts can be programmed or configured to be reliable or unreliable. A reliable spout plays back the failed events into the topology.
The following diagram depicts the same flow, graphically:

All Storm spouts are implemented to be able to emit tuples on one or more stream bolts. As in the preceding diagram, a spout can emit tuples to both bolt A and C.
Each spout should implement the IRichSpout interface. The following are important methods to know in context with spout:
nextTuple()
: This is the method that keeps on polling the external source for new events; for instance, the queue in the preceding example. On every poll, if the method finds an event, it is emitted to the topology through a stream, and if there is no new event, the method simply returns.ack()
: This method is called when the tuple emitted by the spout has been successfully processed by the topology.fail()
: This method is called when a tuple emitted by the spout is not successfully processed within the specified timeout. In this case, for reliable spouts, the spout traces and tracks each tuple with themessageIds
event, which are then re-emitted to the topology to be reprocessed. For instance, in the preceding figure, the failed tuple is emitted again.
For unreliable spouts, the tuples are not tracked using messageIds
and the methods such as ack()
and fail()
don't hold any value as the spout doesn't track the tuples for successful processing. These topologies are identified as unreliable.
Bolts
Bolts are the processing units of a topology. They are the components of the topology that perform one or more of the following tasks:
- Parsing
- Transformation
- Aggregation
- Joins
- Database interaction
The entire process being performed by the topology is generally divided into smaller tasks and subtasks, each preferably performed by a different bolt to exploit the power of the parallel distributed processing of Storm.
Let's look at the following figure that captures a real-time use case where the location coordinates from various airplanes are tracked and processed to ascertain whether they are moving on the correct trajectory:

Here, the flight location coordinates are sent by sensors in the plane, which are collated at log servers and fed into a Storm topology. The Storm topology is broken into the following bolts that can act on the tuples emitted by the spout:
- The parse event bolt: This bolt filters and transforms the event emitted by the spout. It converts the information into a decipherable format.
- The location bolt: This is the bolt that extracts the location coordinates from the tuples it receives from the parse bolt and then sends them across to the next bolt.
- The verify bolt: This is the bolt that verifies the location coordinates sent by the location bolt against the predefined trajectory of the plane, and if it detects deviation, it sends a tuple to the alert bolt.
- The alert bolt: This bolt is the actor that informs the external systems, such as the air controller in our case, about the anomaly or deviation detected in the flight path.
Owing to the nature of real-time use cases, such as the one depicted in the preceding figure, speed and accuracy of computation is of utmost importance, and that's the reason that makes Storm a strong technological choice for the implementation of such solutions.
The total processing logic gets broken down into smaller tasks that are executed in bolts; configuring tasks and parallelism in bolts lets the engineers attain the right kind of performance for the solution.
One bolt can listen to multiple streams and it can emit to multiple other bolts on different streams. As depicted in the figure in the Sprouts section:
- Bolt-A emits to Bolt-B and Bolt-C
- Bolt-D subscribes to streams from Bolt-C and Bolt-B
The common interfaces provided by Storm to be implemented by user-defined bolts are as follows:
- IRichBolt
- IBasicBolt
The difference in these two interfaces depends upon whether reliable messaging and transactional support are required or not.
The main methods used by the bolts are as follows:
prepare()
: This is the method that is called when the bolt is initialized. Fundamentally, the Storm topology runs forever and the bolt once initialized will not terminate till the topology is killed. This method is generally used to initialize connections and read other static information, which is required during the entire life cycle of the bolt.execute()
: This is the method that performs the functioning and processing logic defined on the bolt. It is executed for every tuple.
Streams
Stream can be defined as a sequence of tuples or events that are unbounded by nature. These streams are generally created in a parallel and distributed manner across the topology. Streams can be called the wiring or information flow channels between the spout and bolts. These are carriers of unprocessed, semiprocessed, and processed information to and from various task-performing components such as bolts and spouts. Streams are configured while encoding the topology using a schema that gives names to the fields in the stream's tuple.
Tuples – the data model in Storm
A tuple is the basic and constituent data structure in Storm. It's a named list of values that starts its journey from the spout. It's then emitted from streams to bolts, then from bolts to other bolts, where various stages of processing are executed. On successful completion of all intended processing, as per the topology definition, the tuples are acked back to the spout.