I recently started a personal project with Kafka and instead of running it on my local machine, I wanted to build a more realistic and useful project by implementing a full experimental data pipeline on AWS. I could have used the Amazon Managed Streaming for Apache Kafka (MSK) but it is not included in the AWS free tier and it defeats the purpose of learning Kafka as it automatically handles the cluster setup part. Therefore, I decided to install Kafka on an AWS EC2. This tutorial summarises the required steps to set up a simple Kafka broker on a single EC2 node, in all of the following, we suppose that we have an EC2 instance running (t2.micro) in my case the Ubuntu AMI.
Table of contents
Open Table of contents
Installing Java, Scala and Kafka
Apache Kafka is based on the JVM, therefore it is essential to start with installing Java, We will opt for Java 11 as Java 8 can be dropped in Kafka 4.0 according to a recent proposal (KIP 750).
sudo apt-get update
sudo apt-get install openjdk-11-jdk
java --versionOnce the Java version command works, we can proceed with installing Scala, as Kafka is mainly written in Scala, it is important to install it, with the right version (currently version 2.13) :
wget  https://www.scala-lang.org/files/archive/scala-2.13.3.tgz
tar -xvzf scala-2.13.3.tgz
# Rename folder to scala for easier navigation
mv scala-2.13.3 scalaNow that we have Java 11 and Scala 2.13 installed, we can install Kafka by downloading the latest release (Kafka 3.5.0) from the official website and extracting the files from the downloaded tar file.
wget https://downloads.apache.org/kafka/3.5.0/kafka_2.13-3.5.0.tgz
tar -xvzf kafka_2.13-3.5.0.tgz
# Rename folder to kafka for easier navigation
mv kafka_2.13-3.5.0 kafkaConfiguring Kafka and Zookeeper servers
As we will be running Kafka on the server, we need to store the logs in an accessible folder, let’s create a logs directory inside Kafka composed of two subfolders : logs/kafka which will hold the logs of the Kafka broker and logs/zookeeper which will hold the logs of Zookeeper (coordination and controller service for Apache Kafka).
cd kafka
mkdir logs
mkdir logs/kafka & mkdir logs/zookeeperWe need to edit Zookeeper properties and Kafka server properties in the configuration file, we will change only the logs-related configuration, for Zookeeper, and open the configuration file using nano or any other text editor :
nano config/zookeeper.properties
Then locate the line containing dataDir=/tmp/zookeeper/ and change it to the path of the Zookeeper log folder (i.e. /usr/local/kafka/logs/zookeeper). Next, we will need to do the same operation on the server.properties file which is responsible for configuring the actual Kafka server. Open it using a text editor and locate the line logs.dir and put /usr/local/bin/kafka/data/kafka as a path to the Kafka logs directory.
Now that our Kafka installation is well configured, we need to add the binaries of both Kafka and Scala to the $PATH global variable, so that we can run Kafka CLI commands without having to navigate inside /usr/local/bin/kafka/bin each time. To achieve this, we can either edit the profile file or the .bashrc file using the following commands to edit and source the file :
echo 'export PATH="/usr/local/bin/kafka/bin:$PATH"' >> ~/.bashrcRunning Kafka
Now we can run both Zookeeper broker server and Kafka broker server from anywhere on our machine :
zookeeper-server-start.sh -daemon /usr/local/bin/kafka/config/zookeeper.properties
kafka-server-start.sh -daemon /usr/local/bin/kafka/config/server.propertiesThe -daemon was introduced in August 2015 and allows to run Zookeeper and Kafka processes as a daemon (i.e. background processes).
Our instance now has Kafka running in the background and ready to be used to create new topics, producers and consumers.
Going further
You can go further and set up Zookeeper and Kafka as Linux services, but for the sake of simplicity, we will not cover this aspect, Manh Phan has an excellent tutorial on how to set it up.
If you want to set up a Kafka cluster with multiple instances on EC2, similar steps are required in each instance with some extra configuration for Kafka broker required as well as launching the cluster through a shell script, this blog post has a nice explanation on this matter.
As a final note, this tutorial is the beginning of a personal data engineering project I will document on this blog. However, when it comes to utilizing Kafka for production systems, I would highly recommend using the MSK service in AWS because While installing and maintaining a production-ready Kafka cluster on EC2 might seem tempting initially, it can lead to a multitude of headaches in the long run.