Amazon Managed Streaming for Apache Kafka (MSK) is a popular distributed event store and stream-processing platform many companies use to process data.
To ensure fault tolerance, replication, and mirroring are used, but these technologies don’t protect against data loss caused by processing errors or application failures. These cases may severely impact a company’s operation and result in unrecoverable data.
What makes streaming data difficult is the ability to persist data along the processing path (e.g., checkpoints) and the ability to rewind the processing of past data. You hope these checkpoints are never needed, but a company won’t know if they are needed until a failure happens.
A few examples where data loss may occur:
Elastio and its data protection capabilities can help with solving all the pain points mentioned above to provide Kafka business continuity and protect from data loss and downtime.
How we use Elastio to protect our customers from kafka outages and application failures
At Elastio, the events stored in Amazon MSK are crucial for our platform. Kafka directly influences the reliability and data consistency of our customer Tenants. Because of this, we need to protect these events to isolate our customers from Kafka outages and downtime.
The Elastio Tenant is built on the AWS cloud and uses Elastic Kubernetes Services (EKS), RDS, ElastiCache, and MSK.
The Tenant is built on a microservice architecture, and each of our services is bounded by the domain context. Our services communicate synchronously, with internal API calls and asynchronously, passing messages over the AWS Managed Streaming for Apache Kafka (MSK). The Tenant also synchronously communicates with every customer’s Cloud Connector by calling a Cloud Connector Lambda function and asynchronously polling and putting SQS messages from/into a Cloud Connector.
Here is a diagram of how the Tenant works:
As a result, we generate a lot of external and internal asynchronous communication that relies on Kafka. All event messages polled from the customers’ Cloud Connectors are sent to a specific Kafka topic. Event messages from Cloud Connectors can include more detailed information, such as backup metadata, and security report details. The Tenant processes, stores, and visualizes that data for our customers. Our team investigated the market to find a solution to backing up Kafka and was surprised that no product or service is available for Amazon MSK.
Introducing Elastio. The Elastio CLI offers advanced backup options. The stream backup capability became the key to solving the Kafka backup issue. We built a script around the Elastio CLI that creates a Kafka consumer with a unique consumer group id and streams a Kafka topic to the Elastio vault. The data is encrypted, deduped, and cataloged as a recovery point for future use.
The script works as follows:
Next we wanted to embed Elastio into the CloudOps workflow to protect Kafka. We include the Elastio service and cloud connector in our Tenant infrastructure. Then we wrapped the script into a Docker image and served the image to an ECS cluster to schedule regular Kafka topics’ backups.