Best practices
Message Compression
We strongly recommend using compression for your Kafka topics. Compression can result in a significant saving in data transfer costs with virtually no performance hit. To learn more about message compression in Kafka, we recommend starting with this guide.
Limitations
DEFAULTis not supported.- Individual messages are limited to 16MB (uncompressed) by default when running with the smallest (XS) replica size, and 32MB (uncompressed) with larger replicas. Messages that exceed this limit will be rejected with an error. If you have a need for larger messages, please contact support.
Delivery semantics
ClickPipes for Kafka guarantees at-least-once delivery by default, tracking ingestion progress through Kafka consumer group offsets. It also optionally supports exactly-once semantics, where each Kafka record is inserted into ClickHouse exactly once even across pod restarts, consumer rebalances, and insert failures.
To deliver exactly-once, ClickPipes records each partition's progress in its internal state store using two values:
- High-water mark — the offset up to which every record on the partition is confirmed inserted into ClickHouse. On restart, ClickPipes drops any record at or below this mark, so data that already landed is never sent again.
- Pending ranges — the offset ranges of insert blocks sent to ClickHouse but not yet confirmed. After a failure, ClickPipes replays exactly these ranges.
Each insert block covers a contiguous range of offsets and carries a deterministic deduplication token of the form topic:partition:firstOffset-lastOffset. On replay, ClickPipes reproduces the same offset range and therefore the same token, so ClickHouse rejects the duplicate. Because the token depends only on the offset range, a replay is deduplicated even when the rebuilt block isn't byte-for-byte identical.
Token deduplication is bounded by the target table's replicated_deduplication_window (the most recent 10,000 insert blocks by default) and replicated_deduplication_window_seconds (one hour by default). High-throughput pipes can churn through the block-count window quickly, so we recommend confirming and possibly increasing both of these settings on the target table to cover your worst-case replay delay. Data replayed after its token has left the window can be inserted again, so exactly-once isn't guaranteed in that case.
The main tradeoff is part size. Larger insert blocks produce fewer, larger parts in ClickHouse, which keeps merge overhead low. ClickPipes holds a partition's rows in memory while it builds a block, so the part size it can reach depends on the memory available to the pipe — when memory is tight, it builds smaller blocks and the table accumulates more parts. Giving the pipe more memory lets it build larger blocks, producing fewer parts.
A pipe works best when the number of partitions is close to the number of internal insert "workers", since each worker then handles roughly one partition and has the memory headroom to build large blocks. Both the worker count and the available memory scale with replica size and count, which you configure under Settings -> Advanced Settings -> Scaling.
Authentication
For Apache Kafka protocol data sources, ClickPipes supports SASL/PLAIN authentication with TLS encryption, as well as SASL/SCRAM-SHA-256 and SASL/SCRAM-SHA-512. Depending on the streaming source (Redpanda, MSK, etc) will enable all or a subset of these auth mechanisms based on compatibility. If you auth needs differ please give us feedback.
Warpstream Fetch Size
ClickPipes rely on the Kafka setting max.fetch_bytes to limit the size of data processed in a single ClickPipes node at any one time. In some circumstances
Warpstream doesn't respect this setting, which can cause unexpected pipe failures. We strongly recommend that the Warpstream specific setting kafkaMaxFetchPartitionBytesUncompressedOverride
to 8MB (or lower) when configuring your WarpStream agent to prevent ClickPipes failures.
IAM
ClickPipes supports the following AWS MSK authentication
- SASL/SCRAM-SHA-512 authentication
- IAM Credentials or Role-based access authentication
When using IAM authentication to connect to an MSK broker, the IAM role must have the necessary permissions. Below is an example of the required IAM policy for Apache Kafka APIs for MSK:
Configuring a trusted relationship
If you're authenticating to MSK with a IAM role ARN, you will need to add a trusted relationship between your ClickHouse Cloud instance so the role can be assumed.
Role-based access only works for ClickHouse Cloud instances deployed to AWS.
Custom Certificates
ClickPipes for Kafka supports the upload of custom certificates for Kafka brokers which use non-public server certificates. Upload of client certificates and keys is also supported for mutual TLS (mTLS) based authentication.
Performance
Batching
ClickPipes inserts data into ClickHouse in batches. This is to avoid creating too many parts in the database which can lead to performance issues in the cluster.
Batches are inserted when one of the following criteria has been met:
- The batch size has reached the maximum size (100,000 rows or 28MB per 1GB of pod memory)
- The batch has been open for a maximum amount of time (5 seconds)
Latency
Latency (defined as the time between the Kafka message being produced and the message being available in ClickHouse) will be dependent on a number of factors (i.e. broker latency, network latency, message size/format). The batching described in the section above will also impact latency. We always recommend testing your specific use case with typical loads to determine the expected latency.
ClickPipes doesn't provide any guarantees concerning latency. If you have specific low-latency requirements, please contact us.
Scaling
ClickPipes for Kafka is designed to scale horizontally and vertically. By default, we create a consumer group with one consumer. This can be configured during ClickPipe creation, or at any other point under Settings -> Advanced Settings -> Scaling.
ClickPipes provides a high-availability with an availability zone distributed architecture. This requires scaling to at least two consumers.
Regardless number of running consumers, fault tolerance is available by design. If a consumer or its underlying infrastructure fails, the ClickPipe will automatically restart the consumer and continue processing messages.
Benchmarks
Below are some informal benchmarks for ClickPipes for Kafka that can be used to get a general idea of the baseline performance. It's important to know that many factors can impact performance, including message size, data types, and data format. Your mileage may vary, and what we show here isn't a guarantee of actual performance.
Benchmark details:
- We used production ClickHouse Cloud services with enough resources to ensure that throughput wasn't bottlenecked by the insert processing on the ClickHouse side.
- The ClickHouse Cloud service, the Kafka cluster (Confluent Cloud), and the ClickPipe were all running in the same region (
us-east-2). - The ClickPipe was configured with a single L-sized replica (4 GiB of RAM and 1 vCPU).
- The sample data included nested data with a mix of
UUID,String, andIntdatatypes. Other datatypes, such asFloat,Decimal, andDateTime, may be less performant. - There was no appreciable difference in performance using compressed and uncompressed data.
| Replica Size | Message Size | Data Format | Throughput |
|---|---|---|---|
| Large (L) | 1.6kb | JSON | 63mb/s |
| Large (L) | 1.6kb | Avro | 99mb/s |