Simple Postgres to ClickHouse replication featuring MinIO

https://blog.peerdb.io/simple-postgres-to-clickhouse-replication-featuring-minio

At PeerDB, we provide a fast and cost-effective way to replicate data from Postgres to Data Warehouses such as Snowflake, BigQuery, ClickHouse, and queues like Kafka, Red Panda and Google PubSub, among others.

A few months ago, we added a ClickHouse connector for Postgres Change Data Capture (CDC). Surprisingly, this connector gained substantial traction and adoption within our community. This applies to both our fully managed service (PeerDB Cloud) and our Open Source offerings. Here is a customer story from one of our customers who uses the ClickHouse connector.

The Problem

However, there was one common piece of feedback from many of our Open Source users. The ClickHouse connector required an S3 bucket as a prerequisite, which added additional overhead for users. Non-AWS users and those without immediate access to S3 could not use the ClickHouse connector. This wasn't a problem in our fully managed offering (PeerDB Cloud), as we abstracted away the S3 bucket creation from our customers.

This blog describes how we solved this problem and made it extremely easy for our users replicating data from Postgres to ClickHouse. We used MinIO, the open source S3 alternative, to stage the intermediary Avro files as part of the Change Data Capture (CDC) from Postgres to ClickHouse.

Why does the ClickHouse connector need S3?

Under the hood, PeerDB uses the Avro format for data in transit while replicating data from Postgres to Data Warehouses. Loading Avro files through Go wasn't trivial as the clickhouse-go driver didn't support Avro ingestion. Additionally, ClickHouse has native integration for loading data from S3 and is very efficient at it, as it attempts to parallelize as much work as possible, processing files in a streaming fashion. Therefore, we chose to use S3 as an intermediary storage for Avro files before importing them into ClickHouse.

This method has proven effective, allowing users to efficiently replicate data from Postgres to ClickHouse with latencies under 30 seconds and high throughput rates.

MinIO helps make the PeerDB's ClickHouse Connector Seamless

By integrating MinIO container services into our Docker Compose files for our Open Source offering, we've enabled an in-house S3-compatible storage solution that launches seamlessly with PeerDB. PeerDB uses environment variables to manage S3 bucket credentials, allowing for easy integration. Users can set these variables to match the MinIO bucket parameters, or they can plug in their own S3 bucket details. These parameters default to the packaged MinIO bucket parameters, as a result, users no longer need to provide a separate bucket for PeerDB’s ClickHouse integration, simplifying the setup process significantly.

A huge shoutout to MinIO for building a solid product that serves as an open source alternative to S3. Integrating MinIO's Docker container within PeerDB's Docker file was a one-week project. MinIO's APIs, being fully compatible with S3, allowed for seamless integration with PeerDB and ClickHouse.

Result: Even simpler Postgres to ClickHouse replication with PeerDB.

Simplifying ClickHouse Peer Creation with Optional S3 Configuration

Integrating the MinIO Docker Container in our Open Source offering eliminates the need for users to specify S3 buckets to use our ClickHouse connector. While creating the ClickHouse Peer, adding S3 information is optional, as shown in the screenshot below.

Set Up a Postgres to ClickHouse Mirror in Under a Minute

Once the Postgres and ClickHouse Peers are created, users can create MIRRORs to replicate data from Postgres to ClickHouse within a minute. See below video:

Use the MinIO Console for complete visibility into internal staging

MinIO also comes with a sleek UI that helps you manage the internal Avro files PeerDB creates as part of the replication process.

We hope you enjoyed reading the blog. If you're a ClickHouse user and wish to replicate data from Postgres to ClickHouse using PeerDB, please check out the links below or reach out to us directly!

  1. Docs on Postgres to ClickHouse Replication.

  2. Try PeerDB Cloud for free.

  3. Visit PeerDB's GitHub repository to Get Started.

{
"by": "craigkerstiens",
"descendants": 0,
"id": 40240414,
"kids": [
40240771
],
"score": 12,
"time": 1714677845,
"title": "Simple Postgres to ClickHouse replication featuring MinIO",
"type": "story",
"url": "https://blog.peerdb.io/simple-postgres-to-clickhouse-replication-featuring-minio"
}
{
"author": "Sai Srirampur",
"date": "2024-05-02T17:55:58.808Z",
"description": "At PeerDB, we provide a fast and cost-effective way to replicate data from Postgres to Data Warehouses such as Snowflake, BigQuery, ClickHouse, and queues like Kafka, Red Panda and Google PubSub, among others.\nA few months ago, we added a ClickHouse…",
"image": "https://hashnode.com/utility/r?url=https%3A%2F%2Fcdn.hashnode.com%2Fres%2Fhashnode%2Fimage%2Fupload%2Fv1714627263834%2F5b1077b4-24e0-44b1-bfd8-5f9faf731f8c.jpeg%3Fw%3D1200%26h%3D630%26fit%3Dcrop%26crop%3Dentropy%26auto%3Dcompress%2Cformat%26format%3Dwebp%26fm%3Dpng",
"logo": null,
"publisher": "PeerDB Blog",
"title": "Simple Postgres to ClickHouse replication featuring MinIO",
"url": "https://blog.peerdb.io/simple-postgres-to-clickhouse-replication-featuring-minio"
}
{
"url": "https://blog.peerdb.io/simple-postgres-to-clickhouse-replication-featuring-minio",
"title": "Simple Postgres to ClickHouse replication featuring MinIO",
"description": "At PeerDB, we provide a fast and cost-effective way to replicate data from Postgres to Data Warehouses such as Snowflake, BigQuery, ClickHouse, and queues like Kafka, Red Panda and Google PubSub, among others.\nA few months ago, we added a ClickHouse ...",
"links": [
"https://blog.peerdb.io/simple-postgres-to-clickhouse-replication-featuring-minio"
],
"image": "https://hashnode.com/utility/r?url=https%3A%2F%2Fcdn.hashnode.com%2Fres%2Fhashnode%2Fimage%2Fupload%2Fv1714627263834%2F5b1077b4-24e0-44b1-bfd8-5f9faf731f8c.jpeg%3Fw%3D1200%26h%3D630%26fit%3Dcrop%26crop%3Dentropy%26auto%3Dcompress%2Cformat%26format%3Dwebp%26fm%3Dpng",
"content": "<div><p>At <a target=\"_blank\" href=\"https://www.peerdb.io/\">PeerDB</a>, we provide a fast and cost-effective way to replicate data from <a target=\"_blank\" href=\"https://www.postgresql.org/docs/\">Postgres</a> to Data Warehouses such as <a target=\"_blank\" href=\"https://www.snowflake.com/en/\">Snowflake</a>, <a target=\"_blank\" href=\"https://cloud.google.com/bigquery?utm_source=google&amp;utm_medium=cpc&amp;utm_campaign=na-US-all-en-dr-bkws-all-all-trial-e-dr-1707554&amp;utm_content=text-ad-none-any-DEV_c-CRE_665665924750-ADGP_Hybrid+%7C+BKWS+-+MIX+%7C+Txt-Data+Analytics-BigQuery-KWID_43700077225652815-kwd-47616965283&amp;utm_term=KW_bigquery-ST_bigquery&amp;gad_source=1&amp;gclid=CjwKCAjw88yxBhBWEiwA7cm6pVY6no_rLOIpdwp02v4Oa3S5UbPpKAHIFybbrJH-X3nX5PHS23brUBoCU0oQAvD_BwE&amp;gclsrc=aw.ds&amp;hl=en\">BigQuery</a>, <a target=\"_blank\" href=\"https://clickhouse.com/\">ClickHouse</a>, and queues like Kafka, Red Panda and Google PubSub, among others.</p>\n<p>A few months ago, we added a <a target=\"_blank\" href=\"https://blog.peerdb.io/postgres-to-clickhouse-real-time-replication-using-peerdb\">ClickHouse connector</a> for Postgres Change Data Capture (CDC). Surprisingly, this connector gained substantial traction and adoption within our community. This applies to both our fully managed service (PeerDB Cloud) and our Open Source offerings. Here is a <a target=\"_blank\" href=\"https://www.peerdb.io/customers/peerdb-fiber-ai-customer-story\">customer story</a> from one of our customers who uses the ClickHouse connector.</p>\n<h2 id=\"heading-the-problem\">The Problem</h2>\n<p>However, there was one common piece of feedback from many of our Open Source users. The ClickHouse connector required an S3 bucket as a prerequisite, which added additional overhead for users. Non-AWS users and those without immediate access to S3 could not use the ClickHouse connector. This wasn't a problem in our fully managed offering (<a target=\"_blank\" href=\"https://app.peerdb.cloud/\">PeerDB Cloud</a>), as we abstracted away the S3 bucket creation from our customers.</p>\n<p>This blog describes how we solved this problem and made it extremely easy for our users replicating data from Postgres to ClickHouse. We used <a target=\"_blank\" href=\"https://min.io/\">MinIO</a>, the open source S3 alternative, to stage the intermediary Avro files as part of the Change Data Capture (CDC) from Postgres to ClickHouse.</p>\n<h2 id=\"heading-why-does-the-clickhouse-connector-need-s3\">Why does the ClickHouse connector need S3?</h2>\n<p>Under the hood, PeerDB uses the <a target=\"_blank\" href=\"https://blog.peerdb.io/moving-a-billion-postgres-rows-on-a-100-budget#heading-data-in-transit\">Avro format</a> for data in transit while replicating data from Postgres to Data Warehouses. Loading Avro files through Go wasn't trivial as the clickhouse-go driver didn't support Avro ingestion. Additionally, ClickHouse has native integration for loading data from S3 and is very efficient at it, as it attempts to parallelize as much work as possible, processing files in a streaming fashion. Therefore, we chose to use S3 as an intermediary storage for Avro files before importing them into ClickHouse.</p>\n<p>This method has proven effective, allowing <a target=\"_blank\" href=\"https://www.peerdb.io/customers/peerdb-fiber-ai-customer-story\">users</a> to efficiently replicate data from Postgres to ClickHouse with latencies under 30 seconds and high throughput rates.</p>\n<h2 id=\"heading-minio-helps-make-the-peerdbs-clickhouse-connector-seamless\">MinIO helps make the PeerDB's ClickHouse Connector Seamless</h2>\n<p>By integrating MinIO container services into our <a target=\"_blank\" href=\"https://github.com/PeerDB-io/peerdb/blob/main/docker-compose.yml#L189\">Docker Compose files</a> for our Open Source offering, we've enabled an in-house S3-compatible storage solution that launches seamlessly with PeerDB. PeerDB uses <a target=\"_blank\" href=\"https://github.com/PeerDB-io/peerdb/blob/main/docker-compose.yml#L4C3-L4C54\">environment variables</a> to manage S3 bucket credentials, allowing for easy integration. Users can set these variables to match the MinIO bucket parameters, or they can plug in their own S3 bucket details. These parameters <a target=\"_blank\" href=\"https://github.com/PeerDB-io/peerdb/blob/main/docker-compose.yml#L4C3-L4C54\">default</a> to the packaged MinIO bucket parameters, as a result, users no longer need to provide a separate bucket for PeerDB’s ClickHouse integration, simplifying the setup process significantly.</p>\n<p>A huge shoutout to MinIO for building a solid product that serves as an open source alternative to S3. Integrating MinIO's Docker container within PeerDB's Docker file was a one-week project. MinIO's APIs, being fully compatible with S3, allowed for seamless integration with PeerDB and ClickHouse.</p>\n<h2 id=\"heading-result-even-simpler-postgres-to-clickhouse-replication-with-peerdb\">Result: Even simpler Postgres to ClickHouse replication with PeerDB.</h2>\n<h3 id=\"heading-simplifying-clickhouse-peer-creation-with-optional-s3-configuration\">Simplifying ClickHouse Peer Creation with Optional S3 Configuration</h3>\n<p>Integrating the MinIO Docker Container in our Open Source offering eliminates the need for users to specify S3 buckets to use our ClickHouse connector. While creating the <a target=\"_blank\" href=\"https://docs.peerdb.io/connect/postgres/rds_postgres\">ClickHouse Peer</a>, adding S3 information is optional, as shown in the screenshot below.</p>\n<p><img src=\"https://cdn.hashnode.com/res/hashnode/image/upload/v1714519914016/13d54fe7-12bc-4e9c-a1b7-f2274343f781.png?auto=compress,format&amp;format=webp\" /></p>\n<h3 id=\"heading-set-up-a-postgres-to-clickhouse-mirror-in-under-a-minute\">Set Up a Postgres to ClickHouse Mirror in Under a Minute</h3>\n<p>Once the <a target=\"_blank\" href=\"https://docs.peerdb.io/connect/postgres/rds_postgres\">Postgres</a> and <a target=\"_blank\" href=\"https://docs.peerdb.io/connect/clickhouse\">ClickHouse Peers</a> are created, users can create MIRRORs to replicate data from Postgres to ClickHouse within a minute. See below video:</p>\n<h3 id=\"heading-use-the-minio-console-for-complete-visibility-into-internal-staging\">Use the MinIO Console for complete visibility into internal staging</h3>\n<p>MinIO also comes with a sleek UI that helps you manage the internal Avro files PeerDB creates as part of the replication process.</p>\n<p>We hope you enjoyed reading the blog. If you're a ClickHouse user and wish to replicate data from Postgres to ClickHouse using PeerDB, please check out the links below or reach out to us directly!</p>\n<ol>\n<li><p><a target=\"_blank\" href=\"https://docs.peerdb.io/mirror/cdc-pg-clickhouse\"><strong>Docs on Postgres to ClickHouse Replication.</strong></a></p>\n</li>\n<li><p><a target=\"_blank\" href=\"https://app.peerdb.cloud/\"><strong>Try PeerDB Cloud for free.</strong></a></p>\n</li>\n<li><p><a target=\"_blank\" href=\"https://app.peerdb.cloud/\"><strong>Visit PeerDB's Gi</strong></a><a target=\"_blank\" href=\"https://github.com/PeerDB-io/peerdb\"><strong>tHub r</strong></a>epo<a target=\"_blank\" href=\"https://app.peerdb.cloud/\"><strong>sitory to Get Started</strong></a><a target=\"_blank\" href=\"https://github.com/PeerDB-io/peerdb\"><strong>.</strong></a></p>\n</li>\n</ol>\n</div>",
"author": "https://hashnode.com/@saisrirampur",
"favicon": "https://cdn.hashnode.com/res/hashnode/image/upload/v1689020843558/5lJP85yyq.png?auto=compress,format&format=webp&fm=png",
"source": "blog.peerdb.io",
"published": "2024-05-02t17:55:58.808z",
"ttr": 130,
"type": "article"
}