Connect Apify to ClickHouse
Apify is a web scraping and automation platform. You build, run, and scale serverless cloud programs called Actors. Actors scrape websites, crawl the web, process data, or automate workflows. Every Actor run produces structured output stored in Datasets (collections of JSON objects).
Load scraped or processed data into ClickHouse for analytics, monitoring, or enrichment pipelines.
Key concepts
| Apify concept | What it is |
|---|---|
| Actor | A serverless cloud program that runs on the Apify platform. Thousands of ready-made Actors are available in the Apify Store. |
| Dataset | The output of an Actor run. A table-like collection of JSON objects, retrievable as JSON, CSV, XML, or other formats via the Apify API. |
| Webhook | An event-driven HTTP call triggered when an Actor run succeeds, fails, or reaches other lifecycle events. Use webhooks to automate the Apify-to-ClickHouse pipeline. |
Setup guide
Gather your ClickHouse connection details
To connect to ClickHouse with HTTP(S) you need this information:
| Parameter(s) | Description |
|---|---|
HOST and PORT | Typically, the port is 8443 when using TLS or 8123 when not using TLS. |
DATABASE NAME | Out of the box, there is a database named default, use the name of the database that you want to connect to. |
USERNAME and PASSWORD | Out of the box, the username is default. Use the username appropriate for your use case. |
The details for your ClickHouse Cloud service are available in the ClickHouse Cloud console. Select a service and click Connect:
Choose HTTPS. Connection details are displayed in an example curl command.
If you're using self-managed ClickHouse, the connection details are set by your ClickHouse administrator.
Apify prerequisites
You'll also need:
- An Apify account (free tier available).
- An Apify API token, found in Settings > Integrations in the Apify Console.
- Node.js 18+ installed locally (for the JavaScript examples).
Install dependencies
Install the Apify JavaScript client and the ClickHouse JavaScript client:
Apify also provides a Python client. If you prefer Python, install apify-client via pip and use clickhouse-connect for ClickHouse.
Fetch Apify dataset and load into ClickHouse
The following script fetches the results of an Apify Actor run and inserts them into ClickHouse:
For large datasets, paginate through results using the limit and offset parameters of the List dataset items endpoint. You can also pass clean=true to retrieve only non-empty, deduplicated items.
Automate with webhooks
Instead of running the script manually, automate the pipeline so data loads into ClickHouse every time an Actor finishes:
- In the Apify Console, go to your Actor and open the Integrations tab.
- Add a new webhook with:
- Event type:
ACTOR.RUN.SUCCEEDED - Action: An HTTP POST to your loader endpoint, or trigger another Actor that handles the ClickHouse insert.
- Event type:
- The webhook payload includes the
defaultDatasetId, which you can use to fetch the run's results.
See Apify webhook documentation for payload details and configuration options.
An alternative approach is to use Apify Schedules to run Actors on a cron-like schedule, combined with webhooks for the loading step.
Best practices
Fetching data from Apify
Use the Apify client library (apify-client for JavaScript or Python) instead of raw HTTP calls. It handles pagination, retries, and authentication for you. For large datasets, paginate through results with the limit and offset parameters of the List dataset items endpoint.
Loading into ClickHouse
Use JSONEachRow format when inserting into ClickHouse. It maps directly to Apify's JSON output with no transformation needed.
Match your ClickHouse table schema to the Actor's output fields. Check the Actor's output schema on its Apify Store page or in the Dataset tab after a run.
Performance
For high-throughput inserts from the JavaScript client, follow the Tips for performance optimizations. Batch rows into larger inserts rather than inserting one row at a time, and consider async inserts when client-side batching isn't practical.
Security
The examples on this page use the default user and database to keep things simple. In production, create a dedicated user with the minimum privileges required to insert into your target table, and store credentials securely (for example, in environment variables or a secrets manager rather than committing them to source code). See Cloud access management for guidance.