Capacity Planning
Instance specs, resource usage, indexing lag, disk sizing, and failure recovery behavior for production deployments
dbtrail runs a lightweight Go agent on AWS Graviton EC2 instances. The agent orchestrates binlog streaming, indexing, and archival by delegating heavy work to the bintrail CLI — the agent itself is a thin HTTP server that manages subprocesses. This page covers what you need to size and plan for in production.
Instance specifications
Each plan tier runs on a specific EC2 instance type. Free tier shares an instance with other tenants; paid tiers get a dedicated instance.
| Free | Pro | Premium | Enterprise | |
|---|---|---|---|---|
| Instance type | r7g.medium (shared) | t4g.medium | t4g.large | t4g.large |
| vCPU | 1 | 2 | 2 | 2 |
| RAM | 8 GB | 4 GB | 8 GB | 8 GB |
| Disk | 50 GB gp3 | 50 GB gp3 | 50 GB gp3 | 50 GB gp3 |
| InnoDB buffer pool | 2 GB | 2 GB | 4 GB | 4 GB |
| Max index DB connections | 100 | 100 | 200 | 200 |
Free tier resource sharing
Free tier runs as a Docker container on a shared EC2 instance (up to 6 containers per host). CPU and memory are not reserved — performance may vary under load. Paid tiers get a dedicated instance with guaranteed resources.
Agent resource usage
The Go agent itself is minimal. CPU and memory are dominated by the bintrail CLI subprocesses and the MySQL index database, not the agent HTTP server.
Real-world measurements
These numbers come from a production demo running on a t4g.medium (2 vCPU, 4 GB RAM) indexing a WordPress database doing ~2,500 writes/day with binlog_row_image=FULL:
| Component | Memory (RSS) |
|---|---|
bintrail-agent (HTTP server) | 14 MB |
bintrail stream (binlog parser) | 27 MB |
bintrail rotate (archival daemon) | 30 MB |
| Total agent footprint | ~71 MB |
The remaining ~2 GB of used memory is the InnoDB buffer pool for the local index database. The agent processes themselves are lightweight.
CPU: 7–9% average across all hours of the day, with occasional spikes to ~50% during archive rotation (Parquet export + S3 upload). The 51% spike lasts seconds, not minutes. System load average holds steady at 0.1–0.2.
Connections to your source MySQL
The agent opens a small, fixed number of connections to your monitored database:
| Connection | Purpose | Count |
|---|---|---|
| Replication | Binlog streaming via COM_BINLOG_DUMP_GTID | 1 |
| Connection cache poller | Reads performance_schema.threads every 500ms | 2 |
| Total | 3 |
Audit plugin optimization
If the agent detects an active audit log plugin on your source MySQL (e.g., Percona Audit Log, MySQL Enterprise Audit), the connection cache poller is automatically disabled. This reduces source connections from 3 to 1, since the audit log provides superior historical connection data.
Plan limits
| Free | Pro | Premium | Enterprise | |
|---|---|---|---|---|
| Servers | 1 | 5 | 20 | Unlimited |
| Concurrent streams | 1 | 5 | 20 | 100 |
| History retention | 7 days | 30 days | 90 days | 365 days |
| API rate limit (tenant) | 120 RPM | 600 RPM | 2,000 RPM | 10,000 RPM |
| API rate limit (per user) | 60 RPM | 200 RPM | 600 RPM | 2,000 RPM |
Rate limits apply to API calls (query, recover, status, forensics). They do not throttle the binlog stream itself, which runs continuously.
Index size and disk usage
The index database stores one row per binlog change event, including structured before/after column values, timestamps, binlog positions, and schema metadata.
How much disk does the index use?
From our production demo (WordPress, ~2,500 writes/day, binlog_row_image=FULL):
| Metric | Value |
|---|---|
| Source binlog throughput | ~40 MB/day |
| Index payload per day | 8–32 MB (varies with write volume) |
| Average event size in index | ~5.5 KB |
| Total events indexed (19 days) | 57,130 |
| Live index size (with 1-day retention) | 107 MB |
| Total disk used (index + MySQL + OS + agent) | 5.5 GB of 50 GB (12%) |
The index-to-binlog ratio depends on your row width and binlog_row_image setting. With FULL row images (which store complete before/after rows), the index is roughly 50–80% the size of the raw binlog. With MINIMAL row images, the index can be significantly smaller since only changed columns are recorded.
Disk monitoring
Two independent mechanisms protect against disk exhaustion. A disk watcher polls every 3 minutes and cancels backup/dump operations if free space drops below 3 GB or 10% (whichever is stricter). Separately, the /health endpoint checks disk on every request and degrades to degraded when usage exceeds 95%. Monitor disk_usage_percent and disk_free_gb from your alerting system.
Sizing recommendation: start a stream, let it run for 24–48 hours, then check disk usage to extrapolate. Schema and table filtering can significantly reduce index size if you only need to track specific tables.
Retention and archiving
Retention by plan
| Plan | Live index retention | S3 archive |
|---|---|---|
| Free | 7 days (auto-enforced) | Not available |
| Pro | 30 days | Included |
| Premium | 90 days | Included |
| Enterprise | 365 days | Included |
How retention works
The rotate daemon runs on a configurable interval (default: every hour). It:
- Exports events older than the retention window to Parquet files, partitioned by date and hour
- Uploads Parquet files to S3 with checksum verification (size + SHA-256)
- Deletes local Parquet files only after S3 verification succeeds
- Purges expired rows from the live index
Archive compression
From 19 days of production data:
| Metric | Value |
|---|---|
| Parquet files produced | 450 |
| Total S3 archive size | 12.8 MB |
| Average per day | ~670 KB |
| Compression vs raw binlog | ~60:1 |
Parquet's columnar format with built-in compression makes long-term storage extremely efficient. Even with 365 days of retention on a moderately active server, S3 storage costs are typically under $1/month.
S3 lifecycle policies
Retention applies to the local index database. Archived Parquet files in S3 follow your S3 lifecycle policies and are not subject to the rotate daemon's retention window. You can keep archives indefinitely in S3 or move them to Glacier for long-term compliance storage.
Indexing lag
Indexing lag is the delay between when a change occurs on your MySQL server and when it appears in the dbtrail index. It's the sum of binlog replication delay, event parsing time, and index write time.
How lag is exposed
The stream status endpoint returns two lag metrics:
lag_seconds— time difference between the most recent indexed event and the current timelag_events— number of binlog events received but not yet written to the index
Both are visible in the dashboard's stream status panel and via the /api/v1/status API endpoint.
Checkpoint mechanism
The stream writes a checkpoint (binlog file + position, or GTID set) to the index database at a configurable interval. On restart, the stream resumes from the last checkpoint — no events are reprocessed or lost, as long as the source MySQL hasn't purged the binlog files covering the gap.
What affects lag
- Binlog event volume — high-throughput workloads (bulk INSERTs, batch UPDATEs) produce more events per second, increasing parsing time
- Row width — wide rows with large TEXT or BLOB columns take longer to parse and write to the index
- Network latency — relevant if the agent connects to the source MySQL over a network (not localhost) or through an SSH tunnel
- Index DB write performance — gp3 EBS volumes with baseline 3,000 IOPS handle typical workloads easily; extremely write-heavy servers may benefit from provisioned IOPS
Under typical OLTP workloads (hundreds to low thousands of writes per second), expect sub-second lag. High-throughput batch operations may temporarily increase lag, which recovers once the burst subsides.
Failure and recovery
What happens when the agent restarts?
The agent persists stream state to disk (a JSON state file in Docker mode, or systemd journal in systemd mode). On restart:
- The agent reads the persisted state and re-launches all previously running streams
- Each stream reads its last checkpoint from the index database
- Streaming resumes from the checkpoint position — no events are duplicated or lost
If the source MySQL has purged the binlog files that cover the gap between the last checkpoint and the current position, a fresh snapshot (full dump) is needed to re-establish a baseline. This is the same recovery model as MySQL replication.
Auto-restart behavior
If a stream process crashes, the agent restarts it automatically:
| Parameter | Value |
|---|---|
| Max restart attempts | 5 (consecutive) |
| Initial backoff | 5 seconds |
| Maximum backoff | 80 seconds (doubles each attempt) |
| Stable uptime threshold | 2 minutes |
If the stream runs for 2+ minutes after a restart, the retry counter resets — it's treated as a transient crash. After 5 consecutive failures within the stable threshold, the agent stops retrying and reports the stream as failed.
Graceful shutdown
On SIGTERM, the agent stops all streams gracefully, writes final state to disk, and exits cleanly. Systemd and Docker both send SIGTERM by default on stop/restart operations.
Monitoring
Health endpoint
GET /health — lightweight, unauthenticated. Suitable for load balancer health checks or external monitoring.
{
"status": "healthy",
"agent_version": "0.4.1",
"mysql_index": "connected",
"disk_usage_percent": 12,
"disk_total_gb": 47.3,
"disk_free_gb": 41.8,
"uptime_seconds": 5422
}Status is healthy when the index database is reachable and disk usage is below 95%. It degrades to degraded otherwise.
Stream status endpoint
GET /api/v1/status — returns detailed stream metrics including lag, checkpoint position, and schema coverage. Requires authentication via service token.
Recommended monitoring setup
Poll /health every 30–60 seconds from your monitoring system (Datadog, Prometheus, Nagios, etc.). Alert on:
status!=healthydisk_usage_percent> 80% (warning) or > 90% (critical)disk_free_gb< 5 GB
For stream-level monitoring, poll /api/v1/status and alert on lag_seconds exceeding your tolerance (e.g., > 60 seconds sustained).
Sizing recommendations
| Workload | Recommended plan | Notes |
|---|---|---|
| Single server, < 1 GB binlog/day | Free | Shared instance, 7-day retention |
| 1–5 servers, < 10 GB binlog/day | Pro | Dedicated t4g.medium, 30-day retention |
| 5–20 servers, < 50 GB binlog/day | Premium | Dedicated t4g.large (8 GB RAM), 90-day retention |
| 20+ servers or compliance requirements | Enterprise | Dedicated t4g.large, 365-day retention, custom limits |
These are starting points. Actual requirements depend on row width, change frequency, and how aggressively you filter schemas and tables. Start with a trial and monitor disk usage for 48 hours before committing to a plan.
Next steps
- Stream configuration — configure binlog streaming, filtering, and checkpoints
- Backup strategy — periodic full dumps complement binlog streaming
- Troubleshooting — common issues and resolution steps
- Status API — endpoint details for monitoring integration