Emergent Architecture from Constraints

With the privacy constraints defined, the next step was working out how to actually build the pipeline. What surprised me was how little design was left to do. Each constraint ruled out a category of approaches, and the constraints overlapped enough that only one viable architecture remained.

The full constraint set#

The privacy constraints from the previous post eliminate most analytics approaches on their own. But they were not the only boundaries I was working within. The hosting architecture adds four more:

Static export only. No server-side rendering, no edge functions beyond URL rewriting. The site has no runtime.
No publicly invokable compute. No API Gateway, no Lambda function URLs, no endpoints that accept external requests. This eliminates real-time data collection endpoints.
AWS-native only. The infrastructure runs entirely on AWS services managed through Terraform. No third-party SaaS dependencies for analytics.
Cost-bounded. The total cost of the analytics pipeline must remain under a defined monthly budget. This eliminates services with high baseline costs.

Ruling things out#

I started by working through what each constraint eliminated. The privacy constraints had already ruled out client-side tracking and cookies. That removed every approach that relies on JavaScript beacons, tracking pixels, or browser-side event collection, including Google Analytics, Plausible, Fathom, Umami, and self-hosted tools that depend on a client-side script. The hosting constraints narrowed things further.

No publicly invokable compute ruled out server-side event collection. There is no endpoint to receive data from anywhere. This left me with one data source: what the infrastructure already produces. CloudFront writes standard access logs to S3 as part of its normal operation. Those logs were the only input the pipeline could use.

Static export only meant the dashboard could not be server-rendered with live data. The analytics page had to be a static page that fetches pre-computed data client-side. Since the build runs on deploy rather than on a schedule, pre-computed JSON fetched at page load was the natural choice.

AWS-native constrained the query engine. The access logs are in S3, so I needed something that reads S3 directly. Athena was the obvious fit: it queries data in place using standard SQL, charges per query based on data scanned, and requires no running infrastructure.

Cost-bounded eliminated always-on services. No RDS instance, no Elasticsearch cluster, no persistent compute. Everything had to be event-driven or scheduled, with costs that scale to zero when idle.

By the time I had worked through all the constraints, there was very little left to decide.

What I ended up with#

The pipeline has four components:

CloudFront delivers the site and writes standard access logs to a dedicated S3 bucket. This happens automatically with no additional code. The logs contain request metadata: timestamps, paths, status codes, referrers.
Athena queries those logs using SQL. The log files are registered as an external table in the Glue Data Catalogue. Queries run against the raw tab-delimited logs, filtered to the last thirty days.
A scheduled Lambda function runs the queries daily. It is triggered by EventBridge, not by any external event. It has no public endpoint. It reads query results, applies privacy thresholds, and writes pre-aggregated JSON files to the site's S3 bucket.
CloudFront serves the JSON alongside the rest of the site. The dashboard page fetches these files client-side and renders charts. The JSON files use a fifteen-minute edge cache TTL, matching the analytics data caching tier.

No component in this pipeline accepts external input. No component runs continuously. The entire data flow is internal: from CloudFront's own logs, through a query engine, into static files, back out through CloudFront.

Batch processing as a feature#

The pipeline runs once per day. This means the dashboard shows data that is at most twenty-four hours old. For anything requiring operational awareness or time-sensitive decisions, that lag would be a genuine problem. For a personal blog, it is acceptable, and in some respects a feature.

Daily aggregation means the Lambda runs once, the Athena queries scan a bounded dataset, and the resulting JSON files are small and cacheable. There is no streaming pipeline to monitor, no queue to back up, and no hot path to optimise.

The latency also reinforces the privacy model. The EventBridge rule fires once daily. The Lambda aggregates the full day's logs in a single pass, applies the minimum count threshold, and writes the result. By the time any metric appears on the dashboard, it represents an entire day's traffic with low-count entries already removed. There is no window in which an individual visit is observable.

No public API#

The Lambda function has no function URL, no API Gateway, and no other mechanism for external invocation. It is triggered exclusively by an EventBridge schedule rule. This was a deliberate choice.

Adding a function URL or API Gateway would mean adding rate limiting, authentication, input validation, and monitoring for an endpoint whose only job is serving a handful of small JSON files. Instead, the Lambda writes pre-aggregated JSON directly to S3 and CloudFront serves it with the same caching and DDoS protection (AWS Shield Standard) that already covers the rest of the site. The analytics data gets the same delivery path as every other static asset, with nothing extra to secure.

The next post will cover the concrete AWS implementation and cost model.