Skip to content
Back to blog
Part 3 of 3 in series
Measuring Without Surveillance

A curated collection of articles exploring this topic in depth.

8 min read

Self-Hosted Analytics on AWS

A privacy-preserving analytics pipeline on AWS, built to cost almost nothing.


When I started wiring the analytics pipeline together, the first thing I checked was the cost. The previous two posts in this series defined the constraints and narrowed the architecture to a specific set of AWS services. This post covers the concrete implementation: what each service does, how they connect, and what the pipeline actually costs to run.

The stack#

The pipeline uses six AWS services, all managed through Terraform:

  • CloudFront standard access logging (already serving the site; logging is an additional configuration)
  • S3 for log storage and as the origin for pre-aggregated JSON
  • Glue Data Catalogue to define the log table schema
  • Athena to query the logs using SQL
  • Lambda to orchestrate queries and write results
  • EventBridge to trigger the Lambda on a daily schedule

Every service here is either pay-per-use or effectively free at this scale. Nothing runs continuously, and nothing needs to.

Sequence diagram showing the daily batch process from CloudFront logs to static JSON results(click to enlarge)

Supporting services for cost control:

  • AWS Budgets with threshold alerts
  • CloudWatch alarms for error rate spikes
  • SNS for email notifications

The previous post narrowed the design to batch processing over server logs rather than client-side collection. From there, the implementation mostly became an exercise in choosing the simplest AWS components that could preserve that constraint without introducing persistent infrastructure.

CloudFront logging#

CloudFront standard access logs are tab-delimited text files written to S3. Each line represents one request and includes the timestamp, HTTP method, URI path, status code, referrer, and other metadata. The logs also contain the client IP address and full user agent string, which is why the raw logs must be treated as sensitive and short-lived.

Enabling logging requires a dedicated S3 bucket with the log-delivery-write ACL and BucketOwnerPreferred ownership controls. The bucket has all public access blocked and a lifecycle policy that deletes log files after thirty days. A separate lifecycle rule deletes Athena query results after seven days.

logging_config {
  include_cookies = false
  bucket          = aws_s3_bucket.analytics_logs.bucket_regional_domain_name
  prefix          = "cloudfront/"
}

Cookies are excluded from the logs because the site does not set any.

The Glue table#

Athena queries data through the Glue Data Catalogue. The CloudFront log format is registered as an external table using the LazySimpleSerDe with a tab field delimiter. The table includes the full standard CloudFront log schema, though the queries only reference a small subset: log_date, method, uri, status, and referrer.

The skip.header.line.count table property is set to 2 to skip the version and field-name comment lines that CloudFront includes at the top of each log file.

I kept the schema broad and the queries narrow. The raw logs may contain sensitive columns, but the pipeline never needs to read them.

Athena queries#

Five queries run daily. Each one enforces the privacy constraints at the SQL level:

Pageviews per day counts successful GET requests for content paths, excluding static assets (/_next/*), analytics paths (/stats/*), and the favicon. This produces the daily trend line.

Top pages aggregates by URI path after stripping query strings with regexp_replace. Build artifacts are excluded at the query level: Next.js RSC payloads, static assets like images and scripts, and internal framework files never appear in the results. Trailing slashes are normalised so that /about and /about/ resolve to a single entry.

A HAVING clause enforces a minimum count threshold, filtering low-traffic pages from the output. This is not noise reduction; it is a privacy constraint enforced at the query layer.

Top referrers applies the same query-string stripping and minimum-count threshold to the referrer field, excluding self-referrals. This shows where traffic originates without exposing individual referral chains.

Status code distribution counts all requests grouped by HTTP status code. This is an operational health signal rather than a content metric.

Status codes by day breaks the same distribution down by date, producing a daily time series of status codes. This makes it possible to spot transient error spikes rather than only seeing the aggregate.

No query selects the client_ip or user_agent columns. This is enforced by the query definitions themselves, not by a post-processing filter. The data is never read from the logs into the Lambda's memory.

The Athena workgroup#

Athena is configured with a dedicated workgroup that enforces the output location, publishes CloudWatch metrics, and sets a per-query scan limit of one hundred megabytes. The scan limit is a safety net: if a query attempts to read more data than expected (due to a bug or unexpected log volume), Athena cancels it before the cost becomes significant. The workgroup also ensures that query results are written to a known prefix in the logs bucket, where the lifecycle policy will clean them up.

At personal-blog scale, each daily query run scans a small amount of data. Athena charges five dollars per terabyte scanned. A month of CloudFront logs for a low-traffic blog is measured in megabytes, making the query cost negligible.

The Lambda function#

The aggregation Lambda is a Node.js function bundled with esbuild. It runs on a daily EventBridge schedule at 03:00 UTC, has no public endpoint, and completes within its 120-second timeout.

The function runs through five steps:

  1. Run all five Athena queries in parallel.
  2. Poll each query for completion.
  3. Re-check the minimum-count thresholds on top-pages and top-referrers as defence in depth, in case a query is later modified without preserving the threshold.
  4. Wrap each result set in a JSON envelope with a timestamp and window size.
  5. Write the five JSON files to the site bucket under /stats/data/.

The JSON files are small, bounded, and cacheable. Each file includes a generated_at timestamp so the dashboard can display when the data was last updated.

The Lambda's IAM role follows least privilege: it can start Athena queries in the analytics workgroup, read from the logs bucket, write to the Athena results prefix, and write to the stats/data/* prefix in the site bucket. It cannot read or modify any other part of the site.

Cost model#

At personal-blog scale, the monthly cost breaks down as follows:

Service Cost driver Estimated monthly cost
S3 log storage CloudFront access logs in S3 Standard < $0.10
Athena Data scanned per query < $0.01
Lambda 30 invocations, < 10s each Free tier
EventBridge 30 scheduled events Free tier
Glue Data Catalogue Table storage Free tier
CloudWatch alarms Per-alarm charge ~$0.10
S3 lifecycle Automatic deletion Free

The total comes to well under one dollar per month at personal-blog scale. There is no WAF line item because the analytics data is static JSON served through the same CloudFront distribution as the rest of the site.

The interesting part is not that it is cheap. It is that every cost scales with usage or deletes itself automatically. There is no fixed monthly baseline that needs to be justified.

Operational guardrails#

Three mechanisms prevent unexpected costs or failures:

AWS Budgets is configured with a monthly threshold and alerts at fifty, eighty, and one hundred percent of actual spend. This catches any unexpected cost growth before it becomes a problem.

CloudWatch alarms monitor the CloudFront 5xx error rate, indicating delivery problems. Alerts are sent through SNS email notifications.

S3 lifecycle policies ensure that raw logs and Athena query results are automatically deleted. There is no manual cleanup required and no risk of unbounded storage growth.

Together, these mean the pipeline does not need active monitoring. If I stopped looking at it entirely, the costs would remain bounded and the lifecycle policies would continue deleting data on schedule.

The static dashboard#

The /stats page is a client-side React page that fetches the five JSON files on load. Charts are rendered with Recharts. Tables display the raw data with accessible markup. If the JSON files do not exist yet (before the first Lambda run), the page shows a graceful fallback message rather than an error.

The page respects the site's existing patterns: theme-aware colours using CSS variables, reduced-motion support for chart animations, and the same typographic hierarchy used by all other pages. It is a static page in the Next.js build output, served through CloudFront like everything else.

Open questions#

The pipeline is deliberately minimal, and several things remain unresolved:

  • Parquet conversion. The raw tab-delimited logs could be converted to Parquet via a Glue ETL job, reducing Athena scan costs by an order of magnitude. At current traffic levels this is unnecessary, but it is worth revisiting if log volume grows.
  • Rolling aggregates. The Lambda currently scans the full thirty-day window on every run. Maintaining a rolling aggregate and appending only the previous day's data would reduce query cost and execution time. I have not yet needed this, but the design would change if I did.
  • Geographic distribution. CloudFront logs include a c-country field that indicates the viewer's country without requiring IP address processing. This fits the existing privacy model, but whether regional traffic patterns are useful enough to justify the added complexity is an open question.

The goal is not to build a comprehensive analytics platform. It is to measure what matters, respect the people visiting, and keep the system simple enough to operate without dedicated attention.