The Path Not Found Is the Point

A static site with no dynamic surface makes an unusually clean observation platform. Every request for a path that never existed here is an unambiguous signal: no legitimate visitor was trying to reach it. The monitoring I built to read the signal is ten URI patterns applied against CloudFront access logs, surfaced through the site's security dashboard.

Why static sites make clean observation platforms#

Most honeypots work by pretending to be vulnerable. They might expose a fake SSH service, a decoy login page, or an application that looks exploitable. An attacker interacts with it, and the system records what they do.

A static site works differently. It has no server-side code, no forms, and no database. In this case, it is just HTML, CSS, and JavaScript served from S3 through CloudFront. There is no application backend for an attacker to probe.

That makes it much thinner than a traditional honeypot. The useful signal comes from requests that should not be happening at all.

CloudFront access logs already capture each request, including the path, status code, timestamp, edge location, and related metadata. The analysis happens in the logging pipeline, not in the site itself.

The probe filter#

I kept the filter short. Ten patterns in total:

Exact paths: /.env, /admin/, /wp-admin/, /wp-login.php, /api/, /config/
Wildcard patterns: /wp-%, %xmlrpc%, %/admin.php%, %/sc.php%

I aligned the six exact paths with this site's robots.txt Disallow rules deliberately. For legitimate crawlers, the robots.txt entries are a standard exclusion directive. For scanners, they look more like a list of targets. One deception pattern is to seed robots.txt with paths that only a scanner would follow. If a request arrives for a disallowed path that has never existed, it was either read from robots.txt by a bot that does not respect exclusion directives, or it was found through a common scanning wordlist. Either way, it is not a legitimate visitor.

The four wildcard patterns cover relatively common scanning prefixes. A request matching any of these ten patterns is classified as a "known probe."

Two tiers of observation#

The novel probe detection adds a second tier to the model. Novel probes are requests that return 4xx errors on paths outside the curated filter. These could be scanning campaigns the filter has not caught yet, or something else entirely.

For novel probes, the pipeline collects only aggregate data: path, hit count, and unique IP count. It does not select client_ip or user_agent for individual requests. The detection is broader, but the observability is shallower.

This creates a deliberate two-tier structure:

	Known probes	Novel probes
Detection	Curated filter match	4xx on unrecognised path
Confidence	High (honeypot paths)	Lower (wider net)
Observability	IP, user agent, timing, edge location	Aggregate counts only

The tighter the confidence that a request is automated scanning, the more individual-level data the pipeline collects. The wider the net, the less it collects.

The first post in this series showed a significant fraction of probe traffic falls outside the curated filter. This is expected. The novel tier exists to surface what the filter misses, at a lower level of observability. Between them, the two tiers produce a view of automated scanning while legitimate visitor traffic is excluded.

What this shows#

The observation system is a simple two-tier filter built on the CloudFront access logs the site was already producing. That is it.

The useful part is not the filter itself. It is the shape of the environment around it. Because the site has no dynamic application surface, many requests that would be ambiguous elsewhere become clear signals here.

A single snapshot only shows what the filter caught on one pass. Over time, the same setup can show which probes persist, which disappear, and where commodity scanning moves next.