Knock Knock, Who's There?

There are no login pages here, no forms, no databases, no server-side code, and no commercial value.

None of that matters to the traffic that arrives uninvited.

When I built a security dashboard to surface probe data from my CloudFront access logs, I expected some background noise. I did not expect the volume, the variety, or the specificity of the requests. Over a thirty-day window, the site receives more than twenty thousand requests for paths that have never existed here and never will.

The numbers#

The site's analytics pipeline runs daily, querying CloudFront access logs through Athena. Security queries are scoped to two categories: known probes (requests matching a curated list of honeypot paths and common scanning targets) and novel probes (requests returning 4xx errors on paths that were never part of this site, and that fall outside the curated list).

Known probes account for roughly sixty percent of the total. Novel probes account for the rest. That means nearly forty percent of probe traffic is hitting paths outside the detection list I began with.

What the known probes look for#

The curated probe list covers common scanning targets. The top paths, ranked by volume over a recent thirty-day window, are not the ones I would have guessed:

/wp-content/plugins/hellopress/wp_filemanager.php (a file manager web shell planted inside a WordPress plugin directory)
/admin.php and variants dropped into /wp-content/themes/, /wp-content/uploads/, and /wp-admin/css/ (admin shells disguised inside WordPress directory structures)
/xmlrpc.php (WordPress remote procedure call interface)
/wp-login.php (WordPress login page)
/wp-content/uploads/ and /wp-includes/ (WordPress directory enumeration)
/.env (environment variable files, typically containing database credentials or API keys)

The expected probes are present: xmlrpc.php, wp-login.php, .env. But the highest-volume paths are after something more specific than a vulnerable installation: web shells(opens in a new tab) planted inside WordPress plugin and theme directories by previous attackers, the leftovers of someone else's break-in. The scanners are asking whether this server has already been backdoored and might have a usable shell waiting.

This site has never run WordPress or PHP. It is a static export served from S3. The scanners do not know this and do not care.

What the novel probes reveal#

The more interesting traffic is in the novel probes: requests for unrecognised paths that return 4xx errors outside the curated list. A mistyped URL or a stale inbound link lands in the same bucket, so this tier carries less certainty than the curated one. The sample below suggests very little of it is lost readers.

A sample from a recent thirty-day window:

/alfa.php (a widely distributed PHP web shell known as Alfa Shell)
/file.php, /222.php, /1.php, /66.php (generic filenames commonly used when planting backdoors on compromised servers)
/classwithtostring.php, /gifclass.php (filenames associated with specific PHP exploitation payloads)
/about.php, /bolt.php, /chosen.php (innocuous names frequently used to disguise web shells in a site's directory listing)

None of these target a particular technology stack or a known vulnerability. They test whether someone else has already broken in and left a shell behind. Several of the filenames belong to common web shell toolkits, and the scanners cycle through them methodically.

Who is sending this#

The probe sources table in the security dashboard shows the IP addresses and user agent strings of the top scanners. These requests are treated as security events, so their source IP and user agent are collected for analysis. The filter that selects them is scoped to probe paths, so an ordinary reader is never picked up this way.

Several patterns appear consistently:

Empty user agent strings dominate. The single largest fingerprint category is an empty user agent, accounting for tens of thousands of hits from a relatively small number of IP addresses.

A single fabricated Chrome string appears across most of the top probe sources. The same user agent claims to be Chrome 120 on Windows 10 and appears verbatim across dozens of different IPs. Chrome 120 was current in late 2023. On a site whose legitimate traffic is largely Safari and current-version Chrome, a two-year-old browser string arriving repeatedly from cloud hosting IPs is a strong indicator of automation rather than ordinary readership.

Honest automation tooling is a minority. curl and python-requests appear, but the bulk of scanning traffic uses either no user agent or a fabricated browser string.

The unclassified middle is large. A significant share of traffic does not match any known bot signature but also does not match any known scanner pattern.

Edge locations#

The security dashboard shows the geographic distribution of probe traffic by CloudFront edge location. Probe traffic arrives through edges on multiple continents, with a long tail of lower-volume locations. But the distribution is not even. One edge location dominates the chart by a wide margin.

The geography is suggestive but imprecise. CloudFront edge location records where the request entered the CDN; the scanner itself could be anywhere. A scanner running on a cloud instance in Virginia might have its traffic routed through an edge in Frankfurt depending on network conditions and CDN routing. The distribution says "this is not one machine in one place," but why one edge handles so much more traffic than the rest is a question the chart alone cannot answer.

What this does not show#

The data shows that a public hostname attracts unsolicited scanning traffic regardless of what it hosts. The traffic is predominantly commodity scanning: automated, broad, and only loosely matched to what this site actually runs. The same probes that hit this static site hit millions of other hostnames.

It does not demonstrate targeting. Nothing in this data suggests that anyone has identified this site as valuable or interesting. The probes are generic. The paths are well-known vulnerability checks. The volume is consistent with background internet noise.

It does not, by itself, identify who is scanning. An IP address places a request on a network; physical location and organisational affiliation take further work. Cloud hosting IPs reveal that someone provisioned a virtual machine, but not much more without further investigation.

And it does not generalise beyond this site. The patterns here are consistent with what others have reported from internet-facing infrastructure, but one low-traffic blog is not a representative sample.

What comes next#

The data raises two questions. The first is visible in the edge location chart: one CloudFront edge dominates the probe traffic by a wide margin. The next post follows that thread.

The second is structural: how was this data collected, and what choices shaped what is visible? The detection list is a filter. It determines what counts as a "known probe" and what falls into the "novel" category. That filter is also a privacy boundary: it controls which requests have their IP address and user agent recorded. A later post in this series examines how a static site can function as a passive observation platform, and why the design of the detection mechanism is also the design of the collection boundary.