Content as Data, Not Markup

Markdown is often viewed as a way to write for the web without touching HTML. While true, I'd like to expand upon that concept as well as its usage in this blog (and many others) and/or static sites. It might seem like I'm overthinking it, but in this blog context is not just text; it is a structured data source that can be queried, validated, and transformed.

By treating content as data, we move away from "magic" folders and ad-hoc logic, replacing them with a formal contract between the author and the build system.

The anatomy of a post#

Each entry in this blog is an MDX file. MDX allows the use of React components within Markdown, but the most important feature for this architecture is the frontmatter: the YAML block at the top of the file.

In a typical setup, frontmatter is used for titles and dates. Here, it is treated as an API schema. It includes:

Strict typing: Every field has a defined type and validation rule.
Relationships: Series and series order are explicitly defined.
Metadata: Tags and summaries are first-class fields, not optional extras.

Validation as a hard gate#

A potential problem with many static blogs is "silent rot." You rename a tag in one post but forget the others, or you accidentally use an inconsistent date format. These issues often do not break the build; they just degrade the site's quality or searchability.

To prevent this, I use Zod to define a schema for blog posts. Before the site is built, a validation script parses every MDX file and ensures it matches the contract.

const postSchema = z.object({
  title: z.string().min(1),
  date: z.string(),
  summary: z.string().min(1).max(300),
  tags: z.array(z.string()).min(1),
  series: z.string().optional(),
  series_order: z.number().optional(),
}).refine(data => {
  if (data.series && data.series_order === undefined) return false;
  return true;
}, {
  message: "series_order is required when series is present"
});

If a post fails validation, perhaps because a summary is too long or a series order is missing, the CI/CD pipeline fails. The system refuses to publish a broken state.

Enabling features through structure#

When content is structured data, features like search, archives, and RSS feeds become trivial.

Search: The build process generates a JSON search index by extracting the title, summary, and tags from the validated data.
Series logic: Because the relationship between posts is part of the data schema, the system can automatically generate "Part X of Y" navigation without manual linking.
Archives: Grouping posts by month or year is a simple data transformation, not a complex filesystem traversal.

Decoupling from the UI#

Treating content as data reinforces the separation of concerns discussed in the previous post. The content layer does not know about the UI components used to render it. It only knows that it must provide a certain set of fields.

This decoupling means that if I want to redesign the site or change how tags are displayed, I only touch the build layer. The data layer, the writing, remains stable and portable.

By ensuring the data layer is robust and validated, I can focus on writing with the confidence that the system will handle the discovery and delivery of that content reliably.

In the next post, we will look at how this data-driven approach is integrated into a CI/CD pipeline, turning automation into a trust mechanism for a "simple" site.