Skip to main content

Architecture

Three loosely-coupled pieces, all sharing types and helpers through a single workspace package, @gsd/shared:

  1. Terraform provisions every AWS resource.
  2. The management app (Nest.js API + React dashboard) is a local control plane. It reads terraform.tfstate directly to discover what the infra looks like and drives AWS via SDK v3.
  3. Four Lambdas run the always-on control flow: two for Discord, one for DNS, one for the idle watchdog.

There is no persistent ECS service. Game servers only exist while a RunTask is in flight — Start triggers ecs.runTask, Stop triggers ecs.stopTask, and the Watchdog Lambda stops tasks that look idle.

Component diagrams

The system splits cleanly into three slices. Each is shown on its own rather than jammed into one overview — the cross-cluster arrows that arise when you draw all three together (Discord Lambdas talking to ECS, EventBridge talking to update-dns, the dashboard talking to everything) route through neighbouring subgraphs and produce unreadable overlap.

Game plane and operator control

The Nest.js API is the local control plane. It reads terraform.tfstate directly to discover infrastructure IDs, then drives ECS / DynamoDB / Secrets Manager / CloudWatch via SDK v3. Players reach the game either direct to the task's public IP (UDP / TCP games) or through the ALB (HTTPS games).

Game plane and operator

Serverless Discord bot

Two Lambdas and a single DynamoDB table handle every slash command. interactions is the synchronous entry point behind a Function URL — it verifies the Ed25519 signature, replies with a deferred ack within Discord's 3-second budget, then fires the async followup Lambda for anything that touches ECS.

Serverless Discord bot

Control loops (DNS + watchdog)

EventBridge drives the two "always on" Lambdas that keep DNS and idle shutdown in sync with actual task state. update-dns fires on every ECS task state change and reconciles Route 53 / ALB targets plus the pending-interaction row in DynamoDB. watchdog fires on a schedule and stops tasks whose NetworkPacketsIn has stayed below the threshold for IDLE_CHECKS consecutive intervals.

Control loops

The /server-start critical path

When a user types /server-start palworld in Discord, five AWS services and three Lambdas cooperate to return a usable palworld.yourdomain.com without ever letting the interaction time out.

/server-start sequence

After the session: either the user types /server-stop palworld (same flow but stopTask + DELETE A record), or the Watchdog Lambda notices NetworkPacketsIn < min_packets for four consecutive 15-minute windows and stops the task itself.

Invariants

These are easy to break by accident. They are spelled out in CLAUDE.md, the maintainer guide, and inline in a few Terraform files. If you change one, write the PR description as if you're explaining the new design.

  1. game_servers in terraform.tfvars is the single source of truth. Task definitions, EFS access points, log groups, security-group rules, and the GAME_NAMES env var on three Lambdas are all produced by for_each over this map. Adding or removing a game means editing exactly one place.

  2. DNS is Lambda-managed, not Terraform-managed. The Route 53 zone is a data source; individual A records are created and deleted by the update-dns Lambda in response to ECS task state changes. Adding an aws_route53_record resource would fight the Lambda.

  3. Lambdas use AWS_REGION_ (trailing underscore). The standard AWS_REGION name is reserved by the Lambda runtime and cannot be overridden. Every Lambda reads process.env.AWS_REGION_ instead.

  4. Secrets never leave AWS. The bot token and the Discord public key live in Secrets Manager. The management app can write them and getEffectiveToken() once (to register guild commands), but they are never sent to the browser — the API only returns botTokenSet / publicKeySet booleans.

  5. Per-guild command registration only. DiscordCommandRegistrar.registerForGuild PUTs to applications/{client_id}/guilds/{guild_id}/commands. Do not register global commands — they would leak to every guild the bot is invited to.

  6. Permission resolution lives in canRun() in @gsd/shared. The server and both Discord Lambdas import the same function. Do not duplicate the logic; do not reorder the checks (guild allowlist → admin → per-game).

  7. Watchdog state lives in ECS task tags. There is no DynamoDB/SSM for the idle counter — it is an idle_checks tag on each running task. Counter resets when a task stops, which is free.

See the maintainer guide for what tends to break these and what the failure modes look like.