Architecture
Three loosely-coupled pieces, all sharing types and helpers through a single
workspace package, @gsd/shared:
- Terraform provisions every AWS resource.
- The management app (Nest.js API + React dashboard) is a local control
plane. It reads
terraform.tfstatedirectly to discover what the infra looks like and drives AWS via SDK v3. - Four Lambdas run the always-on control flow: two for Discord, one for DNS, one for the idle watchdog.
There is no persistent ECS service. Game servers only exist while a
RunTask is in flight — Start triggers ecs.runTask, Stop triggers
ecs.stopTask, and the Watchdog Lambda stops tasks that look idle.
Component diagrams
The system splits cleanly into three slices. Each is shown on its own rather than jammed into one overview — the cross-cluster arrows that arise when you draw all three together (Discord Lambdas talking to ECS, EventBridge talking to update-dns, the dashboard talking to everything) route through neighbouring subgraphs and produce unreadable overlap.
Game plane and operator control
The Nest.js API is the local control plane. It reads
terraform.tfstate directly to discover infrastructure IDs, then drives
ECS / DynamoDB / Secrets Manager / CloudWatch via SDK v3. Players reach
the game either direct to the task's public IP (UDP / TCP games) or
through the ALB (HTTPS games).
Serverless Discord bot
Two Lambdas and a single DynamoDB table handle every slash command.
interactions is the synchronous entry point behind a Function URL —
it verifies the Ed25519 signature, replies with a deferred ack within
Discord's 3-second budget, then fires the async followup Lambda for
anything that touches ECS.
Control loops (DNS + watchdog)
EventBridge drives the two "always on" Lambdas that keep DNS and idle
shutdown in sync with actual task state. update-dns fires on every
ECS task state change and reconciles Route 53 / ALB targets plus the
pending-interaction row in DynamoDB. watchdog fires on a schedule and
stops tasks whose NetworkPacketsIn has stayed below the threshold for
IDLE_CHECKS consecutive intervals.
The /server-start critical path
When a user types /server-start palworld in Discord, five AWS services and
three Lambdas cooperate to return a usable palworld.yourdomain.com without
ever letting the interaction time out.
After the session: either the user types /server-stop palworld (same flow
but stopTask + DELETE A record), or the Watchdog Lambda notices
NetworkPacketsIn < min_packets for four consecutive 15-minute windows and
stops the task itself.
Invariants
These are easy to break by accident. They are spelled out in CLAUDE.md, the
maintainer guide, and inline in a few Terraform files. If you change one,
write the PR description as if you're explaining the new design.
-
game_serversinterraform.tfvarsis the single source of truth. Task definitions, EFS access points, log groups, security-group rules, and theGAME_NAMESenv var on three Lambdas are all produced byfor_eachover this map. Adding or removing a game means editing exactly one place. -
DNS is Lambda-managed, not Terraform-managed. The Route 53 zone is a data source; individual A records are created and deleted by the update-dns Lambda in response to ECS task state changes. Adding an
aws_route53_recordresource would fight the Lambda. -
Lambdas use
AWS_REGION_(trailing underscore). The standardAWS_REGIONname is reserved by the Lambda runtime and cannot be overridden. Every Lambda readsprocess.env.AWS_REGION_instead. -
Secrets never leave AWS. The bot token and the Discord public key live in Secrets Manager. The management app can write them and
getEffectiveToken()once (to register guild commands), but they are never sent to the browser — the API only returnsbotTokenSet/publicKeySetbooleans. -
Per-guild command registration only.
DiscordCommandRegistrar.registerForGuildPUTs toapplications/{client_id}/guilds/{guild_id}/commands. Do not register global commands — they would leak to every guild the bot is invited to. -
Permission resolution lives in
canRun()in@gsd/shared. The server and both Discord Lambdas import the same function. Do not duplicate the logic; do not reorder the checks (guild allowlist → admin → per-game). -
Watchdog state lives in ECS task tags. There is no DynamoDB/SSM for the idle counter — it is an
idle_checkstag on each running task. Counter resets when a task stops, which is free.
See the maintainer guide for what tends to break these and what the failure modes look like.