Glasck is an esport data platform for League of Legends — for now. The project started because a friend wanted to build it. He began on his own, and a few months later, he told me about it. I liked the idea, the database already existed. I joined along with another friend, we started working on the design together, and things took off from there.

There are four of us on the project. Three developers — myself included — and a fourth person handling project management, marketing, and legal. On the dev side, I handle the infrastructure. I also work on design with the team and manage communication on LinkedIn. Big-picture decisions about the project's direction, we make those together.

My Approach

From the start, I set a few principles for myself.

User first. Every technical decision should serve the user, not the developer's ego. ROI and tradeoffs: our time is limited, every choice has a cost. If a solution that's 80% as good takes three times less effort, we go with it.

Everything I build has to be understandable by the rest of the team. No black boxes that only I can maintain. I refuse to be a SPOF — a Single Point of Failure, the one person whose absence blocks everything. If I'm unavailable, the project shouldn't stop.

The Stack — and Why These Choices

Docker Compose in production on a single VPS. No cluster, no Kubernetes. We're three devs with limited time — complexity has to be proportional to the team. With Taskfile — a command-line task automation tool — it's deployable and upgradable in a single command. Everyone on the team uses it.

Traefik as a reverse proxy instead of Nginx. When a new Docker service starts, Traefik discovers it automatically through labels in the Compose file. HTTPS certificates are generated automatically via Let's Encrypt, the dashboard is built in. In practice, exposing a new service is three lines of config, not an Nginx file to write and reload.

# Docker-compose excerpt — Traefik labels on frontend
labels:
  - "traefik.enable=true"
  - "traefik.http.routers.frontend.rule=Host(`glasck.gg`)"
  - "traefik.http.routers.frontend.tls.certresolver=letsencrypt"
  - "traefik.http.services.frontend.loadbalancer.server.port=3000"

Rate limiting at 50 req/s on the front, 100 req/s on the back. Sticky sessions for user session management.

PostgreSQL with PgBouncer for connection pooling — two separate databases, one for data, one for authentication. Redis for caching with an LRU policy and destructive commands disabled in production. This was the project's initial stack, already in place before I joined. I adapted and integrated it into the Docker infrastructure. I don't manage the databases yet — that's an area I'm working on.

The Docker network is segmented into three isolated subnets:

glasck-public-facing    → exposed services (Traefik, front)
glasck-internal-backend → backend + Redis, isolated from the internet
glasck-monitoring       → separate monitoring stack

A compromised service on the public network can't reach the backend or monitoring. It's basic segmentation, but it limits the attack surface.

Docker Swarm runs in staging, but the VPS is limited — faithfully replicating the production stack for reliable staging is still a work in progress. Kubernetes, maybe later, if traffic and team size justify it. For now, no automatic rolling updates — there's downtime during deployments, but it's short and acceptable for our stage.

20 Minutes After Going Live, the Site Is Down

Twenty minutes after the site went live, the server crashed. Bots crawling the site in droves, no rate limiting in place, no visibility into what was happening. The VPS buckled under the load. I restricted access through Traefik configuration — that solved the problem. But without metrics, we were reacting instead of anticipating.

The first OOM came a bit later. I didn't even see it in the logs. I found out by opening the site — blank page. The container had crashed silently. I had to dig through manually to figure out what happened, with no view on memory usage, no history. How long was the site down before I noticed? No idea.

Before monitoring, when something broke, I'd go straight to the VPS. SSH, docker logs, container by container. No dashboard, no alerts. You discover problems instead of anticipating them.

It was after these episodes that I built the monitoring stack. I should have started there.

Monitoring — Eyes on Production

The starting point is the four Golden Signals: latency, traffic, errors, saturation. If you watch these four metrics, you catch the majority of problems before they become critical. It's what you look at first, every day.

But we didn't stop there. We monitor every component — Redis metrics (hit rate, memory, connections), Traefik metrics (requests per second, HTTP codes, latency per route), PostgreSQL metrics (active connections, slow queries), state of every Docker container. The Golden Signals give you the big picture. The rest is for digging deeper when something's off.

The architecture:

Services → Exporters → Prometheus (scrape 15s, retention 15d)
                            ↓
                        Grafana (9 dashboards)

Containers → Loki + Alloy (log aggregation)

Alertmanager → Discord webhook + email

UptimeRobot → external monitoring (is the site responding from outside?)

Seven exporters: node-exporter for the machine, cAdvisor for containers, redis-exporter, two postgres-exporters (one per database), uptimerobot-exporter, and Traefik's built-in metrics.

Nine Grafana dashboards — overview with Golden Signals, Docker, node, Redis, Traefik, PostgreSQL, Fastify API, Next.js frontend, and centralized logs via Loki.

Eight alert rules: ServiceDown (critical, 2-minute threshold to avoid false positives), HighCPU, HighMemory, DiskSpaceLow, ExternalDown. Alerts go to Discord — that's where the team is. Emails are for those who don't check Discord often enough.

Traces will be the next step. For now, metrics and logs cover our needs.

Today, I see everything. A container restarting, a slow request, a suspicious traffic spike — I know in real time.

The Story That Sums It All Up

After an update, the site slows down. Response times spike on the Grafana dashboard. A request on the application side gets queued and causes every other request to timeout — the domino effect.

Monitoring does its job. I see latency metrics climbing, error logs surface in Loki, I know exactly which service is affected. But I can't fix it — it's application code, not infrastructure.

I restart the container to buy some uptime until the request occurs a second time. Meanwhile, I send the logs and metrics to my developer friend. He identifies the issue and fixes it. The service recovers.

That moment stuck with me. Seeing the problem, diagnosing it, but not being able to fix it because it's not your domain — that's frustrating. It's one of the reasons I want to become a T-shaped engineer: strong on infrastructure, but able to understand and step into the code when needed.

Useful Mistakes

Monitoring from day one. Not after the first crash, not after the first OOM. Before. It's not a luxury, it's a foundation.

Security thought through earlier. We haven't had an issue, but that's not an excuse. Hardening your config from the start is always cheaper than doing it after an incident.

CI/CD is planned. For now, manual deployment via Taskfile does the job, but as the project grows, it won't hold.

And Now

Glasck is my first real project in production. Not an exercise, not a school project. The site is there — glasck.gg. It's running, with real users, real sessions, real traffic. It's tangible. When something breaks, real people are affected.

We're still working on it. No pressure, everyone moves at their own pace — but we have clear goals for the site. It changes the way you think. Every technical decision becomes a tradeoff between what you want to do, the time available, and above all what the user expects. Does it work? Is it reliable? Is the experience right? That's what matters, not the most impressive stack.

Monitoring isn't a bonus. Working in a team means making your work readable by others. Managing the infrastructure of a real project showed me how much I still have to learn. That's why I keep going.