Startup Journey 0001 | It's There, But Also Not There

(the post is automatically translated by AI)

In early 2026, I left my senior software engineer position at a large tech company to join a startup as a tech lead. Still figuring out both the technical and managerial sides, I’m hoping this series of journal entries will help me look back years later and see how I’ve grown.

My first task after joining was to look back at the current system. This system had long struggled with auto-scaling, causing delays under sudden or sustained high traffic.

So on day one I started working with the existing developer to get the environment set up. The first problem surfaced immediately: since this system had been developed almost entirely by a single engineer, reproducing it locally — or on the cloud K8s cluster — turned out to be impossible. Environment variable mismatches between setups were a major culprit.

As I dug deeper into the infrastructure, a more serious issue emerged that needed immediate attention: each environment’s deployment is maintained separately. In other words, while there appears to be a “dev/test,” “staging,” and “production” environment split, they are not consistent with each other. The Jenkins CI/CD consists of three separate pipelines — one per environment.

What does that mean in practice? It means that any fix or test we do on dev or staging cannot be guaranteed to hold in production, because the three pipelines are independent processes.

The environments seem to exist, but also kind of don’t.

I knew that if we wanted meaningful performance testing on staging, we’d first need to unify the deployment process across all environments. So while I was drafting the next release plan in the first week, I was simultaneously studying the system architecture and working to unify the deployment process without touching the existing codebase.

The three environments currently differ in their cluster setups: dev runs on our own self-hosted machines, while staging and production run in the cloud. The biggest difference is that the cloud uses Istio for ingress routing, while the local environment uses a plain ingress controller. Additionally, the staging infrastructure code hadn’t been updated in a long time — the previous workflow was basically: develop and test on dev, then deploy directly to production. Staging was only occasionally used for performance testing.

Since staging was already neglected, I decided not to try to fix it. My plan: take the infrastructure currently used in production and migrate it to staging, then adjust the environment variables for staging-specific settings.

What about dev? Normally you’d test on dev before promoting to staging, then production. But the dev environment had engineers actively building the next release. It was more efficient to work in parallel — I’d handle staging, and we’d sync dev later once there was a stable checkpoint.

So the migration plan became:

Complete the staging infra reproduction
Unify the Jenkins pipeline so all environments use the same deployment process
Deploy to staging with the unified process to confirm no regressions
Deploy to production with the unified process to confirm no regressions
(Optional) Migrate the unified Jenkins flow to GitLab CI/CD
Deploy to dev with the unified process

Step 5 is partly personal preference — Jenkins Groovy syntax is less readable than GitLab CI/CD YAML, and since the Jenkins config doesn’t live in the codebase it’s outside version control. I’d also rather not manage multiple toolchains; keeping CI/CD co-located with the code repo reduces operational overhead.

After about a week, we finally had a unified deployment across all environments — a first step in the right direction. Now it’s time to tackle the actual traffic bottleneck.

相關文章