Ask HN: Advice for leading a software migration?
Hey HN,

I'm about to take lead of a decent sized software migration at work. (From V1 of some subsystem, to v2, both in house. We want to deprecated and eventually remove V1 totally) For 8 of our clients, totalling about 16 million customers.

I don't have too many details to share, as I don't know what's relevant. But I'm asking if anyone has any advice or recommended reading regarding such?

One book that is really inspiring me about it is "how big things get done" by Bent Flyvbjerg and Dan Gardner. In it, there's some key bits of advice such as

* Think slow, ask fast, and mitigate long tailed risks.

* Compartmentalize and stick to repeated processes. "Build with LEGOs"

* Look around at other projects of similar nature.

The last point is why I'm here, as I know some of you have been in the game for longer than I have, so feel free to share experiences that you might think is relevant, if you'd like.

I’ve done this. My quick thoughts:

- migrations always run longer than expected. In my case, leadership estimates were off by a factor of 10. What the eng manager originally said would take 3 months ended up taking a couple years.

- try to deliver quick wins and incremental value. This is often hard though. But it’s worth a try.

- Try to avoid this becoming the project everybody attaches their pet projects too. It’s too easy for people to make this the project where they use that new framework, test well, set up a design system, and make lots of little changes.

- that being said: migrations are easiest if you keep the design (visually and engineering) exactly the same. There will be lots of pressure to “just redo it while you’re already having to rewrite it”, but the uncertainty of a redesign really slows things down. Having a reference implementation means you don’t have to invent tons of acceptance criteria from first principles.

- as soon as things start getting delayed, which they will, try offering to cut corners or cancel the project. You want somebody else in corporate to stick their neck out to extend the project.

- Try seeding the team with more veteran ICs internally. You’ll need their help as you uncover dragons or need to get other teams to help run or integrate your new code.

- Among projects I’ve seen like this, the person running them gets fired or quits partway through at least half of the time. This is often because some middle manager made a promise they couldn’t keep to executives, and needs a scapegoat to save their own job. (It’s often that kind of middle manager who switches jobs every two years and keeps failing up silently and the project delay happens halfway through their stay at the company and they’re just trying to get to the two year mark and quit before anybody realizes what is going on internally.)

  • sjf
  • ·
  • 4 weeks ago
  • ·
  • [ - ]
I support everything in this comment.

After more than a decade at large sw companies, I can count on one hand the number of migrations where the legacy system was ever able to be turned down. I’ve seen migrations drag on for years, to the point where most of the team has turned over. I’ve seen them become a three-way migration because the second version was deemed insufficient so a third solution was introduced.

Absolutely put your most senior devs on this; maintain as much support from management as possible; budget for much, much more time than you think; you need full commitment or you are going to be maintaining both systems indefinitely.

Do senior Devs actually want to work on such a thankless project?
It favors people who just want a clear thing to work on for a year or two.
> After more than a decade at large sw companies, I can count on one hand the number of migrations where the legacy system was ever able to be turned down.

If part of the plan wasn't to run a v1 shim on top of v2 to handle legacy users that won't migrate, v2 almost certainly doesn't meet the needs of v1 customers and it's not a question of 'migration' it's a question of ending a product and releasing a similar product.

Sometimes that's what's wanted and needed, but often it's not, and then it's a surprise that the v1 users want their needs met and it's hard to say no to paying customers, but nobody signed up to run two products forever.

  • sjf
  • ·
  • 3 weeks ago
  • ·
  • [ - ]
I’ve seen this happen in situations where the migration is totally invisible to users. My last team is five years into an opaque database migration that seems to only expand in scope. It’s just a symptom of the migration being more difficult than originally expected usually combined with losing momentum or leadership support. Obviously no one originally intends to keep maintaining both system indefinitely.
I've done this too. Although not at the "millions of clients" scale, but large enough to drive learnings. Everything above is true.

Migrations are painful, thankless and always run over budget and time. Unless I've been at the company long enough, have enough confidence and rapport with my reporting head and skip level, I'd rather not do it.

I'm never taking any big (more than 2-3 month) migrations. Only small predictable subsystems that i can rollback or run both v1 and v2 in parallel. First one third time is for discovering by making changes and seeing where things break and possibly come up with fast tests (manual or automated). Last one third is for actual testing, trying out small pieces in production and fixing unexpected issues. So take your dev estimate and multiply by 3.

Even then, you have to shoot down any demands to use new frameworks, new processes and new dependencies. And resist your own temptation. Remember no one gives a shit about migrations.

You will be asked a thousand times on the progress by people incapable of fathoming the complexity. They expect a percentage. Have one ready with a small roadmap, of cornerstones and publish as a report or something. Everytime someone asks, point to the report. No one ever opens that report.

One delayed follow-up thought here:

Redesigns almost always result in a decrease in metrics/KPIs. The redesign just lacks the learned improvements that were baked into the old product. So, the initial launch almost always seems like a failure - and requires leadership to expect this dip before problems can be patched.

Thanks for the follow up. I'm just making notes from this thread now and found that you only posted this 13 hours ago. :)
> the person running them gets fired or quits partway through at least half of the time

This is a good point. Or the migration appears to have been very successful to management (before it's actually complete from an engineering perspective) and they get promoted / moved onto higher priority work.

Either way: make sure you are keeping the rest of the relevant engineering organization informed about how the new system works and how the migration is going to work.

I don’t think there’s much room for promotion because migrations are fabrication and promotions favor innovation. It’s ability to save money versus ability to make money. See: Smiling curve in economics.
If at all possible, try to find a way to do it incrementally, with options to roll back if things go sideways when something is released.

Management rarely wants to wait years for before seeing any pay off from a big dramatic cutover, and big sweeping changes are disruptive to clients.

This will likely create more work. Maybe some layer has to be built to allow v1 and v2 subsystems to both operate with the other parts of the app. But it should ultimately make it less stressful.

If you can allow some friendly departments from friendly clients to test and provide feedback before rolling it out to the whole company or the full set of companies, that would probably go a long way to help identify blind spots.

Most importantly, listen to your team and the people who know the systems well. The projects I’ve seen that have really gone sideways are ones where the people who know the true issues are never consulted, or completely ignored when they try to raise an alarm.

> try to find a way to do it incrementally, I would make it a hard requirement.

If you can't do it incrementally, it's going to fail. Corporations rarely have the attention span and staff tenure to make that kind of migration work.

Even if it takes a year of pre-work to get to a point where it can be done incrementally, it will be the only way it gets done.

One thing I've learned from these large migration projects is that v1 always seems like total crap, while v2 appears to be the perfect dream. However, as you begin building v2, you start to realize that v1 was not actually that bad and had many great but unappreciated features. Additionally, you come to understand that many v1 features took a long time to develop, were battle-tested, and would require significant effort to rebuild in v2 with minimal benefits.

So, what I've learned is not to completely discard v1. Instead, it's better to refactor or rebuild only the parts that pose issues, even though it may not be as sexy or exciting as starting v2 from scratch.

In practice, I would begin by cloning v1 and deploying it to a development environment to start tweaking it. I would also ensure to implement numerous automated tests to safeguard against any potential issues caused by refactoring. Of course, if you can keep using the same database that's even better as you can test refactored features with real customer data and even run both builds in parallel to spot any differences.

Follow the strangler fig pattern, and map out every single task that is required in the migration on a whiteboard.

Write tests if you can, and set up a staging environment for V2 that you can setup and tear down easily for battle testing way before going live.

From there, break the tasks up from above into their business domains, and abstract those into new api services that the v1 system can use without any downtime.

For a frontend migration, that’s a whole different story and you would have to provide more details such as “moving from legacy Angular 1 to React 18 while it’s running”.

I second this
> Follow the strangler fig pattern, and map out every single task that is required in the migration on a whiteboard.

> Write tests if you can, and set up a staging environment for V2 that you can setup and tear down easily for battle testing way before going live.

I've successfully helped migrate a critical project and followed exactly this strategy. Older versions were being developed and run 1:1 in parallel to the newer ones until the customers got only a small downtime due the change of IP Addresses were the system was running

This is a good answer and one I've put into practice successfully more than once. Automated tests are very key here.
Listen to the data that you're migrating from one system to another, so to speak. Test v1-to-v2 and v2-to-v1 migrations until you're blue in the face. Feature-flag migrations for individual clients. Ensure that any SLAs are met with v1-only, v1-in-flight-to-v2, v2 only, and/or some mix of static partial migration. Make sure that you have an absolutely homeomorphic mapping of data from one representation to another.
Id immediately set the expectation that the process will be messy, take longer than expected, and require continued maintenance, iterations, and process improvements. Management usually tries to sell a transition as being great for everyone and will solve all problems. When it usually ends up being awful, painful, and take incredible effort. Disappointment is always better the sooner it is communicated. Align in principal for why an effort must happen and the realistic benefits to their daily life. Don't sell them a fairytale. I've found every transition is nost painful because expectations and communication is poorly managed.

I don't blame people. Usually the offenders are in a culture where telling the truth is unpopular. It just depends on if you want to have a successful transition, or make people feel good about a project that takes 6 years to not finish.

I would strongly disagree with that, do not go into a migration with the expectation that you'll impact people. If you do that, you'll take shortcuts, you'll start thinking in the wrong ways. Suddenly you'll start saying to yourself that the migrated customers should be able to live with X or Y or that your colleagues have to accept that they have to do these various steps because hey, we are doing a migration after all. Instead it has to retain the exact same behavior at all times. It should cause zero pain whatsoever, if it causes pain you failed at your migration task. Secondly I agree with the other poster that it has to be incremental, otherwise you might as well accept a monumental amount of bugs from the start. My third point is that you should automate as much as possible and write code to do the migration in a repeatable way, first on testdata, and then keep expanding the type of testdata until it encompasses all the possible data that customers can have. Then you run that migration on the press of a button and it should work perfectly every single time you do it.
I'm at the tail end of two of these, of ~10 in my career. They are always tough, always a bit of chaos, and all different.

Planning is important, and avoid committing to targets or deadlines until you have your arms wrapped around what needs to be done. This can be wide-ranging, and include: product parity, contract management, internal asset development (project plans, test suites, customer training, etc.), customer change management, and team throughput.

You have few clients but large impacts. You likely want to pick the friendliest one and give them generous terms to be the "test case". Expect it will take 2x longer than your estimate.

Do as much work on parity as you can: what are the differences between v1 and v2, and how will you bridge them? If data migration is involved, you will need tooling and team training.

Inevitably you will find that customers move slower than you like and are using v1 in ways you did not expect.

Day #1 of any N-month long migration/rewrite project I've participated in:

PM: "Fill out this spreadsheet with key dates leading up to the project completion."

Me: "First, that's your job, not mine. Second, I literally just got here, I haven't even drunk my coffee yet. Hi, my name is Jiggawatts. I've only just heard of this software we're migrating ten minutes ago."

PM: "Yes, yes, but the customer asked me for cost estimates and timelines."

Me: "I asked for a Lamborghini packed with supermodels, but I didn't get that either. Tough break, huh?"

PM: "It's not an unreasonable request!"

Me: "Without time machines and/or a magic crystal ball, it is. Do you have a time machine?"

Etc...

We all recognise this, and it's a symptom of an underlying problem.

Really, what ought to occur is incremental progress and demonstrable deliverables. If you go off into a cave for two years and come back with something the customer doesn't like, then you've caused a business catastrophe.

I've found that businesses and customers in general prefer incremental improvement. One trick in .NET land is to use something like YARP[1], which lets you totally rewrite the app... one web page at a time.

Another management trick on top of that is to not demo the last few steps. Complete the last few milestones of the project quietly, without reporting this up until the very end. I guarantee you that everyone in charge of the budget thinks they can "save money" by skipping the "last 10%", even though that results in 2x the ongoing complexity because it means the legacy components must still remain live and deployed to production.

I guarantee that the only way to prevent this is to lie to management. It is biologically impossible to insert these concepts into the brain of a non-technical manager, so don't even try.

[1] https://microsoft.github.io/reverse-proxy/

I led a painful migration a couple of years ago and can share some tips.

It's not clear whether v2 is already in production somewhere else. If it is not, you better wait until 1) the v2 data model has really been finalized and in prod and 2) key resources can be made available to the migration team. We were forced to begin the migration before the new product was complete and it was just plain impossible. We had to start all over every quarter.

- Migrations are very difficult to estimate. Any optimistic estimate will bite back. Hold off as much as you can, and ensure appropriate buffers if you really have to.

- ensure that the 8 clients have an identical v1 data model (tables, constraints, etc). If that is not the case, remember you will run n migrations, not 1.

- You need a team with knowledge of both v1 and v2 data models, as well as business domain know-how. There are many decisions that need to be made and you need the right people to be around.

- Not everything has to be migrated. Trying to migrate 100% is a common mistake: engage with the customers to understand what's the minimum that legally and operationally has to be migrated, especially if the v1 system has been in production for many years.

- Data migration is a iterative process, and the last thing you want to is to manually QA every iteration. You need to develop tests that will provide a reasonable data integrity assurance.

- Dashboards showing data migrated, failing/ok tests, remaining tables, etc. help communicate status and track progress.

- Customers will need to be involved during the whole project. You need them to commit to making people available that can quickly answer questions to unlock you dev teams. ideally, you want to create a single team. Make sure that decisions are traced and versioned.

- Performance matters. Discuss the performance requirements upfront. Our process was very, very slow and we found out a bit too late that the customer would not tolerate such down time. Also, discuss "when" ok to migrate, how to rollback in case of failure, etc.

I don’t know if this fits your particular situation, but I recommend building tutorials into your migration process. I built a tool for migrating apps from Heroku to AWS ECS. The app developer runs the tool in their repository and it opens a migration guide in their web browser. The actual migration was mostly automated but we split it up into steps and embedded them into the guide. This way we could teach app devs the basics of how to use ECS and other AWS services as they went. We could also link out to additional docs and provide company specific details. There was a CLI mode for developers that had to migrate a bunch of apps. The tool was a big success and a couple hundred apps were migrated with it. The migration guide ended up being a good reference for people building brand new apps in AWS too. I built the guide using VuePress, but Docusaurus is also a good option if you are familiar with React.
  • buro9
  • ·
  • 4 weeks ago
  • ·
  • [ - ]
Is it a web service? Can you put a proxy in front of the old, to allow you to observe, and potentially duplicate (to the new, whilst testing) all requests that go to the old system?

If you're migrating data, can you take counts of things, so you can get quickly verify, i.e. we have 2.32M records before, and we have a way to prove we have 2.32M records after.

Mostly though, all migrations take longer than you think.

Some good resources from Will Larson: https://lethain.com/migrations/ or, if you prefer it in talk format: https://lethain.com/qcon-sf-migrations-video/
This is the best piece of writing I've ever seen on the topic of migrations, could not recommend this more highly.
  • avan1
  • ·
  • 4 weeks ago
  • ·
  • [ - ]
One year ago we successfully migrated to a new version (totally big bang) with less than 4 hours downtime, for us it was version 3 and v1 and v2 (plus a side service) both was working side by side (v2 was a failed migration so they ended up a frankenstein system which some requests goes to v2 and some other to v1 and they put data in each other's databases. yes not a single source of truth for all the data in database) and here are my three cents:

1. Don't - i don't know the size of the software but for us it was lots of works specially at weekends and holidays. after release almost half of the developers quits and the other half were exhausted. totally doesn't worth that.

2. Don't - if its possible to fix and refactor current version please do that. you would thank yourself later. we had 15 months of developments and in the middle of the project we need some features and fundemental fixes for our current version which we ended up another minor migration that we called v2.9.

3. Don't - only do it if you had to and do it incrementally as others suggested. start by building a microservice for most used domain of your application with api backward compatibility (if possible) and even use same database you are already using.

If you can't refactor current version (which i can't understand why) and you insist to have a bigbang migration know the current system well and know every column in the database(s) since you will need to migrate millions of data at the end which is a big project by itself.

So much of the answer, for better or worse, depends on company culture, personalities, leadership, etc. But...

To the extent possible, prioritize complete vertical slices. While it may not be feasible for foundational or generic technology layers that must be fully developed before building on them, aim to avoid focusing on large horizontal layers that delay end-to-end functionality or demonstrable results within the new system.

Again, depending on many factors, you might consider the strangler fig pattern, in which you implement some component of the new system and operate alongside the original one, route some usage from the old to the new (or even run in shadow), test and validate, repeat incrementally, and so on.

In my mind, what it comes down to is just being really vigilant in ensuring you minimize the risk inherent in a massive migration...which basically means try to achieve the ultimate migration with a series of much smaller, iterative migrations that carry significantly less risk.

  • mrj
  • ·
  • 4 weeks ago
  • ·
  • [ - ]
One thing: try to find a path towards delivering solid improvements as early as possible, phase out the big stuff and work on a drum beat of consistent improvements.

Large projects have lots of vulnerabilities, but I've seen many get sucked into "v2 is going to fix all the problems and mistakes of v1." Without a solid technical plan, goals and deliverables, it's easy for that effort to devolve into a years-long architect astronaut-style arguments about nanoseconds saved by something over something else. Halfway through somebody will suggest all problems with this approach will be solved by $newLanguage. If it doesn't serve the goals and deliver meaningful value, avoid getting stuck in those traps. Know what you're trying to solve.

There will probably be a v3 and somebody will complain about your version someday, too. It's the way of progress. As long as it's an improvement over the old and lays the right groundwork, continue moving in the right direction.

I co-founded a startup, Quesma, with one case to assist with database migrations. Big (e.g. move from tech X to Y) and small (schema change to add one column).

I did a fair amount of customer research and noticed almost all successful migrations have in common:

- they tested end-to-end migration, even on production data, early on

- very familiar with blue/green deployments or shadow testing

- gradual and have a good in-between story

- pragmatic (e.g. they moved 98%+ of data from SQL database to Cassandra but left some)

- very good revert story

- either simplified current system or split monolith into more manageable pieces

Many painful ones shared:

- wrote a next-gen system by a team that does not understand or has never worked on the current system

- big release day, after X quarters

- try to change too many things at once (e.g. database technology, schema, protocols)

- declared legacy system too early on, and everybody moved out of that development

- end up discovering flaws very late in the process

Thanks for this. I'm focusing on the top half :)
Around a year ago I had one of those huge migration tasks that you have no idea where to start. I hitted my head on the wall a few times, and had to erase first month of work completely. In the end what worked was: 1. Spend a day or two creating a fuzzy view of the whole problem. Pay attention to the rabbit holes, do not fall in them, be superficial. 2. Spend a day or two creating a detailed view of the next 2 weeks. Go as deep as you can, but pay attention to not prepare more than 2 weeks of work, because things WILL change. And you will lose a lot of work. Minimize that. 3. Execute. 4. Repeat from 1.

After a couple iterations your estimation will be much better and you will see the light in the end of the tunnnel.

EDIT - almost forgot the most important part: write small backwards compatible prs that are deploy to production constantly. Don’t write few big PRs, they will bite your back.

I find the ruby on rails migration path of github inspiring.

https://github.blog/2018-09-28-upgrading-github-from-rails-3...

They show some details how to migrate on large scale yet be safe while doing so.

The best approach I can give you? Don't.

If you have a big bang V2 that is incompatible with V1, you've already lost. There should be a V1.0.1, V1.0.2, etc that incrementally gets you to what you would've already gotten with V2 without losing the ability to do each individual piece in stepwise succession. That's essentially what the "strangler pattern" is.

The strangler pattern is helpful, because it forces you to focus on what you piece-wise need to "strangle" -- which usually isn't as much as it looks like at first blush.

The hardest part of most migrations is data model migrations, and the best approach here is to start writing to the new model before you start reading it from the core business logic. By the time that works as expected, much of the pain is done. This takes a long time because it requires a lot of repairing the ship as you steer it, so for the sake of the business, it is best doing it in small pieces aligned with chunks of business value or new feature iteration velocity.

The second part of many migrations is adding sufficient test coverage -- in a lot of cases, this will already be present, but if it's not, you're in for a world of pain. If you don't have enough test coverage of the V1, add that before you try and do anything fancy or you'll end up testing "the long way" (through production outages and late night scrambles to hotfix and inevitable rollbacks).

Lot's of good advice here. Some things I will throw in:

Find ways to ship smaller versions of the migration first. If possible: isolate features that can be migrated on their own.

If possible silently run v2 in parallel with v1 for as long as it takes to be comfortable with v2.

Assume that at some point you are going to have to completely halt the migration, go back to v1-only, fix something, and restart the migration.

I'd bet it's going to take 2-3x longer than you think to completely deprecate v1.

  • simne
  • ·
  • 4 weeks ago
  • ·
  • [ - ]
This is really what should be named big thing.

Unfortunately, this is huge project, and to do it, you need very clean view on three parameters:

1. Where are you now? How large (in LOCs) your start? Is your project already loose coupled or, it is monolith? This is important, because it is much easier if you could isolate small parts and rebuild them separate, while all other code old.

2. How should behave end point of project? Monolith or micro-services, or something else?

3. How large budget you have, and how many "man-hours" per month possible?

So to do all these in some predicable time and budget, you need project in waterfall style, but sure, you could use waterfall part as overall strategy, and do all things with agile (trying to stay nearest to waterfall milestones).

In real life I seen such migration, unfortunately unsuccessful from project manager view. First, they tried to made new version on same platform, but project just bloated without much success; second, they decided to change to totally other platform and rewrite all code from scratch (on other language), and this time got success.

Everyone here has some good advice, but I didn't see this one listed:

I work at a place that just finished the migration for ONE customer (of many) and it took about 2 years. The main issue we ran into was that NOBODY documented the difference in IDs between the old vs. new system. We had to Frankenstein that shit (look at original filenames of imported data to deduce what new id matched to which old id) which took MONTHS.

So, if you have any data, make sure you know EXACTLY what the id is in BOTH systems, even if you think they should be exactly the same.

FWIW, our first step was keeping both systems synchronized (via that ID matching up) and migrating the end-user frontend to the new system.

From there, we trained the customer on the new system and administration, and finally, we swapped them over to the new system and disabled the synchronization system.

Now, we kinda know how to do it and we expect to be able to do it faster ... we'll see.

I'm almost in a similar situation, and I have to talk with the old team to discuss how they did it and what issues they faced.

the advice I'd recommend to you is to probably write down all your information so it's accessible by someone who needs it.

  • junto
  • ·
  • 4 weeks ago
  • ·
  • [ - ]
I’m going through the same thing at the moment. I’ve come in two years after the project was started, where the key strategy was to replace a in-house developed maze of spaghetti that had become unmaintainable, with a collection of SaaS based services interconnected with a bunch of synchronization queues and services moving data around.

The initial plan as I arrived was to write V2 in its entirety, migrate the data from V1 as one bug bang rewrite and job done.

I realized immediately that the risk there was far too dangerous and the goals unrealistic, so I’m pushing for the strangler pattern and the business is pushing back. However I’m finally getting people in the business to understand the new plan I’ve put together and they are seeing opportunities.

Still, almost all the old developers have left and the existing system is running on duct tape. There are no tests. The old system is a fat client with a bunch of half finished messaging services as the improvement project for the original system was cancelled with lots of workarounds going direct into a central database and most of the business logic is buried in the UI. Even understanding how it all fits together is impossible, so the only way forward is to go back to basics and work with the business departments to design and document the process they REALLY need and iteratively build it. You’ll never really feature match and business departments develop workaround processes over time that become the norm, to the point that they become inefficient by historical lack of design. Going back to basics in terms of process design is something I highly recommend.

Your biggest challenge is that the business will be pushing engineering to deliver and they will continuously try to push deadlines on the engineering department that you won’t be able to realistically deliver. All you can do is keep pushing back, keep on trucking and try not to let it get to you and your engineering teams.

Remember to take a deep breath once in a while. Let the waves crash over you and try not to take it all personally.

I' ve done 2 large software migrations in my life, here are my findings: 1. If you can't do gradual deployment, try to do a primary-secondary (master/slave) type of deployment where new system runs in read-only mode (mirroring old system data) for a while. 2. Whatever you budgeted for migrating data, double it. Set a data cleansing specialist to start working on the data-to-be migrated ASAP. 3. Document all processes of the current system. Have the painful conversations up-front about functionality that will be eliminated, migrating usually means eliminating a bunch of features that do not pass the cost/benefit threshold. Your users/stakeholders might not see it that way, make it explicit what the cost of those features is, get as much buy-in as possible
There's already some really value info here. I'll add that it's important to manage expectations and emotions of your stakeholders during these types of projects. They need to know the level of involvement you'll need from them and to what extent.

Also, let them know that whatever go-live date you set, the project isn't over. They should expect to be some issues post-launch that will need addressing. Ensure them that the engineering teams won't leave them, and make sure you have time built in to support them during that time. Whatever time you think you need for that period, double it, just in case.

All the best to you as you begin your migration effort.

Make sure you can own the work for the whole migration. If you lay out the tasks and more than 1 team has to be involved add 10 to the multiple increase over your managers estimates the migration will take by 10 for each extra team.

"But that means if I have a 1 month project and I have to involve 10 other teams it would take 10 years or so"

Yes that is another way of saying it will fail.

If you can figure out how to get up front sign off from all teams so you can just do it all within your own team you will make things go a lot faster.

Separately figure out how to cake slice things. If you have Dev and prod for instance and 10 applications, don't migrate all 10 in Dev first. Migrate 1 app in Dev, then the same app in prod, then go onto the next app. That way wherever you stop at least something will be delivered to the custoner.

I'm glad someone asked this question and one of my often quipped quotes is - if you squint hard, every software project is some kind of a migration. And there are some excellent suggestions by others.

Having led my fair share of migrations in the past and going through one right now, here are my tips.

- Understand your stakeholders and the teams that are impacted. Spend enough time understanding how the system is being used today. Proxying and abstractions are you friend here. Just get as much data as possible.

- Once you get enough data, make sure to crunch and very importantly, have a means to surface this to the users. More often than not, for systems that have organically grown, you'll be surprised that users themselves don't know how they use an API.

- As much as possible, try to move things seamlessly. You can always do some portion of the migration without users knowing it. This can be as simple as introducing a translation layer or even you making code changes for the users to review them.

- If you're working on a timeline here owing to external factors like vendor contracts, cert expiry dates etc., make sure to buffer in at least a quarter (or possibly more). There will be new discoveries along the way.

- There will always be teams/stakeholders who will oppose to you asking them to do this work. I can't stress this enough - make sure to get your leadership on board. If you have a Program Management Office, make them your best friends. For anything that escalates and gets political, you as an engineer are better served to get the job done and an aligned leadership that backs you will help you fight these stragglers.

- Ultimately, love what you're doing. There's a mistaken understanding that migration work is not as sexy as greenfield work. I truly believe greenfield work is manifold simpler than migration. A migration is more like changing the wheels and the engine of a car as its running. There will be a lot of tradeoffs that have to be made and this is where engineering skills come into the picture!

All the best! :)

I've done a ton of migrations, and most of the advice I'd give has already been said in the other comments, except for one thing:

If people are pushing for changes to the app to better match how the business works today, leap into that conversation, but don't get talked into changing the app. Instead aim at reworking their business processes to first make their process as simple as possible, and then simplify the app to match the new process. Your migration is simpler when their process is simpler, and everyone wins.

If people aren't willing to refactor business processes as part of the effort, then refuse to change requirements. Hold steady to "We both improve, or we stick to the status quo."

  • 23B1
  • ·
  • 4 weeks ago
  • ·
  • [ - ]
Be transparent in how you pad estimates. This builds trust with stakeholders so that when things go awry, you can remind them.

Require anyone who reports to you during the planning process to do the same; provide the most accurate estimate possible, then be transparent about their padding.

If there's a a 'known unknown' call it out. Mitigate risk with high-level executive check-ins. Be candid with your status lights, and tell them what you're doing to mitigate any risk on a regular basis.

Migration is about managing up to the org not just to a boss; the more candid you are the more you deflate the rage that comes with unexpected downtime, rollbacks, etc.

  • leros
  • ·
  • 3 weeks ago
  • ·
  • [ - ]
One thing I've seen delay migrations is that people keep adding features to v1 during the migration because customers need improvements now. Then you have to add those changes to v2, which delays the migration, then more stuff gets added to v1 because the migration was delayed, etc. Ensuring that doesn't happen by setting the right expectations for stakeholders is important.
  • ·
  • 3 weeks ago
  • ·
  • [ - ]
  • zmj
  • ·
  • 4 weeks ago
  • ·
  • [ - ]
Migrate the most complex use case first. There's nothing worse than discovering mid-migration that you have to pause for rework. Better instead to slow down with your first adopter and accelerate for subsequent.
Design your new UI first, then your new data model, then write your migrations/mapping functions to move v1 data to v2. Just did this for a decent size app moving from Postgres to Mongo with few hiccups.
While good software design sets clear boundaries between various layers, this is often not the case in real life. The upper layers will incorporate knowledge of the underlying subsystem and may either program around deficiencies or exploit undocumented features. This can make it much harder to rip out V1 with V2 seamlessly. Perhaps focus first on identifying all the places where this kind of thing crosses those boundaries and figure out how to deal with them.
Well it is hard to give advice when you don't really provide much regarding your background or your existing experience.

TPOSNA Volume 1/2 of Limoncelli has a lot of good wisdom in general for the operational side, though there are some aspects that have become dated, most has stood the test of time.

At a client of mine they want to rewrite a product I designed and maintain to get rid of their hard dependency on me and move it to their default tech stack. I was very relaxed hearing that since my expectation was it would never actually happen since there would always be something else that is more urgent. So far my predictions were absolutely correct. Half a year in we have a bunch of meeting notes.
Put in more effort up front to make things easier later.

Try to automate what you can, efficiently. Code conversion, tests, etc.

Keep an eye out for opportunities to simplify things.

Make sure to have buffer built into your time/effort estimations.

Ask lots of questions.

Find other folks who interact with different parts of the system and ask if some of their time can be allocated to the conversion.

If you can onboard other folks, find out if there's anything you can do to automate any of their work.

As a SDET, I suggest lots of testing.

Assuming V1 and V2 offer users the same functionality, there’s a bunch of tests you can offer. The best one IMO is oracle testing where you do something on v1 and v2 and check they do the same thing. Preferably roll out to a subset of users such as via a canary deployment and make sure you have a rollback plan.

This might help you: https://bilbof.com/migrations/table

I mined HN last year for all migrations. You can filter by various fields like technologies etc. The table will probably be most useful since it links off to the blog posts etc

Do not give anyone the wrong idea that "it will work exactly like before - only better!" Migrations are always bumpy roads and there are always people who hate new things. You don't want to give people who complain about stuff not working exactly as it did before any ammunition.
What is motivating the move technically?

What is motivating the move politically?

What is motivating the move psychologically?

Be clear regarding each.

They are all there in the decision.

Don't pretend they aren't.

Only one of them is technical.

And it is not most of success.

Good luck.

This. Even though you/we are [mostly] focusing on the tech aspect of the world, make no mistake; the “business side” (or the political) can kill your migration project more suddenly and decidedly than you can spell ‘strangler pattern’.

So, to add to the comment above:

- does your migration affect the clients and the way clients work in any way? No matter how small, if the answer is “yes” then you need to ensure full buy-in from the clients. Even if your migration went flawlessly from a technical perspective, if a large enough client didn’t realise that V2 comes with some change that he doesn’t like, and when the change hits him after the migration, he raises this as a problem and escalates the problem high enough up the food chain with the message “this is not working for us” then you are going to be rolling back, regardless of the technical stuff. So, realise that the clients are big stakeholders and they need to be managed from the beginning of the project until some time after your V2 go live. In my experience the best results come from bringing them close to the project early and get some buy in by having them e.g. do some end-to-end testing if V2 and get them to accept the V2 before the go live. Preferably in an email for if things get ugly at some point (it happens, is sucks).

- also as the comment above says, don’t ignore the political. You should know what every important stakeholder gets out of this? Don’t forget personal ambition, ego, promotions etc as possible motivators for stakeholders. Who of the stakeholders are supporting your project now, and who is not? And just as important, what may change for a stakeholder to “switch camp” from supporter to not. Maybe the stakeholder is a mid-level manager who is measured on some KPI and V2 will make his KPI look better. So he’s a supporter. But then his company gets a new ceo and the KPIs change. Now he is no longer a supporter because V2 doesn’t give him anything he wants. And he’s actually now against your project because he has to commit some resources to it, but doesn’t get anything, so actually if your project is killed he frees up resources and doesn’t loose anything.

From one developer to another; The tech part is the easier part I’m sorry to say.

I guess I need to clarify what I mean by the psychological component.

Technical and political components are external. Career aspirations and mitigating boredom/stagnation by pursuing complicated work create motivations to invent interesting projects.

And there’s the ability to claim integration from v1 to v2 as progress. Rather than only change.

To put it another way there is always some degree of change for the sake of change motivating our desires for change. Particularly when a big chunk of our time must be accounted for. Typically, playing video games, sleeping, and walking a dog through the woods instead are not viable alternatives in contexts where data migrations are being considered.

If migration was something the OP didn’t want to do, the question would be about finding a new job.

For me the first assessment is what can i throw away - what processes can be replaced or arent used? Does new system offer any new ways of doing it - discuss this with business areas

Then see whats left

1. Remember Murphy's Law 2. Have Rollback option 3. Keep things todo on go live date to a minimum. You would be surprised lot of the risks can be mitigated before the date of change.
  • jgord
  • ·
  • 4 weeks ago
  • ·
  • [ - ]
Bring the risk forward .. if there is an easy to identify " most risky/gnarly " part.. aim to get a proof of concept of that done first.
  • Too
  • ·
  • 4 weeks ago
  • ·
  • [ - ]
Since this is a in house project, in some ways your organization have already failed. They shouldn't drop someone new onto leading "the big migration", it should have been built into the process of developing V2 from the beginning, delivered in increments rather than a big bang.

Now that's too late and you just have to do the best of the situation. Lots of good advice has already been given in the thread, here are a few more.

- Remove as much data (and features) from v1 as you possibly can, before you start migrating (obviously after backing up everything). There will always be edge cases of strangely formatted data stored, that got inserted before data validator X was implemented 8 years ago. These outliers will throw grit into your migration machinery. Do you really need those audit logs from a decade ago? No. Do you really need obscure feature that was only used for compatibility with windows 98? No. Just remove it.

Working with smaller data sets not only removes outliers but makes data size more manageable. Snapshot of the db will fit on a developer workstation. Running the migration with short downtime. Rolling back the migration with short downtime. Compare a database that fits in memory vs a distributed monster, could mean the difference of seconds to hours and once you reach that threshold, a rollback can be scary.

- Which brings us to the next topic. Avoid an atomic switchover of data. You will always find some errors when you swap and that puts you in a sweaty position if you should quickfix or rollback. Rolling back data is a lot more difficult than rolling back applications, which can pressure you into quick fixes, that may not even be possible due to some oversight when developing V2.

If you can, stop writing into V1, only use it for reading. Gradually move more and more write-paths into V2 until there are none left. Then, gradually remove read-paths from V1 as you move the old data over into V2 (or remove it).

- When transitioning the application gradually like this, you may end up with some corner of the system that never get completely migrated. This is fine. Maybe you realize V1 was better than V2 after all in this particular situation. Maybe the cost outweighs the benefit. Don't see this as a failure. Always reevaluate what is the best solution now and avoid the sunken cost fallacy.

- Finally, make sure the migration get priority from the whole org. If V1 never get phased out, others will start delivering new features there, forcing you to take two steps forward and one step back all the time, cleaning up new features to migrate into v2. Making sure there is only one project should be everyone's priority.

second hand knowledge here, but, one senior dev told me that their company made a conscious effort to make this type of transition between versions of software really, really early. he said, when you have a large customer base and they have expectations of the software, you're screwed.
  • diob
  • ·
  • 4 weeks ago
  • ·
  • [ - ]
I would say if you can swap parts of v2 out incrementally that's the best way.

Integration tests for behavior verification.

But the incremental migration is key.

CYA: Keep comprehensive records of all decisions made and identify who made them throughout the process.
Is V2 already written? Or are you taking lead on designing and building it?
Already written and in use with about half of our clients, but I've got to migrate the "early half" that have things caked on v1
I can give you some timeline advice. Every merger and aquisition I’ve ever seen takes at least 7 years before the old system is no longer referenced anymore. This timeline is going to depend a lot on the complexity of the system with regards to other system interactions. Getting rid of old systems is not easy.

You often can’t replace old systems without also changing the org chart.

Every task that involves someone outside you team will need the t-shirt size doubled for every group outside yours. Possibly triple if it’s a vendor (some are better than others). This can easy turn a 1 line fix into a month long project if you need to coordinate the change.

It is not uncommon to get 90% finished with a migration, only to find out some job needs to still happen in the old system. That old system will survive another 4years. If you’re lucky, you’ll find a way to rope it off so it can quietly do that one job, without effecting anything else.

It is not uncommon that some team starts trying to use the old system midway through the migration, usually because it’s there. This will seem fine, because “they can easily change to the new system”. This will inevitably add 6months to the project, because changing is never easy. If the organization is big enough, these adds will happen with enough frequency that the project is guaranteed to last forever.

Politics is not your friend. Any large system change is going to require other teams to do work. Managers will attempt to position themselves so that their team is not the one “responsible” for calendar slips that were likely unrealistic to begin with.

I’m sure I could think of some other things, but these were the thoughts that came quickly having lead large migrations away from legacy systems.

Good luck.

A very short list:

* The trick is to set everyone's expectations low, especially your own. E.g.; Even for a relatively small project, make sure management understands that there is no way to guarantee a schedule.

* If you think a step will take x days, schedule 4x days. EXPECT 4x. Generally, EXPECT bizarre failures to fry your so-called "schedule."

* Simple, obvious steps that work initially will suddenly stop working, and likely near the project's end.

* At least one thing not obviously connected to your project will stall everything (e.g., an old switch, a DB update, something somebody band-aided with COBOL 40 years ago**, ad nauseam.)

* You will almost certainly have at least one hair-pulling interaction with the security team. Hopefully, you'll have someone higher up to help you.

**yeah, that happened.

I just read kill it with fire [0] that describes a methodology for legacy modernization projects (which should work fine for any migration). Highly recommended! It would have served me well as a guide before I went in to the large migrations I’ve operated in my career.

[0]: https://nostarch.com/kill-it-fire

16 million users using the system means v1 is fine. Iterate on it, make migration process, not a task with a deadline. Never do two things at the same time, no matter how attractive they may feel from the distance.

Sorry to say it but it smells a bit of "we're migrating because microservices or kafka or whatever" - don't. Grow organically into it. Do this kind of stuff because you have to, not because you can.

If you said you're struggling/something doesn't work and you can't anymore – it would be easier to advice and it wouldn't feel like a step in the wrong direction.

I took over a team that was struggling with a very large architectural migration that had been going for a couple years. Two years later we have largely gotten things back to a healthy state, though we have only achieved maybe 20% of the original technical ambitions, the team is an order of magnitude stronger than when it started, which in many ways is more important than the exact state of the system. The migration introduced two major new technologies being incubated by outside infra teams, a new data model meant to coalesce 500+ fields comprised of data stored from a dozen or more databases, serving hundreds of clients across dozens of teams, and exposing data on hundreds of customer facing surfaces representing both sides of a C2C marketplace.

The first thing I would say is take all advice you get with a huge grain of salt. Details matter, and the particular details that matter the most vary tremendously from project to project. That said, here's my advice:

- Be clear on the goals up front and along the way. It's already a red flag that you don't lead with the goal and say things like "I don't have many details to share as I don't know what's relevant". In the heady early days of a big project, there will be many rose-tinted ideas of problems that can be solved, and people will keep tacking them on without the burden of knowing the stumbling blocks that will inevitably come. You need to keep the goal in mind at all times so you can ruthlessly make tradeoffs every step of the way. It's even okay if the goal changes, but be explicit about it.

- Make sure you find a way to do it incrementally. If you find you have code accumulating that is not being exercised in the running system for more than a few weeks at a time, that's a huge red flag. Kent Becks Trough of Despair [1] from a few days ago is relevant to this point. You need to be very careful that your trough doesn't grow wider than you can handle. It's surprisingly easy for that to happen given the nature of software system complexity growth. The risk is even greater if you have a lot of resources at your disposal because more cooks in the kitchen means hire risk of losing cohesion.

- There's no substitute for seniority up and down the chain. One or two weak links can really derail the entire effort. And it's not just about technical strength, communication and social aspects are equally important. Every single front line engineer will likely run into issues that will be relevant outside of their scope, but will they recognize that for areas they are not focused on? When a project is too big for any one individual to understand all the details, you need a critical mass of big picture thinkers, and some lightweight ways for informal conversations to be sparked and escalated (or de-escalated) as the importance comes into focus

- If you ever ask an engineer why they're doing something and the answer is "because XXX told me to" or "because that's the plan", it's time for a quick sit-down. Engineers who don't know why they're doing something will not make good choices when the unforeseen arises (which it always does).

- Know your clients. Are they internal, external? Are there ghost or second-order clients due to leaking internal details or other encapsulation violations? Will you still support all the features they need? What actions will they need to take to support you? What rate of change can they support? You can have the perfect end-state in mind, and then get tripped up by mundane constraints on your clients that you were not fully aware of.

I've done this sort of thing a few times.

- Build a small core team who know the problem in depth. You really need to understand v1 and v2 data and the mappings, as well as the functionality in each.

- Build a test system that is insulated from customers; you want to be able to use this as if it's the real thing, but to be absolutely, completely, dead certain that it will not affect live systems, and that no output from this will reach people it should not. Make sure there are visual indicators as well as logical traps on data exiting this system. Make this repeatable - you are going to use this system a lot to re-run tests. Despite the firebreaks, you will have some brown pants moments.

- The ideal is to move from v1 to v2 gradually, using a passthrough system. However, ime, this is often not possible, and at some point there will be a hard switch between systems.

- Develop migration plans with multiple off-ramps and fallbacks, and monitoring. By the time you press the switch you should know exactly how everything will work, and you should have no issues, despite this, you should have layers of contingency to allow for business as usual when the unplanned happens. This is a mix between technical and business and it should have been properly understood by everyone involved. Monitoring is critically important. Your plans should include aftercare... for example, what happens if you think the migration is successful, and two weeks later you discover an issue with 200,000 transactions. How will you reconcile? How will you communicate with the affected parties?

- Look for classes of things... Can you find 100,000 dead accounts that can be removed? Can you find 500,000 that have only ever had one transaction? Look for classes of errors - and fix them before migration. Keep a record of all of this, and make certain that you have covered all cases and all records. If you are lucky, you will be able to migrate classes from v1 to v2 and have the passthrough transparently manage this.

- Ideally, have a log of transactions that can be replayed on demand. So that you can run systems in parallel and so that in the event of issues, you can unwind.

- Keep written logs of all the things that you and the team do. You _will_ forget stuff you've done. This is true on an hour to hour basis as well as a month-to-month basis.

- Work on making migrations fast. Can you organise it so that 16m migrations take 10 minutes? This allows you test and retest. You want anyone to be able to run on-demand migrations.

- Look to the end-users. For an upgrade to be successfully deployed, both the business and the end-users must be happy; you will want to run test groups, pilots, group conversations, and make your team available to the end-users. Nothing should be surprising by the end. You will also need to know that v2 is performant - catch these problems before they become general issues of dissatisfaction. Look also for pain points and try to ensure that you remove them in v2. Change is painful, but if you can show that there are benefits you will ameliorate much of the criticism.

- Have defined end-points. You do not want to be doing this in 5 years time.

the old system was fine
Led a couple of these. Advice:

1) Do it incrementally. If you don't, it will fail. You can't block feature releases for a whole organization for years, but if you don't block feature releases you will forever fall behind head.

2) Design the v2 you want to have, but don't get too attached to the design. It will change as you uncover engineering realities and as business direction evolves. Be flexible and adapt as you go along.

3) It helps garner exec support if you can catalog product ideas that they've wanted to do but been prevented from by the current architecture, and address them with the new architecture. Rewrites that address new business goals and strategies have a lot more staying power than rewrites for the hell of it, or ones justified by "the code will be much cleaner".

4) If V1 doesn't already have clear APIs and subsystem boundaries, it's usually worth doing pre-work to put them in place. These take the form of behavior-neutral refactorings whose only purpose is to trim & rationalize dependencies - you're changing how the system looks to outside clients, but not how it works.

5) Make sure you have a comprehensive regression test suite. This is part of why the last point is so important.

6) The new APIs are the most important part. Get them working first, in order of which APIs are most frequently depended on by new code, even if you have to implement them on top of old code or use hacks to connect them up to the old system. And then get the rest of the org using them. This will help keep you from falling the rest of the org in development, and build momentum on the new system.

7) Separate behavior-neutral changes that change how the code works internally from behavior-adding changes that change what the code does. The former should give exactly the same results as the v1 system, pass all the existing regression test suite, and have exactly the same functionality except perhaps a latency penalty for shimming & data conversion. The latter handle any of the new functionality that are the business goals of v2, which hopefully you established in #3.

8) The project will be a lot more sustainable if you can deliver some of the business goals from #3 before your conversion is complete.

9) Have a latency budget for how much slower the new system can be than the old one. It will be slower than the old one, and don't try to message otherwise. This is why you try to get it used for new functionality (#6) first; these often have fewer users, so inefficiency causes less of a penalty for overall experience.

10) Also, expect bugs and lots of them. This is the other reason to get it used for new code (#6) before migrating over critical core functionality; it lets you smooth out the kinks before you cause career- or business-ending failures.

11) If v2 has a different data backend from v1, you will need a dual-write layer, because there's going to be a period of time when both backends are live. Consistency-check both backends against each other to catch bugs before trying to switch fully over to v2.

12) Test your migration scripts, and treat them with the same care you treat production code. A one-character error in the migration script that moved GMail over to BigTable ended up deleting 10% of GMail accounts and necessitating a tape-backup restore that knocked out people's GMail for a week. If Google can screw it up, you can too.

13) When it comes to migrating over the long-tail of functionality, enlist the rest of the org's help, have a bunch of whole-company Fixits, and put lots of people on it. By the time you do this, everything in v2 should be stable, people should know the dragons in the new system, and migration should be pretty straightforward. But this work tends to burn engineers out, because it's boring and has virtually no real benefit other than being able to get rid of V1.

14) Expect this project to suck, to take about 5x more man-hours than you expect, and to face cancellation at numerous intervals. You should not be embarking on this unless business leaders are sure that you really need to, and you have their complete buy-in.

Good luck.