Building from the Inside --> Out. How We Refreshed The Core of OfferUp

By Akhil Kazha, on behalf of the Engineering, Product, PMO and Biz Ops teams at OfferUp

“What got you here won’t get you there?”

— Marshall Goldsmith

In September 2019, OfferUp was a top shopping application for Google and Apple and was looking to scale. The business had grown exponentially, powered by early founder code. However, due to the popularity of the application which resulted in high periods of growth, we had reached the limits of our early technical stack. Our next phase demanded significant rethinking to enable a cost-effective solution proportional to our business needs.

Our monolithic infrastructure was seeing classic scale problems. The data store was unable to keep up with volume, and costs were growing. System instability due to the lack of partitioning was leading to outages. Things became even more critical as our community members started to experience a slowdown in their feed loads. The application itself was returning errors due to backend issues, and mean time to resolution on issues was high, which started to lead to poor customer satisfaction.

At this moment, the OfferUp leadership made a bold bet to do what very few companies do - they decided to invest in overhauling the entire system and build for the future. This was a major decision that had to be approved by our board, and our leadership team had the courage to stand behind it. #Thinkbig #UseSoundJudgement

If a body is at rest or moving at a constant speed in a straight line, it will remain at rest or keep moving in a straight line at constant speed unless it is acted upon by a force.

— Isaac Newton

The following is an account of this entire undertaking.

For change to happen, it needs to be organized, it needs to be understandable, and it needs to help the business. In classic OfferUp fashion, we began working backwards from the customer, #customerobsessed. We planned to grow the scale and scope of our business exponentially, and this effort facilitated that plan. This project allowed us to think about the outcome we desired and create an iterative plan that could be measured at every stage. We organized it through our teams in a fashion that aligned with our business goals.

We considered three options architecturally:

Option 1 - Horizontal partitioning: Breaking the system into layers and approaching it layer by layer.

Option 2 - Vertical partitioning: Breaking the work into subsystems, each being primarily independent.

Option 3 - Hybrid: Begin with the layered platform approach but ultimately switch over to the vertical business line approach.

After much debate, we decided to go with option three, a hybrid approach. We identified three foundational blocks that became the basis of our forward-looking stack: inventory management, user management, and payment/billing management. We also identified a group of people at OfferUp who had the domain knowledge and the horsepower to push this change and believed in this seemingly impossible project.

“All our dreams can come true, if we have the courage to pursue them.”

— Walt Disney

Next, we began by examining the system architecture and determined that the following principles would guide us:

Build extensibility into the system for future APIs
Build for scale for long term business growth
Create clear partitioning between the control plane, the data plane, and the application to allow extensibility and future refactoring
Create business-level service delineation to allow for organizational agility and delivery
Create cost-efficient systems to enable sublinear growth in relation to the business
Drive observability and monitoring into the system to improve mean time to recovery on outages and enabling outage prevention
Async call patterns out of the gate unless explicitly approved to be otherwise to allow for a clean, non-blocking user experience
Make Java the consistent language of choice to enable maintainability

Planning was essential for a change of this scale. Each of our technical leaders went through a multi-month review of their designs on the data plane to ensure: schemas were defined, scale was planned for, API call patterns were designed async out of the gate and RESTful, technology stacks were standardized for cross-service calls, caches were understood, databases were denormalized, and foreign key relationships were reconciled, data flow was understood, service boundaries were defined, role-based security was built in, observability through logging was built out of the gate, and cross-cloud services was considered as a future possibility.

“You can't grow long-term if you can't eat short-term. Anybody can manage short. Anybody can manage long. Balancing those two things is what management is.”

— Jack Welch

We learned three things during our planning:

Our estimates were hyper-conservative; we needed more time.
Our costs would go up before they go down. We needed to work side by side on the data plane to ensure our customers didn’t suffer any negative impacts.
The more we dug in, the more we realized that there were many more unknowns due to how the codebase had organically evolved. We would always need to manage the risk of the unknown.

To mitigate risk where possible, we decided to break the project into the following three self-contained phases.

1. Data plane refactoring: We decided to go from a SQL-based system to a NoSQL approach for this first phase. Furthermore, we would need to build an event-based pub/sub pattern for multiple listeners. The ingress path would need to include queues to deal with scale and replays with an Async API pattern. This choice required us to invest in materialized views by scenario versus a traditional SQL-based approach that offered more flexibility but at a cost. Our patterns had to be simplified and broken into specific access patterns. This led to adopting a data plane with the simplest operations Create, Upsert, and Delete, implemented as microservices. While our original plan only covered three key areas, we decided to include our Integrity systems. We also invested in a dual sync strategy to keep the old and new systems in sync as we transitioned from one backend system to another.

2. Control plane repivoting: This phase involved building new layers to satisfy business needs (i.e., our B2C business layers), repivoting existing layers where it made sense to reuse the same logic but moving it out of the monolithic architecture and rewriting code in Java to drive consistency. It involved re-establishing the API and data contracts for services outside of the monolith such that the forward-looking patterns could be realized. This was the most complex part of our endeavor. Our goals in this phase included (not an exhaustive list):

Moving from sync patterns to async patterns to enable predictable customer outcomes
Using role-based auth to ensure we could secure our edges based on users
Ensuring that we could keep our old application running at a high quality while we rebuilt the internals to switch the backend, middle-tier, and front end.
Breaking bad patterns where data access was directly going to the database versus using contracts so that we could get to an auditable architecture
Clarifying what went to the data warehouse and in what format to drive consistency and completeness in reporting
Defining all the internal cross service contracts
Building the new aggregation layers in our GQL logic to enable a BFF model with the client

Once we defined this we had to move our entire ecosystem to the underlying data plane.

Post refactoring this represented over 70 microservices that covered: Autos ingestion pipeline, Ads pipeline, Search pipeline, Admin application, third party API, billing systems, shipping systems, integrity services, trust and safety services, data science pipelines, and data platforms for aggregation and personalizations, third party systems we depended upon (over a dozen vendors), chat/notifications pipelines and engagement pipelines.

3. Application and data warehousing: We decided to build ourselves a future-proofed solution for this last phase. We invested in a middle-tier BFF using GraphQL to abstract our application churn from the back-end churn. This layer also covered authentication, aggregation, materialization, and caching. This work benefited us by pivoting our backend faster over time as we evolved the UI. Our investment in event-driven pipelines allowed our data warehouse to benefit from well-defined contracts to subscribe to data across all of our services versus curating specialized flows using unpredictable jobs to drive this. We also invested in a data platform that could aggregate events from various sources using Apache Beam to drive near-real-time decisions in our ML pipelines.

This three-phase approach allowed us to bring people on to the project in step functions and translate lessons learned from one phase to another. This process was essential for reducing risk and enabled us to deploy resources to other projects with well-defined checkpoints at each stage. The core project management team comprised three people, the leadership team that drove the entire project was under 20 people. By the time we were finished, over 180 people had worked on the project.

“The more you know, the more you know you don’t know.”

— Aristotle

Our first checkpoint also aligned with the business release of our new merchant pipeline. We were five months into the endeavor. This checkpoint was a critical learning moment for all of us.

We realized several things:

1. Our ingest pipelines had ill-defined policies for scale.

2. Our materialized views did not fully replace our previous read patterns.

3. Our internal and external scenarios diverged significantly, and there were several shared caches and data access patterns that were not codified in our contracts.

4. Our side-by-side systems had to be managed carefully since the old system was at the end of its storage capacity (we ran out of indexes or created outages because of too much data), and costs were rising.

The good news was that as we fixed these issues, we could deliver our business goal. However, this caused us to go back to the drawing board on our approach to the next phase. We delayed the start of phase 2 to drive observability and resilience to the above issues. We also implemented a dual write strategy more comprehensively and focused on mocks to help parallelization. Next, we had to expand the definition of phase two to include internal customers land our admin application. Doing so would allow us to understand the patterns and reduced risk as we migrated off the monolith.

Taking this approach meant our second phase was now held hostage by external dependencies. The best way to proceed and stay within budget was to collapse phases 2 and 3 and adapt our strategy to embrace all customers into the effort.

We ultimately ended up switching from a horizontal to a vertical approach. Tackling a challenge of this complexity included 180 engineers across 120 services, a public SDK, and 4 UI interfaces. Dependency management became another critical problem to solve. We leaned on our PMO team, who helped us better manage our time and all of our dependencies. We reduced large meetings and reduced all meetings to 30 minutes or less. Our PMO team then charted our course, helping to track and manage our success in increments. They worked team by team, dependency by dependency, deliverable by deliverable, and risk by risk for the next eight months.

We also implemented a system of checks and balances on defining completeness so that we didn’t leave any of our customers behind.

We measured this using:

1. API contracts that the applications depended upon and SDK exposed

2. Telemetry to track traffic flow ingress and egress in the system at every layer to the stores.

3. Cost modeling to ensure that as items were moved, costs started to reflect the deltas.

The execution pattern here was consistent across teams; each body of code had to move from Python to Java. In the process, they each were broken down into independent services. We took the read/write patterns apart to facilitate scale. Dependencies to existing old services were changed to the newly built data plane. Unit testing became the norm to drive quality upstream. One of our critical challenges was that APIs were not ready because we had parallelized the application development and the control plane. To overcome this, we started to build mocks with test data. The goal was to get the application layer functional and integrate as the services lit up. The obvious risk was integration testing when the real APIs lit up.

Along the way, we also uncovered cases where teams were behind or unaware of the overall change. Load balancing became necessary to assist the progress. People across the company stepped up to the challenge. Teams that were ahead helped teams that were behind, people from business operations stepped up to help migrate our data warehouse queries, the NOC team increased their vigilance to ensure the quality of the external product stayed high through all this churn. The next four months were intense as teams stepped up to the plate and landed their parts one after the other. At the end of this cycle, we had a tested, stable codebase in production that had yet to take any significant traffic. Once each team had crossed this line, we began the final act. #PursueExcellenceTogether #DisagreeAndCommit

“The Monolith is dead, long live microservices.”

-Source Unknown

With their front-end counterparts, each service team gradually ramped up traffic a percent at a time to 100%. They held that pattern and then passed the baton to the next team. We planned the work so that the front-end teams at the GQL layer changed the routing to call the new API. If we saw any risk in either the technical or business metrics, we would rollback. The overall migration process lasted 3.5 weeks, and we were able to successfully bring in 68 microservices taking in billions of transactions daily with minimal customer impact #pursuexcellencetogether #customerobsessed. One of the advantages of the approach we took was that it only lasted a few minutes when we had a problem. We were able to roll back, deploy the fix, and then immediately roll forward.

At the end of the 3.5 weeks, we did a dependency-based rollout of services. We had two side-by-side systems, one with minimal traffic and one with the full traffic load. We then spent one additional week looking for any outliers or missed entities. For the few we found, we fixed them in the coming two weeks. From there, we started to deprecate the infrastructure that backed our old service stack. This was orchestrated in a reverse fashion where top-level services were taken down in terms of machines, caches, routes, and dual write removals. It was repeated layer over layer until nothing stood. After fifteen months of blood, sweat, and tears we had finally vanquished the beast!

The gains we realized are, and will continue to be, significant. We saw millions of dollars in savings in our horizontally scalable systems and over 50-80% in the microservices’ performance. Our better diagnosability allowed us to reduce the MTTR to under 9 minutes and our outages themselves fell by 80%. We were also able to reduce the need for tribal knowledge and consolidate our infrastructure. We moved to a common programming language, Java, and exponentially increased our agility. Our internal and external systems are now built on well-defined contracts and clear tiering in the services. This dramatically improved extensibility and maintainability of software services. It allowed 14 teams to execute in parallel on a common code base iteratively with faster time to market. We went from months to days in terms of feature delivery.

Looking back, we learned many lessons related to planning and execution that will live with most of us for a lifetime. This team achieved a monumental milestone while delivering on many fronts to help us dream and execute on a bigger future.

“Impossible is just a big word thrown around by small men who find it easier to live in the world they've been given than to explore the power they have to change it. Impossible is not a fact. It's an opinion. Impossible is not a declaration.”

-Muhammad Ali

Timelines:

December 2019: The idea is pitched and approved by leadership.
January 2020: Work begins
April 2020: Core data plane delivered
September 2020: Core control plane comes together
December 2020: The application plane comes together
March 2021: Production software traffic migration
May 2021: Monolith is officially retired.