How to Build a Flexible Customer Support Platform with Kotlin

January 24, 2023 8 Minute Read Backend 11

Kishore Guruswamy

Kishore is an Engineering Manager at DoorDash, leading customer experience platform engineering teams.

Han Huang

Han Huang is a Software Engineer at DoorDash, since June 2021, on the customer experience platform focusing on credits and refunds platform.

Redesigning the customer support application

Our legacy customer support application utilized a configurator, which is what we call our web-based tool. It allows customer service agents to associate a customer issue with a corresponding resolution strategy. The criteria for selecting resolution strategies was defined in the code. We expected to provide fast, accurate resolutions at relatively low cost with this new customer support platform. We narrowed down the next step for the upgrade to two possible approaches.

Our first approach would involve continuing to leverage the legacy architecture. Because our engineers developed the legacy application and have a thorough knowledge of it, composing new resolution strategies for the legacy system would be straightforward, streamlining the development work. The tradeoff inherent to this approach, however, is that we would have to rely heavily on the engineers to make continuous code changes to support the customer support team as they adjusted resolution strategies and ran optimization experiments.

Alternatively, we could move the code-defined resolution strategies and experimentation capabilities outside the codebase to make them configurable by non-engineers. A configuration-driven, no-code solution would reduce the reliance on engineers and enable our operators to move faster because they could translate resolution strategies into configurations in a “what you see is what you get” manner. To ensure that this solution would scale with our future business needs, we decided to use a decision tree configuration. This would break down ambiguities because each path in a decision tree can represent a unique resolution strategy. When a new strategy is introduced with the decision tree representation, the configuration can easily extend by adding a new branch to the tree. The downside of this approach is that it requires more up-front investment in resources when compared to just extending the code. We would also need to train operators on how to define strategies and experiments using the self-serve configurator.

Ultimately, we decided that the second approach’s benefits outweigh the drawbacks. Consequently, we elected to redesign the basic customer support application into a configurable no-code platform that can support fast changes and experimentation.

Migrating the system from Python to Kotlin

The "componentization" of a credit and refund strategy could be implemented in the legacy Python codebase at the same time we worked to spin up a new Kotlin service. In other words, we could continue to improve the existing application to meet current business needs — a relatively low-cost action — and simultaneously tackle system migration as a separate effort. A fast-growing business requires the fast delivery of technical solutions. Of course, the tradeoff is that failing to address the underlying technical issue means we would continue to build code that adds to our technology debt, not to mention that we eventually would need to deprecate the code. But tackling a large migration effort in a fast-growing environment creates a risk of significant business disruption.

An alternative option would be to stop building new code in our existing Python application to focus exclusively on spinning up our new Kotlin service. As new business requests arose, we could implement those requests in the Kotlin service. This approach would offer the advantage of not building tech debt while steadily migrating code without significant disruption to the business. A key drawback: We would have a hybrid state with both the legacy system and the new system involved. Both systems would have to be maintained and monitored for a longer period of time.

However, one additional factor tipped the balance in favor of Kotlin. DoorDash Engineering’s decision to use Kotlin as its microservices programming language meant that our new service would be operating fully inside DoorDash’s tech ecosystem and infrastructure.

We chose that second approach and created a new customer support platform in Kotlin while gradually migrating the legacy application over. We believe this path provides the best chance for initiating and completing the systems migration without significantly disrupting our business.

Making the credit and refund strategies configurable

After we created a new Kotlin service, we defined gRPC endpoints to create and read a credit and refund strategy. Our biggest redesign effort revolved around implementing a credit and refund configurator to allow operators to create credit and refund strategy decision trees (see Figure 1). The configurator required building a visual editor to arrange credit and refund decision trees using a drag-and-drop mechanism and creating APIs to store and fetch configuration data for the trees. In addition to the visual editor, we needed a framework to parse the configuration data and execute the actions that the tree specified. Client services needed an API to invoke this execution framework. But building these capabilities from scratch would be time and cost prohibitive.

*Figure 1: The credit and refund configurator allows operators to configure credit and refund strategies based on a decision tree*

Fortunately, we already had a homegrown decision tree-based configuration platform to configure business and technical flows without code. We leveraged this existing workflow platform to store and fetch configuration data for credit and refund strategies. To help operators define those strategies, we also added special types of nodes that could only be understood by the credit and refund platform. For example, the is_vertical_id_in_list node in Figure 1 checks the business vertical. It tells the next node if the order is, for example, a restaurant order, a grocery order, an alcohol order, or a pharmacy order. Based on the output of the is_vertical_id_in_list node, there would be different credit and refund strategies.

We had an experimentation platform at DoorDash, but experiments needed to be hard-coded by engineers. To save engineering time, we enhanced the workflow platform to configure an experiment without code. We added a new type of node, select_control_or_x_treatment (x is the number of treatment groups; see Figure 1), that allows operators to name an experiment. If a select_control_or_x_treatment node is configured as part of a decision tree, the workflow platform will leverage the APIs provided by the experimentation platform to execute the experiment and take the treatment or control path based on the results.

Exploring the technical architecture

At this stage, we were ready to put everything together.

To orchestrate credits and refunds, the Python application routed the traffic to the Kotlin application to determine strategies. The control flow then returned back to the Python application. The architecture behind the customer support platform, as shown in Figure 2, highlights how the Python and Kotlin systems work together to issue credits and refunds. This architecture transformed the way we tested and experimented with customer support resolution strategies.

*Figure 2: The architecture behind the customer support platform highlights how the Python and Kotlin systems work together to issue credits and refunds*

Conclusion

After the redesigned system was rolled out, we saw significant improvement in how quickly our operators could respond to customer problems and define, test, experiment, and roll out credit and refund resolution strategies. As the behavior of the system changed, new challenges cropped up, including a need for transparency about the system’s configuration changes made. We also needed more system guardrails because we required stricter validation of the configuration data before it rolled into production. As our configuration-based decision-making system evolves, we are discovering new requirements, including a need for automated testing of resolution strategies to prevent production environment regressions created by inaccurate strategies.

Migrating systems from one technical stack to another is a complex endeavor. There is a natural temptation to redesign the system as part of the migration to eliminate technology debt and introduce best practices. When we did both at once, we kept an eye on maintaining functional feature parity between the two systems. After the technical migration was completed, we were able to verify that there were no regressions introduced. Subsequently, we cut traffic over to the new system, allowing newer functional requirements to be applied only to the new system.