Supercharging Conversational Design

As a Designer who loves languages and writing Interactive Fiction, I’ve had the pleasure of doing Conversational UX Design (aka CUX or CxD) at Moment, Chime, and Verizon (reach out if you’d like to hear more). I’ve worked on Chatbots, Virtual Assistants, FAQ Docs, and IVR systems – from researching users and their use cases, to delivering conversational flows and interfaces. I’m no stranger to platforms like Slack, Siri, Apple’s Messages for Business, nor Facebook Messenger. Classical tools like Dialogflow (ES and CX), RASA, IBM Watson, and Twilio all look familiar to me.

However, in 2024, relying on these classical tools and processes alone is not enough to create amazing Conversational experiences. Whilst it is still important to be able to tackle pain points one user journey at a time, most companies’ Conversational Designers are grouped into a single Conversational Design team. This means they are also responsible for the quality of thousands of evolving user journeys. These thousands of potentially unaddressed journeys is enough to stress any designer out. The following approach is based on years of experience in the Conversational Design space and strives to improve the user experience of those thousands of other user journeys.

In general, the following approach uses advanced Information-Gathering methods and Generative AI to drastically improve the speed and output of Conversational Designers. This approach does not necessarily mean Generative AI is used in the final product, as many large companies may have regulatory considerations that prevent that from being a possibility.

i. Response Generators – Picking the right ones

There are different types of response generators that Conversational Designers can design for. The types of response generators that are available depend on the technology the company has resources for, as well as a product’s required level of regulatory approval.

Below is an example of a Public Transit chatbot and the responses different generators might produce for the same question.

A. Live Response Generator
B. Categorical Response Generator
C. Summarized Search Generator
D. Plain Search Generator

Each generator has different implications regarding UX and is compatible with different levels of regulatory approval.

A Conversational AI system should not consist of only one type of response generator. In the above scenario, the “A. Live Response” offers the best UX, however this is not always the case. In an emotional emergency scenario (e.g. “What’s the latest on the train that crashed? I’m worried about my Mom.”), it is unlikely that A or B will have up-to-date information, only C and D. Meanwhile, given it’s an emotional situation where responding with the incorrect tone is dangerous, the Conversational Designer might specify that D should be used in such cases rather than C.

ii. Gathering All Available Information

Complete Up-to-Date Company Information

Traditional Conversational Design work often involves manually consolidating internal and public information and trying to keep that information up to date. Automating this process would save Conversational Designers a lot of time when researching information, and also allow Conversational Designers to utilize Generative AI to assist them in their design processes (particularly important when designing for “B. Categorical Responses”). A repository of such information is also a prerequisite for most response generators (A, C, & D). Whilst it is a technical endeavor to acquire complete and up-to-date information, it has large benefits, and can start off as something as simple as an automated periodic crawler that crawls the company’s public webpages.

Regulatory Notes:

It is important to label which piece of information comes from a public regulatorily approved source, vs one that is internal and has not been proof-read.
It is important to publish internal information into the public domain when possible, to ensure that the information passes all required checks. This mitigates the number of checks on downstream products like conversational experiences.

User Queries (Desires & Pain Points)

With Transcripts

In Conversational Design, we often get access to thousands of transcripts which are an amazing source of insight. This is perfect for performing big data analysis that allows us to improve conversational experiences as a whole.

Without Transcripts

Unfortunately, transcripts may not always be available or may take a long time to get approved due to sensible privacy measures. In such cases, desk research (e.g. browsing reddit for complaints about the company), and specific Conversational UX Design exercises can be used to generate a large amount of hypothetical user queries. I’ve personally found card-prompted play-acting workshops to be particularly effective here.

iii. Clustering All User Queries

To keep things simple, the following examples are based on the first query a user has in a conversation. This can be extrapolated to cover the entire user conversation.

With Transcripts

The past decade has introduced wonderful new techniques like transformer-based encoder-decoder (ML) models which can be used to compute the embeddings of each user query. The most famous example of an embedding is “king - man + woman = queen”. This is a word embedding, however sentences (user queries) can also have embeddings which can be visualized in 2D or 3D space. Sentence embeddings can be clustered automatically and further tweaked by Conversational Designers or stakeholders. (Note: If designing for “B. Categorical Responses”, then it would be ideal if that response generator’s clustering (aka categorization) algorithm was also available for visualization.)

Without Transcripts

If our initial set of all available user queries does not come from transcripts (and is therefore quite small), we can group these manually via Affinity Diagramming. This should only serve as a temporary measure until transcripts can be obtained.

iv. Evaluating Existing Performance

*Example of Outliers and Clusters via AnalyzeIt.*

LLMs can do a surprisingly good job at predicting the performance of a response. Such an “Automatic Performance Scorer” can be easily created with or without the aid of existing data from Customer Satisfaction Scores (CSAT), Net Promoter Scores (NPS), or Customer Effort Scores (CES).

Clusters, in combination with an Automatic Performance Scorer, allow us to do several things:

Spot outliers (edge cases).
Visualize the volume of a certain cluster.
Visualize the performance of each query at a glance.
Visualize the performance of each cluster at a glance.
Report the median and average performance of each cluster.
Visualize the change in performance between one conversational system or conversational design, and another. ★

As a result, Conversational Designers can:

Determine any clusters that are worth targeting for dedicated improvement, and if that improvement should be through better content or by splitting it up.
Test out different conversational designs on a specific cluster, before publishing them. (Especially if designing for “B. Categorical Responses”)
Visualize the predicted impact of new conversational designs on the entire conversational experience. (Especially if designing for “A. Live Responses”, or “C. Summarized Search”)

v. Designing and Iterating Responses

Designing for Categorical Response Generators (B)

By having an up-to-date repository of all company information, we have already made the design of categorical responses (aka traditional Conversational Design work) much easier. Additionally, simply feeding this data into an LLM (either through fine-tuning or RAG), can drastically improve productivity. By utilizing Generative AI to assist information-lookup and draft responses, Conversational Designers can tackle exponentially more queries, improving the quality of the overall Conversational Experience of a product.

(Note: Because of the speed increase here, Regulatory and Technical departments may also need to be introduced to similar techniques to keep pace. This is so that everyone can help ensure a great Conversational Experience.)

Designing for Generative Response Generators (A & C)

Designing generative responses involves designing the actual generator (aka model) itself. Despite the power and internal complexity of these models, this is actually incredibly simple to do in 2024.

We can create a baseline LLM model from the company information gathered earlier. Creating this baseline model can be done with the help of a Developer, prototyped by a Conversational Designer, or by using an off-the-shelf tool.

Once we have a baseline model, Conversational Designers can work on a refined model. This is done by iterating the prompt and examples provided to the model. The number of examples can be none (zero-shot), one (single-shot), or multiple (multi-shot / fine-tuning). By providing examples of ideal conversations, we improve the response’s copy, tone of voice (so it’s aligned with all stakeholders), and overall helpfulness. This is also where comparing the overall performance of different models becomes important (see above). We may also then improve the model by chaining together multiple models (e.g. LangChain).

The number of examples and the speed with which we want to iterate determines the way the refined model is implemented. The way the refined model, and its underlying baseline mode, are implemented will impact:

The Business Costs of running the model.
The Time-to-Respond taken by the model.
How fast we can Iterate the model.

Helpful & Scalable Components

One of the primary benefits of a conversational experience is that it can theoretically make all of a company’s functions and information accessible to a user in one convenient interface. This benefit is broken if many of those functions and information is inaccessible. Carefully crafting conversational flows for every function takes time. Instead, we can achieve initial coverage by presenting components that other teams have already published on web or app pages.

Components can be thought of as widgets. These can be Tabular Data, Forms, Charts, Authorized API calls etc. All of these components are already compatible with visual modalities. Gradually, interpretation layers and component libraries can be designed (in conjunction with external teams), to ensure compatibility with other modalities (like text-only and voice-only modalities). In turn, the Conversational Design team can deliver user feedback reports to external teams to improve their own APIs and Products.

Below are different formats a “Lost Property Form” component could take, given different modalities and technological resources:

vi. Conclusion

Overall, this process proposes a complete and automatic repository of company information, coupled with internal Generative AI. This can drastically improve the quality and efficiency of Conversational Design work. In addition, it can also create much higher quality response generators depending on the regulatory requirements of the product.

All of this adds up to a better 24/7 Conversational Experiences for users, a reduction in repetitive customer service calls, and streamlined user insights that improve the User Experience of all products.