My esteemed colleague Flo and I recently worked on a television guide bot using api.ai + Amazon's AWS backend service to experiment with the capabilities of Voice User Interfaces (VUI). In this blog post I’ve covered the design process we followed to enable voice interfaces in an existing product and how we used api.ai to build natural dialogues.
Our aim was to explore what it takes to enable voice capabilities in an existing product and find out what we can do with it. There are some great design resources on VUIs, such as Google's Crafting a Conversation, but there is no mention of integrating voice with an existing product. Having clear information architecture for your project will help you create relevant user journeys that you want your assistant to support, without having to worry about platform-specific navigation patterns.
The main difference between a visual interface and an audio one is that dialogues are a single, linear communication channel. Here is an example on the current Netflix app compared to a potential audio assistant:
It might take a split second to scan a screen and get a good idea of what to do next. But explaining every action verbally can take much longer. Dialogues need to be short and clear so the user will not lose interest and walk away.
On the other hand, voice input can be simpler and more direct than touch. When you need something you just speak your mind. You don't have to worry about finding the feature you need on the screen, navigating around the app, waiting for animations to end and so on. It’s easier and more natural to say "Send a message to Mum: I’m writing a blog post", instead of finding the 'New Message' icon, typing Mum's number, tapping the text input, typing the message and hitting the send button.
In the field of computer science there are many metaphors and synonyms already being used (a computer can have a mouse, keep information on its memory, Material design talks about the metaphor of paper , the list goes on). It’s likely that you’re already using several metaphors in your own apps and systems to represent multiple features with a single word. All 4's 'Catchup' feature, for example, allows the users to see which episodes aired in the past so they can find any updates from the shows they’re following. But how can the user be expected to know this feature exists, let alone know the correct name? How do you teach people detailed vocabulary or should this only be available for ‘power’ users?
It’s worth creating alternative ways of prompting such features rather than expecting the user to know the branded word (in this case Catchup). Think about how someone would describe that feature conversationally. This can help when thinking about how a user would ask the chatbot for information such as "What happened in the latest episode of The Big Bang Theory?".
Many users may not have used a voice assistant before, so you need to guide them through the process. One way to do that is through audio onboarding. A good way to approach this is with an introductory conversation with first time users. Let the user know what the assistant can do and how the user can access the features. Teach your user what to do if they’re not sure how to move forward. Let them know if they’re stuck at any given point they can ask for help. Remember, there are no visual hints at this point. You can’t expect a user to know what they can do without giving them hints.
In terms of experienced or power users, things need to move a bit faster. They already know the ins and outs of your assistant so they know how to act and what to say. For these users, you might want to include voice commands instead of dialogues, removing hints or guidance on features. For example a new user would be more likely to speak naturally eg "Could you please tell me more about X", but an experienced user may just ask for "Details".
When implementing your scripts into code, you need to consider which part of the discussion your inputs will be in. A phrase such as "Tell me more about X" is something a user could say at the beginning or in the middle of a conversation. There may be several variations of that question, such as "I would like to know more information about X" or the user may not refer to the show by its full name or use “it”. Luckily, all those scenarios can be easily handled with api.ai.
The way input dialogues are defined in api.ai is done through
intents. These are the questions the user can ask the chatbot, the chatbot’s responses back to the user and how those two statements are connected.
What is really cool about api.ai is that you can define contexts around those intents. This means that if you start a conversation about a show, the bot will be able to remember which show you are talking about if you use the word "it" ('Tell me more about it', or 'Remind me to watch it?').
Lastly, api.ai makes it really easy to create flexible conversations without having to tie specific parts of the discussion together. By creating two versions of the same intent, one that requires a context of discussion and one that doesn't, you can place the intent at any point in the conversation.
Even though voice interfaces are often associated with assistants such as Google Home or Amazon Alexa, this is just the tip of the iceberg in terms of possible applications for this medium.
How can your existing application be enhanced by giving it voice capabilities? Maybe your app could allow users to listen to news on-the-go, or allow the user to navigate through content by verbally telling the app which sections they are interested in?
How about replacing annoying guide screens with a chatbot that informs the user of potential ways of interacting with the app instead? You could even combine different technological mediums together? A great example of this is Starship Commander, a virtual reality game, which allows the player to control the narrative of the story by talking to the in-game characters. If you’re interested in tinkering with Arduino or Android Things then why not build a robot and give it a personality through conversation?
When designing dialogues, take into account the time it takes to say the words out loud and consider how direct the voice input might be. It’s advisable to cater for new and experienced users with input phrases and commands that will be meaningful to both groups.
Working on a VUI project can be fun and interesting from a design and a development perspective. I particularly enjoyed working with api.ai on this as it provides the freedom to create flexible scenarios. I’m looking forward to exploring how to improve the discoverability of features through dialogue structure amongst other techniques.
Listing image by Jason Rosewell
We plan, design, and develop the world’s most desirable software products. Our team’s expertise helps brands like Sony, Motorola, Tesco, Channel4, BBC, and News Corp build fully customized Android devices or simply make their mobile experiences the best on the market. Since 2008, our full in-house teams work from London, Liverpool, Berlin, Barcelona, and NYC.
Let’s get in contact