Alexa at amaysim

Published in

engineering @ amaysim

17 min readMar 26, 2018

As part of amaysim’s commitment to make it amazingly simple for our customers to manage their household services, amaysim labs have been exploring the different ways our users can interact with us. This post details some of the work we’ve been doing with Conversational Interfaces — specifically Amazon’s Alexa and the approach we’ve taken to designing and developing amaysim’s Alexa Skill.

What’s a Conversational Interface?

A Conversational Interface is a human-machine interface that functions in a conversational manner — where there is a back and forth exchange of information in natural language, as opposed to the more computer-centric command method that has been common till now. Common examples of conversational interfaces are chatbots or digital assistants.

A conversation typically has:

One or more request-responses
Context
Natural language

Start Trek IV: The Voyage Home (1986) — Scotty cannae operate an antiquated user interface

Rather than training users how to use our applications, the goal of a Conversational Interface is to allow users to communicate with machines as they do with other people, by using context and natural speech that doesn’t have to follow a strict syntax. Moreover, by asking follow-up questions to qualify details of the task and even anticipate related tasks that typically follow each other, Conversational Interfaces lower the barrier-to-entry for users to interact with our application. No-one expects to have to ‘RTFM’ before they ask a person to do something — they shouldn’t have to before they use your application either.

Alexa: To skill or be skilled

Alexa is Amazon’s cloud-based voice service. Alexa is available on Amazon’s Fire Tablets (gen 4+) and in a wide variety of hardware devices, mostly, but not solely, manufactured by Amazon: The Echo, Echo dot, Echo Plus and the Sonos One to name a few.

Alexa comes with some built in functions — “skills” in the Alexa parlance. These include:

Calendar-based skills: reminders, timers — “What’s on my calendar today”, “Set a timer for 12 minutes”
Lists: shopping, to-do — “Add clean the kitchen to my to-do list”, “Add Tim Tams to my shopping list”
Media: Amazon Prime, Audible — “Play the book: Animal Farm”
Calling & Messaging — “Call dad”, “Drop-in on the kitchen”
Shopping: order from Amazon — “buy more AA batteries”

As well as the built-in functions, Alexa can also invoke 3rd party skills — like the one developed by amaysim. 3rd party skills fall into the following categories:

Smart-home skills

“Alexa, lock the front door”, “Alexa, dim the lounge lights by 50 percent”

Smart-home skills have a dedicated API (and SDKs) and do not require an invocation name

Flash-briefing skills

“Alexa, what’s the news today?”, “Alexa, what’s the weather like in Darwin right now?”

Somewhat analogous to RSS feeds, content producers can create audio and video media for users that subscribe to their flash-briefing skill.

Video skills

“Alexa, play Manchester by the Sea”, “ Alexa, play Monsters Inc.”

Video content providers can use the dedicated Alexa video skill API to expose their content catalogue to users.

List skills

“Alexa, what’s on my Shopping List”, “Alexa add milk to my Shopping List”

Alexa exposes a list API that lets third-party skills react to the built-in List events: such as adding or removing items from lists.

Custom skills

“Alexa, ask amaysim for my mobile balance”

Gives the most control over a user’s experience
Developers define: Invocation name, Intents and sample Utterances and how to respond to them
Voice, visual and touch interfaces

Custom skills offer developers the most flexibility when it comes to customising the behaviour of Alexa. Custom skills are also invoked via an unique ‘invocation name’ and the developers define the intents and provide sample utterances that are used to train Alexa to recognise and analyse what users say. Amaysim’s Alexa skill is a custom skill, developed primarily with the official node SDK for custom skills

Designing your Conversational Interface

Undeniably, the most important part of developing an Alexa Skill or any virtual assistant is to design the interface.

The first step, after deciding on what you are going to deliver, functionality wise, is working out how users can ask for what you are offering. The best way to do this is to write out a script of an imagined conversation between your user and your Alexa skill.

The simplest interactions are when the user provides all of the required information in a single statement — so-called one-shot utterances should be considered the happiest of happy paths for your intent. For example, an amaysim customer with multiple mobile phone numbers in their account should be able to get their mobile balance as follows:

Alexa, ask amaysim for my mobile balance for my number ending in 1234

Obviously a customer who hasn’t used the skill before may not know that they can do this and will instead have to get there with a multiple-turn dialogue:

customer: Alexa, launch amaysim.
Alexa: Welcome back. Your go-to place to talk to amaysim. Say the word help and I’ll be there to guide you. Would you like to hear your balance, enable roaming or find your phone?
customer: I’d like to hear my balance
Alexa: You have more than one number in your account. One ending in 1234 and one ending in 9876 — which one do you want the balance for?
customer: the one ending in 1234
Alexa: You have 2.1 gigabytes of data left and 10 days to use it in…

As well as the happy path, you also need to handle occasions when a user says nothing at all or something that’s either incoherent, masked by background noise or even if the user forgot what options Alexa just gave them. The more turns there are in a dialog — the more possible branches there are for you to cover.

The next step we took was to create flow charts based on the scripts that capture some of the logic we would need to employ for each desired outcome. It also helped us to identify the different application states we would have to cover in our code.

Flowchart for the happy paths for the ‘get balance’ intent

Each of the states illustrated above correspond to possible turns in a dialog between a user and Alexa. What’s not shown is that for each state Alexa should also be able to handle the ‘unhappy paths’: when it doesn’t understand a user utterance, if the user asks for help, says nothing at all, or if the user wants to cancel the interaction.

General Tips for designing a Conversational Interface

Design first — write scripts, define flows.
Don’t try and be too personal — while it’s cute at first, colloquialisms and banter will quickly grate on your users nerves. They know Alexa isn’t a real person, don’t pretend she is, at least don’t pretend she’s that annoying person who keeps repeating the same jokes.
Keep responses short — people can skim read text, but not voice. To this end you can track how many times a user has used your skill and possibly give them shorter versions of responses once they are more familiar with your skill and what it can do. Just think of your own experience with IVR telephone menus and how frustrating it can be to listen to the same message over and over just to find out what pressing ‘9’ will get you.
Don’t leave the user guessing — lead them by the hand if necessary. Make sure that every follow-up question that Alexa asks your users in order to fulfil the overall intent has a matching ‘Help’ intent, to explain further what each option is. Giving your users too many options at once can lead to confusion so wherever possible stick to presenting your users with binary (yes/no) choices.
Expect the unexpected — ensure that the way you handle ‘unhandled’ utterances make sense for the skill’s current state.
Be smart — make assumptions based on past behaviour.

Anatomy of a skill

Wake word

User’s Alexa devices are configured with a wake word — by default, this is ‘Alexa’, but users can change it to whatever they like (eg. ‘computer’ if you want to get trekky). When the device detects someone saying the wake word, it immediately starts to process the following ‘utterance’.

Invocation name

Each custom skill has an invocation name — a unique name or short phrase that identifies your skill. In our case that invocation name is ‘amaysim’. If a user wakes Alexa and starts or ends the phrase with our invocation name it knows that the phrase should be processed in the context of our skill and the result handled by our fulfilment service.

IE: ‘Alexa, ask amaysim how much data I have left’ or ‘Alexa, get my mobile balance from amaysim’ will wake Alexa and then invoke amaysim’s skill.

Interaction model

The skill also comprises an interaction model that defines what ‘utterances’ Alexa should listen out for, how to group together similar utterances into ‘intents’ which may also contain ‘slots’: variable parts of a user’s utterance that convey specific information required for the intent.

Writing out scripts for user interactions will help you identify the components of the interaction model that you will have to define:

Intents : The desired action — what the user intends to do
Utterance: What the user actually says.
Slots: variable parts of an utterance that specifies the utterance

An example of an utterance for the ‘find my phone’ intent

The interaction model defines a list of sample utterances for each intent, but we don’t have to provide a complete list of every phrase we want to capture — this is where the NLP comes in: Amazon takes our interaction model and uses the sample utterances it contains to train their own black-box NLP model so that it will recognise and categorise utterances that we haven’t provided as samples, but still convey the same meaning.

Alexa also comes with some built-in intents that you don’t have to define, these recognise utterances that could commonly be used by any skill. Examples include: users asking for help (AMAZON.HelpIntent), confirmation/negation (AMAZON.YesIntent/AMAZON.NoIntent), cancellation (AMAZON.CancelIntent).

A user utterance that includes a ‘slot’ value for the ‘get balance’ intent. Here the slot value provided is ‘1948’

As well as built-in intents, there are also built-in slot-types that help Alexa process the parts of an utterance that correspond to a slot-value that we want to capture. IE by defining a part of a sample utterance as a slot with the built-in slot-type: [AMAZON.FOUR_DIGIT_NUMBER] means that Alexa will parse the phrase ‘one nine four eight’ as a four-digit cardinal number — 1948 and not the string of nouns a user actually says. Moreover this value is passed to our fulfilment service as a value, keyed with the name we gave the slot in the intent.

Fulfilment service

The fulfilment service is the heart of our application, where we define how Alexa should respond to each intent. The service can be hosted anywhere as long as it talks JSON over https, but the preferred method is to be hosted as an AWS Lambda function. AWS Lambda even has a specific Alexa Skills Kit event type/trigger for just this purpose and the Alexa Skills Kit CLI includes commands to deploy functions to AWS Lambda directly

amaysim’s Alexa Architecture

The above diagram doesn’t show the OAuth login webpage required for account linking, but this is essentially it:

The hardware device recognises when a user says the wake-word, it then streams the following audio to the Alexa Skills Kit front-end
The audio is analysed for the appropriate invocation name (amaysim in this case) and then processed with the NLP model that we have configured with our interaction model.
The result of this NLP processing is parsed into a native AWS Lambda Alexa event which is handled by our fulfilment service (an AWS Lambda function) as a JSON payload.
The service returns a JSON response that includes SSML that tells Alexa what to say (and how to say it), session attributes and whether it should end the session (close the mic) or keep listening for more input.

A typical request/event that Alexa will dispatch to your fulfilment service looks like:

{
  'version': '1.0',
  'session': {
    'new': true,
    'sessionId': 'amzn1.echo-api.session.[REDACTED GUID]',
    'application': {
      'applicationId': 'amzn1.ask.skill.[REDACTED GUID]'
    },
    'attributes': {},
    'user': {
      'userId': 'amzn1.ask.account.[REDACTED]',
      'accessToken': '[REDACTED]'
    }
  },
  'context': {
    'AudioPlayer': {
      'playerActivity': 'IDLE'
    },
    'System': {
      'application': {
        'applicationId': 'amzn1.ask.skill.[REDACTED GUID]'
      },
      'user': {
        'userId': 'amzn1.ask.account.[REDACTED]',
        'accessToken': '[REDACTED]'
      },
      'device': {
        'deviceId': 'amzn1.ask.device.[REDACTED]',
        'supportedInterfaces': {
          'AudioPlayer': {}
        }
      },
      'apiEndpoint': 'https://api.fe.amazonalexa.com',
      'apiAccessToken': '[REDACTED]'
    }
  },
  'request': {
    'type': 'IntentRequest',
    'requestId': 'amzn1.echo-api.request.[REDACTED]',
    'timestamp': '2017-11-20T09:00:36Z',
    'locale': 'en-AU',
    'intent': {
      'name': 'AMAZON.NoIntent',
      'confirmationStatus': 'NONE'
    }
  }
}

This particular Alexa event has been raised by a user utterance that maps to a built-in intent (AMAZON.NoIntent) — the user has said something that Alexa recognises as meaning ‘NO’ (“no thanks”, “nope”, “nah” etc.). It also has no custom session attributes, a session.new:true property indicating this is the first request in a session and a session.user object that includes a unique user identifier and an accessToken.

Our architecture also includes an instance of DynamoDB to allow us to persist user data between sessions — the offical node SDK abstracts away the necessary CRUD operations so that user session attributes will be saved to your DynamoDB table when their session ends and then automatically retrieved as session attributes when the same user starts a new session. Session attributes are stored against the user’s Amazon ID — session.user.userID in the sample above.

Account linking: Authentication & Authorisation

In order for a skill to connect the identity of a user to an account in another system, users have to be able to link their Amazon ID, associated with their Alexa device, with their identity in the third party system.

When an unauthenticated user uses our skill, we prompt them to link their account, by sending an account-linking card to their companion app. The companion app is how users initially configure their hardware device to connect to their wi-fi, it also allows users to browse the Skills catalogue, and change settings of their account/devices. There are versions of the companion app for iOS, Android/FireOS and a web browser version too. For Alexa devices without a built-in screen this is the only visual UI afforded to Alexa users.

Alexa companion app home screen showing an account-linking card for the amaysim skill

When a user clicks the ‘Link Account’ link in their companion app, it opens a new browser window pointing to amaysim’s login page. Once the user provides their amaysim user credentials their Amazon user ID is linked to the amaysim account they just authenticated with.

Amazon only allows third-party systems to authenticate using OAuth 2.0 authentication flows. Amaysim already uses OAuth 2.0 authentication to afford us SSO functionality across our different web portals (mobile, energy, broadband and devices) so handily we didn’t have to do any extra development to allow account-linking for Alexa. We simply had to configure our Skill with the correct credentials to interface with our OAuth service.

After an account is linked each Alexa Event payload that our skill receives will contain a valid OAuth bearer (access) token. If you are using the ‘authorisation code grant’ flow, which also returns a refresh token as well as the access token, the Alexa front-end will automatically manage re-acquiring a valid access token from your authentication service when the previous one expires.

Currently, in the ANZ region, a single Alexa device can only be associated with a single Amazon user account, hence all users of a given device will have the same identity when linked to third-party services. Voice Profiles are a feature of Alexa, available only in the US, UK and Germany at the time of writing, which allows Alexa to differentiate users based on their voices — Alexa will then provide the appropriate OAuth token/User ID for each user in a household based on their voice, allowing different users to link to different third-party accounts.

Context and State: The StateHandler

Alexa does the heavy lifting of parsing human speech, determining what intent the utterance fits and sends a JSON payload to your service describing the intent that is being fulfilled. Your service then tells Alexa what to say in response, again using a structured JSON payload.

Normal human conversations are dominated by context, where previous information is referred to implicitly as the conversation progresses. To this end Alexa, in lieu of cookies in a web application, provides session attributes in it’s requests and responses with the fulfilment service to allow your service to persist information between requests within the same session. It is also possible to persist session attributes between sessions if you use a backing datastore.

The Alexa skills kit SDK for nodejs is the official node SDK from the Alexa SDK team for developing custom AWS Lambda Alexa skill fulfillment services. It includes higher-order concepts such as session-state, which is implemented as a session attribute with the key: ‘STATE’ . It’s a handy and natural way of adding basic context to your intent handlers.

Without session-state/context your stateless fulfillment service would not be able to differentiate a ‘Yes’ intent that follows a response asking: ‘Would you like to hear the news’ from one that follows: ‘Do you want to call your Aunt’ and your skill would be limited to one-shot intents only.

When developing using the SDK we found the following StateHandler decorator class to be a handy tool — it adds some default Intent handling to the SDK’s StateHandler class and allows you to define Intent handlers for each state in a fluent manner.

StateHandler class with a fluent interface for registering Intent handlers for a given state

Use the StateHandler class to register intent handlers for each state that your skill uses

Using the StateHandler class

Testing

It’s important to recognise the two separate domains that make up a custom skill:

The parsing of speech into text, mapping to intents and raising an Alexa event: Audio → text → NLP model → JSON event
The fulfilment service: JSON event in → JSON response out

Amazon is continuously improving Alexa’s text to speech transcription capabilities and the NLP model associated with your skill should also improve over time as it adjusts according to feedback from users. However, this also means that developers have, at the ‘front’ of their system, a black-box that they don’t really have a great deal of control over and, in theory, will give different results over time given the same input.

While the fulfillment service is amenable to conventional automated testing just like any other service you might write — you are asserting that your service produces text output for a given text input after all — the first part, taking a human utterance and transforming it to an Alexa event via a black box NLP model, is a little harder to automate using conventional testing frameworks and tooling.

There’s little chance of being able to avoid at least some manual testing when evaluating your interaction model, but tools like Bespoken’s Virtual Alexa and Virtual Device SDK can make it easier to automate regression testing with some degree of confidence.

Gotta catch ’em all

The first iterations of our skill resulted in Alexa mapping some control utterances (random phrases) to the wrong intent. The NLP model was being too ‘greedy’ with respect to one intent and was overmatching utterances. In the end we had to create a ‘catchall’ intent in our interaction model that included a lot of unrelated phrases, song lyrics and random words that rebalanced the model and cut down on the number of false-positive matches to an acceptable level.

Deployment

The ASK CLI

As you should expect, Amazon provides a dedicated Command Line Interface for the Alexa Skills Kit that affords all of the same functionality that the amazon developer ASK web console does, allowing developers to automate skill deployment from a CI/CD pipeline. When we first started development of the amaysim skill, the ASK CLI wasn’t available in the ANZ region so we were limited to automating deployment of our AWS Lambda fulfilment service alone via CI.

Now it is available in region, we can automate both the front-end (interaction model, skill manifest and account-linking configuration) and the Lambda function deployment. Which leads us to the last part in our skill deployment journey:

Blue-Green deployment with Lambda Versioning and Aliases

When you deploy an AWS Lambda function it actually creates TWO ARNs that can be used to address that function:

The canonical, or unqualified, Lambda ARN:
arn:aws:lambda:aws-region:acct-id:function:helloworld
And a Qualified ARN — the canonical ARN + a suffix:
arn:aws:lambda:aws-region:acct-id:function:helloworld:$LATEST

If you address your Lambda function by the unqualified, canonical ARN it will automatically resolve to version: $LATEST — handy when you’re in dev mode and you always want to be testing against the latest deployed version of your function.

The moment you submit a skill for certification, the skill manifest, interaction model and account-linking information become immutable. A new ‘development’ version of your skill is cloned automatically from this ‘production ready’ version so you can still test changes to your skill. However, as you can no longer make any changes or updates to the front-end of your submitted skill, including the ARN of your AWS Lambda fulfilment function — you could be in danger of breaking your ‘production’ skill with changes required for development. Having to change function names just to get around this overwriting behaviour is distinctly not-optional — you have to ensure that you update all dependent log shipping/alerting infrastructure with each deployment for a start. So, this is where Lambda function versioning and aliases come in useful.

Each time you publish a Lambda function version, AWS Lambda copies the$LATEST version to create a new version, with the same ARN + an incremented numerical suffix:

arn:aws:lambda:aws-region:acct-id:function:helloworld:2

With versioning enabled, you can address specific versions of your fulfilment service by using the version qualifier. If you use Lambda Aliases, you can also create ‘pointers’ to specific Lambda versions. IE a DEV alias that resolves to $LATEST version and a PROD alias that resolves to a particular function version.

Our build pipeline has a final, manually triggered deployment step, submit for certification that, after testing and packaging etc. :

Deploys a build from HEAD/master from our git repository to AWS Lambda, returns function version: arn:aws:lambda:aws-region:acct-id:function:alexa:N
Gets a git version tag from a file in the root directory of our solution of HEAD/master:IE v1.1
Creates a new release in our git repository for the same build, based on this tag.
Creates a new Lambda alias based on the github release tag that points to our latest deployed Lambda version:
arn:aws:lambda:aws-region:acct-id:function:alexa:v1_1 →arn:aws:lambda:aws-region:acct-id:function:alexa:N
Updates the front-end development skill manifest using the ASK CLI to use the ARN of the new Lambda Alias for fulfilment.
Submits the skill for certification with Amazon’s Alexa certification team.

This way, we ensure that when the skill is submitted for certification we have:

A release tag in github that we can relate to our change management workflow and that corresponds to a production release.
An immutable Lambda function version that we can’t overwrite by mistake.
Conversely, we can still rollback/apply hot-fixes to the deployed Lambda function if bugs are found post-certification by updating the Lambda alias to point to new Lambda versions.

Conclusion

In my household we already use a Conversational Interface to add items to shopping lists, play music, control the TV, set timers when cooking and make reminders for upcoming events in a shared calendar. Convenience is King and this alone means Conversational Interfaces are here to stay. Beyond this though, I personally believe that Conversational Interfaces will really come into their own in wearable devices that can’t afford the physical real-estate and energy to power visual UIs as well as being accessible for visually-impaired users.

While it’s still a space dominated by early-adopters and big-name brands, the barrier for entry as developers is very low. Alexa in particular has well defined APIs and SDKs and a pretty comprehensive suite of documentation, which, combined with the AWS free tier for new accounts makes experimenting with this new technology easy and, dare I say, fun.

Has any of the above piqued your interest? Does amaysim sound like the sort of place where you think you could make an impact? Do you thrive in organisations where you are empowered to bring change and constant improvement? Why not take a few minutes to learn more about our open roles and opportunities and if you like what you see then say hi, we’d love to hear from you..

Shout-out to all the lawyers..

The views expressed on this blog post are mine alone and do not necessarily reflect the views of my employer, amaysim Australia Ltd.