Using Amazon Polly on the AWS CLI

Amazon Polly is a managed service provided by AWS that makes it easy to synthesize speech from text. In this article, we will learn how to use Polly through the AWS CLI. We will learn how to use all the commands available in Polly along with some examples.

Using Polly with AWS CLI

Make sure you have the latest version of AWS CLI and configured your access keys before proceeding further.

Security Updates: Newer versions of the AWS CLI often contain critical security patches that fix vulnerabilities. These vulnerabilities could potentially be exploited by malicious actors to gain unauthorized access to your AWS account or resources. Using an outdated version leaves your account exposed to these risks.
New Features and Functionality: Amazon Polly and other AWS services are constantly evolving with new features and functionalities. The latest AWS CLI ensures you have access to the most recent options and commands for interacting with Polly.
Bug Fixes: Bugs in the AWS CLI can cause unexpected behavior or errors when using Polly commands. The latest version likely has these bugs addressed, leading to a smoother and more reliable experience.
Configured Access Keys:
Authentication: Access keys are your credentials for interacting with AWS services like Polly. They act like a username and password, proving your identity and granting authorization to perform actions. Without configured access keys, you won't be able to use any AWS CLI commands, including those for Polly.
Security Best Practices: It's recommended to use temporary, short-lived access keys for programmatic access (like the AWS CLI) instead of long-term credentials. Configuring access keys allows you to set permissions that limit what actions the CLI can perform on your account, minimizing potential damage in case of accidental misuse.

Finding Help with AWS Polly in the CLI

Use the help command to get a list of commands that are available in AWS Polly CLI.

aws polly help

To get help, for a specific command in Polly:

aws polly COMMAND help

For example,

aws polly synthesize-speech help

Synthesizing speech using AWS CLI commands

To synthesize speech use the `synthesize-speech` command.
aws polly synthesize-speech \
    --output-format mp3 \
    --voice-id Joanna \
    --text 'Hello, This is a sample text recorded using AWS Polly.' \
    hello.mp3

This command generates a file named hello.mp3. In addition to the MP3 ﬁle, the operation sends the following output to the console.

The --voice-id is the voice that should be used in the audio file. There are many voices available in AWS Polly for each of the language. You can get a list of voice id using the aws polly synthesize-speech help command and look in the --voice-id section or the describe-voices command.

To generate a speech in another language use the --language-code option. This command produces audio in Indian English with the voice id as Aditi. You can get the list of the language codes with the help command.

aws polly synthesize-speech \
    --output-format mp3 \
    --voice-id Aditi \
    --text 'Hello, This is a sample text recorded using AWS Polly.' \
    --language-code en-IN \
    hello2.mp3

Find the voice ids related to a specific language. This command prints all the available voices for Indian English.

aws polly describe-voices --language-code en-IN

aws polly describes voices — describe-voices --language code en-IN output

AWS Polly has three kinds of text to speech engines: standard, neural and long-form. Use the --engine option to configure the engine used to produce speech. This command uses the neural engine with Kajal voice id to produce speech.

aws polly synthesize-speech \
    --output-format mp3 \
    --voice-id Kajal \
    --engine neural \
    --text 'Hello, This is a sample text recorded using AWS Polly.' \
    --language-code en-IN \
    hello3.mp3

Not all voices supports the neural engine. If you use an unsupported voice id for neural engine then it will cause an error.

The synthesize-speech command has many options available that supports multiple languages, file formats, voices, engines, SSML etc. which can be found in the AWS documentation or aws polly synthesize-speech help command.

Speech synthesis tasks

A speech synthesis task is an asynchronous operation that allows you to create speech synthesis tasks. These are suitable for long texts which can take a while to produce the results. The generated audio files are stored in an S3 bucket. Once the task is created you will get a SpeechSynthesisTask object, which includes id of the task and other details. This object is available for 72 hours after starting the task.

This command starts a speech synthesis task that gets its input from the input.txt (input.txt should be in the same directory) file and stores the file in `my-s3-bucket`. (Make sure you have created a bucket and use that bucket name in --output-s3-bucket-name option.)

aws polly start-speech-synthesis-task \
              --output-format mp3 \
              --output-s3-bucket-name my-s3-bucket \
              --text  file://input.txt \
              --voice-id Joanna

Output:

TaskId: the id of the task you just created.
TaskStatus: Current status of the task.
OutputUri: Pathway of the output speech file.
CreationTime: Timestamp for the time the synthesis task was started.
RequestCharacters: Number of billable characters synthesized.
OutputFormat: Format in which the output file will be encoded.
TextType: Specifies whether the input text is plain text or SSML.
VoiceId: Voice ID used for the synthesis.

To list all the speech synthesis tasks use the `list-speech-synthesis-tasks` command.

aws polly list-speech-synthesis-tasks

Output:

To get a specific speech synthesis task based on its TaskId use the `get-speech-synthesis-task` command.

aws polly get-speech-synthesis-task \
        --task-id <Enter SPEECH_SYNTHESIS_TASK_ID here>

Output:

Managing Lexicons

Pronunciation lexicons allows you to customize the pronunciation of words. For example, you can use lexicons pronounce AWS as Amazon Web Services. You can generate lexicons in an AWS region. Those lexicons are then specific to that region. You can manage lexicons using the `list-lexicons`, `put-lexicon`, `get-lexicon` and `delete-lexicon` commands.

Create a file named lexicon1.pls and add below text to it.

<?xml version="1.0" encoding="UTF-8"?>
<lexicon version="1.0"
    xmlns="https://www.w3.org/2005/01/pronunciation-lexicon"
    xmlns:xsi="https://www.w3.org/2001/XMLSchema-instance"
    xsi:schemaLocation="https://www.w3.org/2005/01/pronunciation-lexicon
        http://www.w3.org/TR/2007/CR-pronunciation-lexicon-20071212/pls.xsd"
    alphabet="ipa"
    xml:lang="en-US">
    <lexeme>
        <grapheme>AWS</grapheme>
        <alias>Amazon Web Services</alias>
    </lexeme>
</lexicon>

The <lexeme> tags describes the mapping between <grapheme> and <alias>. <graphene> describes the which text needs modified pronunciation and <alias> defines how it should be pronounced. In this example, AWS will be pronounced as Amazon Web Services in the synthesized speech when this lexicon is used during speech synthesis.

To add this lexeme use the put-lexicon command. The --name option is used to specify the name of the lexicon. You can use it to refer to it during speech synthesis.

aws polly put-lexicon \
    --name awslexicon \
    --content file://lexicon1.pls

Now generate speech using the lexicon.

aws polly synthesize-speech \
    --text 'Hello, This is a sample text recorded using AWS Polly.' \
    --voice-id Joanna \
    --output-format mp3 \
    --lexicon-names="awslexicon" \
    speech.mp3

Now AWS is synthesized as Amazon Web Services in speech.mp3

You can also include multiple lexeme in a single lexicon. For example,

<?xml version="1.0" encoding="UTF-8"?>
<lexicon version="1.0"
    xmlns="https://www.w3.org/2005/01/pronunciation-lexicon"
    xmlns:xsi="https://www.w3.org/2001/XMLSchema-instance"
    xsi:schemaLocation="https://www.w3.org/2005/01/pronunciation-lexicon
        http://www.w3.org/TR/2007/CR-pronunciation-lexicon-20071212/pls.xsd"
    alphabet="ipa"
    xml:lang="en-US">
    <lexeme>
        <grapheme>AWS</grapheme>
        <alias>Amazon Web Services</alias>
    </lexeme>
    <lexeme>
        <grapheme>CLI</grapheme>
        <alias>Command Line Interface</alias>
    </lexeme>
</lexicon>

If two lexemes have same grapheme then the synthesis engine uses the one that comes first.

You can even use multiple lexicons in a single command.

aws polly synthesize-speech \
    --text 'Hello, This is a sample text recorded using AWS Polly.' \
    --voice-id Joanna \
    --output-format mp3 \
    --lexicon-names '["lexicon1","lexicon2"]' \
    speech.mp3

Here, lexicon1 and lexicon2 are two lexicons. If any grapheme in both of them are same, the ones in the first lexicon that is lexicon1 will be used.

List all the available lexicons using the list-lexicons command

aws polly list-lexicons

Output:

Get a single lexicon by name using the get-lexicon command

aws polly get-lexicon --name awslexicon

Delete a lexicon using the `delete-lexicon` command

aws polly delete-lexicon --name awslexicon

Using Amazon Polly on the AWS CLI

Using Polly with AWS CLI

Finding Help with AWS Polly in the CLI

Synthesizing speech using AWS CLI commands

Speech synthesis tasks

Managing Lexicons

Explore