The value of intellectual property (IP) rights lies in their ability to protect investment, reward creative endeavour and provide remedies and recourse should unauthorised third parties seek to leverage or usurp such rights. In previous artificial intelligence (AI) related articles, OK, Computer: AI, Music and IP Law in Australia and The Bad Blood of Deepfakes, we discussed some of the challenges posed by AI-generated music and explored the use of AI in relation to the creation of deepfakes. In this article, we query whether, in light of the way AI continues to encroach upon the creative industries, the time has come for the family of IP rights to welcome a new member and explicitly provide for a “voice” right and statutory remedy for unauthorised use.
With a keen focus on deep fakes and the effect that malicious content could have on society, it is somewhat surprising that scant attention has been paid to the vocation of voice. However, the ability of AI systems to accurately mimic human speech clearly raises concerns about the potential for unauthorised exploitation, marketplace confusion and the extinction of voice artists.
It is worth noting that when it comes to the interaction of voice and AI, there is an existing landscape of which to be aware. Most people would have at some point encountered synthetic “speech” when calling customer service and engaging with an interactive voice response system. Text-to-speech technology as it suggests converts text into synthetic audio providing a close imitation of a human voice. Such technology forms the basis of voice cloning and can take a concatenative or parametric approach. Concatenative is where audio recordings are used to create a pool of words and sounds, from which sentences can be generated. Parametric is the construction of statistical models of speech.
Enter AI, machine learning and the neural network-based text-to-speech models and the opportunities and potential issues of voice cloning, voice phishing, voice spoofing and misinformation become clear. One need only spend a few minutes experimenting with text-to-speech synthesis systems such as Speechify, Tacotron, WaveNet, Lyrebird, and Polly to see how quickly dramatic advances in quality have been made and how attendant cost efficiencies have disrupted and devalued voice.
Despite the arguable erosion and devaluing of voice in recent years, it took a recent skirmish involving ScarJo (aka Scarlett Johansson) to shine a light on the issue of voice rights. Johansson’s distinctive voice will be recognisable to fans of Black Widow, The Avengers and the Marvel universe, as well as the 2013 Spike Jonze film Her, where Johansson supplied the memorable voice of Joaquin Phoenix’s virtual assistant “Samantha”.
It is hard not to view OpenAI’s actions in releasing the version of ChatGPT with a “Sky” voice that, if not “eerily reminiscent” then certainly evocative of Johansson’s “Samantha” with some suspicion. While OpenAI denied any connection between Johansson’s “Samantha” and “Sky”, OpenAI CEO Sam Altman is on record as noting Her is his favourite film, and he had previously sought – and Johansson had twice rebuffed – requests to license her voice for this purpose. Altman’s convenient timing and single word post “her” on X coinciding with the release of the new version of ChatGPT and “Sky” provide sufficient grist for the circumstantial mill. As the idiom goes, any publicity is good publicity.
Although the perceived and oft ventilated threat relating to AI was previously limited to rogue, nefarious and broadly “criminal” action, with OpenAI’s successful leveraging of Johansson’s voice in Her as a virtual assistant, previously held assumptions that business would act in good faith have been shattered.
In the 2021 documentary Roadrunner: A Film About Anthony Bourdain, director Morgan Neville included 45 seconds of an AI-generated deepfake voice that sounded like but is not the late Bourdain. By design, Neville wanted the AI-generated audio to be indistinguishable from the variety of archival clips from Bourdain’s extensive career. In the context of a work of 1 hour and 59 minutes in duration, 45 seconds may not seem substantive. However, use went unnoticed upon release and accordingly misled audiences into thinking that Bourdain said things he actually did not say. While the ethics of such a decision are clearly questionable, especially in the context of a documentary, using AI to generate audio content as a deliberate artistic choice may well allow for the exploration of as yet unexplored and inspiring art.
To date, Australia has adopted a relatively permissive approach to the adoption and regulation of AI. Given the lack of regulation in this space, representative organisations across the creative industries and arts such as the Australian Association of Voice Actors (AAVA), APRA AMCOS and SAG-AFTRA are all grappling with how best to protect their craft and the interests and livelihood of their members. At the heart of the very real and legitimate anxiety that many creatives, including voice artists, are grappling with as AI continues to advance and disrupt their livelihoods are the three C’s. Control (over and how voices are used), consent (for the use of voices in training AI systems) and the contractual arrangements (governing the relationship between voice artists and the companies developing AI voice technologies).
In the United States case Midler v. Ford Motor Co., Bette Midler successfully sued the Ford Motor Company in relation to a television commercial featuring a “sound alike” vocalist’s cover version of the celebrated chanteuse’s song “Do You Want to Dance”. As the advertisement was designed to evoke Midler’s identity and persona, the Court reasoned that a voice can be as distinctive and valuable as their image and therefore deserving of legal protection. Similarly, in Waits v. Frito-Lay, Inc., US singer-songwriter Tom Waits known for his distinctive raspy and gravelly vocals successfully took action against the snack company’s use of a “sound alike” singer deemed an appropriation of Waits’ voice and a violation of his right of publicity.
But what if a voice artist excels at their job but has not accrued the reputation of an actor, musician or celebrity? And what if there is no recognised “right to publicity”? In Australia, there is no recognised “personality right” and the right to one’s image and likeness is protected primarily through a combination of common law and statutory provisions, including the common law tort of passing off, the misleading and deceptive provisions of the Australian Consumer Law, privacy law, trade mark registration and copyright. In many jurisdictions, including Australia, there is also no explicit “voice right” that grants individuals exclusive control over the commercial use of their voice. While it is possible to secure protection of non-traditional ”sound” trade marks in Australia, such marks are necessarily limited to short distinctive catchphrases and this does not protect a voice per se.
The potential for AI voice farming and commoditisation of the work and value that voice artists contribute to a variety of creative and commercial industries means that the future of voice is at a crossroad. Like the proverbial frog in boiling water, it is easy to wait until it is far too late. It can be infinitely more difficult to recognise a threat for what it is and have the awareness and impetus to engage with the issues at hand or pivot. Many may argue that the existing matrix of legal rights is sufficient to provide remedies for unauthorised use of voice – perhaps in some jurisdictions but arguably not in Australia. AI presents an opportunity to adapt and only those who adapt will survive.