
TikTok’s $300 billion-valued guardian firm, ByteDance, is likely one of the world’s busiest AI builders. It plans to spend billions of {dollars} on AI chips this yr, whereas its tech provides Sam Altman’s OpenAI a run for its cash.
ByteDance’s Duobao AI chatbot is at present the preferred AI assistant in China, with 78.6 million month-to-month lively customers as of January.
This makes it the world’s second most-used AI app behind OpenAI’s ChatGPT (with 349.4 million MAUs). The just lately launched Doubao-1.5-pro is claimed to match the efficiency of OpenAI’s GPT-4o at a fraction of the fee.
As Counterpoint Analysis notes on this breakdown of Duobao’s positioning and performance, “very like its worldwide rival ChatGPT, the cornerstone of Doubao’s attraction is its multimodality, providing superior textual content, picture, and audio processing capabilities”.
It will probably additionally generate music.
In September, ByteDance added an AI music technology operate to the Duobao app, which apparently “helps greater than ten forms of music kinds and permits you to write lyrics and compose music with one click on”.
This, although, isn’t the tip of ByteDance’s fascination with constructing music AI applied sciences.
On September 18, ByteDance’s Duobao Staff introduced the huge launch of a collection of AI music fashions dubbed Seed-Music.
Seed-Music, they claimed, would “empower individuals to discover extra potentialities in music creation”.
Established in 2023, the ByteDance Doubao (Seed) Staff is “devoted to constructing industry-leading AI basis fashions”.
In response to the official launch announcement for Seed-Music in September, the AI music product “helps score-to-song conversion, controllable technology, music and lyrics modifying, and low-threshold voice cloning”.
It additionally claims that “it cleverly combines the strengths of language fashions and diffusion fashions and integrates them into the music composition workflow, making it appropriate for various music creation eventualities for each newbies and professionals”.
The official Seed-Music web site incorporates quite a lot of audio clips that display what it could actually do.
You may hear a few of that, under:
Extra necessary, although, is how Seed-Music was constructed.
Fortunately, the Duobao Staff has printed a tech report that explains the inside workings of their Seed-Music mission.
MBW has learn it cowl to cowl.
Within the introduction to ByteDance’s analysis paper, which you’ll learn in full right here, the corporate’s researchers state that, “music is deeply embedded in human tradition” and that “all through human historical past, vocal music has accompanied key moments in life and society: from love calls to seasonal harvests”.
“Our aim is to leverage fashionable generative modeling applied sciences, to not substitute human creativity, however to decrease the limitations to music creation.”
ByeDance analysis paper for Seed-Music
The intro continues: “Right this moment, vocal music stays central to world tradition. Nonetheless, creating vocal music is a fancy, multi-stage course of involving pre-production, writing, recording, modifying, mixing, and mastering, making it difficult for most individuals.”
“Our aim is to leverage fashionable generative modeling applied sciences, to not substitute human creativity, however to decrease the limitations to music creation. By providing interactive creation and modifying instruments, we purpose to empower each novices and professionals to have interaction at totally different levels of the music manufacturing course of.”
How Seed-Music works
ByteDance’s researchers clarify that the “unified framework” behind Seed-Music “is constructed upon three elementary representations: audio tokens, symbolic tokens, and vocoder latents”, which every correspond to “a technology pipeline.”
The audio token-based pipeline, as illustrated within the chart under, works like this: “(1) Enter embedders convert multi-modal controlling inputs, reminiscent of music type description, lyrics, reference audio, or music scores, right into a prefix embedding sequence. (2) The auto-regressive LM generates a sequence of audio tokens. (3) The diffusion transformer mannequin generates steady vocoder latents. (4) The acoustic vocoder produces high-quality 44.1kHz stereo audio.”
In distinction to the audio token-based pipeline, the symbolic token-based Generator, which you’ll see within the chart under, is “designed to foretell symbolic tokens for higher interpretability”, which the researchers state is “essential for addressing musicians’ workflows in Seed-Music”.
In response to the analysis paper, “Symbolic representations, reminiscent of MIDI, ABC notation and MusicXML, are discrete and might be simply tokenized right into a format suitable with LMs”.
ByteDance’s researchers add within the paper: “In contrast to audio tokens, symbolic representations are interpretable, permitting creators to learn and modify them straight. Nonetheless, their lack of acoustic particulars means the system has to rely closely on the Renderer’s capacity to generate nuanced acoustic traits for musical efficiency. Coaching such a Renderer requires large-scale datasets of paired audio and symbolic transcriptions, that are particularly scarce for vocal music.”
The apparent query…
By now, you’re most likely asking the place The Beatles and Michael Jackson’s music come into all of this.
We’re almost there. First, we have to discuss MIRs.
In response to the Seed-Music analysis paper, “to extract the symbolic options from audio for coaching the above system,” the staff behind the tech used numerous “in-house Music Info Retrieval (MIR) fashions”.
In response to this very clear rationalization over at Dataloop, MIR “is a subcategory of AI fashions that focuses on extracting significant data from music knowledge, reminiscent of audio alerts, lyrics, and metadata”.
Aka: It’s a metadata scraper. Stick a track into the jaws of a MIR mannequin, and it’ll analyze, predict and current knowledge which may embrace pitch, beats-per-minute (BPM), lyrics, chords, and extra.
Music Info Retrieval analysis first gained recognition over its capacity to assist with the digital classification of genres, moods, tempos, and so forth. – key constructing blocks for suggestion techniques utilized by music streaming companies.
Now, although, main generative AI music platforms are reportedly utilizing MIR analysis to enhance their product output.
Are you able to see the place that is going? Sure, in fact.
ByteDance’s analysis staff has efficiently constructed its personal in-house MIR fashions, which have been utilized by the ByteDance staff to “extract the symbolic options from audio” to construct elements of its Seed-Music system. These MIR fashions embrace:
AI, are you okay? Are you okay, AI?
Taking a deeper dive into the analysis printed by ByteDance for its Structural evaluation-focused MIR mannequin, we discover a analysis paper titled:
‘To catch a refrain, verse, intro, or the rest: Analyzing a track with structural features’.
It was printed in 2022. You can learn it right here.
In response to the paper: “Standard music construction evaluation algorithms purpose to divide a track into segments and to group them with summary labels (e.g., ‘A’, ‘B’, and ‘C’).
“Nonetheless, explicitly figuring out the operate of every section (e.g., ‘verse’ or ‘refrain’) isn’t tried, however has many purposes”.
On this analysis paper, they “introduce a multi-task deep studying framework to mannequin these structural semantic labels straight from audio by estimating ‘verseness,’ ‘chorusness,’ and so forth, as a operate of time”.
To conduct this analysis, the ByteDance staff used 4 “public datasets”, together with one referred to as the ‘Isophonics’ dataset, which, it notes, “incorporates 277 songs from The Beatles, Carole King, Michael Jackson, and Queen.”
The supply of the Isophonics dataset utilized by ByteDance’s researchers seems to be Isophonics.internet, described as the house for software program and knowledge sources from the Centre for Digital Music (C4DM) at Queen Mary, College of London.
The Isophonics web site notes that its “chord, onset, and segmentation annotations have been utilized by many researchers within the MIR group.”
The web site explains that “the annotations printed right here fall into 4 classes: chords, keys, structural segmentations, and beats/bars”.
In 2022, ByteDance’s researchers printed a video presentation of their, To catch a refrain, verse, intro, or the rest: Analyzing a track with structural features paper for the Worldwide Convention on Acoustics, Speech, and Sign Processing (ICASSP).
You may see this presentation under.
The video’s caption describes a “novel system/technique that segments a track into sections reminiscent of refrain, verse, intro, outro, bridge, and so forth”.
It demonstrates its findings associated to songs by the Beatles, Michael Jackson, Avril Lavigne and different artists:
We have to be cautious right here over any suggestion that ByteDance’s AI music-generating expertise could have been “skilled” utilizing songs by in style artists just like the Beatles or Michael Jackson.
But, as you may see, a dataset containing annotations of such songs has clearly been used as part of a ByteDance analysis mission on this area.
Any evaluation or reference to in style songs and their annotations in analysis performed or funded by a multi-billion-dollar expertise firm will certainly increase quite a lot of questions for the music {industry} – particularly these employed to guard its copyrights.
“We firmly consider that AI applied sciences ought to assist, not disrupt, the livelihoods of musicians and artists. AI ought to function a device for creative expression, as true artwork all the time stems from human intention.”
ByteDance’s Seed-Music researchers
There’s a part devoted to Ethics and Security on the backside of ByteDance’s Seed-Music analysis paper.
In response to ByteDance’s researchers, they “firmly consider that AI applied sciences ought to assist, not disrupt, the livelihoods of musicians and artists“.
They add: “AI ought to function a device for creative expression, as true artwork all the time stems from human intention. Our aim is to current this expertise as a chance to advance the music {industry} by reducing limitations to entry, providing smarter, sooner modifying instruments, producing new and thrilling sounds, and opening up new potentialities for creative exploration.”
The ByteDance researchers additionally define moral points particularly: “We acknowledge that AI instruments are inherently susceptible to bias, and our aim is to supply a device that stays impartial and advantages everybody. To realize this, we purpose to supply a variety of management parts that assist reduce preexisting biases.
“By returning creative selections to customers, we consider we are able to promote equality, protect creativity, and improve the worth of their work. With these priorities in thoughts, we hope our breakthroughs in lead sheet tokens spotlight our dedication to empowering musicians and fostering human creativity by way of AI.”
By way of Security / ‘deepfake’ issues, the researchers clarify that, “within the case of vocal music, we acknowledge how the singing voice evokes one of many strongest expressions of particular person id”.
They add: “To safeguard in opposition to the misuse of this expertise in impersonating others, we undertake a course of much like the security measures specified by Seed-TTS. This entails a multistep verification technique for spoken content material and voice to make sure the enrollment of audio tokens incorporates solely the voice of approved customers.
“We additionally implement a multi-level water-marking scheme and duplication checks throughout the generative course of. Trendy techniques for music technology could essentially reshape tradition and the connection between creative creation and consumption.
“We’re assured that, with robust consensus between stakeholders, these applied sciences will and revolutionize music creation workflow and profit music novices, professionals, and listeners alike.”Music Enterprise Worldwide