Abstract. Existing datasets for audio understanding primarily focus on single-turn interactions (i.e. audio caption- ing, audio question answering) for describing audio in natural language, limiting understanding audio via interactive dialogue. To address this gap, we introduce Audio Dialogues, a multi-turn dialogue dataset containing 160k samples for general audio sounds and music. In addition to dialogues, Audio Dialogues also has question/answer pairs to understand and compare multiple input audios together. Audio Dialogues leverages a prompting-based approach and caption annotations from existing datasets to generate multi-turn dialogues using a Large Language Model (LLM). We evaluate existing audio-augmented large language models on our proposed dataset demonstrating the complexity and applicability of Audio Dialogues..
We show in detail the prompts that are used to generate (1) the AudioSet Dialogues, (2) the Music Dialogues, and (3) the AudioSet comparison subsets of the Audio Dialogues dataset.
sys prompt: Based on the sound events, create a dialogue between you (the
assistant)
and a person (the user) about the events in the audio. Each dialogue should consist of:
1. A user examines the audio, and sends a reasonable and creative message to the assistant. 2. Once the audio is provided, the assistant thoroughly perceives and comprehends them, responding with helpful answers that provide comprehensive reasoning. Do not include timestamps in the answer provided by the assistant. 3. Considering the past dialogue i.e. the question and the answer in the previous timestep, the user chooses another question or further inquiry. 4. The assistant provides a helpful answer providing comprehensive reasoning regarding the audio content. The audios are 10 seconds long and the sound events are segmented based on where they occur in the audio. All the questions asked by the user should be diverse and complex. Example 1: Sound Events: Sound of Ambulance (siren) (High-pitched and wailing): [0.0s-1.0s]; Sound of Traffic noise, roadway noise (Droning, loud and intrusive): [0.0s-10.0s]; Sound of Accelerating, revving, vroom (Highpitched, short and intense): [2.0s-10.0s]; Sound of Generic impact sounds (Loud and sharp): [6.7s-6.8s]. Dialogue: User: What does the audio sound like? Assistant: An ambulance siren echoes while traffic noise fades, and an engine revs. User: Can you describe in more detail about the engine revs? Assistant: The engine revving is highpitched, short and intense. User: From the highpitched sound, can you tell me what vehicle it was? Assistant: It is hard to tell this from the information provided. User: Are there any other events in the audio that might be relevant? Assistant: There is a sound of impact which is very loud and sharp. Example 2: Sound Events: Sound of Outside, rural or natural: [0.000s-10.000s]; Sound of Liquid: [0.474s-0.680s], [4.249s-5.292s]; Sound of Gurgling, bubbling: [0.474s-0.680s], [4.249s-5.292s]; Sound of Splash, splatter: [1.620s-3.116s]; Sound of Laughter: [2.745s-4.029s], [5.299s-10.000s]; Sound of Chuckle, chortle: [2.752s-4.029s]; Sound of Speech: [4.551s-4.935s]; Sound of Male speech, man speaking: [4.557s-4.935s]; Sound of Belly laugh: [5.292s-10.000s] Dialogue: User: What are the main events in the audio? Assistant: Liquid sounds, laughter, speech, and a man speaking can be heard in a rural setting. User: Is is he speaking to other people? Assistant: As there is sound of laughter present from other people, the man might be speaking to others. User: Can you explain in more detail about the first event? Assistant: The liquid sounds are more of splashing, thrashing and splattering sounds. User: Are there other people laughing with him? Assistant: Yes, there are other laughter voices along with that of the man. Based on the above examples, design a four-turn dialogue between the user and assistant for the example given below. The question asked by the user should use pronouns such as 'he', 'she', 'it', 'they' and 'them' while referring to entities that have already occurred in the past questions or answers. Format each QA pair in a single line as a JSON dictionary (key ''user'' for question, and ''assistant'' for answer, wrapped with \{ and \}). Do not include any other explanation. Example 3: |
sys prompt:
Based on the music, create a dialogue between you (the assistant) and a person (the
user)
about the events in the music audio. Each dialogue should consist of: 1. A user examines the music, and sends a reasonable and creative message to the assistant. 2. Once the music audio is provided, the assistant thoroughly perceives and comprehends them, responding with helpful answers that provide comprehensive reasoning. Do not include timestamps in the answer provided by the assistant. 3. Considering the past dialogue i.e. the question and the answer in the previous timestep, the user chooses another question or further inquiry. 4. The assistant provides a helpful answer providing comprehensive reasoning regarding the music content. The more complex and diverse the question, the better. You can use the list of questions below while asking questions. List of questions: Is it a live recording or a studio recording? Is the guitar acoustic or electric? Is it playing chords, a riff or a solo? What music style or genre is it? What key is it in? How about the time signature? What's the tempo? Slow, Medium or Fast? Does it have vocals? Is it low pitched or high pitched? Is the person speaking fast or slow? What is the language and accent? What emotion is conveyed by the voice? What are the vocal characteristics of that emotion? How old is the person likely to be given their vocal characteristics? Example 1: Music description: House music with a thought provoking message in the beginning that eventually builds up and can be used to create the sense of a social revolution. Dialogue: User: What type of music is this? Assistant: It is house music with synth and drums and a man speaking at the beginning. User: What is his tone while speaking? Assistant: He has an intense and energetic tone in his voice. User: Does he speak till the end? Assistant: No, he stops speaking halfway and then it is followed by house music. User: What is the source of this music? Assistant: This is electronic music. Example 2: Music description: Energetic bluesy song with a harmonica and horn section in musical dialogue. Dialogue: User: What instruments are playing in this music? Assistant: This is a jazzy tune with harmonica, trumpet and drums playing. User: Which key is the music playing in? Assistant: The key is A Major. User: How will you explain the tempo of this piece? Assistant: The music is bouncy and positive with high beats per minute in the beginning and then changes signature halfway through. User: What is the source of this music? Assistant: This is fusion jazz as it has elements of bass and synthesizer. Based on above examples, design a four-turn dialogue between the user and assistant for the example given below. The question asked by the user should use pronouns such as 'he', 'she', 'it', 'they' and 'them' while referring to entities that have already occurred in the past questions or answers. Format each QA pair in a single line as a JSON dictionary (key "user" for question, and "assistant" for answer, wrapped with \{ and \}). Do not include any other explanation. Example 3: |
sys prompt:
Based on the description of audios, create a dialogue between you (the assistant) and
a person (the user) about the events in the audio. Example 1: Audio 1: Sound of Car (Engine hum and tire noise.) Audio 2: Sound of Car (Engine hum and tire noise.) Dialogue: User: What's the common type of sound in these two audios? Assistant: Both of them have sounds of car and engine humming. Example 2: Audio 1: Sound of Male singing (Deep, resonant, and powerful tones.): [0.000s-1.246s], [1.595s-1.851s], [5.437s-8.394s]; Sound of Music (Sound produced by vibrating instruments.): [0.000s-10.000s]; Sound of Shout (Loud, high-pitched, intense vocal sound.): [2.375s-2.910s] Audio 2: Sound of Female singing (High-pitched and melodious tones.): [0.000s-1.484s], [3.295s-7.558s], [7.779s-10.000s]; Sound of Singing (Melodic vocal sounds with pitch variation.): [0.000s-1.979s], [4.516s-5.011s], [7.537s-10.000s]; Sound of Crowd (Loud, diverse, and overlapping sounds.): [0.000s-2.013s], [7.558s-10.000s]; Sound of Music (Sound produced by vibrating instruments.): [0.000s-10.000s]; Sound of Human voice (Unique, complex, dynamic sound.): [9.842s-10.000s] Dialogue: User: What are the differences between the speech in these two audios? Assistant: The first one is male singing, the second one is a female singing. Example 3: Audio 1: This is a Christmas music piece with heavy influences from Celtic music. There is a female vocalist singing melodically as the lead. The melodic background is provided by the strings and the piano and at the same time Celtic instruments that resemble a harp and a dulcimer. There is a calming and positive aura to this piece. It could be playing in the background at a Christmas party. It could also be used in Christmas-themed shows or social media content. Audio 2: This music is a spirited instrumental. The tempo is fast with an animated Piano harmony, synthesizer arrangements, synthesised violins, rhythmic digital drums, funky bass lines and electronic sounds. The music is superimposed over sounds of tapping and rhythmic beeping. The music is happy, perky, upbeat, enthusiastic, lively and spirited. Audio 3: This is a lullaby that features a main melody made with a xylophone that has a bright sound. A sustained piano synth creates a dreamy and sleepy ambience sound. A subtle cello plays single chords that bounce from one beat to the next. This is a song a baby could fall asleep to. Dialogue: User: What are the same and different instruments played in these three music audios? Assistant: Piano appear in all these three music. The differences are that the first music has harp, the second music has synthesizer and violins, and the third music has cello. Based on the above examples, design a single-turn between the user and assistant for the example given below. The question asked by the user should focus on either the similarities or the differences between the audios given for comparison. Once the description of audios is provided along with the question, the assistant thoroughly perceives and comprehends them, responding with helpful answers that provide comprehensive reasoning. Do not include timestamps in the answer provided by the assistant. Format each QA pair in a single line as a JSON dictionary (key "user" for question, and "assistant" for answer, wrapped with \{ and \}). Do not include any other explanation. Example 4: |
Here we show some examples of generated dialogues via our data generation pipeline.
Audio | Dialogue |
---|---|
User: What are the main sounds in the audio? Assistant: The main sounds in the audio are rhythmic heartbeats and some background noise. User: Is the heartbeat sound consistent throughout the audio? Assistant:Yes, the heartbeat sound is consistent and rhythmic throughout the audio, with brief pauses in between. |
|
User: What are the main sounds in the audio? Assistant: The main sounds in the audio are wind, change ringing or campanology, hubbub or speech noise, tapping, and clapping. User: What is the nature of the speech noise? Assistant: The speech noise is loud, overlapping, and indistinct, which is often referred to as hubbub or speech babble. User: Is there any specific event that the clapping sound might be associated with? Assistant: Given the presence of change ringing and hubbub, the clapping sound could be associated with a celebration or event, possibly in a public place like a town square or church. |
|
User: What are the main events in the audio? Assistant: The main events in the audio include whistling, music, cheering, shouting, and the sound of fireworks. User: Is the whistling part of the music? Assistant: The whistling sound is separate from the music, but it might be part of the overall musical performance or event. User: What about the cheering and shouting, are they related? Assistant: Yes, the cheering and shouting are likely related. They could be reactions from a crowd at a concert or a public event. User: Can you tell me more about the fireworks? Assistant: The fireworks sound is a loud, explosive, echoing boom. It's brief, suggesting it might be the finale of the event or a highlight moment. |
Examples of multi-turn dialogues generated by Qwen Audio, LTU and Audio Flamingo on the Audio Dialogues dataset.
Audio | Dialogue |
---|---|
User: Can you briefly explain what you hear in the
audio? Ground-truth: The audio primarily consists of bird sounds, including chirping and bird vocalizations. There are also sounds of other animals. Additionally, there is a woman speaking at certain intervals. Qwen Audio: Birds are chirping. Audio Flamingo: The audio features a variety of sounds including bird vocalizations, mechanical sounds, impact sounds, a man speaking, and a ticking sound. User: Are the bird sounds continuous? Ground-truth: The bird sounds are not continuous. They occur at different intervals throughout the audio, with some overlap with the woman's speech. Qwen Audio: Yes, the bird sounds are continuous. Audio Flamingo: No, the bird sounds are not continuous. They occur at different intervals throughout the audio. User: Do the bird sounds and the person's speech overlap? Ground-truth: Yes, there are instances where the bird sounds and the woman's speech overlap. Qwen Audio: Yes, the bird sounds and the woman's speech overlap. Audio Flamingo: Yes, the bird sounds and the person's speech do overlap at certain points in the audio. |
|
User: Is the male voice rapping or singing? Ground-truth: The male voice is doing both, rapping and singing in the audio. LTU: The male voice is rapping. Audio Flamingo: The male voice is rapping. User: What kind of music is it? Ground-truth: The type of music isn't specified, but it's accompanied by rapping and male singing. LTU: The kind of music is hip-hop. Audio Flamingo: The music is a rap song. User: Is the male voice the only one present? Ground-truth: Yes, the male voice is the only one specified in the audio, both rapping and singing. LTU: Yes, the male voice is the only one present. Audio Flamingo: Yes, the male voice is the only one present in the audio. |
|
User: What genre does this music belong to? Ground-truth: This music belongs to the R&B/Disco genre. LTU: It is not specified in the given audio clip. Audio Flamingo: This music belongs to the R&B genre. User: Can you tell me more about the vocals in this track? Ground-truth: Sure, the vocals in this track are from a male singer who is singing in a higher register. His voice has been modified with effects, making it sound like a voice sample. LTU: The vocals in this track are described as rich and full-bodied, but no further details are given about the specific singer or the lyrics being sung in the audio clip. Audio Flamingo: The vocals in this track are by a male singer. His voice is deep, resonant, and powerful, adding to the overall emotional impact of the song. |