III: Medium: Spatial Sound Scene Description

  • Bello, Juan Pablo (PI)
  • Roginska, Agnieszka A. (CoPI)
  • Cartwright, Mark (CoPI)
  • McFee, Brian (CoPI)

Project: Research project

Project Details


Sound is rich with information about the surrounding environment. If you stand on a city sidewalk with your eyes closed and listen, you will hear the sounds of events happening around you: birds chirping, squirrels scurrying, people talking, doors opening, an ambulance speeding, a truck idling. In addition, you will also likely be able to perceive the location of each sound source, where it's going, and how fast it's moving. This project will build innovative technologies to allow computers to extract this rich information out of sound. By not only identifying which sound sources are present but also estimating the spatial location and movement of each sound source, sound sensing technology will be able to better describe our environments with microphone-enabled everyday devices, e.g. smartphones, headphones, smart speakers, hearing-aids, home camera, and mixed-reality headsets. For hearing impaired individuals, the developed technologies have the potential to alert them to dangerous situations in urban or domestic environments. For city agencies, acoustic sensors will be able to more accurately quantify traffic, construction, and other activities in urban environments. For ecologists, this technology can help them more accurately monitor and study wildlife. In addition, this information complements what computer vision can sense, as sound can include information about events that are not easily visible, such as sources that are small (e.g., insects), far away (e.g., a distant jackhammer), or simply hidden behind another object (e.g., an incoming ambulance around a building's corner). This project also includes outreach activities involving over 100 public school students and teachers, as well as the training and mentoring of postdoctoral, graduate and undergraduate students.

This project will develop computational models for spatial sound scene description: that is, estimating the class, spatial location, direction and speed of movement of living beings and objects in real environments by the sounds they make. The investigators aim for their solutions to be robust across a wide range of sound scenes and sensing conditions: noisy, sparse, natural, urban, indoors, outdoors, with varying compositions of sources, with unknown sources, with moving sources, with moving sensors, etc. While current approaches show promise, they are still far from robust in real-world conditions and thus unable to support any of the above scenarios. These shortcomings stem from important data issues such as a lack of spatially annotated real-world audio data, and an over-reliance on poor quality, unrealistic synthesized data; as well as methodological issues such as excessive dependence on supervised learning and a failure to capture the structure of the solution space. This project plans an approach mixing innovative data collection strategies with cutting-edge machine learning solutions. First, it advances a novel framework for the probabilistic synthesis of soundscape datasets using physical and generative models. The goal is to substantially increase the amount, realism and diversity of strongly-labeled spatial audio data. Second, it collects and annotates new datasets of real sound scenes via a combination of high-quality field recordings, crowdsourcing, novel VR/AR multimodal annotation strategies and large-scale annotation by citizen scientists. Third, it puts forward novel deep self-supervised representation learning strategies trained on vast quantities of unlabeled audio data. Fourth, these representation modules are paired with hierarchical predictive models, where the top/bottom levels of the hierarchy correspond to coarser/finer levels of scene description. Finally, the project includes collaborations with three industrial partners to explore applications enabled by the proposed solutions. The project will result in novel methods and open source software libraries for spatial sound scene generation, annotation, representation learning, and sound event detection/localization/tracking; and new open datasets of spatial audio recordings, spatial sound scene annotations, synthesized isolated sounds, and synthesized spatial soundscapes.

This award reflects NSF's statutory mission and has been deemed worthy of support through evaluation using the Foundation's intellectual merit and broader impacts review criteria.

Effective start/end date7/1/206/30/23


  • National Science Foundation: $1,022,555.00


Explore the research topics touched on by this project. These labels are generated based on the underlying awards/grants. Together they form a unique fingerprint.