From CSWiki
Revision as of 21:33, 24 April 2006 by Gewang (talk | contribs) (III. Feature-based Sound Design)

Jump to: navigation, search


  • The smartest sound editor ever built
  • (Feature-based Sound Design Framework/Workbench/System/null)
  • (Feature-aware TAPESTREA: A Integrated/Comprehensive/Smart/Interactive Approach to Sound Design Workbench)
  • (TAPESTREA: Augmenting Interactive Sound Design with Feature-based Audio Analysis)
  • Interactive Content Retrieval for Intelligent/Template-aware Sound Design
  • Interactive Sound Design by Example
  • FAT-APE-STREAT: Sound Design by Querying
  • Sound Design-by-Querying and by-Example
  • Finding New Examples to Sound Design By
  • Extending Sound Scene Modeling By Example with Examples
  • Integrating Sound Scene Modeling and Query-by-example
  • Sound Scene Modeling by Example with Integrated Audio Retrieval
  • Facilitating Sound Design using Query-by-example
  • Enriching/Extending/Expanding Sound Scene Modeling By Examples using Audio Information Retrieval
  • Enhancing the Palette: Querying in the Service of Interactive Sound Design
  • Expanding the Palette: Audio Information Retrieval for Intelligent Sound Design
  • Expanding the Palette: Audio Information Retrieval for Intelligent Data-driven Sound Design
  • Enhancing the Palette: Audio Information Retrieval for TAPESTREA
  • Expanding the Palette: Audio Information Retrieval for Sound Scene Modeling by Example
  • Enhancing the Palette: Template-based Retrieval for Intelligent Sound Design
  • Enhancing the Palette: Using Audio Information Retrieval to Expand the Transformative Power of TAPESTREA

AUTHORS (order ok?):

Ananya Misra, Matt Hoffman, Perry R. Cook, Ge Wang


(no. down with order.)



We integrate music information retrieval technologies with TAPESTREA techniques to facilitate and enhance sound design, providing a new class of "intelligent" sound design workbench.

I. Introduction + Motivation

Sound designers who work with environmental or natural sounds are faced with a large selection of existing audio samples, including sound effects, field recordings, and soundtracks from movies and television, as a starting point. The TAPESTREA system [cite] facilitates the reuse of existing recordings by offering a new framework for interactively extracting desired components of sounds, transforming these individually, and flexibly resynthesizing them to create new sounds. However, the corpus of existing audio remains unstructured and largely unlabeled, making it difficult to locate desired sounds without minute knowledge of the available database. This paper explores ways to leverage audio analysis at multiple levels in interactive sound design, via TAPESTREA. It also considers methods for TAPESTREA in turn to aid audio analysis.

The main goals of this work include: (1) aiding sound designers in creating varied and interesting sound scenes by combining elements of existing sounds, and (2) enabling a human operator to quickly identify similar sounds in a large collection or database. Combined with TAPESTREA's analysis-transformation-synthesis techniques and paradigms, this presents an extended "query by example" framework, where feature-based querying can enhance both the analysis and synthesis aspects of interactive sound recomposition. The constructs discussed here can also be useful in forensic audio applications and watermarking.

The rest of this paper is organized as follows. Section 2 addresses related work, and also provides an overview of the TAPESTREA system. Section 3 discusses the integration of the audio information retrieval with the analysis-transformation-synthesis framework of TAPESTREA. Section 4 provides results. We conclue and discuss future work in Section 5.



  • Large corpus of unstructured and largely unlabled audio

(sound effects, field recordings, soundtracks from movies and TV, etc.)

  • leverage audio analysis in interactive sound design (via TAPS)
  • vice versa


  • To aid sound designers in creating varied and interesting scenes (standard TAPS stuff)
  • To enable a human operator to quickly identify similar sounds in a large collection of sounds.
  • can be also useful for forensic audio applications, watermarking */

II. Previous Work

Related Work

  • see references
  • Marsyas, Taps (sine+noise, transient, wavelet), feature-based synthesis
  • related systems generally falls into one of two categories, (1) "intelligent" audio editors,

which generally extracted musical information, or (2) sonic browsers for search and retrieval.


  • Sound-scene modeling by example / Content-aware tapestrea analysis interface
  • Re-composing natural sounds

III. Feature-based Sound Design

In order to augment TAPESTREA with audio information retrieval capabilities, two areas were addressed. Firstly, we integrated a feature-based similarity query engine as a component into the TAPESTREA system, and established well-defined points of interface to the analysis, synthesis, and template library components (Section 3.1). Secondly, we designed and integrated a new user interface devoted to and specialized for similarity retrieval of TAPS templates and raw audio files, and for interactively visualizing and browsing the feature-space in regions of interest (Section 3.2). Additionally, several retrieval-aware hooks were embedded into the existing user interfaces to allow the query and marking of sound events during analysis (Section 3.3).



Interactive template-based similarity search (database)

(figure for interface)

quering/marking recorded sounds for template discovery

IV. Results

V. Conclusion and Future Work

The most obvious next step toward improving the relevance of our query results is the incorporation of more features, particularly features capturing information the time domain dynamics of our sounds. At the moment our only two primarily time-domain-oriented features (low power and feature variance) do not consider any potential long-scale periodicity or order-dependent qualities present in sounds, which could be potentially quite relevant. A finer-grained set of spectral features might also provide better results.

One advantage to using a relatively small number of features, however, is that it remains practical for a user to manually set the weights to give each feature when ranking similarity. When using large numbers of features, many of which may be strongly correlated with each other, choosing the relative importance of each feature becomes both increasingly important (lest one set of features dominate the distance calculation) and increasingly difficult and time-consuming. An approach to dealing with this problem is to use machine learning algorithms such as [schapire rankboost] to try to infer how a user would rank the similarity of a set of sounds to one another from feature data. However, such an approach requires a substantial amount of human-labeled data, and presupposes that a general mapping is possible.

A disadvantage to query-by-example systems is that, by definition, they require the user to have an example on hand of the sort of sound they wish to find. It may be possible to circumvent this problem using the feature-based synthesis techniques we are currently developing, as described in [hoffman, these proceedings fingers crossed]. Using feature-based synthesis, we can synthesize audio matching arbitrary feature values specified by the user in real time. Once the sound generated in this way begins to resemble what the user is looking for, the features used to specify that sound can be passed as a query to the database, which should return a sound resembling what the user had in mind.

Finally, we hope to use machine learning techniques to better predict appropriate source separation parameters for TAPESTREA based on the feature values we extract for each sound. Since we have and continue to build a large library of extracted template files recording good separation parameters for a wide variety of sounds, it may be possible to leverage the features we extract and classify sounds into broad categories for which certain separation parameters are most appropriate. This in turn would allow us to do sinusoidal analyses of batches of sound files, and extract new features based on statistics about the deterministic, residual, and stochastic components of those sounds as automatically separated.

VI. References

Bregman, A. Auditory Scene Analysis. MIT Press, Cambridge, 1990.

   * NOT what taps does.

Chafe, C., B. Mont-Reynaud, and L. Rush. (1982). "Towards an intelligent editor of digital audio: Recognition of musical constructs," Computer Music Journal 6(1): .

   * 1st paper to deal with transcription without dealing with identifying notes

Dubnov, S., Z. Bar-Joseph, R. El-Yaniv, D. Lischinski, and M. Werman (2002). "Synthesizing sound textures through wavelet tree learning,". IEEE Computer Graphics and Applications 22(4).

Fernstrom, M. and E. Brazil. (2001)."Sonic Browsing: an auditory tool for multimedia asset management," In Proceedings of the International Conference on Auditory Display.

   * deals more with musical structures and notes

Foote, J. (1999). "An overview of audio information retrieval," ACM Multimedia Systems, 7:2(10).

Jolliffe, L. (1986). Principal Component Analysis. Springer-Verlag, New York.

Kang, H. and B. Shneiderman. (2000). "Visualization Methods for Personal Photo Collections: Browsing and Searching in the PhotoFinder," In Proceedings of the International Conference on Multimedia and Expo, New York, IEEE.

Kashino, Tanaka. (1993). "A sound source separation system with the ability of automatic tone modeling," International Computer Music Conference.

   * uses of clustering techniques for identifying sound sources

Misra, A., P. Cook, and G. Wang. (2006). "Musical Tapestry: Re-composing Natural Sounds," International Computer Music Conference. Submitted.

Misra, A., P. Cook, and G. Wang. (2006). "TAPESTREA: Sound Scene Modeling By Example," International Conference on Digital Audio Effects. Submitted.

Serra, X. (1989). "A System for Sound Analysis Transformation Synthesis based on a Deterministic plus Stochastic Decomposition," PhD thesis, Stanford University.

Shneiderman, B. (1998). Designing the User Interface: Strategies for Effective Human- Computer Interaction. Addison-Wesley, 3rd edition.

Tzanetakis G. and P. Cook. (2000). "MARSYAS: A Framework for Audio Analysis" Organized Sound, Cambridge University Press 4(3).

Tzanetakis, G. and P. Cook. (2001). "MARSYA3D: A prototype audio browser-editor using a large scale immersive visual and audio display," In Proceedings of the International Conference on Auditory Display.