Difference between revisions of "Taps ISMIR"

From CSWiki
Jump to: navigation, search
(Interactive template-based similarity search (database))
Line 1: Line 1:
Taps IS Mostly Infinite Recursion.
* The smartest sound editor ever built
* (Feature-based Sound Design Framework/Workbench/System/null)
* (Feature-aware TAPESTREA: A Integrated/Comprehensive/Smart/Interactive Approach to Sound Design Workbench)
* (TAPESTREA: Augmenting Interactive Sound Design with Feature-based Audio Analysis)
* Interactive Content Retrieval for Intelligent/Template-aware Sound Design
* Interactive Sound Design by Example
* FAT-APE-STREAT: Sound Design by Querying
* Sound Design-by-Querying and by-Example
* Finding New Examples to Sound Design By
* Extending Sound Scene Modeling By Example with Examples
* Integrating Sound Scene Modeling and Query-by-example
* '''Sound Scene Modeling by Example with Integrated Audio Retrieval'''
* Facilitating Sound Design using Query-by-example
* Enriching/Extending/Expanding Sound Scene Modeling By Examples using Audio Information Retrieval
* Enhancing the Palette: Querying in the Service of Interactive Sound Design
* '''Expanding the Palette: Audio Information Retrieval for Intelligent Sound Design'''
* Expanding the Palette: Audio Information Retrieval for Intelligent Data-driven Sound Design
* Enhancing the Palette: Audio Information Retrieval for TAPESTREA
* '''Expanding the Palette: Audio Information Retrieval for Sound Scene Modeling by Example'''
* Enhancing the Palette: Template-based Retrieval for Intelligent Sound Design
* Enhancing the Palette: Using Audio Information Retrieval to Expand the Transformative Power of TAPESTREA
=AUTHORS (order ok?): =
Ananya Misra, Matt Hoffman, Perry R. Cook, Ge Wang
(no. down with order.)
TAPESTREA is a unified sound design framework for selectively extracting sound components from existing recordings and flexibly transforming and resynthesizing these to create new sound scenes or recompositions. While the existing techniques and paradigms enable the production of a wide range of recompositions, it has so far been left to the user to navigate the space of available sounds at each stage. We now expand the TAPESTREA palette by incorporating a query-based framework to assist the user in locating sound components during analysis and synthesis. We integrate music information retrieval technologies with TAPESTREA techniques to facilitate and enhance sound design, providing a new class of "intelligent" sound design workbench.
= I. Introduction + Motivation =
Sound designers who work with environmental or natural sounds are faced with a large selection of existing audio samples, including sound effects, field recordings, and soundtracks from movies and television, as a starting point. The TAPESTREA system [cite] facilitates the reuse of existing recordings by offering a new framework for interactively extracting desired components of sounds, transforming these individually, and flexibly resynthesizing them to create new sounds. However, the corpus of existing audio remains unstructured and largely unlabeled, making it difficult to locate desired sounds without minute knowledge of the available database. This paper explores ways to leverage audio analysis at multiple levels in interactive sound design, via TAPESTREA. It also considers methods for TAPESTREA in turn to aid audio analysis.
The main goals of this work include: (1) aiding sound designers in creating varied and interesting sound scenes by combining elements of existing sounds, and (2) enabling a human operator to quickly identify similar sounds in a large collection or database. Combined with TAPESTREA's analysis-transformation-synthesis techniques and paradigms, this presents an extended "query by example" framework, where feature-based querying can enhance both the analysis and synthesis aspects of interactive sound recomposition. The constructs discussed here can also be useful in forensic audio applications and watermarking.
The rest of this paper is organized as follows.  Section 2 addresses related work, and also provides an overview of the TAPESTREA system.  Section 3 discusses the integration of the audio information retrieval with the analysis-transformation-synthesis framework of TAPESTREA.  Section 4 provides results.  We conclue and discuss future work in Section 5.
= II. Previous Work =
== Related Work ==
* see references
* Marsyas, Taps (sine+noise, transient, wavelet), feature-based synthesis
* related systems generally falls into one of two categories, (1) "intelligent" audio editors,
which generally extracted musical information, or (2) sonic browsers for search and retrieval.
TAPESTREA, Techniques And Paradigms for Expressive Synthesis, Transformation and Rendering of Environmental Audio, aims to facilitate the creation of new sound scenes or recompositions from existing sounds. It builds on the notion that most natural or environmental sounds consist of foreground and background components, and these are best modeled separately. It therefore enables the extraction of the following types of components or _templates_ from an existing sound:
(1) Deterministic events: highly sinusoidal foreground events, often perceived as pitchy, such as bird chirps or voices.
(2) Transient events: brief, noisy foreground events with high stochastic energy, such as a door slamming.
(3) Stochastic background: The background noise or "din" that is heard beneath the foreground events, such as ocean waves or street noise.
Each type of template is detected, extracted and resynthesized using techniques suited to its particular characteristics. Deterministic events are found and extracted by sinusoidal modeling based on the spectral modeling synthesis framework [cite serra]. They are then synthesized via sinusoidal resynthesis, enabling massive real-time frequency and time transformations. Transient events are located by examining energy changes in the time-domain envelope of the sound [cite?], and are resynthesized with desired transformations through a phase vocoder [cite?]. The stochastic background is obtained by (a) removing deterministic events during spectral modeling, and (b) removing transient events in the time domain and filling in the "holes" via wavelet tree learning [cite Dubnov] of nearby transient-free segments. A modified wavelet tree learning algorithm is then used to continuously resynthesize more background texture, controllably similar to the extracted background template.
In addition to this basic template set, TAPESTREA also provides additional templates to facilitate sound analysis and synthesis. These include: (MAYBE CUT SHORT)
(1) "Raw" template: A selected segment in time extracted from a recording, bandpass filtered between specified frequency bounds, thus capturing both foreground and background components of the selected time-frequency region. It can be resynthesized with time and frequency transformations.
(2) Loop: A structure for synthesizing repeating events, varying parameters such as periodicity, density, and range of random transformations.
(3) Timeline: A structure for synthesizing a collection of templates explicitly placed in time relative to each other.
(4) Mixed Bag: A structure for synthesizing multiple events repeating at different likelihoods.
(5) Script: ChucK [cite] scripts for finer control over the synthesis parameters.
TAPESTREA presents interfaces for interactive, parametric control over all aspects of the analysis, transformation and resynthesis, and can be used to create a wide range of sounds from a given set of recordings. Since templates can be saved to file and reloaded at a later sitting, it also paves the way for building a reusable database of extracted templates and raw sound effects, to be loaded in and used at any time. As this database grows, it becomes worthwhile to include audio information retrieval techniques for searching through it, visualizing it, and using it to its full potential. Information retrieval can also be beneficial in finding specific segments of a recording during the analysis phase. Context Aware TAPESTREA addresses some of these topics.
In order to augment TAPESTREA with audio information retrieval capabilities, two areas are addressed.  Firstly, we integrate a feature-based similarity query engine into the TAPESTREA system, and establish well-defined points of interface to the analysis, synthesis, and template library components.  Secondly, we provide a new user interface devoted to and specialized for similarity retrieval of TAPS templates and raw audio files, and for interactively visualizing and browsing the feature-space in regions of interest.  These additions allows the user to begin with a TAPESTREA template, transform it, and query for similar sounds (Section 3.2).  Additionally, several retrieval-aware hooks were embedded into the existing user interfaces to allow the query and marking of sound events during analysis (Section 3.3).
== System Architecture ==
The TAPESTREA system architecture is show in Figure X.  In order to add similarity retrieval, the architecture has been augmented with a Retrieval Engine and an Audio Database containing TAPESTREA templates, raw audio files, and their associated feature values and other meta-data.  As sound files are analyzed, the resulting templates are stored in the TAPESTREA working library, and optionally assimilated into the Audio Database.  The latter is performed by the Retrieval Engine, which extracts features from each template, and stores a copy of the template and its features into the persistent storage and index of the database.
A query can be made at any time against the Retrieval Engine, either through the Search user interface, or from embedded features in the Analysis user interface.  A retrieval query consists of specifying weights on the available and desired features, and a set of target values for all non-zero-weight features.  The target feature values can be provided in two ways.  The feature values can be extracted from an existing template or a raw sound file.  Also, it is possible to manually select and/or modify some or all of the feature values.  The query is then presented to the Retrieval Engine, which finds the N sounds in the database whose feature values (as optionally weighted by the user) most closely match those in the query, i.e. those with the smallest Euclidean distance to the query point.
== Interactive template-based similarity search (database) ==
The template-based similarity search allows a user to select a template in the Template Library and find similar templates and raw sound files from the Audio Database. Figure X displays the interface for doing so. Selecting a sound displays its associated feature values on the sliders at the bottom right. Corresponding sliders at the top right determine the weight of each feature for the similarity search. Search results are displayed either in the form of a list, or as a 3D visualization. In the list view, the closest matches are displayed, sorted by distance from the query. The 3D visualization displays a specified number of matches, with the feature vector of each match mapped to 3D space either through principal component analysis or by visualizing three specifically selected features. Any of the matches found can be played, loaded to the Template Library for transformation and synthesis, or used as a query for the next search. By browsing through the matches with these views, the user can find templates similar to the query that can be used to create a richer and more varied sound scene.
The feature extraction is performed on audio clips synthesized from the template, in a type-sensitive manner, and taking existing transformations into account.  For example, a deterministic template with 0.5x frequency-warping and 15x time-stretching is synthesized as such and in its entirety, whereas potentially longer templates such as loops, background, and timelines may only provide portions, as directed by the user. The spectral centroids, root mean square (RMS) powers, spectral fluxes, and autocorrelation strengths are computed for 512-sample frames of audio beginning at every 256th sample in the sound. The mean centroid, fraction of windows with RMS-power less than the average (a.k.a. low power), mean flux, and mean autocorrelation strength are calculated from these window-level features. Finally, the variances of the centroid, power, flux, and autocorrelation strength are calculated, normalized, and summed to produce a fifth feature, completing our descriptor. This data is then associated with the template and added to the Audio Database, making it available to future queries.
== Querying and marking recorded sounds for template discovery ==
During the analysis phase, audio information retrieval can also aid the discovery of desired templates in a given recording. This feature enables the user to find useful audio in files that may be too long to conveniently search by hand. A user can select an interesting time and frequency region in the recording, and send this to the retrieval engine as a query for a similarity search. The recording is broken into a series of overlapping regions (with a hop size of 256 samples) of the same length as the query region. The window-level features are only calculated once, and most of the region-level features described above can be derived from their values in the previous overlapping region using a finite number of operations, making search efficient despite the large number of overlapping regions.
Once feature values have been extracted for each region in the recording, the Euclidean distance of each region's feature vector to the feature vector describing the query region is calculated. All of the regions characterized by distances below a user-specified threshold parameter are deemed similar to the query region, but since there is substantial redundancy between overlapping regions, we return only the regions whose distances are local minima. In essence, the system locates potential templates that are similar to the user's selection and may thus be interesting to the user. The query need not be restricted to a raw time-frequency region, but can also be extended to include extracted deterministic and transient event templates.
= IV. Applications and Example =
Figure X shows the results of a query on the database. The template child2 was chosen as the query, and so it comes back as the most similar sound to itself. The next five query results are also similar-sounding templates of children's voices, followed by several strongly-pitched ringing sounds. The distances in feature space are presented to the right of the results.
In figure X2:Mutants United, the sonic browser is shown after resetting the weights of the flux and centroid features. In the first pane, the Audio Database is visualized in the three feature dimensions of spectral centroid, low power, and flux. In the second pane, the weight of flux has been reduced to zero, effectively flattening the space to two dimensions. In the final pane, the centroid dimension's weight has also been reduced to zero, further compressing the space into a single dimension (low power). By querying on only one or two features at a time, the user can get more focused results, while using more features allows for broader definitions of similarity.
= V. Conclusion and Future Work =
The most obvious next step toward improving the relevance of our query results is the incorporation of more features, particularly features capturing information about the time domain dynamics of our sounds. At the moment our only two primarily time-domain-oriented features (low power and feature variance) do not consider any potential long-scale periodicity or order-dependent qualities present in sounds, which could be potentially quite relevant. A finer-grained set of spectral features might also provide better results.
One advantage to using a relatively small number of features, however, is that it remains practical for a user to manually set the weights to give each feature when ranking similarity. When using large numbers of features, many of which may be strongly correlated with each other, choosing the relative importance of each feature becomes both increasingly important (lest one set of features dominate the distance calculation) and increasingly difficult and time-consuming. An approach to dealing with this problem is to use machine learning algorithms such as [schapire rankboost] to try to infer how a user would rank the similarity of a set of sounds to one another from feature data. However, such an approach requires a substantial amount of human-labeled data, and presupposes that a general mapping is possible.
A disadvantage to query-by-example systems is that, by definition, they require the user to have an example on hand of the sort of sound they wish to find. It may be possible to circumvent this problem using the feature-based synthesis techniques we are currently developing, as described in [hoffman, these proceedings fingers crossed]. Using feature-based synthesis, we can synthesize audio matching arbitrary feature values specified by the user in real time. Once the sound generated in this way begins to resemble what the user is looking for, the features used to specify that sound can be passed as a query to the database, which should return a sound resembling what the user had in mind.
Finally, we hope to use machine learning techniques to better predict appropriate source separation parameters for TAPESTREA based on the feature values we extract for each sound. Since we have and continue to build a large library of extracted template files recording good separation parameters for a wide variety of sounds, it may be possible to leverage the features we extract and classify sounds into broad categories for which certain separation parameters are most appropriate. This in turn would allow us to do sinusoidal analyses of batches of sound files, and extract new features based on statistics about the deterministic, transient, and stochastic background components of those sounds as automatically separated. Thus, in the long-term, TAPESTREA template separation can itself enhance audio information retrieval.
= VI. References =
Bregman, A. Auditory Scene Analysis. MIT Press, Cambridge, 1990.
    * NOT what taps does.
Chafe, C., B. Mont-Reynaud, and L. Rush. (1982).
"Towards an intelligent editor of digital audio: Recognition of musical constructs,"
Computer Music Journal 6(1): .
    * 1st paper to deal with transcription without dealing with identifying notes
Dubnov, S., Z. Bar-Joseph, R. El-Yaniv, D. Lischinski, and M. Werman (2002).
"Synthesizing sound textures through wavelet tree learning,".
IEEE Computer Graphics and Applications 22(4). 
Fernstrom, M. and E. Brazil. (2001)."Sonic Browsing: an auditory tool for multimedia asset
management," In Proceedings of the International Conference on Auditory Display.
    * deals more with musical structures and notes
Foote, J. (1999). "An overview of audio information retrieval," ACM Multimedia Systems,
Jolliffe, L. (1986).  Principal Component Analysis. Springer-Verlag, New York.
Kang, H. and B. Shneiderman. (2000).  "Visualization Methods for Personal Photo
Collections: Browsing and Searching in the PhotoFinder," In Proceedings of the
International Conference on Multimedia and Expo, New York, IEEE.
Kashino, Tanaka. (1993).
"A sound source separation system with the ability of automatic tone modeling,"
International Computer Music Conference.
    * uses of clustering techniques for identifying sound sources
Misra, A., P. Cook, and G. Wang. (2006). "Musical Tapestry: Re-composing Natural Sounds,"
International Computer Music Conference.  Submitted.
Misra, A., P. Cook, and G. Wang. (2006). "TAPESTREA: Sound Scene Modeling By Example,"
International Conference on Digital Audio Effects.  Submitted.
Serra, X. (1989). "A System for Sound Analysis Transformation
Synthesis based on a Deterministic plus Stochastic Decomposition,"
PhD thesis, Stanford University.
Shneiderman, B.  (1998).  Designing the User Interface: Strategies for Effective Human-
Computer Interaction. Addison-Wesley, 3rd edition.
Tzanetakis G. and P. Cook.  (2000). "MARSYAS: A Framework for Audio Analysis" 
Organized Sound, Cambridge University Press 4(3).
Tzanetakis, G. and P. Cook. (2001). "MARSYA3D: A prototype audio browser-editor using a large scale immersive visual and audio display,"
In Proceedings of the International Conference on Auditory Display.

Latest revision as of 16:11, 25 April 2006

Taps IS Mostly Infinite Recursion.