Time and pitch scaling in audio processing
by Olli Parviainen
음악의 키 변환과 관련하여 좋은 글이 있어서 번역을 합니다. 완전히 직역을 하지는 않았고, 꼭 필요하다 싶은 부분만 골라서 번역을 하였습니다. Phase Vocoder 알고리즘은 효율성이 떨어져 간단하게 넘어갔습니다.
원본은 다음 링크입니다.
SOLA 알고리즘을 Visual Studio 에서 실제로 구현한 소스코드는 다음 링크에서 확인 하실 수 잇습니다.
________________________________________________________________________
소개
녹음된 음원을 빠르게 재생하면 음이 높아지고, 느리게 재생하면 반대로 음이 낮아집니다. 하지만 보통 우리는 둘 중 하나만 변하기를 바랍니다. 즉 속도는 그대로면서 음만 변하거나, 음은 그대로면서 속도를 줄이거나 하는 식이죠. 아날로그 시대에는 타이밍 별로 속도 옵션을 조정함으로써 이 문제를 해결했지만, 디지털 시대에 와서는 같은 효과를 보간법(interpolating)을 통해 해결하고 있습니다. (다른 말로는 resampling 이라고 합니다.) 이렇게 효과를 주는 것을 time/pitch scaling, time/pitch scaling 또는 time stretching 이라고 업계에서는 부르고 있습니다.
Introduction
Anyone who's used a nowadays obsolete tape recorder or a vinyl disc player is likely familiar with effects of playing the recording back at different speed than it was originally recorded: Playing the recording at double speed reduces the playtime to half, at the same time causing a side-effect of the sound pitch jumping up by an octave which amusingly converts human voices to sound like cartoon figures. Similarly, playing back a record at slower rate makes the record duration longer and lowers the sound pitch by the same ratio.
In old days of analog audio recording technology this modified playback speed/time/pitch effect was easy to produce by applying incorrect playback speed setting. In digital signal processing the same effect can be achieved by interpolating between sound samples, referred as resampling.
While resampling affects both the playback time and pitch in the same ratio, occasionally its desirable to control the sound duration or pitch independently of the other to modify pitch or time scale without the sometimes-undesirable cartoon-hero voice side-effects. Luckily signal processing techniques exist for modifying the time and pitch scales independently of each other, these techniques being known as time/pitch scaling, time/pitch shifting or time stretching.
In old days of analog audio recording technology this modified playback speed/time/pitch effect was easy to produce by applying incorrect playback speed setting. In digital signal processing the same effect can be achieved by interpolating between sound samples, referred as resampling.
While resampling affects both the playback time and pitch in the same ratio, occasionally its desirable to control the sound duration or pitch independently of the other to modify pitch or time scale without the sometimes-undesirable cartoon-hero voice side-effects. Luckily signal processing techniques exist for modifying the time and pitch scales independently of each other, these techniques being known as time/pitch scaling, time/pitch shifting or time stretching.
________________________________________________________________________
응용 방법
악기나 춤을 연습할 때, 음악의 속도를 올리거나 줄일 수 있으면 연습에 큰 도움이 됩니다. 또한 언어를 연습할 때나, 맹인이 오디오 북을 들을 때도 도움이 됩니다. 또는 어떤 비디오를 더 높거나 낮은 frame의 비디오 포맷으로 변환할 때 목소리의 pitch 변화를 막기 위해 scaling은 꼭 필요합니다.
유사하게 음을 높이거나 낮추는 것은 가라오케에서 노래를 부르거나 할 때 중요한 부분입니다. 음정이 잘 맞지 않는 노래를 오토튜닝할 때도 이용되곤 합니다.
Applications
Time scaling that reduces the music playback speed or tempo ease practicing playing musical instruments and dancing. Slowing down recorded speech helps transcribing recorded notes and practicing spoken languages, and blind people with developed hearing sense may prefer accelerated speech when listening to recorded audio books to save time. Further, video formats conversions between different frame rates require fine-adjusting the sound duration, for example converting a 24 frames/second cinema format into 25 frames/second TV format reduces the movie playback time by 4%, implying need for corresponding sound track duration adjustment.
Similarly, altering the sound pitch or key to match with the singer's voice eases karaoke singing and singing practising. When practicing playing an instrument along music, adjusting key of the background music can be easier than retuning the instrument for each song separately. Pitch correction is nowadays a standard method in record production to correct imperfections in singing, handily compensating possibly insufficient musical talent in the concurrent celebrity idol industry.
Similarly, altering the sound pitch or key to match with the singer's voice eases karaoke singing and singing practising. When practicing playing an instrument along music, adjusting key of the background music can be easier than retuning the instrument for each song separately. Pitch correction is nowadays a standard method in record production to correct imperfections in singing, handily compensating possibly insufficient musical talent in the concurrent celebrity idol industry.
Finally, people with questionable plots may wish to alter pitch of their voice to conceal their true identity.
________________________________________________________________________
접근 방법
음성, 음악과 같은 일반적인 음원에 대한 time/pitch scaling을 하는 방법에는 time-domain 에서 진행하는 것과 frequency-domain 에서 진행하는 두가지 방법이 있습니다.
time-domain에서 진행하는 경우에는 sampling 되어 있는 data에 대해서 이 글에서 소개될 SOLA 알고리즘을 직접적으로 적용하면서 진행됩니다. time-domain-processing의 경우에는 구현이 비교적 직설적이고, 파일의 포맷이 변하지 않는다는 장점이 있습니다. 단점으로는 평균적으로 15% 이상의 시간 수정을 가할 때 원하지 않는 리버브 효과가 생긴다는 점입니다.
다른 방식인 frequency-domain에서 진행하는 방법은 샘플링된 음원을 짧은 진동/진폭 요소로 전환하여 크기를 변환시키는 방법입니다. Phase Vocoder가 이런 방식을 이용합니다.
이 방식의 장점은 소리의 변환을 좀 더 정밀하게 진행함으로써 더 좋은 결과물을 만들어 낼 수 있다는 것입니다. 하지만 이러한 변환법은 코딩을 하고 실행을 함에 있어서 time-domain 방식에 비해 매우 부하가 크다는 단점이 있습니다. (CPU 속도와 가용 메모리의 크기가 중요합니다.)
Approaches
When considering general-purpose time and/or pitch scaling methods that work with any kind of sampled sound, were it speech, music or other sound, two fundamental approaches exist for such effects, namely doing the processing either in time-domain or in frequency-domain.
The time-domain methods operate directly with sampled data, such as SOLA algorithm introduced in this article. Advantage of time-domain processing is that implementation is rather straight-forward, as the sound data is manipulated in the same sample format as it's recorded and played back. Disadvantage of the time-domain processing is tendency for producing reverberating artefacts that get more obvious with larger time modification, i.e. when scaled time differs from the original sound roughly by 15% or more.
Another approach is frequency-domain methods that convert the sampled sound into short-time frequency/amplitude components and do the scaling with frequency information, Phase Vocoder that's featured below being an example of such method. Advantage of frequency domain processing is enabling of more sophisticated sound modifications that better resemble the eventual human hearing experience, as human hearing is fundamentally based on frequency/amplitude sensing: we don't have a magnetic voice coils with digital sampling circuitry in our ears, instead the ear's hearing sense consists of a great number of tiny frequency sensors, each being sensitive to certain frequency alone, with our brains combining these together for a continuous frequency/amplitude scope hearing sensation - well, at least so until our once-perfect ears become ruined by high-volume activities such as rock-music, motor sports or fostering junior family members.
However, despite their apparent potent and elegance, frequency domain methods are also vastly more complicated to program and compute than the time-domain methods, thus making them possibly impractical in application constrained by computational resources, CPU speed and memory these are.
Another approach is frequency-domain methods that convert the sampled sound into short-time frequency/amplitude components and do the scaling with frequency information, Phase Vocoder that's featured below being an example of such method. Advantage of frequency domain processing is enabling of more sophisticated sound modifications that better resemble the eventual human hearing experience, as human hearing is fundamentally based on frequency/amplitude sensing: we don't have a magnetic voice coils with digital sampling circuitry in our ears, instead the ear's hearing sense consists of a great number of tiny frequency sensors, each being sensitive to certain frequency alone, with our brains combining these together for a continuous frequency/amplitude scope hearing sensation - well, at least so until our once-perfect ears become ruined by high-volume activities such as rock-music, motor sports or fostering junior family members.
However, despite their apparent potent and elegance, frequency domain methods are also vastly more complicated to program and compute than the time-domain methods, thus making them possibly impractical in application constrained by computational resources, CPU speed and memory these are.
________________________________________________________________________
SOLA
SOLA는 "Synchronous-OverLap-Add"가 축약된 이름입니다. 소리 데이터를 수십~수백 milli second로 분리한 뒤 이들 사이에 빈 공간을 넣거나 또는 중복 시키는 등의 방식으로 play time을 늘리거나 줄이는 것이 이 알고리즘이 time scaling을 하는 방법입니다. TDHS, WSOLA, PSOLA 등도 SOLA와 대동소이하며 소리를 겹치는 방식에 있어서 차이가 있을 뿐입니다.
소리가 중간중간 중첩되는 부분이 있기 때문에 알고리즘 이름에 OverLap-Add 라는 단어가 사용되었습니다.
SOLA를 매우 간단하게 설명해 보자면, 예를 들어 소리의 길이를 10% 줄이고 싶다면, 110 ms 의 소리 조각을 가져와 10% 만큼 겹치면 됩니다. 반대로 늘리고 싶다면 90ms의 소리 조각을 가져와 10ms 를 사이에 추가 하면 됩니다.
하지만, 늘 그렇듯 SOLA 알고리즘을 실제로 구현하는 것은 그리 쉽지 않습니다. 소리 조각을 매우 적절하게 쪼개어서 변환하지 않는다면, 재구성된 결과물에 소리 조각들이 조화롭게 섞여들지 않을 것이고, 이는 변환된 소리의 음질 열화를 초래할 것입니다.
SOLA
SOLA, being short-hand for Synchronous-OverLap-Add, works by cutting the sound data into shortish sequences between tens to hundreds milliseconds each, and then joining these sequences back together with a suitable amount of sound data either skipped or partially repeated in between these sequences, to achieve a shorter or longer playback time than originally. Algorithms based on the same basic approach are also TDHS for Time-Domain Harmonic Sampling, WSOLA and PSOLA, with differences in details of how the overlapping offset that's discussed below is sought.
To prevent too obvious discontinuities in the sound at those locations where the two adjacent sequences are joined together, the sequences are partially overlapped so that the sound amplitude gradually shifts from one sequence to the other during the overlapping period - that's why the "Overlap-Add" part of the name SOLA.
At simplest the SOLA implementation could use common sequence length and pick the sequences at even intervals from the original sound data. Wish to make the sound duration 10% shorter? Use processing sequence duration of e.g. 100 milliseconds (+overlapping duration), pick sequences from the original sound data at intervals of 110 milliseconds, join these sequences together by overlapping and there you are. In a similar way, to achieve 10% longer sound duration, choose the 100 milliseconds long sequences from original sound at intervals of 90 milliseconds, and so forth.
However, practical SOLA implementation isn't quite that easy. Choosing the processing sequences without sufficient attention to their contents will result into sequences not fitting together too well, consequently producing inferior sound quality with humming and clicking artefacts due to too large discontinuities in the sound despite the gradual overlapping approach.
At simplest the SOLA implementation could use common sequence length and pick the sequences at even intervals from the original sound data. Wish to make the sound duration 10% shorter? Use processing sequence duration of e.g. 100 milliseconds (+overlapping duration), pick sequences from the original sound data at intervals of 110 milliseconds, join these sequences together by overlapping and there you are. In a similar way, to achieve 10% longer sound duration, choose the 100 milliseconds long sequences from original sound at intervals of 90 milliseconds, and so forth.
However, practical SOLA implementation isn't quite that easy. Choosing the processing sequences without sufficient attention to their contents will result into sequences not fitting together too well, consequently producing inferior sound quality with humming and clicking artefacts due to too large discontinuities in the sound despite the gradual overlapping approach.
________________________________________________________________________
구현시 고려사항들.
SOLA 알고리즘을 구현할 때 변환된 결과물의 만족스러운 음질을 위해 가장 중요한 점들중 하나는 결과물을 구성할 때 이용될 소리 조각들을 잘 분리하는 것입니다.
SOLA 알고리즘을 구현할 때, 전체 음악은 같은 duration의 유사성을 지니는 소리 조각들의 연속물로 치환되어야 합니다. 한 소리 조각을 쪼갰을 때 다음 소리 조각은 일정 offset 범위 안에서 전 소리 조각과 가장 유사한 파형을 가지는 조각으로 선택됩니다. 이때 각 소리 조각들의 duration은 일정합니다.
SOLA 알고리즘을 구현할 때, 전체 음악은 같은 duration의 유사성을 지니는 소리 조각들의 연속물로 치환되어야 합니다. 한 소리 조각을 쪼갰을 때 다음 소리 조각은 일정 offset 범위 안에서 전 소리 조각과 가장 유사한 파형을 가지는 조각으로 선택됩니다. 이때 각 소리 조각들의 duration은 일정합니다.
오른 쪽 그림은 쪼개진 소리 조각을 통해 새로운 소리 조합을 만들어 내는 과정을 설명하고 있습니다. 아래축에 이용되고 있는 소리의 단위는 중요하지 않습니다. 이렇게 쪼개진 소리 조각들을 겹치거나 뒤로 미루면서 원하는 시간 조절을 해낼 수 있습니다.
오른쪽 그림에서는 9에서 시작하는 7이라는 길이를 가지는 소리 조각을 5의 위치로 겹침으로써, 원 음악에 비해 44.5%가 줄여진 결과물을 구성했습니다.
적절한 소리조각을 찾아내는 방법으로는 여러가지 방법이 있지만 드럼과 같은 박자 악기의 소리를 이용함으로써 불필요한 박자음의 중복을 막는 등의 방법이 있습니다. 또 다른 방법으로는 time-domain 상에서의 파형 모양 보다는 frequency-domain에서의 파형 모양을 이용하는 방법도 있습니다. SOLA의 경우에는 더하기/곱하기와 같은 간단한 산술연산만을 이용하기 때문에 정수나 적은 자릿수의 소수 연산만으로도 구현이 가능합니다. 따라서 성능 이슈나 CPU의 기능 제공 이슈에 의한 문제에서 어느정도 자유롭습니다.
Implementation considerations
For satisfactory sound quality, SOLA implementation needs to choose the processing sequences with waveforms of adjacent sequences to be joined together being as much alike as possible over the overlapping period.
In practice, the sound stream is processed one sequence at a time, always choosing the next processing sequence from a suitable window of possible offsets candidates around the theoretical next sequence offset, so that the two sequences match together as well as possible. A good way for choosing the best-matching sequences is to calculate cross-correlation function between the end of previous sequence and the window of possible new sequence candidates; the offset at which the waveforms have highest similarity also gives the highest cross-correlation value. The sequences are then joined together by overlapping to produce a new continuous sound stream with different duration than the original sound.
In practice, the sound stream is processed one sequence at a time, always choosing the next processing sequence from a suitable window of possible offsets candidates around the theoretical next sequence offset, so that the two sequences match together as well as possible. A good way for choosing the best-matching sequences is to calculate cross-correlation function between the end of previous sequence and the window of possible new sequence candidates; the offset at which the waveforms have highest similarity also gives the highest cross-correlation value. The sequences are then joined together by overlapping to produce a new continuous sound stream with different duration than the original sound.
The overall SOLA algorithm is illustrated in the figure at side. Notice that time axis scales shown in the figure are arbitrary, the time units not necessarily corresponding to any real time units. During the algorithm execution, the original sound is cut into processing sequences of suitable duration, new sequences being chosen at a suitable intervals after the previous sequence so that the desired time scaling ratio is achieved.
On Figure 1, the first processing sequence begins at time offset of 0, the sequence duration being 7 time units, including the two units wide overlapping areas at begin and end of the sequence. Each further sequence are chosen at nominal intervals of 9 units, resulting into time scaling ratio of (7-2) / 9 = 0.555, meaning ~ 44.5% reduction in the time duration compared to the original sound.
However, instead of taking the new sequence directly at the nominally calculated time interval, the actual beginning of a sequence is selected from a window around the nominal offset, in case of "New Sequence" as in Figure 1 between offsets 8..10, so that the new and previous sequence shall match together as well as possible once overlapped. The sequences are then overlap-added together to produce a continuous resulting sound. Notice the resulting time compression between the original and resulting sound waveform shapes shown in Figure 1.
Cross-correlation function works well for evaluating similarity of the sequence waveforms and is easy to implement. Also other similarity measures have been proposed though, one suggested approach being to match the sequence edges to the beats of the sound, which can reduce the reverberating artefact, assuming that the beats are there and they can be detected with good consistency. Another alternative is to evaluate frequency-domain spectral similarity instead of the waveform similarity for finding the best-matching overlapping offset.
SOLA implementation requires only basic add/multiply arithmetic operations, thus SOLA algorithm can be implement with integer or fixed-point arithmetics if so desirable, in cases that floating-point support is not available or is undesirable to use for performance reasons.
________________________________________________________________________
SOLA 알고리즘을 통한 음높이 조정
앞서는 SOLA 알고리즘을 통해 음높이는 유지하면서 전체적으로 재생 시간을 변경하는 방법을 알아봤습니다. 그러면 재생 시간은 유지하면서 음높이는 어떻게 변화하는 걸까요? 방법은 다음과 같습니다.
- 원하는 음높이에 맞춰서 전체적으로 전체 재생 시간을 늘이거나 줄입니다.
- 그 다음에 전체 재생 시간을 앞서 설명한 SOLA 알고리즘을 통해 다시 원래 재생 시간에 맞춰서 조정을 합니다. 그러면 재생 시간은 유지가 되지만 전체적으로 음높이는 올라가거나 내려가게 됩니다.
예를 들어 만약 어떤 음악을 한 옥타브를 올리고 싶다면, 먼저 음악의 재생시간을 일괄적으로 절반으로 줄입니다. 이후에 SOLA 알고리즘을 적용하여 재생 시간을 원래의 시간으로 맞춥니다. 이를 통해 음정을 한옥타브 올릴 수 있습니다.
비슷하게 만약 음정을 낮추고 싶다면 전체적으로 재생시간을 늘린 후, SOLA 알고리즘을 통해 원래의 시간으로 맞춥니다. 이를 통해 음정을 전체적으로 낮출 수 있습니다.
Pitch scaling with SOLA
SOLA as such produces a time scaling effect, however, combining SOLA algorithm with resampling that modifies both time and pitch scales in the same ratio allows producing also a pitch shift effect with a little extra cost.
The pitch shifting effect implemented combining SOLA + resampling works as follows:
- Use resampling to increase or decrease sound pitch by desired amount. Because resampling modifies both sound duration and pitch in the same ratio, the sound duration will become different than original in the process.
- Then use SOLA algorithm to adjust the sound duration modified by resampling back to the original duration. The result has the same duration as originally, but now with modified pitch.
Any resampling + SOLA time scaling ratios can be applied to achieve arbitrary pitch scaling ratios, or even arbitrary simultaneous time and pitch scaling when so desirable.
For example, to increase the pitch by ratio of two (i.e. an octave), first resample the sound duration down to half of the original, resulting in twice higher pitch at the same time, then apply SOLA to expand the sound duration by the same factor of two, resulting into the original sound duration but with an octave higher pitch.
In a similar fashion, to achieve a lower sound pitch, first resample the sound for a longer duration and lower pitch, then apply SOLA with a corresponding time scaling ratio to reduce the duration back to the original, producing a sound with original duration but lower pitch.
________________________________________________________________________
SOLA 알고리즘 적용에 따른 부작용들
SOLA 알고리즘에 의한 일반적인 부작용은 음의 echoing이나 의도하지 않은 리버브 효과입니다. 적은 양의 변환에서는 그리 두각되지 않지만, 큰 변화에서는 비교적 확연히 드러납니다.
리버브 효과를 줄이기 위해서는 소리조각의 길이가 원음의 비트와 비슷해야 합니다. 예를 들어 소리에 드럼 소리와 같은 명백한 비트음이 있다면 이 음을 기준으로 소리조각이 만들어져야 합니다.
또한 소리조각의 길이는 결과물의 음질에 영향을 끼칩니다. 길이가 길수록 overlap 되기에 적절한 파형을 찾기가 쉬워져 음질이 좋아지는 것은 사실이나 너무 길면 결과물의 소리가 불안정 해질 수 있습니다.
파형을 덮어 씌울때 단순하게 파형을 겹치기만 해도, 꽤 좋은 결과물이 나옵니다.
SOLA 알고리즘을 적용함에 따라 발생하는 지연 시간은 소리 조각의 길이, overlap 되는 시간의 길이 등에 의해 좌우 받는데 일반적으로 100~200 milli seconds가 소요됩니다. 이러한 지연은 offline에서 음악 파일을 처리하는 경우에는 문제가 되지 않지만, 만약 sync 문제가 포함되어 있거나, 실시간으로 sound를 처리해야 하는 경우라면, 관련하여 코딩에 주의가 필요합니다.
이러한 지연 문제가 심각해지는 예 중의 하나는 마이크를 통해 녹음된 소리에 대해 실시간으로 음높이를 변조하여 스피커로 재생하는 어플리케이션입니다. 이런 경우에는 소리를 녹음하고 재생하는 기본적인 지연시간에 추가적으로 100 milli seconds 가 지연된다는 것을 미리 주시하고 있어야 합니다.
Sound Artefacts and parameters
Typical sound artefact produced by SOLA is an echoing or reverberating sound which is insignificant at small time scale changes but becomes more obvious with larger-scale time slowdown.
To reduce the reverberating artefact, the processing sequence duration is ideally chosen to be relatively close to the fundamental frequency of the processed sound - for example, if there's certain frequency component or constant beat in the sound, that can be used as a reference for selecting the processing sequence duration.
Also the width of the offset seeking window affects the sound quality. The wider this window is, the higher possibility there is for finding a good match for the overlapping sequences. However, if the window is set to too wide setting, the result can sound unstable, as if it were drifting around.
For overlapping duration it's usually sufficient that it's a fraction of the whole processing sequence length. Usually a simple linear amplitude sliding over the overlapping period works quite as well as more sophisticated window functions.
To reduce the reverberating artefact, the processing sequence duration is ideally chosen to be relatively close to the fundamental frequency of the processed sound - for example, if there's certain frequency component or constant beat in the sound, that can be used as a reference for selecting the processing sequence duration.
Also the width of the offset seeking window affects the sound quality. The wider this window is, the higher possibility there is for finding a good match for the overlapping sequences. However, if the window is set to too wide setting, the result can sound unstable, as if it were drifting around.
For overlapping duration it's usually sufficient that it's a fraction of the whole processing sequence length. Usually a simple linear amplitude sliding over the overlapping period works quite as well as more sophisticated window functions.
The SOLA algorithm processing latency depends on the processing sequence duration, the width of overlap seeking window, and the width of the overlapping period, typically summing up to some 100 milliseconds. This processing latency isn't usually an issue when processing sound files offline, but if the application has strict requirements for sound synchronisation or if processing or playing back sound in real-time, then the latency may require attention.
An example of such latency-critical application is a real-time application that records sound with a microphone to process the recorded sound in realtime with SOLA for a pitch shift effect, and then immediately play back the processed sound with a loudspeaker. In such case it's good to notice the additional approx. 100 millisecond delay that SOLA processing introduces in addition to the basic latencies of audio sampling and playback operations.
________________________________________________________________________
예제 소스 코드
이 예제 는 SOLA 알고리즘의 간단한 예제 프로그램입니다.
sola() 함수는 'input' 배열에 포함된 음악 데이터의 시간을 늘이거나 줄여서 'output' 배열로 저장합니다. 물론 두 배열은 해당 작용을 하기에 충분한 공간을 가지고 있어야 합니다. 음악 데이터가 mono channel이고 44100Hz 의 sampling rate 와 16 bit 의 sample format으로 구성되어 있다는 가정하에서 구현되어 있습니다..
#define TIME_SCALE 이 0.85로 세팅되어 있어 (1.0 - 0.85) * 100% = 15% 정도 시간을 늘리도록 구현되어 있습니다. 사용자는 이 수치를 조정할 수 있습니다.
overlap() 함수는 두 소리 조각을 겹치는 역할을 합니다.
seek_best_overlap() 함수는 다양한 소리조각의 길이와 duration으로 가장 최적화된 소리조각의 길이와 duration을 찾는 역할을 수행합니다.
Example source code
Listing 1 shows a simple example implementation of a SOLA routine. The function sola() processes sampled sound data given in array 'input' and stores the resulting time-scaled sound into 'output' array, assuming that both arrays are sufficiently large for the desired processing. The algorithm assumes that sound data is mono sound sampled at 44100Hz sampling rate with 16bit integer sample format.
The time scaling ratio is given in #define TIME_SCALE, the default value of 0.85 corresponding to (1.0 - 0.85) * 100% = 15% longer duration than originally. User can modify this value to achieve other time scaling ratios.
The subroutine overlap() overlaps the two given input sequences by linearly sliding the amplitude weight from one sequence to the other over the overlapping period.
The subroutine seek_best_overlap() seeks for optimal overlapping offset by cross-correlating the sound sequences 'input_prev' and 'input_new' over their overlapping periods, by testing various offsets for 'input_prev' within range [0..SEEK_WINDOW]. A precalculation step multiplying the 'input_prev' overlapping region with the overlapping sliding coefficients is done prior to the the actual cross-correlation to remarkably speed up the actual cross-correlation calculation. Notice that cross-correlation result scaling by the vector lengths are omitted, because in this case it's sufficient to find just the largest correlation value without caring of the absolute result magnitude. Cross-correlation calculation uses floating point arithmetics for sake of simplicity; while the cross-correlation calculation could be done also with pure integer or fixed-point arithmetics, additional scaling of the sub-results would then be necessary to ensure that integer overflows can't occur.
이후에 frequency-domain에서 활용가능한 Phase Vocoder 관련 내용이 나옵니다.
Phase Vocoder는 파형에 대한 퓨리어 트랜스폼을 통해 가장 유사한 파형의 모양으로 겹치는 것을 목표로 하는 것 같습니다.
현재 에브리싱 앱에서 필요한건 실시간 음정 조정 알고리즘이라 이 부분은 나중에 번역하기로 합니다.
________________________________________________________________________
결론
SOLA는 비교적 단순하면서도 효과적인 time/pitch 조정 알고리즘입니다. SOLA는 적은 범위 안에서의 time/pitch 변화에 대해 잘 작동하며 사람의 목소리부터 이미 만들어진 음악까지 굉장히 다양한 소리에 대해 적용이 가능합니다. SOLA 결과물의 음질은 20% 이상의 높은 time/pitch 변화에 대해서는 여러 부작용 효과에 의해 매우 악화됩니다.
CD 음질의 stereo 음악을 SOLA 알고리즘으로 실시간으로 변형하는데는 C로 구현되었을 때 100 MHz의 processor가 필요합니다. 이는 SOLA 알고리즘이 현재의 PC는 물론 PDA나 스마트폰과 같은 휴대용 기기에도 적용이 가능함을 의미합니다.
SOLA는 정수연산이나 적은 자릿수의 소수 연산으로도 충분하기 때문에 부동소수점 연산이 지원되지 않은 processor에서도 구현이 가능합니다.
SOLA 알고리즘과 비교해서 Phase Vocoder 알고리즘의 가장 큰 장점은 큰 time/pitch 변화에 대해서도 echoing 같은 SOLA에서 보여주는 부작용이 없다는 것입니다.
그러나 Phase Vocoder도 소리가 뭉개지는 등의 부작용은 존재합니다.
Phase Vocoder는 적은 수의 악기를 이용하는 음악에 최적화되어 있어, 목소리나 노래는 잘 변환하지만, 많은 종류의 악기가 복합적으로 이용된 음악에 대해서는 잘 작동하지 않습니다. (그래서 에브리싱에는 쓰기 힘듬)
Phase Vocoder는 부동 소수점 연산을 포함하여 상당히 많은 양의 연산을 하므로, 꽤 좋은 processor가 있어야만 무난한 수행이 가능합니다.
추가 정보
SOLA 알고리즘을 기반으로 한 효과적인 open-source 구현을 찾아보려면 http://www.surina.net/soundtouch 에서 SoundTouch 라이브러리를 확인하시기 바랍니다.
Conclusions
SOLA approach allows relatively simple and efficient time/pitch scaling implementation that works well for small-scale time/pitch changes and can be applied for all kinds of sound, from human speech to ready produced music. Sound quality of SOLA becomes however disadvantageous with larger-scale tempo/pitch modification, with scaling ratios exceeding a couple of tens of percents introducing a reverberating artefact into the sound.
Processing CD-quality stereo sound in real-time with SOLA requires around a 100 MHz processor with a standard C language implementation, meaning that real-time SOLA processing is well feasible with modern PC's and even with portable devices such as PDAs and smart phones. Cross-correlation algorithm used in SOLA are also well-suited for CPU-specific SIMD optimizations, allowing reduction of CPU usage by a factor of two to three.
SOLA can be implemented using integer or fixed-point arithmetics if necessary, making it feasible also for processors without built-in floating-point support.
Compared to SOLA algorithm, the greatest advantage of using Phase Vocoder for time/pitch scaling is that it doesn't produce similar reverberating artefact as SOLA even with large-scale time/pitch modifications. However, Phase Vocoder produces other artefacts, most notable the dull sound caused by loss of phase synchronisation and uneven frequency response.
Phase Vocoder is best suited for processing independent sound channel information with a single or few instrument, human speech or singing, but may not be at its best for post-processing high quality music that's already been produced and mixed into final recording format.
Phase Vocoder is calculationally heavy, involving complex calculation and thus practically enforcing use of floating point arithmetics. High quality Phase Vocoder implementation requires approximately an order of magnitude more computational power than the SOLA algorithm. If applying only the time scaling effect, consequently allowing omitting some of the most time-consuming stages, and using hand-tuned processor-level SIMD optimizations, processing CD-quality stereo sound in real-time requires around a 500Mhz on a Pentium-III level processor.
Single-Instruction-Multiple-Destination or SIMD means processor-level instructions that perform same operation for multiple data items with a single instruction. Modern processors support SIMD instruction extensions as means for optimizing performance of calculationally intensive routines. Examples of SIMD instruction sets are MMX and SSE extensions of the Intel X86 processors used in PCs and new Mac computers, and AltiVec of PowerPPC used in older Mac computers.
An example of a SIMD instruction is multiplying four parallel numbers by other four numbers, producing four parallel multiplication result values, thus offering theoretically four times higher performance than using four consecutive multiplication instruction. However, due to overhead typically required for arranging data into suitable batches for SIMD, quite that high speed-up isn't practically achievable, yet speed-up by factor of two to three is often possible.
SIMD optimizations are suitable for routines doing multiple calculation operations at the same time, such as vector and matrix calculations. However, finding parallel operations in routines written in usual programming language isn't trivial, and still today compilers can't produce good automatic SIMD code, but the programmer has to help the compiler by hand-writing the routines using SIMD-compatible syntax extensions.
|
More information
For an efficient, open-source implementation of SOLA-based time/pitch scaling routines, see SoundTouch library at http://www.surina.net/soundtouch.
For further information of time/pitch scaling based on Phase Vocoder, the paper "Improved Phase Vocoder Time-Scale modification of Audio" (Laroche, Dolson, IEEE transactions on speech and audio processing, Vol. 7, No. 3, May 1999) presents the Phase Vocoder concept with further references and a good discussion of the phase coherence issue.
댓글 없음:
댓글 쓰기