Wednesday, October 24, 2018

Pursuing clarity through openness, part 7: from structure to sound

Digital sound is a complex subject, with many variations on the theme. Most use Pulse-code Modulation (PCM) in some fashion. PCM is a sequence of numbers representing the amplitude of a sound wave, the instantaneous pressure, measured frequently at regular intervals in the case of a microphone capturing sound from the environment. The frequency of those measurements, the sample rate, is most commonly 44100 per second, too low to capture the nuances of the squeaks made by mice and bats but more than adequate for human voices.

The way those measurements are encoded varies, with 16-bit signed integers being a common format made popular by its use on CDs. Apple uses that format for its microphones and speakers, but internally Apple's OSes use 32-bit floats to pass data around, waiting until the very last step to convert those to integers for output. So, at least for Apple devices, synthesizing sound means generating a sequence of 32-bit float 'samples' quickly enough to stay at least a little ahead of the output, so that it never runs out.

However, if you're working in an interactive context, where the delay between a user action and the sound it generates needs to be imperceptibly small, you don't want to get too far ahead of the output. If, for example, the length of a note depends on the time between touch down and touch up events, it cannot be entirely precomputed, and even if it could, if you want to be able to overlap multiple notes of the same pitch, there would still be the issue of combining them into the stream of sample values in a manner that produces proper phase-alignment, to avoid randomized interference phenomena.

The most straightforward approach is to generate the stream of samples to be fed to the output on the fly, just in time. Apple's older Core Audio framework provides callbacks for this purpose; you supply a block/closure (or a pointer to a function) to an audio unit, which then calls that code whenever it's ready for more data. This is a low-latency process. The challenge is to craft code that will return in time, so you don't leave the output without data. You stand a better chance of doing this in C than in Swift, but even in C you need to be careful not to try to do too much in a callback; anything that can be precomputed should be.

AVAudioNodes provide callbacks, but it's not clear to me whether these are appropriate for an interactive context. AVAudioNodes also wrap AUAudioUnits, which have callbacks of their own. I think it should be possible to make use of these and avoid the need to set up an audio unit in C, but I already had that code so I haven't yet put this theory to the test.

At this point you'll be staring, figuratively if not literally, at an empty callback routine. At least in the case of Core Audio audio units, you will have been passed a structure with a pointer to an array of pointers to buffers. Assuming only a single channel, you'll get the pointer in the [0] cell of the array and begin writing sample values into the buffer. When done, you return the structure. Anything requiring continuity from one such call to the next, such as phase alignment, will need to have broader scope than the callback routine (in the case of a named function, static variables might work).

The samples we'll be adding to the buffer mentioned above will be derived from sine values. Because sine values take some effort (cpu cycles) to compute they should be precomputed, so we'll want a table (an array) of them from which particular values can be extracted using a simple index. The table should represent one complete cycle of a sine wave, from sin(0.0) up to but not including sin(2.0 x pi) assuming you're using a sine function that takes radians as an argument.

Frequency, also called pitch, can be expressed in terms of the rate of traversal across the range of indices into this sine table, measured in indices per sample. Walking off the end of that range and going back to the beginning equates to completing a single cycle of a sine wave. When using this approach, the size of the sine table (the number of elements it contains) becomes an important component of the calculations. The larger the number of elements in the table, the more precise the values provided by table lookup will be, but I consider 65536 (2^16) a practical upper limit. Any table size that is an even power of 2 allows moving back from the end to the beginning in an efficient manner.

I've managed the business of tracking the phase-alignment of a synthetic sound wave as a progression through a sine table several different ways. Originally I used radians/second to represent frequency, which meant that the phase-alignment for the current sample had to be multiplied by the sine_table_size/two_pi and the result of that truncated to produce an integer index. Then I realized I might just as well be using sine-table-indices/second, which only needs to be checked for being out of range and adjusted by sine_table_size if it is. At some point it occurred to me that this approach, if combined with a sine_table_size equal to the sample rate, would eliminate the need for converting from cycles per second to sine-table-indices per second, since they would be equivalent, requiring only a type conversion from double to int just before the table lookup.

(Note: the above paragraph and the two that follow are a bit confused, but that's consistent, since so was I while stumbling through this transition in algorithmic approaches. In any case, straightening this out is more than I can do at this moment. Just remember that dimensional analysis is your friend! 14Jan2019)

When I began to experiment with complex tones, I also began to use the current phase-alignment of the fundamental to generate the phase alignments of any harmonics to be included, multiplying it by their harmonic numbers, applying modulo division by the sine_table_size to reduce this product to the proper range, and using the result of that modulo division as an index into the sine table. At some point in 2017 or early 2018, it occurred to me that this same approach would work with harmonic structures composed of multiple harmonic series, if I were to track the phase of the Highest Common Fundamental (HCF) and multiply that by the harmonic numbers of the members of the structure as determined by their position within the harmonic series defined by the HCF, their HCF harmonic numbers.

Then, finally, I realized that, if I were to keep frequency in cycles per second, I could eliminate the modulo divisions, since modulo 1.0 is equivalent to simply dropping the non-fractional part of a floating point value. The tradeoff in doing this is the need to reintroduce multiplication by sine_table_size followed by truncation to produce integer indices for table lookup.

By this time I'd lost track of the distinction between these various approaches and began to combine elements of them inappropriately, leading to confusion – vaguely analogous to random mutation in genetics, it occasionally works out but mostly you get no discernible difference, or monsters.

So, now you at least know that there can be several ways to specify a frequency, and probably also have an inkling that the choice of which to use impacts the manner in which the phase is tracked and converted into sine values, which then feed into the generation of sample values.

That's probably enough for one day.

Part 8: integer-ratio intervals against a logarithmic scale

No comments:

Post a Comment