Karol Łacina’s Website

Making Crazy Sounds With aplay

Exactly two years ago, as my IRC logs tell, a friend of mine shared with me this cool little trick:

find / | aplay

When you run this command on a Linux system with alsa-utils, you will hear a soothing buzz that can be described as a low-pass filtered electric razor. That buzz comes from the output of find / being interpreted as audio. Similarly, if you pipe cat /dev/random into aplay, you will get low-pass filtered white noise coming out of your speakers.

Having just played around with music synthesis in Shadertoy (case 1, 2 and 3), I was eager to try out this new tool to finally free myself from having to use a web app for my music synthesis needs and be able to do it locally on my computer. And so I ported my Shadertoy shaders to desktop using it, then had a great idea for a new kind of DAW and in the end got bored of playing around with sound, but now I am back to guide you through this fun adventure of making your own bleeps and bloops with aplay and C++ explaining everything on the way as best as I can, assuming you have basic knowledge of C++.

Using aplay 🔗

To harness the dead-simple yet sufficient interface of aplay we’ll first have to figure out how exactly it interprets its input. Well, actually we don’t have to because aplay already does that for us by printing this nice little message at the beginning:

Playing raw data 'stdin' : Unsigned 8 bit, Rate 8000 Hz, Mono

Which tells us that it’s reading in raw LPCM audio data, which is the standard for representing sound digitally, with a single channel and some default parameters of the LPCM. LPCM is a method for encoding an analog signal digitally by measuring (sampling) it at regular time intervals according to the sample rate, then mapping these real number samples from -1 to 1 to a signed/unsigned n-bit integer and saving them in consecutive order with the given endiannesses. If the audio signal is stereo, then the LPCM-encoded samples of the left and right channels are interleaved.

That raw audio data is passed on to your kernel by aplay, processed by your sound server (if you have one) and sent through the kernel again to your speakers, which use that signal to control the position of their diaphragms, whose movement - the derivative of the signal - creates sound and that’s basically all you need to know. We don’t have to delve into a deep analysis of how the human ear perceives the resulting sound waves because we just have to make it sound good, that’s all.

As a side note, the default sample rate of 8000Hz and the Nyquist-Shannon sampling theorem explain why the two examples at the beginning of this article covered only the lower frequency range of human hearing.

Let’s get to work and write a program that will take an audio signal - the sample_at function - and write it to stdout encoded with LPCM so that aplay can play it, shall we? We’ll be looking only at the signed 16 bit little endian 44100Hz stereo format, which we can enable in aplay with -f cd, because it’s the format that all CDs by design and the majority of WAVs use.

#include <climits>
#include <cmath>
#include <iostream>

int const sample_rate = 44100;

struct Sample {
  // This is actually two samples - one for each channel.
  double left, right;
};
Sample sample_at(double time) {
  return {sin(420.0 * 2.0 * M_PI * time), 0.0};
}

double clamp(double x, double a, double b) {
  return x < a ? a : (x > b ? b : x);
}

void write_short(unsigned short x, std::ostream &stream) {
  stream << (char) x << (char) (x >> 8);
}

void write_channel(double x, std::ostream &stream) {
  write_short(SHRT_MAX * clamp(x, -1.0, 1.0), stream);
}

void write_sample(Sample sample, std::ostream &stream) {
  write_channel(sample.left, stream);
  write_channel(sample.right, stream);
}

int main() {
  for(size_t sample_idx = 0; ; sample_idx++) {
    double time = sample_idx / (double) sample_rate;
    auto sample = sample_at(time);
    write_sample(sample, std::cout);
  }
}

Simple as that. Compile it with g++ code.cpp -o code, play it with ./code | aplay -f cd and you should hear a 420Hz sine wave in your left speaker. Be sure to use at least double-precision floats so that you don’t have to worry about regular floats causing nasty distortion that increases over time somewhere in your code, which can happen even in a simple sine wave. If you want to, you can try generalizing that code to all sample rates, number of bits, signed/unsigned, little endian/big endian and mono/stereo as an exercise, which isn’t as complicated as it sounds.

Now we can move on to synthesizing sound…

Waveforms 🔗

Waveforms are the mass of the sculpture that is music. Without them there would just be no sound. Thus, most synthesis methods start with some waveform. The basic waveforms used in music synthesis are the sine, pulse, sawtooth and triangle waves and white noise.

Sine waves 🔗

Sine waves are the purest waveform of them all since they lack any overtones, which is why they sound clear to the human ear and are useful for additive synthesis, but they are computationally expensive and you may have to precompute them at startup, if your application gets too complex. They are the solution to the differential equation x’ = -kx which models the movement of all freely oscillating things in the universe, so you will find them used in a vast majority of synthesized acoustic instruments.

double sine(double time, double freq) {
  return sin(freq * 2.0 * M_PI * time);
}

Pulse waves 🔗

Pulse waves alternate between 1 and -1 instantenously and regularly and their sound can described as buzzy and resonant. The pulse wave is the landmark of chiptune music since it’s the easiest waveform to generate with digital components. The fraction of the time in a cycle when a pulse wave is at 1 is called its duty cycle and a special case of them are square waves with a duty cycle of 50%. You can modulate the duty cycle with a low frequency sine wave for a more interesting and pleasant timbre.

double pulse(double time, double freq, double duty) {
  double phase = round(sample_rate * time) / round(sample_rate / freq);
  return fmod(phase, 1.0) < duty ? 1.0 : -1.0;
}

double square(double time, double freq) {
  return pulse(time, freq, 0.5);
}

Be careful not to pass negative times to the functions using fmod here because fmod returns negative remainders for negative dividends, so the results may not be what you expect.

Sawtooth waves 🔗

Sawtooth waves look just like the teeth of a saw, that is in one cycle they rise from 0 to 1, drop down to -1 and rise again to 0, and can be described as sounding warm and fuzzy. They contain all the harmonics and for that reason they are the most commonly used waveform for subtractive synthesis.

double sawtooth(double time, double freq) {
  double phase = round(sample_rate * time) / round(sample_rate / freq);
  return 2.0 * fmod(phase + 0.5, 1.0) - 1.0;
}

Low frequency pulse and sawtooth waves will make a popping noise because of the sudden jumps between 1 and -1. These sudden jumps are also the reason why we calculate the phase by dividing time by the period both rounded to a whole number of samples instead of just multiplying time by frequency as we did with the sine wave. If we didn’t do that, the waveform’s frequency content would end up aliased and distorted.

Triangle waves 🔗

Triangle waves look just like a row of triangles with their bases at -1 and sound brighter than sine waves. Harmonics-wise they are a softer version of square waves. I don’t know any specific application of them besides being a good-sounding base for bass instruments and a cheap alternative to sine waves in analog circuits when you low-pass filter them.

double triangle(double time, double freq) {
  return fabs(4.0 * fmod(freq * time + 0.75, 1.0) - 2.0) - 1.0;
}

White noise 🔗

White noise is a random signal that equally covers the whole range of available frequencies and the only aperiodic waveform shown here. Therefore, by filtering frequencies out we can transform it to any other noise spectrum we want. It is often used for synthesizing percussion and measuring the frequency response of digital signal processing (DSP) algorithms.

#include <cstdlib>

double noise() {
  return 2.0 * rand() / RAND_MAX - 1.0;
}

Another consequence of our Nyquist-Shannon sampling theorem above is that any periodic waveform with any frequency higher than 22.05kHz won’t sound like it due to aliasing.

Filters 🔗

Next up are frequency filters. They basically modify amplitudes and phases of a signal’s frequencies, low-pass ones let frequencies on the lower side of a cutoff frequency through and attenuate them at the higher side and high-pass ones vice versa. Filters are the foundation of subtractive synthesis, which takes in initial waveforms and removes certain harmonics from them. The opposite is additive synthesis, which combines basic waveforms together to create new and more harmonically-rich sounds. They, together with FM synthesis, are the three major approaches to music synthesis, which may have different “classic” sounds associated with them but sometimes are just different means to the same end, but I digress.

Filters can be divided into two categories: causal and non-causal. The distinction between those is that causal filters do not require knowledge of future inputs, which is what we need in our real-time synthesizer, whereas non-causal ones do. Casual filters can in turn be divided into finite impulse response (FIR) and infinite impulse response (IIR) filters. This formal mathematical distinction doesn’t really make a difference in our audio application but what does make a difference is that the former are rarely seen implemented in electronic circuits because of the latter’s greater simplicity both analog and digital and as such the latter may more closely resemble the sound of analog synths. Thus, we are interested only in causal IIR filters.

There are many different IIR filter designs out there and the theory behind them is too big to fit into this article, so I’m just gonna show you the code for the two simplest filters that just get the job done. Below is something similar to a Butterworth filter that can be implemented using low-pass RC filters connected in series. It is just a digital translation of an already existing analog filter like almost all digital IIR filters out there. This is because of the vast mathematical knowledge on electronic (analog) filters collected before the appearance of accessible microprocessors.

#include <array>

template<size_t order>
struct Lowpass {
  double cutoff;
  std::array<double, order> state;

  Lowpass(double cutoff): cutoff(cutoff) {}

  double feed(double sample) {
    double alpha = 2.0 * sin(M_PI * cutoff / sample_rate);
    for(size_t i = 0; i < order; i++) {
      sample = state[i] = alpha * sample + (1.0 - alpha) * state[i];
    }
    return sample;
  }
};

The order specifies the number of sub-filters (RC circuits) the input signal goes through, an order of 2 is usually enough. Note that this data structure, like all the other ones in this article, needs to be fed samples from your audio signal in a sequential manner.

Low-pass filters in analog synthesizers also have a resonance parameter that adds a feedback loop between the filter stages, which boosts frequencies around the cutoff point. We can implement it in a second-order filter as follows:

struct ResonantLowpass {
  double cutoff, resonance;
  double state1, state2;

  ResonantLowpass(double cutoff, double resonance): cutoff(cutoff), resonance(resonance) {}

  double feed(double sample) {
    double alpha = 2.0 * sin(M_PI * cutoff / sample_rate);
    double beta = resonance + resonance / (1.0 - alpha);
    state1 = alpha * (sample + beta * (state1 - state2)) + (1.0 - alpha) * state1;
    state2 = alpha * state1 + (1.0 - alpha) * state2;
    return state2;
  }
};

The resonance variable here is from 0 to 1 and if you set it at 1, the filter will start to self-oscillate. Designing high-pass filters is harder than designing low-pass ones but a simple trick to turn low- into high-pass filters is to take the low-pass filtered signal and subtract it from the original signal.

Envelopes 🔗

An envelope is a real function defined for a given duration of time that is used to articulate notes played on a synthesizer so that notes don’t just turn on and off making a popping sound. You can either multiply the envelope with the note’s signal, which will “contain” the audio signal’s amplitude in the envelope, or you can use the envelope to modulate the cutoff frequency of a low-pass filter, which will “contain” the signal in the time-frequency domain.

ADSR 🔗

This is the most common and possibly the oldest type of envelope, used in almost all analog synthesizers. From time 0 it rises from 0 to 1 in attack seconds (the attack stage), then it drops to the sustain level in decay seconds (decay), then it stays at that level until time duration (sustain) and at the end drops to 0 in release seconds (release). At all other times it stays at 0.

double adsr(double time, double attack, double delay, double sustain, double duration, double release) {
  if(time < 0.0) return 0.0;
  if(time < attack) return time / attack;
  if(time - attack < delay) return 1.0 - (time - attack) / delay * (1.0 - sustain);
  if(time < duration) return sustain;
  if(time - duration < release) return (1.0 - (time - duration) / release) * sustain;
  return 0.0;
}

Exponential ADR 🔗

This envelope first rises from 0 to 1 (attack), then decays exponentially (decay) until it is released and drops back to 0 (release), which is similar to how the amplitude of a piano note and damped harmonic oscillators in general behave, at least when it comes to the “decays exponentially” part.

double exp_adr(double time, double attack, double duration, double release) {
  if(time < 0.0) return 0.0;
  if(time < attack) return time / attack;
  if(time < duration) return exp(-(time - attack));
  if(time - duration < release) return (1.0 - (time - duration) / release) * exp(-(duration - attack));
  return 0.0;
}

Effects 🔗

Effects are the final ingredient to make our instruments sound more lively and not so boring.

Overdrive/distortion 🔗

This effect is the simplest to achieve and you may have already accidentally experienced it earlier by having your output audio leave the -1 to 1 range, which our function for converting the audio to LPCM hard clips to. Similarly, we can clamp any other audio signal to a range smaller than the one it is contained in and get a distortion effect.

Limiting 🔗

Yet you usually don’t want your music to get distorted and sometimes, such as when playing live, there is no simple way to prevent it but to use dynamic range compression and specifically limiting. Here we have a state variable that smoothly follows the volume of our audio signal but always stays at or above 100% and by which the signal is divided, effectively limiting it to a volume of 100%. Since all the designs in this article are causal, we have to introduce a delay of attack seconds to the audio to allow us to smoothly transition into the loud parts. Likewise, we transition out of them in release time.

#include <deque>
#include <queue>
#include <utility>

struct Limiter {
  double attack, release;
  std::deque<std::pair<double, size_t>> peaks;
  size_t processedc;
  double state = 1.0;
  std::queue<double> delay;

  Limiter(double attack, double release): attack(attack), release(release) {}

  double feed(double sample) {
    double const hold = 1.0 / 20.0; // The period of 20Hz, which is the lowest audible frequency.
    while(!peaks.empty() && processedc - peaks.front().second > (attack + hold) * sample_rate) {
      peaks.pop_front();
    }

    while(!peaks.empty() && peaks.back().first <= abs(sample)) {
      peaks.pop_back();
    }
    peaks.push_back({abs(sample), processedc});
    processedc++;

    double target = std::max(peaks.front().first, 1.0);
    double alpha = state < target ? attack : release - hold;
    if(alpha > 0.0) {
      // Be within 5% of the target in `alpha` time.
      alpha = pow(0.05, 1.0 / alpha / sample_rate);
      state = alpha * state + (1.0 - alpha) * target;
    } else {
      state = target;
    }

    delay.push(sample);
    size_t samplec = attack * sample_rate;
    while(samplec < delay.size() - 1) {
      delay.pop();
    }
    double result = samplec < delay.size() ? delay.front() : 0.0;
    return result / state;
  }
};

To determine the volume of the signal, we calculate the maximum amplitude of the last few milliseconds (+ the length of the attack stage) of the input signal using a minimum queue, which holds only the amplitude peaks that could be the result at some point.

Delay 🔗

Delay creates an echo-like effect. Here we remember the past duration seconds of audio and just low-pass filter to make it less sharp, attenuate and mix the oldest sample into the current one.

#include <queue>

struct Delay {
  double duration, depth, cutoff;
  std::queue<double> tape;
  Lowpass<2> filter;

  Delay(double duration, double depth, double cutoff):
    duration(duration), depth(depth), cutoff(cutoff), filter(cutoff) {}

  double feed(double sample) {
    size_t samplec = duration * sample_rate;
    while(tape.size() > samplec) {
      tape.pop();
    }
    if(tape.size() == samplec) {
      filter.cutoff = cutoff;
      sample += filter.feed(tape.front()) * depth;
    }
    tape.push(sample);
    return sample;
  }
};

Tremolo 🔗

Tremolo modulates the amplitude of the sound creating a trembling effect and is quite simple to achieve. All you have to do is map a sine wave to the range of 1 - depth to 1 and multiply it with your audio.

double tremolo(double time, double freq, double depth) {
  return 1.0 - depth * (sine(time, freq) / 2.0 + 0.5);
}

Wah 🔗

This effect, also called “wah-wah” because of its similarity to that onomatopoeia, works by low-pass filtering notes with a high resonance and a cutoff frequency controlled by an envelope. If you look at the spectrograms of a real human “wah” and our digital one, then they will indeed be very much similar.

Chorus and flanging 🔗

Chorus and flanging work the same way by mixing the audio signal with a delayed copy of it, whose delay time is modulated by a sine wave. That requires us to smoothly interpolate between samples and the method we’ll use for that is linear interpolation. Although there exist better formulas for interpolation, they add a few extra samples of delay, which is enough to alter the frequency response in the case of flanging. This no extra delay constraint makes linear interpolation the best one to use here. Additionally, we feed some of the delay’s output back into itself to create some resonance for flanging.

#include <deque>

struct Chorus {
  double delay, depth, freq, mix, resonance;
  std::deque<double> tape;
  size_t processedc = 0;

  Chorus(double delay, double depth, double freq, double mix, double resonance):
    delay(delay), depth(depth), freq(freq), mix(mix), resonance(resonance), tape({0.0}) {}

  double feed(double sample) {
    size_t max_samplec = delay * sample_rate;
    while(max_samplec + 1 < tape.size() - 1) {
      tape.pop_back();
    }

    double mod = sine(processedc / (double) sample_rate, freq) / 2.0 + 0.5;
    double samplec = delay * (1.0 - depth * mod) * sample_rate;

    size_t whole = samplec;
    double frac = samplec - whole;
    double x = whole + 1 < tape.size() ? (1.0 - frac) * tape[whole] + frac * tape[whole + 1] : 0.0;

    tape.push_front(sample + resonance * x);
    sample = (1.0 - mix) * sample + mix * x;

    processedc++;
    return sample;
  }
};

Chorus makes the original audio feel more ambient or even sometimes dreamy and can be achieved with the code above by setting delay to 10ms or more, depth to around 10%, mix to 50% or less and resonance to 0.

Flanging creates a jet engine-like sweeping sound, especially when applied to white noise, because the delay is small enough that it creates many evenly-spaced throughs moving in the signal’s frequency domain, this type of interference is called a comb filter. For this effect, set delay to just a few milliseconds (eg. 1ms), depth to almost 100% and mix to 50% or less. You can also put in a negative mix parameter to invert the throughs, amplifying them instead of attenuating them.

You can also get a vibrato-like effect with this by setting depth and mix to 100% and resonance to 0, although the resulting signal is also time-shifted, so it’s not a pure vibrato. This is the effect you can usually find in effect pedals.

Phasing 🔗

Phasing is similar to flanging in that both are sweeping comb filters. However, phasing is a lot more dreamier and it uses multiple all-pass filters in series in place of the delay line, which makes the comb filter’s throughs irregularly spaced and constant in number. The number of all-pass filter stages divided by two is the number of throughs in the frequency domain.

An all-pass filter is a filter that maintains the input signal’s amplitude for all frequencies but unevenly shifts the phase of different frequencies. The Schroeder all-pass filter used here consists of a one sample-long delay with feedback and feedforward loops with equal gains. You can read more about them here and here.

#include <array>

template<size_t order>
struct Phaser {
  double from_freq, to_freq, rate, mix, feedback;
  std::array<double, order> allpass_filters;
  double last_x = 0.0;
  size_t processedc = 0;

  Phaser(double from_freq, double to_freq, double rate, double mix, double feedback):
    from_freq(from_freq), to_freq(to_freq), rate(rate), mix(mix), feedback(feedback) {}

  double feed(double sample) {
    double alpha = sine(processedc / (double) sample_rate, rate) / 2.0 + 0.5;
    alpha = (1.0 - alpha) * from_freq + alpha * to_freq;
    alpha = alpha * 4.0 / sample_rate - 1.0;

    double x = sample + feedback * last_x;
    for(double &state: allpass_filters) {
      double y = state;
      state = x - alpha * y;
      x = alpha * x + (1.0 - alpha * alpha) * y;
    }
    last_x = x;

    sample = (1.0 - mix) * sample + mix * x;
    processedc++;
    return sample;
  }
};

The combs should generally oscillate between the frequencies from and to. A setting of from and to to a few hundred hertz wide range centered around 500Hz, rate to below 1Hz and mix to 50% sounded good when I was testing this effect.

Reverb 🔗

This is probably the most complex effect of all shown here since it has many different possible designs and even more variations on those designs. Nevertheless, the two best reverberation techniques out there seem to be convolutions, which, as the name suggests, involve complex mathematical magic spells but can model any space, and feedback delay networks (FDN), which are far easier to understand and something we’d actually want to implement.

In essence, an FDN consists of a multi-channel delay line with a feedback loop that mixes the multiple channels together according to some mixing matrix. That matrix should be orthogonal so that energy in the system doesn’t decay out of our control. You can visualize that as a room with multiple reflective surfaces, in which sound waves repeatedly split and bounce off the many sound mirrors until they are no louder than natural noise.

Now, things can only get more complicated from here on since there are surprisingly many elements that you can tweak here, including those that don’t even exist yet. What I meant by that is putting a variety of weird things into that simple FDN layout, which in turn can also be tweaked and tuned. The selection of mixing matrices is also wide, so that’s another thing you can play around with for hours on end.

I have decided to deal with none of that and implement only the bare skeleton of an FDN reverb both to keep the code simple giving you - the reader - room to expand upon it and to save my evaporating patience from completely disappearing now that I’ve worked on this article for probably a month now. The only superficial design choices I’ve made is using the Hadamard mixing matrix since it’s the most popular one and having 8 channels because that was the minimum of sounding good enough. It sounds good, except for instruments with short attack time such as drums, for which you may find a solution in an undermentioned tutorial.

#include <array>
#include <bit>
#include <cstdlib>
#include <queue>

template<size_t k>
constexpr auto hadamard() {
  size_t const n = 1ull << k;
  std::array<double, n * n> result;

  result[0] = 1.0;
  for(size_t i = 1; i < n; i *= 2) {
    for(size_t y = 0; y < i; y++) {
      for(size_t x = 0; x < i; x++) {
        result[n * y + (x + i)] = result[n * (y + i) + x] = result[n * y + x];
        result[n * (y + i) + (x + i)] = -result[n * y + x];
      }
    }
  }

  for(auto &cell: result) {
    cell /= sqrt(n);
  }
  return result;
}

template<size_t tapec = 8>
struct Reverb {
  double duration, wet;
  static_assert((tapec & (tapec - 1)) == 0);
  static constexpr auto matrix = hadamard<std::countr_zero(tapec)>();
  std::array<std::queue<double>, tapec> tapes;
  std::array<double, tapec> delays;
  double avg_delay;

  Reverb(double room_size, double duration, double wet):
    duration(duration), wet(wet)
  {
    avg_delay = 0.0;
    for(size_t i = 0; i < tapec; i++) {
      double x = (i + rand() / (double) RAND_MAX) / tapec;
      delays[i] = (1.0 - x) * room_size + x * 2.0 * room_size;
      avg_delay += delays[i];
    }
    avg_delay /= tapec;
  }

  double feed(double sample) {
    for(auto &tape: tapes) {
      tape.push(sample);
    }
    double feedback = pow(0.05, avg_delay / duration); // Go down to 5% volume in *duration* time.
    for(size_t i = 0; i < tapec; i++) {
      auto &tape = tapes[i];
      size_t samplec = delays[i] * sample_rate;
      while(tape.size() - 1 > samplec) {
        tape.pop();
      }
      if(tape.size() - 1 == samplec) {
        double x = tape.front();
        for(size_t j = 0; j < tapec; j++) {
          tapes[j].back() += feedback * matrix[j * tapec + i] * x;
        }
        sample += wet * x;
      }
    }
    return sample;
  }
};

With the code above, setting room size to 100ms, duration to just a few seconds and wet to 5% sounds good. If you want a better algorithm still, then you can go and read this really well-made tutorial, which shares my goal of keeping the design robust but does it a lot better. If that’s not enough technicalities for you, then there’s also this, have fun with that.

Reading in MIDI 🔗

Now that we have mastered (not really) the DSP algorithms essential for music synthesis, it’s time to put them to good use. But before that, let’s figure out how to give our sounds some structure so that we can test our “patches” better and get closer to our goal of making actual music. Here you can either venture into the world of algorithmic composition and translate your melodies and rhythms to code or just play your software synth with an external MIDI controller, which is without a doubt easier.

You may look around for a CLI utility on your system that is similar to aplay and can output MIDI coming from your controller and you will indeed find two such programs called “arecordmidi” and “amidi”. Unfortunately, neither of them supports writing to stdout, so their output can’t be simply piped to our program and we’d have to use other means such as Unix domain sockets, which would be too much of a hassle for a hack. In that case, we have no other choice but to use an external library in our code - the ALSA library.

In case you didn’t know, ALSA is the part of the Linux kernel responsible for sound, and so all programs, including ours, will have to listen to the rules it lays out for everyone. The basis of MIDI messaging in ALSA are input and output ports, which we can create and connect together, and they are the only way to send and receive MIDI events on Linux. Thus, ALSA by design forces your applications to be modular.

#include <alsa/asoundlib.h>
#include <array>
#include <fcntl.h>

snd_seq_t *seq_handle;
void init_midi() {
  assert(snd_seq_open(&seq_handle, "default", SND_SEQ_OPEN_INPUT, 0) == 0);
  assert(snd_seq_set_client_name(seq_handle, "Making Crazy Sounds With aplay") == 0);
  int port = snd_seq_create_simple_port(
    seq_handle,
    "MIDI Input",
    SND_SEQ_PORT_CAP_WRITE | SND_SEQ_PORT_CAP_SUBS_WRITE,
    SND_SEQ_PORT_TYPE_MIDI_GENERIC | SND_SEQ_PORT_TYPE_SOFTWARE | SND_SEQ_PORT_TYPE_SYNTHESIZER
  );
  assert(port >= 0);
  int client = snd_seq_client_id(seq_handle);
  assert(client >= 0);
  std::cerr << "Listening for MIDI on " << client << ":" << port << std::endl;
  /*
   * Set the size of the pipe to aplay to the minimum
   * (one system page) so that there's no input lag.
   */
  fcntl(fileno(stdout), F_SETPIPE_SZ, 0);
}

struct MidiKey {
  double on_time, off_time, velocity;
};
std::array<MidiKey, 256> midi_keys;
void poll_midi(double time) {
  while(snd_seq_event_input_pending(seq_handle, 1) > 0) {
    snd_seq_event_t *event;
    snd_seq_event_input(seq_handle, &event);
    switch(event->type) {
    case SND_SEQ_EVENT_NOTEON: {
      auto &key = midi_keys[event->data.note.note];
      auto velocity = event->data.note.velocity;
      if(velocity > 0) {
        key.on_time = time;
        key.velocity = velocity / 255.0;
      } else {
        // Some keyboards (including mine) do this for some reason.
        key.off_time = time;
      }
    } break;
    case SND_SEQ_EVENT_NOTEOFF:
      midi_keys[event->data.note.note].off_time = time;
      break;
    }
    snd_seq_free_event(event);
  }
}

In the code above, we register ourselves as a MIDI client in ALSA and create an input port that we can then poll for note on and note off events. You can find more information about other event types and the MIDI event struct in the docs. The main function also needs to have calls to those two functions above added to it:

int main() {
  init_midi();
  for(size_t sample_idx = 0; ; sample_idx++) {
    double time = sample_idx / (double) sample_rate;
    poll_midi(time);
    auto sample = sample_at(time);
    write_sample(sample, std::cout);
  }
}

These two pieces of code above only take in MIDI input and make it ready for use but don’t actually play any sound with it on their own, that’s up to you. However, to give you an idea of what you can do with that, I wrote the example code below, which basically simulates a crude handpan based off of this paper, which was unfortunately the only resource on this topic that I could find online.

double key_freq(int note) {
  return 440.0 * pow(2.0, (note - 69) / 12.0);
}

double envelope(double time, double onset, double peak, double t60) {
  // The formula for converting decibels to the ratios they represent
  peak = pow(10.0, peak / 20.0);
  if(time < 0.0) return 0.0;
  if(time < onset) return peak * time / onset;
  if(time - onset < t60) return peak * pow(pow(10.0, -60.0 / 20.0) / peak, (time - onset) / t60);
  return 0.0;
}

Sample sample_at(double time) {
  double result = 0.0;
  for(int i = 0; i < 256; i++) {
    auto key = midi_keys[i];
    double key_time = time - key.on_time;
    if(key.velocity > 0.0 && key_time < 4.1 + 0.372) {
      result += (sine(time,        key_freq(i)) * envelope(key_time, 0.023,  -2.3, 2.7) +
                 sine(time, 1.98 * key_freq(i)) * envelope(key_time, 0.092,  -7.5, 2.0) +
                 sine(time, 2.97 * key_freq(i)) * envelope(key_time, 0.372, -16.7, 4.1) * tremolo(key_time, 3.3, 0.8) +
                 sine(time, 3.89 * key_freq(i)) * envelope(key_time, 0.092, -21.0, 2.5)) * key.velocity / 4.0;
    }
  }
  return {result, result};
}

This code is an example of additive synthesis, which I’ve mentioned briefly before. For more interesting articles about synthesizing different instruments you can visit Sound on Sound and for technical stuff there’s musicdsp.org. There’s also a function key_freq here for converting MIDI note numbers to fundamental frequencies of a piano with A4 equal to 440Hz, which you’ll need for your own synth work too.

From now on, you have to compile your code with an additional flag to link the ALSA library - -lasound. You may also need to download a development package for the library, eg. on Debian and its derivatives you have to:

apt install libasound2-dev

When running the code, remember to add the -B 1 option to aplay. This gets rid of input lag by making aplay’s internal audio buffer the size of one microsecond (the minimum). Everytime you run the code you have to connect your MIDI controller to it using:

aconnect <input client:port> <code client:port>

But you can just do that in a while loop as I do. The code automatically tells you its port name. There is a way to automatically connect ports in code, if you’re interested and want to build something more sophisticated than what we’ve done already. To list all of the MIDI ports available on your system and find the one for your controller:

aconnect -l

And that’s all there is to it.

Writing WAVs 🔗

Okay, so now you have learned how to create music and play it live but what if you want to save your creation as an actual sound file that you can then share with other people and play without having to remember all the LPCM parameters? Well, we can use WAV for that purpose since it’s just raw audio with a header at the beginning, which is simple enough that we can implement it ourselves.

To do that, we first have to know what the WAV format looks like. For that we’ll have to look at WAV’s specification (or rather its parent format - RIFF) because the Wikipedia article about it does not even describe the format precisely, even though it rants about the specification being confusing. The header described in the standard, ignoring the possibility of music metadata, presents itself as follows:

FieldBytes
“RIFF”4
File size excluding “RIFF” and this field4
“WAVE”4
“fmt “4
164
Audio data format - 1 for LPCM2
Number of channels - 22
Sample rate4
Size of one second of sound - 2 * 2 * sample rate4
The above but divided by the sample rate - 2 * 22
Bits per sample for one channel - 162
“data”4
Audio data size4

…and our audio data follows after that in the same form as we output it to aplay. I won’t explain the magic values in this table since that’s just some unimportant RIFF boilerplate, which you can read about on Wikipedia yourself. Here is the code that effectively exports your music to WAV based on the above information:

#include <vector>

void write_int(unsigned int x, std::ostream &stream) {
  write_short(x, stream);
  write_short(x >> 16, stream);
}

void write_wav(std::vector<Sample> const& samples, std::ostream &stream) {
  stream << "RIFF";
  write_int(36 + 2 * 2 * samples.size(), stream);
  stream << "WAVE";

  stream << "fmt ";
  write_int(16, stream);
  write_short(1, stream);
  write_short(2, stream);
  write_int(sample_rate, stream);
  write_int(2 * 2 * sample_rate, stream);
  write_short(2 * 2, stream);
  write_short(16, stream);

  stream << "data";
  write_int(2 * 2 * samples.size(), stream);
  for(auto sample: samples) {
    write_sample(sample, stream);
  }
}

Our main function would see some new additions to enable the code above:

#include <csignal>
#include <fstream>

std::vector<Sample> samples;
void save(int) {
  std::ofstream file("code.wav");
  write_wav(samples, file);
}

int main() {
  init_midi();
  signal(SIGINT, save);
  for(size_t sample_idx = 0; ; sample_idx++) {
    double time = sample_idx / (double) sample_rate;
    poll_midi(time);
    auto sample = sample_at(time);
    write_sample(sample, std::cout);
    samples.push_back(sample);
  }
  save(0);
}

This will save your performance to code.wav when it’s done or you interrupt the program while still allowing it to be played in real-time through aplay.

And that’s it for the article. Congratulations, you have now become a fully-fledged musician and your own music producer! So go out there, go play in venues, impress your friends with your brilliant skills and don’t forget to point them to this article so that they too can become musicians of their own. /s But most importantly, have fun.