Technically,
https://phyphox.org/xmas/ already provides most of the information: the general principle is explained and the .phyphox file is open source (see
https://phyphox.org/wiki/index.php/Phyphox_file_format).
I try to break it down a bit. The audio file plays four concurrent notes, one each in the ranges [749…2249] (x coordinate), [2249…3749] (y coordinate), [3749…5249] (color and repeat), and [5249…5624] (checksum) Hertz. Each segment is divided into subranges of the width 48000/2048=23.4375 Hz. On each of these range a FFT is mapped and in each the maximum is searched.
If the maximum of the first range is in the nx-th segment, then the x coordinate is this nx (1…64).
If the maximum of the second range is in the ny-th segment, then the y coordinate is this ny (1…64).
If the maximum of the third range is in the n-th segment, then the remainder of n divided by 8 gives the color (0…7) and the quotient of n divided by 8 (the integer part of n/8) tells how often the point should be repeated in x direction (internally we add one, so 1…8 is the possible length of the horizontal block in the graph).
If the maximum of the forth range is in the c-th segment, then the remainder of (nx+ny+n) divided by 16 needs to match c–1 (0…15) or the entire point is discarded.
HTH a bit.