This guy hadn't done very much research on gunshots. The "two sounds" are the the sound of the bullet passing close to the microphone (sonic boom shock wave, which sounds like a crack), and the sound of the bullet leaving the rifle (which sounds like a boom). They are around 0.2 seconds apart from each other because it took about 0.2 seconds for the sound from the rifle to reach the location of the microphone.
See this paper, for example, which explains the basics.
He compares to a .30-06 recording where you hear a "clean shot", followed some seconds later by the sound of the bullet hitting a metal target. Without seeing the video from which that recording is taken, it's probably the case that it was recorded from the location of the rifle. In that case, you only hear the muzzle explosion, not the sonic "crack", which is only audible if the microphone is somewhere along the actual path of the bullet in front of the rifle.