# Dip Statistic for Multimodality

maths
data
Published

August 21, 2020

If you’ve got a distribution you may want a way to tell if it has multiple components. For example a sample of heights may have a couple of peaks for different gender, or other attributes. While you could determine this through explicitly modelling them as a mixture the results are sensitive to your choice of model.

Another approach is statistical tests for multimodality. One common test is Silverman’s Test which checks for the number of modes in a kernel density estimate; the trick is choosing the right width. Another test is Hartigan’s Dip Test which tries to estimate the distance between the Cumulative Distribution Function and the nearest multimodal function. This article tries to explain this dip statistic.

The method was first published in The Dip Test of Unimodality by J. A. Hartigan and P. M. Hartigan. The paper is moderately statistically involved, especially in the middle, but the overall idea is quite simple.

A unimodal distribution will have a Probability Density Function (PDF) that increases from 0 to some peak value and then decrease back to 0. If there’s a flat region there may be a range of points the mode is achieved at, but it’s a single interval. Its Cumulative Distribution Function at any point is just the area under the PDF between 0 and that point. When the PDF switches from increasing to decreasing then the CDF switches from convex to concave. A graph is convex means that any section of the curve lies below the straight line joining the endpoints, and conversely convave means that any section of the curve lies above the straight line joining the endpoints.

A multimodal distributions CDF will change from convex to concave and back again multiple times, because it’s PDF will change from increasing to decreasing and back again multiple times. The idea of the dip statistic is to measure how much we need to change the CDF to make it unimodal. In particular it is the maximum distance at any point between the CDF and the closest multimodal CDF. In other words the distribution can be deformed into a unimodal one by moving the CDF by at most the dip at each point, and the dip is the smallest number for which this is true.