Centroid for Cosine Similarity
Cosine similarity is often used as a similarity measure in machine learning. Suppose you have a group of points (like a cluster); you want to represent the group by a single point - the centroid. Then you can talk about how well formed the group is by the average distance of points from the centroid, or compare it to other centroids. Surprisingly it’s not much more complex than finding the geometric centre in euclidean space, if you pick the right coordinate system.
The cosine distance between two n-dimensional vectors is the cosine of the great circle distance of their projections unit (n-1)-sphere. The cosine distance between two vectors v and w is given by the rather obtuse formula
I define the cosine similarity centroid of a set of points as the ray that has maximum average similarity to the points; it’s the most similar point you can have. For simplicity lets project all the points on the unit sphere, except the origin, which is ignored since all points have the same similarity to it. Then the cosine distance is just the dot product. Concretely if our points on the unit sphere are
The maximum can be found using the method of Lagrange multipliers. We want to find the extremum of
This occurs where
That is the centroid lies along the line from the origin to the geometric midpoint in cartesian coordinates of the points on the unit sphere. So if you want to calculate a centroid of a group of points with respect to the dot product, then normalise the vectors and average them.
Note that this method will fail if there’s too much symmetry of the points. For example if they lie on a regular polyhedron centred at the origin then their mean is zero. That’s because there’s no loner a unique most similar point. For example the centroid for the north pole and the south pole could be any point along the equator. One easy solution for this is to break the symmetry, by moving any one of the points by a small random amount, and then it will have a solution again.