### Highlighting regions in a scatter plot by solving the pulley problem

The R code used to generate the graphs is available in the PlotRegionHighlighter package in CRAN

The challenge of data visualization is presenting more information without overwhelming the viewer. A scatter plot with items represented as points in two dimensions is one way to present a large volume of data. For instance, the following graph shows how stocks within industry groups have correlated daily returns.

In this plot, each dot represents a stock with all stocks within an industry group having the same color. Stocks which have prices that move up and down together are plotted near each other. The plot was produced using multidimensional scaling (see chapter 10 in Baird & Noma’s Fundamentals of Scaling and Psychophysics) in which stocks with returns that move in sync are plotted near each other. Increased distance between stocks indicated lower correlations in daily returns. The correlations were calculated using daily returns from 2002 to 2012.

Other information can be added to the plot. For instance, if we have another method for grouping the stocks we would like to add this information to the graph. Varying the size and texturing of points is one way to show commonalities. Another way is to circle the points on the space when they are in the same cluster. The following plot shows how the stocks cluster by superimposing circles on the plot.

Note that the clustering shows a slightly different picture of structure as the clusters are similar, but not totally overlapping in their interpretation of the correlation of daily returns for pairs of stocks. This is not surprising since multidimensional scaling is a spatial model while cluster analysis is based on a measure of distance that is not constrained to placing points on a two dimensional page (see Baird & Noma, Fundamentals of Scaling and Psychophysics, chapter 11).

The rest of this article will discuss how the envelope is created.

The main goals for creating the envelope are to create an aesthetically pleasing shape containing all points in the cluster and to do it with algorithm way that allows us to compare the sizes of different envelopes relative to their contents and to other envelopes. Other desirable characteristics include a continuous curve without corners which has a buffer zone between the outermost points and the surrounding curve.

For these reasons, I produce an envelope that contains circles centered at each point and the envelope is the smallest convex figure that contains the circles. The figure consists of straight lines and circular arcs connected at the points where the lines are tangent to the circles. The algorithm similar to that used to determine the length of a tank’s treads or the length of the belt needed to connect two flywheels. For this reason it is called the pulley problem (http://en.wikipedia.org/wiki/Belt_problem).

Solving for the envelope containing multiple circles is an extension of the simple pulley problem where lines are determined that are tangent to a pair of circles. In the sample case, the envelope is the two tangent line segments and the ends of the two circles(http://en.wikipedia.org/wiki/Tangent_lines_to_circles). The transition is at the tangent points.  In the general problem, tangent lines are computed for all pairs of circles and lines that intersect the interiors of any of the circles, pass between circles or intersect other line segments on the surface are eliminated. This creates a continuous, smooth, convex envelope. The following figure shows the envelope generated to encompass a set of circles placed at random with random radii.

The area within the envelope is computed by the area of the polygon defined by the tangency points where the lines and circles intersect plus the areas subtended by the arcs and chords on the circles. Once the radii of the circles around each point are specified, the envelope can be graphed atop the scatter plot

The resultant envelope has the minimum perimeter and the smallest area for a convex figure that contains all circles.

This method may also require an adjustment when the aspect ratio of the plot deviates from 1:1 and the ranges of the horizontal and vertical coordinates are materially different. There are several ways to do this adjustment. One method multiplies the vertical coordinates by a constant before calculating the envelope. This adjustment is reversed on the output coordinates for the envelope. The following R code could be used as a template for this procedure:

envelope <- generateEnvelope(cbind(X, Y * aspectAdjustment), r=rep(1,length(X)))\$envelopeXY
envelope <- envelope %*% matrix(c(1,0,0,1/aspectAdjustment), ncol=2)

7 Responses to “Highlighting regions in a scatter plot by solving the pulley problem”

Nice idea.. noma

2. Converse says:

I’m impressed, I have to admit. Seldom do I come across a blog that’s both
educative and interesting, and without a doubt,
you’ve hit the nail on the head. The issue is something not enough people are speaking intelligently about. I am very happy I came across this in my hunt for something regarding this.

3. Greetings! Very helpful advice within this post! It’s the little changes which will make the largest changes. Thanks for sharing!

4. Regena says:

Thank for the informative blog post. I truly preferred it.

I hope You retain up the good work, to publish far more distinctive content
like this is. Alright all the best for Your spouse and children.
Bye

5. Terence Oza says:

S’il vous plaît faire plus parce que j’ai vraiment apprécié la lecture de votre article à ce sujet.

6. this is a very helpful website!

7. Silva Fossa says:

Immer sehr informativ. Ich las einige Artikel aus hier und ich muss sagen, das ist das beste.