Monthly Archives: April 2013

Introducing distVis() – calculating distances and clustering numeric data

In version 1.2, we also have a new function for creating a distance visualization of a matrix of numeric data. This figure was developed and contributed by Joseph Paulson and Florin Chelaru from the University of Maryland Center for Bioinformatics and Computational Biology (CBCB).

A distance matrix can be generated from a matrix of numeric values by choosing a distance metric to essentially say how “far” one value is from another. Once these distances are calculated, the data can be clustered by distance. The distVis() function allows for six different distance metrics (e.g. Euclidean, Manhattan) and seven different clustering methods (e.g. average, centroid, Ward), so that users can see how their data reorganizes under different distance calculations and clusterings.

In this example, we take the mtcars dataset in R, convert it to a matrix using as.matrix() , and visualize using distVis():

testData <- as.matrix(mtcars)

The result looks like this.

New features in Healthvis v1.2

We’re rolling out version 1.2 of our package, which includes a new visualization (described in an upcoming post) as well as some changes to package functionality described below. You can find instructions for updating your package on the install page.

Passing Data

In the past, if a user called a healthvis function it would pass the user’s data to the server where the visualization is rendered. We did this so that users could save their image to the server for the purposes of sharing and embedding. There were concerns about the passing of private data to the server, and some would prefer to visualize their data without sending the data away from their local machine.

We have changed the basic architecture of our package to address these concerns. In v1.2, the visualization is rendered locally by default. User data is only passed to the server if the “Save to Healthvis” button is pressed, at which point options for permalinking and embedding will become available. The visualization still appears in a web browser, but data do not go to our app engine server until the button is pressed.

To accomplish this, we have moved to using web sockets. As with other R utilities that use web sockets to pass data, your R environment will be suspended after you call a function until you stop the process.


We have had some issues with embedding figures, since figures with static heights and widths would not resize in a smaller embedded window. Viewers would have to scroll around the originally-sized image in the embedded iframe, and would not be able to see the entire visualization at once. To address this issue, we have added a function called healthvis.getDimensions() to the javascript side which checks if the figure is embedded and resets the height and width of the figure accordingly. Now viewers will see the entire image, resized to fit within the frame on the page where the figure is embedded. We are still working out some kinks with this (e.g. issues with different browsers), so please let us know how this feature is working for you.


We have reorganized our github repository according to this strategy. Basically, we will maintain a master branch from which users will install the stable version of the package. Parallel to this branch, we will have a develop branch and feature branches for in-progress development. These will be merged after adequate testing is completed.

The upshot for developers is that you can now make a pull request on the develop branch, (hopefully) making the development process more streamlined. As always, you can read about development here.

Introducing pairedVis() – interactive scatterplot matrix

John Muschelli (@strictlystat) wins the prize for first adapted d3 visualization! He liked this plot (which we’ll explain further) created by Mike Bostock (the creator of the d3.js library), so he decided to develop a version of it for our package. You can read about his process on his blog post. We touched up a few parts and the visualization is now part of our package!

So what is this?
A scatterplot matrix is useful for getting a high-level view of a data set. R has the function pairs() which creates such a matrix plot. Let’s use the iris data set (built into R) that Mike used above as an example. pairs(iris) produces the following:

pairs() plot of iris dataset

pairs() plot of iris dataset

If you clicked the link to the d3 version of this plot above, you’ll notice 1) a more colorful display and 2) the ability to select groups of points in one plot and have them be highlighted in all the plots. The colors correspond to the categorical variable “Species”.

So if you have a data frame with both numeric and categorical columns, the d3 version is a neat alternative way to display this data. We added additional functionality for multiple categorical variables (can recolor the points via the drop-down) and a dynamic legend. Let’s add a fake categorical variable to iris and visualize:

test_data <- iris
test_data$content <- sample(c("High", "Med", "Low", "None"), nrow(test_data), replace=T)

We get this:

But what does this have to do with health?
Nothing directly, but that’s not an issue! If you read our FAQ, you’ll see that we are more than excited about others developing any type of d3 visualization they want to integrate with R through our package. Anything you think would be useful for many people and different data sets is eligible.

Cool! I want to use it
No problem. If you don’t have the healthvis package installed already, head over to the install page for instructions. If you have the package, you can update it easily (check the “Update” section on the install page).

This development story seems rigged…
Yes, it’s true that John is in the Hopkins Biostat department, and so are we, but we can assure you that he was not part of the initial package development and learned everything independently! The only real way to prove that the development process is not so bad is for someone completely unaffiliated to give it a try…consider the gauntlet thrown down!

More about heatmapVis

The idea behind the sortable heatmap plot is that we often have a set of subjects and numeric observations that we want to visualize as a heatmap. For example, in genomics we might have differential gene expression levels at different sites and for different study subjects.

In addition to the numeric observations, we could have an additional set of data for each subject, such as their age, gender, treatment indicators, medication, etc. We would somehow want to incorporate this information into the heatmap, and so the sortable heatmap is born.

Take a look at the example code below:

nsubj = 40
nobs = 25
data1 <- matrix(rnorm(nsubj*nobs), nsubj, nobs)
rownames(data1) <- sapply(1:nsubj, function(x){paste("S", x, sep="")})
colnames(data1) <- sapply(1:nobs, function(x){paste("V", x, sep="")})

# Create a set of discrete and continuous covariates to sort by
sort.by1 <- data.frame("Treatment"=rbinom(nsubj, 1, 0.4), "Age"=rpois(nsubj, 30))


In this code, we are simulating 40 study subjects and 25 observation sites (a 40×25 matrix of numeric values). We provide names for the rows (S1…S40 for the subjects) and columns (V1…V25 for the sites). Next, we simulate a set of covariates as a data frame. The first column is a treatment indicator and the second is the subject’s age.

heatmapVis() expects a numeric matrix of dimension n x m and a data frame of covariates of dimension n x p. Here we have passed a 40×25 matrix of numeric observations and a 40×2 data frame of covariates. There is also a color argument (defaulted above) which takes a vector of the form c("lowColor", "medColor", "highColor") . You need to specify three colors signifying low, medium, and high values, and the intermediate colors are filled in accordingly by the function.

As you can see in the final product, we start with an unsorted heatmap, but can then sort the rows (subjects) by the covariates we specified. This could be useful for looking at how trends change when the subjects are grouped and ordered.

Limitations: This visualization is still a bit limited as we are working to increase how much data our server can accept in one request. Also, the transition becomes sluggish when there are many rows or columns. We are working on rectifying these issues as we speak. For now, try it out and let us know what you think!

Creating the survival plot

We’ve mentioned that healthvis makes some nice interactive figures using just one line of R code, but we haven’t showed an example yet. Remember the grouped survival comparison from the announcement post? This is what you would do in R to create it:

# Load the survival library
# Convert trt and prior to factors so they are treated as such by the plotting function
veteran$trt <- as.factor(veteran$trt)
veteran$prior <- as.factor(veteran$prior)
# Run a cox proportional hazards regression
cobj <- coxph(Surv(time, status)~trt+age+celltype+prior, data=veteran)
# Here's the plotting command!
survivalVis(cobj, data=veteran, plot.title="Veteran Survival Data", group="trt", group.names=c("Treatment", "No Treatment"), line.col=c("#E495A5","#39BEB1"))

Above, we fit a proportional hazards model that takes into account a subject’s treatment status, disease cell type, and whether or not they have had prior therapy. In the “survivalVis” call, we simply pass the model fit object, the data set (“veteran” data from the “survival” R package), a plot title, the model covariate on which we want to group (here, treatment or “trt”), and group names and colors.