Sankey Diagram


Sankey Diagram can be built in R using the networkD3 package.

sankeyNetwork function


sankeyNetwork(
Links, a data frame object with the links between the nodes
Nodes, a data frame containing the node id and properties of the nodes
Source, character string naming the network source variable in the Links data frame
Target, character string naming the network target variable in the Links data frame
Value, character string naming the variable in the Links data frame for how far away the nodes are from one another
NodeID, character string specifying the node IDs in the Nodes data frame. Must be 0-indexed
NodeGroup, character string specifying the node groups in the Nodes, used to color the nodes in the network
LinkGroup, character string specifying the groups in the Links, used to color the links in the network
units, character string describing physical units (if any) for Value
colourScale, character string specifying the categorical color scale for the nodes
fontSize, numeric font size in pixels for the node text labels
fontFamily, font family for the node text labels
nodeWidth, numeric width of each node
nodePadding, numeric essentially influences the width height
margin, an integer or a named list/vector of integers for the plot margins. If using a named list/vector, the positions top, right, bottom, left are valid. If a single integer is provided, then the value will be assigned to the right margin
height, numeric height for the network graph’s frame area in pixels
width, numeric width for the network graph’s frame area in pixels
iterations, numeric. Number of iterations in the diagram layout for computation of the depth (y-position) of each node
sinksRight, boolean. If TRUE, the last nodes are moved to the right border of the plot
... )

To save this object, the saveNetwork function can be used for the html version (saveNetwork(obj, "path/obj.html")). Then, the webshot function from the webshot package can be used to save a png format (webshot("path/obj.html", "path/obj.png")).

Input data


Input data can be stored in 2 different formats:

This post describes how to build a basic Sankey diagram from these 2 types of input.


From connection data frame

A connection data frame lists all the connections one by one in a data frame.

Usually you have a source and a target column. You can add a third column that gives further information for each connection, like the value of the flow.

This is the format you need to use the networkD3 library. Let’s build a connection data frame and represent it as a Sankey diagram:

# Libraries
library(networkD3)
library(dplyr)
 
# A connection data frame is a list of flows with intensity for each flow
links <- data.frame(
  source = c("group_A", "group_A", "group_B", "group_C", "group_C", "group_E"), 
  target = c("group_C", "group_D", "group_E", "group_F", "group_G", "group_H"), 
  value = c(2, 3, 2, 3, 1, 3)
  )
 
# From these flows we need to create a node data frame: it lists every entities involved in the flow
nodes <- data.frame(
  name = c(as.character(links$source), as.character(links$target)) %>% unique()
)
 
# With networkD3, connection must be provided using id, not using real name like in the links dataframe.. So we need to reformat it.
links$IDsource <- match(links$source, nodes$name)-1 
links$IDtarget <- match(links$target, nodes$name)-1

# Make the Network
p <- sankeyNetwork(Links = links, Nodes = nodes,
                   Source = "IDsource", Target = "IDtarget",
                   Value = "value", NodeID = "name",
                   fontSize = 16, sinksRight=FALSE)


An other example with additional graphic arguments and where the size of each node is displayed:

# Libraries
library(networkD3)
library(dplyr)
 
# A connection data frame is a list of flows with intensity for each flow
set.seed(123)
links <- data.frame(
  source = rep(c("group_A", "group_B", "group_C", "group_D"),each=4),
  target = rep(c("group_A ", "group_B ", "group_C ", "group_D "),4),
  value = sample(0:4, size=16, replace=T)
  ) %>% filter(value!=0)
links$source <- as.factor(links$source)
links$target <- as.factor(links$target)
 
# We add size of each node
links$source_n <- NA
links$target_n <- NA
for(i in 1:nrow(links)){
  links$source_n[i] <- paste0(links$source[i],' (n=',sum(links$value[links$source==links$source[i]]),')')
  links$target_n[i] <- paste0(links$target[i],' (n=',sum(links$value[links$target==links$target[i]]),')')
}

# From these flows we need to create a node data frame: it lists every entities involved in the flow
nodes <- data.frame(
  name_n = c(as.character(links$source_n), as.character(links$target_n)) %>% unique()
)
nodes$name <- gsub(" \\(n=[0-9]+)","",nodes$name_n)
nodes$name <- gsub(" $","",nodes$name)
nodes$name <- as.factor(nodes$name)
nodes$target <- c(rep(0,length(unique(nodes$name))),rep(1,length(unique(nodes$name))))
nodes <- nodes %>% arrange(target, name_n)

# With networkD3, connection must be provided using id, not using real name like in the links dataframe.. So we need to reformat it.
links$IDsource <- match(links$source_n, nodes$name_n)-1 
links$IDtarget <- match(links$target_n, nodes$name_n)-1
 
# Make the Network
my_color <- paste0('d3.scaleOrdinal() .domain(["',
                   paste(nodes$name_n, collapse='","'),
                   '"]) .range(["#6699CC","#CC3333","orange","#666699","#6699CC","#CC3333","orange","#666699"])')
p <- sankeyNetwork(Links = links, Nodes = nodes,
                   Source = "IDsource", Target = "IDtarget",
                   Value = "value", NodeID = "name_n",
                   iteration=0,
                   sinksRight=FALSE, nodeWidth = 20,
                   fontSize = 16, nodePadding = 20,
                   colourScale=my_color, LinkGroup="source_n")




From incidence matrix

An incidence matrix is square or rectangle.

Row and column names are node names. The item in row x and column y represents the flow between x and y. In the Sankey diagram we represent all flows that are over 0.

Since the networkD3 library expects a connection data frame, we will fist convert the dataset, and then re-use the code from above.

# Libraries
library(networkD3)
library(dplyr)
 
# Create an incidence matrix. Usually the flow goes from the row names to the column names.
# Remember that our connection are directed since we are working with a flow.
data <- matrix(c(0,0,0,0,0,0,0,0, 0,0,0,0,0,0,0,0,
                 2,0,0,0,0,0,0,0, 3,0,0,0,0,0,0,0,
                 0,2,0,0,0,0,0,0, 0,0,3,0,0,0,0,0,
                 0,0,1,0,0,0,0,0, 0,0,0,0,3,0,0,0), 8, 8)
colnames(data) = rownames(data) = c("group_A", "group_B", "group_C", "group_D", "group_E", "group_F", "group_G", "group_H")

# Transform it to connection data frame with tidyr from the tidyverse:
links <- data %>% 
  as.data.frame() %>% 
  tibble::rownames_to_column(var="source") %>% 
  tidyr::gather(key="target", value="value", -1) %>%
  filter(value != 0)

# From these flows we need to create a node data frame: it lists every entities involved in the flow
nodes <- data.frame(
  name = c(as.character(links$source), as.character(links$target)) %>% unique()
)

# With networkD3, connection must be provided using id, not using real name like in the links dataframe.. So we need to reformat it.
links$IDsource <- match(links$source, nodes$name)-1 
links$IDtarget <- match(links$target, nodes$name)-1

# Make the Network
p <- sankeyNetwork(Links = links, Nodes = nodes,
                   Source = "IDsource", Target = "IDtarget",
                   Value = "value", NodeID = "name",
                   fontSize = 16, sinksRight=FALSE)

Example: Migration flow


Here is an example displaying the number of people migrating from one country (left) to another (right).

# Libraries
library(tidyverse)
library(viridis)
library(patchwork)
library(hrbrthemes)
library(circlize)

# Load dataset from github
data <- read.table("https://raw.githubusercontent.com/holtzy/data_to_viz/master/Example_dataset/13_AdjacencyDirectedWeighted.csv", header=TRUE)
# Package
library(networkD3)

# I need a long format
data_long <- data %>%
  rownames_to_column %>%
  gather(key = 'key', value = 'value', -rowname) %>%
  filter(value > 0)
colnames(data_long) <- c("source", "target", "value")
data_long$target <- paste(data_long$target, " ", sep="")

# From these flows we need to create a node data frame: it lists every entities involved in the flow
nodes <- data.frame(name=c(as.character(data_long$source), as.character(data_long$target)) %>% unique())
 
# With networkD3, connection must be provided using id, not using real name like in the links dataframe.. So we need to reformat it.
data_long$IDsource=match(data_long$source, nodes$name)-1 
data_long$IDtarget=match(data_long$target, nodes$name)-1

# prepare colour scale
ColourScal ='d3.scaleOrdinal() .range(["#FDE725FF","#B4DE2CFF","#6DCD59FF","#35B779FF","#1F9E89FF","#26828EFF","#31688EFF","#3E4A89FF","#482878FF","#440154FF"])'

# Make the Network
p <- sankeyNetwork(Links = data_long, Nodes = nodes,
                     Source = "IDsource", Target = "IDtarget",
                     Value = "value", NodeID = "name", 
                     sinksRight=FALSE, colourScale=ColourScal, nodeWidth=40, fontSize=13, nodePadding=20)

Sankey Diagram only using plotly


How to create sankey diagrams in R with Plotly :

# Libraries
library(plotly)
library(rjson)

json_file <- "https://raw.githubusercontent.com/plotly/plotly.js/master/test/image/mocks/sankey_energy.json"
json_data <- fromJSON(paste(readLines(json_file), collapse=""))

fig <- plot_ly(
    type = "sankey",
    domain = list(
      x =  c(0,1),
      y =  c(0,1)
    ),

    orientation = "h",
    valueformat = ".0f",
    valuesuffix = "TWh",

    node = list(
      label = json_data$data[[1]]$node$label,
      color = json_data$data[[1]]$node$color,
      pad = 15,
      thickness = 15,
      line = list(
        color = "black",
        width = 0.5
      )
    ),

    link = list(
      source = json_data$data[[1]]$link$source,
      target = json_data$data[[1]]$link$target,
      value =  json_data$data[[1]]$link$value,
      label =  json_data$data[[1]]$link$label
    )
  ) 

fig <- fig %>% layout(
    title = "Energy forecast for 2050<br>Source: Department of Energy & Climate Change, Tom Counsell via <a href='https://bost.ocks.org/mike/sankey/'>Mike Bostock</a>",
    font = list(
      size = 10
    ),
    xaxis = list(showgrid = F, zeroline = F),
    yaxis = list(showgrid = F, zeroline = F)
)



Contact

This document is a work of the statistics team in the Biostatistics and Medical Information Department at Saint-Louis Hospital in Paris (SBIM).
Based on The R Graph Gallery by Yan Holtz.