Hello R, meet plotly!

Part 1. Basic Charts

Cover image credit: janjf93 from Pixabay

What is plotly

Plotly’s R graphing library makes interactive, publication-quality graphs. Examples of how to make line plots, scatter plots, area charts, bar charts, error bars, box plots, histograms, heatmaps, subplots, multiple-axes, and 3D (WebGL based) charts.

Plotly.R is free and open source and you can view the source, report issues or contribute on GitHub.

Website | GitHub | Gallery

What is Data Visualization

First of all, let’s start with the definition what is “Data Visualization”. Wikipedia defines it as:

Data visualization is the graphic representation of data. It involves producing images that communicate relationships among the represented data to viewers of the images. This communication is achieved through the use of a systematic mapping between graphic marks and data values in the creation of the visualization. This mapping establishes how data values will be represented visually, determining how and to what extent a property of a graphic mark, such as size or color, will change to reflect change in the value of a datum. [1]

Simply saying, data visualization is the set of techniques in order to create a graphical representation of a raw data. I personally believe that a lot of Data Scientists don’t really understand the power of a simple plot. Making plots of the data can help you to find hidden patterns and insights before applying any fancy machine learning stuff.

Consider a simple example. You run a marketing campaigns through different channels (let’s say Google Ads, Facebook, Twitter, Email and Offline). You have a set of values that indicate the performance of the campaign for a specific month.

set.seed(27)
temp_df <- data.frame(
  campaign = c("Google Ads", "Facebook", "Twitter", "Email", "Offline"),
  revenue = round(runif(n = 5, min = 1000, max = 3000), 2))
kable(temp_df)
campaignrevenue
Google Ads2943.50
Facebook1167.52
Twitter2747.74
Email1658.46
Offline1444.55

Sure you can sort it and see which campaign performed better/worse. Like this:

temp_df %>% 
  arrange(desc(revenue)) %>% 
  kable()
campaignrevenue
Google Ads2943.50
Twitter2747.74
Email1658.46
Offline1444.55
Facebook1167.52

However, you cannot lie that this representation makes it easy to “read” the data:

temp_df %>% 
  ## reorder the factor values of `campaign` according to 
  ## the `revenue` values
  mutate(campaign = fct_reorder(campaign, revenue)) %>% 
  plot_ly(
    ## ~ sign tells plotly to look for a column with such name 
    ## in a dataframe provided
    x = ~revenue, 
    y = ~campaign, 
    type = 'bar', 
    orientation = 'h',
    # setting the color of bars
    marker = list(color = 'rgba(158,202,225, 0.8)')) %>% 
  # adding titles
  layout(title = "<b>Campaigns' Revenue for a Month</b>",
         xaxis = list(title = "<b>Revenue</b>"),
         yaxis = list(title = "<b>Campaign</b>"),
         # set the top margin so the plot helper buttons don't cover the title
         margin = list(t = 70))

In such way it’s easier to spot the difference between campaigns. For example, we can see that Google Ads are bringing almost twice as high as Email campaign without doing any math.

You could see already some plotly magic. Some key points so far:

  • plotly doesn’t do any sorting for bars so you have to provide the sorted data frame;
  • plotly works with pipes %>%;
  • you can provide html tags like <b> inside the text layout.

For the further introduction to Data Visualization we are going to use the data set of Nobel Prize Laureates found at kaggle.com. Ddataset includes a record for every individual or organization that was awarded the Nobel Prize since 1901.

nobel_df <- read_csv("archive.csv")
nobel_df %>% 
  select(-Motivation) %>% 
  head() %>% 
  kable()
YearCategoryPrizePrize ShareLaureate IDLaureate TypeFull NameBirth DateBirth CityBirth CountrySexOrganization NameOrganization CityOrganization CountryDeath DateDeath CityDeath CountryX19
1901ChemistryThe Nobel Prize in Chemistry 19011/1160IndividualJacobus Henricus van ’t Hoff1852-08-30RotterdamNetherlandsMaleBerlin UniversityBerlinGermany1911-03-01BerlinGermanyNA
1901LiteratureThe Nobel Prize in Literature 19011/1569IndividualSully Prudhomme1839-03-16ParisFranceMaleNANANA1907-09-07ChâtenayFranceNA
1901MedicineThe Nobel Prize in Physiology or Medicine 19011/1293IndividualEmil Adolf von Behring1854-03-15Hansdorf (Lawice)Prussia (Poland)MaleMarburg UniversityMarburgGermany1917-03-31MarburgGermanyNA
1901PeaceThe Nobel Peace Prize 19011/2462IndividualJean Henry Dunant1828-05-08GenevaSwitzerlandMaleNANANA1910-10-30HeidenSwitzerlandNA
1901PeaceThe Nobel Peace Prize 19011/2463IndividualFrédéric Passy1822-05-20ParisFranceMaleNANANA1912-06-12ParisFranceNA
1901PhysicsThe Nobel Prize in Physics 19011/11IndividualWilhelm Conrad Röntgen1845-03-27Lennep (Remscheid)Prussia (Germany)MaleMunich UniversityMunichGermany1923-02-10MunichGermanyNA

Chart Types

Line Chart

Let’s start with the most basic chart line chart.

A line chart or line plot or line graph or curve chart is a type of chart which displays information as a series of data points called ‘markers’ connected by straight line segments. [2]

A big note here: line chart requires a continuous relationship for x variable (like timeseries). The nobel_df has a Year column that could be used as a x value. Let’s consider how many awards were given for each year.

nobel_df %>% 
  group_by(Year) %>% 
  summarise(`Total Awards` = n()) %>% 
  plot_ly(
    x = ~Year, y = ~`Total Awards`, name = "",
    type = 'scatter', mode = 'lines+markers',
     hovertemplate = paste(
      '<i>Year</i>: %{x}',
      '<br><i>Awards</i>: %{y}')) %>% 
  layout(
    title = "<b>Total Awards by Year</b>",
    xaxis = list(title = "<b>Year</b>"),
    yaxis = list(title = "<b>Total Awards</b>"),
    margin = list(t = 70))

Note the use of hovertemplate in plot_ly() function. This argument allows to set custom text when hovering over plot points. We set name to empty string since it is not really needed for now. It might be of better use when we have multiple objects on the plot to show a legend.

As a summary we can say that line chart usually is used to show the connection between two variables:

  • x axis - numerical variable (usually time series)
  • y axis - numerical variable

You can add more categorical variables as the new line charts (for example, you could a new line a line of total awards for Peace category, so your plot would show the relationship for two categories - Total Awards and Awards for Peace Category).

Area Chart

A slight modification of the line chart would be the area chart.

An area chart or area graph displays graphically quantitative data. It is based on the line chart. The area between axis and line are commonly emphasized with colors, textures and hatchings. Commonly one compares two or more quantities with an area chart. [3]

The difference from the basic line chart is that we color the area under the line. We can set this up just adding fill parameter to the previous code.

nobel_df %>% 
  group_by(Year) %>% 
  summarise(`Total Awards` = n()) %>% 
  plot_ly(
    x = ~Year, y = ~`Total Awards`, name = "",
    type = 'scatter', mode = 'lines',
    fill = 'tozeroy', # color the area under the line
    fillcolor = "rgba(158,202,225, 0.6)",
    hovertemplate = paste(
      '<i>Year</i>: %{x}',
      '<br><i>Awards</i>: %{y}')) %>% 
  layout(
    title = "<b>Total Awards by Year</b>",
    xaxis = list(title = "<b>Year</b>"),
    yaxis = list(title = "<b>Total Awards</b>"),
    margin = list(t = 70))

Stacked Area Chart

As you can see that previous example was not really different from the regular line plot. However, there is a modification of area chart called stacked area chart that is way more useful. It is used to show the difference when two or more labels are included in the plot. When multiple attributes are included, the first attribute is plotted as a line with color fill followed by the second attribute, and so on. For example, let’s take a look on how amount of awards differs among men and women for each year.

However, this requires some manual manipulation with the data frame.

# select count for women by all years
female_df <- nobel_df %>% 
  filter(Sex == "Female") %>% 
  group_by(Year) %>% 
  summarise(female = n())

# select count for men by all years
male_df <- nobel_df %>% 
  filter(Sex == "Male") %>% 
  group_by(Year) %>% 
  summarise(male = n())

## since there migh be years when just women/just men got the award
## we need to join previous dfs with *all* years available in the data set
## and replace missing values by 0s.
joint_df <- nobel_df %>% 
  select(Year) %>% 
  distinct(Year) %>% 
  left_join(female_df, by = "Year") %>% 
  left_join(male_df, by = "Year") %>% 
  mutate(female = replace_na(female, 0),
         male = replace_na(male, 0))

kable(head(joint_df))
Yearfemalemale
190106
190207
190316
190405
190514
190606
fig1 <- joint_df %>% 
  plot_ly(
    x = ~Year, y = ~male, 
    name = 'Male', stackgroup = 'one', # specify that this is stacked chart
    type = 'scatter', mode = "none",
    legendgroup = 'Male', # legend group name
    fillcolor = 'rgba(158,202,225, 0.7)',
    hovertemplate = paste('<i>Year</i>: %{x}',
                          '<br><i>Awards</i>: %{y}'))

fig1 <- fig1 %>% 
  add_trace(
    y = ~female, name = 'Female',  
    fillcolor = 'rgba(255, 188, 101, 0.8)',
    legendgroup = 'Female', # legend group name
    hovertemplate = paste('<i>Year</i>: %{x}',
                          '<br><i>Awards</i>: %{y}'))

Also, stacked chart can be used to show the normalized data (ratio) foe each label. It can be done by simply adding groupnorm = 'percent' to plot_ly function.

fig2 <- joint_df %>% 
  plot_ly(
    x = ~Year, y = ~male, 
    name = 'Male', stackgroup = 'one', 
    type = 'scatter', mode = "none", showlegend = FALSE,
    groupnorm = 'percent', legendgroup = 'Male',
    fillcolor = 'rgba(158,202,225, 0.7)',
    hovertemplate = paste('<i>Year</i>: %{x}',
                          '<br><i>Ratio</i>: %{y: .2f}%'))

fig2 <- fig2 %>% 
  add_trace(y = ~female, name = 'Female',  
            legendgroup = 'Female', showlegend = FALSE,
            fillcolor = 'rgba(255, 188, 101, 0.8)',
            hovertemplate = paste('<i>Year</i>: %{x}',
                                  '<br><i>Ratio</i>: %{y: .2f}%'))

Now we have two objects fig1 (stacked area chart) and fig2 (normalized stacked area chart). We can show them on one plot using subplots. The reason why we used showlegend = FALSE in fig2 object is that we grouped two legends for subplot by adding legendgroup and we don’t need to show the same legend twice.

# `shareX` means that `x` axis will be the same for both plots
subplot(fig1, fig2, nrows = 2, shareX = TRUE) %>% 
  layout(title = "<b>Total Awards by Year by Gender</b>",
         xaxis = list(title = "<b>Year</b>"),
         yaxis = list(title = "<b>Number of Awards</b>"),
         yaxis2 = list(title = "<b>Ratio of Awards (%)</b>"),
         margin = list(t = 70))

Summary:

The idea for area chart stays the same as for line chart - x variable should be continuous:

  • x axis - numerical variable (usually time series)
  • y axis - numerical variable

You can add more categorical variables as new area charts.

What if we wanted to see how the number of awards by Category (categorical variable) on x axis rather then by Year? We couldn’t use line chart, however, we could use the bar chart.

Bar chart

A bar chart or bar graph is a chart or graph that presents categorical data with rectangular bars with heights or lengths proportional to the values that they represent. The bars can be plotted vertically or horizontally. A vertical bar chart is sometimes called a column chart. [4]

We have seen the example of bar chart in the example section. So we can jump straight forward to bar chart modifications.

Grouped Bar chart

In a grouped bar chart, for each categorical group there are two or more bars. These bars are color-coded to represent a particular grouping. example, a business owner with two stores might make a grouped bar chart with different colored bars to represent each store: the horizontal axis would show the months of the year and the vertical axis would show the revenue. [4]

This is useful for side-by-side comparison among categories.

Stacked Bar Chart

The stacked bar chart stacks bars that represent different groups on top of each other. The height of the resulting bar shows the combined result of the groups. However, stacked bar charts are not suited to data sets where some groups have negative values. In such cases, grouped bar chart are preferable.

< Grouped bar graphs usually present the information in the same order in each grouping. Stacked bar graphs present the information in the same sequence on each bar. [4]

The idea is similar to stacked area chart. For each x value (or y if we use horizontal bar chart) we stack a value of the label on top of each other.

grouped_bc <- plot_ly()
grouped_bc <- grouped_bc %>% 
  add_trace(
    ## some manipulations with the data:
    ## first plotly object will have the count of 
    ## awards by categories for men only
    data = nobel_df %>%
      filter(Sex == "Male") %>% 
      group_by(Category) %>% 
      summarise(awards = n()),
    x = ~Category, y = ~awards,
    name = "Men", type = "bar", legendgroup = 'Male',
    # change the color of bar
    marker = list(color = 'rgba(158,202,225, 0.9)'),
    hovertemplate = paste('<i>Category</i>: %{x}',
                          '<br><i>Awards</i>: %{y}')) %>% 
  add_trace(
    ## second plotly object will have the count of 
    ## awards by categories for women only
    data = nobel_df %>%
      filter(Sex == "Female") %>% 
      group_by(Category) %>% 
      summarise(awards = n()),
    x = ~Category, y = ~awards,
    name = "Women", type = "bar", legendgroup = 'Female',
    marker = list(color = 'rgba(255, 188, 101, 0.9)'),
    hovertemplate = paste('<i>Category</i>: %{x}',
                          '<br><i>Awards</i>: %{y}')) %>% 
  layout(title = paste0("<b>Total awards by Category and Gender</b><br>",
                        "<i>Grouped Bar Chart</i>"),
         barmode = 'group', # set the type of bar chart
         xaxis = list(title = "<b>Category</b>"),
         yaxis = list(title = "<b>Number of Awards</b>"),
         margin = list(t = 70))

grouped_bc
stacked_bc <- plot_ly()

stacked_bc <- stacked_bc %>% 
  add_trace(
    data = nobel_df %>%
      filter(Sex == "Male") %>% 
      group_by(Category) %>% 
      summarise(awards = n()),
    x = ~Category, y = ~awards,
    name = "Men", type = "bar", legendgroup = 'Male',
    marker = list(color = 'rgba(158,202,225, 0.9)'),
    hovertemplate = paste('<i>Category</i>: %{x}',
                          '<br><i>Awards</i>: %{y}')) %>% 
  add_trace(
    data = nobel_df %>%
      filter(Sex == "Female") %>% 
      group_by(Category) %>% 
      summarise(awards = n()),
    x = ~Category, y = ~awards,
    name = "Female", type = "bar", legendgroup = 'Female',
    marker = list(color = 'rgba(255, 188, 101, 0.9)'),
    hovertemplate = paste('<i>Category</i>: %{x}',
                          '<br><i>Awards</i>: %{y}')) %>% 
  layout(title = paste0("<b>Total awards by Category and Gender</b><br>",
                        "<i>Stacked Bar Chart</i>"),
         barmode = 'stack', # set the type of bar chart
         xaxis = list(title = "<b>Category</b>"),
         yaxis = list(title = "<b>Number of Awards</b>"),
         margin = list(t = 70))

stacked_bc

Note, that it would be still ok to plot total awards by year using bar chart.

Summary:

  • For horizontal bar chart:
    • x axis - numerical variable
    • y axis - categorical variable
  • For vertical bar chart:
    • x axis - categorical variable
    • y axis - numerical variable
  • For side-by-side comparison among different categorical variables you can use grouped bar chart.
  • If you are interested in total proportion of each categorical variable for each x values (for vertical bar chart) you can use stacked bar chart (or normalized stacked bar chart).

Scatter Plot

What if you had two numerical variables, but none of them is in “datetime” format so it makes no sence for a line chart? Bar charts are also not useful for such type of problem since you would have dozens of bars for each of the numerical variable you put on x axis. In such case scatter plots might help.

A scatter plot (also called a scatterplot, scatter graph, scatter chart, scattergram, or scatter diagram) is a type of plot or mathematical diagram using Cartesian coordinates to display values for typically two variables for a set of data. If the points are coded (color/shape/size), one additional variable can be displayed. The data are displayed as a collection of points, each having the value of one variable determining the position on the horizontal axis and the value of the other variable determining the position on the vertical axis. [5]

Nobel Prize Winners data set doesn’t really have columns to use for a scatter plot, so for this example I am going to use data set of Education Indicators for 107 countries for 2014 found at kaggle.com.

edindex_df <- read_csv("EducationIndicators2014.csv")
kable(head(edindex_df))
Country NamePPTGDPPRPEOOCPESEEPEUNEMPLEBTDP
Albania2893654132198574590.73709733329119572016.177.835
United Arab Emirates90861394019580000000.22146114110404097763.677.375
Azerbaijan9535079751980109650.16228219492945177085.270.764
Burundi10816860309364722724.256924658330820467946.956.696
Belgium112312135317620000002.49653812101127735688.580.596
Benin10598482970743201610.847022889676321333301.059.516

Columns codes are:

  • PPT: Population
  • GDP: Gross domestic product
  • PRPE: Percentage of repeaters in Primary Education
  • OOCP: Out-of-school children of Primary School
  • ESE: Enrolment in Secondary Education
  • EPE: Enrolment in Primary Education
  • UNEMP: Unemployment Rate
  • LEB: Life expectancy at birth
  • TDP: Theoretical Duration of Primary Education

We would like to see the relatioship between Gross domestic product and Out-of-school children of Primary School.

edindex_df %>% 
  plot_ly(
    x = ~GDP, y = ~OOCP,
    text = ~`Country Name`, name = "",
    type = "scatter", mode = 'markers',
    marker = list(color = 'rgba(158,202,225, 0.9)'),
    hovertemplate = paste0(
      '<b>Country</b>: %{text}',
      '<br><b>Gross domestic product</b>: %{y}<br>',
      '<b>Out-of-school children of Primary School</b>: %{x}')) %>% 
  layout(
    title = "<b>GDP vs Out-of-school children of Primary School</b>",
    xaxis = list(title = "<b>Gross domestic product</b>"),
    yaxis = list(title = "<b>Out-of-school children of Primary School</b>"),
    margin = list(t = 70))

Note how easy it is to spot outliers:

  • Germany has the highest value of GDP and lowest value of OOCP;
  • Pakistan has the lowest value of GDP and highest value of OOCP.

Scatter plots are useful to see the shape of relationship between variables (linear, exponential, etc.) and for visual evaluation of correlation [6].

You can also add a third numerical variable for comparison. This type of plots are usually called bubble chart.

Bubble Chart

A bubble chart is a type of chart that displays three dimensions of data. Each entity with its triplet of associated data is plotted as a disk that expresses two of the vi values through the disk’s xy location and the third through its size. Bubble charts can facilitate the understanding of social, economical, medical, and other scientific relationships. [7]

We will slightly modify previous plot to add unemployment rate as a third variable:

edindex_df %>% 
  plot_ly(
    x = ~GDP, y = ~OOCP,
    color = ~UNEMP,  colors = "Blues", # adding colors
    text = ~`Country Name`, name = "",
    type = "scatter", mode = 'markers',
    marker = list(size = ~UNEMP), # add `UNEMP` variable as a size of a point
    hovertemplate = paste0(
      '<b>Country</b>: %{text}',
      '<br><b>Gross domestic product</b>: %{y}<br>',
      '<b>Out-of-school children of Primary School</b>: %{x}<br>')) %>% 
  layout(
    title = "<b>GDP vs Out-of-school children of Primary School</b>",
    xaxis = list(title = "<b>Gross domestic product</b>"),
    yaxis = list(title = "<b>Out-of-school children of Primary School</b>"),
    margin = list(t = 70)) %>% 
  # change name of color scale
  colorbar(title = '<b>Unemployment<br>Rate</b>')

We could also add a third categorical variable, for example we could add a column pop_size with three labels - less than 10m, less than 20m, greater than 20m.

edindex_df %>% 
  mutate(pop_size = case_when(PPT < 10*10^6 ~ "less than 10m",
                              PPT < 20*10^6 ~ "less than 20m",
                              TRUE ~ "greater than 20m")) %>% 
  plot_ly(
    x = ~GDP, y = ~OOCP,
    color = ~pop_size,  
    text = ~`Country Name`, 
    marker = list(size = ~UNEMP),
    type = "scatter", mode = 'markers',
    hovertemplate = paste0(
      '<b>Country</b>: %{text}',
      '<br><b>Gross domestic product</b>: %{y}<br>',
      '<b>Out-of-school children of Primary School</b>: %{x}<br>')) %>% 
  layout(
    title = "<b>GDP vs Out-of-school children of Primary School</b>",
    xaxis = list(title = "<b>Gross domestic product</b>"),
    yaxis = list(title = "<b>Out-of-school children of Primary School</b>"),
    margin = list(t = 70))

Summary:

  • x axis - numerical variable
  • y axis - numerical variable
  • Third continuous varialbe can be added as a size of a point.
  • Third categorical varialbe can be added as a color/shape of a point.

Pie chart

Next type of chat is a bit controversial since it has been criticized among specialist since it can be hard to compare different sections of a given chart, or to compare data across different charts.

A pie chart (or a circle chart) is a circular statistical graphic, which is divided into slices to illustrate numerical proportion. In a pie chart, the arc length of each slice (and consequently its central angle and area), is proportional to the quantity it represents. While it is named for its resemblance to a pie which has been sliced, there are variations on the way it can be presented. [8]

Simply saying pie chart show the normalized proportion of each categorical variable label. Looking back at Nobel Prize Winners data set let’s take a look at the overall proportion of each category that was awarded in.

nobel_df %>% 
  group_by(Category) %>% 
  summarise(Awards = n()) %>% 
  plot_ly(
    labels = ~Category, values = ~Awards,
    type = 'pie', showlegend = FALSE,
    textposition = 'inside', textinfo = 'label+percent') %>% 
  layout(title = "<b>Overall Proportion of Winning Categories</b>")

It seems readable since we have just 6 categories but imagine what would happen if we had 15+ labels.

Summary:

  • pie chart shows the normalized proportion of a categorical variable.

Heatmap

A heat map (or heatmap) is a data visualization technique that shows magnitude of a phenomenon as color in two dimensions. The variation in color may be by hue or intensity, giving obvious visual cues to the reader about how the phenomenon is clustered or varies over space. [9]

As for me, the simpliest way of thinging about heat map is to image pivot table with colored cells according to its value. Let’s take a look of amount of awards for each year and category. In order to create a pivot table we can use table() function.

# pivot table
pt <- table(nobel_df$Year, nobel_df$Category)

plot_ly(
  x = colnames(pt), y = rownames(pt), name = "",
  z = pt, type = "heatmap", colors = "Blues",
  hovertemplate = paste0('<b>Year</b>: %{y}',
                          '<br><b>Category</b>: %{x}<br>',
                          '<b>Awards</b>: %{z}<br>')) %>% 
  layout(
    title = "<b>Amount of Awards by Year and Category</b>",
    xaxis = list(title = "<b>Category</b>"),
    yaxis = list(title = "<b>Year</b>"),
    margin = list(t = 70))

Summary:

  • x axis - categorical variable
  • y axis - categorical variable
  • z axis - numerical variable (z is the intersection value between x and y axis)

This is it for now. In the next part I want to show the examples of map chart, custom buttons/sliders and animations.

Rule of Thumb

To sum up, here are some guidelines for creating a good chart no matter what library you are using (plotly, ggplot, etc.):

  1. Think about what answer should you visualization answer (do you want to show the trend? distribution?).
  2. Choose the right chart for your data (what type of variables to you have?).
  3. Make it clear and human readable.
  4. Don’t put to much information on one chart (one research question ~ one chart).
  5. Describe it with titles, labels and annotations.
  6. Don’t go crazy with colors.
Ruslan Klymentiev
Ruslan Klymentiev
Data Scientist

My life credo is “Never stop learning”. When I am not learning, I am travelling or hiking.

comments powered by Disqus

Related