day 22 stars

Author
Published

April 22, 2025

Day 22 and the prompt is stars. Here I am looking at the age differences between love interests in Hollywood movies. Data found on Kaggle and downloaded from https://hollywoodagegap.com.

library(tidyverse)
library(here)
library(janitor)
library(ggrain)
library(ggeasy)
library(patchwork)

age_diff <- read_csv(here("charts", "2025-04-22_stars", "hollywood age.csv")) %>%
  clean_names() %>%
  select(movie_name, release_year, age_difference)


glimpse(age_diff)
Rows: 1,203
Columns: 3
$ movie_name     <chr> "Harold and Maude", "Venus", "The Quiet American", "Sol…
$ release_year   <dbl> 1971, 2006, 2002, 2009, 1998, 2010, 1992, 2016, 2009, 1…
$ age_difference <dbl> 52, 50, 49, 45, 45, 43, 42, 41, 40, 39, 38, 38, 36, 36,…

plot

The increase in the number of movies made across this period makes any change in age difference over time difficult to see. Maybe creating a new variable that groups movies into decade will help.

age_diff %>%
  ggplot(aes(x = release_year, y = age_difference)) +
  geom_jitter() 

Here I am making a new decade column using case_when() .

age_diff_decade <- age_diff %>%
  mutate(decade = case_when(release_year < 1940 ~ "1930s", 
                            release_year >= 1940 & release_year < 1950 ~ "1940s", 
                            release_year >= 1950 & release_year < 1960 ~ "1950s", 
                            release_year >= 1960 & release_year < 1970 ~ "1960s", 
                            release_year >= 1970 & release_year < 1980 ~ "1970s", 
                            release_year >= 1980 & release_year < 1990 ~ "1980s", 
                            release_year >= 1990 & release_year < 2000 ~ "1990s", 
                            release_year >= 2000 & release_year < 2010 ~ "2000s", 
                            release_year >= 2010 & release_year < 2020 ~ "2010s", 
                            release_year >= 2020 & release_year < 2030 ~ "2020s"
                            ))

And plotting by decade instead of release year.

age_diff_decade %>%
  ggplot(aes(x = decade, y = age_difference)) +
  geom_jitter(width = 0.1, alpha = 0.5) 

I haven’t tried a raincloud plot in a while- this might be a good use case. Raincloud plot combine raw points, box plot, and half violin to get a good idea of the distribution of the data.

Quick google and found the ggrain package.

Code
p1 <- age_diff_decade %>%
  filter(release_year < 1980) %>%
  ggplot(aes(x = decade, y = age_difference, fill = decade)) +
 geom_rain(alpha = .5, 
            boxplot.args.pos = list(
              width = .1, position = position_nudge(x = 0.2)),
            violin.args.pos = list(
              side = "r",
              width = 0.7, position = position_nudge(x = 0.3))) +
  theme_minimal() +
  easy_remove_legend() +
  scale_y_continuous(expand = c(0,0), limits = c(-.2, 55)) +
  labs(y = "Age difference", x = "Decade", 
       subtitle = "1930s - 1970s")

p1

Code
p2 <- age_diff_decade %>%
  filter(release_year >= 1980) %>%
  ggplot(aes(x = decade, y = age_difference, fill = decade)) +
 geom_rain(alpha = .5, 
            boxplot.args.pos = list(
              width = .1, position = position_nudge(x = 0.2)),
            violin.args.pos = list(
              side = "r",
              width = 0.7, position = position_nudge(x = 0.3))) +
  theme_minimal() +
  easy_remove_legend() +
    scale_y_continuous(expand = c(0,0), limits = c(-.2, 55)) +
  labs(y = "Age difference", x = "Decade",  subtitle = "1980s - current")

p2

Here I am using the patchwork package to combine the plots

p1 + 
  labs(title = "The age difference in years between movie love interests") +
p2 +
  labs(caption = "Data from https://hollywoodagegap.com/")