gutenberg

tidy tuesday week 22

Author

Jen Richmond

Published

June 3, 2025

The Tidy Tuesday data this week comes from the gutenbergr package, which pulls data about the ebooks and authors from Project Gutenberg.

load packages

Code

library(tidyverse)
library(tidytuesdayR)
library(janitor)
library(ggeasy)
library(patchwork)
library(treemapify)
library(ggtext)


# adjust year/week values here
year = 2025
week = 22

get the data

Code

tt <- tt_load(year, week)

---- Compiling #TidyTuesday Information for 2025-06-03 ----
--- There are 4 files available ---


── Downloading files ───────────────────────────────────────────────────────────

  1 of 4: "gutenberg_authors.csv"
  2 of 4: "gutenberg_languages.csv"
  3 of 4: "gutenberg_metadata.csv"
  4 of 4: "gutenberg_subjects.csv"

Code

authors <- tt[[1]]

languages <- tt[[2]]

metadata <- tt[[3]]

# remove tidy tuesday object
rm(tt)

Questions

1. How many different languages are available in the Project Gutenberg collection?

Note

Reminder: the dplyr::distinct function is useful for getting rid of duplicate rows. The base::unique function is similar.

janitor::get_dups will pull duplicate entries in a particular variable.

If we want to count the number of unique entries in a variable, we need dplyr::n_distinct

Code

glimpse(languages)

Rows: 76,205
Columns: 3
$ gutenberg_id    <dbl> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16,…
$ language        <chr> "en", "en", "en", "en", "en", "en", "en", "en", "en", …
$ total_languages <dbl> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, …

Code

distinct <- n_distinct(languages$language)

There are books in 70 languages represented in the Gutenberg data.

2. How many books are available in each language?

The metadata dataframe contains information about books and what language they are in. tabyl() counts how many books there are in each language; here I am displaying the top10 languages.

Code

books_per_language <- metadata %>%
  tabyl(language) %>%
  select(language, n) %>%
  arrange(-n) %>%
  head(10) 

gt::gt(books_per_language)

language	n
en	63362
fr	4091
fi	3388
de	2365
nl	1186
it	1077
es	909
pt	671
hu	616
zh	443

I don’t know what many of language codes in that table are, but I found a table of codes here. I used datapasta to get the codes into R and then join them to the books_per_language dataframe.

Code

book_codes <- left_join(books_per_language, codes, by = "language") %>%
  rename(code = language, language = Name) %>%
  mutate(language = as.factor(language)) 


gt::gt(book_codes)

code	n	language
en	63362	English
fr	4091	French
fi	3388	Finnish
de	2365	German
nl	1186	Dutch
it	1077	Italian
es	909	Spanish
pt	671	Portuguese
hu	616	Hungarian
zh	443	Chinese

plot

There are so may more books in English relative to other languages, perhaps a treemap plot would work here. Referring back to my 30 Day chart challenge code using the treemapify package.

Code

palette <- c("#59C7EBFF", "#CCEEF9FF", "#FFB8ACFF", "#FEE2DDFF", "#0AA398FF", "#71D1CCFF", "#ECA0B2FF", "#F3BFCBFF", "#B8BCC1FF", "#E1E2E5FF")


  
book_codes %>% 
  ggplot(aes(area = n, fill = language, label = paste(language, n, sep = "\n"))) +
  geom_treemap(colour = "white") +
 scale_fill_manual(values = palette) +
  geom_treemap_text(colour = "black",
                    place = "topleft",
                    size = 5, 
                    grow = FALSE) + # option from ggfittext to NOT make font fit box
  easy_remove_legend() +
  labs(title = "Number of books in the Gutenberg database by language", 
       subtitle = "Top 10 languages", 
       caption = "Data source `gutenbergr` package Tidy Tuesday") +
   theme(text = element_text(family = "Karla"), 
         plot.background = element_rect("antiquewhite")) +
  easy_caption_size(8)

3. Do any authors appear under more than one gutenberg_author_id?

Code

dups <- authors %>%
  get_dupes(author)

dup_authors <- n_distinct(dups$author)

There are 119 authors in the dataset who are under more than one author_id.

4. When were most of the gutenberg books written?

Here I am joining the authors dataframe to the metadata to add the author birthdate and deathdate. I am adding new columns to distinguish between authors who lived in the time of the printing press (post 1500) vs. earlier.

Code

meta_authors <- left_join(metadata, authors, by = "gutenberg_author_id")

meta_authors <- meta_authors %>%
  select(gutenberg_id, title, author = author.x, gutenberg_author_id, alias, birthdate, deathdate, language, wikipedia, gutenberg_bookshelf, rights, has_text) %>%
  mutate(timepoint= case_when(birthdate < 1500 ~ "ancient", 
                               birthdate >= 1500 ~ "modern"))

plot

Code

a <- meta_authors %>%
  filter(timepoint == "ancient") %>%
  ggplot(aes(x = birthdate)) +
  geom_histogram(binwidth = 100) +
  theme_minimal() +
  scale_y_continuous(limits = c(0,150)) +
  labs(subtitle = "Pre-Modern", y = "Number of books", x = "Author birthdate") 



m <- meta_authors %>%
  filter(timepoint == "modern") %>%
  ggplot(aes(x = birthdate)) +
  geom_histogram(binwidth = 20) +
  theme_minimal() +
  scale_y_continuous(limits = c(0,20000)) +
  labs(subtitle = "Modern", y = "Number of books", x = "Author birthdate") 

a + m +
  plot_annotation(title = "Project Gutenberg: Books by historial period", subtitle = "The majority of books in the Gutenberg database were written by authors during in \nthe 19th century, however, there are also books from Ancient Greek and Roman \nliterature and the Medieval period")