library(readxl)
library(tidyverse)
library(here)
I have mutliple excel files that live in a data folder. I want to read them all into R, adding a filename variable to each. I found the original solution to the problem at this stackflow link
# define a path where the data lives
file_path <- here("data")
# get a list of files at that file_path
file_list <- list.files(path = file_path, pattern='*.xlsx')
# use map to read_excel all of those files, adding a file_name variable to each one
list_raw <- file_list %>%
map(~ read_excel(path = .x, range = "A1:C5") %>%
mutate("file_name" = .x))
#> Error: `path` does not exist: 'fruitdata1.xlsx'
This error had me stumped so I reproduced the problem using some dummy fruit data and made a git repo. Jenny Bryan VERY kindly came up with lots of different options for solving the problem. I know MUCH more about working with file paths as a result. I’ve added my two cents and Q and A to Jenny’s solutions below in case these options are useful to anyone else. Thanks again, Jenny- you are the best!
Here’s a troubleshooting chunk that reveals the working directory and file_list
. The path problem is going to be that list.files()
, by default, only returns file names. But you need a path relative to whatever the working directory will be at render time or, even better, a resilient path determined at render time by here.
getwd()
#> [1] "/Users/jenny/Desktop/map_excel_reprex"
file_list
#> [1] "fruitdata1.xlsx" "fruitdata2.xlsx" "fruitdata3.xlsx" "fruitdata4.xlsx"
file.exists(file_list)
#> [1] FALSE FALSE FALSE FALSE
Be explicit about the path, using path = here("data", .x)
inside read_excel()
file_list %>%
map(~ read_excel(path = here("data", .x), range = "A1:C5") %>%
mutate("file_name" = .x))
#> [[1]]
#> # A tibble: 4 x 4
#> fruit colour votes file_name
#> <chr> <chr> <dbl> <chr>
#> 1 banana yellow 3635 fruitdata1.xlsx
#> 2 orange orange 3567 fruitdata1.xlsx
#> 3 grapes green 3564 fruitdata1.xlsx
#> 4 strawberries red 3367 fruitdata1.xlsx
#>
#> [[2]]
#> # A tibble: 4 x 4
#> fruit colour votes file_name
#> <chr> <chr> <dbl> <chr>
#> 1 banana yellow 7578 fruitdata2.xlsx
#> 2 orange orange 8546 fruitdata2.xlsx
#> 3 grapes green 6478 fruitdata2.xlsx
#> 4 strawberries red 4674 fruitdata2.xlsx
#>
#> [[3]]
#> # A tibble: 4 x 4
#> fruit colour votes file_name
#> <chr> <chr> <dbl> <chr>
#> 1 banana yellow 4345 fruitdata3.xlsx
#> 2 orange orange 2456 fruitdata3.xlsx
#> 3 grapes green 3564 fruitdata3.xlsx
#> 4 strawberries red 5647 fruitdata3.xlsx
#>
#> [[4]]
#> # A tibble: 4 x 4
#> fruit colour votes file_name
#> <chr> <chr> <dbl> <chr>
#> 1 banana yellow 5684 fruitdata4.xlsx
#> 2 orange orange 6754 fruitdata4.xlsx
#> 3 grapes green 3423 fruitdata4.xlsx
#> 4 strawberries red 5674 fruitdata4.xlsx
Use here("data", .x)
on file_list
on the way in. Now you’ll want to use basename()
on the paths before adding to the data frame.
file_list %>%
here("data", .) %>%
map(~ read_excel(path = .x, range = "A1:C5") %>%
mutate("file_name" = basename(.x)))
#> [[1]]
#> # A tibble: 4 x 4
#> fruit colour votes file_name
#> <chr> <chr> <dbl> <chr>
#> 1 banana yellow 3635 fruitdata1.xlsx
#> 2 orange orange 3567 fruitdata1.xlsx
#> 3 grapes green 3564 fruitdata1.xlsx
#> 4 strawberries red 3367 fruitdata1.xlsx
#>
#> [[2]]
#> # A tibble: 4 x 4
#> fruit colour votes file_name
#> <chr> <chr> <dbl> <chr>
#> 1 banana yellow 7578 fruitdata2.xlsx
#> 2 orange orange 8546 fruitdata2.xlsx
#> 3 grapes green 6478 fruitdata2.xlsx
#> 4 strawberries red 4674 fruitdata2.xlsx
#>
#> [[3]]
#> # A tibble: 4 x 4
#> fruit colour votes file_name
#> <chr> <chr> <dbl> <chr>
#> 1 banana yellow 4345 fruitdata3.xlsx
#> 2 orange orange 2456 fruitdata3.xlsx
#> 3 grapes green 3564 fruitdata3.xlsx
#> 4 strawberries red 5647 fruitdata3.xlsx
#>
#> [[4]]
#> # A tibble: 4 x 4
#> fruit colour votes file_name
#> <chr> <chr> <dbl> <chr>
#> 1 banana yellow 5684 fruitdata4.xlsx
#> 2 orange orange 6754 fruitdata4.xlsx
#> 3 grapes green 3423 fruitdata4.xlsx
#> 4 strawberries red 5674 fruitdata4.xlsx
Q: What does basename do?
A: removes all of the path up to and including the last path separator (if any). Without it, the filename column ends up having the whole path listed (/Users/jenny/Desktop/map_excel_reprex/data/fruitdata1.xlsx)
Use list.files(full.names = TRUE)
when capturing the file paths. Again, use basename()
on the paths before adding to the data frame.
Remember- here is how I made the first file_list. It is just a list of file names.
file_list1 <- list.files(path = file_path, pattern='*.xlsx')
file_list1
#> [1] "fruitdata1.xlsx" "fruitdata2.xlsx" "fruitdata3.xlsx" "fruitdata4.xlsx"
Here is how JennyB suggests you make a file list. Specifying full.names = TRUE gets the whole path.
file_list2 <- list.files(path = here("data"), pattern='*.xlsx', full.names = TRUE)
file_list2
#> [1] "/Users/jenny/Desktop/map_excel_reprex/data/fruitdata1.xlsx"
#> [2] "/Users/jenny/Desktop/map_excel_reprex/data/fruitdata2.xlsx"
#> [3] "/Users/jenny/Desktop/map_excel_reprex/data/fruitdata3.xlsx"
#> [4] "/Users/jenny/Desktop/map_excel_reprex/data/fruitdata4.xlsx"
Using the file_list2 (with the whole path) and then basename (to strip the bit of the path you dont need in the filename column).
file_list2 %>%
map(~ read_excel(path = .x, range = "A1:C5") %>%
mutate("file_name" = basename(.x)))
#> [[1]]
#> # A tibble: 4 x 4
#> fruit colour votes file_name
#> <chr> <chr> <dbl> <chr>
#> 1 banana yellow 3635 fruitdata1.xlsx
#> 2 orange orange 3567 fruitdata1.xlsx
#> 3 grapes green 3564 fruitdata1.xlsx
#> 4 strawberries red 3367 fruitdata1.xlsx
#>
#> [[2]]
#> # A tibble: 4 x 4
#> fruit colour votes file_name
#> <chr> <chr> <dbl> <chr>
#> 1 banana yellow 7578 fruitdata2.xlsx
#> 2 orange orange 8546 fruitdata2.xlsx
#> 3 grapes green 6478 fruitdata2.xlsx
#> 4 strawberries red 4674 fruitdata2.xlsx
#>
#> [[3]]
#> # A tibble: 4 x 4
#> fruit colour votes file_name
#> <chr> <chr> <dbl> <chr>
#> 1 banana yellow 4345 fruitdata3.xlsx
#> 2 orange orange 2456 fruitdata3.xlsx
#> 3 grapes green 3564 fruitdata3.xlsx
#> 4 strawberries red 5647 fruitdata3.xlsx
#>
#> [[4]]
#> # A tibble: 4 x 4
#> fruit colour votes file_name
#> <chr> <chr> <dbl> <chr>
#> 1 banana yellow 5684 fruitdata4.xlsx
#> 2 orange orange 6754 fruitdata4.xlsx
#> 3 grapes green 3423 fruitdata4.xlsx
#> 4 strawberries red 5674 fruitdata4.xlsx
The fs
package has useful functions for dealing with path issues.
Q: What does
fs
stand for?
A: File systems; the
fs
package is a cross-platform interface to file system operations
Use fs::dir_ls()
instead of list.files()
, which IMO has better defaults than list.files()
. Now we’ll switch to fs::path_file()
to shorten the paths before adding to the data frame.
library(fs)
file_list_fs <- dir_ls(here("data"), glob = "*.xlsx")
file_list_fs
#> /Users/jenny/Desktop/map_excel_reprex/data/fruitdata1.xlsx
#> /Users/jenny/Desktop/map_excel_reprex/data/fruitdata2.xlsx
#> /Users/jenny/Desktop/map_excel_reprex/data/fruitdata3.xlsx
#> /Users/jenny/Desktop/map_excel_reprex/data/fruitdata4.xlsx
Q: what is “glob”?
A: glob patterns specify sets of filenames with wildcard characters, see [Wikipedia](https://en.wikipedia.org/wiki/Glob_(programming)
file_list_fs %>%
map(~ read_excel(path = .x, range = "A1:C5") %>%
mutate("file_name" = fs::path_file(.x)))
#> $`/Users/jenny/Desktop/map_excel_reprex/data/fruitdata1.xlsx`
#> # A tibble: 4 x 4
#> fruit colour votes file_name
#> <chr> <chr> <dbl> <chr>
#> 1 banana yellow 3635 fruitdata1.xlsx
#> 2 orange orange 3567 fruitdata1.xlsx
#> 3 grapes green 3564 fruitdata1.xlsx
#> 4 strawberries red 3367 fruitdata1.xlsx
#>
#> $`/Users/jenny/Desktop/map_excel_reprex/data/fruitdata2.xlsx`
#> # A tibble: 4 x 4
#> fruit colour votes file_name
#> <chr> <chr> <dbl> <chr>
#> 1 banana yellow 7578 fruitdata2.xlsx
#> 2 orange orange 8546 fruitdata2.xlsx
#> 3 grapes green 6478 fruitdata2.xlsx
#> 4 strawberries red 4674 fruitdata2.xlsx
#>
#> $`/Users/jenny/Desktop/map_excel_reprex/data/fruitdata3.xlsx`
#> # A tibble: 4 x 4
#> fruit colour votes file_name
#> <chr> <chr> <dbl> <chr>
#> 1 banana yellow 4345 fruitdata3.xlsx
#> 2 orange orange 2456 fruitdata3.xlsx
#> 3 grapes green 3564 fruitdata3.xlsx
#> 4 strawberries red 5647 fruitdata3.xlsx
#>
#> $`/Users/jenny/Desktop/map_excel_reprex/data/fruitdata4.xlsx`
#> # A tibble: 4 x 4
#> fruit colour votes file_name
#> <chr> <chr> <dbl> <chr>
#> 1 banana yellow 5684 fruitdata4.xlsx
#> 2 orange orange 6754 fruitdata4.xlsx
#> 3 grapes green 3423 fruitdata4.xlsx
#> 4 strawberries red 5674 fruitdata4.xlsx
If data
held only Excel files you could use fs::dir_map()
. But it won’t work as long as that .Rmd
is in there.
foo <- function(full_path) {
read_excel(path = full_path, range = "A1:C5") %>%
mutate("file_name" = path_file(full_path))
}
dir_map("data", foo)
I suspect you will eventually want to row bind these data frames? In that case, I think the most pleasing work flow is with purrr::map_dfr()
. If your paths are well-named from the start, its id
argument accomplishes what you doing with mutate()
.
# define list of files including path
file_list_fs <- dir_ls(here("data"), glob = "*.xlsx")
# pull just the filenames from that list of paths using path_file()
names(file_list_fs) <- path_file(file_list_fs)
file_list_fs %>%
map_dfr(~ read_excel(path = .x, range = "A1:C5"), .id = "file_name")
#> # A tibble: 16 x 4
#> file_name fruit colour votes
#> <chr> <chr> <chr> <dbl>
#> 1 fruitdata1.xlsx banana yellow 3635
#> 2 fruitdata1.xlsx orange orange 3567
#> 3 fruitdata1.xlsx grapes green 3564
#> 4 fruitdata1.xlsx strawberries red 3367
#> 5 fruitdata2.xlsx banana yellow 7578
#> 6 fruitdata2.xlsx orange orange 8546
#> 7 fruitdata2.xlsx grapes green 6478
#> 8 fruitdata2.xlsx strawberries red 4674
#> 9 fruitdata3.xlsx banana yellow 4345
#> 10 fruitdata3.xlsx orange orange 2456
#> 11 fruitdata3.xlsx grapes green 3564
#> 12 fruitdata3.xlsx strawberries red 5647
#> 13 fruitdata4.xlsx banana yellow 5684
#> 14 fruitdata4.xlsx orange orange 6754
#> 15 fruitdata4.xlsx grapes green 3423
#> 16 fruitdata4.xlsx strawberries red 5674
Q: I know that map_df() returns a dataframe, what does map_dfr() do?
A: Requires dplyr to be installed and creates a df by row binding. There is also a map_dfc() function that does the same binding by columns. More info here