4. Reading and Writing Data#
In order to use Julia for research computing projects, you’ll need to know how to read and write data. Julia provides functions to read and write bytes and text, while community-contributed packages provide functions to handle a wide variety of structured data formats.
4.2. Example: Reading a CSV File#
The U.S. Bureau of Transportation Statistics publishes and regularly updates the Airline On-Time Performance Data Set, which includes departure and arrival times for all domestic flights since 1987. The data set is distributed as a collection of comma-separated value (CSV) files, with one for each month-year combination.
Important
You can download a zipped subset of the data set here.
Let’s try reading the data set into Julia. There’s no built-in function to read CSV files, but the CSV.jl package provides one. The CSV format is tabular, so let’s read the data as a data frame (a tabular data structure). In Julia, the DataFrames.jl package provides data frames. Install both packages (if you haven’t yet), and then load them:
using CSV
using DataFrames
using Pkg
Pkg.add("CSV")
Pkg.add("DataFrames")
You can read a CSV file with the CSV.read
function:
path = "data/On_Time_Reporting_Carrier_On_Time_Performance_(1987_present)_2023_1.csv"
air = CSV.read(path, DataFrame)
typeof(air)
DataFrame
The second argument is a sink function, a function that converts raw
tabular data to a specific Julia type. The sink function determines the type of
the value CSV.read
returns. In this case, the returned value is a
DataFrame
.
Note
The Tables.jl package defines a standard programming interface for working with tabular data in Julia. The CSV.jl and DataFrames.jl packages both use this interface, as well as many other Julia packages.
Any type that satisfies the Tables.jl interface is called a source. For
example, the CSV.jl package’s CSV.File
type is a source, although the
CSV.read
function hides this detail. Any function that can take a source
instance as input and return a table of a specific type is called a sink. The
DataFrames.jl package’s DataFrame
function is a sink. Coincidentally, the
DataFrame
type is also a source (this facilitates transformations of data
frames).
The CSV.jl package also provides a way to read a CSV file:
source = CSV.File(path)
# air = DataFrame(source)
Compared to CSV.read
, this is less efficient for reading a CSV as a data
frame: it makes a copy of the data.
You can use the first
function to get just the first few rows of a data
frame:
first(air, 5)
Row | Year | Quarter | Month | DayofMonth | DayOfWeek | FlightDate | Reporting_Airline | DOT_ID_Reporting_Airline | IATA_CODE_Reporting_Airline | Tail_Number | Flight_Number_Reporting_Airline | OriginAirportID | OriginAirportSeqID | OriginCityMarketID | Origin | OriginCityName | OriginState | OriginStateFips | OriginStateName | OriginWac | DestAirportID | DestAirportSeqID | DestCityMarketID | Dest | DestCityName | DestState | DestStateFips | DestStateName | DestWac | CRSDepTime | DepTime | DepDelay | DepDelayMinutes | DepDel15 | DepartureDelayGroups | DepTimeBlk | TaxiOut | WheelsOff | WheelsOn | TaxiIn | CRSArrTime | ArrTime | ArrDelay | ArrDelayMinutes | ArrDel15 | ArrivalDelayGroups | ArrTimeBlk | Cancelled | CancellationCode | Diverted | CRSElapsedTime | ActualElapsedTime | AirTime | Flights | Distance | DistanceGroup | CarrierDelay | WeatherDelay | NASDelay | SecurityDelay | LateAircraftDelay | FirstDepTime | TotalAddGTime | LongestAddGTime | DivAirportLandings | DivReachedDest | DivActualElapsedTime | DivArrDelay | DivDistance | Div1Airport | Div1AirportID | Div1AirportSeqID | Div1WheelsOn | Div1TotalGTime | Div1LongestGTime | Div1WheelsOff | Div1TailNum | Div2Airport | Div2AirportID | Div2AirportSeqID | Div2WheelsOn | Div2TotalGTime | Div2LongestGTime | Div2WheelsOff | Div2TailNum | Div3Airport | Div3AirportID | Div3AirportSeqID | Div3WheelsOn | Div3TotalGTime | Div3LongestGTime | Div3WheelsOff | Div3TailNum | Div4Airport | Div4AirportID | Div4AirportSeqID | Div4WheelsOn | Div4TotalGTime | Div4LongestGTime | Div4WheelsOff | ⋯ |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
Int64 | Int64 | Int64 | Int64 | Int64 | Date | String3 | Int64 | String3 | String7 | Int64 | Int64 | Int64 | Int64 | String3 | String | String3 | Int64 | String | Int64 | Int64 | Int64 | Int64 | String3 | String | String3 | Int64 | String | Int64 | Int64 | Int64? | Float64? | Float64? | Float64? | Int64? | String15 | Float64? | Int64? | Int64? | Float64? | Int64 | Int64? | Float64? | Float64? | Float64? | Int64? | String15 | Float64 | String3? | Float64 | Float64? | Float64? | Float64? | Float64 | Float64 | Int64 | Float64? | Float64? | Float64? | Float64? | Float64? | Int64? | Float64? | Float64? | Int64 | Float64? | Float64? | Float64? | Float64? | String3? | Int64? | Int64? | Int64? | Float64? | Float64? | Int64? | String7? | String3? | Int64? | Int64? | Int64? | Float64? | Float64? | Int64? | String7? | Missing | Missing | Missing | Missing | Missing | Missing | Missing | Missing | Missing | Missing | Missing | Missing | Missing | Missing | Missing | ⋯ | |
1 | 2023 | 1 | 1 | 2 | 1 | 2023-01-02 | 9E | 20363 | 9E | N605LR | 4628 | 10529 | 1052907 | 30529 | BDL | Hartford, CT | CT | 9 | Connecticut | 11 | 12953 | 1295304 | 31703 | LGA | New York, NY | NY | 36 | New York | 22 | 800 | 757 | -3.0 | 0.0 | 0.0 | -1 | 0800-0859 | 11.0 | 808 | 833 | 20.0 | 905 | 853 | -12.0 | 0.0 | 0.0 | -1 | 0900-0959 | 0.0 | missing | 0.0 | 65.0 | 56.0 | 25.0 | 1.0 | 101.0 | 1 | missing | missing | missing | missing | missing | missing | missing | missing | 0 | missing | missing | missing | missing | missing | missing | missing | missing | missing | missing | missing | missing | missing | missing | missing | missing | missing | missing | missing | missing | missing | missing | missing | missing | missing | missing | missing | missing | missing | missing | missing | missing | missing | missing | missing | ⋯ |
2 | 2023 | 1 | 1 | 3 | 2 | 2023-01-03 | 9E | 20363 | 9E | N605LR | 4628 | 10529 | 1052907 | 30529 | BDL | Hartford, CT | CT | 9 | Connecticut | 11 | 12953 | 1295304 | 31703 | LGA | New York, NY | NY | 36 | New York | 22 | 800 | 755 | -5.0 | 0.0 | 0.0 | -1 | 0800-0859 | 19.0 | 814 | 851 | 6.0 | 905 | 857 | -8.0 | 0.0 | 0.0 | -1 | 0900-0959 | 0.0 | missing | 0.0 | 65.0 | 62.0 | 37.0 | 1.0 | 101.0 | 1 | missing | missing | missing | missing | missing | missing | missing | missing | 0 | missing | missing | missing | missing | missing | missing | missing | missing | missing | missing | missing | missing | missing | missing | missing | missing | missing | missing | missing | missing | missing | missing | missing | missing | missing | missing | missing | missing | missing | missing | missing | missing | missing | missing | missing | ⋯ |
3 | 2023 | 1 | 1 | 4 | 3 | 2023-01-04 | 9E | 20363 | 9E | N331PQ | 4628 | 10529 | 1052907 | 30529 | BDL | Hartford, CT | CT | 9 | Connecticut | 11 | 12953 | 1295304 | 31703 | LGA | New York, NY | NY | 36 | New York | 22 | 800 | 755 | -5.0 | 0.0 | 0.0 | -1 | 0800-0859 | 14.0 | 809 | 837 | 7.0 | 905 | 844 | -21.0 | 0.0 | 0.0 | -2 | 0900-0959 | 0.0 | missing | 0.0 | 65.0 | 49.0 | 28.0 | 1.0 | 101.0 | 1 | missing | missing | missing | missing | missing | missing | missing | missing | 0 | missing | missing | missing | missing | missing | missing | missing | missing | missing | missing | missing | missing | missing | missing | missing | missing | missing | missing | missing | missing | missing | missing | missing | missing | missing | missing | missing | missing | missing | missing | missing | missing | missing | missing | missing | ⋯ |
4 | 2023 | 1 | 1 | 5 | 4 | 2023-01-05 | 9E | 20363 | 9E | N906XJ | 4628 | 10529 | 1052907 | 30529 | BDL | Hartford, CT | CT | 9 | Connecticut | 11 | 12953 | 1295304 | 31703 | LGA | New York, NY | NY | 36 | New York | 22 | 800 | 754 | -6.0 | 0.0 | 0.0 | -1 | 0800-0859 | 13.0 | 807 | 845 | 3.0 | 905 | 848 | -17.0 | 0.0 | 0.0 | -2 | 0900-0959 | 0.0 | missing | 0.0 | 65.0 | 54.0 | 38.0 | 1.0 | 101.0 | 1 | missing | missing | missing | missing | missing | missing | missing | missing | 0 | missing | missing | missing | missing | missing | missing | missing | missing | missing | missing | missing | missing | missing | missing | missing | missing | missing | missing | missing | missing | missing | missing | missing | missing | missing | missing | missing | missing | missing | missing | missing | missing | missing | missing | missing | ⋯ |
5 | 2023 | 1 | 1 | 6 | 5 | 2023-01-06 | 9E | 20363 | 9E | N337PQ | 4628 | 10529 | 1052907 | 30529 | BDL | Hartford, CT | CT | 9 | Connecticut | 11 | 12953 | 1295304 | 31703 | LGA | New York, NY | NY | 36 | New York | 22 | 800 | 759 | -1.0 | 0.0 | 0.0 | -1 | 0800-0859 | 17.0 | 816 | 844 | 5.0 | 905 | 849 | -16.0 | 0.0 | 0.0 | -2 | 0900-0959 | 0.0 | missing | 0.0 | 65.0 | 50.0 | 28.0 | 1.0 | 101.0 | 1 | missing | missing | missing | missing | missing | missing | missing | missing | 0 | missing | missing | missing | missing | missing | missing | missing | missing | missing | missing | missing | missing | missing | missing | missing | missing | missing | missing | missing | missing | missing | missing | missing | missing | missing | missing | missing | missing | missing | missing | missing | missing | missing | missing | missing | ⋯ |
The second argument, 5
, is the number of rows to return.
The describe
function is also useful for inspecting data frames. It provides
a statistical summary of each column:
describe(air)
Row | variable | mean | min | median | max | nmissing | eltype |
---|---|---|---|---|---|---|---|
Symbol | Union… | Any | Any | Any | Int64 | Type | |
1 | Year | 2023.0 | 2023 | 2023.0 | 2023 | 0 | Int64 |
2 | Quarter | 1.0 | 1 | 1.0 | 1 | 0 | Int64 |
3 | Month | 1.0 | 1 | 1.0 | 1 | 0 | Int64 |
4 | DayofMonth | 16.0954 | 1 | 16.0 | 31 | 0 | Int64 |
5 | DayOfWeek | 3.89027 | 1 | 4.0 | 7 | 0 | Int64 |
6 | FlightDate | 2023-01-01 | 2023-01-16 | 2023-01-31 | 0 | Date | |
7 | Reporting_Airline | 9E | YX | 0 | String3 | ||
8 | DOT_ID_Reporting_Airline | 19944.9 | 19393 | 19930.0 | 20452 | 0 | Int64 |
9 | IATA_CODE_Reporting_Airline | 9E | YX | 0 | String3 | ||
10 | Tail_Number | N999JQ | 0 | String7 | |||
11 | Flight_Number_Reporting_Airline | 2201.93 | 1 | 1909.0 | 9887 | 0 | Int64 |
12 | OriginAirportID | 12653.2 | 10135 | 12889.0 | 16869 | 0 | Int64 |
13 | OriginAirportSeqID | 1.26532e6 | 1013506 | 1.2889e6 | 1686902 | 0 | Int64 |
⋮ | ⋮ | ⋮ | ⋮ | ⋮ | ⋮ | ⋮ | ⋮ |
99 | Div4LongestGTime | 538837 | Missing | ||||
100 | Div4WheelsOff | 538837 | Missing | ||||
101 | Div4TailNum | 538837 | Missing | ||||
102 | Div5Airport | 538837 | Missing | ||||
103 | Div5AirportID | 538837 | Missing | ||||
104 | Div5AirportSeqID | 538837 | Missing | ||||
105 | Div5WheelsOn | 538837 | Missing | ||||
106 | Div5TotalGTime | 538837 | Missing | ||||
107 | Div5LongestGTime | 538837 | Missing | ||||
108 | Div5WheelsOff | 538837 | Missing | ||||
109 | Div5TailNum | 538837 | Missing | ||||
110 | Column110 | 538837 | Missing |
We’ll learn more about working with data frames in Section 5.
4.3. Structured Data#
Functions to read and write structured file formats are generally provided by packages rather than built into Julia. Here are a few (find more by searching online):
Format |
Extension |
Package |
---|---|---|
|
||
Delimited File |
|
|
Fixed-width File |
|
Planned for CSV.jl |
Geospatial Vector Data |
|
|
HDF5 |
|
|
JavaScript Object Notation |
|
|
|
||
Images |
|
|
Geospatial Raster Data |
|
|
|
||
Microsoft Excel |
|
|
Extensible Markup Language |
|
|
|
||
Often there’s more than one package with functions to read and write a specific format. We chose the packages in the table based on popularity and signs of active development or maintenance.
4.4. Text and Bytes#
Sometimes you might need to read and write text or bytes directly. For example, you might need to work with a file that has an obscure or custom format.
4.4.1. Writing Data#
Julia’s built-in open
function opens a file. Let’s open a file called
hello.txt
:
file = open("hello.txt", "w")
IOStream(<file hello.txt>)
The second argument specifies whether to open the file in read mode (the
default, or "r"
) or write mode ("w"
). In this case, we opened the file in
write mode.
You can use the print
, println
, or write
function to write to a file. Try
it out with a few lines:
write(file, "Don't worry, be happy!\n")
23
println(file, "Hello world!")
print(file, "Print?\n")
Important
The print
and println
functions convert objects to strings in the encoding
of the open file (set by the call to open
; the default is UTF-8) before
output.
The write
function outputs objects as bytes. For strings, this means the
encoding is determined by the string itself rather than the file.
In general, use print
and println
to write text to files, and use write
to write bytes to files.
Make sure to use the close
function to close files when you’re finished using
them. When writing to a file, this ensures that all of the writes are
completed.
close(file)
It’s easy to forget to close files, but fortunately Julia provides syntactic
sugar to close files automatically: the do
block.
4.4.2. do
Blocks#
A do
block is syntactic sugar for defining an anonymous function and passing
it as the first argument to another function. As an example, consider this call
with an anonymous function as the first argument:
map(x -> x^2, [1, 2, 3])
3-element Vector{Int64}:
1
4
9
The equivalent using a do
block is:
map([1, 2, 3]) do x
x^2
end
3-element Vector{Int64}:
1
4
9
Caution
Since a do
block defines an anonymous function, variables you define inside
of the block are not visible from outside of the block.
The open
function has a method that takes a function as its first argument.
The file is opened, the function is called on the open file, the file is
closed, and the result is returned. This approach is safer than manually
closing the file with close
, because the open
function ensures that the
file is always closed, even if something goes wrong in the computation.
4.4.3. Reading Data#
Let’s read the hello.txt
file to check that the lines were written, and do it
using a do
block. Try running this code to read the lines from the file:
lines =
open("hello.txt") do f
readlines(f)
end
3-element Vector{String}:
"Don't worry, be happy!"
"Hello world!"
"Print?"
The readlines
function reads lines as elements of an array. If you only want
to read one line, use the readline
function instead. If you want to read all
of the lines into a single string, use the read
function instead.
Note
In addition to read
, there’s also a readchomp
function, which only differs
from read
in that it removes (“chomps”) trailing newlines.