4. Reading and Writing Data#

In order to use Julia for research computing projects, you’ll need to know how to read and write data. Julia provides functions to read and write bytes and text, while community-contributed packages provide functions to handle a wide variety of structured data formats.

4.2. Example: Reading a CSV File#

The U.S. Bureau of Transportation Statistics publishes and regularly updates the Airline On-Time Performance Data Set, which includes departure and arrival times for all domestic flights since 1987. The data set is distributed as a collection of comma-separated value (CSV) files, with one for each month-year combination.

Important

You can download a zipped subset of the data set here.

Let’s try reading the data set into Julia. There’s no built-in function to read CSV files, but the CSV.jl package provides one. The CSV format is tabular, so let’s read the data as a data frame (a tabular data structure). In Julia, the DataFrames.jl package provides data frames. Install both packages (if you haven’t yet), and then load them:

using CSV
using DataFrames
using Pkg

Pkg.add("CSV")
Pkg.add("DataFrames")

You can read a CSV file with the CSV.read function:

path = "data/On_Time_Reporting_Carrier_On_Time_Performance_(1987_present)_2023_1.csv"
air = CSV.read(path, DataFrame)
typeof(air)
DataFrame

The second argument is a sink function, a function that converts raw tabular data to a specific Julia type. The sink function determines the type of the value CSV.read returns. In this case, the returned value is a DataFrame.

Note

The Tables.jl package defines a standard programming interface for working with tabular data in Julia. The CSV.jl and DataFrames.jl packages both use this interface, as well as many other Julia packages.

Any type that satisfies the Tables.jl interface is called a source. For example, the CSV.jl package’s CSV.File type is a source, although the CSV.read function hides this detail. Any function that can take a source instance as input and return a table of a specific type is called a sink. The DataFrames.jl package’s DataFrame function is a sink. Coincidentally, the DataFrame type is also a source (this facilitates transformations of data frames).

The CSV.jl package also provides a way to read a CSV file:

source = CSV.File(path)

# air = DataFrame(source)

Compared to CSV.read, this is less efficient for reading a CSV as a data frame: it makes a copy of the data.

You can use the first function to get just the first few rows of a data frame:

first(air, 5)
5×110 DataFrame
10 columns omitted
RowYearQuarterMonthDayofMonthDayOfWeekFlightDateReporting_AirlineDOT_ID_Reporting_AirlineIATA_CODE_Reporting_AirlineTail_NumberFlight_Number_Reporting_AirlineOriginAirportIDOriginAirportSeqIDOriginCityMarketIDOriginOriginCityNameOriginStateOriginStateFipsOriginStateNameOriginWacDestAirportIDDestAirportSeqIDDestCityMarketIDDestDestCityNameDestStateDestStateFipsDestStateNameDestWacCRSDepTimeDepTimeDepDelayDepDelayMinutesDepDel15DepartureDelayGroupsDepTimeBlkTaxiOutWheelsOffWheelsOnTaxiInCRSArrTimeArrTimeArrDelayArrDelayMinutesArrDel15ArrivalDelayGroupsArrTimeBlkCancelledCancellationCodeDivertedCRSElapsedTimeActualElapsedTimeAirTimeFlightsDistanceDistanceGroupCarrierDelayWeatherDelayNASDelaySecurityDelayLateAircraftDelayFirstDepTimeTotalAddGTimeLongestAddGTimeDivAirportLandingsDivReachedDestDivActualElapsedTimeDivArrDelayDivDistanceDiv1AirportDiv1AirportIDDiv1AirportSeqIDDiv1WheelsOnDiv1TotalGTimeDiv1LongestGTimeDiv1WheelsOffDiv1TailNumDiv2AirportDiv2AirportIDDiv2AirportSeqIDDiv2WheelsOnDiv2TotalGTimeDiv2LongestGTimeDiv2WheelsOffDiv2TailNumDiv3AirportDiv3AirportIDDiv3AirportSeqIDDiv3WheelsOnDiv3TotalGTimeDiv3LongestGTimeDiv3WheelsOffDiv3TailNumDiv4AirportDiv4AirportIDDiv4AirportSeqIDDiv4WheelsOnDiv4TotalGTimeDiv4LongestGTimeDiv4WheelsOff
Int64Int64Int64Int64Int64DateString3Int64String3String7Int64Int64Int64Int64String3StringString3Int64StringInt64Int64Int64Int64String3StringString3Int64StringInt64Int64Int64?Float64?Float64?Float64?Int64?String15Float64?Int64?Int64?Float64?Int64Int64?Float64?Float64?Float64?Int64?String15Float64String3?Float64Float64?Float64?Float64?Float64Float64Int64Float64?Float64?Float64?Float64?Float64?Int64?Float64?Float64?Int64Float64?Float64?Float64?Float64?String3?Int64?Int64?Int64?Float64?Float64?Int64?String7?String3?Int64?Int64?Int64?Float64?Float64?Int64?String7?MissingMissingMissingMissingMissingMissingMissingMissingMissingMissingMissingMissingMissingMissingMissing
1202311212023-01-029E203639EN605LR462810529105290730529BDLHartford, CTCT9Connecticut1112953129530431703LGANew York, NYNY36New York22800757-3.00.00.0-10800-085911.080883320.0905853-12.00.00.0-10900-09590.0missing0.065.056.025.01.0101.01missingmissingmissingmissingmissingmissingmissingmissing0missingmissingmissingmissingmissingmissingmissingmissingmissingmissingmissingmissingmissingmissingmissingmissingmissingmissingmissingmissingmissingmissingmissingmissingmissingmissingmissingmissingmissingmissingmissingmissingmissingmissingmissing
2202311322023-01-039E203639EN605LR462810529105290730529BDLHartford, CTCT9Connecticut1112953129530431703LGANew York, NYNY36New York22800755-5.00.00.0-10800-085919.08148516.0905857-8.00.00.0-10900-09590.0missing0.065.062.037.01.0101.01missingmissingmissingmissingmissingmissingmissingmissing0missingmissingmissingmissingmissingmissingmissingmissingmissingmissingmissingmissingmissingmissingmissingmissingmissingmissingmissingmissingmissingmissingmissingmissingmissingmissingmissingmissingmissingmissingmissingmissingmissingmissingmissing
3202311432023-01-049E203639EN331PQ462810529105290730529BDLHartford, CTCT9Connecticut1112953129530431703LGANew York, NYNY36New York22800755-5.00.00.0-10800-085914.08098377.0905844-21.00.00.0-20900-09590.0missing0.065.049.028.01.0101.01missingmissingmissingmissingmissingmissingmissingmissing0missingmissingmissingmissingmissingmissingmissingmissingmissingmissingmissingmissingmissingmissingmissingmissingmissingmissingmissingmissingmissingmissingmissingmissingmissingmissingmissingmissingmissingmissingmissingmissingmissingmissingmissing
4202311542023-01-059E203639EN906XJ462810529105290730529BDLHartford, CTCT9Connecticut1112953129530431703LGANew York, NYNY36New York22800754-6.00.00.0-10800-085913.08078453.0905848-17.00.00.0-20900-09590.0missing0.065.054.038.01.0101.01missingmissingmissingmissingmissingmissingmissingmissing0missingmissingmissingmissingmissingmissingmissingmissingmissingmissingmissingmissingmissingmissingmissingmissingmissingmissingmissingmissingmissingmissingmissingmissingmissingmissingmissingmissingmissingmissingmissingmissingmissingmissingmissing
5202311652023-01-069E203639EN337PQ462810529105290730529BDLHartford, CTCT9Connecticut1112953129530431703LGANew York, NYNY36New York22800759-1.00.00.0-10800-085917.08168445.0905849-16.00.00.0-20900-09590.0missing0.065.050.028.01.0101.01missingmissingmissingmissingmissingmissingmissingmissing0missingmissingmissingmissingmissingmissingmissingmissingmissingmissingmissingmissingmissingmissingmissingmissingmissingmissingmissingmissingmissingmissingmissingmissingmissingmissingmissingmissingmissingmissingmissingmissingmissingmissingmissing

The second argument, 5, is the number of rows to return.

The describe function is also useful for inspecting data frames. It provides a statistical summary of each column:

describe(air)
110×7 DataFrame
85 rows omitted
Rowvariablemeanminmedianmaxnmissingeltype
SymbolUnion…AnyAnyAnyInt64Type
1Year2023.020232023.020230Int64
2Quarter1.011.010Int64
3Month1.011.010Int64
4DayofMonth16.0954116.0310Int64
5DayOfWeek3.8902714.070Int64
6FlightDate2023-01-012023-01-162023-01-310Date
7Reporting_Airline9EYX0String3
8DOT_ID_Reporting_Airline19944.91939319930.0204520Int64
9IATA_CODE_Reporting_Airline9EYX0String3
10Tail_NumberN999JQ0String7
11Flight_Number_Reporting_Airline2201.9311909.098870Int64
12OriginAirportID12653.21013512889.0168690Int64
13OriginAirportSeqID1.26532e610135061.2889e616869020Int64
99Div4LongestGTime538837Missing
100Div4WheelsOff538837Missing
101Div4TailNum538837Missing
102Div5Airport538837Missing
103Div5AirportID538837Missing
104Div5AirportSeqID538837Missing
105Div5WheelsOn538837Missing
106Div5TotalGTime538837Missing
107Div5LongestGTime538837Missing
108Div5WheelsOff538837Missing
109Div5TailNum538837Missing
110Column110538837Missing

We’ll learn more about working with data frames in Section 5.

4.3. Structured Data#

Functions to read and write structured file formats are generally provided by packages rather than built into Julia. Here are a few (find more by searching online):

Format

Extension

Package

Apache Arrow

.arrow

Arrow-julia.jl

Delimited File

.csv, .tsv, …

CSV.jl

Fixed-width File

.fwf

Planned for CSV.jl

Geospatial Vector Data

.geojson, .shp, …

ArchGDAL.jl

HDF5

.hdf5

HDF5.jl

JavaScript Object Notation

.json

JSON.jl

Apache Parquet

.parquet

Parquet.jl

Images

.png, .jpg, …

Images.jl

Geospatial Raster Data

.tiff, …

ArchGDAL.jl

TOML

.toml

built-in

Microsoft Excel

.xls, .xlsx

XLSX.jl

Extensible Markup Language

.xml

XML.jl

YAML

.yaml

YAML.jl

MessagePack

MsgPack.jl

Often there’s more than one package with functions to read and write a specific format. We chose the packages in the table based on popularity and signs of active development or maintenance.

4.4. Text and Bytes#

Sometimes you might need to read and write text or bytes directly. For example, you might need to work with a file that has an obscure or custom format.

4.4.1. Writing Data#

Julia’s built-in open function opens a file. Let’s open a file called hello.txt:

file = open("hello.txt", "w")
IOStream(<file hello.txt>)

The second argument specifies whether to open the file in read mode (the default, or "r") or write mode ("w"). In this case, we opened the file in write mode.

You can use the print, println, or write function to write to a file. Try it out with a few lines:

write(file, "Don't worry, be happy!\n")
23
println(file, "Hello world!")
print(file, "Print?\n")

Important

The print and println functions convert objects to strings in the encoding of the open file (set by the call to open; the default is UTF-8) before output.

The write function outputs objects as bytes. For strings, this means the encoding is determined by the string itself rather than the file.

In general, use print and println to write text to files, and use write to write bytes to files.

Make sure to use the close function to close files when you’re finished using them. When writing to a file, this ensures that all of the writes are completed.

close(file)

It’s easy to forget to close files, but fortunately Julia provides syntactic sugar to close files automatically: the do block.

4.4.2. do Blocks#

A do block is syntactic sugar for defining an anonymous function and passing it as the first argument to another function. As an example, consider this call with an anonymous function as the first argument:

map(x -> x^2, [1, 2, 3])
3-element Vector{Int64}:
 1
 4
 9

The equivalent using a do block is:

map([1, 2, 3]) do x
    x^2
end
3-element Vector{Int64}:
 1
 4
 9

Caution

Since a do block defines an anonymous function, variables you define inside of the block are not visible from outside of the block.

The open function has a method that takes a function as its first argument. The file is opened, the function is called on the open file, the file is closed, and the result is returned. This approach is safer than manually closing the file with close, because the open function ensures that the file is always closed, even if something goes wrong in the computation.

4.4.3. Reading Data#

Let’s read the hello.txt file to check that the lines were written, and do it using a do block. Try running this code to read the lines from the file:

lines =
    open("hello.txt") do f
        readlines(f)
    end
3-element Vector{String}:
 "Don't worry, be happy!"
 "Hello world!"
 "Print?"

The readlines function reads lines as elements of an array. If you only want to read one line, use the readline function instead. If you want to read all of the lines into a single string, use the read function instead.

Note

In addition to read, there’s also a readchomp function, which only differs from read in that it removes (“chomps”) trailing newlines.