DataFrames.jl

DataFrames is a package that provides a set of tools for working with tabular data. Its design and functionality are similar to pandas (in Python) and data.frame, data.table and dplyr (in R) or table (in Matlab). This makes it a great general-purpose data science tool, especially for people coming to Julia from other languages.

The core of the package is the DataFrame structure that represents a data table. The simplest way of constructing a DataFrame is to pass column vectors using keyword arguments or pairs.

using DataFrames
df = DataFrame(A = 1:4, B = ["M", "F", "F", "M"], C = rand(4))

4×3 DataFrame

Row	A	B	C
	Int64	String	Float64
1	1	M	0.719108
2	2	F	0.0469126
3	3	F	0.026386
4	4	M	0.328051

Since each column is stored in a DataFrame as a separate vector, it is possible to combine columns of different element types. Columns can be accessed directly without copying.

julia> df.A4-element Vector{Int64}:
 1
 2
 3
 4

Another way is to use the indexing syntax similar to the one for arrays.

julia> df[!, :A]4-element Vector{Int64}:
 1
 2
 3
 4

We use ! to select all rows. This creates a pointer to the column. If we use :, then we get a copy of a column. Since vectors are mutable structures and accessing a column of DataFrame via ! does not make a copy, it is possible to change elements of the DataFrame.

df.A[1] = 5
df

4×3 DataFrame

Row	A	B	C
	Int64	String	Float64
1	5	M	0.719108
2	2	F	0.0469126
3	3	F	0.026386
4	4	M	0.328051

On the other hand, the : creates a copy, which will not change the original DataFrame.

col = df[:, :A]
col[1] = 4
df

4×3 DataFrame

Row	A	B	C
	Int64	String	Float64
1	5	M	0.719108
2	2	F	0.0469126
3	3	F	0.026386
4	4	M	0.328051

Column names:

DataFrames allow using symbols (like :A) and strings (like "A") for all column indexing operations. Using symbols is slightly faster and should be preferred. One exception is when the column names are generated using string manipulation.

The standard format for storing table data is the csv file format. The CSV package provides an interface for saving and loading csv files.

using CSV
CSV.write("dataframe.csv", df)
table = CSV.read("dataframe.csv", DataFrame; header = true)

4×3 DataFrame

Row	A	B	C
	Int64	String1	Float64
1	5	M	0.719108
2	2	F	0.0469126
3	3	F	0.026386
4	4	M	0.328051

See the package documentation for more information.

Adding columns and rows

It is common for tables to be created column by column or row by row. DataFrames provides an easy way to extend existing tables. To can add new columns to a DataFrame in a direct way.

df.D = [:a, :b, :c, :d]
df

4×4 DataFrame

Row	A	B	C	D
	Int64	String	Float64	Symbol
1	5	M	0.719108	a
2	2	F	0.0469126	b
3	3	F	0.026386	c
4	4	M	0.328051	d

Alternatively, we can use the insertcols! function. This function can insert multiple columns at once and also provides advanced options for column manipulation. For example, we can specify the column index into which the columns are to be inserted.

insertcols!(df, 3, :B => rand(4), :B => 11:14; makeunique = true)

4×6 DataFrame

Row	A	B	B_1	B_2	C	D
	Int64	String	Float64	Int64	Float64	Symbol
1	5	M	0.348518	11	0.719108	a
2	2	F	0.820582	12	0.0469126	b
3	3	F	0.485706	13	0.026386	c
4	4	M	0.562	14	0.328051	d

New rows can be added to an existing DataFrame by the push! function. It is possible to append a new row in the form of a vector or a tuple of the correct length or in the form of a dictionary or DataFrame with the correct keys.

push!(df, [10, "F", 0.1, 15, 0.235, :f])
push!(df, (10, "F", 0.1, 15, 0.235, :f))
push!(df, Dict(:B_1 => 0.1, :B_2 => 15, :A => 10, :D => :f, :B => "F", :C => 0.235))
df

7×6 DataFrame

Row	A	B	B_1	B_2	C	D
	Int64	String	Float64	Int64	Float64	Symbol
1	5	M	0.348518	11	0.719108	a
2	2	F	0.820582	12	0.0469126	b
3	3	F	0.485706	13	0.026386	c
4	4	M	0.562	14	0.328051	d
5	10	F	0.1	15	0.235	f
6	10	F	0.1	15	0.235	f
7	10	F	0.1	15	0.235	f

It is also possible to start with an empty DataFrame and build the table incrementally.

using DataFrames
df_empty = DataFrame()
df_empty.A = 1:3
df_empty.B = [:a, :b, :c]
df_empty

3×2 DataFrame

Row	A	B
	Int64	Symbol
1	1	a
2	2	b
3	3	c

However, this approach will not work if the DataFrame is created row by row. In this case, the DataFrame must be initialized with empty columns of appropriate element types.

df_empty = DataFrame(A = Int[], B = Symbol[])
push!(df_empty, [1, :a])
push!(df_empty, (2, :b))
push!(df_empty, Dict(:A => 3, :B => :c))
df_empty

3×2 DataFrame

Row	A	B
	Int64	Symbol
1	1	a
2	2	b
3	3	c

Renaming

Two functions can be used to rename columns. The names function returns column names as a vector of strings, while the propertynames function returns a vector of symbols.

julia> names(df)6-element Vector{String}:
 "A"
 "B"
 "B_1"
 "B_2"
 "C"
 "D"
julia> propertynames(df)6-element Vector{Symbol}:
 :A
 :B
 :B_1
 :B_2
 :C
 :D

We use the rename! function to change column names. This function can be used to rename all columns at once.

rename!(df, [:a, :b, :c, :d, :e, :f])
df

7×6 DataFrame

Row	a	b	c	d	e	f
	Int64	String	Float64	Int64	Float64	Symbol
1	5	M	0.348518	11	0.719108	a
2	2	F	0.820582	12	0.0469126	b
3	3	F	0.485706	13	0.026386	c
4	4	M	0.562	14	0.328051	d
5	10	F	0.1	15	0.235	f
6	10	F	0.1	15	0.235	f
7	10	F	0.1	15	0.235	f

Another option is to rename only some of the columns specified by their names.

rename!(df, :a => :A, :f => :F)
df

7×6 DataFrame

Row	A	b	c	d	e	F
	Int64	String	Float64	Int64	Float64	Symbol
1	5	M	0.348518	11	0.719108	a
2	2	F	0.820582	12	0.0469126	b
3	3	F	0.485706	13	0.026386	c
4	4	M	0.562	14	0.328051	d
5	10	F	0.1	15	0.235	f
6	10	F	0.1	15	0.235	f
7	10	F	0.1	15	0.235	f

It is also possible to use a function to generate column names.

myname(x) = string("column_", uppercase(x))
rename!(myname, df)
df

7×6 DataFrame

Row	column_A	column_B	column_C	column_D	column_E	column_F
	Int64	String	Float64	Int64	Float64	Symbol
1	5	M	0.348518	11	0.719108	a
2	2	F	0.820582	12	0.0469126	b
3	3	F	0.485706	13	0.026386	c
4	4	M	0.562	14	0.328051	d
5	10	F	0.1	15	0.235	f
6	10	F	0.1	15	0.235	f
7	10	F	0.1	15	0.235	f

Working with `DataFrame`s

In the next part of the lecture, we will use the RDatasets package. The package provides an easy way for Julia users to use many standard datasets available in the core of the R programming language. We will use the Iris dataset.

using RDatasets, DataFrames
iris = dataset("datasets", "iris")
first(iris, 6)

6×5 DataFrame

Row	SepalLength	SepalWidth	PetalLength	PetalWidth	Species
	Float64	Float64	Float64	Float64	Cat…
1	5.1	3.5	1.4	0.2	setosa
2	4.9	3.0	1.4	0.2	setosa
3	4.7	3.2	1.3	0.2	setosa
4	4.6	3.1	1.5	0.2	setosa
5	5.0	3.6	1.4	0.2	setosa
6	5.4	3.9	1.7	0.4	setosa

We use the first function to print the first n = 6 rows of a table. Similarly, the last function shows the last n rows. When working with a new dataset, it is helpful to get a basic description. DataFrames provides the describe function that returns descriptive statistics for each column.

describe(iris)

5×7 DataFrame

Row	variable	mean	min	median	max	nmissing	eltype
	Symbol	Union…	Any	Union…	Any	Int64	DataType
1	SepalLength	5.84333	4.3	5.8	7.9	0	Float64
2	SepalWidth	3.05733	2.0	3.0	4.4	0	Float64
3	PetalLength	3.758	1.0	4.35	6.9	0	Float64
4	PetalWidth	1.19933	0.1	1.3	2.5	0	Float64
5	Species		setosa		virginica	0	CategoricalValue{String, UInt8}

We can use the indexing syntax to get a specific subset of a DataFrame.

iris[2:4, [:SepalLength, :Species]]

3×2 DataFrame

Row	SepalLength	Species
	Float64	Cat…
1	4.9	setosa
2	4.7	setosa
3	4.6	setosa

Additionally, DataFrames provides Not, Between, Cols and All selectors for more complex column selection scenarios.

iris[2:4, Not([:SepalLength, :Species])]

3×3 DataFrame

Row	SepalWidth	PetalLength	PetalWidth
	Float64	Float64	Float64
1	3.0	1.4	0.2
2	3.2	1.3	0.2
3	3.1	1.5	0.2

The Query package allows for advanced manipulation with DataFrame. The code below selects only rows with SepalLength >= 6 and SepalWidth >= 3.4. Then we create a new DataFrame, where for each of the selected rows, we add the Species, the sum of sepal length and width, and the sum of petal length and width.

using Query

table = @from row in iris begin
    @where row.SepalLength >= 6 && row.SepalWidth >= 3.4
    @select {
        row.Species,
        SepalSum = row.SepalLength + row.SepalWidth,
        PetalSum = row.PetalLength + row.PetalWidth,
    }
    @collect DataFrame
end

6×3 DataFrame

Row	Species	SepalSum	PetalSum
	Cat…	Float64	Float64
1	versicolor	9.4	6.1
2	virginica	10.8	8.6
3	virginica	11.5	8.9
4	virginica	11.7	8.4
5	virginica	9.7	8.0
6	virginica	9.6	7.7

There are many topics related to DataFrames. However, there is not enough time to cover them all. We refer the reader to the excellent documentation with lots of examples.

Visualizing using StatsPlots

The StatsPlots package provides recipes for plotting histograms, boxplots, and many other plots related to statistics. This package also provides the @df macro, which allows simple plotting of tabular data. As a simple example, we create a scatter plot of SepalLength and SepalWidth grouped by Species. Keyword arguments can be used in the same way as before.

using StatsPlots
@df iris scatter(
    :SepalLength,
    :SepalWidth;
    group = :Species,
    xlabel = "SepalLength",
    ylabel = "SepalWidth",
    marker = ([:d :h :star7], 8),
)

As another example, we mention the marginalkde function for plotting marginal kernel density estimations. In statistics, kernel density estimation (KDE) is a non-parametric way to estimate the probability density function of a random variable. The marginalkde function can be used together with @df macro.

@df iris marginalkde(
    :SepalLength,
    :SepalWidth;
    xlabel = "SepalLength",
    ylabel = "SepalWidth",
)

Another example is the corrplot function, which shows the correlation between all variables.

@df iris corrplot(
    cols(1:4);
    grid = false,
    nbins = 15,
    fillcolor = :viridis,
    markercolor = :viridis,
)

Because it is shorter, we use cols(1:4) instead of the column names.

DataFrames.jl

Adding columns and rows

Renaming

Working with DataFrames

Visualizing using StatsPlots

Working with `DataFrame`s