DataFrames.jl

DataFrames is a package that provides a set of tools for working with tabular data. Its design and functionality are similar to pandas (in Python) and data.frame, data.table and dplyr (in R) or table (in Matlab). This makes it a great general-purpose data science tool, especially for people coming to Julia from other languages.

The core of the package is the DataFrame structure that represents a data table. The simplest way of constructing a DataFrame is to pass column vectors using keyword arguments or pairs.

using DataFrames
df = DataFrame(A = 1:4, B = ["M", "F", "F", "M"], C = rand(4))
4×3 DataFrame
RowABC
Int64StringFloat64
11M0.0414862
22F0.703153
33F0.981285
44M0.658003

Since each column is stored in a DataFrame as a separate vector, it is possible to combine columns of different element types. Columns can be accessed directly without copying.

julia> df.A4-element Vector{Int64}:
 1
 2
 3
 4

Another way is to use the indexing syntax similar to the one for arrays.

julia> df[!, :A]4-element Vector{Int64}:
 1
 2
 3
 4

We use ! to select all rows. This creates a pointer to the column. If we use :, then we get a copy of a column. Since vectors are mutable structures and accessing a column of DataFrame via ! does not make a copy, it is possible to change elements of the DataFrame.

df.A[1] = 5
df
4×3 DataFrame
RowABC
Int64StringFloat64
15M0.0414862
22F0.703153
33F0.981285
44M0.658003

On the other hand, the : creates a copy, which will not change the original DataFrame.

col = df[:, :A]
col[1] = 4
df
4×3 DataFrame
RowABC
Int64StringFloat64
15M0.0414862
22F0.703153
33F0.981285
44M0.658003
Column names:

DataFrames allow using symbols (like :A) and strings (like "A") for all column indexing operations. Using symbols is slightly faster and should be preferred. One exception is when the column names are generated using string manipulation.

The standard format for storing table data is the csv file format. The CSV package provides an interface for saving and loading csv files.

using CSV
CSV.write("dataframe.csv", df)
table = CSV.read("dataframe.csv", DataFrame; header = true)
4×3 DataFrame
RowABC
Int64String1Float64
15M0.0414862
22F0.703153
33F0.981285
44M0.658003

See the package documentation for more information.

Adding columns and rows

It is common for tables to be created column by column or row by row. DataFrames provides an easy way to extend existing tables. To can add new columns to a DataFrame in a direct way.

df.D = [:a, :b, :c, :d]
df
4×4 DataFrame
RowABCD
Int64StringFloat64Symbol
15M0.0414862a
22F0.703153b
33F0.981285c
44M0.658003d

Alternatively, we can use the insertcols! function. This function can insert multiple columns at once and also provides advanced options for column manipulation. For example, we can specify the column index into which the columns are to be inserted.

insertcols!(df, 3, :B => rand(4), :B => 11:14; makeunique = true)
4×6 DataFrame
RowABB_1B_2CD
Int64StringFloat64Int64Float64Symbol
15M0.867763110.0414862a
22F0.166064120.703153b
33F0.367744130.981285c
44M0.986799140.658003d

New rows can be added to an existing DataFrame by the push! function. It is possible to append a new row in the form of a vector or a tuple of the correct length or in the form of a dictionary or DataFrame with the correct keys.

push!(df, [10, "F", 0.1, 15, 0.235, :f])
push!(df, (10, "F", 0.1, 15, 0.235, :f))
push!(df, Dict(:B_1 => 0.1, :B_2 => 15, :A => 10, :D => :f, :B => "F", :C => 0.235))
df
7×6 DataFrame
RowABB_1B_2CD
Int64StringFloat64Int64Float64Symbol
15M0.867763110.0414862a
22F0.166064120.703153b
33F0.367744130.981285c
44M0.986799140.658003d
510F0.1150.235f
610F0.1150.235f
710F0.1150.235f

It is also possible to start with an empty DataFrame and build the table incrementally.

using DataFrames
df_empty = DataFrame()
df_empty.A = 1:3
df_empty.B = [:a, :b, :c]
df_empty
3×2 DataFrame
RowAB
Int64Symbol
11a
22b
33c

However, this approach will not work if the DataFrame is created row by row. In this case, the DataFrame must be initialized with empty columns of appropriate element types.

df_empty = DataFrame(A = Int[], B = Symbol[])
push!(df_empty, [1, :a])
push!(df_empty, (2, :b))
push!(df_empty, Dict(:A => 3, :B => :c))
df_empty
3×2 DataFrame
RowAB
Int64Symbol
11a
22b
33c

Renaming

Two functions can be used to rename columns. The names function returns column names as a vector of strings, while the propertynames function returns a vector of symbols.

julia> names(df)6-element Vector{String}:
 "A"
 "B"
 "B_1"
 "B_2"
 "C"
 "D"
julia> propertynames(df)6-element Vector{Symbol}: :A :B :B_1 :B_2 :C :D

We use the rename! function to change column names. This function can be used to rename all columns at once.

rename!(df, [:a, :b, :c, :d, :e, :f])
df
7×6 DataFrame
Rowabcdef
Int64StringFloat64Int64Float64Symbol
15M0.867763110.0414862a
22F0.166064120.703153b
33F0.367744130.981285c
44M0.986799140.658003d
510F0.1150.235f
610F0.1150.235f
710F0.1150.235f

Another option is to rename only some of the columns specified by their names.

rename!(df, :a => :A, :f => :F)
df
7×6 DataFrame
RowAbcdeF
Int64StringFloat64Int64Float64Symbol
15M0.867763110.0414862a
22F0.166064120.703153b
33F0.367744130.981285c
44M0.986799140.658003d
510F0.1150.235f
610F0.1150.235f
710F0.1150.235f

It is also possible to use a function to generate column names.

myname(x) = string("column_", uppercase(x))
rename!(myname, df)
df
7×6 DataFrame
Rowcolumn_Acolumn_Bcolumn_Ccolumn_Dcolumn_Ecolumn_F
Int64StringFloat64Int64Float64Symbol
15M0.867763110.0414862a
22F0.166064120.703153b
33F0.367744130.981285c
44M0.986799140.658003d
510F0.1150.235f
610F0.1150.235f
710F0.1150.235f

Working with DataFrames

In the next part of the lecture, we will use the RDatasets package. The package provides an easy way for Julia users to use many standard datasets available in the core of the R programming language. We will use the Iris dataset.

using RDatasets, DataFrames
iris = dataset("datasets", "iris")
first(iris, 6)
6×5 DataFrame
RowSepalLengthSepalWidthPetalLengthPetalWidthSpecies
Float64Float64Float64Float64Cat…
15.13.51.40.2setosa
24.93.01.40.2setosa
34.73.21.30.2setosa
44.63.11.50.2setosa
55.03.61.40.2setosa
65.43.91.70.4setosa

We use the first function to print the first n = 6 rows of a table. Similarly, the last function shows the last n rows. When working with a new dataset, it is helpful to get a basic description. DataFrames provides the describe function that returns descriptive statistics for each column.

describe(iris)
5×7 DataFrame
Rowvariablemeanminmedianmaxnmissingeltype
SymbolUnion…AnyUnion…AnyInt64DataType
1SepalLength5.843334.35.87.90Float64
2SepalWidth3.057332.03.04.40Float64
3PetalLength3.7581.04.356.90Float64
4PetalWidth1.199330.11.32.50Float64
5Speciessetosavirginica0CategoricalValue{String, UInt8}

We can use the indexing syntax to get a specific subset of a DataFrame.

iris[2:4, [:SepalLength, :Species]]
3×2 DataFrame
RowSepalLengthSpecies
Float64Cat…
14.9setosa
24.7setosa
34.6setosa

Additionally, DataFrames provides Not, Between, Cols and All selectors for more complex column selection scenarios.

iris[2:4, Not([:SepalLength, :Species])]
3×3 DataFrame
RowSepalWidthPetalLengthPetalWidth
Float64Float64Float64
13.01.40.2
23.21.30.2
33.11.50.2

The Query package allows for advanced manipulation with DataFrame. The code below selects only rows with SepalLength >= 6 and SepalWidth >= 3.4. Then we create a new DataFrame, where for each of the selected rows, we add the Species, the sum of sepal length and width, and the sum of petal length and width.

using Query

table = @from row in iris begin
    @where row.SepalLength >= 6 && row.SepalWidth >= 3.4
    @select {
        row.Species,
        SepalSum = row.SepalLength + row.SepalWidth,
        PetalSum = row.PetalLength + row.PetalWidth,
    }
    @collect DataFrame
end
6×3 DataFrame
RowSpeciesSepalSumPetalSum
Cat…Float64Float64
1versicolor9.46.1
2virginica10.88.6
3virginica11.58.9
4virginica11.78.4
5virginica9.78.0
6virginica9.67.7

There are many topics related to DataFrames. However, there is not enough time to cover them all. We refer the reader to the excellent documentation with lots of examples.

Visualizing using StatsPlots

The StatsPlots package provides recipes for plotting histograms, boxplots, and many other plots related to statistics. This package also provides the @df macro, which allows simple plotting of tabular data. As a simple example, we create a scatter plot of SepalLength and SepalWidth grouped by Species. Keyword arguments can be used in the same way as before.

using StatsPlots
@df iris scatter(
    :SepalLength,
    :SepalWidth;
    group = :Species,
    xlabel = "SepalLength",
    ylabel = "SepalWidth",
    marker = ([:d :h :star7], 8),
)

As another example, we mention the marginalkde function for plotting marginal kernel density estimations. In statistics, kernel density estimation (KDE) is a non-parametric way to estimate the probability density function of a random variable. The marginalkde function can be used together with @df macro.

@df iris marginalkde(
    :SepalLength,
    :SepalWidth;
    xlabel = "SepalLength",
    ylabel = "SepalWidth",
)

Another example is the corrplot function, which shows the correlation between all variables.

@df iris corrplot(
    cols(1:4);
    grid = false,
    nbins = 15,
    fillcolor = :viridis,
    markercolor = :viridis,
)

Because it is shorter, we use cols(1:4) instead of the column names.