DataFrames.jl
DataFrames is a package that provides a set of tools for working with tabular data. Its design and functionality are similar to pandas (in Python) and data.frame
, data.table
and dplyr (in R) or table
(in Matlab). This makes it a great general-purpose data science tool, especially for people coming to Julia from other languages.
The core of the package is the DataFrame
structure that represents a data table. The simplest way of constructing a DataFrame
is to pass column vectors using keyword arguments or pairs.
using DataFrames
df = DataFrame(A = 1:4, B = ["M", "F", "F", "M"], C = rand(4))
Row | A | B | C |
---|---|---|---|
Int64 | String | Float64 | |
1 | 1 | M | 0.311687 |
2 | 2 | F | 0.031932 |
3 | 3 | F | 0.0796035 |
4 | 4 | M | 0.978723 |
Since each column is stored in a DataFrame
as a separate vector, it is possible to combine columns of different element types. Columns can be accessed directly without copying.
julia> df.A
4-element Vector{Int64}: 1 2 3 4
Another way is to use the indexing syntax similar to the one for arrays.
julia> df[!, :A]
4-element Vector{Int64}: 1 2 3 4
We use !
to select all rows. This creates a pointer to the column. If we use :
, then we get a copy of a column. Since vectors are mutable structures and accessing a column of DataFrame
via !
does not make a copy, it is possible to change elements of the DataFrame
.
df.A[1] = 5
df
Row | A | B | C |
---|---|---|---|
Int64 | String | Float64 | |
1 | 5 | M | 0.311687 |
2 | 2 | F | 0.031932 |
3 | 3 | F | 0.0796035 |
4 | 4 | M | 0.978723 |
On the other hand, the :
creates a copy, which will not change the original DataFrame
.
col = df[:, :A]
col[1] = 4
df
Row | A | B | C |
---|---|---|---|
Int64 | String | Float64 | |
1 | 5 | M | 0.311687 |
2 | 2 | F | 0.031932 |
3 | 3 | F | 0.0796035 |
4 | 4 | M | 0.978723 |
DataFrames allow using symbols (like :A
) and strings (like "A"
) for all column indexing operations. Using symbols is slightly faster and should be preferred. One exception is when the column names are generated using string manipulation.
The standard format for storing table data is the csv
file format. The CSV package provides an interface for saving and loading csv
files.
using CSV
CSV.write("dataframe.csv", df)
table = CSV.read("dataframe.csv", DataFrame; header = true)
Row | A | B | C |
---|---|---|---|
Int64 | String1 | Float64 | |
1 | 5 | M | 0.311687 |
2 | 2 | F | 0.031932 |
3 | 3 | F | 0.0796035 |
4 | 4 | M | 0.978723 |
See the package documentation for more information.
Adding columns and rows
It is common for tables to be created column by column or row by row. DataFrame
s provides an easy way to extend existing tables. To can add new columns to a DataFrame
in a direct way.
df.D = [:a, :b, :c, :d]
df
Row | A | B | C | D |
---|---|---|---|---|
Int64 | String | Float64 | Symbol | |
1 | 5 | M | 0.311687 | a |
2 | 2 | F | 0.031932 | b |
3 | 3 | F | 0.0796035 | c |
4 | 4 | M | 0.978723 | d |
Alternatively, we can use the insertcols!
function. This function can insert multiple columns at once and also provides advanced options for column manipulation. For example, we can specify the column index into which the columns are to be inserted.
insertcols!(df, 3, :B => rand(4), :B => 11:14; makeunique = true)
Row | A | B | B_1 | B_2 | C | D |
---|---|---|---|---|---|---|
Int64 | String | Float64 | Int64 | Float64 | Symbol | |
1 | 5 | M | 0.333859 | 11 | 0.311687 | a |
2 | 2 | F | 0.655728 | 12 | 0.031932 | b |
3 | 3 | F | 0.870544 | 13 | 0.0796035 | c |
4 | 4 | M | 0.78748 | 14 | 0.978723 | d |
New rows can be added to an existing DataFrame
by the push!
function. It is possible to append a new row in the form of a vector or a tuple of the correct length or in the form of a dictionary or DataFrame
with the correct keys.
push!(df, [10, "F", 0.1, 15, 0.235, :f])
push!(df, (10, "F", 0.1, 15, 0.235, :f))
push!(df, Dict(:B_1 => 0.1, :B_2 => 15, :A => 10, :D => :f, :B => "F", :C => 0.235))
df
Row | A | B | B_1 | B_2 | C | D |
---|---|---|---|---|---|---|
Int64 | String | Float64 | Int64 | Float64 | Symbol | |
1 | 5 | M | 0.333859 | 11 | 0.311687 | a |
2 | 2 | F | 0.655728 | 12 | 0.031932 | b |
3 | 3 | F | 0.870544 | 13 | 0.0796035 | c |
4 | 4 | M | 0.78748 | 14 | 0.978723 | d |
5 | 10 | F | 0.1 | 15 | 0.235 | f |
6 | 10 | F | 0.1 | 15 | 0.235 | f |
7 | 10 | F | 0.1 | 15 | 0.235 | f |
It is also possible to start with an empty DataFrame
and build the table incrementally.
using DataFrames
df_empty = DataFrame()
df_empty.A = 1:3
df_empty.B = [:a, :b, :c]
df_empty
Row | A | B |
---|---|---|
Int64 | Symbol | |
1 | 1 | a |
2 | 2 | b |
3 | 3 | c |
However, this approach will not work if the DataFrame
is created row by row. In this case, the DataFrame
must be initialized with empty columns of appropriate element types.
df_empty = DataFrame(A = Int[], B = Symbol[])
push!(df_empty, [1, :a])
push!(df_empty, (2, :b))
push!(df_empty, Dict(:A => 3, :B => :c))
df_empty
Row | A | B |
---|---|---|
Int64 | Symbol | |
1 | 1 | a |
2 | 2 | b |
3 | 3 | c |
Renaming
Two functions can be used to rename columns. The names
function returns column names as a vector of strings, while the propertynames
function returns a vector of symbols.
julia> names(df)
6-element Vector{String}: "A" "B" "B_1" "B_2" "C" "D"
julia> propertynames(df)
6-element Vector{Symbol}: :A :B :B_1 :B_2 :C :D
We use the rename!
function to change column names. This function can be used to rename all columns at once.
rename!(df, [:a, :b, :c, :d, :e, :f])
df
Row | a | b | c | d | e | f |
---|---|---|---|---|---|---|
Int64 | String | Float64 | Int64 | Float64 | Symbol | |
1 | 5 | M | 0.333859 | 11 | 0.311687 | a |
2 | 2 | F | 0.655728 | 12 | 0.031932 | b |
3 | 3 | F | 0.870544 | 13 | 0.0796035 | c |
4 | 4 | M | 0.78748 | 14 | 0.978723 | d |
5 | 10 | F | 0.1 | 15 | 0.235 | f |
6 | 10 | F | 0.1 | 15 | 0.235 | f |
7 | 10 | F | 0.1 | 15 | 0.235 | f |
Another option is to rename only some of the columns specified by their names.
rename!(df, :a => :A, :f => :F)
df
Row | A | b | c | d | e | F |
---|---|---|---|---|---|---|
Int64 | String | Float64 | Int64 | Float64 | Symbol | |
1 | 5 | M | 0.333859 | 11 | 0.311687 | a |
2 | 2 | F | 0.655728 | 12 | 0.031932 | b |
3 | 3 | F | 0.870544 | 13 | 0.0796035 | c |
4 | 4 | M | 0.78748 | 14 | 0.978723 | d |
5 | 10 | F | 0.1 | 15 | 0.235 | f |
6 | 10 | F | 0.1 | 15 | 0.235 | f |
7 | 10 | F | 0.1 | 15 | 0.235 | f |
It is also possible to use a function to generate column names.
myname(x) = string("column_", uppercase(x))
rename!(myname, df)
df
Row | column_A | column_B | column_C | column_D | column_E | column_F |
---|---|---|---|---|---|---|
Int64 | String | Float64 | Int64 | Float64 | Symbol | |
1 | 5 | M | 0.333859 | 11 | 0.311687 | a |
2 | 2 | F | 0.655728 | 12 | 0.031932 | b |
3 | 3 | F | 0.870544 | 13 | 0.0796035 | c |
4 | 4 | M | 0.78748 | 14 | 0.978723 | d |
5 | 10 | F | 0.1 | 15 | 0.235 | f |
6 | 10 | F | 0.1 | 15 | 0.235 | f |
7 | 10 | F | 0.1 | 15 | 0.235 | f |
Working with DataFrame
s
In the next part of the lecture, we will use the RDatasets package. The package provides an easy way for Julia users to use many standard datasets available in the core of the R programming language. We will use the Iris dataset.
using RDatasets, DataFrames
iris = dataset("datasets", "iris")
first(iris, 6)
Row | SepalLength | SepalWidth | PetalLength | PetalWidth | Species |
---|---|---|---|---|---|
Float64 | Float64 | Float64 | Float64 | Cat… | |
1 | 5.1 | 3.5 | 1.4 | 0.2 | setosa |
2 | 4.9 | 3.0 | 1.4 | 0.2 | setosa |
3 | 4.7 | 3.2 | 1.3 | 0.2 | setosa |
4 | 4.6 | 3.1 | 1.5 | 0.2 | setosa |
5 | 5.0 | 3.6 | 1.4 | 0.2 | setosa |
6 | 5.4 | 3.9 | 1.7 | 0.4 | setosa |
We use the first
function to print the first n = 6
rows of a table. Similarly, the last
function shows the last n
rows. When working with a new dataset, it is helpful to get a basic description. DataFrames provides the describe
function that returns descriptive statistics for each column.
describe(iris)
Row | variable | mean | min | median | max | nmissing | eltype |
---|---|---|---|---|---|---|---|
Symbol | Union… | Any | Union… | Any | Int64 | DataType | |
1 | SepalLength | 5.84333 | 4.3 | 5.8 | 7.9 | 0 | Float64 |
2 | SepalWidth | 3.05733 | 2.0 | 3.0 | 4.4 | 0 | Float64 |
3 | PetalLength | 3.758 | 1.0 | 4.35 | 6.9 | 0 | Float64 |
4 | PetalWidth | 1.19933 | 0.1 | 1.3 | 2.5 | 0 | Float64 |
5 | Species | setosa | virginica | 0 | CategoricalValue{String, UInt8} |
We can use the indexing syntax to get a specific subset of a DataFrame
.
iris[2:4, [:SepalLength, :Species]]
Row | SepalLength | Species |
---|---|---|
Float64 | Cat… | |
1 | 4.9 | setosa |
2 | 4.7 | setosa |
3 | 4.6 | setosa |
Additionally, DataFrames provides Not
, Between
, Cols
and All
selectors for more complex column selection scenarios.
iris[2:4, Not([:SepalLength, :Species])]
Row | SepalWidth | PetalLength | PetalWidth |
---|---|---|---|
Float64 | Float64 | Float64 | |
1 | 3.0 | 1.4 | 0.2 |
2 | 3.2 | 1.3 | 0.2 |
3 | 3.1 | 1.5 | 0.2 |
The Query package allows for advanced manipulation with DataFrame
. The code below selects only rows with SepalLength >= 6
and SepalWidth >= 3.4
. Then we create a new DataFrame, where for each of the selected rows, we add the Species, the sum of sepal length and width, and the sum of petal length and width.
using Query
table = @from row in iris begin
@where row.SepalLength >= 6 && row.SepalWidth >= 3.4
@select {
row.Species,
SepalSum = row.SepalLength + row.SepalWidth,
PetalSum = row.PetalLength + row.PetalWidth,
}
@collect DataFrame
end
Row | Species | SepalSum | PetalSum |
---|---|---|---|
Cat… | Float64 | Float64 | |
1 | versicolor | 9.4 | 6.1 |
2 | virginica | 10.8 | 8.6 |
3 | virginica | 11.5 | 8.9 |
4 | virginica | 11.7 | 8.4 |
5 | virginica | 9.7 | 8.0 |
6 | virginica | 9.6 | 7.7 |
There are many topics related to DataFrames. However, there is not enough time to cover them all. We refer the reader to the excellent documentation with lots of examples.
Visualizing using StatsPlots
The StatsPlots package provides recipes for plotting histograms, boxplots, and many other plots related to statistics. This package also provides the @df
macro, which allows simple plotting of tabular data. As a simple example, we create a scatter plot of SepalLength
and SepalWidth
grouped by Species
. Keyword arguments can be used in the same way as before.
using StatsPlots
@df iris scatter(
:SepalLength,
:SepalWidth;
group = :Species,
xlabel = "SepalLength",
ylabel = "SepalWidth",
marker = ([:d :h :star7], 8),
)
As another example, we mention the marginalkde
function for plotting marginal kernel density estimations. In statistics, kernel density estimation (KDE) is a non-parametric way to estimate the probability density function of a random variable. The marginalkde
function can be used together with @df
macro.
@df iris marginalkde(
:SepalLength,
:SepalWidth;
xlabel = "SepalLength",
ylabel = "SepalWidth",
)
Another example is the corrplot
function, which shows the correlation between all variables.
@df iris corrplot(
cols(1:4);
grid = false,
nbins = 15,
fillcolor = :viridis,
markercolor = :viridis,
)
Because it is shorter, we use cols(1:4)
instead of the column names.