DataFrames.jl
DataFrames is a package that provides a set of tools for working with tabular data. Its design and functionality are similar to pandas (in Python) and data.frame, data.table and dplyr (in R) or table (in Matlab). This makes it a great general-purpose data science tool, especially for people coming to Julia from other languages.
The core of the package is the DataFrame structure that represents a data table. The simplest way of constructing a DataFrame is to pass column vectors using keyword arguments or pairs.
using DataFrames
df = DataFrame(A = 1:4, B = ["M", "F", "F", "M"], C = rand(4))| Row | A | B | C |
|---|---|---|---|
| Int64 | String | Float64 | |
| 1 | 1 | M | 0.694365 |
| 2 | 2 | F | 0.335163 |
| 3 | 3 | F | 0.812815 |
| 4 | 4 | M | 0.133397 |
Since each column is stored in a DataFrame as a separate vector, it is possible to combine columns of different element types. Columns can be accessed directly, without copying.
julia> df.A4-element Vector{Int64}: 1 2 3 4
Another way is to use the indexing syntax similar to the one for arrays.
julia> df[!, :A]4-element Vector{Int64}: 1 2 3 4
We use ! to select all rows. This creates a pointer to the column. If we use :, then we get a copy of a column. Since vectors are mutable structures and accessing a column of DataFrame via ! does not make a copy, it is possible to change elements of the DataFrame.
df.A[1] = 5
df| Row | A | B | C |
|---|---|---|---|
| Int64 | String | Float64 | |
| 1 | 5 | M | 0.694365 |
| 2 | 2 | F | 0.335163 |
| 3 | 3 | F | 0.812815 |
| 4 | 4 | M | 0.133397 |
On the other hand, the : creates a copy, therefore changes will not affect the original DataFrame.
col = df[:, :A]
col[1] = 4
df| Row | A | B | C |
|---|---|---|---|
| Int64 | String | Float64 | |
| 1 | 5 | M | 0.694365 |
| 2 | 2 | F | 0.335163 |
| 3 | 3 | F | 0.812815 |
| 4 | 4 | M | 0.133397 |
DataFrames allow using symbols (like :A) and strings (like "A") for all column indexing operations. Using symbols is slightly faster and should be preferred. One exception is when the column names are generated using string manipulation.
The standard format for storing table data is the csv file format. The CSV package provides an interface for saving and loading csv files.
using CSV
CSV.write("dataframe.csv", df)
table = CSV.read("dataframe.csv", DataFrame; header = true)| Row | A | B | C |
|---|---|---|---|
| Int64 | String1 | Float64 | |
| 1 | 5 | M | 0.694365 |
| 2 | 2 | F | 0.335163 |
| 3 | 3 | F | 0.812815 |
| 4 | 4 | M | 0.133397 |
See the package documentation for more information.
Adding columns and rows
It is common for tables to be created column by column or row by row. DataFrames provides an easy way to extend existing tables. You can add new columns to a DataFrame in a direct way.
df.D = [:a, :b, :c, :d]
df| Row | A | B | C | D |
|---|---|---|---|---|
| Int64 | String | Float64 | Symbol | |
| 1 | 5 | M | 0.694365 | a |
| 2 | 2 | F | 0.335163 | b |
| 3 | 3 | F | 0.812815 | c |
| 4 | 4 | M | 0.133397 | d |
Alternatively, we can use the insertcols! function. This function can insert multiple columns at once and also provides advanced options for column manipulation. For example, we can specify the column index into which the columns are to be inserted.
insertcols!(df, 3, :B => rand(4), :B => 11:14; makeunique = true)| Row | A | B | B_1 | B_2 | C | D |
|---|---|---|---|---|---|---|
| Int64 | String | Float64 | Int64 | Float64 | Symbol | |
| 1 | 5 | M | 0.301244 | 11 | 0.694365 | a |
| 2 | 2 | F | 0.966312 | 12 | 0.335163 | b |
| 3 | 3 | F | 0.043748 | 13 | 0.812815 | c |
| 4 | 4 | M | 0.276899 | 14 | 0.133397 | d |
New rows can be added to an existing DataFrame by the push! function. It is possible to append a new row in the form of a vector or a tuple of the correct length or in the form of a dictionary or DataFrame with the correct keys.
push!(df, [10, "F", 0.1, 15, 0.235, :f])
push!(df, (10, "F", 0.1, 15, 0.235, :f))
push!(df, Dict(:B_1 => 0.1, :B_2 => 15, :A => 10, :D => :f, :B => "F", :C => 0.235))
df| Row | A | B | B_1 | B_2 | C | D |
|---|---|---|---|---|---|---|
| Int64 | String | Float64 | Int64 | Float64 | Symbol | |
| 1 | 5 | M | 0.301244 | 11 | 0.694365 | a |
| 2 | 2 | F | 0.966312 | 12 | 0.335163 | b |
| 3 | 3 | F | 0.043748 | 13 | 0.812815 | c |
| 4 | 4 | M | 0.276899 | 14 | 0.133397 | d |
| 5 | 10 | F | 0.1 | 15 | 0.235 | f |
| 6 | 10 | F | 0.1 | 15 | 0.235 | f |
| 7 | 10 | F | 0.1 | 15 | 0.235 | f |
It is also possible to start with an empty DataFrame and build the table incrementally.
using DataFrames
df_empty = DataFrame()
df_empty.A = 1:3
df_empty.B = [:a, :b, :c]
df_empty| Row | A | B |
|---|---|---|
| Int64 | Symbol | |
| 1 | 1 | a |
| 2 | 2 | b |
| 3 | 3 | c |
However, this approach will not work if the DataFrame is created row by row. In this case, the DataFrame must be initialized with empty columns of appropriate element types.
df_empty = DataFrame(A = Int[], B = Symbol[])
push!(df_empty, [1, :a])
push!(df_empty, (2, :b))
push!(df_empty, Dict(:A => 3, :B => :c))
df_empty| Row | A | B |
|---|---|---|
| Int64 | Symbol | |
| 1 | 1 | a |
| 2 | 2 | b |
| 3 | 3 | c |
Renaming
There are two functions that can be used to rename columns. The names function returns column names as a vector of strings, while the propertynames function returns a vector of symbols.
julia> names(df)6-element Vector{String}: "A" "B" "B_1" "B_2" "C" "D"julia> propertynames(df)6-element Vector{Symbol}: :A :B :B_1 :B_2 :C :D
We use the rename! function to change column names. This function can be used to rename all columns at once.
rename!(df, [:a, :b, :c, :d, :e, :f])
df| Row | a | b | c | d | e | f |
|---|---|---|---|---|---|---|
| Int64 | String | Float64 | Int64 | Float64 | Symbol | |
| 1 | 5 | M | 0.301244 | 11 | 0.694365 | a |
| 2 | 2 | F | 0.966312 | 12 | 0.335163 | b |
| 3 | 3 | F | 0.043748 | 13 | 0.812815 | c |
| 4 | 4 | M | 0.276899 | 14 | 0.133397 | d |
| 5 | 10 | F | 0.1 | 15 | 0.235 | f |
| 6 | 10 | F | 0.1 | 15 | 0.235 | f |
| 7 | 10 | F | 0.1 | 15 | 0.235 | f |
Another option is to rename only some columns, specified by their names.
rename!(df, :a => :A, :f => :F)
df| Row | A | b | c | d | e | F |
|---|---|---|---|---|---|---|
| Int64 | String | Float64 | Int64 | Float64 | Symbol | |
| 1 | 5 | M | 0.301244 | 11 | 0.694365 | a |
| 2 | 2 | F | 0.966312 | 12 | 0.335163 | b |
| 3 | 3 | F | 0.043748 | 13 | 0.812815 | c |
| 4 | 4 | M | 0.276899 | 14 | 0.133397 | d |
| 5 | 10 | F | 0.1 | 15 | 0.235 | f |
| 6 | 10 | F | 0.1 | 15 | 0.235 | f |
| 7 | 10 | F | 0.1 | 15 | 0.235 | f |
It is also possible to use a function to generate column names.
myname(x) = string("column_", uppercase(x))
rename!(myname, df)
df| Row | column_A | column_B | column_C | column_D | column_E | column_F |
|---|---|---|---|---|---|---|
| Int64 | String | Float64 | Int64 | Float64 | Symbol | |
| 1 | 5 | M | 0.301244 | 11 | 0.694365 | a |
| 2 | 2 | F | 0.966312 | 12 | 0.335163 | b |
| 3 | 3 | F | 0.043748 | 13 | 0.812815 | c |
| 4 | 4 | M | 0.276899 | 14 | 0.133397 | d |
| 5 | 10 | F | 0.1 | 15 | 0.235 | f |
| 6 | 10 | F | 0.1 | 15 | 0.235 | f |
| 7 | 10 | F | 0.1 | 15 | 0.235 | f |
Working with DataFrames
In the next part of the lecture, we will use the RDatasets package. The package provides an easy way for Julia users to use many standard datasets available in the core of the R programming language. We will use the Iris dataset.
using RDatasets, DataFrames
iris = dataset("datasets", "iris")
first(iris, 6)| Row | SepalLength | SepalWidth | PetalLength | PetalWidth | Species |
|---|---|---|---|---|---|
| Float64 | Float64 | Float64 | Float64 | Cat… | |
| 1 | 5.1 | 3.5 | 1.4 | 0.2 | setosa |
| 2 | 4.9 | 3.0 | 1.4 | 0.2 | setosa |
| 3 | 4.7 | 3.2 | 1.3 | 0.2 | setosa |
| 4 | 4.6 | 3.1 | 1.5 | 0.2 | setosa |
| 5 | 5.0 | 3.6 | 1.4 | 0.2 | setosa |
| 6 | 5.4 | 3.9 | 1.7 | 0.4 | setosa |
We use the first function to print the first n = 6 rows of a table. Similarly, the last function shows the last n rows. When working with a new dataset, it is helpful to get a basic description. DataFrames provides the describe function that returns descriptive statistics for each column.
describe(iris)| Row | variable | mean | min | median | max | nmissing | eltype |
|---|---|---|---|---|---|---|---|
| Symbol | Union… | Any | Union… | Any | Int64 | DataType | |
| 1 | SepalLength | 5.84333 | 4.3 | 5.8 | 7.9 | 0 | Float64 |
| 2 | SepalWidth | 3.05733 | 2.0 | 3.0 | 4.4 | 0 | Float64 |
| 3 | PetalLength | 3.758 | 1.0 | 4.35 | 6.9 | 0 | Float64 |
| 4 | PetalWidth | 1.19933 | 0.1 | 1.3 | 2.5 | 0 | Float64 |
| 5 | Species | setosa | virginica | 0 | CategoricalValue{String, UInt8} |
We can use the indexing syntax to get a specific subset of a DataFrame.
iris[2:4, [:SepalLength, :Species]]| Row | SepalLength | Species |
|---|---|---|
| Float64 | Cat… | |
| 1 | 4.9 | setosa |
| 2 | 4.7 | setosa |
| 3 | 4.6 | setosa |
Additionally, DataFrames provides Not, Between, Cols and All selectors for more complex column selection scenarios.
iris[2:4, Not([:SepalLength, :Species])]| Row | SepalWidth | PetalLength | PetalWidth |
|---|---|---|---|
| Float64 | Float64 | Float64 | |
| 1 | 3.0 | 1.4 | 0.2 |
| 2 | 3.2 | 1.3 | 0.2 |
| 3 | 3.1 | 1.5 | 0.2 |
The Query package allows for advanced manipulation with DataFrame. The code below selects only rows with SepalLength >= 6 and SepalWidth >= 3.4. Then we create a new DataFrame, where for each of the selected rows, we add the Species, the sum of sepal length and width, and the sum of petal length and width.
using Query
table = @from row in iris begin
@where row.SepalLength >= 6 && row.SepalWidth >= 3.4
@select {
row.Species,
SepalSum = row.SepalLength + row.SepalWidth,
PetalSum = row.PetalLength + row.PetalWidth,
}
@collect DataFrame
end| Row | Species | SepalSum | PetalSum |
|---|---|---|---|
| Cat… | Float64 | Float64 | |
| 1 | versicolor | 9.4 | 6.1 |
| 2 | virginica | 10.8 | 8.6 |
| 3 | virginica | 11.5 | 8.9 |
| 4 | virginica | 11.7 | 8.4 |
| 5 | virginica | 9.7 | 8.0 |
| 6 | virginica | 9.6 | 7.7 |
There are many topics related to DataFrames. However, there is not enough time to cover them all. We refer the reader to the excellent documentation with lots of examples.
Visualizing using StatsPlots
The StatsPlots package provides recipes for plotting histograms, boxplots, and many other plots related to statistics. This package also provides the @df macro, which allows simple plotting of tabular data. As a simple example, we create a scatter plot of SepalLength and SepalWidth grouped by Species. Keyword arguments can be used in the same way as before.
using StatsPlots
@df iris scatter(
:SepalLength,
:SepalWidth;
group = :Species,
xlabel = "SepalLength",
ylabel = "SepalWidth",
marker = ([:d :h :star7], 8),
)As another example, we mention the marginalkde function for plotting marginal kernel density estimations. In statistics, kernel density estimation (KDE) is a non-parametric way to estimate the probability density function of a random variable. The marginalkde function can be used together with @df macro.
@df iris marginalkde(
:SepalLength,
:SepalWidth;
xlabel = "SepalLength",
ylabel = "SepalWidth",
)Another example is the corrplot function, which shows the correlation between all variables.
@df iris corrplot(
cols(1:4);
grid = false,
nbins = 15,
fillcolor = :viridis,
markercolor = :viridis,
)Because it is shorter, we use cols(1:4) instead of the column names.