7. Case Studies#

This chapter describes some of the use cases we’ve found for Julia, as well as benchmarks we’ve tried.

7.1. Gaussian Mixture Modeling#

The following code implements an Expectation-Maximization (EM) algorithm to find the maximum likelihood estimate of the mixing parameter of a two-component Gaussian mixture model (a form of k-means clustering which assumes that each cluster has a Gaussian distribution). This “algorithm” [it’s really a meta-algorithm framework, not a specific algorithm] is a very frequently-used approach for finding maximum likelihood estimates, particularly for problems that can be interpreted as involving latent variables or other forms of incomplete data.

7.1.1. Analysis objective#

Let \(Y\) denote the observed univariate outcome. To simplify this coding example, let’s assume that we already know that:

  • cluster one (denoted by \(Z=1\)) has a Gaussian distribution with

    • mean \(\mu_1 = \text{E}[Y|Z=1] = 0\)

    • standard deviation \(\sigma_1 = \sqrt{\text{Var}(Y|Z=1)} = 1\)

  • cluster two (\(Z=2\)) has a Gaussian distribution with

    • mean \(\mu_2 = \text{E}[Y|Z=2] = 2\)

    • standard deviation \(\sigma_2 = \sqrt{\text{Var}(Y|Z=2)} = 1\).

We only need to estimate the mixing probability (the relative frequency of the clusters), \(\pi = P(Z=1)\).

7.1.2. Data-generating simulation#

First, let’s write a function to generate data that our EM algorithm can analyze.

using DataFrames, Distributions, DataFramesMeta

function gen_data(
  ; # no positional arguments
  
  # four keyword arguments:
  n = 500000, 
  mu = [0, 2], 
  sigma = 1, 
  pi = 0.8 
)
    data = DataFrame(
        Obs_ID = 1:n,
        Z = (rand(Uniform(0,1), n) .> pi) .+ 1,
        )

    @transform!(data, :Y = rand(Normal(0, sigma), n) + mu[:Z])    
    @select!(data, :Obs_ID, :Y)
    @transform!(data, :p_Y_given_Z_1 = pdf.(Normal(mu[1], sigma), :Y))
    @transform!(data, :p_Y_given_Z_2 = pdf.(Normal(mu[2], sigma), :Y))

    return data
  end
gen_data (generic function with 1 method)

Let’s generate some data using our gen_data() function and examine the results:

using Random
Random.seed!(1234);
data = gen_data(n = 1000, pi = 0.8)
first(data, 10)
10×4 DataFrame
RowObs_IDYp_Y_given_Z_1p_Y_given_Z_2
Int64Float64Float64Float64
111.621810.1070920.371408
22-0.7636890.2980340.00875678
331.465290.1363570.3458
44-0.8371160.2810230.00712923
55-0.09989090.3969570.0439937
66-0.7261420.3064870.00970742
771.759430.08486140.387564
882.942730.005253940.255812
990.4458850.3611920.119245
10100.4604270.358820.121958

7.1.3. EM algorithm: top-level function#

The EM algorithm starts with an initial guess for \(\pi\), \(\hat{\pi}_0\). Then it iterates between two steps:

  • E step: compute the cluster membership probabilities, \(p(X|Y)\), for each observed outcome \(Y\), using the current estimate of \(\hat{\pi}\)

  • M step: re-estimate \(\hat{\pi}\) using the previously estimated cluster membership probabilities, \(\hat p(X|Y)\)

# The `!` in the function name `function fit_model!()` indicates that this function may have side-effects; specifically, in this function, `data` and `progress` be passed by reference and modified

  function fit_model!(
    data; # one positional argument
    pi_hat_0 = 0.5, 
    tolerance = 0.00001,
    max_iterations = 1000,
    progress = DataFrame(
        iter = 1:(max_iterations+1), 
        pi_hat = Vector{Float64}(undef, max_iterations+1), 
        ll = Vector{Float64}(undef, max_iterations+1), 
        ll_diff = Vector{Float64}(undef, max_iterations+1)
        )
    )

    pi_hat = pi_hat_0
    E_step!(data, pi_hat) # E step
    ll = loglik!(data)
    # progress = [(iter = 0, pi_hat, ll, ll_diff = NaN)]
    progress[1,:] = (0, pi_hat, ll, NaN)
    
    last_iter = 0
    for i in 1:max_iterations
        pi_hat = M_step(data)
        E_step!(data, pi_hat)
        
        ll_old = ll
        ll = loglik!(data)
        ll_diff = ll - ll_old
        progress[i+1,:] = (i, pi_hat, ll, ll_diff)

        if ll_diff < tolerance
            last_iter = i
            break
        end
    end
    return progress[1:last_iter, :]
  end
fit_model! (generic function with 1 method)

7.1.4. E step subroutine#

The E step uses Bayes’ Theorem to compute \(\hat p(X|Y)\) using the current estimate of \(\hat\pi\).

function E_step!(data, pi_hat)
    @transform!(data, :pY_Z1 = :p_Y_given_Z_1 .* pi_hat)
    @transform!(data, :pY_Z2 = :p_Y_given_Z_2 .* (1- pi_hat))
    @transform!(data, :pY = :pY_Z1 + :pY_Z2)
    @transform!(data, :pZ1_given_Y = :pY_Z1 ./ :pY)
end
E_step! (generic function with 1 method)

7.1.5. M step subroutine#

The M step treats the possible cluster memberships as weighted data, with the weights determined by the cluster membership probabilities from the E step.

function M_step(data)
    mean(data[!, :pZ1_given_Y])
end
M_step (generic function with 1 method)

7.1.6. Convergence criterion#

We check for convergence using the change in the observed-data log-likelihood:

function loglik!(data)
    sum(log.(data[!, :pY]))
end
loglik! (generic function with 1 method)

7.1.7. Timing our implementation#

Let’s speed-test our implementation:

@time results = fit_model!(data);
  0.618296 seconds (1.10 M allocations: 75.855 MiB, 3.49% gc time, 99.25% compilation time)

Note that most of the computation time was for pre-compiling our code. A second call to fit_model!() will be faster, since re-compilation isn’t needed:

# for a fair comparison, let's reset `data` to its original state, since `gen_data()` augmented the previous copy.
Random.seed!(1234);
data = gen_data(n = 1000, pi = 0.8)

@time results = fit_model!(data);
  0.003415 seconds (6.01 k allocations: 828.656 KiB)

7.1.8. Examine the results#

results
12×4 DataFrame
Rowiterpi_hatllll_diff
Int64Float64Float64Float64
100.5-1738.58NaN
210.663909-1659.4279.1631
320.733948-1642.8116.6087
430.764507-1639.213.60109
540.778503-1638.390.815393
650.78512-1638.20.189709
760.788302-1638.160.0447865
870.789847-1638.150.0106521
980.7906-1638.140.00254294
1090.790967-1638.140.000608182
11100.791147-1638.140.000145587
12110.791235-1638.143.4866e-5

7.1.9. Graph the algorithm’s path towards the MLE#

using Gadfly

Gadfly.plot(
  results,
  x = :pi_hat,
  y = :ll,
  Geom.point,
  Geom.line)
pi_hat 0.5 0.6 0.7 0.8 0.49 0.50 0.51 0.52 0.53 0.54 0.55 0.56 0.57 0.58 0.59 0.60 0.61 0.62 0.63 0.64 0.65 0.66 0.67 0.68 0.69 0.70 0.71 0.72 0.73 0.74 0.75 0.76 0.77 0.78 0.79 0.80 0.500 0.501 0.502 0.503 0.504 0.505 0.506 0.507 0.508 0.509 0.510 0.511 0.512 0.513 0.514 0.515 0.516 0.517 0.518 0.519 0.520 0.521 0.522 0.523 0.524 0.525 0.526 0.527 0.528 0.529 0.530 0.531 0.532 0.533 0.534 0.535 0.536 0.537 0.538 0.539 0.540 0.541 0.542 0.543 0.544 0.545 0.546 0.547 0.548 0.549 0.550 0.551 0.552 0.553 0.554 0.555 0.556 0.557 0.558 0.559 0.560 0.561 0.562 0.563 0.564 0.565 0.566 0.567 0.568 0.569 0.570 0.571 0.572 0.573 0.574 0.575 0.576 0.577 0.578 0.579 0.580 0.581 0.582 0.583 0.584 0.585 0.586 0.587 0.588 0.589 0.590 0.591 0.592 0.593 0.594 0.595 0.596 0.597 0.598 0.599 0.600 0.601 0.602 0.603 0.604 0.605 0.606 0.607 0.608 0.609 0.610 0.611 0.612 0.613 0.614 0.615 0.616 0.617 0.618 0.619 0.620 0.621 0.622 0.623 0.624 0.625 0.626 0.627 0.628 0.629 0.630 0.631 0.632 0.633 0.634 0.635 0.636 0.637 0.638 0.639 0.640 0.641 0.642 0.643 0.644 0.645 0.646 0.647 0.648 0.649 0.650 0.651 0.652 0.653 0.654 0.655 0.656 0.657 0.658 0.659 0.660 0.661 0.662 0.663 0.664 0.665 0.666 0.667 0.668 0.669 0.670 0.671 0.672 0.673 0.674 0.675 0.676 0.677 0.678 0.679 0.680 0.681 0.682 0.683 0.684 0.685 0.686 0.687 0.688 0.689 0.690 0.691 0.692 0.693 0.694 0.695 0.696 0.697 0.698 0.699 0.700 0.701 0.702 0.703 0.704 0.705 0.706 0.707 0.708 0.709 0.710 0.711 0.712 0.713 0.714 0.715 0.716 0.717 0.718 0.719 0.720 0.721 0.722 0.723 0.724 0.725 0.726 0.727 0.728 0.729 0.730 0.731 0.732 0.733 0.734 0.735 0.736 0.737 0.738 0.739 0.740 0.741 0.742 0.743 0.744 0.745 0.746 0.747 0.748 0.749 0.750 0.751 0.752 0.753 0.754 0.755 0.756 0.757 0.758 0.759 0.760 0.761 0.762 0.763 0.764 0.765 0.766 0.767 0.768 0.769 0.770 0.771 0.772 0.773 0.774 0.775 0.776 0.777 0.778 0.779 0.780 0.781 0.782 0.783 0.784 0.785 0.786 0.787 0.788 0.789 0.790 0.791 0.792 0.793 0.794 0.795 0.796 0.797 0.798 0.799 0.800 0.801 0.5 1.0 0.7912348928257482-1638.1430410735159 0.7911469424432123-1638.1430759394723 0.7909671693702044-1638.1432215261511 0.7905995150663498-1638.1438297082454 0.7898468190258571-1638.1463726529148 0.7883024711388757-1638.157024774936 0.7851198579324357-1638.2018112523033 0.7785029867532156-1638.3915204993998 0.7645069000356234-1639.2069132026052 0.7339480057792802-1642.8080031667957 0.6639094628864054-1659.4167412508612 0.5-1738.5798718713709 h,j,k,l,arrows,drag to pan i,o,+,-,scroll,shift-drag to zoom r,dbl-click to reset c for coordinates ? for help ? -1750 -1700 -1650 -1600 -1750 -1745 -1740 -1735 -1730 -1725 -1720 -1715 -1710 -1705 -1700 -1695 -1690 -1685 -1680 -1675 -1670 -1665 -1660 -1655 -1650 -1645 -1640 -1635 -1630 -1625 -1620 -1615 -1610 -1605 -1600 -1750.0 -1749.5 -1749.0 -1748.5 -1748.0 -1747.5 -1747.0 -1746.5 -1746.0 -1745.5 -1745.0 -1744.5 -1744.0 -1743.5 -1743.0 -1742.5 -1742.0 -1741.5 -1741.0 -1740.5 -1740.0 -1739.5 -1739.0 -1738.5 -1738.0 -1737.5 -1737.0 -1736.5 -1736.0 -1735.5 -1735.0 -1734.5 -1734.0 -1733.5 -1733.0 -1732.5 -1732.0 -1731.5 -1731.0 -1730.5 -1730.0 -1729.5 -1729.0 -1728.5 -1728.0 -1727.5 -1727.0 -1726.5 -1726.0 -1725.5 -1725.0 -1724.5 -1724.0 -1723.5 -1723.0 -1722.5 -1722.0 -1721.5 -1721.0 -1720.5 -1720.0 -1719.5 -1719.0 -1718.5 -1718.0 -1717.5 -1717.0 -1716.5 -1716.0 -1715.5 -1715.0 -1714.5 -1714.0 -1713.5 -1713.0 -1712.5 -1712.0 -1711.5 -1711.0 -1710.5 -1710.0 -1709.5 -1709.0 -1708.5 -1708.0 -1707.5 -1707.0 -1706.5 -1706.0 -1705.5 -1705.0 -1704.5 -1704.0 -1703.5 -1703.0 -1702.5 -1702.0 -1701.5 -1701.0 -1700.5 -1700.0 -1699.5 -1699.0 -1698.5 -1698.0 -1697.5 -1697.0 -1696.5 -1696.0 -1695.5 -1695.0 -1694.5 -1694.0 -1693.5 -1693.0 -1692.5 -1692.0 -1691.5 -1691.0 -1690.5 -1690.0 -1689.5 -1689.0 -1688.5 -1688.0 -1687.5 -1687.0 -1686.5 -1686.0 -1685.5 -1685.0 -1684.5 -1684.0 -1683.5 -1683.0 -1682.5 -1682.0 -1681.5 -1681.0 -1680.5 -1680.0 -1679.5 -1679.0 -1678.5 -1678.0 -1677.5 -1677.0 -1676.5 -1676.0 -1675.5 -1675.0 -1674.5 -1674.0 -1673.5 -1673.0 -1672.5 -1672.0 -1671.5 -1671.0 -1670.5 -1670.0 -1669.5 -1669.0 -1668.5 -1668.0 -1667.5 -1667.0 -1666.5 -1666.0 -1665.5 -1665.0 -1664.5 -1664.0 -1663.5 -1663.0 -1662.5 -1662.0 -1661.5 -1661.0 -1660.5 -1660.0 -1659.5 -1659.0 -1658.5 -1658.0 -1657.5 -1657.0 -1656.5 -1656.0 -1655.5 -1655.0 -1654.5 -1654.0 -1653.5 -1653.0 -1652.5 -1652.0 -1651.5 -1651.0 -1650.5 -1650.0 -1649.5 -1649.0 -1648.5 -1648.0 -1647.5 -1647.0 -1646.5 -1646.0 -1645.5 -1645.0 -1644.5 -1644.0 -1643.5 -1643.0 -1642.5 -1642.0 -1641.5 -1641.0 -1640.5 -1640.0 -1639.5 -1639.0 -1638.5 -1638.0 -1637.5 -1637.0 -1636.5 -1636.0 -1635.5 -1635.0 -1634.5 -1634.0 -1633.5 -1633.0 -1632.5 -1632.0 -1631.5 -1631.0 -1630.5 -1630.0 -1629.5 -1629.0 -1628.5 -1628.0 -1627.5 -1627.0 -1626.5 -1626.0 -1625.5 -1625.0 -1624.5 -1624.0 -1623.5 -1623.0 -1622.5 -1622.0 -1621.5 -1621.0 -1620.5 -1620.0 -1619.5 -1619.0 -1618.5 -1618.0 -1617.5 -1617.0 -1616.5 -1616.0 -1615.5 -1615.0 -1614.5 -1614.0 -1613.5 -1613.0 -1612.5 -1612.0 -1611.5 -1611.0 -1610.5 -1610.0 -1609.5 -1609.0 -1608.5 -1608.0 -1607.5 -1607.0 -1606.5 -1606.0 -1605.5 -1605.0 -1604.5 -1604.0 -1603.5 -1603.0 -1602.5 -1602.0 -1601.5 -1601.0 -1600.5 -1600.0 -1800 -1600 ll

Compare these results with an implementation in R. Despite very similar code, the Julia implementation is much faster!

7.2. Network Construction#

The Armed Conflict Location & Event Data (ACLED) are a collection of data about political violence and protests around the world. In a recent exploratory project with Ashish Shenoy (UC Davis) and Eoin McGuirk (Tufts), we studied the relationships between events and tried to identify sequences of related events. The ACLED consist of millions of records; we focused on data about Africa from 2016 to 2020, which consists of 112,926 records. One of our goals was to construct a network of related events based on nearness in time and space and commonality of participants.

We first developed a Python script to construct the network (as an edge list). The script was prohibitively slow, so we translated it into Julia. The translation required only minor (mostly syntactic) changes to the code. We also developed an “optimized” version of the Python script that uses a more efficient strategy for distance computations.

For comparison, here’s the code for all three:

 1# Timing estimate: 7.41 seconds
 2using DataFrames
 3using Dates
 4using Distances
 5using Parquet
 6
 7
 8function pair_events_on_date(date, events, max_days = Dates.Day(30), max_km = 80)
 9    # Get all events on the date.
10    d₀ = Dates.Day(0)
11    dist_days = events[:, :event_date] - date
12    events_on_date = events[dist_days .== d₀, :]
13
14    # Get all events in max_days window.
15    is_in_window = (d₀ .<= dist_days) .& (dist_days .<= max_days)
16    events_in_window = events[is_in_window, :]
17    events_in_window[:, :dist_days] .= dist_days[is_in_window]
18
19    # For each event on date, get km distance to events in window.
20    xy_date = Matrix{Float64}(events_on_date[:, [:x, :y]])'
21    xy_window = Matrix{Float64}(events_in_window[:, [:x, :y]])'
22    dist_km = pairwise(Euclidean(), xy_window, xy_date, dims = 2)
23    dist_km ./= 1_000
24
25    all_pairs = []
26    for (i, event) in enumerate(eachrow(events_on_date))
27        is_in_radius = dist_km[:, i] .<= max_km
28        pairs = copy(events_in_window[is_in_radius, :])
29        pairs[:, :dist_km] = dist_km[is_in_radius, i]
30        pairs[:, :event1_id] .= event[:data_id]
31
32        push!(all_pairs, pairs)
33    end
34
35    result = vcat(all_pairs...)
36    select!(result, Not([:x, :y]), :data_id => :event2_id)
37    return result
38end
39
40
41function make_proximal_events_edgelist(events)
42    dates = unique(events[:, :event_date])
43    
44    paired = [pair_events_on_date(d, events) for d in dates]
45    paired = vcat(paired...)
46    
47    return paired
48end
49
50
51# Read the data.
52path = "../output/acled_africa_2016_2020.parquet"
53df = DataFrame(read_parquet(path))
54cols = [:event_date, :x, :y, :actor1, :actor2, :data_id]
55df = df[:, cols]
56
57# Convert dates to Date type.
58dates = unix2datetime.(df[:, :event_date] / 1e6)
59dates = convert.(Date, dates)
60df[!, :event_date] .= dates;
61
62make_proximal_events_edgelist(df)
 1# Timing estimate: 2040.30 seconds
 2import geopandas as gpd
 3import pandas as pd
 4
 5
 6def pair_events_on_date(
 7        date
 8        , events
 9        , max_days = pd.Timedelta(30, "days")
10        , max_km = 80):
11    # Get all events on the date.
12    ZERO_DAYS = pd.Timedelta(0, "days")
13    dist_days = events["event_date"] - date
14    events_on_date = events.loc[dist_days == ZERO_DAYS]
15
16    # Get all events in max_days window.
17    is_in_window = (ZERO_DAYS <= dist_days) & (dist_days <= max_days)
18    events_in_window = events.loc[is_in_window, :].copy()
19    events_in_window.loc[:, "dist_days"] = dist_days[is_in_window]
20
21    # For each event on date, get km distance to events in window.
22    all_pairs = []
23    for event in events_on_date.itertuples():
24        # Compute distance and convert meters to kilometers.
25        dist_km = events_in_window.distance(event.geometry)
26        dist_km /= 1_000
27
28        # Get all events in max_km radius.
29        is_in_radius = (dist_km <= max_km)
30        pairs = events_in_window.loc[is_in_radius, :].copy()
31        pairs.loc[:, "dist_km"] = dist_km[is_in_radius]
32        pairs.loc[:, "event1_id"] = event.data_id
33
34        all_pairs.append(pd.DataFrame(pairs))
35
36    # Clean up colnames and remove match with self.
37    all_pairs = pd.concat(all_pairs, axis = 0)
38    all_pairs.rename({"data_id": "event2_id"}, axis = 1, inplace = True)
39    #all_pairs.query("event1_id != event2_id", inplace = True)
40    all_pairs.drop("geometry", axis = 1, inplace = True)
41    return all_pairs
42
43
44def make_proximal_events_edgelist(events):
45    dates = events["event_date"].unique()
46
47    paired = [pair_events_on_date(d, events) for d in dates]
48    paired = pd.concat(paired, axis = 0)
49
50    paired.reset_index(drop = True, inplace = True)
51
52    return paired
53
54
55# Read the data.
56path = "../output/acled_africa_2016_2020.parquet"
57df = gpd.read_parquet(path)
58
59make_proximal_events_edgelist(df)
 1# Timing estimate: 367.60 seconds
 2import pandas as pd
 3from scipy.spatial.distance import cdist
 4
 5
 6def pair_events_on_date(
 7        date
 8        , events
 9        , max_days = pd.Timedelta(30, "days")
10        , max_km = 80):
11    # Get all events on the date.
12    ZERO_DAYS = pd.Timedelta(0, "days")
13    dist_days = events["event_date"] - date
14    events_on_date = events.loc[dist_days == ZERO_DAYS]
15
16    # Get all events in max_days window.
17    is_in_window = (ZERO_DAYS <= dist_days) & (dist_days <= max_days)
18    events_in_window = events.loc[is_in_window, :].copy()
19    events_in_window.loc[:, "dist_days"] = dist_days[is_in_window]
20    
21    xy_date = events_on_date.loc[:, ["x", "y"]].to_numpy()
22    xy_window = events_in_window.loc[:, ["x", "y"]].to_numpy()
23    dist_km = cdist(xy_window, xy_date, "euclidean") / 1_000
24
25    # For each event on date, get km distance to events in window.
26    all_pairs = []
27    for i, event in enumerate(events_on_date.itertuples()):
28        # Get all events in max_km radius.
29        is_in_radius = (dist_km[:, i] <= max_km)
30        pairs = events_in_window.loc[is_in_radius, :].copy()
31        pairs.loc[:, "dist_km"] = dist_km[is_in_radius, i]
32        pairs.loc[:, "event1_id"] = event.data_id
33
34        all_pairs.append(pd.DataFrame(pairs))
35
36    # Clean up colnames and remove match with self.
37    all_pairs = pd.concat(all_pairs, axis = 0)
38    all_pairs.rename({"data_id": "event2_id"}, axis = 1, inplace = True)
39    #all_pairs.query("event1_id != event2_id", inplace = True)
40    all_pairs.drop(["x", "y"], axis = 1, inplace = True)
41    return all_pairs
42
43
44def make_proximal_events_edgelist(events):
45    dates = events["event_date"].unique()
46
47    paired = [pair_events_on_date(d, events) for d in dates]
48    paired = pd.concat(paired, axis = 0)
49
50    paired.reset_index(drop = True, inplace = True)
51
52    return paired
53
54
55# Read the data.
56path = "../output/acled_africa_2016_2020.parquet"
57df = pd.read_parquet(path)
58cols = ["event_date", "x", "y", "actor1", "actor2", "data_id"]
59df = df.loc[:, cols]
60
61make_proximal_events_edgelist(df)

We benchmarked the scripts on a consumer-grade laptop (Intel Core i7-5500U at 2.4GHz; 12GB RAM). In our tests, the Python script took approximately 291 times longer to run than the Julia script (2040 seconds versus 7 seconds). Even the optimized Python script took approximately 52 times longer to run than the Julia script (367 seconds). Note that in the Julia script we only specified types where required; it may be possible to make the script even faster by adding more type information or taking advantage of other optimization strategies for Julia code.