Dangerous Streets: Using ML to Prioritize Cyclist Safety in Austin

The Vision Zero Challenge

Austin, like many cities, has committed to Vision Zero, the goal of eliminating traffic fatalities. But with limited budgets, city planners face a difficult question: Which streets should we fix first?

Aggregate statistics tell us that high-speed roads are dangerous, but that's not granular enough for planning. I wanted to build a system that could rank specific corridors based on their predicted risk, helping to prioritize infrastructure investments where they are needed most.

I started by pulling data from the Texas Department of Transportation’s Crash Records Information System (CRIS). The dataset included 2,757 crashes involving bikes from 2015 to 2025.

[Dataset Statistics] Total Crashes: 2,757 | Severe Injuries: 12.1% | Timeframe: 2015 to 2025

Designing an Actionable Model

When building a predictive model, it's easy to include every variable you have. But for public policy, "explainability" and "actionability" are key.

I deliberately excluded features like cyclist age or gender. While highly predictive, they aren't actionable—you can't redesign a street to exclude certain demographics. Instead, I forced the model to learn only from infrastructure and environmental factors.

Model Performance (Held-out Test Set)

Recall (Severe) 56.8%

ROC-AUC 0.670

Strategy Prioritized Recall over Precision

Using only infrastructure features, the model successfully identifies over half of all severe crash locations. Given the class imbalance (only 12% of crashes were severe), this validates that the built environment is a strong signal for safety risk.

I used a Stacking Ensemble combining XGBoost, LightGBM, and Gradient Boosting, optimized with Bayesian hyperparameter tuning. Here is a snippet of the training pipeline:

def train_stacking_ensemble(X_train, y_train):
    # Handle Class Imbalance
    pos_weight = (y_train == 0).sum() / (y_train == 1).sum()
    
    # Base estimators
    estimators = [
        ('xgb', xgb.XGBClassifier(
            scale_pos_weight=pos_weight,
            learning_rate=0.05,
            max_depth=4,
            eval_metric='auc'
        )),
        ('lgb', LGBMClassifier(class_weight='balanced')),
        ('gb', GradientBoostingClassifier(max_depth=4))
    ]
    
    # Meta-learner
    meta_learner = LogisticRegression(class_weight='balanced')
    
    # Stacking classifier
    clf = StackingClassifier(
        estimators=estimators,
        final_estimator=meta_learner,
        cv=5,
        n_jobs=-1
    )
    
    clf.fit(X_train, y_train)
    return clf

To interpret the "black box" of the ensemble, I used SHAP (SHapley Additive exPlanations), which breaks down the contribution of each infrastructure feature to the final risk score.

What Actually Matters?

The SHAP analysis revealed a clear hierarchy of risk factors. While we often focus on bike lanes, the data shows that Speed Limit is the single most important actionable predictor of crash severity.

Feature Importance (Mean |SHAP| Value)

The model suggests that:

Speed Limit (0.311): This is nearly 3x more predictive than the presence of a bike lane. Higher speeds exponentially increase kinetic energy, making crashes far more likely to be severe.
Intersection Complexity (0.260): Complex intersections are major conflict points.
Lack of Traffic Control (0.189): Uncontrolled intersections significantly raise risk.

This doesn't mean bike lanes don't work. They do (SHAP -0.104, indicating a protective effect), but they are part of a larger system. A painted lane on a 45mph road is less effective than slowing the cars down to 30mph.

The Most Dangerous Corridors

By aggregating crash-level predictions to the street level, I identified the top "High Risk Gaps." I calculated a Composite Danger Score for each corridor using the following weighted criteria:

40% Severe Count

30% Risk Score

20% Speed Limit

10% Infra Gaps

This formula ensures we prioritize streets that are not just historically dangerous (high crash counts), but also structurally dangerous (high predicted risk and speed) and neglected (high infrastructure gaps).

These are the top 5 most dangerous streets for cyclists in Austin:

Top 5 Dangerous Corridors (Composite Score)

IH-35 leads the list with 21 severe crashes and a 16% severe crash rate. Notably, 100% of these crashes occurred at locations with no bike infrastructure.

Other major corridors like US-183 and S Congress Ave also appear, sharing a common profile: high speeds, complex intersections, and a lack of protected facilities.

The Equity Dimension

Mapping these dangerous streets reveals a troubling pattern. The risks are not evenly distributed; they are concentrated in East and South Austin.

For example, on S Pleasant Valley Rd, 55% of crashes occurred in dark conditions. This suggests a simple but critical intervention: better street lighting.

Historically underinvested neighborhoods often have the high speed arterials with the poorest lighting and least infrastructure.

This aligns with previous research on "mobility related environmental injustice." The data confirms that equity must be a core component of safety planning.

From Data to Action

This project demonstrates that we can move beyond simple crash counts to a more proactive, risk-based approach to safety.

Based on the model's findings, the priorities for Austin's Vision Zero plan should include:

Speed Management: Lowering limits on arterials is the most effective lever we have.
Protected Intersections: Addressing the complexity of major crossings.
Basic Infrastructure: Filling gaps in lighting and traffic signals, especially in East/South Austin.

Data science can point the way, but it's up to policy makers and citizens to push for the concrete changes that save lives.