Lost in the Woods – Emily Miller

The Data Everyone Wants to Use

The USDA Forest Service’s Forest Inventory and Analysis program has been running permanent forest plots across all 50 states since the early 2000s. Every five years or so, field crews visit each ~1-acre plot and record every tree: species, diameter, crown class, condition, disturbance history, regeneration. The national dataset currently covers around 6,900 permanent sites with measurements spanning 25 years. It’s the most comprehensive longitudinal record of US forest health that exists.

Researchers who work on forest disturbance, thermophilization, fire ecology, carbon accounting, and biodiversity all eventually find their way to FIA data. And then they spend the next week trying to figure out how it’s structured.

Why FIA Is Hard to Work With

The national FIA database (FIADB v9.4) is distributed as either a 70 GB SQLite file or per-state CSV downloads. Either way, before you can analyze anything, you need to understand the structure - and the user guide is 600+ pages.

The complexity is real, not just documentation overhead. A few things that catch new users:

FIA samples all US land, not just forests. About 59% of plot-year rows in the raw data have pct_forested == 0. They’re fields, parking lots, lawns. The first filter for any forest analysis is just: is this actually a forest plot? This isn’t documented anywhere obvious - you discover it when your species diversity numbers look completely wrong.

Code fields are everywhere. STATUSCD, CCLCD, AGENTCD, COND_STATUS_CD, DSTRBCD, TRTCD - each has its own set of integer codes with specific meanings buried in appendix tables. The wrong code filter doesn’t error. It silently returns the wrong answer.

Table relationships aren’t always intuitive. TREE_GRM_COMPONENT, which holds between-measurement mortality data, links back to TREE via TRE_CN - not to PLOT directly. That means you need a TREE join to get INVYR and SPCD even when you’re only analyzing mortality. The first time you encounter this, it’s genuinely confusing.

50+ tables, and some names don’t telegraph their content. Which table has seedling counts? (SEEDLING.) Okay, but how does it relate to SUBPLOT? What about SUBP_COND? None of this is unfixable, but it slows down the first few weeks considerably.

The Dashboard

The FIA Database Schema Navigator is a Streamlit app for exploring FIADB v9.4 structure without loading the database into memory. The core design constraint: the full SQLite file is 70 GB, so all schema navigation runs off SQLite’s PRAGMA table_info(). Instant, even on the full database.

The app runs in metadata-only mode when no database is connected - the relationship diagram, field codebook, table browser, and SQL templates are all hardcoded from the User Guide and work without the SQLite file. You can explore it live here:

Open the FIA Database Schema Navigator

import sqlite3
import streamlit as st

@st.cache_data
def get_table_schema(db_path: str, table_name: str) -> list[dict]:
    """Schema introspection with no row loading."""
    conn = sqlite3.connect(db_path)
    try:
        cursor = conn.execute(f"PRAGMA table_info({table_name})")
        return [
            {
                "name": col[1],
                "type": col[2],
                "not_null": bool(col[3]),
                "primary_key": bool(col[5]),
            }
            for col in cursor.fetchall()
        ]
    finally:
        conn.close()

Row previews are available but opt-in - you request them explicitly. The app doesn’t load anything by default.

Table Browser

The 50+ FIADB tables are organized into categories pulled from the user guide - Location Level, Tree Level, Invasive/Understory Vegetation, Down Woody Material, Population, Reference - rather than alphabetically. Alphabetical is how a database shows you tables. Categorical is how a researcher looks for them.

TABLE_CATEGORIES: dict[str, list[str]] = {
    "Location Level": [
        "SURVEY", "PROJECT", "COUNTY", "PLOT", "COND",
        "SUBPLOT", "SUBP_COND", "BOUNDARY", "SUBP_COND_CHNG_MTRX",
    ],
    "Tree Level": [
        "TREE", "TREE_WOODLAND_STEMS", "TREE_GRM_COMPONENT",
        "TREE_GRM_THRESHOLD", "TREE_GRM_MIDPT", "TREE_GRM_BEGIN",
        "TREE_GRM_ESTN", "BEGINEND", "SEEDLING", "SITETREE",
    ],
    "Reference": [
        "REF_SPECIES", "REF_FOREST_TYPE", "REF_DAMAGE_AGENT",
        "REF_FOREST_TYPE_GROUP", "REF_SPECIES_GROUP",
        # ...
    ],
    # ...
}

Select a table, and the sidebar shows column names, types, and which are primary keys - no rows loaded.

Field Codebook

A hardcoded reference for the code columns that come up constantly:

Code	Column	Meaning
1	STATUSCD	Live tree
2	STATUSCD	Standing dead tree
1	COND_STATUS_CD	Forested - keep for most analyses
5	COND_STATUS_CD	Non-forest land with trees (deforested)
10	DSTRBCD	Insect damage (general)
30	DSTRBCD	Fire (general)
32	DSTRBCD	Crown fire
80	DSTRBCD	Human-induced disturbance
10	TRTCD	Cutting / harvest
2	CCLCD	Dominant
3	CCLCD	Codominant
4	CCLCD	Intermediate
5	CCLCD	Overtopped

Having these in one place saves reopening the user guide every 20 minutes.

Relationship Diagram

The feature I wish had existed when I started. The interactive graph shows foreign key relationships between tables, rendered with PyVis - hover over an edge to see exactly which columns link which tables. If PyVis isn’t available, it falls back to a static Graphviz diagram.

from pyvis.network import Network

def build_relationship_graph(fk_info: dict) -> str:
    net = Network(height="600px", width="100%", directed=True)
    net.set_options('{"physics": {"solver": "forceAtlas2Based"}}')

    for table, foreign_keys in fk_info.items():
        net.add_node(table, label=table, shape="box")
        for fk in foreign_keys:
            net.add_edge(
                table,
                fk["referred_table"],
                title=f"{fk['constrained_columns']} → {fk['referred_columns']}",
            )
    return net.generate_html()

Understanding that TREE_GRM_COMPONENT → TREE via TRE_CN, for example, would have saved me a few confused hours if I’d had this diagram at the start.

SQL Templates

Pre-built query patterns for common analyses - basal area by species over time, mortality by cause code, seedling counts, forest type composition. The TREE_GRM_COMPONENT → TREE join in the mortality template is the kind of thing that’s easy to miss from the user guide alone.

SQL_TEMPLATES = {
    "Basal area by species over time": """
SELECT
    p.STATECD, p.LAT, p.LON,
    t.INVYR, t.SPCD,
    SUM(0.005454 * t.DIA * t.DIA * t.TPA_UNADJ) AS ba_per_acre
FROM TREE t
JOIN PLOT p ON t.PLT_CN = p.CN
JOIN COND c ON c.PLT_CN = t.PLT_CN AND c.CONDID = t.CONDID
WHERE t.STATUSCD = 1          -- live trees only
  AND t.DIA >= 1.0
  AND c.COND_STATUS_CD = 1    -- forested conditions only
GROUP BY p.STATECD, p.LAT, p.LON, t.INVYR, t.SPCD;
""",
    "Mortality by cause code": """
SELECT
    t.INVYR, t.AGENTCD, t.SPCD,
    SUM(grm.MICR_TPAMORT_UNADJ_AL_FOREST) AS tpa_mortality
FROM TREE_GRM_COMPONENT grm
JOIN TREE t ON grm.TRE_CN = t.CN   -- GRM links to TREE, not PLOT
WHERE grm.MICR_COMPONENT_AL_FOREST IN ('MORTALITY1','MORTALITY2')
  AND grm.MICR_TPAMORT_UNADJ_AL_FOREST > 0
GROUP BY t.INVYR, t.AGENTCD, t.SPCD;
""",
}

Things I Learned About FIA Along the Way

Some things I got wrong before I got them right:

pct_forested is the first filter, not an afterthought. FIA samples all US land. If you skip the forested condition filter and jump to species analysis, your numbers will be wrong and the error won’t be obvious.

TPA_UNADJ is already on the TREE record. You don’t need to compute per-acre expansion factors from subplot geometry. sum(TPA_UNADJ * 0.005454 * DIA^2) gives per-acre basal area directly.

CCLCD has a lot of NAs, particularly in older periodic plots and for woodland species. A reasonable fallback: DIA >= 5 inches → overstory, below → understory.

TREE_GRM_COMPONENT does not carry INVYR. It records measurement intervals, not calendar years. To get the inventory year for a mortality event, join back to TREE on TRE_CN and use TREE.INVYR.

The three disturbance slots (DSTRBCD1/2/3) are ranked by importance, not by year. DSTRBCD1 is the most ecologically significant disturbance recorded, not the most recent one.

Taking It Further: Two More Dashboards

I built the schema navigator first, but the project kept growing. Once I had a full FIA extraction workflow, I ended up building two more dashboards that sit downstream of the same pipeline.

Visualizing the Compiled Data

Open the FIA Compiled Data Explorer

The FIA Compiled Data Explorer is for checking and exploring processed outputs, not raw schema. It reads the compiled FIA summaries produced in the forest-data-compilation.Rproj workflow, then surfaces them in map and chart views for QA and exploratory analysis.

The Full Project

Open the Full Forest Data Compilation Dashboard

The full project dashboard pulls the same idea one level up, combining FIA outputs with IDS and climate products in one place.

Where to Find Workflow Details

This post is about building the FIA schema navigator. If you want the full data workflow, setup details, and run instructions for forest-data-compilation.Rproj, they are documented in the project repo:

The schema navigator code lives in the forest-data-compilation repository, under 05_fia/docs/dashboard/fiadb_dashboard.py.

Citation

BibTeX citation:

@online{miller2026,
  author = {Miller, Emily},
  title = {Lost in the {Woods}},
  date = {2026-03-04},
  url = {https://rellimylime.github.io/posts/forest-data-compilation/},
  langid = {en}
}

For attribution, please cite this work as:

Miller, Emily. 2026. “Lost in the Woods.” March 4, 2026. https://rellimylime.github.io/posts/forest-data-compilation/.