Join and Query Relational Data Tables Using R

Download data and R Code for this example

Project Requirement:

An NCEAS scientist requests an estimate of the total length and total number of all stream segments in the Continental U.S. watershed designated as Strahler Stream Order Number 1 or 2.

Input Data / Format:

The United States Environmental Protection Agency (USEPA), in partnership with the USGS and Horizon Systems, has created the NHDPlus hydrologic data set, an enhanced version of the USGS National Hydrography Dataset. NHDPlus is freely available online to the scientific community along with supporting tools and documentation.

The NHDPlus dataset consists of ESRI Shape Files that represent the geospatial attributes (e.g., location and segment connectivity) of the stream network, and indexed relational tables containing descriptive attributes (e.g., Strahler Number) for the network:

Sample NHDPlus Stream Network: Region 1 (New England) ESRI Line Shape File (state outlines in RED):

NHD Stream Network - NHDPlus Region 1


Sample NHDPlus Data Tables: Stream segment descriptive attributes and related Strahler Number (SO field) table (indexed on COMID field):

NHD Flowline and Strahler Stream Order tables 


1) Read the Flowline (stream segment descriptive attributes) and Strahler Stream Number tables into R data structures.

2) If desired, streamline the in-memory attribute tables by removing unneeded columns.

3) Join (merge) the Flowline and Strahler Stream Number tables using the COMID relational key.

4) Extract the desired attributes (segment length and segment count) for each 'qualifying' stream segment (segments assigned Strahler Stream Numbers 1 and 2 within the stream segment network.

5) Compute segment count and total segment length for each desired Strahler Stream Order number; write results to an ASCII file.


To review and run this example:

1) Download the Zip file archive, unpack into a folder.

2) Be sure that your R (version 2.9 or later) environment includes the foreign package.

3) Start R, set working directory to the folder containing the downloaded and unpacked example.

4) At the R command prompt, enter 'source("./SummarizeStreamSegmentsByStrahler.r")'

5) Enter the command 'SummarizeStreamSegmentsByStrahler("NHDPlus01","Region1TestResults")'


Summary Tables (comma-delimited and tab-delimited): Total stream segment distance and segment count by NHD Region and Strahler Stream Order number:

Sample Results Image

R Script:

SummarizeStreamSegmentsByStrahler <- function(NHDRegion,resultsFile)
    require(foreign)  # includes read.dbf() function
# Make the desired NHD data region folder the current working folder.
# getwd() function returns current working directory string.
    rootFolder = getwd()
    NHDFolder = paste(rootFolder,"/",NHDRegion,sep="")
# Read two NHD Hydrological Database tables: 
# 1) NHDFlowline: contains NHD linear features of type: stream/river (among others)
# 2) NHDsosc: contains an *updated* Strahler Stream Order assignment
# in the ('STREAMORDE') field for each linear feature/row in the NHDFlowline table. 
# The two tables are related through the primary key / column: COMID
    VAAFlowlineTable = read.dbf("./Hydrography/NHDFlowline.dbf")
    soscFile = paste(NHDFolder,"/NHDPlus_sosc/sosc.dbf",sep="")
    NewStrahlerOrderTable = read.dbf(soscFile)     
# Rename the column in the 'Strahler' table to match the name in the separate
# NHDFlowlineVAA table, not used here.
    colnames(NewStrahlerOrderTable)[2] = "STREAMORDE"  
# Remove the unneeded column 'SC' from the Strahler order table.  
    index =   grep("SC",colnames(NewStrahlerOrderTable))
    NewStrahlerOrderTable = NewStrahlerOrderTable[,-index]
# Merge the NewStrahlerOrderTable with the VAAFlowline table, using COMID
# column/attribute as the primary key.  Effectively a left outer join, with all 
# records in the LEFT table appearing in the result, but not all records in the
# in the right table. Net result: Only VAAFlowlineTable records with assigned
# Strahler Stream Order appear in the results table. 
     FlowAndStreamOrderTable = merge(NewStrahlerOrderTable,VAAFlowlineTable,by = "COMID")
# Finally, sum the stream segment lengths for all "StreamRiver" features with STREAMORDE = 1 and STREAMORDE = 2.
# Note: for demonstration purposes, we compute lengths only for the first two Strahler stream order values.
# There are typically 9 - 11 different Strahler orders, depending upon the NHD Region.
    strOrderAll = subset(FlowAndStreamOrderTable,(FTYPE == "StreamRiver"),select = c(COMID,FTYPE,STREAMORDE,GNIS_NAME,LENGTHKM))  
    strOrder1 = subset(FlowAndStreamOrderTable,(FTYPE == "StreamRiver" & STREAMORDE == 1),select = c(COMID,FTYPE,STREAMORDE,GNIS_NAME,LENGTHKM))  
    strOrder2 = subset(FlowAndStreamOrderTable,(FTYPE == "StreamRiver" & STREAMORDE == 2),select = c(COMID,FTYPE,STREAMORDE,GNIS_NAME,LENGTHKM))
    SumStreamLengthKm =  sum(strOrderAll$LENGTHKM)
    CountStreamSegments = nrow(strOrderAll)   
    SumStreamOrder1LengthKm =  sum(strOrder1$LENGTHKM)
    CountStreamOrder1Segments = nrow(strOrder1)
    SumStreamOrder2LengthKm =  sum(strOrder2$LENGTHKM)
    CountStreamOrder2Segments = nrow(strOrder2)  
# Write the results for this NHD region out to the Ascii files described in the function header.
# Append this file each time we process a new NHD Region.
# Also write a CSV file compatible with Excel.
    resultsFileCSV = paste(rootFolder,"/",(unlist(strsplit(resultsFile,"\\.")))[1],".csv",sep="")
    resultsFile = paste(rootFolder,"/",resultsFile,".txt",sep="")
    zz = file(resultsFile,"at")
    CSVExists = file.exists(resultsFileCSV)
    zzCSV = file(resultsFileCSV,"at")
    resultsBuffer = sprintf(
    "For NHD Region: %s\n\tAll Streams:    # Segments: %8g, sum segment lengths (km): %12.4f\n\tStream Order 1: #   Segments: %8g, Sum segment lengths (km): %12.4f \n\tStream Order 2: # Segments: %8g, Sum segment lengths (km): %12.4f\n",
# if csv file is new, write a column header
    if (!CSVExists)
       resultsBuffer = sprintf("NHD Region,Seg Count (all),Sum Dist (all),Seg Count (Order 1),Sum Dist (Order 1),Seg Count (Order 2),Sum Dist (Order 2)")  
    resultsBuffer = sprintf("%s,%g,%.3f,%g,%.3f,%g,%.3f",NHDRegion,CountStreamSegments,SumStreamLengthKm  ,CountStreamOrder1Segments,SumStreamOrder1LengthKm,CountStreamOrder2Segments,SumStreamOrder2LengthKm)  
# Restore incoming folder

Learning More:

Information Management Tutorial: Relational Database Basics

Wikipedia Entry: Relational Databases

Horizons Systems / NHDPlus Dataset Home Page