New Biogeochemical Sample Collection Database Model

Emilio Mayorga, Created: 2/26/2003 Updated: 5/15/2003

Problems With The Current Data Model

We have a bloated set of tables for the actual biogeochemical measurements. Keeping track of all available columns (parameters) is kind of difficult. The arrangement into thematic tables (bulk organic, aminos, stable isotopes, etc) helps, but it makes loading data into queries from multiple tables kind of messy. Also, a lot of the entries are empty -- that's a waste of file space. It's very difficult to construct a query that asks: what was measured on this station, on this sample? What data is available for this expedition? You're forced to select every single column and see if it has a valid value! Also, it's somewhat difficult to keep an updated, easy to read single table of all parameters, what grouping each parameter belongs to (inorganic, bulk org, etc), units, etc.
We don't have a way of recording if a measurement is "below detection", and what the detection level was. Currently we have to specify "zero" or some other half-baked solution, and we can't make a fool-proof documentation of this decision for each case. The current structure would not support the addition of "below detection" and "detection limit" fields -- *many* for each parameter, because it'll vary according to the lab and instrumentation.
While we have entered some parameters that are actually calculated (eg, DIC from measured pH, temperature, and Alk), we don't have a way of tagging these values as "calculated" rather than actually measured. This is an important distinction. The same applies for values that we may have entered based on extrapolations/guesses from separate samplings for the same or similar areas (I've done this for the French-Bolivian data!).
- Maybe this set of "constructed values" should be part of a parallel set of tables that can be integrated with the primary one on the fly, via Views??!!
We don't record the date when the sample was analyzed. While this isn't critical, it would be very convenient. The current structure would not support it.
Particle size, dissolved fractions, filtration cutoffs and concentration methods. Right now we are limited to Coarse Particulates, Fine Particulates, total particulates in a couple of cases (eg, TSS), a couple of pre-defined UF size ranges, and implicitly, the dissolved fractions from concentration on resin. The only way to add new, different size fractions, is to create new columns (bad!) or put the data in an existing column but making a note of it somewhere (for example, the C14 data for two UDOM size fractions in Rio Negro were combined/averaged into a single, aggregate fraction). Also, the need to have separate fields for each size fraction, for each parameter (eg, CPOC, FPOC, DOC, UDOC, ...) leads to a proliferation of biogeochemical fields.
By closely tying the GIS derived information about the position of a station relative to others, its basin, etc, we've made it more difficult to add new sites. Good examples include the wetland sample from the Parana, small streams samples from Campinas, and water from saturated soils or groundwater. We also make it nearly impossible to use alternate datasets of river networks, basins, and GIS properties; for example, for mesoscale basins where superior, finer resolution data sets are available (Ji-Parana, Jau, Pachitea). This last issue is a severe limitation.
We do not maintain separate records for similar measurements done on the same sample, by different labs (C13, nutrients, etc). While this can be convenient, it makes it harder to resolve future inconsistencies.
We do not keep track of where each "dataset" was published. It'd be nice to be able to associate pieces of data with a particular publication.
It's very hard to include time-composites (eg, data published only as long-term means), or composite samples from more than one location (eg, '80s C14 measurements from the mainstem).

Major Suggested Changes

Replace all biogeochemical tables with a single "measurement" table linked to the "sample" table. This table will be linked to a "parameter" table, which is linked to a "parameter category" table. See the GW Model document.
Add a "lab" link to the "measurement" table, to identify where each measurement was made. Also, each measurement will have an optional "analysis date" (as opposed to sampling date).
Instead of having separate parameter names for each particle size and dissolved fraction, use a "size_fraction" entry in "measurement", linked to a "size_fraction" table, which is linked to a "size_fraction_category" table (eg, particulate, colloidal, dissolved). NOTE: but what to do about resin-based dissolved-fraction concentration?? It's not based on size! Though, it's used *after* pre-filtering using some filter size, so we can enter that filter size. Maybe in addition to size_fraction, we need a concentration method (for dissolved only?), or something like that.
Remove all GIS or basin classification from the primary database. Make all that information a separate database that is "imposed" on the main database. That will give us more freedom to have all sorts of samples in the main database (lakes, soils, streams outside the Amazon, etc).
Remove "RiverSystems" and "RiverSystemsStations" tables from the primary database, and move it to the GIS database, generally for the same reason described in the previous point. We want to keep the primary database more focused on information gathered in the field or collected on the samples themselves (the geochemical analyses). Also, ultimately the RiverSystemsStations table should be derived automatically from GIS using a minimal set of input information (eg, most downstream point, "source" point that defines the mainstem trunk [which doesn't have to be the "real" trunk, the longest river course], and a maximum distance from main trunk defining stations labelled as "IsDirectAffluent"; is anything else needed?)
Change the name of "stations" to "location". (is this a very minor change?)
Add a set of categories for sample/location environment type: river, lake, river bank seds, soils, air. A second category set may be needed too: largely water sample, largely sediment/soil (solid), and gas.
make extensive use of "reliability" or "quality" flags??

Unresolved Issues

Preservation of original significant digits and precision when making unit conversions for geochemical data
How to handle time or spatial composites, averages
Handling samples/measurements that haven't been analyzed yet; have flags for specifying that a parameter *will* be analyzed, and a sample of such-and-such type has been collected? Also, tables and fields that allow us to track samples while they're being analyzed -- where they are, what labs they are going to, etc
Data access locks, granted to specific groups or users for unpublished data (especially on the measurement field)