Recent Changes - Search:

Project Status

Project Documents

Discussion

Bug Tracking

edit SideBar

DataStorageRetrieval

NB: I've done my best to explain the issues at hand. Please feel free to ask my to clarify if needed.

Databases are "easy." Data storage and retrieval is "easy." When you have a clearly defined structure that maps well to the existing Relational Database paradigm, that is. Things get a bit muddier when you delve into the world of object-relational mapping, especially when those object have an arbitrary (and varying) number of attributes. Suddenly, you no longer have a clean model for your RDBMS (Relational Database Management System). And that is where we stand in this project right now.

Here is the scenario. We are retrieving data from various stations (or more correctly, "groups of sensors," since a "station" could be a data logger at CCHRC that has sensors in several parts of the building). Each group of sensors can send back a file or files that contain sensor readings, with each line of the file being the readings on those sensors at a given time. A format that looks like this:

2006-10-27 15:41,4.552,291.14,413,400,etc

As was stated, a file may have four sensors or four dozen. This leads to a problem of storing this data in the database. There are a variety of ways to do this, and you can read a lively discussion about it. (That I started, incidentally)

So, in order to determine the best way to store the data in the database, we must answer this question: What do we want to do with the data once it is stored in the database.

Value Separation
If we want to be able to search on individual values (such as "temperature > 0.5") quickly and efficiently, then we must store in a format in which each value is in it's own field, easily "viewable" by the database when it executes a query.

Value Serialization (see below)
If we do not need to search on individual values, then the line of data in the file (such as the group of 10 sensor readings) could all be put into a single database field. Thus, if we wanted a list of temperature readings for a certain station, we would have to pull the entire record, and then extract just the value we wanted, as opposed to the previous model where we could select just the temperature values.

However, if we split out the values, and want a series of several readings, that would require one query for each value, vs. one query to retrieve all values in the second option.

Value Separation would be fast in searching for (and selecting) individual values, but somewhat slow if one wants to select many of the values for a station.

Value Serialization would be rather slow in searching for individual values, but would be fast for selecting individual values, and almost just as fast for selecting many of the values for a station.

The first option would require something like an Entity-Attribute-Value model (read more).

The second option would use something like object serialization (read more).

Neither of these options are terribly elegant, but one of them will probably have to be chosen. It really comes down to how the data will be used after it is stored.

There is the third option of having one table in the database for each data file, but that is very ugly, and requires all kinds of code to manage, create, modify, and otherwise make sure the tables are kept in line with the data files. Things get really ugly when the data file adds or deletes columns. Do you create a new table? Modify the existing one and have empty columns where no data is stored? Unless an elegant way to do this can be explained, this is really an non-option.

Further research/thought is warranted. (:notoc:)

Edit - History - Print - Recent Changes - Search
Page last modified on October 28, 2006, at 12:14 AM