Recent Changes - Search:

Project Status

Project Documents


Bug Tracking

edit SideBar


Click to start the slide show.


Weather Data Analysis and QA/QC

The Photizo Project - People

  • I'm Joshua Kugler
  • My committee is Dr. Knoke, Dr. Nance, and Dr. Roth
  • Stakeholders
    • Michael Lilly: Geo-Watersheds Scientific
    • Gary Whitton: EE Internet
    • Alaska Department of Transportation and Public Facilities/UAF WERC (by extension)
    • UAF Graduate Committee

The Photizo Project - Que es?

  • Name comes from a Greek word meaning "to enlighten, render evident, to give understanding to."
  • Archiving, QA/QC, Publishing, and Data Analysis for Meteorological Sensor Networks
    • Quality Control: Making sure our data is being collected and is accurate
    • Quality Assurance: Making sure our data is meeting the customers' needs.

Where are we?

  • Timeline
    • Planning: September 1 - September 30
    • Requirements: October 1 - December 3
    • Design & Test Plan: December 4 - January 15
    • Implementation: January 16 - March 12
    • Testing: March 13 - April 16
    • Documentation: September 1 - May 1

Where are we going?

  • Scope will most likely be unchanged, reduced if anything.
  • Design will go forward
  • Design issues have been researched and considered

Design Issues

  • Storage of data with wildly heterogeneous formats
  • Some files:
    • "2006-09-17 11:00:00",0,3153,13.09,3.825,13.08,82.9,5.46
  • Others:
    • "2006-11-09 10:00:00", 0, 1, 66.15, 4.537, 3.55, 7.453, 4.747, <Several dozen values here>, 18.47,18.09
  • And everywhere in between.

One storage solution: Serialization

  • + Would involve putting nearly all values in a blob
  • + "Quick and dirty"
  • + Easy to program as Python has fast and solid serialization tools
  • - Makes data very opaque
  • - Very slow to search for data in a range
    • E.g. temperature between X and Y
  • - Would tie the data to Python

Another option: EAV

  • Entity-Attribute-Value
    • One row for each Entity/Attribute/Value tuple
    • Used in cases where applicability or number of attributes vary
  • + All the values are still visible and searchable by non-python tools
  • + Values can be indexed for faster searching
  • + Searches on ranges would be very fast

Another option: EAV - Cont'd

  • - Not terribly elegant as all data has to go through the row-column/column-row transformation
    • Must be reimplemented if using another language
  • - Requires a lot of custom code to do the transformations
  • - Queries are not as efficient due to the joins that must be done
  • - Not as efficient as the application is sending more data across the wire (entity and attribute id) in addition to just the value.

The Solution (so far) - HDF5

  • Hierarchical Data Format 5
    • Born at NCSA, now at The HDF Group
    • Used by Nasa JPL, Gulfstream Aerospace, and NOAA, among others
  • Matches the hierarchy of network/station/tables
  • Search 1.3 million rows in less than a second
  • Can have indexes on columns
  • PyTables presents the tables and rows as objects
  • HDF5 Bindings available for other languages
    • Perl, C/C++, Java, R

The Solution (so far) - Cont'd

  • One table per station/dataset
  • Each table will have metadata including
    • Column to "friendly name" mapping
    • Date ranges of table's validity
  • If station's dataset format changes, a new table will be created
  • Data can be shared via OPeNDAP
    • Use the Python implementation PyDap

General Design So Far

Design Ideas So Far

  • Plugin architecture for data tests
    • Define what it needs and "how many"
    • For each sensor, one would define which tests to perform
  • Plugin architecture for notify
    • Define what information it needs
    • Simply passed a message and "contact info"



Edit - History - Print - Recent Changes - Search
Page last modified on March 06, 2007, at 09:29 PM