Click to start the slide show.
Photizo
Weather Data Analysis and QA/QC
The Photizo Project - People
- I’m Joshua Kugler
- My committee is Dr. Knoke, Dr. Nance, and Dr. Roth
- Stakeholders
- Michael Lilly: Geo-Watersheds Scientific
- Gary Whitton: EE Internet
- Alaska Department of Transportation and Public Facilities/UAF WERC (by extension)
- UAF Graduate Committee
The Photizo Project - Que es?
- Name comes from a Greek word meaning “to enlighten, render evident, to give understanding to.”
- Archiving, QA/QC, Publishing, and Data Analysis for Meteorological Sensor Networks
- Quality Control: Making sure our data is being collected and is accurate
- Quality Assurance: Making sure our data is meeting the customers’ needs.
Where are we?
- Timeline
- Planning: September 1 - September 30
- Requirements: October 1 - December 3
- Design & Test Plan: December 4 - January 15
- Implementation: January 16 - March 12
- Testing: March 13 - April 16
- Documentation: September 1 - May 1
Where are we going?
- Scope will most likely be unchanged, reduced if anything.
- Design will go forward
- Design issues have been researched and considered
Design Issues
- Storage of data with wildly heterogeneous formats
- Some files:
\”Sunday, 17 September 2006 11:00:00”,0,3153,13.09,3.825,13.08,82.9,5.46
- Others:
\”Thursday, 9 November 2006 10:00:00”, 0, 1, 66.15, 4.537, 3.55, 7.453, 4.747, <Several dozen values here>, 18.47,18.09
- And everywhere in between.
One storage solution: Serialization
- + Would involve putting nearly all values in a blob
- + “Quick and dirty”
- + Easy to program as Python has fast and solid serialization tools
- - Makes data very opaque
- - Very slow to search for data in a range
- E.g. temperature between X and Y
- - Would tie the data to Python
Another option: EAV
- Entity-Attribute-Value
- One row for each Entity/Attribute/Value tuple
- Used in cases where applicability or number of attributes vary
- + All the values are still visible and searchable by non-python tools
- + Values can be indexed for faster searching
- + Searches on ranges would be very fast
Another option: EAV - Cont’d
- - Not terribly elegant as all data has to go through the row-column/column-row transformation
- Must be reimplemented if using another language
- - Requires a lot of custom code to do the transformations
- - Queries are not as efficient due to the joins that must be done
- - Not as efficient as the application is sending more data across the wire (entity and attribute id) in addition to just the value.
The Solution (so far) - HDF5
- Hierarchical Data Format 5
- Born at NCSA, now at The HDF Group
- Used by Nasa JPL, Gulfstream Aerospace, and NOAA, among others
- Matches the hierarchy of network/station/tables
- Search 1.3 million rows in less than a second
- Can have indexes on columns
- PyTables presents the tables and rows as objects
- HDF5 Bindings available for other languages
The Solution (so far) - Cont’d
- One table per station/dataset
- Each table will have metadata including
- Column to “friendly name” mapping
- Date ranges of table’s validity
- If station’s dataset format changes, a new table will be created
- Data can be shared via OPeNDAP
- Use the Python implementation PyDap
General Design So Far
Design Ideas So Far
- Plugin architecture for data tests
- Define what it needs and “how many”
- For each sensor, one would define which tests to perform
- Plugin architecture for notify
- Define what information it needs
- Simply passed a message and “contact info”
Finito
(:notoc:)