Data Science for RDA Climate Change Data Challenge and Meetup

The announcement said:
The 6th Plenary RDA hosted in Paris from 23-25 September 2015, features a special focus on research data for climate change, leveraging on the UN Climate Change Conference (COP21) to be held in Paris in December 2015.
As a part of this special focus Cap Digital & RDA have created a special Data Challenge designed to connect Climate Change related Data Sets with startups, SMEs and larger organizations with practical application for these data.
We have received a wealth of datasets from different global organisations have been made available to enterprises for the creation of novel and innovative solutions in areas covering Air quality, energy and urban activity. We are now entering the second phase of the challenge – the Call for Enterprise Engagement.
By organizing the 6th RDA Plenary Assembly (P6) in Paris, Cap Digital seeks to promote RDA and the work undertaken within Working and Interest Groups, to a significant number of European players, especially among startups and major companies concerned by the challenges of ?Big Data?.
Another announcement said:
Please join the NITRD FASTER Community of Practice for an informative presentation and discussion with Dr. Francine Berman, Chair, RDA/US and Edward P. Hamilton Distinguished Professor of Computer Science, Rensselaer Polytechnic Institute. Dr. Berman will describe the Research Data Alliance (RDA) and its community, and give a look ahead at future directions for the RDA.
In 2013, the Research Data Alliance (RDA) was formed to build and adopt infrastructure that accelerates data sharing world-wide. Two years later, the organization has attracted nearly 3000 members from over 100 countries and all sectors. The precipitous growth and enthusiasm for the RDA emphasizes the global need for data infrastructure and coordination, and indicates the community?s high expectations that RDA has the potential to meet those needs. In this talk, Fran Berman — U.S. Chair of the Research Data Alliance and co-Chair of its leadership Council ? describes the organization and its community, and gives a look ahead at future directions for the Research Data Alliance.
So I decided to enter the RDA Special Data Challenge and report on the results at a Federal Big Data Working Group Meetup in connection with the NITRD FASTER Community of Practice Meeting at NSF on July 15th. I will ask if two years later with nearly 3000 members in over 100 countries with almost 6 plenaries, is this the focus of the RDA now? I will also ask: Are there any special instructions for this competition?
The reason being that the “RDA special Data Challenge is designed to connect Climate Change related Data Sets with startups, SMEs and larger organizations with practical application for these data” which is the reason for the Federal Big Data Working Group Meetup!
I downloaded the Climate Change Dataset Catalog and repurposed the PDF into MindTouch here.
As a data scientist / data journalist, I always have questions about competitions and their data sets as follows:
What distinguishes this competition from many others?
Are these the best data sets?
Is the data set information accurate?
Should one pick a few data sets or try to work with them all?
Why are there 83 links to data.gov?
Etc.
I entered the competition and my answers to the questions were:
Describe the possible application or solution you are developing and how does it constitute a challenge goal
Data Science for RDA Climate Change Data Challenge
How will your solution involve the proposed datasets?
Try to use as many as possible
Which are the datasets integrated in your solution demonstration? Select from the dropdown list.
Try to use as many as possible
What is the expected impact?
Try to integrate all of the data sets
Show whether or not this is a worthwhile activity
In order to answer the above questions and integrate the data sets in a data ecosystem, I need to repurpose the catalog into a dataset that can be filtered.
In essence there are three basic steps in the data science process:
Data Preparation
Data Ecosystem
Data Story
A data scientist is a role in high demand now and in the future. President Obama just hired a chief data scientist at the White House, Dr. DJ Patil. Academia cannot meet the demand for data scientists so Data Science Meetups and Massive Online Courses (MOOCs) are filling that workforce manpower gap. One such Meetup that provides MOOCs is the Federal Big Data Working Group Meetup which trains data scientists, especially in the use of government big data, using an industry leading data science tool called Spotfire.
The FBDWG Meetup provides tutorials and presentations on Data Preparation, Data Ecosystems, and Data Stories, that answer four essential questions:
How was the data collected?
Where is the data stored?
What are the data results? and
Why should we believe the data results?
The results are documented in three tools: this Wiki (called MindTouch), Excel spreadsheets, and TIBCO Spotfire, so others can study and repurpose/reuse them.
My Goals for 2015-2016 are:
Goal 1: Digital Catalog
Goal 2: Data Audit
Goal 3: Individual Data Sets in Spotfire
Goal 4: Integration/Applications
Goal 5: Meetups/Data Science Publication/MOOCs
The Data Audit Results so far are:
1. I could not readily find the actual data sets for 18 of the 64 data sets.
2. The URL for the very important DOE Buildings Data Book does not work (I think this is being revised or removed permanently).
3. 11 of the remaining USCDINASA 40 data sets come from the National Transportation Atlas Database. Why not use all 36 as a more authoritative and consistent data set?
4. Why was a contractor brought in to manage the White House Climate Data Initiative and is now a private consultant on climate data (Climate Data Solutions LLC), listed as the contact person
5. The obvious other data that can be used is the 557 data sets at Data.gov/Climate and the data sets in the President?s National Climate Assessment, which many others and our Meetup have already worked with.
6. I could not find the 3 data sets from Cap Digital (numbers 22-24), the sponsoring organization, and their web site Cannot Be Translated into English.
MORE TO FOLLOW
Knowledge Base Index into Spreadsheet
Spreadsheets Imported into Spotfire
Data Dictionaries for the Data Ecosystem (Looking For Them)
Data Ecosystem Integrated in Spotfire
Data Analytics and Visualizations in Spotfire
My entry could be the following possibilities:
A detailed catalog of all the data sets to see what can be reused and integrated: Done That
All of my data stories about Climate Change data sets: Need to Inventory
My Data Science Data Publication for the 40 Data Sets in the Presidents Climate Change Initiative
This builds on my previous Data Science for RDA and is a work in progress.
I added the Climate.Data.gov Catalog and Audit results to the spreadsheet: See Tabs: Data.gov Climate, Data.gov Climate CSV and EPA CWA 303 (d) Dictionary.
I was able to download 18 CVS files from the 38 data sets found at:
11 of the 38 are FTP sites with many files to sort through and 8 were DNF (Did Not Find) or DNW (Did Not Work).
Now on to importing those 18 CSV files into Spotfire to see the data and visualizations and do integrations and applications. Interestingly there a 8 data sets on disease (NNDSS) and that may prove to be the most interesting climate change application. I just finished the ESRI GIS Tutorial for Health from the Health Datapalooza 2015:
My initial conclusions are that the Data.gov Climate CSV and NTRD are the best and easiest to have people work with so they are not frustrated in trying to find data like I was with the PDF Catalog.
One of the U.S. Climate Resilience Toolkit FAQs Asks: Why do climate.data.gov and the U.S. Climate Resilience Toolkit live in different locations?, and Answers: ?In the long run, our aim is to integrate them into one ?seamless? system.?
My answer is: We can and are doing do that in a Data Science Data Publication! This will solve the all to common problem of: So many web pages, articles, etc. about data, but not with data.
The Data Science for RDA Climate Change Data Challenge and Meetup will include an additional goal, namely to integrate the climate.data.gov and the U.S. Climate Resilience Toolkit into one ?seamless? system, which we will call “a Data Science Data Publication”. This will be my challenge submission and experimentation day demo!
I also think we will do a meetup (or series of meetups like this: see below) to support the NSF Data Science / Big Data Community and use the RDA Climate Change Data Challenge, climate.data.gov, and the U.S. Climate Resilience Toolkit data sets, I am preparing, to jump start our meetup members and other data science meetup participants.
NSF Graduate Data Science Workshop & Community Building, Aug. 5-7, Seattle
The NSF-sponsored Graduate Data Science Workshop will bring together 100 graduate students from diverse domain sciences and engineering with Data Scientists from industry and academia to discuss and collaborate on Big Data / Data Science challenges.
I just found what I was looking for in the Climate Resilience Toolkit to help others: There are 63 data sets used in 80 Case Studies. Using Climate Data, Satellite Imagery, and Local Knowledge to Prevent Famine uses 6 data sets (the maximum for any case study), so this would be the best one for integrating multiple data sets.
I also just found from my earlier Data Science for Climate Change (US National Climate Assessment), that the original number of datasets (23) has become 2,377 data sets, in addition to the 36 data tables I extracted from the report itself into a spreadsheet (135 MB). I extracted the 2377 data sets into the RDA Climate Challenge spreadsheet.