The Best Way to Get Big Data By Director and Senior Data Scientist

The new Digital Government Strategy is “treating all content as data.” So big data = all your content:
But just a small sample to start a pilot.
There are many Big Data Technologies to choose from and many early adopters are finding them more expensive than expected:
Use open source-free trials to pilot.
There are many Big Data Problems to solve that could “boil the ocean”:
Use a data scientist to help build a team and community for a fast, inexpensive, and small semantic data science pilot.
Subcommittee on Networking and Information Technology Research and Development(NITRD Subcommittee)
Data Science Team Example:Chief Data Science Officer
Chief Data Science Officer:
Dr. George Strawn, Director, White House OSTP NITRD/NCO: Semantic Medline could be the “killer” Semantic Web application for the US Federal Government
Data Science Team:
Dr. Brand Niemann, Lead
Dr. Tom Rindflesch, NLM Semantic Medline Creator
Professor Kirk Borne, George Mason University
Federal Big Data Senior Steering WG Workforce Training Initiative
Tim White, Director, YarcData Federal Global Head
Aaron Bossett, YarcData Federal Solution Architect
Dr. Eric Little, Modus Operandi Chief Scientist
Generic Problems
How to get Big Data:
Unstructured (Natural Language Processing to Graph-RDF Triples) and Structured (Relational-RDF Triples)
Where to store Big Data:
Graph-RDF Triples and Relational
What to show about Big Data:
Statistics, Visualizations, and Network Graphs
Note: RDF Triples make Big Data smaller, smarter, and integrated!
Semantic Medline on the YarcData Graph Appliance is an example of the best content on the best graph data store with the best visualization results so far (in my humble opinion)!
Our Semantic Data Science Team delivered this for the recent White House Big Data Event: See Making the Most of Big Data

Semantic Medline – YarcData Graph Appliance Application for Federal Big Data Senior Steering WG:Work Flow
Semantic Medline – YarcData Graph Appliance Application for Federal Big Data Senior Steering WG:Semantic Medline Database Application
Semantic Medline – YarcData Graph Appliance Application for Federal Big Data Senior Steering WG:Visualization and Linking to Original Text
Semantic Medline – YarcData Graph Appliance Application for Federal Big Data Senior Steering WG:Bioinformatics Publication
Semantic Medline – YarcData Graph Appliance Application for Federal Big Data Senior Steering WG:Semantic Medline at NIH-NLM
Current : Web based research tool.
Transition: Current systems re-engineered to leverage Urika (less than 5 days).
Purpose: Build a platform for users to perform increasingly complex analysis.
Immediate Requirement : Replicate current capability.
Future: Allow for increasingly complex analysis. Ability to capture and share analytics in addition to sharing data. Tailor Urika to less complex queries.
Semantic Medline – YarcData Graph Appliance Application for Federal Big Data Senior Steering WG:Graphs and Traditional Technologies
Square peg, round hole:
Current technology does not support efficient representation, storage, and interaction with complex graph structures
Traditional relational models only add the an already complex structure
Traditional hardware approaches do not support efficient access to highly interconnected graphs

You don’t know what you don’t know:
Efficient relational schemas require prior knowledge of the relationships between database fields
Updating and modifying schemas frequently introduces delays and errors

Problems in partitioning the problem:
Distributed computing solutions are good…If your problem can be easily partitioned
Graphs are not predictable; accessing graph nodes across large clusters can be unwieldy at best and does not work at scale
Semantic Medline – YarcData Graph Appliance Application for Federal Big Data Senior Steering WG:The YarcData Approach
Semantic Medline – YarcData Graph Appliance Application for Federal Big Data Senior Steering WG:New Use Cases
Current therapies target dopamine receptors
Not entirely effective
Side effects
Basic research is exploring glutamate and its NMDA receptor
Goal: can we use Semantic MEDLINE to discover that research trend in the scientific literature
With some exceptions, therapy is not effective
Has not progressed significantly in 60 years
Scientific basis
Traditionally – cancer cells
More recently – non-cancer cells (immune system)
Immune system and cancer
Connection noted in 1863 (Virchow)
But not exploited until recently
Goal: look for trends in cancer immunotherapy
Modus Operandi:Mantra, Performance, and Vision
Speeding the Discovery, Integration, and Fusion of Information
SBIR Phase Three Successes: Wave Exploitation Framework (EF)
Wave EF: Government-off-the-shelf (GOTS) technology for intelligence applications that tackles the difficult problem of processing unstructured and semi-structured data
C4ISR Government Customers: U.S. Air Force, U.S. Army, U.S. Marine Corps, U.S. Navy, DARPA, DTRA, Missile Defense Agency, and Intelligence Agencies
Wave All-Source Semantic Fusion Engine: In development to support individual medical researchers/intelligence analysts to work with big data
Semedy (former Ontoprise founders): Reasoner and Triple Store
Modus Operandi:Finding the Right Needle in the Right Haystack
Dyson said. “So a lot of what we’re doing is enabling that by making the data sources accessible and searchable.”
“Our specialization is what we call ‘semantic technology,’ which is just a way of making the data smarter. We enrich the data with various tags to make it easier to find.”
The software also provides what McNeight called data “provenance” which has to do with the traceability back to the source of the data – the really important aspect for intelligence personnel.
“We don’t make decisions,” McNeight explained. “We just help (the analyst) to make decisions and to find the right data. He may only be interested in a certain person in a certain location at a certain time. We can bring that back to him across multiple databases.”
Data Science Team Example:President of Modus Operandi
President of Modus Operandi:
Richard McNeight, President, Masters Degree in Artificial Intelligence & Computer Science, Board of Regents, Florida Institute of Technology University, Recognized for Entrepreneurial Leadership, and Recipient of Florida County Economic Development Grant for Big Medical Data
Data Science Team:
Lee Watkins, Director of Bioinformatics & IT JHMI, and Dr. Brand Niemann, Semantic Community, Co-Leads
Dr. Eric Little, Modus Operandi Chief Scientist, Ontology and Wave All-Source Semantic Fusion Engine Development
Bryan Thompson and Michael Personick, SYSTAP Principals, Bigdata® Platform
Tim Barr, YarcData Medical Informatics, and Aaron Bossett, YarcData Federal Solution Architect
Others to be added as needed
Dr. Tom Rindflesch, NIH/NLM Semantic Medline Creator
Dr. Richard Ford and Dr. Marco Carvalho, Florida Institute of Technology
How Wave Drives the BLADE Semantic Wiki and Other Kinds of Analytic Visualizations
Possible Scenario
For medicine – the Blade 2.0 Semantic Wiki would allow different researchers to view the data collectively from within their areas of expertise, but connect them to other areas effortlessly.
This means – scientist 1 could be looking up information on a given receptor on a cell, while scientist 2 is looking at proteomic information (perhaps not even knowing it is the underlying substance of that cell/receptor).
Scientist 3 could add some new information about a given compound that shows reactions at the receptor site scientist 1 is studying.
Upon entering that information, scientist 1 would see a new linked piece of data about their receptor related to the compound – and the cool part is scientist 2 would also see information about the connection between their protein structure and that compound.
Scientist 3 would see the information about the protein related to their compound as well (since they were only looking at the receptor-compound connection).
All 3 would basically have new linked information available to pursue if they wanted.
Now imagine being able to do those kinds of joins in near-real-time with a simple tool across the entire corpus of the Semantic Medline data set. Kaboom!
Source: Dr. Eric Little, Chief Scientist and Ontologist
Knowledge Base:Modus Operandi Web Intelligence in MindTouch
Big Data in Memory:Innovation Story
Met Jef Sharp, President, Panève:
Amazing fast access and massive storage – Big Data Supercomputer on My Mobile Device
John Hopkins University – Blackbook (CIA Cloud)
I suggested:
Greylock Partners – #2 Data Scientist in the World (DJ Patil, Entrepreneur-in-Residence who built the first formal data science team at LinkedIn)
Works for In-Q-Tel (Robert Ames, Senior VP for Technology, In-Q-Tel)
Works for CIA (Gus Hunt, CTO, CIA)
Who Wants Big Data Supercomputer on Mobile Devices
Future: PossibilityPanève’s ZettaLeaf & ZettaTree Products
Scalable single level storage
Panève’s scalable single level storage model collapses the server, network, and storage by removing software and replacing them with memory system primitives. This eliminates all network and network-processing overhead associated with accessing storage and delivers a 10,000X increase in raw performance.

Dr. Brand Niemann
Director and Senior Data Scientist
Semantic Community for Johns Hopkins University School of Medicine and Modus Operandi
December 12, 2013