Last year, the event was called the Data Scientist Summit and focused largely on data scientist "rock star" speakers and panel members. This year's title signaled an intention to focus instead on data science teams. The Greenplum division of EMC and EMC itself have sponsored the events both years.
What is data science? According to Wikipedia, "data science defines a discipline that incorporates applying various degrees of statistics, data visualizations, computer programming, data mining, machine learning and database engineering to solve complex data problems." The very short article goes on to say that a Data Science Journal has been published since April 2002, so data science is at least a decade old.
Why is data science relevant? Think about it this way: When questioned about the value of the first hot air balloons, Benjamin Franklin is said to have asked in response, "What is the value of a newborn baby?" Actually, data science is probably a long way from the newborn-baby stage, although it still has a long way to go before it achieves full maturity. Data science leads to technologies such as search engines like Google's, which use data outside the page itself; friendship relationships (think Facebook); big-data analysis; and product recommendation systems. In short, data science and data scientists are all about thinking creatively about what information might be useful and putting it in a useful context from which value can be derived.
Below are descriptions of some of the topics discussed at Data Summit 2012
- Predictive modeling: Predictive modeling has been with us for a long time, but data science goes far beyond traditional regression analysis to pushing the boundaries of what is possible, often involving multiple disciplines in addition to statistical learning, such as how to mine massive data sets.
- Data visualization: Making use of the power of our eyes to process a lot of information all at once, visualization can provide illumination where insight might not otherwise be easy to obtain.
- Impact of data science: The individual speakers and panels were keenly aware of how collaboration and other social tools impact products developed by teams of data scientists. They were also focused on the data collected by products that are widely deployed on the Web. Such data collection may result in a conflict between convenience and privacy. For example, analyzing an aggregation of medical records from many people may result in obtaining information that can improve the treatment of disease. However, even if individuals allow their information to be pooled anonymously, effectively securing that very private information is difficult, at best.
- Tidbits: With torrents of real-world data captured in a natural way from the Web, data conditioning rather than data quality (which is necessary in traditional enterprise systems) is often enough as the outliers may actually contain information of value. As a result, one of the key challenges of data science is being able to separate correlation from causality.
Overall, Data Science Summit 2012 was interesting and useful and should be continued in the future, but a lot of work has to go on in the field to build a superstructure that can focus and promote clear thinking about data science and its potential impacts.
The "horse and carriage" relationship between computation and information has long been expressed by the old term "data processing." Both are needed, but if the center of the IT solar system is becoming more about data, then data science as the next stage in computer science becomes more attractive and important.
However, the data science industry also requires more exposure. Data Science Summit 2012 was useful for sparking thought about the broad issues affecting data science, but its messages need to be carried to a wider audience. Why? So more people can understand and be part of a dialog that is likely to have an impact on their lives in many ways (with not all effects being necessarily beneficial).
The data science community needs to think not just in terms of individuals, teams and projects, but also in terms of how it will act as a functioning industry. The summit was a valuable starting point, but much work needs to be done before the next event. As projects lead to findings and conclusions that expand upon case studies, the results will give deeper direction and substance to the data science movement.
EMC is a client of David Hill and the Mesabi Group.