BigConf 2014

Speakers List and Schedule!

Our speakers line up is complete, finally! See our list of great speakers below.

Thank you to everyone who submitted a talk, all 60+ of you!



Registration and Continental Breakfast

Great Hall North Great Hall South Spring Room

Welcome and Introduction


Keynote: The Rise of Data-Driven Applications

by Eli Collins


Coffee Break


Returning Transactions to Distributed Databases

by Dave Rosenthal

Overview of Building Hadoop Data Applications with the Kite SDK

by Joey Echeverria


Law and Big Data: A Privacy Perspective

by Joseph Jerome

Riding the Elephant: Hadoop 2.0

by Simon Elliston Ball




Big Security for Big Data

by Ari Elias-Bachrach

Building a Lightweight Discovery Interface for Chinese Patents

by Eric Pugh


D3.js Visualizing Data on the Web

by Will Carroll

Deep Learning for Natural Language Processing: Mining Massive Mounds of Text

by Sumeet Vij


Lunch Break


Awful Dashboards for Large Datasets

by John Feminella

Test Driven Relevancy - How to Work with Content Experts to Optimize and Maintain Search Relevancy

by Doug Turnbull

Fundamentals of NoSQL

by Will LaForest


Making a Difference with Data Science

by Armen Kherlopian, Stephanie Rivera, and Paul Yacci

Jump-start with Cassandra

by Matt Overstreet

Data Lessons Learned at Scale

by Charlie Reverte


Snack Break


Open Source Software for Data Scientists

by Charlie Greenbacker

Data Analysis in Python

by Bryan Weber

Cassandra: Consistency and Tolerance (a morality guide for databases)

by Matt Kennedy


Distributed Data Processing with Hydra

by Chris Burroughs

Deriving Insight from Imagery using Giraph and Hadoop

by Dr. Ameet Kini

Big Data and Business Decisions

by Mark Ettrich




Big Data Analytics in Google’s Storage Infrastructure

by Raghav Lagisetty

Taming Text

by Drew Farris


Spark and Spark Streaming

by Ted Malaska

The Data-Paradox

by Jessica Langdorf


Beer Summit at McGinty's Irish Pub

The Rise of Data-Driven Applications by Eli Collins

Eli Collins Enterprise applications have typically focused on implementing business processes. The role of data has been input to or a byproduct of the process, perhaps as input for reporting or analytics. However, there's a widening gap between the types of applications that companies need to build to be competitive and what is achievable with the traditional process-centric approach. Developers are already attacking this gap by making existing applications better through data. For example, we expect applications to make intelligent recommendations, not just provide search functionality. Data is not just making existing applications better, it's driving the creation of new applications. The relative shift in value from processes to data necessitates a more data-centric approach to application development. In this talk we will examine this approach, and discuss the implications for application developers and infrastructure.

Eli Collins is Cloudera's Chief Technologist. He spent the previous four years leading the team responsible for Cloudera's Hadoop distribution (CDH), and is an Apache Hadoop committer and PMC member. Prior to joining Cloudera, Eli worked on processor virtualization and Linux at VMware. Eli holds Bachelor's and Master's degrees in Computer Science from New York University and the University of Wisconsin-Madison, respectively. You can find him on Twitter at @elicollins.

Returning Transactions to Distributed Databases by Dave Rosenthal

Dave Rosenthal NoSQL databases have certainly struck a chord with engineers who've sought out high availability and easy scalability for their apps. That in tandem with analytic tools such as Hadoop, have made managing Big Data much more feasible. But the first generation of NoSQL databases were designed in the shadow of the CAP theorem. Engineers were told to "pick 2 out of 3 -- consistency, availability and partition tolerance." Given that, NoSQL systems abandoned consistency and settled for "eventual consistency."

But was throwing out ACID transactions the best way forward? FoundationDB, CEO Dave Rosenthal and NoSQL innovator, Google seem to think no. Google's distributed database, Spanner is both consistent and highly-available. Have they both beat CAP? This talk will examine impact of CAP on early NoSQL systems, explore the design space of consistent systems, answer the question of what price to we pay for transactions in a distributed system, and propose a path towards a better NoSQL and its relation towards Big Data moving forward.

Dave Rosenthal is CEO of N. Virginia-based, FoundationDB. He started programming by building games. In high school he formed a team of students that designed and built Fire and Darkness, a 3D strategy game that won the grand prize at the 1st Independent Games Festival. In 2001 he applied his interactive design skills to large-scale data processing as the first employee at Visual Sciences. Visual Sciences' unique take on web analytics data analytics product defined the high end of web analytics. After Visual Sciences was acquired in 2006 he served as CTO of the combined company till a 2008 acquisition by Omniture (now part of Adobe). Mr. Rosenthal has a bachelor of science in computer science from MIT.

The Data-Paradox by Jessica Langdorf

Jessica Langdorf Even the most data-driven brands can’t depend on software alone to get them all the necessary sales and customer improvements. You need a human aspect to do the interpretation in order to be able to recommend specific actions based on the analytics results, as well as generate new ideas and strategies. After all, "Big Data" is just a new term for what has existed for years, which is simply massive repositories of information that need to be organized, extracted, and interpreted in order to identify actionable insights. So, how do machines and humans work together to achieve the right balance?

This session presented by TouchCommerce's VP of Analytics, Jessica Langdorf, will address the latest techniques, methods and approaches to making sense of massive data sets for enterprise clients in order to inform and drive execution, how to identify appropriate technologies and specialized resources for analyzing customer data and mining millions of engagements to understand patterns and behaviors, and when to rely on automation vs. the human touch (and the right balance of the two).

Jessica Langdorf: As VP of Solutions Planning, Analytics & Optimization at TouchCommerce, Jessica brings 14+ years of operational planning and process improvement experience with 9 years in the web analytics and digital optimization for telecom, financial services, and retail industries. Jessica is a Certified Usability Analyst (CUA) with an Award of Achievement in Web Analytics. Jessica is responsible for strategic solution design and online engagement best practice definition, execution, and optimization.

Law and Big Data: A Privacy Perspective by Joseph Jerome

Joseph Jerome The topic of "Big Data" has come to dominate debates about the future of data regulation, presenting new challenges to personal privacy. Fundamental privacy concepts such as notice and consent, context, and data minimization are under strain in a Big Data world. One solution to this new privacy quandary is to offer individuals meaningful rights to access their data in usable, machine-readable formats. Developers should encourage the creation of user-side applications and services based on access to personal data, "featurizing" control over personal information and increasing transparency. This presentation will discuss the privacy concerns of lawyers and policymakers and offer a way forward for enterprising collectors and users of data.

Joseph Jerome serves as Policy Counsel at the Future of Privacy Forum, a Washington, D.C.-based think tank that seeks to advance responsible data practices. At FPF, Joseph explores the intersection of technology and public policy in Big Data and the emerging Internet of Things, where he works on de-identification standards and educational privacy questions. He also assists TeachPrivacy, a company founded by Professor Dan Solove that provides privacy and data security training programs.

Big Security for Big Data by Ari Elias-Bachrach

Ameet Kini Big data isn't just being sought after by corporations and governments - it's also being sought after by the hackers that want to sell that data for financial profit. That's why we need to secure our data and applications we use to access the data. As many of the technologies we're using for big data are new, not a lot is known about how to properly secure them, and some of the security models are still being developed.

Big data systems still need old-school security necessities like encryption, and proper access controls, as well as defenses against web applications attacks like injection. This presentation will cover some of the most important security features that are needed for any big data application.

Ari Elias-Bachrach is an application security expert and former developer. Having worked for consulting firms, large banks, and the federal government, he is now the CEO of Defensium llc. Ari spends most of his time working with developers to try and resolve security issues and bridge the gap between security and development. He is also a regular speaker at security conferences.

Deriving Insight from Imagery using Giraph and Hadoop by Dr. Ameet Kini

Ameet Kini Come learn about what Apache Giraph and satellite imagery have in common. This talk presents some novel use cases of analyzing satellite imagery using tools from the Hadoop ecosystem that we have built at DigitalGlobe, a leading provider of satellite imagery and geospatial content with a fleet of five high resolution earth observation satellites. Analyzing imagery from satellites turns out to be a class of big data analytics that has received relatively little attention from the big data community. Hi-res imagery can easily scale up to petabytes, and extracting insight from this imagery involves running analytics ranging from complex yet embarrassingly parallel raster operations to iterative graph algorithms that help find routes and nearest neighbors over this imagery. For example, analysts can run a map/reduce job to build a cross-country mobility model using multiple geospatial layers and then run a suite of routing algorithms using Giraph. This talk focuses on the use cases driving these graph algorithms, describe their implementation in Apache Giraph, and present some performance numbers. No prior geospatial experience is required of the audience, the goal is to simply learn about how Giraph+Hadoop is used to solve a new problem from concept to implementation.

Dr. Ameet Kini is a Principal Engineer at DigitalGlobe where he works on developing novel distributed algorithms for solving geospatial problems. His primary area of expertise is designing scalable data architectures, and has spent the last decade working within the R&D/product development divisions of companies such as as Oracle, IBM, Google, and MITRE. He has a B.S. from UMBC and a M.S./Ph.D. from the U of Wisconsin-Madison, all in Computer Science.

D3.js Visualizing Data on the Web by Will Carroll

In this day and age data is being collected everywhere and the datasets are being stored in databases and large spreadsheets. It is next to impossible for most people to read data line by line to understand the data from a database or a large spreadsheet. Humans are visual beings so we understand shapes and colors a lot more than columns and rows, which is why we need our data to be visualized.

Will Carroll will introduce attendees to D3 or Data-Driven Documents. D3 is a javascript library for manipulating documents based on data. D3 helps bring data to life using HTML, SVG and CSS. D3’s emphasis on web standards gives developers the full capabilities of modern browsers, to combine powerful visualization components and a data-driven approach to DOM manipulation.

Will Carroll the principal at develop_for is both a developer and designer. He has a background in design and computer science, which allows him to take complex datasets and build comprehensive interactive data visualizations that are dynamic and easy to understand. Will focuses on developing for data, data visualizations and web applications to help people understand the data to make better decisions.

Big Data Analytics in Google’s Storage Infrastructure by Raghav Lagisetty

Google collects and stores massive amounts of data containing detailed observations on production infrastructure. This talk will cover how Google uses Big Data Analytics to extract insights on production infrastructure, for better provisioning and utilization of resources and also help in predicting and averting catastrophic events. The talk will highlight a few case studies on the Storage Infrastructure side of production covering the techniques and tools used and challenges faced during the various phases of analytics life-cycle journey, from data collection at Google scale to understanding trends and finally creating value.

Raghav Lagisetty is a member of the Storage Analytics team that is responsible for measuring and analyzing various aspects of Google’s storage stack in production.

Prior to Google, Raghav spent more than a decade managing and building engineering teams and ground-up enterprise class storage systems and applications. Raghav has held various positions from Vice President of Engineering to Senior Architect at companies like Smapper Technologies, Brocade Communications and Rhapsody Networks.

Raghav earned his bachelor’s degree in Computer Science from the Indian Institute of Technology, Mumbai and a master’s degree in Computer Science from the University of Arizona, Tucson.

Building a Lightweight Discovery Interface for Chinese Patents by Eric Pugh

Eric Pugh The United States Patent and Trademark Office wanted a simple, lightweight, yet modern and rich discovery interface for Chinese patent data. This is the story of the Global Patent Search Network, the next generation multilingual search platform for the USPTO. GPSN, , was the first public application deployed in the cloud, and allowed a very small development team to build a discovery interface across millions of patents.

This case study will cover:

  • How we leveraged Amazon Web Services platform for data ingestion, auto scaling, and deployment at a very low price compared to traditional data centers.
  • We will cover some of the innovative methods for converting XML formatted data to usable information.
  • Parsing through 5 TB of raw TIFF image data and converting them to modern web friendly format.
  • Challenges in building a modern Single Page Application that provides a dynamic, rich user experience.
  • How we built “data sharing” features into the application to allow third party systems to build additional functionality on top of GPSN.

Eric Pugh: Fascinated by the “craft” of software development, Eric Pugh has been heavily involved in the open source world as a developer, committer, and user for the past 5 years. He is an emeritus member of the Apache Software Foundation and lately has been mulling over how we move from the read/write web to the data web. In biotech, financial services and defense IT, he has helped European and American companies develop coherent strategies for embracing open source software. Eric became involved in Solr when he submitted the patch SOLR-284 for Parsing Rich Document types such as PDF and MS Office formats that became the single most popular patch as measured by votes! He co-authored Solr Enterprise Search Server.

Deep Learning for Natural Language Processing: Mining Massive Mounds of Text by Sumeet Vij

Sumeet Vij Deep Learning has been hailed by MIT Technology Review as a breakthrough technology and advances by pioneers in the field of Deep Neural Networks based Machine Learning has allowed machines to begin comprehending things like humans do, which has helped Google, Microsoft, Facebook make significant advances in rapidly processing and understanding unstructured text at scale. At Booz Allen Hamilton, we are utilizing the same Deep Learning techniques and tools to mine the massive amount of unstructured text produced within an enterprise by the "systems of engagement" like Yammer, Chat and SharePoint. This talk will explain how Deep Learning can utilized to rapidly process unstructured text syntactically and semantically to allow Text Analytics, Knowledge Extraction and create Recommendation Engines.

Sumeet Vij is a highly accomplished Chief Technologist and IT Executive with a proven track record of 15+ years in leading large scale complex IT projects form concept to production in the federal and commercial sector. He specializes in leading high performance software development teams in the arena of Big Data, Hadoop, Machine Learning, Data Analytics & Mining, SOA, Semantic Web, BPM, EAI, Open Source projects and Enterprise Architecture. Sumeet Vij is currently a part of the Strategic Innovation Group (SIG) of Booz Allen Hamilton, focused on Mission Solutions and large scale analysis of data and it's use to quickly provide deeper insights, create new capabilities, and drive down costs.

Open Source Software for Data Scientists by Charlie Greenbacker

Charlie Greenbacker Harvard Business Review called it "the sexiest job of the 21st century." These days, data scientists are faced with an onslaught of companies pitching products that promise to solve all your problems. Is there such a thing as a "silver bullet" for data science, and is it worth the hefty price tag?

This talk will briefly discuss what data science is, it will argue why open source software is usually the right choice for data scientists, and it will examine some of the leading OSS tools for data science available today. Topics will include statistical analysis, data mining, machine learning, natural language processing, and data visualization. Additional materials will be provided on the presentation's companion website:

Charlie Greenbacker is Director of Data Science at Altamira Technologies Corporation, a top open source technology company in the national security space. He specializes in natural language processing (NLP) and advanced analytics on unstructured text, and is also the founder and organizer of the DC NLP meetup group. Charlie has an MS in Computer Science from the University of Delaware, nearly finished a PhD (ABD), and received a Blackfriars Fellowship to attend the University of Oxford as an undergraduate. He has co-authored over a dozen academic publications in peer-reviewed journals and research conference proceedings. Prior to joining Altamira, he worked at Berico Technologies (where he created the open source geoparser CLAVIN), Raytheon BBN Technologies, and the United States Air Force.

Taming Text by Drew Farris

Drew Farris There is so much text in our lives, we are practically drowning in it. Fortunately, there are innovative tools and techniques for managing unstructured information that can throw the smart developer a much-needed lifeline. In this talk, based on the outline of the book of the same name, I'll provide an introduction to a variety of Java-based open source tools that aide in the development of search and NLP applications.

Book Abstract: Taming Text is a practical, example-driven guide to working with text in real applications. This book introduces you to useful techniques like full-text search, proper name recognition, clustering, tagging, information extraction, and summarization. You'll explore real use cases as you systematically absorb the foundations upon which they are built. Written in a clear and concise style, this book avoids jargon, explaining the subject in terms you can understand without a background in statistics or natural language processing. Examples are in Java, but the concepts can be applied in any language.

Drew Farris is a software developer and technology consultant at Booz Allen Hamilton where he focuses on large scale analytics, distributed computing and machine learning. Previously, he worked at TextWise where he implemented a wide variety of text exploration, management and retrieval applications combining natural language processing, classification and visualization techniques. He has contributed to a number of open source projects including Apache Mahout, Lucene and Solr, and holds a master's degree in Information Resource Management from Syracuse University's iSchool and a B.F.A in Computer Graphics.

Making a Difference with Data Science by Armen Kherlopian, Stephanie Rivera, and Paul Yacci

Data Science pushes the boundaries of what is possible in many facets of society by enabling decisions to be data driven. It's commonly known that Data Science is used for displaying targeted ads, suggesting movies you might like, and optimizing profits via algorithmic trading, but what are some other applications? In this talk we present several case studies where Data Science is used to make a positive impact in public safety and health.

Armen Kherlopian, Stephanie Rivera, and Paul Yacci are Data Scientists at Booz Allen Hamilton and work in a number of markets. Their work includes performing data science in defense, health, finance, and cyber security. All three are coauthors of a short book called The Field Guide to Data Science. The speakers are also involved in a number of Data Science community events and Data Science development efforts.

Spark & Spark Streaming by Ted Malaska

Intro into Spark and into Spark Streaming. Running on top of Cloudera's C5 and interacting with Hadoop

Ted Malaska Ted is a Sr. SA at Cloudera and has worked with over 30 clients and over 100 clusters. He can committed to Flume, Avro, Pig, MapReduce, Yarn, and Cloudera Manager.

Test Driven Relevancy — How to Work with Content Experts to Optimize and Maintain Search Relevancy by Doug Turnbull

Getting good search results is hard; maintaining good relevancy is even harder. Fixing one problem can easily create many others. Without good tools to measure the impact of relevancy changes, there's no way to know if the "fix" that you've developed will cause relevancy problems with other queries. Ideally, much like we have unit tests for code to detect when bugs are introduced, we would like to create ways to measure changes in relevancy. This is exactly what we've done at OpenSource Connections. We've developed a tool, Quepid, that allows us to work with content experts to define metrics for search quality. Once defined, we can instantly measure the impact of modifying our relevancy strategy, allowing us to iterate quickly on very difficult relevancy problems. Get an in depth look at the tools we use to not only search a relevancy problem — but to make sure it stays solved!

Doug Turnbull is a Search Relevancy Expert at OpenSource Connections. A frequent blogger and speaker, Doug enjoys the intersection of usability and systems programming. That's exactly what he finds in search — low-level code that directly impacts user's lives. In his search work, Doug bridges the gap between content experts and technologists. To help bridge the gap, Doug created [Quepid]( a search relevancy collaboration canvas used extensively in OpenSource Connection's search work.

Awful Dashboards for Large Datasets by John Feminella

John Feminella The library of visual tools we have to represent and isolate the important features of complex, multivariate datasets is large. But most dashboards do a poor job of displaying valuable information at a glance, which in turn means they don't deliver value to users. Even the cleverest, most sophisticated data analysis tools can't do their job if the information isn't presented correctly.

In this talk, we'll describe the attributes of a perfect dashboard, point out some common antipatterns that we think you should avoid, and illustrate how to optimize for showing the most relevant facets of a dataset. Don't let your dashboard become another statistic!

John Feminella is an avid technologist, occasional public speaker, and frequent instigator of assorted shenanigans. He recently co-founded UpHex, a start-up providing predictive analytics and automated insights for digital agencies and their clients. John is a guest lecturer at the University of Virginia, mentors budding entrepreneurs, and answers questions on StackOverflow once in a while.

He lives in Charlottesville, VA and likes meta-jokes, milkshakes, and referring to himself in the third person in speaker bios.

Data Analysis in Python by Bryan Weber

Bryan Weber This talk will cover Pandas and IPython for beginners. You may have heard of the R programming language for statistical analysis, you may even have tried it out, but while R is fantastic for statistics, it is not so great for data munging and preparation. Oh, and R requires that you learn yet another programming language.

The combination of Pandas and IPython can provide a familiar (or easy to learn) programming environment that allows you to not only prepare the data, but do the analysis in an interactive manner and access the numerous libraries that exist in Python. Come find out how easy and fun data analysis can be.

Bryan Weber is the founder of Cobenian, a small software and design company in Northern Virginia that focuses on network related software development and automation. He has consulted to numerous clients from small VC funded startups to federal government agencies and Internet backbone organizations. As a consultant to one key Internet infrastructure organization he helped make the transition to per delegation management of resource records, implement support for DNSSEC and deliver Resource Public Key Infrastructure (RPKI) to help secure BGP route origination. In his free time he enjoys studying programming languages and spending time with his family.

Data Lessons Learned at Scale by Charlie Reverte

Charlie Reverte AddThis has been in the big data game since 2006 and processed over a trillion events last year on commodity hardware. We'll discuss some lessons we learned along the way, in particular tactics for efficient collection and manipulation of data in distributed environment. We got to build a lot of our infrastructure before open source alternatives were available and we found out firsthand how consistency and accuracy tradeoffs can make or break you.

Charlie Reverte is VP of Engineering AddThis (formerly Clearspring). He has helped AddThis scale from scratch in 2006 to reach 1.4 billion unique users across the web. He also co-authored the OExchange spec for open sharing, which was implemented by Twitter and Google, among others. He believes mobile apps are a fad and that the mobile web will win because of addressability and the URL. Charlie studied microprocessors and distributed systems at Carnegie Mellon while sending robots into caves and coal mines. He also developed one of the first augmented reality systems for robotic surgery and ACL reconstruction. After hours, Charlie is hooked on 24-hour endurance car racing and drives for the Clearspring Motor Club (

Riding the Elephant: Hadoop 2.0 by Simon Elliston Ball

Hadoop is about more than MapReduce these days. How can you use new languages like Clojure, F#, Pig and HQL to get the best out of huge amounts of data? How can you use massive clusters of CPUs to make realtime apps with new frameworks like YARN and Tez that make Hadoop 2.0? By the end of this session you'll know how.

Simon Elliston Ball is a head of the Big Data team at Red Gate, focusing on researching and building tools to interact with Big Data platforms. Previously he has worked in the data intensive worlds of hedge funds and financial trading, ERP and e-Commerce, as well as designing and running nationwide networks and websites. These days his head is in Big Data and visualisation.

In the course of those roles, he’s designed and built several organisation-wide data and networking infrastructures, headed up research and development teams, and designed (and implemented) numerous digital products and high-traffic transactional websites.

For a change of technical pace, he writes and produces screencasts on front-end web technologies such as ExtJS, and is an avid NodeJS programmer. In the past he has also edited novels, written screenplays, developed web sites and built a photography business.

Overview of Building Hadoop Data Applications with the Kite SDK by Joey Echeverria

Joey Echeverria With a such a large number of components in the Hadoop ecosystem, writing Hadoop applications can be a challenge for users who are new to the platform. The Kite SDK (formerly CDK) is an open source project with the goal of simplifying Hadoop application development. It codifies best-practice for writing Hadoop applications by providing documentation, examples, tools, and APIs for Java developers.

We will discuss the architecture of a common data pipeline from data ingest from an application to report generation. Hadoop concepts and components (including HDFS, Avro, Flume, Crunch, HCatalog, Hive, Impala, Oozie) will be introduced along the way, and they will be explained in the context of solving a concrete problem for the application. We will show how to build a simple end-to-end Hadoop data application that you can take away and adapt to your own use cases.

Joey Echeverria is an Architect at Cloudera where he works directly with customers to deploy production Hadoop clusters and solve a diverse range of business and technical problems. Joey joined Cloudera from the NSA where he worked on data mining, network security, and clustered data processing using Hadoop. Prior to working full time for NSA, Joey attended Carnegie Mellon University where he attained an M.S. and a B.S. in Electrical and Computer Engineering.

Jump-start with Cassandra by Matt Overstreet

Matt Overstreet Cassandra is a distributed, massively scalable, fault tolerant, columnar data store, and if you need the ability to make fast writes, the only thing faster than Cassandra is /dev/null! In this fast-paced presentation, we'll briefly describe big data, and the area of big data that Cassandra is designed to fill. We will cover Cassandra's unique, every-node-the-same architecture. We will reveal Cassandra's internal data structure and explain just why Cassandra is so darned fast. Finally, we'll wrap up with a discussion of data modeling using the new standard protocol: CQL (Cassandra Query Language).

When the audience leaves, they will understand the typical use-cases for Cassandra and the audience will have the knowledge necessary to start playing with Cassandra on their own.

Matt Overstreet: Usability is Matt Overstreet’s mission. He has worked with Federal, Fortune 500, and small businesses to help collect, mine and interact with data. He solves problems by synthesizing his experiences drawn from a liberal arts and technical background.

Fundamentals of NoSQL by Will LaForest

Will LaForest Of late there has been much innovation in the database space to help tackle the limitations of the venerable RDBMS. Collectively this new breed of non-relational databases are referred to as NoSQL. In order to accommodate modern business pressures around complex data, scale and performance there has been a marked departure from the standard relational approach.

This presentation will cover the fundamental theories behind NoSQL databases such as data modeling, data processing, distributed processing, and indexing.

Will LaForest is the Senior Director, Federal at MongoDB. In his current position, Mr. LaForest evangelizes the benefits of MongoDB, NoSQL and (OSS) open source software in solving Big Data challenges in the Federal government. He has spent 8 years in the NoSQL space focused on the Federal government. His technical career spans diverse areas including data warehousing, machine learning, and building statistical visualization software for SPSS but began with code slinging at DARPA. Mr. LaForest holds degrees in Mathematics and Physics from the University of Virginia.

Cassandra: Consistency & Tolerance (a morality guide for databases) by Matt Kennedy

Matt Kennedy Apache Cassandra is singular in its ability to provide a single logical database dispersed over multi-region datacenters. Cassandra databases can lose entire datacenters due to network outages or natural disasters and the database will remain available for application users. This talk will discuss how Cassandra is able to achieve this feat, and dive deep into the programming realities of distributed systems and consistency models. It turns out that the consistency model is the critical feature of successful distributed systems. When an application and a database can have an honest conversation about the consistency needs of each, a powerful system is born. This honesty is at the core of Cassandra's dominant position in the world of distributed databases.

Matt Kennedy is an architect at DataStax. Matt has been a Cassandra user and occasional contributor since version 0.7 and is a co-organizer of the Cassandra meetup in the Washinton DC area. At the European Cassandra Summit in October of 2013, he was recognized by DataStax as a Cassandra MVP. Matt has been working with distributed systems his entire career and kinda wishes he was one. Because while Cassandra is partition tolerant, Matt is not.

Distributed Data Processing with Hydra by Chris Burroughs

Hydra is a distributed data processing and storage system originally developed at AddThis. It ingests streams of data (think log files) and builds hierarchical summaries. These hierarchical trees can be explored by humans (tiny queries), as part of a machine learning pipeline (big queries), or to support live data analytics on websites (lots of queries). Hydra was released as open source software in the beginning of 2014.

Chris Burroughs is an engineer at AddThis where he most recently lead the effort to open source Hydra. The rest of the time he stares at stack traces, fights software interrupts, and chases garbage collection to make applications faster and servers run hotter. Chris is also a co-organizer of the DC area Apache Cassandra meetup.

Big Data and Business Decisions by Mark Ettrich

Big Data and Business Planning and Decision Making. How do today's leaders exploit Big Data to make better business decisions? A framework and guide to performing Big Data Analysis.

Chris Burroughs Mark Ettrich has spent his career designing and implementing scalable data solutions for capturing analyzing and reporting on data created by people and machines. He has also led large-scale integration efforts associated with ad networks such as Quigo,, and AOL.

He is the founder of Big Data District, an organization to connect big data talent in the greater D.C. area including Meet-ups and Hackathons. He is also currently a Principal with Accenture responsible for High Performance Analytics & Agile Methods & Tools for North America. Previously Mark held executive management positions including SVP of Engineering for Jumptap and Senior Director of Aol Data Warehouse with 10 years of experience contributing to AOL's product and data services.