The process of accelerating the data analysis insights by reducing the time between coding and deployment, now called DevOps, has become more relevant with the emerging role of data science teams in large organizations. Effective data science teams must share their findings with each other and the organization at large, be agile enough to embed new features or address additional goals during development, and move results from data wrangling, exploratory data analyses (EDA) and predictive analytics into automated visualizations, diagnostics and reports intended for wider consumption. In the recent past, data wrangling, EDA and predictive analytics were done with one set of tools and automated visualizations, recommendations and reports were done with another.
This separation often extended to the very systems where the tools were located (e.g., a development environment versus a production environment). Separating the tools and the environments hinders mid-process feedback and development modifications and by its very nature creates time lags between results discovery and results sharing. In addition, reproducing or modifying projects could become a project itself if the original development environment was no longer in existence or the data scientist who created it had left the firm.
RCloud is open-source software created at AT&T Labs by Simon Urbanek, Gordon Woodhull and Carlos Scheidegger to solve data analysis development to deployment issues of collaboration, sharing, scalability and reproducibility.