Do you have a lot of data? Like, a lot of data. You likely do.
Are you interested in making sense of that data? No? You should be. Are you now? Okay.
Do you like running your own infrastructure? Probably not. Who does?
Companies have, for quite a few years now, relied on the largest open-source project in data-processing to give them insights into their data. That project is Apache Spark. Spark is used by companies like Netflix, Google, eBay and lots of other big names. This does not mean though that it is only for massive enterprises. Everybody has data that they need to gain insights to.
Spark claim to be up to 100 times faster than their competitor Hadoop, as they use in-memory compute, as opposed to Hadoop’s read/write storage operations.
Databricks takes Apache Spark and adds more functionality around it, like the choice of which Spark runtime to use, native version control of notebooks, logging, monitoring and compliance.
In a very tight cooperation with Microsoft they offer Azure Databricks.
Focus on data – not infrastructure
How long does it usually take for your data engineers / scientists (or whatever job title they have in your company) to receive an environment to run their computations on? Be honest. Longer than 30 minutes?
Well, Azure Databricks takes the whole pain of managing clusters away and makes it a non-issue.
Being able to deploy clusters from within Databricks and at the same time telling the service to please also auto-scale up to the maximum number of workers, and oh yes, please auto terminate the workers after a given number of time so that we do not spend any unnecessary money, is a big step towards a 100% focus on the actual business requirement, working with the data.
A thing to point out is the Python version for a cluster. I recommend choosing version 3, not the version 2 which is selected by default. Python 2.7 will not be maintained past 01/01/2020. Python libraries are actively being migrated to version 3 and the same should be done to any custom Python code that is now being written. If you do depend on a library right now though that is still on Python 2.7 then you can of course deploy a Databricks cluster with version 2.7 installed.
Also, great to see that Databricks automatically adds tags to the cluster resources they deploy, of course one can add one’s own tags as well. This makes it easy to understand and assign cost later.
Lots of vendors have claims that they are “cloud ready” or something similar, but Databricks took it a step further with Microsoft Azure and have properly integrated their service with the platform.
Role Based Access Control (RBAC) on Azure Databricks is controlled via Azure Active Directory as can be seen when logging in to the Azure Databricks workspace.
Databricks users can pick and choose from all of Azure VM SKUs for clusters to achieve the best, cheapest and fastest results possible. For example, m-series VMs for massively memory heavy computations, or f-series VMs for machine learning scenarios.
Azure Databricks can be deployed into customer virtual networks (vnet) in order to connect to data sources available only in vnets or maybe even on-premises. The nature of Databricks jobs is that it needs to work with data, potentially confidential data. Customers can rest assured that only PCI compliant services are used under the hood. Databricks uses Azure Storage Accounts, databases and other services which are all compliant to various industry standards.
A long list of data sources is available to Azure Databricks, including more traditional databases like Azure SQL and SQL Data Warehouse, but also Azure Event Hubs, Azure Data Lake, Azure Cosmos DB and lots of other sources like flat CSV files.
For customers that are used to using, for example, Jupyter notebooks the hosted version in Azure Databricks will be very familiar. Besides python, those notebooks also support Scala, R and SQL out of the box. For general queries or visualisations there should be no need to import third party libraries, and if there is, you can.
Azure Databricks notebooks also enable live collaboration. Team members can comment on notebooks and reuse existing queries in their own notebooks. Real collaboration that one is only used to from source code or documents.
Speaking of source code, of course Azure Databricks takes the collaboration a step further and enables companies to integrate their Git source control to their Azure Databricks workspace.
Imagine having your data scientists create pull requests against their notebooks, and have notebooks be deployed as recurrent jobs via a pipeline.
Azure and Databricks together help companies focus on the data problem. With its intelligent built-in libraries on top of Apache Spark Azure Databricks is a great fit for almost every company trying to make sense of their data.