What is behind your data?
No matter whether you’re a student making their first steps into the data analysis world or a young entrepreneur trying to get insights from the customer data, you may be astonished about the pricing of high volume analysis workflows. Enterprise solutions can be extremely costly due to high computing power and huge storage demand. Deep, technical knowledge needed for setup and complex licensing plans can be really scary for small businesses and students. But does it have to be like this? Can we lower the entry barriers?
With the rise of big cloud providers the price of infrastructure has significantly dropped. Storage has become dirty cheap and computing power can scale together with the actual demand. The pricing models quite often include free tier or low-usage plan. Cloud solutions support high variety of programming languages and usually can be easily integrated with other well-known systems. Startups running in the cloud for less than 10$ per month are not a surprise anymore. This article is giving an example of a low-budget analysis workflow using Microsoft Azure services.
The example below is based on a biggest Polish discussion board, with over 6 000 000 page views per month. The goal is to download all the posts and serve custom, periodically refreshed statistics about the content. The whole solution is going to be optimized for a low cost, but still offering powerful features and extensibility for the future needs. Let’s talk about the architecture concept.
Picking the right storage
By doing a simple calculation I estimated that the whole data set will consume around 50 GB in JSON format. What kind of storage would be suitable for this data? Among many options offered by Azure I picked the one that offers the best value for relatively low price – Data Lake Storage. It also has a wide offer of integration options and allows me to decouple the storage from the computation part. Seems like a good choice, so lets put it in the middle of the chart.
The price of 1 GB is €0.0329 per month and another €0.043 for every 10000 write operations. What we need to remember is that the minimum billable file size is 256 KB, so instead of saving every post to a separate file it is better to group them by day or month. Azure Data Lake is optimized for handling rather large files than small ones. In my example the posts grouped by day consumed around 25 MB per file, so I decided to keep it that way.
Decoupling the storage and computation in a data analysis solution allows you to scale your components independently. What was important for me is to have a real time data ingestion pipeline to the Data Lake, so I can easily attach any computation engine – real time or scheduled. By having always the latest data I could easily change the frequency of generating reports, from daily to even hourly.
Getting the data in
The users are creating around 2 000 000 posts per year and the time frame is 10 years. That’s at least 20 000 000 requests to download the page content, convert to JSON format and store in the Data Lake. Having that in mind I took a closer look at Azure Functions, which gives 1 million executions for free and after that charges €0.169 per million executions. Combining that with Queues from Storage Account I could create the consumer-producer pattern.
The producer would check the latest post ID, compare with the last processed and for every post insert a message in the queue (I’m using the fact that IDs are generated sequentially). This function can be triggered by timer. On the opposite side the consumer function can be automatically triggered by a new message in the queue and execute the task. In the consumption plan Azure Function can scale up to 200 instances, so it sounds like a reasonable choice for downloading massive amount of data. Nonetheless, a couple of things need to be addressed:
- Outbound IP addresses are limited. If the target website is throttling requests based on the IP address, you may be easily blocked for making too many requests. A solution for this may be to increase the number of IP addresses, call through a public rotating proxy or limit the maximum number of function instances.
- Functions are scaling up to 200 instances. That means maximum 200 workers sending requests to the target website, so even with the increased IP pool your application may be blocked. There are many solutions how to throttle Azure Functions, but in my opinion the easiest one is to create one more queue and one more function triggered by timer. The purpose of this function is to just move messages from one queue to the other in batches. By manipulating the batch size and time interval, you can steer how fast the incoming messages will be processed.
- The messages in the queue are deleted after 7 days. That’s the maximum time for a message to stay in the queue. When the consumer function may not be able to pick the message within 7 days, you should consider Table storage instead of Queue or any other storage with longer persistence time.
- Limit Application Insights to the minimum. While very helpful at the beginning it can quickly generate a high cost for usage. 1 GB of data ingestion costs nearly €3! You can limit the daily cap in the settings.
In this setup it takes around 1-2 minutes for the new post from appearing on the website to being saved to Data Lake, which is enough for this project to be considered as near real-time. So far the architecture diagram looks like this.
Doing the heavy work
Azure Stream Analytics, Azure Data Lake Analytics, Azure Databricks, SQL Data Warehouse and more. To choose the right tool for the right job we need to revisit the project requirements. For this example project it’s quite simple – the main factor is low cost, so we can exclude real-time processing engines and data warehouses, as they can generate quite impressive bill. The focus will be put on batch, periodic processing of data from Data Lake. Two options come to the first plan: ADLA and Databricks. As ADLA is not migrated to Azure Data Lake gen 2, thus no longer actively supported, Databricks gets all the attention. Azure Databricks is fully managed Apache Spark-based analytics service. It can easily integrate with other Azure services and do the heavy processing. What is also attractive for the current project is that the Databricks cluster can execute scheduled jobs, so the high computing cluster with be running only in a short intervals to generate reports.
The steps needed to complete this part are pretty well documented: mounting the Data Lake storage as file system, writing the query in Jupyter Notebook and scheduling a job to execute the notebook. Let’s draw another component to the diagram. But where should we output the query result? How it can be served to the user? These questions will be answered in the next section.
Publishing the results
Azure Databricks can output the query result to many services. The question should be rather how the visual dashboard will be created and how it can consume the data. One of the possibilities is to use Power BI to create a dashboard and use Power BI Services to publish it on a public page. While Power BI is a really powerful tool, at the beginning you may be struggling to display the right data. Thing are getting even more complicated when you want to create an interactive dashboard. In my example I wanted to create three versions of each report: weekly, monthly and yearly, then add a simple drop down to select the time frame. While it’s relatively easy to update one chart based on the value from drop down, I couldn’t manage to update multiple ones across the whole dashboard. That pushed me to think about different solution.
Having a bit of experience with front-end frameworks I knew there are multiple so-called admin dashboard projects. They are based on some popular framework and offer out-of-the-box visual controls with styling for a fast dashboard development. One of them is Vue Element Admin based on Vue.js. This website can be hosted directly on the Azure Storage account using Static Website feature. If the Databricks can output the query results to the Blob storage as JSON files, then the dashboard can easily pick them and feed the visual controls. The whole front-end part will be just a bunch of static files hosted on the Blob Storage, so the cost of service this dashboard to even large group of users is absolutely minimal. That seems like a good plan, so let’s add the last component to our chart.
Let the money talk
The total cost of running the project is 85.10 CHF. It took over one week to download the historical posts from the last 10 years. After downloading the huge historical data set the project is just getting the new posts, so the service usage and maintenance cost is much, much lower. But let’s dig into the period of backfilling the data.
Data Lake storage was supposed to be extremely cheap, but it actually generated the highest cost among all services. The total amount of downloaded data is 31.8 GB, which is the result of fetching 14 302 462 web pages.
As you can see the cost of data in rest very low, so where did the money escape? The answer is right around the corner.
14,6 milion write requests seems like a valid number, because every downloaded post was immediately saved to Data Lake. The cost of 10 000 write transactions is only 0.05 CHF, but can it be lowered? A single transaction in Data Lake can take up 4 MB, so batching the data before saving would definitely save some money here. Something to note for the future.
The virtual machine and storage cost from the chart belongs to Databricks. The cluster was running once per week for about six hours. The query performance was quite decent on a yearly data set, but very sluggish on calculating statistics for the whole 10 years data set. Could be improved, but requires much more time investment to analyze query performance and optimize execution plan. The total cost of Databricks was around 36 CHF.
Was it also worth to mention is that Azure Functions consumed only 2.30 CHF! That’s a pretty amazing result considering the fact that the functions were executed over 14 milion times.
Running this experiment for less than 100 CHF exceed my expectations, because I had reserved 150 CHF in my Azure account. Things like batching writes to storage or improving query performance could lower the total cost to nearly 50 CHF, but it’s still a satisfying result. Azure Databricks appeared as a powerful tool for data analysis with multiple supported languages and many available integration options. Currently, the monthly cost of running the project is below 10 CHF. On the other side, the project only updates the reports on a weekly basis. Is it possible to create a real-time analytics for a similar scale and reasonable cost? Sounds like a good idea for the next experiment.