How to choose the right data tool for your business
Find the right tool from entire toolkit
I often come across cases when companies do not fully understand which data tools and at what stage of business development to use. Because of this, they either do not completely develop their analytical infrastructure or complicate it altogether by using tools that are actually not needed. I will try to explain this a little and suggest tools for every specific stage of business development.
Let’s describe an evolution of analytics using the example of an online sportswear store.
Stage 1. For a start you buy the first batch of goods, order an inexpensive development of the site and invest money in some marketing activities. Probably you work alone or a couple of workers help you. In this case, we can not talk of large staff.
Monetary resources are limited but already at this first stage, it is important to start collecting data and carry out analysis for making decisions. First of all we need to analyze efficiency of traffic channels through which people go to the site and analyze user behavior on the site, because the site is the main point of sales. It is important to understand what people like on the site, what they don’t like, what prevents them from placing an order. There is good news — unlike offline stores, it is easy enough to measure.
For this stage, free web analytics tools such as Google Analytics will be a good choice. Here, it is important to understand the work of these tools at least at basic level and know how to analyze basic reports. It is also very important here to make it a habit to regularly (once a day, 3 times a week, etc.) analyze data and test different hypotheses for increasing efficiency of the site and marketing activities.
As a tool for fixing deals, the usual Google Spreadsheet (or Excel) for now, is suitable.
Stage 2. We have slightly set up various business processes (procurement, sales, finance, marketing) and made a habit of analyzing data regularly. We have small staff (about 20 people) and a sales department. Sales are starting to stabilize.
Here we can now think of implementing a CRM system like Salesforce. We can also want to see the full path of our customers from visiting the site to successful deals that are registered either in Google Spreadsheets or CRM system. It will allow us to measure and understand efficiency of marketing channels in the context of final sales not just leads.
If you cannot yet allocate a lot of resources for analytics, you can join online data and data about final sales in Google Spreadsheets or BI tools like Power BI, Google Data Studio or Klipfolio. The best option is to automate this process, but if this is difficult for now, you can join these data manually.
If you are ready to allocate more resources for analytics, you can already think of building a data warehouse. At this stage you can use solutions based on RDBMS like MySQL, Microsoft SQL Server or PostgreSQL. But if you understand that volume of data will increase and you need a scalable solution, my recommendation is to use cloud warehouses such as Google BigQuery, Snowflake, Amazon Redshift, Azure Synapse or Clickhouse.
For ETL, you can use both monolithic ETL tools and a mix of tools. For example,a simple and inexpensive solution could be to use Pentaho Data Integration or Talend. And for ELT you can use different data loaders such as Stitch, Renta, OWOX, Matillion Data Loader in conjunction with data transformations tools like dbt (Data Build Tool). For ELT you can also make transformations using options of cloud data warehouses (e.g. Scheduled Queries in Google BigQuery).
For extracting and loading data I personally prefer using my ETL-scripts in Python that are deployed in a serverless environment (e.g. Google Cloud Functions, Google Cloud Run, Amazon Lambda etc.). This approach reduces the cost as much as possible. But if you don’t have expertise in coding and it is easier for you using special tools and data loaders — use them. The main point is to resolve business tasks.
Any tool convenient for you will do as BI.
Stage 3. After bringing the business to a stable sales stream, basic establishment of business processes, we now want to develop and expand our business. We can increase the number of communication channels with leads (e.g. via a mobile app), open offline points of sale etc.
Here we can add mobile analytics services such as Firebase Analytics, AppsFlyer or Adjust in our toolkit.
It also makes sense to analyze not only BI reports or reports in web/app analytics tools but also make deeper analysis using SQL, Python or R. Here, we can use tools such as Jupyter Notebook or R Studio.
Stage 4. We have already increased brand awareness of our company, we have a stable stream of several million users to our website and mobile app every day. The number of data sources and volume of data are increased. The need for competent analytics is growing even more and the current infrastructure is no longer enough. We want to get raw data (without cleaning and aggregation) in order to get more flexibility in data analysis.
At this stage we can build a data platform (mix of data lake and data warehouse) or Lakehouse. For these goals, we can use such tools and their bundles:
- AWS S3 plus AWS Redshift plus AWS Athena or AWS Redshift Spectrum
- Delta Lake (Databricks)
- Google Cloud Storage plus Google BigQuery
- Azure Data Lake Storage plus Azure Synapse
- HDFS plus Hive or Impala(Hadoop)
At this stage, it also makes sense to use cloud ETL tools and orchestration tools with flexible opportunities: Azure Data Factory, Amazon Glue, Google Cloud Dataflow, Matillion ETL, Fivetran, Apache Airflow, Luigi, Apache Nifi etc.
For big data processing, it will be a good choice to use Apache Spark, Databricks, Amazon Elastic MapReduce or Google Cloud Dataproc.
It is also important to use a product approach to data infrastructure development i.e to embed Agile and DevOps practices such as code versioning (Git), building CI/CD pipelines (e.g. using Azure DevOps, Google Cloud Build, AWS CodePipeline or Jenkins), cluster of containers deployment (using Docker and Kubernetes), using Infrastructure as Code (Terraform, AWS Cloud Formation etc.)
For getting raw data we can also use streaming and related technologies and tools: Apache Kafka, Amazon Kinesis, Google Cloud Pub/Sub, Spark Streaming, Azure Event Hub.
Stage 5. After building big data architecture (creating a data platform and streaming setting up) we can make advanced analysis, build machine learning and deep learning algorithms and deploy them in production.
That’s it. I hope this article was helpful for you and you now have an understanding of which tool to use and at what stage of business development.
See you.