Bimotics Blog: Building an Online Data Warehouse Part 1

Google Cloud Platform is a great option for companies looking to build a Data Warehouse on the cloud, with plenty of APIs from which to build. Currently, the number of APIs which you can choose from to connect your applications to the cloud is growing by the day. As of now, you have two options: hire a developer to code and integrate on top of those great Google and Applications APIs, or use tools like marvin. to help you automate some of the processes that are required to build a data warehouse.

marvin. is a tool from Bimotics that allows you to set up an online data warehouse without the need to code. Google has created the capability to host the biggest data warehouse in history, Google BigQuery, and marvin. is the bridge you need to make that process as easy as possible. The steps below will help you get started.

What does it take to set up a Data Warehouse on the cloud? A simplified process for cloud bi would contain the following steps:

1. Gathering the Data

2. Storing the Data

3. Analyzing the Data

4. Visualizing the Data

Gathering the Data

This first step has historically been the hardest, but advances in on-premise technology and Cloud applications like QuickBooks and SalesForce now provide methods to extract the data through APIs. These are often referred to as connectors. Most of these apps allow users to extract data in formats like CSV or JSON. Even e-commerce platforms allow users to extract their customer, product, order, inventory, and lead information through these connectors. Extracting these files to build a Cloud Data Warehouse is where it all begins. Without access to raw data, it is hard to design and build the right Data Warehouse architecture. Remember the old mantra “garbage in, garbage out”, bad data will always yield bad insight or analysis. Important note before going any further: if your data has exceptions, missing items, or other errors, fix them in the application itself before progressing with your warehouse.

Storing the Data

Storage is another big challenge. You will need to answer questions such as:

Where should I store this data and how?
Is there enough space on my servers and hard drives?
How much data needs to be stored?
How often does information need to be added?

After setting marvin. up, pull data from your chosen application (e.g. e-commerce sample). marvin. then allows you to create buckets or folders where these files are to be stored- each relating to specific customers, products, orders, inventory etc.

Organizing data into buckets makes it much easier to refer to later. We suggest you create two types of buckets for each file: one to process into the next step and another for files that have already been processed. For example, we have a file called “Orders_New” and one called “Orders_History.” Not to go into great detail on file names, but naming them as order_YYYYMMDD.csv can help identify when the orders where extracted.

marvin. allows you to store data in any format, but the next step requires that the data be either in CSV or JSON formats. marvin. will take an additional helpful step in compressing the files into GZIP. This minimizes the storage space used as well as the cost associated with storing the data. Storing your data this way allows you to keep a good archive of your data. marvin. also lets you to upload multiple files or folders into a bucket. Finally, you can preview files, as well as download or delete the ones that you do not need.

In the next part of this series, we will go into the next two steps: analyzing and visualizing data using marvin.

Bimotics Blog

Monday, March 7, 2016

Building an Online Data Warehouse Part 1

No comments:

Post a Comment