New to Data Analytics? Here’s 3 Simple Steps For Getting Started
Since I got involved in the big data and analytics world a few years ago, I’ve become the person that friends come to (and refer their friends to) for answers to questions about data.
Recently I’ve been asked a few times by entrepreneurs how they can make sure their company is prepared to do analytics, even before they know precisely what they will be tracking and how they will use that information.
Obviously, there is no formula that works for every situation, but I think there are some rules of thumb that should apply to almost any scenario.
People seem to like guidelines that come in groups of three, so I’m breaking my advice into three groups: planning, acquisition, and “future-proofing.”
Before you know what you will be doing with data, you must have some idea of what you want from the analysis. Analytics should be driven by a particular set of business goals, not the data, so start there.
—Identify goals important to your business, such as increasing sales, reducing employee turnover, and improving your return customer rate.
—Know which specific actions you would take to achieve those goals. This could be providing point-of-sale coupons that are compelling to the consumer or creating incentives that inspire sales staff.
—Understand what knowledge would allow you to decide on the right actions to take, like the demographics of a user, past success of incentives, and even the influence of weather.
Knowing why you want to collect data is the most important step and should not be glossed over.
Once you know what types of things you would need the data for, you will have some idea of what sort of data you should be collecting or acquiring. Remember, some data is publicly available or can be purchased from vendors, and you don’t need to collect it yourself.
Once you know why you would want to keep data, the most important thing is to actually start keeping the data you are already collecting. Disk space is cheap, and systems for storing it are inexpensive—from Hadoop to USB thumb drives with tons of storage space.
As long as you keep good notes on the meaning of the various fields and data sets you are keeping, you should be able to get some value in the future from what you saved in the past.
But simply keeping the data isn’t going to allow you to do analytics. You will need to get it into some structured form. Even so-called “unstructured” data is structured—either loosely (e.g. log files) or implicitly (e.g. human language).
Structuring your data will largely depend on having “clean” data. Tips for keeping clean data include:
—When possible, ingest data via controlled methods like machine-generated output and via forms with check boxes and drop-down options, rather than text boxes.
—Free-form text entered by humans is hard to use and should be put into a canonical form whenever possible. Addresses can be rationalized (e.g always use “Ave” versus “Avenue,” keep e-mail addresses in lower-case, and use only dashes in phone numbers, not a mix of dashes, dots, and parentheses). When possible, do this clean-up before saving the data.
—Check for bad values and consistency with other data you have stored. Try not to store the same data in multiple places unless you can guarantee it will always be updated together.
—Make sure there are ways to link data from multiple data sets and be particularly careful with the keys that link them. So, if you have collected name/address/e-mail in one batch of data and username/e-mail in another, then you can use e-mail to match the two together—but only if you store them in the same way and always update the e-mail entries in both places.
—When possible, structure your data so that you can pull it apart easily. A simple way is to break things into columns (or fields) when you store it, so that you don’t have to involve humans in decoding the format of each file.
—Finally, always keep a document that describes the format of your data so you can go back and understand what you collected. Things you might consider keeping track of include what you collected, from where, what each piece of info means, when it was collected, allowable values, and what each value means.
Once you have some data collected and are using it, you will want to make sure you “future-proof” it so that it retains value to your business. Consider doing some of the following:
—Keep past and present data in a consistent format. If you decide you need to change a format (e.g. changing a column name from “Age” to “Birth Year”), go back and update the past data, if possible. If not possible, create a new field so that it is obvious something changed, as opposed to changing the meaning of an existing field. Document what was done and when so that past data can be integrated into future analysis more easily.
—Run test queries on a regular basis on the new data so that changes or bad data can be detected as early as possible.
—Back up your data regularly, both on-site and off-site. You want to be prepared for user error, disk failures, theft, and acts of nature. A backup stored near the original only protects from the first two. Test your backups periodically—they are of no use if they aren’t being done correctly.
—Worry about privacy and security from the beginning. Having to tell all your customers that their personal information was compromised is not good for business. Something as simple as a stolen laptop can be a reportable breach. Encryption and not storing unnecessary personal information are straightforward means of protecting your business. Using third-party vendors with advanced security for things like credit card transactions can off-load this worry.
There are many more things you can (and will) worry about, but if you keep some of these things in mind at the start, you can be prepared to begin using your data much sooner.
Most of what I recommend above can be automated, and with little or no extra effort beyond setting things up correctly, you can be in a great place to start doing analytics as soon as possible.