Data Methods: A Beginner's Guide
When we think of "Data Science", I'd say pretty confidently that people's first thought is something along the lines of "Machine Learning" or "Artificial Intelligence", which is perfectly reasonable. These are the exciting buzz words or the time, with job titles even calling them out, such as "Machine Learning Engineer" or "Artificial Intelligence Engineer". Whilst these all have slightly different areas of focus as job roles, it's easy to wonder what on earth is encompassed in Data Science if there are job titles calling out AIML specifically.
I'd say, let's take a step back...looking at the words themselves, a Data Scientist and Data Science are focused on two things: data and science. And so that means that there is a whole lot that can be contained in those words. The way I think about it is how can we use data when applying the scientific method to solve a problem. Which, even as I type it sounds like a convoluted way to write "Data Science", but I guess that's the point!
There are, of course, Machine Learning methods that can be encompassed in Data Science, but there's also other aspects that shouldn't be forgotten about. Let's break down some of the methods of how we can analyse and use data, all of which might be employed in Data Science...
Traditional Methods
Traditional methods, for the 21st Century, just to clarify there. We're taking a step further than traditional statistics and incorporating some computing. What we're referring to as 'traditional methods' are also referred to as 'predictive analysis' i.e. what insights can we glean from what we already know? There are four main categories of predictive analysis..
Regression
The most common types are Linear and Logistic Regression. Linear Regression, you might have been introduced to at school without it ever being called that...it's a more structured way of drawing a line of "best fit" through your data points, assuming a linear relationship. You might be familiar with the equation "y=mx+c" - that's just what this is!
Logistic Regression help calculate the probability of something happening, either it is 0 or 1, i.e. binary. When plotting a Logistic Regression it looks like a sort of 'S' shape - with the flats at 0 and one and a curve joining the two.
There are lots of types of Regression, and if you're interested in implementing and understanding Data Science techniques in more detail, I'd definitely suggest checking them out!
Clustering
This is the fancy way of saying "grouping". You would typically apply clustering methods where there are clear groups within the observation data. If you were plotting a scatter graph of your data, you might see that there are different groups of data points that might sit a little apart.
There are different ways of clustering your data, a common one is using "k-means" and there's also "hierarchical", "spectral" and "density-based" and many more.
In some cases you might be able to see the optimal number of clusters in your data, but there are also various ways to determine your optimal number of clusters, "k", for a given data set. The "elbow" method is a commonly used one, which refers to the shape of the graph when plotting out the results.
Factor Analysis
This method is also based on grouping your data, but instead of grouping your observations, you're grouping your explanatory variables to reduce dimensionality. When we have too many inputs into your model, especially when they are similar, it can add unnecessary complexity and instability to your model. So you might want to group similar inputs together to reduce number you are using.
This method might be particularly useful if, say, you have survey results where three questions ask about how much people like carrots. You might have a question about whether people thing they taste nice cooked, another asking about whether people like them raw and a third about whether people like to grow carrots. If what you're interested in, is the general feeling towards carrots you could group these questions together.
Time Series
This is where you plot a variable against the independent variable time. You don't have to "plot" it per se, but it's about how your variable changes over time. Time is always an independent variable as it doesn't rely on anything, so if you're plotting it on a graph..it'll always be on the x-axis.
The most common use of Time Series is for forecasting...can you predict what's going to come in the future based on patterns and trends from the past? Maybe this helps predicting sales or the temperature across a month.
Machine Learning Methods
These methods are employed, when traditional methods can't be used to do the job at hand. This might be if where data needs to be classified into groups based on various fields or where groupings are not obvious or predictions needs to be made.
ML models can be grouped into three broad groups and there are typically 4 component parts to machine learning methods: Data --> Model --> Objective Function --> Optimisation Algorithm.
Supervised
The models are provided with labelled data, which means that you can assess the accuracy of predictions. You would have a training set of data and a validation/test set separated out. The idea is that you train your model using the labelled test set, using the validation to check out your models performance and tweak as needed. The test set should be totally unseen after the training/validation has been completed to give a more objective view of the model performance.
The goal here is to minimise the loss or objective function. The higher the loss, the less likely model is able to make the desired predictions correctly.
Examples of supervised methods might include: Support Vector Machine (SVM), Neural Networks (often used for classification).
Unsupervised
With unsupervised models, the input data is not labelled..rather the model has to work out the groupings and understand the data on its own. That's the point. There is no loss function here to be optimised, because the "labels" are not known to the model.
The goal of these models is to find the connection or groupings within the data. This can be more complex and time consuming, but also gives us humans an insight to possibilities that we may not have been able to get to ourselves. The computer doesn't "see" the data we see it, and therefore can pick up on trends and patterns that we do not.
Examples of unsupervised methods are: Apriori and K-means.
Reinforcement
I like to think of this method similarly to how I would train my puppy: these algorithms learn by getting a reward for when they perform better than previous attempts. This allows the models to adapt and change based on this reward, figuring out what it did when the rewards were granted and refining out to get more and more rewards each time.
Instead of minimising the objective function, which was the goal with supervised learning as this is a "loss" function, the goal is to maximise the objective function as it is a "reward" function.
Examples of reinforcement methods are: Q-learning or Deep-Q networks. I find that this is the least common type of ML that I come across in terms of algorithms but there are a lot of practical applications that are more familiar: Natural Language Processing (e.g. predictive text or text summarisation) or self-driving cars.
Often people might jump straight to wanting to use ML, but there is a place for them and the key is to use the right tool for the job! Hopefully, this gives you an oversight of the main data methods, upon which you can base your further research and learning.