Fundamental Algorithms for Data Science (Part I)

Mavis LohUncategorized

By now, you should be aware that Data Science drives almost every product, decision and processes in our world today. With that, interest in picking up data science, programming and machine learning has skyrocketed over the years, and we don’t foresee it slowing down anytime soon. However, we understand that it can get a little tough to break into the programming industry. That’s why we’re here to share about some of the fundamental learning algorithms for beginners.

Linear Regression

Linear Regression is used to predict the outcome of a given sample when the output variable is in the form of real values. For example, a regression model might process input data to predict the amount of rainfall, the height of a person, etc. The output, typically a continuous variable, can be written as a weighted linear combination of features. This is where we bring in the knowledge you’ve learnt back in school – a linear equation or better known as y=mx+c, where m is the gradient and c being the y-intercept.

Logistic Regression

Logistic Regression on the other hand is used to estimate discrete values, usually binary values like 0 or 1 and True or False, from a set of independent variables. It helps data scientists predict the probability of an event by fitting data they have into function.

Clustering

Simply explained, a cluster analysis is the task of grouping large amounts of datasets into homogeneous groups called clusters. Each of these clusters have to also be as distinct as possible from one another. Here we have the simplest form of clustering, also known as the k-means clustering. It is not only the simplest but to a certain extent, a rather inaccurate method as it splits the dataset into the chosen number of clusters (k). This algorithm aims to reduce the standard deviation at the points of each cluster. The basic idea is that at each iteration the center of mass is recalculated for each cluster obtained in the previous step, then the vectors are divided into clusters again according to which of the new centers was closer in the selected metric. The algorithm stops when no further cluster changes at any iteration.

Decision Trees

It is one of the most popular machine learning algorithms in use today. This supervised learning algorithm is used to and works extremely well classifying both categorical and continuous dependent variables. In this algorithm, we split the population into two or more homogeneous sets based on the most significant attributes/independent variables.

We’ll end off here, so stay tuned to Fundamental Algorithms for Data Science (Part II) for more key algorithm every newbie in the Data Science field got to know!