Overview of Machine Learning
To solve a problem on a computer, we need an algorithm. An algorithm is a sequence of instructions that should be carried out to transform the input to output. For example, one can devise an algorithm for sorting. The input is a set of numbers and the output is their ordered list. For the same task, there may be various algorithms and we may be interested in finding the most efficient one, requiring the least number of instructions or memory or both.
For some tasks, however, we do not have an algorithm—for example,to tell spam emails from legitimate emails. We know what the input is:an email document that in the simplest case is a file of characters. We know what the output should be: a yes/no output indicating whether the message is spam or not. We do not know how to transform the input to the output. What can be considered spam changes in time and from individual to individual.We can easily compile thousands of example messages some of which we know to be spam and what we want is to “learn” what constitutes spam from them.In other words, we would like the computer (machine) to extract automatically the algorithm for this task.
With advances in computer technology, we currently have the ability to store and process large amounts of data, as well as to access it from physically distant locations over a computer network.
Think, for example, of a supermarket chain that has hundreds of stores all over a country selling thousands of goods to millions of customers. The point of sale terminals record the details of each transaction: date, customer identification code, goods bought and their amount, total money spent, and so forth. This typically amounts to gigabytes of data every day. What the supermarket chain wants is to be able to predict who are the likely customers for a product. Again, the algorithm for this is not evident; it changes in time and by geographic location.
The stored data becomes useful only when it is analyzed and turned into information that we can make use of, for example, to make predictions.We do not know exactly which people are likely to buy this ice cream flavor, or the next book of this author, or see this new movie, or visit this city, or click this link. If we knew, we would not need any analysis of the data; we would just go ahead and write down the code. But because we do not, we can only collect data and hope to extract the answers to these and similar questions from data.
For example, consumer behavior—we know that it is not completely random. People do not go to supermarkets and buy things at random. When they buy beer, they buy chips; they buy ice cream in summer and spices for Glühwein in winter. There are certain patterns in the data.We believe that though identifying the complete process may not be possible, we can still detect certain patterns or regularities. This is the niche of machine learning. Such patterns may help us understand the process, or we can use those patterns to make predictions: Assuming that the future, at least the near future, will not be much different from the past when the sample data was collected, the future predictions can also be expected to be right.
Application of machine learning methods to large data bases is called data mining.In data mining, a large volume of data is processed to construct a simple model with valuable use.Its application areas are abundant.
In finance banks analyze their past data to build models to use in credit applications,fraud detection and the stock market
In manufacturing learning models are used for optimization,control and trouble shooting.
In medicine learning models are used for medical diagonosis
But machine learning is not just a database problem; it is also a part of artificial intelligence. To be intelligent, a system that is in a changing environment should have the ability to learn. If the system can learn and adapt to such changes, the system designer need not foresee and provide solutions for all possible situations.
Machine learning also helps us find solutions to many problems in vision,speech recognition, and robotics. Let us take the example of recognizing faces: This is a task we do effortlessly; every day we recognize family members and friends by looking at their faces or from their photographs, despite differences in pose, lighting, hair style, and so forth. But we do it unconsciously and are unable to explain how we do it. Because we are not able to explain our expertise, we cannot write the computer program. At the same time, we know that a face image is not just a random collection of pixels; a face has structure. It is symmetric. There are the eyes, the nose, the mouth, located in certain places on the face.Each person’s face is a pattern composed of a particular combination of these. By analyzing sample face images of a person, a learning program captures the pattern specific to that person and then recognizes by checking for this pattern in a given image. This is one example of pattern recognition.
Machine learning is programming computers to optimize a performance criterion using example data or past experience. We have a model defined up to some parameters, and learning is the execution of a computer program to optimize the parameters of the model using the training data or past experience. The model may be predictive to make predictions in the future, or descriptive to gain knowledge from data, or both.
Machine learning uses the theory of statistics in building mathematical models, because the core task is making inference from a sample. The role of computer science is two fold: first , in training we need efficient algorithms to solve optimization problem, as well as to store and process the massive amount of data we generally have. Second, once a model is learned, its representation and algorithmic solution for inference needs to be efficient as well. In certain applications, the efficiency of the learning or inference algorithm, namely, its space and time complexity, may be as important as its predictive accuracy.
Machine Learning is undeniably one of the most influential and powerful technologies in today’s world.Machine learning is a tool for turning information into knowledge. In the past 50 years, there has been an explosion of data. This mass of data is useless unless we analyse it and find the patterns hidden within. Machine learning techniques are used to automatically find the valuable underlying patterns within complex data that we would otherwise struggle to discover. The hidden patterns and knowledge about a problem can be used to predict future events and perform all kinds of complex decision making.
Most of us are unaware that we already interact with Machine Learning every single day. Every time we Google something, listen to a song or even take a photo, Machine Learning is becoming part of the engine behind it, constantly learning and improving from every interaction. It’s also behind world-changing advances like detecting cancer, creating new drugs and self-driving cars.
Traditionally, software engineering combined human created rules with data to create answers to a problem. Instead, machine learning uses data and answers to discover the rules behind a problem. (Chollet, 2017).
To learn the rules governing a phenomenon, machines have to go through a learning process, trying different rules and learning from how well they perform. Hence, why it’s known as Machine Learning.
There are multiple forms of Machine Learning; supervised, unsupervised , semi-supervised and reinforcement learning. Each form of Machine Learning has differing approaches and applications, but they all follow the same underlying process and theory.
Terminology
Dataset: A set of data examples, that are represented with features important to solving the problem.The dataset is used to train an algorithm with the goal of finding predictable patterns inside the whole dataset.
Features: Each feature, or column of a data set represents a measurable piece of data that can be used for analysis: Name, Age, Sex and so on. Features are also sometimes referred to as “variables” or “attributes.” Depending on what you're trying to analyze, the features you include in your dataset can vary widely.
Model: The representation (internal model) of a phenomenon that a Machine Learning algorithm has learnt. It learns this from the data it is shown during training. The model is the output you get after training an algorithm. For example, a decision tree algorithm would be trained and produce a decision tree model.
Process
The following is the 5 step process in machine learning
Data Collection: Collect the data that the algorithm will learn from.
Data Preparation: Format and engineer the data into the optimal format, extracting important features and performing dimensionality reduction.
Training: Also known as the fitting stage, this is where the Machine Learning algorithm actually learns by showing it the data that has been collected and prepared.
Evaluation: Test the model to see how well it performs.
Tuning: Fine tune the model to maximize it’s performance.
Machine Learning Approaches
There are many approaches that can be taken when conducting Machine Learning. They are usually grouped into the areas listed below. Supervised and Unsupervised are well established approaches and the most commonly used. Semi-supervised and Reinforcement Learning are newer and more complex but have shown impressive results.
1.Supervised Learning
Supervised learning is where you have input variables (x) and an output variable (Y) and you use an algorithm to learn the mapping function from the input to the output.
Y = f(X)
The goal is to approximate the mapping function so well that when you have new input data (x) that you can predict the output variables (Y) for that data.
It is called supervised learning because the process of an algorithm learning from the training dataset can be thought of as a teacher supervising the learning process. We know the correct answers, the algorithm iteratively makes predictions on the training data and is corrected by the teacher. Learning stops when the algorithm achieves an acceptable level of performance.
For example, the inputs could be the weather forecast, and the outputs would be the visitors to the beach. The goal in supervised learning would be to learn the mapping that describes the relationship between temperature and number of beach visitors.
Being able to adapt to new inputs and make predictions is the crucial generalization part of machine learning. In training, we want to maximize generalization, so the supervised model defines the real ‘general’ underlying relationship. If the model is over-trained, we cause over-fitting to the examples used and the model would be unable to adapt to new, previously unseen inputs.
A side effect to be aware of in supervised learning that the supervision we provide introduces bias to the learning. The model can only be imitating exactly what it was shown, so it is very important to show it reliable, unbiased examples. Also, supervised learning usually requires a lot of data before it learns. Obtaining enough reliably labelled data is often the hardest and most expensive part of using supervised learning. (Hence why data has been called the new oil!)
The output from a supervised Machine Learning model could be a category from a finite set e.g [low, medium, high] for the number of visitors to the beach.When this is the case, it’s is deciding how to classify the input, and so is known as classification.
Input [temperature=20] -> Model -> Output = [visitors=high]
Alternatively, the output could be a real-world scalar (output a number).When this is the case, it is known as regression
Input [temperature=20] -> Model -> Output = [visitors=300]
Classification
Classification is used to group the similar data points into different sections in order to classify them. Machine Learning is used to find the rules that explain how to separate the different data points.
But how are the magical rules created? Well, there are multiple ways to discover the rules. They all focus on using data and answers to discover rules that linearly separate data points.
Linear separability is a key concept in machine learning. All that linear separability means is ‘can the different data points be separated by a line?’. So put simply, classification approaches try to find the best way to separate data points with a line.
The lines drawn between classes are known as the decision boundaries. The entire area that is chosen to define a class is known as the decision surface. The decision surface defines that if a data point falls within its boundaries, it will be assigned a certain class.
For example It is important for the bank to be able to predict in advance the risk associated with a loan, which is the probability that the customer will default and not pay the whole amount back.In credit scoring (Hand 1998), the bank calculates the risk given the amount of credit and the information about the customer.This classification is an example of a classification problem where there are two classes: low-risk and high-risk customers. The information about a customer makes up the input to the classifier whose task is to assign the input to one of the two classes.
After training with the past data, a classification rule learned may be of the form
IF income> θ1 AND savings> θ2 THEN low-risk ELSE high-risk
for suitable values of θ1 and θ2 . This is an example of a discriminant; it is a function that separates the examples of different classes.
In some cases, instead of making a 0/1 (low-risk/high-risk) type decision, we may want to calculate a probability, namely, P(Y|X), where X are the customer attributes and Y is 0 or 1 respectively for low-risk and high-risk. From this perspective, we can see classification as learning an association from X to Y. Then for a given X = x, if we have P(Y = 1|X = x) = 0.8, we say that the customer has an 80 percent probability of being high-risk, or equivalently a 20 percent probability of being low-risk. We then decide whether to accept or refuse the loan depending on the possible gain and loss.
There several application areas
One is optical character recognition, which is recognizing character codes from their images. This is an example where there are multiple classes, as many as there are characters we would like to recognize.
In the case of face recognition, the input is an image, the classes are people to be recognized, and the learning program should learn to associate the face images to identities. This problem is more difficult than optical character recognition because there are more classes, input image is larger, and a face is three-dimensional and differences in pose and lighting cause significant changes in the image.
In medical diagnosis, the inputs are the relevant information we have about the patient and the classes are the illnesses
In speech recognition, the input is acoustic and the classes are words that can be uttered
Natural language processing is used in sentimental analysis and spam filtering.
Bio metric system uses bio-metric features for acceptance or rejection.
Another use of machine learning is outlier detection, which is finding the instances that do not obey the rule and are exceptions. In this case,after learning the rule, we are not interested in the rule but the exceptions not covered by the rule, which may imply anomalies requiring attention for example, fraud.
Regression
Regression is another form of supervised learning. The difference between classification and regression is that regression outputs a number rather than a class. Therefore, regression is useful when predicting number based problems like stock market prices, the temperature for a given day, or the probability of an event.
Let us say we want to have a system that can predict the price of a used car. Inputs are the car attributes—brand, year, engine capacity, mileage, and other information—that we believe affect a car’s worth. The output is the price of the car. Such problems where the output is a number are regression regression problems.
Let X denote the car attributes and Y be the price of the car. Again surveying the past transactions, we can collect a training data and the machine learning program fits a function to this data to learn Y as a
function of X. An example is given in figure where the fitted function is of the form
$y=wx+w_0$
for suitable values of $w$ and $w_0$.
Both regression and classification are supervised learning problems where there is an input, X, an output, Y, and the task is to learn the mapping from the input to the output. The approach in machine learning is that we assume a model defined up to a set of parameters:
$y = g(x|θ)$
where $g(·)$ is the model and θ are its parameters. Y is a number in regression and is a class code (e.g., 0/1) in the case of classification. $g(·)$ is the regression function or in classification, it is the discriminant function separating the instances of different classes. The machine learning program optimizes the parameters, θ, such that the approximation error is minimized, that is, our estimates are as close as possible to the correct values given in the training set. For example in the above figure , the model is linear and $w$ and $w_0$ are the parameters optimized for best fit to the training data. In cases where the linear model is too restrictive, one can use for example a quadratic
$y = w_2x^2 + w_1x + w_0$
or a higher-order polynomial, or any other nonlinear function of the input, this time optimizing its parameters for best fit.
Examples
Regression is used in financial trading to find the patterns in stocks and other assets to decide when to buy/sell and make a profit. For classification, it is already being used to classify if an email you receive is spam.
Both the classification and regression supervised learning techniques can be extended to much more complex tasks. For example, tasks involving speech and audio. Image classification, object detection and chat bots are some examples.
Some common types of problems built on top of classification and regression include recommendation and time series prediction respectively.
Linear regression for regression problems.
Random forest for classification and regression problems.
Support vector machines for classification problems.
2.Unsupervised learning
Unsupervised learning is where you only have input data (X) and no corresponding output variables.
The goal for unsupervised learning is to model the underlying structure or distribution in the data in order to learn more about the data.
These are called unsupervised learning because unlike supervised learning above there is no correct answers and there is no teacher. Algorithms are left to their own devises to discover and present the interesting structure in the data.
An example of unsupervised learning in real life would be sorting different colour coins into separate piles. Nobody taught you how to separate them, but by just looking at their features such as colour, you can see which colour coins are associated and cluster them into their correct groups.
Unsupervised learning can be harder than supervised learning, as the removal of supervision means the problem has become less defined. The algorithm has a less focused idea of what patterns to look for.
Think of it in your own learning. If you learnt to play the guitar by being supervised by a teacher, you would learn quickly by re-using the supervised knowledge of notes, chords and rhythms. But if you only taught yourself, you’d find it so much harder knowing where to start.
you start from a clean slate with less bias and may even find a new, better way solve a problem. Therefore, this is why unsupervised learning is also known as knowledge discovery. Unsupervised learning is very useful when conducting exploratory data analysis.
Unsupervised learning problems can be further grouped into clustering and association problems.
Clustering
In supervised learning, the aim is to learn a mapping from the input to an output whose correct values are provided by a supervisor. In unsupervised learning, there is no such supervisor and we only have input data.The aim is to find the regularities in the input. There is a structure to the input space such that certain patterns occur more often than others, and we want to see what generally happens and what does not. In statistics, this is called density estimation.
One method for density estimation is clustering where the aim is to find clusters or groupings of input. In the case of a company with a data of past customers, the customer data contains the demographic information as well as the past transactions with the company, and the company may want to see the distribution of the profile of its customers, to see what type of customers frequently occur. In such a case, a clustering model allocates customers similar in their attributes to the same group, providing the company with natural groupings of its customers; this is called customer segmentation. Once such groups are found, the company may decide strategies, for example, services and products, specific to different groups; this is known as customer relationship management. Such a grouping also allows identifying those who are outliers, namely, those who are different from other customers, which may imply a niche in the market that can be further exploited by the company.
An interesting application of clustering is in image compression. In this case, the input instances are image pixels represented as RGB values.A clustering program groups pixels with similar colors in the same group, and such groups correspond to the colors occurring frequently in the image. If in an image, there are only shades of a small number of colors, and if we code those belonging to the same group with one color, for example, their average, then the image is quantized. Let us say the pixels are 24 bits to represent 16 million colors, but if there are shades of only 64 main colors, for each pixel we need 6 bits instead of 24. For example, if the scene has various shades of blue in different parts of the image, and if we use the same average blue for all of them, we lose the details in the image but gain space in storage and transmission. Ideally, one would like to identify higher-level regularities by analyzing repeated image patterns, for example, texture, objects, and so forth. This allows a higher-level, simpler, and more useful description of the scene, and for example, achieves better compression than compressing at the pixel level.
In document clustering, the aim is to group similar documents. For example, news reports can be subdivided as those related to politics, sports, fashion, arts, and so on.
Machine learning methods are also used in bioinformatics. DNA in our genome is the “blueprint of life” and is a sequence of bases, namely, A,G,C, and T.Clustering is used in learning motifs, which are sequences of amino acids that occur repeatedly in proteins.
Association: An association rule learning problem is where you want to discover rules that describe large portions of your data, such as people that buy X also tend to buy Y.For example, if a person watches video A they will likely watch video B. Association rules are perfect for examples such as this where you want to find related items.
In the case of retail—for example, a supermarket chain—one application of machine learning is basket analysis, which is finding associations between products bought by customers: If people who buy X typically also buy Y, and if there is a customer who buys X and does not buy Y, he or she is a potential Y customer. Once we find such customers, we can target them for cross-selling.
In finding an association rule, association rule we are interested in learning a conditional probability of the form $P(Y|X)$ where $Y$ is the product we would like to condition on $X$, which is the product or the set of products which we know that the customer has already purchased.
Let us say, going over our data, we calculate that $P(chips|beer) = 0.7.$Then, we can define the rule:70 percent of customers who buy beer also buy chips.
We may want to make a distinction among customers and toward this,
estimate $P(Y|X,D)$ where $D$ is the set of customer attributes, for example,gender, age, marital status, and so on, assuming that we have access to this information. If this is a bookseller instead of a supermarket, products can be books or authors. In the case of a Web portal, items correspond to links to Web pages, and we can estimate the links a user is likely to click and use this information to download such pages in advance for faster access.
Anomaly Detection:The identification of rare or unusual items that differ from the majority of data. For example, your bank will use this to detect fraudulent activity on your card. Your normal spending habits will fall within a normal range of behaviors and values. But when someone tries to steal from you using your card the behavior will be different from your normal pattern. Anomaly detection uses unsupervised learning to separate and detect these strange occurrences.
Examples
3.Semi-Supervised Machine Learning
Problems where you have a large amount of input data (X) and only some of the data is labeled (Y) are called semi-supervised learning problems.
These problems sit in between both supervised and unsupervised learning.
A good example is a photo archive where only some of the images are labeled, (e.g. dog, cat, person) and the majority are unlabeled.
Many real world machine learning problems fall into this area. This is because it can be expensive or time-consuming to label data as it may require access to domain experts. Whereas unlabeled data is cheap and easy to collect and store.
You can use unsupervised learning techniques to discover and learn the structure in the input variables.
You can also use supervised learning techniques to make best guess predictions for the unlabeled data, feed that data back into the supervised learning algorithm as training data and use the model to make predictions on new unseen data.
In the real world, clustering has successfully been used in marketing, it is regularly used to cluster customers into similar groups based on their behaviors and characteristics.
Association learning is used for recommending or finding related items. A common example is market basket analysis. In market basket analysis, association rules are found to predict other items a customer is likely to buy based on what they have placed in their basket. Amazon use this. If you place a new laptop in your basket, they recommend items like a laptop case via their association rules.
Anomaly detection is well suited in scenarios such as fraud detection and malware detection.
Some popular examples of unsupervised learning algorithms are:
k-means for clustering problems.
Apriori algorithm for association rule learning problems.
3.Semi-Supervised Machine Learning
Problems where you have a large amount of input data (X) and only some of the data is labeled (Y) are called semi-supervised learning problems.
These problems sit in between both supervised and unsupervised learning.
A good example is a photo archive where only some of the images are labeled, (e.g. dog, cat, person) and the majority are unlabeled.
Many real world machine learning problems fall into this area. This is because it can be expensive or time-consuming to label data as it may require access to domain experts. Whereas unlabeled data is cheap and easy to collect and store.
You can use unsupervised learning techniques to discover and learn the structure in the input variables.
You can also use supervised learning techniques to make best guess predictions for the unlabeled data, feed that data back into the supervised learning algorithm as training data and use the model to make predictions on new unseen data.
Examples
A perfect example is in medical scans, such as breast cancer scans. A trained expert is needed to label these which is time consuming and very expensive. Instead, an expert can label just a small set of breast cancer scans, and the semi-supervised algorithm would be able to leverage this small subset and apply it to a larger set of scans.
Generative Adversarial Networks
Generative Adversarial Networks (GANs) have been a recent breakthrough with incredible results. GANs use two neural networks, a generator and discriminator. The generator generates output and the discriminator critiques it. By battling against each other they both become increasingly skilled.
By using a network to both generate input and another one to generate outputs there is no need for us to provide explicit labels every single time and so it can be classed as semi-supervised.
4.Reinforcement Learning
The final type of machine learning is reinforcement learning. It is less common and much more complex, but it has generated incredible results. It doesn’t use labels as such, and instead uses rewards to learn.
If you’re familiar with psychology, you’ll have heard of reinforcement learning. If not, you’ll already know the concept from how we learn in everyday life. In this approach, occasional positive and negative feedback is used to reinforce behaviors. Think of it like training a dog, good behaviors are rewarded with a treat and become more common. Bad behaviors are punished and become less common. This reward-motivated behavior is key in reinforcement learning.
This is very similar to how we as humans also learn. Throughout our lives, we receive positive and negative signals and constantly learn from them. The chemicals in our brain are one of many ways we get these signals. When something good happens, the neurons in our brains provide a hit of positive neurotransmitters such as dopamine which makes us feel good and we become more likely to repeat that specific action. We don’t need constant supervision to learn like in supervised learning. By only giving the occasional reinforcement signals, we still learn very effectively.
One of the most exciting parts of Reinforcement Learning is that is a first step away from training on static datasets, and instead of being able to use dynamic, noisy data-rich environments. This brings Machine Learning closer to a learning style used by humans. The world is simply our noisy, complex data-rich environment.
In some applications, the output of the system is a sequence of actions. In such a case, a single action is not important; what is important is the policy that is the sequence of correct actions to reach the goal. There is no such thing as the best action in any intermediate state; an action is good if it is part of a good policy. In such a case, the machine learning program should be able to assess the goodness of policies and learn from past good action sequences to be able to generate a policy. Such learning methods are called reinforcement learning algorithms.
A good example is game playing where a single move by itself is not that important; it is the sequence of right moves that is good. A move is good if it is part of a good game playing policy. Game playing is an important research area in both artificial intelligence and machine learning.This is because games are easy to describe and at the same time, they are quite difficult to play well. A game like chess has a small number of rules but it is very complex because of the large number of possible moves at each state and the large number of moves that a game contains. Once we have good algorithms that can learn to play games well, we can also apply them to applications with more evident economic utility.
Games are very popular in Reinforcement Learning research. They provide ideal data-rich environments. The scores in games are ideal reward signals to train reward-motivated behaviours. Additionally, time can be sped up in a simulated game environment to reduce overall training time.
A Reinforcement Learning algorithm just aims to maximise its rewards by playing the game over and over again. If you can frame a problem with a frequent ‘score’ as a reward, it is likely to be suited to Reinforcement Learning.Google DeepMind have used reinforcement learning in research to play Go and Atari games at superhuman levels.
A robot navigating in an environment in search of a goal location is another application area of reinforcement learning. At any time, the robot can move in one of a number of directions. After a number of trial runs,it should learn the correct sequence of actions to reach to the goal state from an initial state, doing this as quickly as possible and without hitting any of the obstacles.
One factor that makes reinforcement learning harder is when the system has unreliable and partial sensory information. For example, a robot equipped with a video camera has incomplete information and thus at any time is in a partially observable state and should decide taking into account this uncertainty; for example, it may not know its exact location in a room but only that there is a wall to its left. A task may also require a concurrent operation of multiple agents that should interact and cooperate to accomplish a common goal. An example is a team of robots playing soccer.
Comparison Supervised and Unsupervised Learning
a)Trained used labelled data a) Trained using unlabeled data
b)Input and output is given for b)Only input is given for training
training
c)Predicts the output c)Find the hidden patterns in data
d)Regression and Classification d)Clustering and Associations
e)Supervised learning model e)Unsupervised learning model may give less accurate result as
e)Supervised learning model e)Unsupervised learning model may give less accurate result as
produces an accurate result. compared to supervised learning.
such as Linear Regression, algorithm
Logistic Regression,
Support Vector Machine,
Multi-class Classification,
Decision tree, Bayesian Logic, etc.
Comments
Post a Comment