Similarity Measures
In data science, the similarity measure is a way of measuring how data samples are related or closed to each other. On the other hand, the dissimilarity measure is to tell how much the data objects are distinct. Moreover, these terms are often used in clustering when similar data samples are grouped into one cluster. All other data samples are grouped into different ones. It is also used in classification(e.g. KNN), where the data objects are labeled based on the features’ similarity. Another example is when we talk about dissimilar outliers compared to other data samples(e.g., anomaly detection).
In data mining applications, such as clustering, outlier analysis, and nearest-neighbor classification, we need ways to assess how alike or unalike objects are in comparison to one another. For example, a store may want to search for clusters of customer objects, resulting in groups of customers with similar characteristics (e.g., similar income, area of residence, and age). Such information can then be used for marketing. A cluster is a collection of data objects such that the objects within a cluster are similar to one another and dissimilar to the objects in other clusters. Outlier analysis also employs clustering-based techniques to identify potential outliers as objects that are highly dissimilar to others. Knowledge of object similarities can also be used in nearest-neighbor classification schemes where a given object (e.g., a patient) is assigned a class label (relating to, say, a diagnosis) based on its similarity toward other objects in the model.
This section presents similarity and dissimilarity measures, which are referred to as measures of proximity. Similarity and dissimilarity are related. A similarity measure for two objects, $i$ and $j$, will typically return the value 0 if the objects are unalike. The higher the similarity value, the greater the similarity between objects. (Typically, a value of 1 indicates complete similarity, that is, the objects are identical.) A dissimilarity measure works the opposite way. It returns a value of 0 if the objects are the same (and therefore, far from being dissimilar). The higher the dissimilarity value, the more dissimilar the two objects are.
The similarity measure is usually expressed as a numerical value: It gets higher when the data samples are more alike. It is often expressed as a number between zero and one by conversion: zero means low similarity(the data objects are dissimilar). One means high similarity(the data objects are very similar).
Let’s take an example where each data point contains only one input feature. This can be considered the simplest example to show the dissimilarity between three data points A, B, and C. Each data sample can have a single value on one axis(because we only have one input feature); let’s denote that as the x-axis. Let’s take two points, A(0.5), B(1), and C(30). As you can tell, A and B are close enough to each other in contrast to C. Thus, the similarity between A and B is higher than A and C or B and C. In other terms, A and B have a strong correlation. Therefore, the smaller the distance is, the larger the similarity will get.
we present two data structures that are commonly used in the above types of applications: the data matrix (used to store the data objects) and the dissimilarity matrix (used to store dissimilarity values for pairs of objects).
Data Matrix versus Dissimilarity Matrix
Suppose that we have $n$ objects (e.g., persons, items, or courses) described by p attributes (also called measurements or features, such as age, height, weight, or gender). The objects are $x_1=x_{11},x_{12}, \ldots ,x_{1p}$,$x_2 = x_{21},x_{22},\ldots ,x_{2p}$, and so on, where $x_{ij}$ is the value for object $x_i$ of the $j^{th}$ attribute.For brevity, we hereafter refer to object $x_i$ as object $i$. The objects may be tuples in a relational database, and are also referred to as data samples or feature vectors.Main memory-based clustering and nearest-neighbor algorithms typically operate on either of the following two data structures:
Each row corresponds to an object. As part of our notation, we may use $f$ to index through the $p$ attributes.
where $d(i, j)$ is the measured dissimilarity or “difference” between objects $i$ and $j$. In general, $d(i, j)$ is a non-negative number that is close to 0 when objects $i$ and $j$ are highly similar or “near” each other, and becomes larger the more they differ. Note that $d(i, i)=0$; that is, the difference between an object and itself is 0. Furthermore,$d(i, j)=d( j, i)$.
Measures of similarity can often be expressed as a function of measures of dissimilarity.
For example, for nominal data,
$sim(i, j)=1-d(i, j)$
Metric
A given distance(e.g. dissimilarity) is meant to be a metric if and only if it satisfies the following four conditions:
1 Non-negativity: $d(p, q) ≥ 0$, for any two distinct observations $p$ and $q$.Distance is a non-negative number.
2 Symmetry: $d(p, q) = d(q, p)$ for all $p$ and $q$.Distance is a symmetric function
3 Triangle Inequality: $d(p, q) ≤ d(p, r) + d(r, q)$ for all $p, q, r.$
4 $d(p, q) = 0$ only if $p = q$.The distance of an object to itself is 0.
A measure that satisfies these conditions is known as metric. Please note that the non-negativity property is implied by the other three properties.
Distance measures are the fundamental principle for classification, like the $k$-nearest neighbor’s classifier algorithm, which measures the dissimilarity between given data samples. Additionally, choosing a distance metric would have a strong influence on the performance of the classifier. Therefore, the way you compute distances between the objects will play a crucial role in the classifier algorithm’s performance.
Proximity Measures for Nominal Attributes
A nominal attribute can take on two or more states . For example, $color$ is a nominal attribute that may have, say, five states: red, yellow, green, pink, and blue.
Let the number of states of a nominal attribute be $M$. The states can be denoted by letters, symbols, or a set of integers, such as $1, 2, \ldots , M$. Notice that such integers are used just for data handling and do not represent any specific ordering.
“How is dissimilarity computed between objects described by nominal attributes?” The dissimilarity between two objects $i$ and $$j can be computed based on the ratio of mismatches:
$d(i,j)=\frac{p-m}{p}$
where $m$ is the number of matches (i.e., the number of attributes for which $i$ and $j$ are in the same state), and $p$ is the total number of attributes describing the objects. Weights can be assigned to increase the effect of $m$ or to assign greater weight to the matches in attributes having a larger number of states.
For example consider 4 objects with single attribute(nominal).Then the dissimilarity matrix is
object id Attribute(nominal)
1 code A
2 code B
3 code C
4 code A
Alternatively, similarity can be computed as
$sim(i, j)= 1-d(i, j)=\frac{m}{p}$
Proximity Measures for Binary Attributes
Let’s look at dissimilarity and similarity measures for objects described by either symmetric or asymmetric binary attributes.
Recall that a binary attribute has only one of two states: 0 and 1, where 0 means that the attribute is absent, and 1 means that it is present. Given the attribute smoker describing a patient, for instance, 1 indicates that the patient smokes, while 0 indicates that the patient does not. Treating binary attributes as if they are numeric can be misleading. Therefore, methods specific to binary data are necessary for computing dissimilarity.
“So, how can we compute the dissimilarity between two binary attributes?” One approach involves computing a dissimilarity matrix from the given binary data. If all binary attributes are thought of as having the same weight, we have the $2 \times 2$ contingency table of Table 2.3, where $q$ is the number of attributes that equal 1 for both objects $i$ and $j$, $r$ is the number of attributes that equal 1 for object $i$ but equal $0$ for object $j$, $s$ is the number of attributes that equal 0 for object $i$ but equal 1 for object $j$, and $t$ is the number of attributes that equal 0 for both objects $i$ and $j$. The total number of attributes is $p$, where $p = q+r +s +t $.
Recall that for symmetric binary attributes, each state is equally valuable. Dissimilarity that is based on symmetric binary attributes is called symmetric binary dissimilarity. If objects $i$ and $j$ are described by symmetric binary attributes, then the dissimilarity between $i$ and $j$ is
$d(i,j)=\frac{r+s}{q+r+s+t}$
For asymmetric binary attributes, the two states are not equally important, such as the positive (1) and negative (0) outcomes of a disease test. Given two asymmetric binary attributes, the agreement of two 1s (a positive match) is then considered more significant than that of two 0s (a negative match). Therefore, such binary attributes are often considered “monary” (having one state). The dissimilarity based on these attributes is called asymmetric binary dissimilarity, where the number of negative matches, $t $, is considered unimportant and is thus ignored in the following computation:
$d(i,j)=\frac{r+s}{q+r+s}$
Complementarily, we can measure the difference between two binary attributes based on the notion of similarity instead of dissimilarity. For example, the asymmetric binary similarity between the objects $i$ and $j$ can be computed as
$sim(i,j)=1-dis(i,j)=\frac{q}{q+r+s}$
The coefficient $sim(i, j)$of above Eq is called the Jaccard coefficient
Example:
Dissimilarity between binary attributes. Suppose that a patient record table (Table 2.4) contains the attributes name, gender, fever, cough, test-1, test-2, test-3, and test-4, where name is an object identifier, gender is a symmetric attribute, and the remaining attributes are asymmetric binary.For asymmetric attribute values, let the values Y (yes) and P (positive) be set to 1,and the value N (no or negative) be set to 0. Suppose that the distance between objects (patients) is computed based only on the asymmetric attributes. According to Eq above, the distance between each pair of the three patients—Jack, Mary, and Jim—is
Dissimilarity of Numeric Data
In this section, we describe distance measures that are commonly used for computing the dissimilarity of objects described by numeric attributes. These measures include the Euclidean, Manhattan, and Minkowski distances. In some cases, the data are normalized before applying distance calculations. This
involves transforming the data to fall within a smaller or common range, such as $[-1, 1]$ or $[0.0, 1.0]$. Consider a height attribute, for example, which could be measured in either meters or inches. In general, expressing an attribute in smaller units will lead to a larger range for that attribute, and thus tend to give such attributes greater effect or “weight.” Normalizing the data attempts to give all attributes an equal weight. It may or may not be useful in a particular application.
- Euclidean Distance
- Manhattan Distance
- Minkowski Distance
- Hamming Distance
- Cosine Distance
1. Euclidean Distance
Euclidean Distance represents the shortest distance between two points.
Most machine learning algorithms including K-Means clustering use this distance metric to measure the similarity between observations.
Lets consider two points $A=(p_1,p_2)$ and $B=(q_1,q_2)$
$d=\sqrt{(p_1-q_1)^2+(p_2-q_2)^2}$
$D_e=\left( \sum_{i=1}^n (p_i-q_i)^2 \right)^{1/2}$
$n$ = number of dimensions
$p_i, q_i$ = data points
Example:
Consider two points A=(1, 2, 3) and B=(4,5,6).
The Euclidean Distance
$\sqrt{(4-1)^2+(5-2)^2+(6-3)^2}$
$\sqrt{9+9+9}=5.1961$
Manhattan Distance is the sum of absolute differences between points across all the dimensions.
We can represent Manhattan Distance as:
Since the above representation is 2 dimensional, to calculate Manhattan Distance, we will take the sum of absolute distances in both the x and y directions. So, the Manhattan distance in a 2-dimensional space is given as:
$d=|p_1-q_1|+|p_2-q_2|$
$D_m=\sum_{i=1}^n |p_i-q_i|$
$n$ = number of dimensions
$p_i, q_i$ = data points
Note that Manhattan Distance is also known as city block distance.
Example:
Consider two points A=(1, 2, 3) and B=(4,5,6).
The Manhattan Distance
$(4-1)+(5-2)+(6-3)$
$3+3+3=9$
Minkowski Distance is the generalized form of Euclidean and Manhattan Distance.
The formula for Minkowski Distance is given as:
$D_min=\left( \sum_{i=1}^n (p_i-q_i)^p \right)^{1/p}$
A=(1,2,3) B=(4,5,6)
$((4-1)^3+(5-2)^3+(6-3)^3)^{1/3}$
$(27+27+27)^{1/3}=4.3267$
Example: Supremum distance. Let’s use the same two objects, $x_1 =(1, 2)$ and $x_2 = (3, 5)$, as in Figure above. The second attribute gives the greatest difference between values for the objects, which is $5-2 =3$. This is the supremum distance between both objects.
If each attribute is assigned a weight according to its perceived importance, the weighted Euclidean distance can be computed as
Weighting can also be applied to other distance measures as well.
So far, we have covered the distance metrics that are used when we are dealing with continuous or numerical variables. But what if we have categorical variables? How can we decide the similarity between categorical variables? This is where we can make use of another distance metric called Hamming Distance
4. Hamming Distance
$||x||=\sqrt{5^2 +0^2 +3^2 +0^2+2^2 +0^2 +0^2 +2^2 +0^2 +0^2 }= 6.48$
$||y||=\sqrt{3^2 +0^2 +2^2 +0^2 +1^2 +1^2 +0^2 +1^2 +0^2 +1^2} = 4.12$
$sim(x, y/)=0.94$
Therefore, if we were using the cosine similarity measure to compare these documents, they would be considered quite similar.
To illustrate this, let’s take the following two documents:
Document A: “ Bitcoin Bitcoin Bitcoin Money”
Document B: “ Money Money Bitcoin Bitcoin”
And let’s denote by the word “Bitcoin” as the x-axis and the word “Money” as the y-axis. This means that document A can be represented as a vector A(3,1) and document B as B(2,2).
Computing the cosine similarity would result in the following value:
$cosine\_similarity=1-\frac{\sum_{i=1}^n P_i.Q_i}{\sqrt{\sum_{i=1}^n P_i^2}.\sqrt{\sum_{i=1}^n Q_i^2}}$Hamming Distance measures the similarity between two strings of the same length. The Hamming Distance between two strings of the same length is the number of positions at which the corresponding characters are different.
Let’s understand the concept using an example. Let’s say we have two strings:
“euclidean” and “manhattan”
Since the length of these strings is equal, we can calculate the Hamming Distance. We will go character by character and match the strings. The first character of both the strings (e and m respectively) is different. Similarly, the second character of both the strings (u and a) is different. and so on.
Look carefully – seven characters are different whereas two characters (the last two characters) are similar:
Hence, the Hamming Distance here will be 7. Note that larger the Hamming Distance between two strings, more dissimilar will be those strings (and vice versa).
5.Cosine Distance
This metric is widely used in text mining, natural language processing, and information retrieval systems. For instance, it can be used to measure the similarity between two given documents. It can also be used to identify spam or ham messages based on the length of the message.
The Cosine distance can be measured as follows:
$sim(x, y)=\frac{x .y}{||x||||y||}$
where $||x||$ is the Euclidean norm of vector $x=(x_1, x_2, \ldots , x_p)$, defined as $\sqrt{x_1^2+x_2^2+\cdots+x_p^2}$
Conceptually, it is the length of the vector. Similarly, $||y||$ is the Euclidean norm of vector $y$. The measure computes the cosine of the angle between vectors $x$ and $y$. A cosine value of 0 means that the two vectors are at 90 degrees to each other (orthogonal) and have no match. The closer the cosine value to 1, the smaller the angle and the greater the match between vectors. Note that because the cosine similaritymeasure does not obey all of the properties of defining metric measures, it is referred to as a nonmetric measure.
Cosine similarity between two term-frequency vectors. Suppose that $x$ and $y$ are the first two term-frequency vectors . That is, $x =( 5, 0, 3, 0, 2, 0, 0, 2, 0,0)$ and $y =(3, 0, 2, 0, 1, 1, 0, 1, 0,1)$. How similar are $x$ and $y$? Using the above Eq. to compute the cosine similarity between the two vectors, we get:
$x^t .y = 5.3+0.0+3.2+0.0+2.1+0.1+0.0+2.1+0.0+0.1 = 25$$||x||=\sqrt{5^2 +0^2 +3^2 +0^2+2^2 +0^2 +0^2 +2^2 +0^2 +0^2 }= 6.48$
$||y||=\sqrt{3^2 +0^2 +2^2 +0^2 +1^2 +1^2 +0^2 +1^2 +0^2 +1^2} = 4.12$
$sim(x, y/)=0.94$
Therefore, if we were using the cosine similarity measure to compare these documents, they would be considered quite similar.
$cosine\_distance=1- cosine\_similarity$
$\qquad=1-cos(P,Q)$
$\qquad=1-\frac{P.Q}{||P||.||Q||}$
$\qquad=1-\frac{\sum_{i=1}^n P_i.Q_i}{\sqrt{\sum_{i=1}^n P_i^2}.\sqrt{\sum_{i=1}^n Q_i^2}}$
Document A: “ Bitcoin Bitcoin Bitcoin Money”
Document B: “ Money Money Bitcoin Bitcoin”
And let’s denote by the word “Bitcoin” as the x-axis and the word “Money” as the y-axis. This means that document A can be represented as a vector A(3,1) and document B as B(2,2).
Computing the cosine similarity would result in the following value:
$=1-\frac{(3*2)+(1*2)}{\sqrt{9+1},\sqrt{4+4}}=1-\frac{8}{\sqrt{10}.\sqrt{8}}=1-0.894=0.106$
Cosine_Similarity = 0.894 means that documents A and B, are very similar. The cos(angle) is large(close to one) means the angle is small(26.6°), the two documents A and B are closed to each other.
However, you can’t interpret the value of the Cosine Similarity as a percentage. For instance, the value 0.894 doesn’t mean that document A is 89.4%, similar to B. It means that documents A and B are very similar, but we don’t know how much percentage! There is no threshold for that value. In other words, You can interpret the value of the Cosine Similarity as follows:
When attributes are binary-valued, the cosine similarity function can be interpreted in terms of shared features or attributes. Suppose an object $x$ possesses the ith attribute if $x_i=1$. Then $x^t.y$ is the number of attributes possessed (i.e., shared) by both $x$ and $y$, and $|x||y|$ is the geometric mean of the number of attributes possessed by $x$ and the number possessed by $y$. Thus, $sim(x, y)$ is a measure of relative possession of common attributes.
A simple variation of cosine similarity for the preceding scenario is
$sim(x, y)=\frac{x.y}{x.x+y.y-x.y}$
which is the ratio of the number of attributes shared by $x$ and $y$ to the number of attributes possessed by $x$ or $y$. This function, known as the Tanimoto coefficient or Tanimoto distance, is frequently used in information retrieval and biology taxonomy.
Example:
Given two objects represented by the tuples (22, 1, 42, 10) and (20, 0, 36, 8):
(i) Compute the Euclidean distance between the two objects.
(ii) Compute the Manhattan distance between the two objects.
(iii) Compute the Minkowski distance between the two objects, using p = 3
(i) $D_e=\sqrt{(22-20)^2+(1-0)^2+(42-36)^2+(10-8)^2}$
$D_e=\sqrt{4+1+36+4}=\sqrt{45}=6.70$
(ii)$D_m=(22-20)+(1-0)+(42-36)+(10-8)=2+1+6+2=11$
(iii)$D_{min}=\{(22-20)^3+(1-0)^3+(42-36)^3+(10-8)^3\}^{1/3}=\{8+1+216+8\}^{1/3}=6.15$
Proximity Measures for Ordinal Attributes
The values of an ordinal attribute have a meaningful order or ranking about them, yet the magnitude between successive values is unknown. An example includes the sequence small, medium, large for a size attribute. Ordinal attributes may also be obtained from the discretization of numeric attributes by splitting the value range into a finite number of categories. These categories are organized into ranks.That is, the range of a numeric attribute can be mapped to an ordinal attribute $f$ having $M_f$ states. For example, the range of the interval-scaled attribute temperature (in Celsius) can be organized into the following states: $-30 to -10, -10 to 10, 10 to 30$, representing the categories cold temperature, moderate temperature, and warm temperature, respectively. Let $M$ represent the number of possible states that an ordinal attribute can have. These ordered states define the ranking $1, \ldots , M_f $
“How are ordinal attributes handled?” The treatment of ordinal attributes is quite similar to that of numeric attributes when computing dissimilarity between objects. Suppose that $f$ is an attribute from a set of ordinal attributes describing $n$ objects. The dissimilarity computation with respect to $f$ involves the following steps:
1. The value of $f $ for the$i$th object is $x_{if}$ , and $f$ has $M_f$ ordered states, representing the
ranking $1,\ldots, M_f$ . Replace each $x_{if}$ by its corresponding rank, $r_{if} \in \{1, \ldots , M_f\}$.
2. Since each ordinal attribute can have a different number of states, it is often necessary to map the range of each attribute onto $[0.0, 1.0]$ so that each attribute has equal weight. We perform such data normalization by replacing the rank $r_{if}$ of the $i$th object in the $f$ th attribute by
$z_{if}=\frac{r_{if}-1}{M_f-1}$
3. Dissimilarity can then be computed using any of the distance measures described in for numeric attributes, using $z_{if}$ to represent the $f$ value for the $i$th object.
Example:Suppose that we have the sample data shown
object identifier test-2(ordinal)
1 excellent
2 fair
3 good
4 excellent
There are three states for test-2: fair, good, and excellent, that is, $M_f =3$ ,if we replace each value for test-2 by its rank, the four objects are assigned the ranks 3, 1, 2, and 3 respectively.Step 2 normalizes the ranking by mapping rank 1 to 0.0, rank 2 to 0.5, and rank 3 to 1.0. For step 3, we can use, say, the Euclidean distance , which results in the following dissimilarity matrix:
Therefore, objects 1 and 2 are the most dissimilar, as are objects 2 and 4 (i.e., $d(2,1)=1.0$ and $d(4,2)= 1.0$). This makes intuitive sense since objects 1 and 4 are both excellent.Object 2 is fair, which is at the opposite end of the range of values for test-2.Similarity values for ordinal attributes can be interpreted from dissimilarity as $sim(i, j)= 1-d(i, j)$.
Comments
Post a Comment