CDS_Group13_Lecture18_Similarity_measures_and_Recommendation_Systems

# Category: PGDBA CDS 2016

Lecture Notes scribed by students of “Computing for Data Sciences” course, offered in Fall 2016. This course is a part of the Post Graduate Diploma in Business Analytics (PGDBA) program, jointly offered by ISI Kolkata, IIM Calcutta and IIT Kharagpur.

## Lecture 9 | 1st September 2016 | PCA and SVD

In this lecture we have done principal component analysis followed by Singular value decomposition.Detailed lecture notes are available at link below.

## Lecture 17 | 18 Oct 2016 | Graphs and PageRank Algorithm

We discussed the following topics in the lecture held on 18th Oct 2016:

- Graph Structures
- Shortest path using graphs
- PageRank
- Algorithm
- Damping Factor
- Spider Traps
- Random Restart

- Power Iteration
- Spectral Decomposition
- Degree, Adjacency and Laplacian Matrix

The detailed lecture notes can be found at the below link:

## Lecture 16 | 05 oct 2016 | Bias -Variance Tradeoff and Regularized Models

## Lecture 13 Introduction to unsupervised learning 27/09/2016

## Lecture 11 | 8 Sept 2016 | Decision Trees and Random Forests

## Lecture 14 | 29 Sept 2016 | Clustering of Data Points

## Lecture 7 | 25 Aug 2016 | Stochastic gradient descent and Basics of linear regression

## Lecture 6 | 17 Aug 2016 | Basics of Linear model and Gradient descent

## Lecture 8 |30 Aug 2016 | Linear Regression and Singular Value Decomposition

**Simple Linear Regression **

Linear regression attempts to model the relationship between input variables and one output variable by fitting a linear equation to the observed data. An important assumption of a linear regression model is that there is an underlying linear model that generates the observed data with certain parameters. One more important assumption is, this model also has an error component.

The linear model for a single variable regression is given by:

Here x and y are the input and output variables respectively. While and are the parameters of the linear model. And denotes the error component associated with the model.

**A brief note on **

Every model has an inherent error associated with it. This error component is denoted by . We assume that ** **is normally distributed with a zero mean. And this type of error is called irreducible error.

**A fit for the linear model**

We assume that the data that we have is an unbiased sample from the population. Once we find the right fit for this sample data, it is more likely to be a good fit for the population as well. We have certain tests to figure out how good a fit our model is. We’ll see this later in this lecture.

We try to model the relationship between the dependent variable y and one or more explanatory variables denoted by X. We do this by estimating the values of the parameters and .

These estimated parameters are denoted by and . And the estimated value of y is given by:

Every estimated value may not be exactly equal to the actual value . We call this difference between the observed and estimated values as residuals. This is denoted by:

**Methods of estimation of the parameters**

First method to estimate the parameters is to use the normal equation to solve for , as discussed in earlier class. Normal equation is given by:

Another method is least squares. Here we minimize the residual sum of squares (RSS) give by:

Here and denote the observation.

RSS can be interpreted as the sum of squares of difference between each observed and estimated value. Our aim is to find out the value of , that minimizes RSS. We can achieve this by taking partial derivative of RSS w.r.t and , set it to zero and solve for and .

RSS can also be interpreted as the square of sum of all the residuals, hence the name Residual Sum of Squares (RSS).

Minimizing the whole quantity of RSS tends to minimize the individual residuals as a whole. And when we reduce all the residuals simultaneously RSS minimizes as well. This tells us that whatever value of we estimate it won’t be far off from .

**Assessing the Estimated Parameters**

Going back to one of our assumptions, our sample is unbiased. If we think about getting a sample as something similar to drawing balls from a huge bag that contains numbered balls, the probability of getting balls numbered from 1 to 50 in a row is very low i.e. probability of getting a biased sample is very low. And probability of getting a sample that is equally dispersed is very high. In other words an unbiased sample is representative of the whole population. Suppose we are given another independent sample from the same population the line would look more or less like the one described by the estimated parameters.

In R linear regression can be accomplished using *lm* function. This is shown below for the advertising data set with TV as the input variable.

*‘lm’ *function in R returns details about the fit of our model. It provides the value of as coefficients along with the standard error, t-value and its probability (p value) . It also shows the RSS value and value. We will look at all these terms one by one.

**Residuals**

Let us start with the plot of the residuals for each observation.

*plot(lm_fit$residuals)*

The dream goal is to make all the residuals zero. But that never happens. We can see that the residuals are symmetric about line y* = 0* i.e. *x-axis*, having central tendency towards y* = 0* line. But can we infer that the residuals are normally distributed? Historgram gives us the answer.

*hist(lm_fit$residuals)*

We can see that the residuals are more or less normally distributed across zero. This means the are normally distributed about the line .

If residuals are not normally distributed we can assume either the is not normally distributed or alternatively we can say that the sample is bad i.e biased or the linear model is not the right fit for this sample.

**Standard Error**

All the parameters that we are calculating here are nothing but estimates. is the estimate of the slope and not the actual slope. Hence all the parameters comes with the properties associated with estimates. One such property is the Standard Error. Standard Error is the deviation that we can expect out of the estimate. That is if we draw several random samples from the population and find and the expectation of all the , it should be very close to .

The Standard Error of is given by:

This can be interpreted as variance of x present in the variance of the residuals. In other words the SE is proportional to the variance of the residuals and inversely proportional to the variance of x. If the error terms are huge then the scope for fitting lines with different slope increases too. This is captured in the Standard Error.

Here the variance of is approximated as variance of e.

**Total Sum of Squares (TSS)**

Hypothetically assume that there is no relationship between x and y. In this case x values doesn’t affect y values in any way. In other words we can say there is no linear dependency between x and y.

For such a scenario what would be the best estimate of y? The best estimate would be the mean of y i.e. . TSS captures the variation in y w.r.t to the . It is given by:

However, if there is a linear dependency between x and y, TSS is the worst case scenario for RSS.

### **Residual Sum of Squares (RSS)**

RSS is the measure of deviation of the predicted values (obtained from the estimated model) from the actual data. In other words RSS shows the residual error in the predicted values from the estimated model. Hence, a small value of RSS indicates the good fit of the model to the observed data.

The main aim of linear regression is to fit a linear model to improve the value of R-squared as much as it can. The correctness of linear fit can be justified using the method of hypothesis testing.

## Hypothesis Testing for Linear Regression

### t- statistic

t- statistic is the ratio of the deviation of the estimated parameter from the actual value and its standard error.

If the scatter plot of data looks linear, we can fit a linear model to it. Hypothesis testing tells us objectively how good our fit is.

The null hypothesis in this case would be that there doesn’t exist a linear relationship between the dependent (y) and independent variables (x). Mathematically, it is equivalent to saying that . The alternate hypothesis states that there exists a linear relationship between the variables x and y. The goal of hypothesis testing is to validate the null hypothesis.

In this case the t statistic is written as :

Suppose if there is no relation between the variables x and y then the value of would be zero with some standard error. The error would be due to the fact that sample drawn from the population could vary slightly and hence can show slight dependence between the variables but in general the dependence would be very weak unless the data is very skewed.

If there is no relationship between the variables then through hypothesis testing we find acceptable range in which lies. Precisely in the above case, the distribution of follows the normal distribution around zero and there is a 95% chance that the value of would lie in the interval . But if there was a relationship between the variables then there would be some other normal distribution of around a different (non-zero) mean.Then the probability of getting the around the new mean would be high. Given a sample, one value of is known.

Now the main problem is to know whether this came from right side distribution or the left side distribution (Refer above figure). The chance that this came from the right side distribution can’t be predicted because there is no information about this distribution. But given the value of , it can be said with 95% confidence that would lie in the interval if there doesn’t exist a linear relationship. Thus the main focus is to check whether lies in the confidence interval of or not. In simple words, the focus is to obtain the probability of obtaining such a sample from population given there is no linear relationship. t-statistic value is computed here to determine this.

For the advertisement example, t- statistic for TV comes out to be 17.67 which has a probability less than i.e. given there is no linear dependence between TV advertisement and sales, it is almost impossible to obtain .Hence there exists a relationship between the two variables and the null hypothesis can be discarded.

Similarly, Radio advertisement and sales also show linear dependence which could be clearly interpreted from the obtained t-statistic values and probability.

Now, from the below graph it is clear that there is no linear relationship between sales and newspaper advertisement which is also justified from the low value of t-statistics of 3.3. Thus null hypothesis can’t be rejected here and there is a good chance that there doesn’t exist a linear relation between newspaper advertisement and sales.

### R-Squared

R-Squared, also known as coefficient of determination, is used to indicate the proportion of total variance that is being accounted for by the predicted model. In other words it is a measure of how good did the model perform to fit the actual data. In mathematical terms it can be defined as:

The R-squared value of the advertising examples are as follows:

The actual goal of fitting a model is to account for 100% of the variance but in reality it is not possible. Thus, it tries to provide justification for as much variance as it can. It is clear from the above table that TV advertisements alone accounts for 61% of the variance, TV 33.2% and newspaper only 5.2%. Another important observation in the above table is that the sum of the R-squared value of TV and Radio advertisement is different from the R-squared value obtained when TV and Radio are fitted together. Same is the case when TV, Radio and newspaper are all taken together. This could be explained from the fact that the variables TV, newspaper and radio have non zero correlation between them. In other words, the variables are not orthogonal to each other. So, in order to account for maximum variance of response variable, it is required to select the linear combination of attributes which are orthogonal to each other and that could be done using singular value decomposition (SVD) in the next section.

## Singular Value Decomposition

Any dataset can be considered as a matrix. Each column in that matrix represents an attribute. Our goal using SVD is to find the independent linear combinations of attributes that could explain the variance of response variable. On applyinig a matrix to any vector, it leads to scalaing and rotation of that vector. Scaling in the geometric sense is the measure of dispersion. Maximum variance is along the direction of maximum expansion. So given a matrix A, our aim is to find the direction and scaling coefficient of maximum expansion. This maximum value is called the first singular value of the matrix.

…(FIrst singular value)

We define on a unit circle. So Thus,

is the first singular vector of matrix . (Note: For square matrix, first singular value is maximum eigen value and corresponding vector is eigen vector)

Let us define such that,

Now we need to find other direction in which variance is the most and is independent of i.e orthoginal to . Otherwise there will be some correlation which does not provide independent piece of information in terms of variance. So consider a unit vector such that, it is perpendicular to . Thus,

Accordingly for it can be written as,

Note that all the singular vectors lie in domain of and are orthonormal. Every addition of , increases the dimension by 1. Overall dimension can maximum be n. We define rank as the number of vectors needed to extinguish the capacity of matrix. So if is the rank of matrix then,

Thus lie in null space of and lie in row space of . forms the orthonormal basis of row space of while forms the orthonormal basis of null space of .

Let, be such that

, , …,

Source: Gilbert Strang Paper

lie in column space of . Also, forms the orthonormal basis of column space of . The above equations can be written as,

As is orthonormal,

On multiplying by on both sides. We get,

Singular Value decomposition is a method of breaking down any matrix of m rows and n columns into product of three matrices.

Source:Click Here

### Illustration in R

Using ‘princomp’ function in R:

‘princomp’ performs PCA for the data. It outputs the variance explained by the principal components.

Using ‘screeplot; to plot the variances explained by different principal components