# Predicting Google’s Stock Price using Linear Regression

** What is Linear Regression?** Let’s forget the term ‘linear regression’ for some time. Instead, I want you to go back to your high school’s math class. You must have plotted a graph of a given linear equation during coordinate geometry classes in high school. Let’s revise what we did there.

We were given an equation *y = 2x + 3*, where 2 is the coefficient of x and 3 is a constant (i.e. intercept on y axis). What we used to do was:

- Take a value of
*x*(say*x*=0) - Find the corresponding value of y by putting
*x=0*in the equation. - Store the
*(x,y)*value pair in a table. - Repeat the process once or twice or as many times as we want.
- Plot the points on the graph to obtain the straight line.

Now, we will just do the **reverse** of the above method.

- We have some set of points
*(x*,_{1}, y_{1})*(x*,_{2}, y_{2})*(x*and so on till_{3}, y_{3})*(x*._{n}, y_{n}) - We have to use these set of points to find the coefficient
*a*and the constant*b*such that*y=ax + b*. - Once we have the equation, we can find the approximate value of y for any value for x.

Basically, we found a relationship or pattern between the values of x and y and generated an equation y=ax+b. You just did linear regression without even knowing.

Let’s see the official definition of regression (from Wikipedia).

In statistics, linear regression is an approach for modeling the relationship between a scalar dependent variable y and one or more explanatory variables (or independent variables) denoted X. The case of one explanatory variable is called simple linear regression. For more than one explanatory variable, the process is called multiple linear regression.

So, in our implementation, *x* is the independent/explanatory variable, and *y* is the dependent variable, as its value is dependent on *x*. Now, let us implement simple linear regression using Python to understand the real life application of the method.

We will be predicting the future price of Google’s stock using simple linear regression. The data that we will be using is real data obtained from Google Finance saved to a CSV file, *google.csv *.

Date | Open |

26 | 708.58 |

25 | 700.01 |

24 | 688.92 |

23 | 701.45 |

22 | 707.45 |

19 | 695.03 |

18 | 710 |

17 | 699 |

16 | 692.98 |

12 | 690.26 |

11 | 675 |

10 | 686.86 |

9 | 672.32 |

8 | 667.85 |

5 | 703.87 |

4 | 722.81 |

3 | 770.22 |

2 | 784.5 |

1 | 750.46 |

In the above dataset, we have the prices at which the Google stock opened from February 1 – February 26, 2016. Using this data, we will try to predict the price at which the stock will open on February 29, 2016. We will be using scikit-learn, csv, numpy and matplotlib packages to implement and visualize simple linear regression.

First, let’s import the above modules:

1 2 3 4 | import csv import numpy as np from sklearn import linear_model import matplotlib.pyplot as plt |

csv module is used to read data from the file “google.csv”. numpy is used for array processing and conversion. Sklearn (scikitlearn) is used to implement linear regression. And, matplotlib is used to plot the data-points on graph.

First, let’s define a method to read data from *google.csv *.

1 2 3 4 5 6 7 8 9 10 11 | dates = [] prices = [] def get_data(filename): with open(filename,'r') as csvfile: csvFileReader = csv.reader(csvfile) next(csvFileReader) #skipping column names for row in csvFileReader: dates.append(int(row[0])) prices.append(float(row[1])) return |

Don’t worry if you are not familiar with reading data from CSV files using python. Just read our previous article, Interacting with CSV files using Python which has well explained and easy to follow examples to help you.

Now, let’s define a function to predict the price of Google’s stock on a given date.

1 2 3 4 5 6 7 | def predict_price(dates,prices,x): linear_mod = linear_model.LinearRegression() #defining the linear regression model dates = np.reshape(dates,(len(dates),1)) # converting to matrix of n X 1 prices = np.reshape(prices,(len(prices),1)) linear_mod.fit(dates,prices) #fitting the data points in the model predicted_price =linear_mod.predict(x) return predicted_price[0][0],linear_mod.coef_[0][0] ,linear_mod.intercept_[0] |

The method predict_price takes 3 arguments,

– **dates**: the list of dates in integer type

– **prices**: the opening price of stock for the corresponding date

– **x**: the date for which we want to predict the price (i.e. 29)

The fit method fits the dates and prices (x’s and y’s) to generate coefficient and constant for regression. Finally, the predict method finds the price(y) for the given date (x) and returns the predicted price, the coefficient and the constant of the relationship equation.

To understand the concept of regression better, we can use matplotlib python module to plot the data-points and the relationship formed between them.

Note: The show_plot method draws the graph using matplotlib. Do not worry if you do not understand the below code completely. It is more important to understand the graph which follows the below code. However, the show_plot method below is commented to help you in understanding the code.

1 2 3 4 5 6 7 8 9 | def show_plot(dates,prices): linear_mod = linear_model.LinearRegression() dates = np.reshape(dates,(len(dates),1)) # converting to matrix of n X 1 prices = np.reshape(prices,(len(prices),1)) linear_mod.fit(dates,prices) #fitting the data points in the model plt.scatter(dates,prices,color='yellow') #plotting the initial datapoints plt.plot(dates,linear_mod.predict(dates),color='blue',linewidth=3) #plotting the line made by linear regression plt.show() return |

The yellow dots in the above plot show the data-points plotted for each date and price (i.e. the initial dataset)

The blue line is the equation formed by the fit method of the linear model (see predict_price method above)

Now, when we input the date February 29 to the regression model, it just uses the equation of the blue straight line in the above plot, and finds the corresponding value on y axis.

See the full program code below.

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 | import csv import numpy as np from sklearn import linear_model import matplotlib.pyplot as plt dates = [] prices = [] def get_data(filename): with open(filename,'r') as csvfile: csvFileReader = csv.reader(csvfile) next(csvFileReader) #skipping column names for row in csvFileReader: dates.append(int(row[0])) prices.append(float(row[1])) return def show_plot(dates,prices): linear_mod = linear_model.LinearRegression() dates = np.reshape(dates,(len(dates),1)) # converting to matrix of n X 1 prices = np.reshape(prices,(len(prices),1)) linear_mod.fit(dates,prices) #fitting the data points in the model plt.scatter(dates,prices,color='yellow') #plotting the initial datapoints plt.plot(dates,linear_mod.predict(dates),color='blue',linewidth=3) #plotting the line made by linear regression plt.show() return def predict_price(dates,prices,x): linear_mod = linear_model.LinearRegression() #defining the linear regression model dates = np.reshape(dates,(len(dates),1)) # converting to matrix of n X 1 prices = np.reshape(prices,(len(prices),1)) linear_mod.fit(dates,prices) #fitting the data points in the model predicted_price =linear_mod.predict(x) return predicted_price[0][0],linear_mod.coef_[0][0] ,linear_mod.intercept_[0] get_data('google.csv') # calling get_data method by passing the csv file to it print dates print prices print "\n" show_plot(dates,prices) #image of the plot will be generated. Save it if you want and then Close it to continue the execution of the below code. predicted_price, coefficient, constant = predict_price(dates,prices,29) print "The stock open price for 29th Feb is: $",str(predicted_price) print "The regression coefficient is ",str(coefficient),", and the constant is ", str(constant) print "the relationship equation between dates and prices is: price = ",str(coefficient),"* date + ",str(constant) |

The above program gives the below output:

1 2 3 4 5 6 7 8 | [26, 25, 24, 23, 22, 19, 18, 17, 16, 12, 11, 10, 9, 8, 5, 4, 3, 2, 1] [708.58, 700.01, 688.92, 701.45, 707.45, 695.03, 710.0, 699.0, 692.98, 690.26, 675.0, 686.86, 672.32, 667.85, 703.87, 722.81, 770.22, 784.5, 750.46] The stock open price for 29th Feb is: $ 680.925520298 The regression coefficient is -1.65535514798 , and the constant is 728.93081959 the relationship equation between dates and prices is: price = -1.65535514798 * date + 728.93081959 [Finished in 6.2s] |

See the last line of the output. They show the equation of the blue line formed in the plot. I hope you have got the concept of linear regression now.

Congrats! You just learnt a fundamental yet strong machine learning technique. The dataset, code and plot are available on Github. Your questions are welcome in the comments below.

Share this article with your friends on Facebook, Twitter and other social networks.

## Leave a Reply