Pandas and numpy¶
Load the data from: http://opendata.dc.gov/datasets that I have include in this github
into a dataframe. ( The file has been is available in directory ./data/ccp_current_csv.csv )
into a dataframe. ( The file has been is available in directory ./data/ccp_current_csv.csv )
In [1]:
import pandas as pd
import numpy as np
ccpdata = pd.read_csv("./data/ccp_current_csv.csv")
ccpdata.head(10) #first 10 records, to give a preview of the data
Out[1]:
What is its shape and what does that mean?¶
In [2]:
print(ccpdata.shape)
#we could aternatively use the numpy.shape() function
print(np.shape(ccpdata))
print
print("No.of Rows: ", len(ccpdata)) #prints the number of rows)
print("No. of Columns: ", len(ccpdata.columns)) #prints the number of columns)
Shape of a dataset means the dimensionality of the dataset. The dimensions of a dataset refers to the number of rows and the number of columns, which gives us an idea of the size of the dataset and how the data are arranged.
The shape of ccpdata given by the statment ccpdata.shape gives a tuple (No. of rows, No. of Columns). From the above ouput, we can understand that the dataset has 465 rows and 27 columns. Or, there are 465 records in the data and each record has 27 columns.
We can also describe a shape of a dataset using statistical values such as standard deviation, mean, median, and mode.
The shape of ccpdata given by the statment ccpdata.shape gives a tuple (No. of rows, No. of Columns). From the above ouput, we can understand that the dataset has 465 rows and 27 columns. Or, there are 465 records in the data and each record has 27 columns.
We can also describe a shape of a dataset using statistical values such as standard deviation, mean, median, and mode.
What are the number of rows in each 'QUADRANT' ?¶
In [3]:
#We can use the value_counts() method to calculate the frequency of values in a column
print ("QUAD. No. of Rows")
print (ccpdata['QUADRANT'].value_counts())
print
#We could also get the counts and store into a dictionary as follows:
rowcount = ccpdata['QUADRANT'].value_counts().to_dict()
print (rowcount)
In [4]:
import numpy as np
a = np.array([1.,2.,3.,4.])
b = np.array([5.0,6.0,7.0,8.0])
print(type(a))
print(type(b))
#adding a and b using numpy.add() function
print(np.add(a,b))
#adding a and b using the operator +
print(a+b)
#We can also use the astype() function to change the datatype of an array.
#For instance, we can declare a as an integer type array and convert it to a float type array
a = np.array([1,2,3,4])
print("Datatype of 1st element of a is ", type(a[0]))
print(a)
a = a.astype(float)
print("Datatype of 1st element of a is ", type(a[0]))
print(a)
subtraction a-b¶
In [5]:
import numpy as np
a = np.array([1.,2.,3.,4.])
b = np.array([5.0,6.0,7.0,8.0])
#adding a and b using numpy.add() function
print(np.subtract(a,b))
print(np.subtract(b,a))
#adding a and b using the operator +
print(a - b)
print(b - a)
multiplication a*b¶
In [6]:
import numpy as np
a = np.array([1.,2.,3.,4.])
b = np.array([5.0,6.0,7.0,8.0])
#multiplying elements of a by elements of b using numpy.multiply() function
print(np.multiply(a,b))
#diving elements of a by elements of b using the operator *
print(a * b)
divsion a/b¶
In [7]:
import numpy as np
a = np.array([1.,2.,3.,4.])
b = np.array([5.0,6.0,7.0,8.0])
#diving elements of a by elements of b using numpy.divide() function
print(np.divide(a,b))
#diving elements of a by elements of b using the operator /
print(a / b)
modulo a%b¶
In [8]:
import numpy as np
a = np.array([1.,2.,3.,4.])
b = np.array([5.0,6.0,7.0,8.0])
#adding a and b using numpy.add() function
print(np.mod(a,b))
#adding a and b using the operator +
print(a % b)
power a^b¶
In [9]:
import numpy as np
a = np.array([1.,2.,3.,4.])
b = np.array([5.0,6.0,7.0,8.0])
#adding a and b using numpy.add() function
print(np.power(a,b))
#adding a and b using the operator +
print(a ** b)
print(a ^ b) #the caret symbol is not used for power in pythong. It represents xor operation
Provide an interesting analysis of the data columns ( frequency or averages )¶
In [11]:
np.shape(mkl_data)
Out[11]:
The imported dataset has 3201 records, each of which has 7 columns. This implies that the stock price of Markel Corp, USA for 3201 days are present in the data.
Let's get the list of the columns.
Let's get the list of the columns.
In [25]:
columns = mkl_data.columns.values #returns a numpy.ndarray object with the column names of the dataset
print (columns)
print(type(columns))
print
print (mkl_data)
The dataframe lists the opening price, closing price, volume of units sold and highest and lowest prices per day.
In [13]:
print "The mean of opening prices of Markel Corp - ", np.mean(mkl_data['Open'])
print "The mean of closing prices of Markel Corp - ", np.mean(mkl_data['Close'])
print "The mean volume of units sold per day - ", (np.mean(mkl_data['Volume']))
print
mkl_data.mean() #we can simly get the mean of all columns with numerical values
Out[13]:
The mean price of the stock is probably not a good parameter to judge a stock. However the MEAN volume of units sold in a single day, which here is almost 33000 is a good number to look at of how many people are interested in the stock.
In [36]:
print "The highest price which the stock of Markel Corp was bought/sold at is: ", mkl_data['High'].max()
print "The lowest price which the stock of Markel Corp was bought/sold at is: ", mkl_data['Low'].min()
print
print "The highest number of units sold in a single day :", mkl_data['Volume'].max()
print "The lowest number of units sold in a single day :", mkl_data['Volume'].min()
In [66]:
print mkl_data['Close'].std() #Standard Deviation of Closing Price
print mkl_data['Close'].var() #Variance of Closing Price
High STANDARD DEVIATION refers to volatile stock prices. This is great for aggressive investors while conservative investors would opt for a company with less volatile stocks. Variance describes the same thing basically.
In [30]:
import matplotlib.pyplot as plt
mkl_data.iloc[2800:3201].plot(x='Date',y='Close')
plt.show()
Plotting the last 1200 closing prices of the Markel Corp show a generally increasing trend. The price has also dropped considerably at the times. Seems like a volatile stock! If you are an aggressive investor, you should go for it!
In [39]:
mkl_data.corr()
Out[39]:
This shows the correlation between the columns of the dataframe.
From this we can see a strong correlation between the the prices. For example, we can see how the opening prices affects the high price, low price or the closing price. We can also see that there is a very low correlation between the price of the stock and the daily volume of units.
From this we can see a strong correlation between the the prices. For example, we can see how the opening prices affects the high price, low price or the closing price. We can also see that there is a very low correlation between the price of the stock and the daily volume of units.
In [42]:
mkl_data.describe()
#We can simply summarize the various statistical values using the describe() method
Out[42]: