Pandas and numpy¶

Load the data from: http://opendata.dc.gov/datasets that I have include in this github
into a dataframe. ( The file has been is available in directory ./data/ccp_current_csv.csv )

import pandas as pd
import numpy as np

ccpdata = pd.read_csv("./data/ccp_current_csv.csv")
ccpdata.head(10) #first 10 records, to give a preview of the data

What is its shape and what does that mean?¶

print(ccpdata.shape)
#we could aternatively use the numpy.shape() function
print(np.shape(ccpdata))

print
print("No.of Rows: ", len(ccpdata)) #prints the number of rows)
print("No. of Columns: ", len(ccpdata.columns)) #prints the number of columns)

(465, 27)
(465, 27)

('No.of Rows: ', 465)
('No. of Columns: ', 27)

Shape of a dataset means the dimensionality of the dataset. The dimensions of a dataset refers to the number of rows and the number of columns, which gives us an idea of the size of the dataset and how the data are arranged.
The shape of ccpdata given by the statment ccpdata.shape gives a tuple (No. of rows, No. of Columns). From the above ouput, we can understand that the dataset has 465 rows and 27 columns. Or, there are 465 records in the data and each record has 27 columns.
We can also describe a shape of a dataset using statistical values such as standard deviation, mean, median, and mode.

What are the number of rows in each 'QUADRANT' ?¶

#We can use the value_counts() method to calculate the frequency of values in a column
print ("QUAD. No. of Rows")
print (ccpdata['QUADRANT'].value_counts())


print
#We could also get the counts and store into a dictionary as follows:
rowcount = ccpdata['QUADRANT'].value_counts().to_dict()
print (rowcount)

QUAD. No. of Rows
NW    195
NE    163
SE     84
BN     12
SW     11
Name: QUADRANT, dtype: int64

{'BN': 12L, 'SW': 11L, 'NE': 163L, 'SE': 84L, 'NW': 195L}

Array math demonstration¶

For two arrarys
a= [1,2,3,4] type=float
b= [5,6,7,8] type=float
Peform the following array operations using numpy
( show both operational use of numpy and functional (example addition operation => + vs addition function => numbpy.add() )

addition a+b¶

import numpy as np
a = np.array([1.,2.,3.,4.])
b = np.array([5.0,6.0,7.0,8.0])
print(type(a))
print(type(b))


#adding a and b using numpy.add() function
print(np.add(a,b))

#adding a and b using the operator +
print(a+b)

#We can also use the astype() function to change the datatype of an array. 
#For instance, we can declare a as an integer type array and convert it to a float type array
a = np.array([1,2,3,4])
print("Datatype of 1st element of a is ", type(a[0]))
print(a)
a = a.astype(float)
print("Datatype of 1st element of a is ", type(a[0]))
print(a)

<type 'numpy.ndarray'>
<type 'numpy.ndarray'>
[ 6.  8. 10. 12.]
[ 6.  8. 10. 12.]
('Datatype of 1st element of a is ', <type 'numpy.int32'>)
[1 2 3 4]
('Datatype of 1st element of a is ', <type 'numpy.float64'>)
[1. 2. 3. 4.]

subtraction a-b¶

import numpy as np
a = np.array([1.,2.,3.,4.])
b = np.array([5.0,6.0,7.0,8.0])

#adding a and b using numpy.add() function
print(np.subtract(a,b))
print(np.subtract(b,a))

#adding a and b using the operator +
print(a - b)
print(b - a)

[-4. -4. -4. -4.]
[4. 4. 4. 4.]
[-4. -4. -4. -4.]
[4. 4. 4. 4.]

multiplication a*b¶

import numpy as np
a = np.array([1.,2.,3.,4.])
b = np.array([5.0,6.0,7.0,8.0])

#multiplying elements of a by elements of b using numpy.multiply() function
print(np.multiply(a,b))

#diving elements of a by elements of b using the operator *
print(a * b)

[ 5. 12. 21. 32.]
[ 5. 12. 21. 32.]

divsion a/b¶

import numpy as np
a = np.array([1.,2.,3.,4.])
b = np.array([5.0,6.0,7.0,8.0])

#diving elements of a by elements of b using numpy.divide() function
print(np.divide(a,b))

#diving elements of a by elements of b using the operator /
print(a / b)

[0.2        0.33333333 0.42857143 0.5       ]
[0.2        0.33333333 0.42857143 0.5       ]

modulo a%b¶

import numpy as np
a = np.array([1.,2.,3.,4.])
b = np.array([5.0,6.0,7.0,8.0])

#adding a and b using numpy.add() function
print(np.mod(a,b))

#adding a and b using the operator +
print(a % b)

[1. 2. 3. 4.]
[1. 2. 3. 4.]

power a^b¶

import numpy as np
a = np.array([1.,2.,3.,4.])
b = np.array([5.0,6.0,7.0,8.0])

#adding a and b using numpy.add() function
print(np.power(a,b))

#adding a and b using the operator +
print(a ** b)
print(a ^ b) #the caret symbol is not used for power in pythong. It represents xor operation

[1.0000e+00 6.4000e+01 2.1870e+03 6.5536e+04]
[1.0000e+00 6.4000e+01 2.1870e+03 6.5536e+04]


TypeErrorTraceback (most recent call last)
<ipython-input-9-54f4a23fef87> in <module>()
      8 #adding a and b using the operator +
      9 print(a ** b)
---> 10 print(a ^ b) #the caret symbol is not used for power in pythong. It represents xor operation

TypeError: ufunc 'bitwise_xor' not supported for the input types, and the inputs could not be safely coerced to any supported types according to the casting rule ''safe''

Provide an interesting analysis of the data columns ( frequency or averages )¶

np.shape(mkl_data)

(3201, 7)

The imported dataset has 3201 records, each of which has 7 columns. This implies that the stock price of Markel Corp, USA for 3201 days are present in the data.
Let's get the list of the columns.

columns = mkl_data.columns.values #returns a numpy.ndarray object with the column names of the dataset
print (columns)
print(type(columns))
print
print (mkl_data)

['Date' 'Open' 'High' 'Low' 'Close' 'Volume' 'OpenInt']
<type 'numpy.ndarray'>

            Date     Open     High      Low    Close  Volume  OpenInt
0      2/25/2005   365.70   365.70   359.50   360.10   14800        0
1      2/28/2005   361.00   361.00   356.25   357.50   11200        0
2       3/1/2005   358.00   362.00   358.00   359.26   20600        0
3       3/2/2005   360.25   361.50   358.00   358.00   22600        0
4       3/3/2005   357.00   364.00   356.50   357.51   17300        0
5       3/4/2005   358.50   359.50   358.00   358.35    8700        0
6       3/7/2005   359.35   360.50   358.17   359.00   15000        0
7       3/8/2005   360.00   360.99   358.50   360.00   25500        0
8       3/9/2005   360.00   360.00   357.05   357.05    8800        0
9      3/10/2005   357.20   359.00   357.00   358.99    5400        0
10     3/11/2005   359.50   362.55   357.50   357.75   43300        0
11     3/14/2005   358.75   368.00   358.60   367.91   40900        0
12     3/15/2005   367.91   373.00   367.00   367.50   26500        0
13     3/16/2005   367.50   367.50   365.00   365.75   10500        0
14     3/17/2005   365.00   371.00   364.96   365.10   40200        0
15     3/18/2005   366.10   366.10   360.60   362.00   14600        0
16     3/21/2005   362.00   362.00   357.49   358.25    6500        0
17     3/22/2005   357.25   357.25   350.05   352.00   30100        0
18     3/23/2005   352.00   352.00   346.07   347.00   23800        0
19     3/24/2005   348.00   348.50   341.00   345.00   39900        0
20     3/28/2005   346.00   348.50   340.01   342.65   13500        0
21     3/29/2005   343.65   345.00   342.00   343.50   12100        0
22     3/30/2005   342.50   343.75   339.00   342.71   28700        0
23     3/31/2005   343.71   346.30   340.30   345.21   15400        0
24      4/1/2005   346.00   348.50   344.00   347.90   15200        0
25      4/4/2005   347.00   347.49   345.50   347.49    6600        0
26      4/5/2005   346.50   349.50   346.50   349.50   10500        0
27      4/6/2005   347.75   349.45   347.75   348.01    4400        0
28      4/7/2005   347.10   355.20   346.51   349.65   45200        0
29      4/8/2005   350.00   350.50   343.50   345.80   65200        0
...          ...      ...      ...      ...      ...     ...      ...
3171   10/2/2017  1070.08  1079.74  1060.47  1079.74   13982        0
3172   10/3/2017  1079.74  1085.00  1070.01  1077.45   27309        0
3173   10/4/2017  1080.15  1086.99  1075.75  1078.16   17683        0
3174   10/5/2017  1077.22  1084.39  1070.23  1073.55   27324        0
3175   10/6/2017  1076.90  1084.94  1071.06  1076.53   18707        0
3176   10/9/2017  1074.22  1078.36  1073.68  1077.13   10584        0
3177  10/10/2017  1083.16  1094.99  1081.83  1090.70   34420        0
3178  10/11/2017  1078.00  1105.23  1078.00  1093.89   27916        0
3179  10/12/2017  1093.00  1099.66  1069.71  1069.79   37609        0
3180  10/13/2017  1069.17  1079.98  1068.88  1070.98   18936        0
3181  10/16/2017  1076.22  1077.15  1061.83  1073.25   24738        0
3182  10/17/2017  1073.42  1077.92  1064.41  1068.44   15045        0
3183  10/18/2017  1070.81  1087.20  1061.50  1061.92   26015        0
3184  10/19/2017  1060.00  1071.87  1056.53  1063.35   17835        0
3185  10/20/2017  1065.19  1079.91  1063.63  1078.23   21568        0
3186  10/23/2017  1075.80  1081.61  1071.19  1077.42   19850        0
3187  10/24/2017  1080.00  1084.12  1060.89  1063.57   25418        0
3188  10/25/2017  1065.89  1071.55  1054.20  1067.81   30720        0
3189  10/26/2017  1057.13  1092.75  1054.76  1090.61   39158        0
3190  10/27/2017  1090.11  1096.73  1085.40  1094.08   30301        0
3191  10/30/2017  1093.45  1104.58  1090.41  1097.02   24248        0
3192  10/31/2017  1096.00  1096.00  1082.12  1084.30   40308        0
3193   11/1/2017  1086.77  1090.96  1076.68  1080.07   29658        0
3194   11/2/2017  1084.15  1097.87  1074.99  1090.39   15901        0
3195   11/3/2017  1091.20  1099.22  1072.23  1073.00   46032        0
3196   11/6/2017  1070.44  1088.53  1070.44  1084.32   19723        0
3197   11/7/2017  1076.32  1091.73  1076.07  1082.48   18262        0
3198   11/8/2017  1080.80  1088.96  1072.30  1073.60   24107        0
3199   11/9/2017  1073.00  1082.19  1064.66  1076.52   21603        0
3200  11/10/2017  1075.54  1081.66  1060.15  1063.41   24759        0

[3201 rows x 7 columns]

The dataframe lists the opening price, closing price, volume of units sold and highest and lowest prices per day.

print "The mean of opening prices of Markel Corp - ", np.mean(mkl_data['Open'])
print "The mean of closing prices of Markel Corp - ", np.mean(mkl_data['Close']) 
print "The mean volume of units sold per day - ", (np.mean(mkl_data['Volume']))
print
mkl_data.mean() #we can simly get the mean of all columns with numerical values

The mean of opening prices of Markel Corp -  529.084975008
The mean of closing prices of Markel Corp -  529.306919713
The mean volume of units sold per day -  32944.4667291

Open         529.084975
High         533.681481
Low          524.592642
Close        529.306920
Volume     32944.466729
OpenInt        0.000000
dtype: float64

The mean price of the stock is probably not a good parameter to judge a stock. However the MEAN volume of units sold in a single day, which here is almost 33000 is a good number to look at of how many people are interested in the stock.

print "The highest price which the stock of Markel Corp was bought/sold at is: ", mkl_data['High'].max()
print "The lowest price which the stock of Markel Corp was bought/sold at is: ", mkl_data['Low'].min()
print
print "The highest number of units sold in a single day :", mkl_data['Volume'].max()
print "The lowest number of units sold in a single day :", mkl_data['Volume'].min()

 The highest price which the stock of Markel Corp was bought/sold at is:  1105.23
The lowest price which the stock of Markel Corp was bought/sold at is:  208.77

The highest number of units sold in a single day : 593000
The lowest number of units sold in a single day : 1117

print mkl_data['Close'].std() #Standard Deviation of Closing Price
print mkl_data['Close'].var() #Variance of Closing Price

221.99735196
49282.8242771

High STANDARD DEVIATION refers to volatile stock prices. This is great for aggressive investors while conservative investors would opt for a company with less volatile stocks. Variance describes the same thing basically.

import matplotlib.pyplot as plt
mkl_data.iloc[2800:3201].plot(x='Date',y='Close')
plt.show()

Plotting the last 1200 closing prices of the Markel Corp show a generally increasing trend. The price has also dropped considerably at the times. Seems like a volatile stock! If you are an aggressive investor, you should go for it!

mkl_data.corr()

This shows the correlation between the columns of the dataframe.
From this we can see a strong correlation between the the prices. For example, we can see how the opening prices affects the high price, low price or the closing price. We can also see that there is a very low correlation between the price of the stock and the daily volume of units.

mkl_data.describe()
#We can simply summarize the various statistical values using the describe() method

	STREET	QUADRANT	FROMINTERSECTION	TOINTERSECTION	NOOFBLOCKS	WORKDESCRIPTION	YEARBUDGETED	PERCENTCOMPLETED	STATUS	MILES	...	YCOORD	LATITUDE	LONGITUDE	PROJECTNOTE	PROJECTNAME	PROJECTMGR	PROJECTID	TEAMREP	EMAIL	CONTACTPHONE
0	13th St	NW	Kennedy St	Longfellow St	1	Pavement Resurface/restoration	2004	100	Completed	0.07	...	143205.11981	38.956747	-77.029694	NaN	FedAid	Paul Stephenson	5834	O. Hill	NaN	202-671-4591
1	S St	NW	North Capitol St	3rd St	3	Regular Cover	1999	100	Completed	0.33	...	138468.90256	38.914085	-77.012411	NaN	NaN	NaN	2467	NaN	NaN	NaN
2	Rhode Island Ave	NE	Brentwood Rd (W)	Brentwood Rd (M)	1	Pavement Resurface/restoration	2004	100	Completed	0.09	...	139574.10659	38.924041	-76.987221	NaN	Fedral Aid Citywide Pavement Restoration	Mohamed Abdullahi	5486	Said Cherifi	said.cherifi@dc.gov	202-671-4611
3	Mill Rd	NW	27th St	Cul De Sac	1	Pavement Resurface/restoration	2004	100	Completed	0.07	...	138132.90198	38.911046	-77.054077	POKA-2003-B-0048-JJ	Citywide Pavement Restoration Contract (Local ...	Maduabuchi Udeh	5796	Michael Jelen	Michael.Jelen@dc.gov	202-671-4542
4	Rhode Island Ave	NE	Brentwood Rd (M)	14th St Evarts St Montan	2	Pavement Resurface/restoration	2005	100	Completed	0.16	...	139657.25318	38.924790	-76.985068	NaN	NaN	NaN	8237	NaN	NaN	NaN
5	Mellon St	SE	5th St	Martin Luther King Jr Ave	1	Pavement Resurface/restoration	2005	100	Completed	0.20	...	130948.25958	38.846337	-76.998463	NaN	CIP_2004-2009	FY-05	4175	NaN	NaN	NaN
6	Nebraska Ave	NW	Northampton St	Oliver St	1	Pavement Resurface/restoration	2004	100	Completed	0.07	...	144260.95520	38.966246	-77.061675	NaN	Citywide Pavement Restoration	Paul Stephenson	5296	M. Howard and O. hill	NaN	202-671-4591
7	Military Rd	NW	30th St	Broad Branch Rd	4	Pavement Resurface/restoration	2004	100	Completed	0.35	...	143680.41824	38.961016	-77.062994	NaN	FedAid_6YR_CIP	Paul Stephenson	4762	O. Hill	NaN	202-671-4591
8	Nebraska Ave	NW	Oliver St	Utah Ave	1	Pavement Resurface/restoration	2005	100	Completed	0.04	...	144332.34858	38.966889	-77.061126	NaN	Citywide Pavement Restoratin	paul Stephenson	7867	M. Howard and O. Hill	NaN	202-671-4591
9	16th St	NW	Alaska Ave	Primrose Rd	13	Reconstruction	1999	100	Completed	0.97	...	145993.53492	38.981864	-77.036344	Reconstruction of 16th St. from Alaska to Prim...	Recons of 16TH St. from Alaska to Prime Rose ...	Eskender Konjit	2349	Porter Curtis	Porter.Curtis@Dc.gov	202-671-4581

	Open	High	Low	Close	Volume	OpenInt
Open	1.000000	0.999731	0.999706	0.999447	0.172553	NaN
High	0.999731	1.000000	0.999551	0.999725	0.178029	NaN
Low	0.999706	0.999551	1.000000	0.999694	0.165892	NaN
Close	0.999447	0.999725	0.999694	1.000000	0.172094	NaN
Volume	0.172553	0.178029	0.165892	0.172094	1.000000	NaN
OpenInt	NaN	NaN	NaN	NaN	NaN	NaN

	Open	High	Low	Close	Volume	OpenInt
count	3201.000000	3201.000000	3201.000000	3201.000000	3201.000000	3201.0
mean	529.084975	533.681481	524.592642	529.306920	32944.466729	0.0
std	221.802516	222.936429	220.894447	221.997352	27408.738754	0.0
min	215.610000	220.050000	208.770000	211.000000	1117.000000	0.0
25%	351.900000	356.000000	347.550000	351.340000	17369.000000	0.0
50%	448.100000	451.850000	443.000000	447.750000	26500.000000	0.0
75%	648.820000	651.970000	645.660000	649.250000	40549.000000	0.0
max	1096.000000	1105.230000	1090.410000	1097.020000	593000.000000	0.0

Big Data with Jenish

Search This Blog

Tutorial 4 - Pandas and Numpy Basics

Pandas and numpy¶

What is its shape and what does that mean?¶

What are the number of rows in each 'QUADRANT' ?¶

Array math demonstration¶

addition a+b¶

subtraction a-b¶

multiplication a*b¶

divsion a/b¶

modulo a%b¶

power a^b¶

Provide an interesting analysis of the data columns ( frequency or averages )¶

Labels

Popular posts from this blog

Tutorial 6 - Statistics and Probability

Tutorial 5 - Matplotlib

Domain Research - Stock Market Prediction