Skip to main content

Tutorial 4 - Pandas and Numpy Basics

Pandas and Numpy Basics

Pandas and numpy

Load the data from: http://opendata.dc.gov/datasets that I have include in this github
into a dataframe. ( The file has been is available in directory ./data/ccp_current_csv.csv )
In [1]:
import pandas as pd
import numpy as np

ccpdata = pd.read_csv("./data/ccp_current_csv.csv")
ccpdata.head(10) #first 10 records, to give a preview of the data
Out[1]:
STREET QUADRANT FROMINTERSECTION TOINTERSECTION NOOFBLOCKS WORKDESCRIPTION YEARBUDGETED PERCENTCOMPLETED STATUS MILES ... YCOORD LATITUDE LONGITUDE PROJECTNOTE PROJECTNAME PROJECTMGR PROJECTID TEAMREP EMAIL CONTACTPHONE
0 13th St NW Kennedy St Longfellow St 1 Pavement Resurface/restoration 2004 100 Completed 0.07 ... 143205.11981 38.956747 -77.029694 NaN FedAid Paul Stephenson 5834 O. Hill NaN 202-671-4591
1 S St NW North Capitol St 3rd St 3 Regular Cover 1999 100 Completed 0.33 ... 138468.90256 38.914085 -77.012411 NaN NaN NaN 2467 NaN NaN NaN
2 Rhode Island Ave NE Brentwood Rd (W) Brentwood Rd (M) 1 Pavement Resurface/restoration 2004 100 Completed 0.09 ... 139574.10659 38.924041 -76.987221 NaN Fedral Aid Citywide Pavement Restoration Mohamed Abdullahi 5486 Said Cherifi said.cherifi@dc.gov 202-671-4611
3 Mill Rd NW 27th St Cul De Sac 1 Pavement Resurface/restoration 2004 100 Completed 0.07 ... 138132.90198 38.911046 -77.054077 POKA-2003-B-0048-JJ Citywide Pavement Restoration Contract (Local ... Maduabuchi Udeh 5796 Michael Jelen Michael.Jelen@dc.gov 202-671-4542
4 Rhode Island Ave NE Brentwood Rd (M) 14th St Evarts St Montan 2 Pavement Resurface/restoration 2005 100 Completed 0.16 ... 139657.25318 38.924790 -76.985068 NaN NaN NaN 8237 NaN NaN NaN
5 Mellon St SE 5th St Martin Luther King Jr Ave 1 Pavement Resurface/restoration 2005 100 Completed 0.20 ... 130948.25958 38.846337 -76.998463 NaN CIP_2004-2009 FY-05 4175 NaN NaN NaN
6 Nebraska Ave NW Northampton St Oliver St 1 Pavement Resurface/restoration 2004 100 Completed 0.07 ... 144260.95520 38.966246 -77.061675 NaN Citywide Pavement Restoration Paul Stephenson 5296 M. Howard and O. hill NaN 202-671-4591
7 Military Rd NW 30th St Broad Branch Rd 4 Pavement Resurface/restoration 2004 100 Completed 0.35 ... 143680.41824 38.961016 -77.062994 NaN FedAid_6YR_CIP Paul Stephenson 4762 O. Hill NaN 202-671-4591
8 Nebraska Ave NW Oliver St Utah Ave 1 Pavement Resurface/restoration 2005 100 Completed 0.04 ... 144332.34858 38.966889 -77.061126 NaN Citywide Pavement Restoratin paul Stephenson 7867 M. Howard and O. Hill NaN 202-671-4591
9 16th St NW Alaska Ave Primrose Rd 13 Reconstruction 1999 100 Completed 0.97 ... 145993.53492 38.981864 -77.036344 Reconstruction of 16th St. from Alaska to Prim... Recons of 16TH St. from Alaska to Prime Rose ... Eskender Konjit 2349 Porter Curtis Porter.Curtis@Dc.gov 202-671-4581
10 rows × 27 columns

What is its shape and what does that mean?

In [2]:
print(ccpdata.shape)
#we could aternatively use the numpy.shape() function
print(np.shape(ccpdata))

print
print("No.of Rows: ", len(ccpdata)) #prints the number of rows)
print("No. of Columns: ", len(ccpdata.columns)) #prints the number of columns)
(465, 27)
(465, 27)

('No.of Rows: ', 465)
('No. of Columns: ', 27)
Shape of a dataset means the dimensionality of the dataset. The dimensions of a dataset refers to the number of rows and the number of columns, which gives us an idea of the size of the dataset and how the data are arranged.
The shape of ccpdata given by the statment ccpdata.shape gives a tuple (No. of rows, No. of Columns). From the above ouput, we can understand that the dataset has 465 rows and 27 columns. Or, there are 465 records in the data and each record has 27 columns.
We can also describe a shape of a dataset using statistical values such as standard deviation, mean, median, and mode.

What are the number of rows in each 'QUADRANT' ?

In [3]:
#We can use the value_counts() method to calculate the frequency of values in a column
print ("QUAD. No. of Rows")
print (ccpdata['QUADRANT'].value_counts())


print
#We could also get the counts and store into a dictionary as follows:
rowcount = ccpdata['QUADRANT'].value_counts().to_dict()
print (rowcount)
QUAD. No. of Rows
NW    195
NE    163
SE     84
BN     12
SW     11
Name: QUADRANT, dtype: int64

{'BN': 12L, 'SW': 11L, 'NE': 163L, 'SE': 84L, 'NW': 195L}

Array math demonstration

For two arrarys
a= [1,2,3,4] type=float
b= [5,6,7,8] type=float
Peform the following array operations using numpy
( show both operational use of numpy and functional (example addition operation => + vs addition function => numbpy.add() )

addition a+b

In [4]:
import numpy as np
a = np.array([1.,2.,3.,4.])
b = np.array([5.0,6.0,7.0,8.0])
print(type(a))
print(type(b))


#adding a and b using numpy.add() function
print(np.add(a,b))

#adding a and b using the operator +
print(a+b)

#We can also use the astype() function to change the datatype of an array. 
#For instance, we can declare a as an integer type array and convert it to a float type array
a = np.array([1,2,3,4])
print("Datatype of 1st element of a is ", type(a[0]))
print(a)
a = a.astype(float)
print("Datatype of 1st element of a is ", type(a[0]))
print(a)
<type 'numpy.ndarray'>
<type 'numpy.ndarray'>
[ 6.  8. 10. 12.]
[ 6.  8. 10. 12.]
('Datatype of 1st element of a is ', <type 'numpy.int32'>)
[1 2 3 4]
('Datatype of 1st element of a is ', <type 'numpy.float64'>)
[1. 2. 3. 4.]

subtraction a-b

In [5]:
import numpy as np
a = np.array([1.,2.,3.,4.])
b = np.array([5.0,6.0,7.0,8.0])

#adding a and b using numpy.add() function
print(np.subtract(a,b))
print(np.subtract(b,a))

#adding a and b using the operator +
print(a - b)
print(b - a)
[-4. -4. -4. -4.]
[4. 4. 4. 4.]
[-4. -4. -4. -4.]
[4. 4. 4. 4.]

multiplication a*b

In [6]:
import numpy as np
a = np.array([1.,2.,3.,4.])
b = np.array([5.0,6.0,7.0,8.0])

#multiplying elements of a by elements of b using numpy.multiply() function
print(np.multiply(a,b))

#diving elements of a by elements of b using the operator *
print(a * b)
[ 5. 12. 21. 32.]
[ 5. 12. 21. 32.]

divsion a/b

In [7]:
import numpy as np
a = np.array([1.,2.,3.,4.])
b = np.array([5.0,6.0,7.0,8.0])

#diving elements of a by elements of b using numpy.divide() function
print(np.divide(a,b))

#diving elements of a by elements of b using the operator /
print(a / b)
[0.2        0.33333333 0.42857143 0.5       ]
[0.2        0.33333333 0.42857143 0.5       ]

modulo a%b

In [8]:
import numpy as np
a = np.array([1.,2.,3.,4.])
b = np.array([5.0,6.0,7.0,8.0])

#adding a and b using numpy.add() function
print(np.mod(a,b))

#adding a and b using the operator +
print(a % b)
[1. 2. 3. 4.]
[1. 2. 3. 4.]

power a^b

In [9]:
import numpy as np
a = np.array([1.,2.,3.,4.])
b = np.array([5.0,6.0,7.0,8.0])

#adding a and b using numpy.add() function
print(np.power(a,b))

#adding a and b using the operator +
print(a ** b)
print(a ^ b) #the caret symbol is not used for power in pythong. It represents xor operation
[1.0000e+00 6.4000e+01 2.1870e+03 6.5536e+04]
[1.0000e+00 6.4000e+01 2.1870e+03 6.5536e+04]

TypeErrorTraceback (most recent call last)
<ipython-input-9-54f4a23fef87> in <module>()
      8 #adding a and b using the operator +
      9 print(a ** b)
---> 10 print(a ^ b) #the caret symbol is not used for power in pythong. It represents xor operation

TypeError: ufunc 'bitwise_xor' not supported for the input types, and the inputs could not be safely coerced to any supported types according to the casting rule ''safe''

Provide an interesting analysis of the data columns ( frequency or averages )

In [11]:
np.shape(mkl_data)
Out[11]:
(3201, 7)
The imported dataset has 3201 records, each of which has 7 columns. This implies that the stock price of Markel Corp, USA for 3201 days are present in the data.
Let's get the list of the columns.
In [25]:
columns = mkl_data.columns.values #returns a numpy.ndarray object with the column names of the dataset
print (columns)
print(type(columns))
print
print (mkl_data)
['Date' 'Open' 'High' 'Low' 'Close' 'Volume' 'OpenInt']
<type 'numpy.ndarray'>

            Date     Open     High      Low    Close  Volume  OpenInt
0      2/25/2005   365.70   365.70   359.50   360.10   14800        0
1      2/28/2005   361.00   361.00   356.25   357.50   11200        0
2       3/1/2005   358.00   362.00   358.00   359.26   20600        0
3       3/2/2005   360.25   361.50   358.00   358.00   22600        0
4       3/3/2005   357.00   364.00   356.50   357.51   17300        0
5       3/4/2005   358.50   359.50   358.00   358.35    8700        0
6       3/7/2005   359.35   360.50   358.17   359.00   15000        0
7       3/8/2005   360.00   360.99   358.50   360.00   25500        0
8       3/9/2005   360.00   360.00   357.05   357.05    8800        0
9      3/10/2005   357.20   359.00   357.00   358.99    5400        0
10     3/11/2005   359.50   362.55   357.50   357.75   43300        0
11     3/14/2005   358.75   368.00   358.60   367.91   40900        0
12     3/15/2005   367.91   373.00   367.00   367.50   26500        0
13     3/16/2005   367.50   367.50   365.00   365.75   10500        0
14     3/17/2005   365.00   371.00   364.96   365.10   40200        0
15     3/18/2005   366.10   366.10   360.60   362.00   14600        0
16     3/21/2005   362.00   362.00   357.49   358.25    6500        0
17     3/22/2005   357.25   357.25   350.05   352.00   30100        0
18     3/23/2005   352.00   352.00   346.07   347.00   23800        0
19     3/24/2005   348.00   348.50   341.00   345.00   39900        0
20     3/28/2005   346.00   348.50   340.01   342.65   13500        0
21     3/29/2005   343.65   345.00   342.00   343.50   12100        0
22     3/30/2005   342.50   343.75   339.00   342.71   28700        0
23     3/31/2005   343.71   346.30   340.30   345.21   15400        0
24      4/1/2005   346.00   348.50   344.00   347.90   15200        0
25      4/4/2005   347.00   347.49   345.50   347.49    6600        0
26      4/5/2005   346.50   349.50   346.50   349.50   10500        0
27      4/6/2005   347.75   349.45   347.75   348.01    4400        0
28      4/7/2005   347.10   355.20   346.51   349.65   45200        0
29      4/8/2005   350.00   350.50   343.50   345.80   65200        0
...          ...      ...      ...      ...      ...     ...      ...
3171   10/2/2017  1070.08  1079.74  1060.47  1079.74   13982        0
3172   10/3/2017  1079.74  1085.00  1070.01  1077.45   27309        0
3173   10/4/2017  1080.15  1086.99  1075.75  1078.16   17683        0
3174   10/5/2017  1077.22  1084.39  1070.23  1073.55   27324        0
3175   10/6/2017  1076.90  1084.94  1071.06  1076.53   18707        0
3176   10/9/2017  1074.22  1078.36  1073.68  1077.13   10584        0
3177  10/10/2017  1083.16  1094.99  1081.83  1090.70   34420        0
3178  10/11/2017  1078.00  1105.23  1078.00  1093.89   27916        0
3179  10/12/2017  1093.00  1099.66  1069.71  1069.79   37609        0
3180  10/13/2017  1069.17  1079.98  1068.88  1070.98   18936        0
3181  10/16/2017  1076.22  1077.15  1061.83  1073.25   24738        0
3182  10/17/2017  1073.42  1077.92  1064.41  1068.44   15045        0
3183  10/18/2017  1070.81  1087.20  1061.50  1061.92   26015        0
3184  10/19/2017  1060.00  1071.87  1056.53  1063.35   17835        0
3185  10/20/2017  1065.19  1079.91  1063.63  1078.23   21568        0
3186  10/23/2017  1075.80  1081.61  1071.19  1077.42   19850        0
3187  10/24/2017  1080.00  1084.12  1060.89  1063.57   25418        0
3188  10/25/2017  1065.89  1071.55  1054.20  1067.81   30720        0
3189  10/26/2017  1057.13  1092.75  1054.76  1090.61   39158        0
3190  10/27/2017  1090.11  1096.73  1085.40  1094.08   30301        0
3191  10/30/2017  1093.45  1104.58  1090.41  1097.02   24248        0
3192  10/31/2017  1096.00  1096.00  1082.12  1084.30   40308        0
3193   11/1/2017  1086.77  1090.96  1076.68  1080.07   29658        0
3194   11/2/2017  1084.15  1097.87  1074.99  1090.39   15901        0
3195   11/3/2017  1091.20  1099.22  1072.23  1073.00   46032        0
3196   11/6/2017  1070.44  1088.53  1070.44  1084.32   19723        0
3197   11/7/2017  1076.32  1091.73  1076.07  1082.48   18262        0
3198   11/8/2017  1080.80  1088.96  1072.30  1073.60   24107        0
3199   11/9/2017  1073.00  1082.19  1064.66  1076.52   21603        0
3200  11/10/2017  1075.54  1081.66  1060.15  1063.41   24759        0

[3201 rows x 7 columns]
The dataframe lists the opening price, closing price, volume of units sold and highest and lowest prices per day.
In [13]:
print "The mean of opening prices of Markel Corp - ", np.mean(mkl_data['Open'])
print "The mean of closing prices of Markel Corp - ", np.mean(mkl_data['Close']) 
print "The mean volume of units sold per day - ", (np.mean(mkl_data['Volume']))
print
mkl_data.mean() #we can simly get the mean of all columns with numerical values
The mean of opening prices of Markel Corp -  529.084975008
The mean of closing prices of Markel Corp -  529.306919713
The mean volume of units sold per day -  32944.4667291

Out[13]:
Open         529.084975
High         533.681481
Low          524.592642
Close        529.306920
Volume     32944.466729
OpenInt        0.000000
dtype: float64
The mean price of the stock is probably not a good parameter to judge a stock. However the MEAN volume of units sold in a single day, which here is almost 33000 is a good number to look at of how many people are interested in the stock.
In [36]:
print "The highest price which the stock of Markel Corp was bought/sold at is: ", mkl_data['High'].max()
print "The lowest price which the stock of Markel Corp was bought/sold at is: ", mkl_data['Low'].min()
print
print "The highest number of units sold in a single day :", mkl_data['Volume'].max()
print "The lowest number of units sold in a single day :", mkl_data['Volume'].min()
 The highest price which the stock of Markel Corp was bought/sold at is:  1105.23
The lowest price which the stock of Markel Corp was bought/sold at is:  208.77

The highest number of units sold in a single day : 593000
The lowest number of units sold in a single day : 1117
In [66]:
print mkl_data['Close'].std() #Standard Deviation of Closing Price
print mkl_data['Close'].var() #Variance of Closing Price
221.99735196
49282.8242771
High STANDARD DEVIATION refers to volatile stock prices. This is great for aggressive investors while conservative investors would opt for a company with less volatile stocks. Variance describes the same thing basically.
In [30]:
import matplotlib.pyplot as plt
mkl_data.iloc[2800:3201].plot(x='Date',y='Close')
plt.show()
Plotting the last 1200 closing prices of the Markel Corp show a generally increasing trend. The price has also dropped considerably at the times. Seems like a volatile stock! If you are an aggressive investor, you should go for it!
In [39]:
mkl_data.corr()
Out[39]:
Open High Low Close Volume OpenInt
Open 1.000000 0.999731 0.999706 0.999447 0.172553 NaN
High 0.999731 1.000000 0.999551 0.999725 0.178029 NaN
Low 0.999706 0.999551 1.000000 0.999694 0.165892 NaN
Close 0.999447 0.999725 0.999694 1.000000 0.172094 NaN
Volume 0.172553 0.178029 0.165892 0.172094 1.000000 NaN
OpenInt NaN NaN NaN NaN NaN NaN
This shows the correlation between the columns of the dataframe.
From this we can see a strong correlation between the the prices. For example, we can see how the opening prices affects the high price, low price or the closing price. We can also see that there is a very low correlation between the price of the stock and the daily volume of units.
In [42]:
mkl_data.describe()
#We can simply summarize the various statistical values using the describe() method
Out[42]:
Open High Low Close Volume OpenInt
count 3201.000000 3201.000000 3201.000000 3201.000000 3201.000000 3201.0
mean 529.084975 533.681481 524.592642 529.306920 32944.466729 0.0
std 221.802516 222.936429 220.894447 221.997352 27408.738754 0.0
min 215.610000 220.050000 208.770000 211.000000 1117.000000 0.0
25% 351.900000 356.000000 347.550000 351.340000 17369.000000 0.0
50% 448.100000 451.850000 443.000000 447.750000 26500.000000 0.0
75% 648.820000 651.970000 645.660000 649.250000 40549.000000 0.0
max 1096.000000 1105.230000 1090.410000 1097.020000 593000.000000 0.0

Popular posts from this blog

Tutorial 6 - Statistics and Probability

Statistics and Probability with Python HW 6 Statistics and probability homework ¶ Complete homework notebook in a homework directory with your name and zip up the homework directory and submit it to our class blackboard/elearn site. Complete all the parts 6.1 to 6.5 for score of 3. Investigate plotting, linearegression, or complex matrix manipulation to get a score of 4 or cover two additional investigations for a score of 5. 6.1 Coin flipping ¶ 6.1.1 ¶ Write a function, flip_sum, which generates $n$ random coin flips from a fair coin and then returns the number of heads. A fair coin is defined to be a coin where $P($heads$)=\frac{1}{2}$ The output type should be a numpy integer, hint: use random.rand() In [4]: import numpy as np import random """def random_flip(): return random.choice(["H", "T"]) def flip_sum(n): heads_count = 0 ...

Tutorial 5 - Matplotlib

Matplotlib Tutorial In [13]: % matplotlib inline import pandas as pd import numpy as np import matplotlib as mpl import matplotlib.pyplot as plt # Setting some Pandas options pd . set_option ( 'display.notebook_repr_html' , False ) pd . set_option ( 'display.max_columns' , 25 ) pd . set_option ( 'display.max_rows' , 25 ) Homework 4 ¶ Couple of reference site: http://matplotlib.org/examples/pylab_examples/ http://docs.scipy.org/doc/numpy/ Homework 4.1 ¶ 4.1.a Create a figure with two subplots in a row. One shows a sine wave of x from with x = 0 ... 2*pi the other shows the tagent of x with the same range. Label the figures. Should look something like: We can follow the following steps to get the required graphs showing sine and tangents of x: Create a numpy array x with values from 0 to 2*pi with 0.001 as step value Set the height and w...

Domain Research - Stock Market Prediction

Hi, as part of my research on a domain of Big Data implementation, I chose Stock Market Prediction. Here I present to you the things that I have learned during my research in the domain. Can stock market be predicted? Early researches on stock market prediction revolved around whether it could be predicted. One of such researches suggested that “short term stock price movements were governed by the  random walk hypothesis  and thus were unpredictable”. Another stated that “the stock price reflected completed market information and the market behaved efficiently so that instantaneous price corrections to equilibrium would make stock prediction useless.” In simple terms, the researches inferred that since the market was affected by a lot of factors which were random predicting the stock market is almost impossible. However, researches carried out later (Brown & Jennings 1998; Abarbanel & Bushee 1998) made use of ...