Demo: NumPy, Pandas#
UW Geospatial Data Analysis
CEE467/CEWA567
David Shean
modified by Quinn Brencher
Introduction#
This is a quick demo of some key functionality for these core Python packages, emphasizing topics that will help with lab exercises this week and later in the quarter. It is by no means complete!
Please consult the reading assignment and lists of other excellent, more complete online resources.
NumPy#
NumPy is the fundamental package for scientific computing in Python. It is a Python library that provides a multidimensional array object, various derived objects (such as masked arrays and matrices), and an assortment of routines for fast operations on arrays, including mathematical, logical, shape manipulation, sorting, selecting, I/O, discrete Fourier transforms, basic linear algebra, basic statistical operations, random simulation and much more.
Pandas#
pandas is a fast, powerful, flexible and easy to use open source data analysis and manipulation tool, built on top of the Python programming language.
pandas is a Python package providing fast, flexible, and expressive data structures designed to make working with “relational” or “labeled” data both easy and intuitive. It aims to be the fundamental high-level building block for doing practical, real-world data analysis in Python. Additionally, it has the broader goal of becoming the most powerful and flexible open source data analysis/manipulation tool available in any language. It is already well on its way toward this goal.
Matplotlib#
Matplotlib is a comprehensive library for creating static, animated, and interactive visualizations in Python.
Matplotlib is a Python 2D plotting library which produces publication quality figures in a variety of hardcopy formats and interactive environments across platforms. Matplotlib can be used in Python scripts, the Python and IPython shells, the Jupyter notebook, web application servers, and four graphical user interface toolkits.
Matplotlib tries to make easy things easy and hard things possible. You can generate plots, histograms, power spectra, bar charts, errorcharts, scatterplots, etc., with just a few lines of code. For examples, see the sample plots and thumbnail gallery.
For simple plotting the pyplot module provides a MATLAB-like interface, particularly when combined with IPython. For the power user, you have full control of line styles, font properties, axes properties, etc, via an object oriented interface or via a set of functions familiar to MATLAB users.
Import necessary modules#
Use shorthand, so you don’t have to type out full module name each time
Note different structure for matplotlib package
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
NumPy 1D array#
#Create 1D array of random integers
#Note parenthesis and brackets
a = np.random.randint(0,10,10)
a
array([4, 3, 3, 6, 3, 6, 5, 2, 3, 9])
type(a)
numpy.ndarray
#np.ndarray?
Constructing an array#
#np.array?
np.array(0, 1, 2)
---------------------------------------------------------------------------
TypeError Traceback (most recent call last)
Cell In[6], line 1
----> 1 np.array(0, 1, 2)
TypeError: array() takes from 1 to 2 positional arguments but 3 were given
#Pass in an array-like object - need brackets around the numbers
np.array([0, 1, 2])
array([0, 1, 2])
mylist = [0, 1, 2]
np.array(mylist)
array([0, 1, 2])
Array properties and datatypes#
a
array([4, 3, 3, 6, 3, 6, 5, 2, 3, 9])
a.shape
(10,)
a.size
10
a.dtype
dtype('int64')
What is ‘int64’?#
Signed integer represented by 64 bits
Each bit can be 0 or 1
0 = 0000000000000000000000000000000000000000000000000000000000000000
1 = 0000000000000000000000000000000000000000000000000000000000000001
2 = 0000000000000000000000000000000000000000000000000000000000000010
…
https://numpy.org/doc/stable/user/basics.types.html
#Possible unique combinations of 64 bits
range = 2**64
range
18446744073709551616
print(f"{range:.2e}")
1.84e+19
mm = int((2**64)/2)
mm
9223372036854775808
f'A 64-bit signed integer can store values between -{mm} and +{mm}'
'A 64-bit signed integer can store values between -9223372036854775808 and +9223372036854775808'
# Overkill for our single integer values
a
array([4, 3, 3, 6, 3, 6, 5, 2, 3, 9])
#Number of bytes (8 bits each) for each element in the array
a.itemsize
8
#Total number of bytes for 10 elements
a.nbytes
80
# Recast to 8-bit unsigned integer (valid range: 0-255)
b = a.astype('uint8')
b
array([4, 3, 3, 6, 3, 6, 5, 2, 3, 9], dtype=uint8)
b.dtype
dtype('uint8')
2**8
256
b
array([4, 3, 3, 6, 3, 6, 5, 2, 3, 9], dtype=uint8)
b.nbytes
10
#Assign value within valid range
b[0] = 255
b
array([255, 3, 3, 6, 3, 6, 5, 2, 3, 9], dtype=uint8)
#Assign value outside of valid range - overlflow!
# https://en.wikipedia.org/wiki/Integer_overflow
b[0] = 257
b
/tmp/ipykernel_802/1809295710.py:3: DeprecationWarning: NumPy will stop allowing conversion of out-of-bound Python integers to integer arrays. The conversion of 257 to uint8 will fail in the future.
For the old behavior, usually:
np.array(value).astype(dtype)`
will give the desired result (the cast overflows).
b[0] = 257
array([1, 3, 3, 6, 3, 6, 5, 2, 3, 9], dtype=uint8)
2D arrays#
a2 = np.random.random((10,10))
a2
array([[1.41204469e-01, 3.89917325e-01, 7.30803440e-01, 2.28727875e-01,
3.05261875e-01, 5.57365149e-01, 3.19747387e-01, 9.76131931e-01,
6.90435916e-01, 7.29483182e-01],
[3.76782504e-01, 9.54623086e-04, 4.96253865e-01, 8.83186872e-01,
7.97195917e-01, 6.37901937e-01, 5.78254274e-01, 5.03462246e-01,
6.37138050e-01, 1.84502898e-02],
[6.05770836e-01, 7.25509385e-01, 2.16041657e-01, 3.50670357e-03,
6.34606720e-01, 9.77455944e-01, 7.18494507e-01, 6.50530622e-01,
5.85406094e-01, 7.57894513e-01],
[6.58903005e-01, 8.07604015e-01, 9.97131444e-01, 6.90749625e-01,
8.11268416e-02, 1.07874063e-01, 4.17861123e-01, 7.23787367e-01,
9.20449637e-02, 4.16555164e-01],
[3.16168125e-01, 6.55887756e-01, 8.20854844e-02, 8.19284438e-02,
2.25303248e-01, 4.16816605e-01, 5.13325213e-01, 5.34474506e-01,
1.73097254e-01, 7.11062044e-01],
[9.22911408e-01, 1.11008058e-01, 1.39258614e-01, 8.76430130e-01,
4.95702854e-01, 7.16097827e-01, 6.36533645e-01, 7.98180948e-01,
9.58087288e-01, 7.77772061e-02],
[7.50955759e-01, 1.28775956e-01, 3.02774126e-01, 5.05208454e-01,
2.43478124e-01, 9.86661911e-02, 7.22968293e-01, 3.39784935e-01,
7.08604837e-01, 1.37895600e-01],
[2.71098182e-01, 8.23604609e-01, 7.39260002e-01, 5.96863088e-01,
6.42867723e-01, 5.57721585e-01, 2.05087063e-01, 3.45959313e-01,
7.13187596e-01, 1.26699262e-01],
[9.75129602e-01, 6.94600553e-01, 1.46673272e-01, 1.61391117e-01,
6.93398224e-01, 7.58123721e-01, 8.01035619e-01, 7.90701972e-01,
7.34414677e-01, 8.80065416e-01],
[4.58111282e-01, 8.55722122e-01, 7.18808961e-01, 3.28407866e-01,
5.75350931e-01, 4.71047838e-01, 1.48364415e-01, 5.68606354e-01,
7.63957334e-02, 4.03262879e-01]])
a2.shape
(10, 10)
a2.size
100
a2.dtype
dtype('float64')
#Get first element along first axis
#Question is this first row or col?
a2[0]
array([0.14120447, 0.38991732, 0.73080344, 0.22872787, 0.30526187,
0.55736515, 0.31974739, 0.97613193, 0.69043592, 0.72948318])
#Get first element along second axis
a2[:,0]
array([0.14120447, 0.3767825 , 0.60577084, 0.658903 , 0.31616813,
0.92291141, 0.75095576, 0.27109818, 0.9751296 , 0.45811128])
#Get first element along both axes
a2[0,0]
0.14120446935395448
#Get slice along first axis - first 3 rows
a2[0:3]
array([[1.41204469e-01, 3.89917325e-01, 7.30803440e-01, 2.28727875e-01,
3.05261875e-01, 5.57365149e-01, 3.19747387e-01, 9.76131931e-01,
6.90435916e-01, 7.29483182e-01],
[3.76782504e-01, 9.54623086e-04, 4.96253865e-01, 8.83186872e-01,
7.97195917e-01, 6.37901937e-01, 5.78254274e-01, 5.03462246e-01,
6.37138050e-01, 1.84502898e-02],
[6.05770836e-01, 7.25509385e-01, 2.16041657e-01, 3.50670357e-03,
6.34606720e-01, 9.77455944e-01, 7.18494507e-01, 6.50530622e-01,
5.85406094e-01, 7.57894513e-01]])
#Slice along second axis - first 3 cols
a2[:,0:3]
array([[1.41204469e-01, 3.89917325e-01, 7.30803440e-01],
[3.76782504e-01, 9.54623086e-04, 4.96253865e-01],
[6.05770836e-01, 7.25509385e-01, 2.16041657e-01],
[6.58903005e-01, 8.07604015e-01, 9.97131444e-01],
[3.16168125e-01, 6.55887756e-01, 8.20854844e-02],
[9.22911408e-01, 1.11008058e-01, 1.39258614e-01],
[7.50955759e-01, 1.28775956e-01, 3.02774126e-01],
[2.71098182e-01, 8.23604609e-01, 7.39260002e-01],
[9.75129602e-01, 6.94600553e-01, 1.46673272e-01],
[4.58111282e-01, 8.55722122e-01, 7.18808961e-01]])
a2[0:3,0:3]
array([[0.14120447, 0.38991732, 0.73080344],
[0.3767825 , 0.00095462, 0.49625387],
[0.60577084, 0.72550939, 0.21604166]])
ufunc#
Efficiently perform operation element-by-element in “vectorized” fashion (different than GIS vector dataset)
Do not loop over arrays (unless absolutely necessary)
https://numpy.org/doc/stable/reference/ufuncs.html
a2 * 10
array([[1.41204469e+00, 3.89917325e+00, 7.30803440e+00, 2.28727875e+00,
3.05261875e+00, 5.57365149e+00, 3.19747387e+00, 9.76131931e+00,
6.90435916e+00, 7.29483182e+00],
[3.76782504e+00, 9.54623086e-03, 4.96253865e+00, 8.83186872e+00,
7.97195917e+00, 6.37901937e+00, 5.78254274e+00, 5.03462246e+00,
6.37138050e+00, 1.84502898e-01],
[6.05770836e+00, 7.25509385e+00, 2.16041657e+00, 3.50670357e-02,
6.34606720e+00, 9.77455944e+00, 7.18494507e+00, 6.50530622e+00,
5.85406094e+00, 7.57894513e+00],
[6.58903005e+00, 8.07604015e+00, 9.97131444e+00, 6.90749625e+00,
8.11268416e-01, 1.07874063e+00, 4.17861123e+00, 7.23787367e+00,
9.20449637e-01, 4.16555164e+00],
[3.16168125e+00, 6.55887756e+00, 8.20854844e-01, 8.19284438e-01,
2.25303248e+00, 4.16816605e+00, 5.13325213e+00, 5.34474506e+00,
1.73097254e+00, 7.11062044e+00],
[9.22911408e+00, 1.11008058e+00, 1.39258614e+00, 8.76430130e+00,
4.95702854e+00, 7.16097827e+00, 6.36533645e+00, 7.98180948e+00,
9.58087288e+00, 7.77772061e-01],
[7.50955759e+00, 1.28775956e+00, 3.02774126e+00, 5.05208454e+00,
2.43478124e+00, 9.86661911e-01, 7.22968293e+00, 3.39784935e+00,
7.08604837e+00, 1.37895600e+00],
[2.71098182e+00, 8.23604609e+00, 7.39260002e+00, 5.96863088e+00,
6.42867723e+00, 5.57721585e+00, 2.05087063e+00, 3.45959313e+00,
7.13187596e+00, 1.26699262e+00],
[9.75129602e+00, 6.94600553e+00, 1.46673272e+00, 1.61391117e+00,
6.93398224e+00, 7.58123721e+00, 8.01035619e+00, 7.90701972e+00,
7.34414677e+00, 8.80065416e+00],
[4.58111282e+00, 8.55722122e+00, 7.18808961e+00, 3.28407866e+00,
5.75350931e+00, 4.71047838e+00, 1.48364415e+00, 5.68606354e+00,
7.63957334e-01, 4.03262879e+00]])
# Don't do this!
#for n, i in enumerate(a2):
# a2[n] = i + 10
#np.power(a2, 2)
a2**2
array([[1.99387022e-02, 1.52035520e-01, 5.34073667e-01, 5.23164406e-02,
9.31848122e-02, 3.10655909e-01, 1.02238392e-01, 9.52833548e-01,
4.76701755e-01, 5.32145713e-01],
[1.41965055e-01, 9.11305236e-07, 2.46267899e-01, 7.80019051e-01,
6.35521330e-01, 4.06918882e-01, 3.34378006e-01, 2.53474233e-01,
4.05944895e-01, 3.40413195e-04],
[3.66958306e-01, 5.26363868e-01, 4.66739974e-02, 1.22969699e-05,
4.02725689e-01, 9.55420122e-01, 5.16234356e-01, 4.23190091e-01,
3.42700295e-01, 5.74404092e-01],
[4.34153170e-01, 6.52224244e-01, 9.94271116e-01, 4.77135045e-01,
6.58156443e-03, 1.16368135e-02, 1.74607918e-01, 5.23868153e-01,
8.47227534e-03, 1.73518205e-01],
[9.99622834e-02, 4.30188749e-01, 6.73802675e-03, 6.71226990e-03,
5.07615536e-02, 1.73736082e-01, 2.63502774e-01, 2.85662998e-01,
2.99626595e-02, 5.05609231e-01],
[8.51765467e-01, 1.23227889e-02, 1.93929617e-02, 7.68129773e-01,
2.45721319e-01, 5.12796098e-01, 4.05175081e-01, 6.37092825e-01,
9.17931251e-01, 6.04929379e-03],
[5.63934553e-01, 1.65832469e-02, 9.16721711e-02, 2.55235582e-01,
5.92815967e-02, 9.73501726e-03, 5.22683153e-01, 1.15453802e-01,
5.02120816e-01, 1.90151964e-02],
[7.34942241e-02, 6.78324551e-01, 5.46505351e-01, 3.56245546e-01,
4.13278909e-01, 3.11053366e-01, 4.20607035e-02, 1.19687846e-01,
5.08636547e-01, 1.60527030e-02],
[9.50877740e-01, 4.82469929e-01, 2.15130488e-02, 2.60470928e-02,
4.80801097e-01, 5.74751576e-01, 6.41658063e-01, 6.25209609e-01,
5.39364917e-01, 7.74515137e-01],
[2.09865947e-01, 7.32260350e-01, 5.16686322e-01, 1.07851727e-01,
3.31028694e-01, 2.21886066e-01, 2.20119995e-02, 3.23313185e-01,
5.83630808e-03, 1.62620949e-01]])
#a2**0.5
np.sqrt(a2)
array([[0.37577183, 0.6244336 , 0.85487042, 0.47825503, 0.55250509,
0.74656892, 0.5654621 , 0.98799389, 0.83092474, 0.85409788],
[0.61382612, 0.03089698, 0.70445288, 0.93978023, 0.89285828,
0.79868763, 0.76043032, 0.70955074, 0.79820928, 0.13583184],
[0.77831281, 0.85176839, 0.46480281, 0.05921743, 0.79662207,
0.98866372, 0.84764055, 0.80655479, 0.76511835, 0.87057137],
[0.81172841, 0.89866791, 0.99856469, 0.83111349, 0.28482774,
0.32844187, 0.64642178, 0.85075694, 0.30338913, 0.64541085],
[0.56228829, 0.80986897, 0.28650564, 0.28623145, 0.47466119,
0.64561336, 0.71646717, 0.73107763, 0.41604958, 0.84324495],
[0.96068278, 0.33317872, 0.37317371, 0.93617847, 0.70406168,
0.84622564, 0.79783059, 0.89340973, 0.97881933, 0.27888565],
[0.86657704, 0.35885367, 0.55024915, 0.71078017, 0.49343502,
0.31411175, 0.85027542, 0.58291074, 0.84178669, 0.37134297],
[0.5206709 , 0.90752664, 0.8598023 , 0.77256915, 0.80179032,
0.7468076 , 0.45286539, 0.58818306, 0.84450435, 0.3559484 ],
[0.98748651, 0.83342699, 0.38297947, 0.40173513, 0.83270536,
0.870703 , 0.89500593, 0.88921424, 0.85697997, 0.93811802],
[0.67683918, 0.9250525 , 0.84782602, 0.57306881, 0.75851891,
0.68632925, 0.38518102, 0.75405991, 0.27639778, 0.63502983]])
Built-in functions#
Operate over entire array, specified axes, or slice
Very fast/efficient
a2.mean()
0.5051072363137312
a2.std()
0.2780218131440386
a2.min()
0.0009546230860033589
a2
array([[1.41204469e-01, 3.89917325e-01, 7.30803440e-01, 2.28727875e-01,
3.05261875e-01, 5.57365149e-01, 3.19747387e-01, 9.76131931e-01,
6.90435916e-01, 7.29483182e-01],
[3.76782504e-01, 9.54623086e-04, 4.96253865e-01, 8.83186872e-01,
7.97195917e-01, 6.37901937e-01, 5.78254274e-01, 5.03462246e-01,
6.37138050e-01, 1.84502898e-02],
[6.05770836e-01, 7.25509385e-01, 2.16041657e-01, 3.50670357e-03,
6.34606720e-01, 9.77455944e-01, 7.18494507e-01, 6.50530622e-01,
5.85406094e-01, 7.57894513e-01],
[6.58903005e-01, 8.07604015e-01, 9.97131444e-01, 6.90749625e-01,
8.11268416e-02, 1.07874063e-01, 4.17861123e-01, 7.23787367e-01,
9.20449637e-02, 4.16555164e-01],
[3.16168125e-01, 6.55887756e-01, 8.20854844e-02, 8.19284438e-02,
2.25303248e-01, 4.16816605e-01, 5.13325213e-01, 5.34474506e-01,
1.73097254e-01, 7.11062044e-01],
[9.22911408e-01, 1.11008058e-01, 1.39258614e-01, 8.76430130e-01,
4.95702854e-01, 7.16097827e-01, 6.36533645e-01, 7.98180948e-01,
9.58087288e-01, 7.77772061e-02],
[7.50955759e-01, 1.28775956e-01, 3.02774126e-01, 5.05208454e-01,
2.43478124e-01, 9.86661911e-02, 7.22968293e-01, 3.39784935e-01,
7.08604837e-01, 1.37895600e-01],
[2.71098182e-01, 8.23604609e-01, 7.39260002e-01, 5.96863088e-01,
6.42867723e-01, 5.57721585e-01, 2.05087063e-01, 3.45959313e-01,
7.13187596e-01, 1.26699262e-01],
[9.75129602e-01, 6.94600553e-01, 1.46673272e-01, 1.61391117e-01,
6.93398224e-01, 7.58123721e-01, 8.01035619e-01, 7.90701972e-01,
7.34414677e-01, 8.80065416e-01],
[4.58111282e-01, 8.55722122e-01, 7.18808961e-01, 3.28407866e-01,
5.75350931e-01, 4.71047838e-01, 1.48364415e-01, 5.68606354e-01,
7.63957334e-02, 4.03262879e-01]])
Note on axis order#
When indexing, first axis (0) will extract rows, second axis (1) will extract cols
When aggregating (e.g., computing mean along an axis), you are specifing the dimension of the array that will be collapsed, not the dimension that will be returned
So axis=0 will aggregate values across all rows for each column in a 2D array
a2[0:3,0:3]
array([[0.14120447, 0.38991732, 0.73080344],
[0.3767825 , 0.00095462, 0.49625387],
[0.60577084, 0.72550939, 0.21604166]])
a2[0:3,0:3].min(axis=0)
array([0.14120447, 0.00095462, 0.21604166])
a2[0:3,0:3].min(axis=1)
array([0.14120447, 0.00095462, 0.21604166])
nD arrays#
# how many dimensions can a numpy array have?
# let's try creating arrays with increasingly more dimensions
shape = []
for i in range(100):
shape.append(1) # add a new dimension of size 1 to the array shape
an = np.random.random(shape) # create a random array with the given shape
---------------------------------------------------------------------------
TypeError Traceback (most recent call last)
Cell In[49], line 4
1 # how many dimensions can a numpy array have?
2 # let's try creating arrays with increasingly more dimensions
3 shape = []
----> 4 for i in range(100):
5 shape.append(1) # add a new dimension of size 1 to the array shape
6 an = np.random.random(shape) # create a random array with the given shape
TypeError: 'int' object is not callable
Basic array plotting and visualization#
a
array([4, 3, 3, 6, 3, 6, 5, 2, 3, 9])
plt.plot(a)
[<matplotlib.lines.Line2D at 0x7f309773c850>]
a2
array([[1.41204469e-01, 3.89917325e-01, 7.30803440e-01, 2.28727875e-01,
3.05261875e-01, 5.57365149e-01, 3.19747387e-01, 9.76131931e-01,
6.90435916e-01, 7.29483182e-01],
[3.76782504e-01, 9.54623086e-04, 4.96253865e-01, 8.83186872e-01,
7.97195917e-01, 6.37901937e-01, 5.78254274e-01, 5.03462246e-01,
6.37138050e-01, 1.84502898e-02],
[6.05770836e-01, 7.25509385e-01, 2.16041657e-01, 3.50670357e-03,
6.34606720e-01, 9.77455944e-01, 7.18494507e-01, 6.50530622e-01,
5.85406094e-01, 7.57894513e-01],
[6.58903005e-01, 8.07604015e-01, 9.97131444e-01, 6.90749625e-01,
8.11268416e-02, 1.07874063e-01, 4.17861123e-01, 7.23787367e-01,
9.20449637e-02, 4.16555164e-01],
[3.16168125e-01, 6.55887756e-01, 8.20854844e-02, 8.19284438e-02,
2.25303248e-01, 4.16816605e-01, 5.13325213e-01, 5.34474506e-01,
1.73097254e-01, 7.11062044e-01],
[9.22911408e-01, 1.11008058e-01, 1.39258614e-01, 8.76430130e-01,
4.95702854e-01, 7.16097827e-01, 6.36533645e-01, 7.98180948e-01,
9.58087288e-01, 7.77772061e-02],
[7.50955759e-01, 1.28775956e-01, 3.02774126e-01, 5.05208454e-01,
2.43478124e-01, 9.86661911e-02, 7.22968293e-01, 3.39784935e-01,
7.08604837e-01, 1.37895600e-01],
[2.71098182e-01, 8.23604609e-01, 7.39260002e-01, 5.96863088e-01,
6.42867723e-01, 5.57721585e-01, 2.05087063e-01, 3.45959313e-01,
7.13187596e-01, 1.26699262e-01],
[9.75129602e-01, 6.94600553e-01, 1.46673272e-01, 1.61391117e-01,
6.93398224e-01, 7.58123721e-01, 8.01035619e-01, 7.90701972e-01,
7.34414677e-01, 8.80065416e-01],
[4.58111282e-01, 8.55722122e-01, 7.18808961e-01, 3.28407866e-01,
5.75350931e-01, 4.71047838e-01, 1.48364415e-01, 5.68606354e-01,
7.63957334e-02, 4.03262879e-01]])
plt.plot(a2)
[<matplotlib.lines.Line2D at 0x7f308f4f3d60>,
<matplotlib.lines.Line2D at 0x7f308f4f3dc0>,
<matplotlib.lines.Line2D at 0x7f308f4f3df0>,
<matplotlib.lines.Line2D at 0x7f308f4f3ee0>,
<matplotlib.lines.Line2D at 0x7f308f4f3fd0>,
<matplotlib.lines.Line2D at 0x7f308f481100>,
<matplotlib.lines.Line2D at 0x7f308f4811f0>,
<matplotlib.lines.Line2D at 0x7f308f4812e0>,
<matplotlib.lines.Line2D at 0x7f308f4813d0>,
<matplotlib.lines.Line2D at 0x7f308f4814c0>]
plt.plot(a2[0])
[<matplotlib.lines.Line2D at 0x7f308dc00c70>]
#2D array visualization
plt.imshow(a2, cmap='gray')
plt.colorbar()
<matplotlib.colorbar.Colorbar at 0x7f308f44d3a0>
plt.hist(a2.ravel(), bins='auto')
(array([12., 14., 8., 10., 12., 25., 10., 9.]),
array([9.54623086e-04, 1.25476726e-01, 2.49998828e-01, 3.74520931e-01,
4.99043033e-01, 6.23565136e-01, 7.48087239e-01, 8.72609341e-01,
9.97131444e-01]),
<BarContainer object of 8 artists>)
Boolean arrays and fancy indexing#
a2
array([[1.41204469e-01, 3.89917325e-01, 7.30803440e-01, 2.28727875e-01,
3.05261875e-01, 5.57365149e-01, 3.19747387e-01, 9.76131931e-01,
6.90435916e-01, 7.29483182e-01],
[3.76782504e-01, 9.54623086e-04, 4.96253865e-01, 8.83186872e-01,
7.97195917e-01, 6.37901937e-01, 5.78254274e-01, 5.03462246e-01,
6.37138050e-01, 1.84502898e-02],
[6.05770836e-01, 7.25509385e-01, 2.16041657e-01, 3.50670357e-03,
6.34606720e-01, 9.77455944e-01, 7.18494507e-01, 6.50530622e-01,
5.85406094e-01, 7.57894513e-01],
[6.58903005e-01, 8.07604015e-01, 9.97131444e-01, 6.90749625e-01,
8.11268416e-02, 1.07874063e-01, 4.17861123e-01, 7.23787367e-01,
9.20449637e-02, 4.16555164e-01],
[3.16168125e-01, 6.55887756e-01, 8.20854844e-02, 8.19284438e-02,
2.25303248e-01, 4.16816605e-01, 5.13325213e-01, 5.34474506e-01,
1.73097254e-01, 7.11062044e-01],
[9.22911408e-01, 1.11008058e-01, 1.39258614e-01, 8.76430130e-01,
4.95702854e-01, 7.16097827e-01, 6.36533645e-01, 7.98180948e-01,
9.58087288e-01, 7.77772061e-02],
[7.50955759e-01, 1.28775956e-01, 3.02774126e-01, 5.05208454e-01,
2.43478124e-01, 9.86661911e-02, 7.22968293e-01, 3.39784935e-01,
7.08604837e-01, 1.37895600e-01],
[2.71098182e-01, 8.23604609e-01, 7.39260002e-01, 5.96863088e-01,
6.42867723e-01, 5.57721585e-01, 2.05087063e-01, 3.45959313e-01,
7.13187596e-01, 1.26699262e-01],
[9.75129602e-01, 6.94600553e-01, 1.46673272e-01, 1.61391117e-01,
6.93398224e-01, 7.58123721e-01, 8.01035619e-01, 7.90701972e-01,
7.34414677e-01, 8.80065416e-01],
[4.58111282e-01, 8.55722122e-01, 7.18808961e-01, 3.28407866e-01,
5.75350931e-01, 4.71047838e-01, 1.48364415e-01, 5.68606354e-01,
7.63957334e-02, 4.03262879e-01]])
a2 > 0.5
array([[False, False, True, False, False, True, False, True, True,
True],
[False, False, False, True, True, True, True, True, True,
False],
[ True, True, False, False, True, True, True, True, True,
True],
[ True, True, True, True, False, False, False, True, False,
False],
[False, True, False, False, False, False, True, True, False,
True],
[ True, False, False, True, False, True, True, True, True,
False],
[ True, False, False, True, False, False, True, False, True,
False],
[False, True, True, True, True, True, False, False, True,
False],
[ True, True, False, False, True, True, True, True, True,
True],
[False, True, True, False, True, False, False, True, False,
False]])
idx = (a2 > 0.5)
idx
array([[False, False, True, False, False, True, False, True, True,
True],
[False, False, False, True, True, True, True, True, True,
False],
[ True, True, False, False, True, True, True, True, True,
True],
[ True, True, True, True, False, False, False, True, False,
False],
[False, True, False, False, False, False, True, True, False,
True],
[ True, False, False, True, False, True, True, True, True,
False],
[ True, False, False, True, False, False, True, False, True,
False],
[False, True, True, True, True, True, False, False, True,
False],
[ True, True, False, False, True, True, True, True, True,
True],
[False, True, True, False, True, False, False, True, False,
False]])
# Quick visualization, True = yellow (1)
plt.imshow(idx)
plt.colorbar()
<matplotlib.colorbar.Colorbar at 0x7f308da5a370>
# Return only elements where condition is True
a2[idx]
array([0.73080344, 0.55736515, 0.97613193, 0.69043592, 0.72948318,
0.88318687, 0.79719592, 0.63790194, 0.57825427, 0.50346225,
0.63713805, 0.60577084, 0.72550939, 0.63460672, 0.97745594,
0.71849451, 0.65053062, 0.58540609, 0.75789451, 0.658903 ,
0.80760401, 0.99713144, 0.69074963, 0.72378737, 0.65588776,
0.51332521, 0.53447451, 0.71106204, 0.92291141, 0.87643013,
0.71609783, 0.63653364, 0.79818095, 0.95808729, 0.75095576,
0.50520845, 0.72296829, 0.70860484, 0.82360461, 0.73926 ,
0.59686309, 0.64286772, 0.55772158, 0.7131876 , 0.9751296 ,
0.69460055, 0.69339822, 0.75812372, 0.80103562, 0.79070197,
0.73441468, 0.88006542, 0.85572212, 0.71880896, 0.57535093,
0.56860635])
# Original shape
a2.shape
(10, 10)
# Selected shape
a2[idx].shape
#idx.nonzero()[0].size
(56,)
#Can also be used for assignment
#a2[idx] = 0
### Bitwise operators, combining boolean arrays
(a2 > 0.5)
array([[False, False, True, False, False, True, False, True, True,
True],
[False, False, False, True, True, True, True, True, True,
False],
[ True, True, False, False, True, True, True, True, True,
True],
[ True, True, True, True, False, False, False, True, False,
False],
[False, True, False, False, False, False, True, True, False,
True],
[ True, False, False, True, False, True, True, True, True,
False],
[ True, False, False, True, False, False, True, False, True,
False],
[False, True, True, True, True, True, False, False, True,
False],
[ True, True, False, False, True, True, True, True, True,
True],
[False, True, True, False, True, False, False, True, False,
False]])
(a2 < 0.7)
array([[ True, True, False, True, True, True, True, False, True,
False],
[ True, True, True, False, False, True, True, True, True,
True],
[ True, False, True, True, True, False, False, True, True,
False],
[ True, False, False, True, True, True, True, False, True,
True],
[ True, True, True, True, True, True, True, True, True,
False],
[False, True, True, False, True, False, True, False, False,
True],
[False, True, True, True, True, True, False, True, False,
True],
[ True, False, False, True, True, True, True, True, False,
True],
[False, True, True, True, True, False, False, False, False,
False],
[ True, False, False, True, True, True, True, True, True,
True]])
#Bitwise and - True if both are True
idx = (a2 > 0.5) & (a2 < 0.7)
#Bitwise or - True if either is True
#idx = (a < 0.5) | (a > 0.9)
idx
array([[False, False, False, False, False, True, False, False, True,
False],
[False, False, False, False, False, True, True, True, True,
False],
[ True, False, False, False, True, False, False, True, True,
False],
[ True, False, False, True, False, False, False, False, False,
False],
[False, True, False, False, False, False, True, True, False,
False],
[False, False, False, False, False, False, True, False, False,
False],
[False, False, False, True, False, False, False, False, False,
False],
[False, False, False, True, True, True, False, False, False,
False],
[False, True, False, False, True, False, False, False, False,
False],
[False, False, False, False, True, False, False, True, False,
False]])
plt.imshow(idx)
<matplotlib.image.AxesImage at 0x7f308d9f2430>
a2[idx]
array([0.55736515, 0.69043592, 0.63790194, 0.57825427, 0.50346225,
0.63713805, 0.60577084, 0.63460672, 0.65053062, 0.58540609,
0.658903 , 0.69074963, 0.65588776, 0.51332521, 0.53447451,
0.63653364, 0.50520845, 0.59686309, 0.64286772, 0.55772158,
0.69460055, 0.69339822, 0.57535093, 0.56860635])
a2[idx].shape
(24,)
#Invert the boolean array
~idx
array([[ True, True, True, True, True, False, True, True, False,
True],
[ True, True, True, True, True, False, False, False, False,
True],
[False, True, True, True, False, True, True, False, False,
True],
[False, True, True, False, True, True, True, True, True,
True],
[ True, False, True, True, True, True, False, False, True,
True],
[ True, True, True, True, True, True, False, True, True,
True],
[ True, True, True, False, True, True, True, True, True,
True],
[ True, True, True, False, False, False, True, True, True,
True],
[ True, False, True, True, False, True, True, True, True,
True],
[ True, True, True, True, False, True, True, False, True,
True]])
plt.imshow(idx)
<matplotlib.image.AxesImage at 0x7f309773acd0>
plt.imshow(~idx)
<matplotlib.image.AxesImage at 0x7f308d916c10>
a2[~idx].shape
(76,)
Pandas!#
#pd.DataFrame?
df = pd.DataFrame(a2)
df
| 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | |
|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 0.141204 | 0.389917 | 0.730803 | 0.228728 | 0.305262 | 0.557365 | 0.319747 | 0.976132 | 0.690436 | 0.729483 |
| 1 | 0.376783 | 0.000955 | 0.496254 | 0.883187 | 0.797196 | 0.637902 | 0.578254 | 0.503462 | 0.637138 | 0.018450 |
| 2 | 0.605771 | 0.725509 | 0.216042 | 0.003507 | 0.634607 | 0.977456 | 0.718495 | 0.650531 | 0.585406 | 0.757895 |
| 3 | 0.658903 | 0.807604 | 0.997131 | 0.690750 | 0.081127 | 0.107874 | 0.417861 | 0.723787 | 0.092045 | 0.416555 |
| 4 | 0.316168 | 0.655888 | 0.082085 | 0.081928 | 0.225303 | 0.416817 | 0.513325 | 0.534475 | 0.173097 | 0.711062 |
| 5 | 0.922911 | 0.111008 | 0.139259 | 0.876430 | 0.495703 | 0.716098 | 0.636534 | 0.798181 | 0.958087 | 0.077777 |
| 6 | 0.750956 | 0.128776 | 0.302774 | 0.505208 | 0.243478 | 0.098666 | 0.722968 | 0.339785 | 0.708605 | 0.137896 |
| 7 | 0.271098 | 0.823605 | 0.739260 | 0.596863 | 0.642868 | 0.557722 | 0.205087 | 0.345959 | 0.713188 | 0.126699 |
| 8 | 0.975130 | 0.694601 | 0.146673 | 0.161391 | 0.693398 | 0.758124 | 0.801036 | 0.790702 | 0.734415 | 0.880065 |
| 9 | 0.458111 | 0.855722 | 0.718809 | 0.328408 | 0.575351 | 0.471048 | 0.148364 | 0.568606 | 0.076396 | 0.403263 |
Wow! Now we have labels for our Numpy array. This will make it much easier to keep track of.
df.index = ['a','b','c','d','e','f','g','h','i','j']
df
| 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | |
|---|---|---|---|---|---|---|---|---|---|---|
| a | 0.141204 | 0.389917 | 0.730803 | 0.228728 | 0.305262 | 0.557365 | 0.319747 | 0.976132 | 0.690436 | 0.729483 |
| b | 0.376783 | 0.000955 | 0.496254 | 0.883187 | 0.797196 | 0.637902 | 0.578254 | 0.503462 | 0.637138 | 0.018450 |
| c | 0.605771 | 0.725509 | 0.216042 | 0.003507 | 0.634607 | 0.977456 | 0.718495 | 0.650531 | 0.585406 | 0.757895 |
| d | 0.658903 | 0.807604 | 0.997131 | 0.690750 | 0.081127 | 0.107874 | 0.417861 | 0.723787 | 0.092045 | 0.416555 |
| e | 0.316168 | 0.655888 | 0.082085 | 0.081928 | 0.225303 | 0.416817 | 0.513325 | 0.534475 | 0.173097 | 0.711062 |
| f | 0.922911 | 0.111008 | 0.139259 | 0.876430 | 0.495703 | 0.716098 | 0.636534 | 0.798181 | 0.958087 | 0.077777 |
| g | 0.750956 | 0.128776 | 0.302774 | 0.505208 | 0.243478 | 0.098666 | 0.722968 | 0.339785 | 0.708605 | 0.137896 |
| h | 0.271098 | 0.823605 | 0.739260 | 0.596863 | 0.642868 | 0.557722 | 0.205087 | 0.345959 | 0.713188 | 0.126699 |
| i | 0.975130 | 0.694601 | 0.146673 | 0.161391 | 0.693398 | 0.758124 | 0.801036 | 0.790702 | 0.734415 | 0.880065 |
| j | 0.458111 | 0.855722 | 0.718809 | 0.328408 | 0.575351 | 0.471048 | 0.148364 | 0.568606 | 0.076396 | 0.403263 |
# Still just NumPy array under the hood
df.values
array([[1.41204469e-01, 3.89917325e-01, 7.30803440e-01, 2.28727875e-01,
3.05261875e-01, 5.57365149e-01, 3.19747387e-01, 9.76131931e-01,
6.90435916e-01, 7.29483182e-01],
[3.76782504e-01, 9.54623086e-04, 4.96253865e-01, 8.83186872e-01,
7.97195917e-01, 6.37901937e-01, 5.78254274e-01, 5.03462246e-01,
6.37138050e-01, 1.84502898e-02],
[6.05770836e-01, 7.25509385e-01, 2.16041657e-01, 3.50670357e-03,
6.34606720e-01, 9.77455944e-01, 7.18494507e-01, 6.50530622e-01,
5.85406094e-01, 7.57894513e-01],
[6.58903005e-01, 8.07604015e-01, 9.97131444e-01, 6.90749625e-01,
8.11268416e-02, 1.07874063e-01, 4.17861123e-01, 7.23787367e-01,
9.20449637e-02, 4.16555164e-01],
[3.16168125e-01, 6.55887756e-01, 8.20854844e-02, 8.19284438e-02,
2.25303248e-01, 4.16816605e-01, 5.13325213e-01, 5.34474506e-01,
1.73097254e-01, 7.11062044e-01],
[9.22911408e-01, 1.11008058e-01, 1.39258614e-01, 8.76430130e-01,
4.95702854e-01, 7.16097827e-01, 6.36533645e-01, 7.98180948e-01,
9.58087288e-01, 7.77772061e-02],
[7.50955759e-01, 1.28775956e-01, 3.02774126e-01, 5.05208454e-01,
2.43478124e-01, 9.86661911e-02, 7.22968293e-01, 3.39784935e-01,
7.08604837e-01, 1.37895600e-01],
[2.71098182e-01, 8.23604609e-01, 7.39260002e-01, 5.96863088e-01,
6.42867723e-01, 5.57721585e-01, 2.05087063e-01, 3.45959313e-01,
7.13187596e-01, 1.26699262e-01],
[9.75129602e-01, 6.94600553e-01, 1.46673272e-01, 1.61391117e-01,
6.93398224e-01, 7.58123721e-01, 8.01035619e-01, 7.90701972e-01,
7.34414677e-01, 8.80065416e-01],
[4.58111282e-01, 8.55722122e-01, 7.18808961e-01, 3.28407866e-01,
5.75350931e-01, 4.71047838e-01, 1.48364415e-01, 5.68606354e-01,
7.63957334e-02, 4.03262879e-01]])
df.index.values
array(['a', 'b', 'c', 'd', 'e', 'f', 'g', 'h', 'i', 'j'], dtype=object)
# Mean of each column
df.mean()
0 0.547704
1 0.519358
2 0.456909
3 0.435640
4 0.469429
5 0.529907
6 0.506167
7 0.623162
8 0.536881
9 0.425915
dtype: float64
# Mean of each row
df.mean(axis=1)
a 0.506908
b 0.492958
c 0.587522
d 0.499364
e 0.371015
f 0.573199
g 0.393911
h 0.502235
i 0.663553
j 0.460408
dtype: float64
Reading files with Pandas#
Most of the time, you will read in tabular data and let Pandas do the work
# Path to csv file
csv_fn = './data/GLAH14_tllz_conus_lulcfilt_demfilt.csv'
pd.read_csv(csv_fn)
| decyear | ordinal | lat | lon | glas_z | dem_z | dem_z_std | lulc | |
|---|---|---|---|---|---|---|---|---|
| 0 | 2003.139571 | 731266.943345 | 44.157897 | -105.356562 | 1398.51 | 1400.52 | 0.33 | 31 |
| 1 | 2003.139571 | 731266.943346 | 44.150175 | -105.358116 | 1387.11 | 1384.64 | 0.43 | 31 |
| 2 | 2003.139571 | 731266.943347 | 44.148632 | -105.358427 | 1392.83 | 1383.49 | 0.28 | 31 |
| 3 | 2003.139571 | 731266.943347 | 44.147087 | -105.358738 | 1384.24 | 1382.85 | 0.84 | 31 |
| 4 | 2003.139571 | 731266.943347 | 44.145542 | -105.359048 | 1369.21 | 1380.24 | 1.73 | 31 |
| ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 65231 | 2009.775995 | 733691.238340 | 37.896222 | -117.044399 | 1556.16 | 1556.43 | 0.00 | 31 |
| 65232 | 2009.775995 | 733691.238340 | 37.897769 | -117.044675 | 1556.02 | 1556.43 | 0.00 | 31 |
| 65233 | 2009.775995 | 733691.238340 | 37.899319 | -117.044952 | 1556.19 | 1556.44 | 0.00 | 31 |
| 65234 | 2009.775995 | 733691.238340 | 37.900869 | -117.045230 | 1556.18 | 1556.44 | 0.00 | 31 |
| 65235 | 2009.775995 | 733691.238341 | 37.902420 | -117.045508 | 1556.32 | 1556.44 | 0.00 | 31 |
65236 rows × 8 columns
# Store output as a new Pandas DataFrame
glas_df = pd.read_csv(csv_fn)
glas_df
| decyear | ordinal | lat | lon | glas_z | dem_z | dem_z_std | lulc | |
|---|---|---|---|---|---|---|---|---|
| 0 | 2003.139571 | 731266.943345 | 44.157897 | -105.356562 | 1398.51 | 1400.52 | 0.33 | 31 |
| 1 | 2003.139571 | 731266.943346 | 44.150175 | -105.358116 | 1387.11 | 1384.64 | 0.43 | 31 |
| 2 | 2003.139571 | 731266.943347 | 44.148632 | -105.358427 | 1392.83 | 1383.49 | 0.28 | 31 |
| 3 | 2003.139571 | 731266.943347 | 44.147087 | -105.358738 | 1384.24 | 1382.85 | 0.84 | 31 |
| 4 | 2003.139571 | 731266.943347 | 44.145542 | -105.359048 | 1369.21 | 1380.24 | 1.73 | 31 |
| ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 65231 | 2009.775995 | 733691.238340 | 37.896222 | -117.044399 | 1556.16 | 1556.43 | 0.00 | 31 |
| 65232 | 2009.775995 | 733691.238340 | 37.897769 | -117.044675 | 1556.02 | 1556.43 | 0.00 | 31 |
| 65233 | 2009.775995 | 733691.238340 | 37.899319 | -117.044952 | 1556.19 | 1556.44 | 0.00 | 31 |
| 65234 | 2009.775995 | 733691.238340 | 37.900869 | -117.045230 | 1556.18 | 1556.44 | 0.00 | 31 |
| 65235 | 2009.775995 | 733691.238341 | 37.902420 | -117.045508 | 1556.32 | 1556.44 | 0.00 | 31 |
65236 rows × 8 columns
type(glas_df)
pandas.core.frame.DataFrame
# For demonstration purpuoses - multiply index to illustrate difference between loc and iloc
glas_df.set_index(glas_df.index*10+1, inplace=True)
glas_df
| decyear | ordinal | lat | lon | glas_z | dem_z | dem_z_std | lulc | |
|---|---|---|---|---|---|---|---|---|
| 1 | 2003.139571 | 731266.943345 | 44.157897 | -105.356562 | 1398.51 | 1400.52 | 0.33 | 31 |
| 11 | 2003.139571 | 731266.943346 | 44.150175 | -105.358116 | 1387.11 | 1384.64 | 0.43 | 31 |
| 21 | 2003.139571 | 731266.943347 | 44.148632 | -105.358427 | 1392.83 | 1383.49 | 0.28 | 31 |
| 31 | 2003.139571 | 731266.943347 | 44.147087 | -105.358738 | 1384.24 | 1382.85 | 0.84 | 31 |
| 41 | 2003.139571 | 731266.943347 | 44.145542 | -105.359048 | 1369.21 | 1380.24 | 1.73 | 31 |
| ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 652311 | 2009.775995 | 733691.238340 | 37.896222 | -117.044399 | 1556.16 | 1556.43 | 0.00 | 31 |
| 652321 | 2009.775995 | 733691.238340 | 37.897769 | -117.044675 | 1556.02 | 1556.43 | 0.00 | 31 |
| 652331 | 2009.775995 | 733691.238340 | 37.899319 | -117.044952 | 1556.19 | 1556.44 | 0.00 | 31 |
| 652341 | 2009.775995 | 733691.238340 | 37.900869 | -117.045230 | 1556.18 | 1556.44 | 0.00 | 31 |
| 652351 | 2009.775995 | 733691.238341 | 37.902420 | -117.045508 | 1556.32 | 1556.44 | 0.00 | 31 |
65236 rows × 8 columns
# Awesome descriptive statistics for each column
glas_df.describe()
| decyear | ordinal | lat | lon | glas_z | dem_z | dem_z_std | lulc | |
|---|---|---|---|---|---|---|---|---|
| count | 65236.000000 | 65236.000000 | 65236.000000 | 65236.000000 | 65236.000000 | 65236.000000 | 65236.000000 | 65236.000000 |
| mean | 2005.945322 | 732291.890372 | 40.946798 | -115.040612 | 1791.494167 | 1792.260964 | 5.504748 | 30.339444 |
| std | 1.729573 | 631.766682 | 3.590476 | 5.465065 | 1037.183482 | 1037.925371 | 7.518558 | 3.480576 |
| min | 2003.139571 | 731266.943345 | 34.999455 | -124.482406 | -115.550000 | -114.570000 | 0.000000 | 12.000000 |
| 25% | 2004.444817 | 731743.803182 | 38.101451 | -119.257599 | 1166.970000 | 1168.240000 | 0.070000 | 31.000000 |
| 50% | 2005.846896 | 732256.116938 | 39.884541 | -115.686241 | 1555.730000 | 1556.380000 | 1.350000 | 31.000000 |
| 75% | 2007.223249 | 732758.486046 | 43.453565 | -109.816475 | 2399.355000 | 2400.072500 | 9.530000 | 31.000000 |
| max | 2009.775995 | 733691.238341 | 48.999727 | -104.052336 | 4340.310000 | 4252.940000 | 49.900000 | 31.000000 |
Indexing and selecting#
https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#different-choices-for-indexing
# Integer indexing like NumPy
glas_df.iloc[2]
decyear 2003.139571
ordinal 731266.943347
lat 44.148632
lon -105.358427
glas_z 1392.830000
dem_z 1383.490000
dem_z_std 0.280000
lulc 31.000000
Name: 21, dtype: float64
glas_df.iloc[0:3]
| decyear | ordinal | lat | lon | glas_z | dem_z | dem_z_std | lulc | |
|---|---|---|---|---|---|---|---|---|
| 1 | 2003.139571 | 731266.943345 | 44.157897 | -105.356562 | 1398.51 | 1400.52 | 0.33 | 31 |
| 11 | 2003.139571 | 731266.943346 | 44.150175 | -105.358116 | 1387.11 | 1384.64 | 0.43 | 31 |
| 21 | 2003.139571 | 731266.943347 | 44.148632 | -105.358427 | 1392.83 | 1383.49 | 0.28 | 31 |
glas_df.loc[21]
decyear 2003.139571
ordinal 731266.943347
lat 44.148632
lon -105.358427
glas_z 1392.830000
dem_z 1383.490000
dem_z_std 0.280000
lulc 31.000000
Name: 21, dtype: float64
# Get labeled indices between 0 and 20
glas_df.loc[0:20]
| decyear | ordinal | lat | lon | glas_z | dem_z | dem_z_std | lulc | |
|---|---|---|---|---|---|---|---|---|
| 1 | 2003.139571 | 731266.943345 | 44.157897 | -105.356562 | 1398.51 | 1400.52 | 0.33 | 31 |
| 11 | 2003.139571 | 731266.943346 | 44.150175 | -105.358116 | 1387.11 | 1384.64 | 0.43 | 31 |
# Get integer indices between 0 and 20
glas_df.iloc[0:20]
| decyear | ordinal | lat | lon | glas_z | dem_z | dem_z_std | lulc | |
|---|---|---|---|---|---|---|---|---|
| 1 | 2003.139571 | 731266.943345 | 44.157897 | -105.356562 | 1398.51 | 1400.52 | 0.33 | 31 |
| 11 | 2003.139571 | 731266.943346 | 44.150175 | -105.358116 | 1387.11 | 1384.64 | 0.43 | 31 |
| 21 | 2003.139571 | 731266.943347 | 44.148632 | -105.358427 | 1392.83 | 1383.49 | 0.28 | 31 |
| 31 | 2003.139571 | 731266.943347 | 44.147087 | -105.358738 | 1384.24 | 1382.85 | 0.84 | 31 |
| 41 | 2003.139571 | 731266.943347 | 44.145542 | -105.359048 | 1369.21 | 1380.24 | 1.73 | 31 |
| 51 | 2003.139571 | 731266.943347 | 44.143996 | -105.359359 | 1366.60 | 1375.23 | 1.60 | 31 |
| 61 | 2003.139571 | 731266.943351 | 44.126969 | -105.362876 | 1355.14 | 1379.38 | 2.17 | 31 |
| 71 | 2003.139571 | 731266.943360 | 44.074358 | -105.373549 | 1369.53 | 1391.71 | 2.88 | 31 |
| 81 | 2003.139571 | 731266.943361 | 44.072806 | -105.373864 | 1380.02 | 1387.79 | 0.45 | 31 |
| 91 | 2003.139571 | 731266.943361 | 44.071256 | -105.374177 | 1391.47 | 1396.90 | 1.56 | 31 |
| 101 | 2003.139571 | 731266.943362 | 44.063515 | -105.375712 | 1388.58 | 1408.54 | 0.24 | 31 |
| 111 | 2003.139571 | 731266.943363 | 44.061967 | -105.376015 | 1372.55 | 1406.21 | 0.17 | 31 |
| 121 | 2003.139571 | 731266.943364 | 44.057328 | -105.376934 | 1402.38 | 1406.23 | 0.33 | 31 |
| 131 | 2003.139571 | 731266.943364 | 44.055780 | -105.377243 | 1401.82 | 1405.75 | 0.35 | 31 |
| 141 | 2003.139571 | 731266.943364 | 44.054231 | -105.377553 | 1399.31 | 1406.05 | 0.68 | 31 |
| 151 | 2003.139571 | 731266.943366 | 44.046487 | -105.379115 | 1394.22 | 1398.14 | 0.27 | 31 |
| 161 | 2003.139571 | 731266.943366 | 44.044941 | -105.379430 | 1394.94 | 1400.58 | 0.17 | 31 |
| 171 | 2003.139571 | 731266.943367 | 44.041850 | -105.380064 | 1386.00 | 1389.69 | 0.57 | 31 |
| 181 | 2003.139571 | 731266.943424 | 43.737000 | -105.441568 | 1496.53 | 1498.16 | 1.52 | 31 |
| 191 | 2003.139571 | 731266.943429 | 43.706060 | -105.447754 | 1459.99 | 1460.90 | 0.08 | 31 |
Selecting columns#
glas_df.columns
Index(['decyear', 'ordinal', 'lat', 'lon', 'glas_z', 'dem_z', 'dem_z_std',
'lulc'],
dtype='object')
glas_df['glas_z']
1 1398.51
11 1387.11
21 1392.83
31 1384.24
41 1369.21
...
652311 1556.16
652321 1556.02
652331 1556.19
652341 1556.18
652351 1556.32
Name: glas_z, Length: 65236, dtype: float64
glas_df.glas_z
1 1398.51
11 1387.11
21 1392.83
31 1384.24
41 1369.21
...
652311 1556.16
652321 1556.02
652331 1556.19
652341 1556.18
652351 1556.32
Name: glas_z, Length: 65236, dtype: float64
glas_df.iloc[:,4]
1 1398.51
11 1387.11
21 1392.83
31 1384.24
41 1369.21
...
652311 1556.16
652321 1556.02
652331 1556.19
652341 1556.18
652351 1556.32
Name: glas_z, Length: 65236, dtype: float64
glas_df.loc[:,'glas_z']
1 1398.51
11 1387.11
21 1392.83
31 1384.24
41 1369.21
...
652311 1556.16
652321 1556.02
652331 1556.19
652341 1556.18
652351 1556.32
Name: glas_z, Length: 65236, dtype: float64
#Multiple columns
glas_df['glas_z', 'dem_z']
---------------------------------------------------------------------------
KeyError Traceback (most recent call last)
File ~/sw/miniconda3/envs/gda/lib/python3.8/site-packages/pandas/core/indexes/base.py:3652, in Index.get_loc(self, key)
3651 try:
-> 3652 return self._engine.get_loc(casted_key)
3653 except KeyError as err:
File ~/sw/miniconda3/envs/gda/lib/python3.8/site-packages/pandas/_libs/index.pyx:147, in pandas._libs.index.IndexEngine.get_loc()
File ~/sw/miniconda3/envs/gda/lib/python3.8/site-packages/pandas/_libs/index.pyx:176, in pandas._libs.index.IndexEngine.get_loc()
File pandas/_libs/hashtable_class_helper.pxi:7080, in pandas._libs.hashtable.PyObjectHashTable.get_item()
File pandas/_libs/hashtable_class_helper.pxi:7088, in pandas._libs.hashtable.PyObjectHashTable.get_item()
KeyError: ('glas_z', 'dem_z')
The above exception was the direct cause of the following exception:
KeyError Traceback (most recent call last)
Cell In[103], line 2
1 #Multiple columns
----> 2 glas_df['glas_z', 'dem_z']
File ~/sw/miniconda3/envs/gda/lib/python3.8/site-packages/pandas/core/frame.py:3761, in DataFrame.__getitem__(self, key)
3759 if self.columns.nlevels > 1:
3760 return self._getitem_multilevel(key)
-> 3761 indexer = self.columns.get_loc(key)
3762 if is_integer(indexer):
3763 indexer = [indexer]
File ~/sw/miniconda3/envs/gda/lib/python3.8/site-packages/pandas/core/indexes/base.py:3654, in Index.get_loc(self, key)
3652 return self._engine.get_loc(casted_key)
3653 except KeyError as err:
-> 3654 raise KeyError(key) from err
3655 except TypeError:
3656 # If we have a listlike key, _check_indexing_error will raise
3657 # InvalidIndexError. Otherwise we fall through and re-raise
3658 # the TypeError.
3659 self._check_indexing_error(key)
KeyError: ('glas_z', 'dem_z')
# Need to pass in a list of column names
glas_df[['glas_z', 'dem_z']]
| glas_z | dem_z | |
|---|---|---|
| 1 | 1398.51 | 1400.52 |
| 11 | 1387.11 | 1384.64 |
| 21 | 1392.83 | 1383.49 |
| 31 | 1384.24 | 1382.85 |
| 41 | 1369.21 | 1380.24 |
| ... | ... | ... |
| 652311 | 1556.16 | 1556.43 |
| 652321 | 1556.02 | 1556.43 |
| 652331 | 1556.19 | 1556.44 |
| 652341 | 1556.18 | 1556.44 |
| 652351 | 1556.32 | 1556.44 |
65236 rows × 2 columns
glas_df.loc[:,['glas_z', 'dem_z']]
| glas_z | dem_z | |
|---|---|---|
| 1 | 1398.51 | 1400.52 |
| 11 | 1387.11 | 1384.64 |
| 21 | 1392.83 | 1383.49 |
| 31 | 1384.24 | 1382.85 |
| 41 | 1369.21 | 1380.24 |
| ... | ... | ... |
| 652311 | 1556.16 | 1556.43 |
| 652321 | 1556.02 | 1556.43 |
| 652331 | 1556.19 | 1556.44 |
| 652341 | 1556.18 | 1556.44 |
| 652351 | 1556.32 | 1556.44 |
65236 rows × 2 columns
Boolean indexing#
glas_df['lulc']
1 31
11 31
21 31
31 31
41 31
..
652311 31
652321 31
652331 31
652341 31
652351 31
Name: lulc, Length: 65236, dtype: int64
glas_df['lulc'].value_counts()
lulc
31 62968
12 2268
Name: count, dtype: int64
glas_df['lulc'] == 12
1 False
11 False
21 False
31 False
41 False
...
652311 False
652321 False
652331 False
652341 False
652351 False
Name: lulc, Length: 65236, dtype: bool
# Boolean Series (index and single column) will be True for records with 'lulc' == 12
idx2 = glas_df['lulc'] == 12
type(idx2)
pandas.core.series.Series
idx2.shape
(65236,)
glas_df.shape
(65236, 8)
# Use to select corresponding rows, returns a new DataFrame with all columns
glas_df[idx2]
| decyear | ordinal | lat | lon | glas_z | dem_z | dem_z_std | lulc | |
|---|---|---|---|---|---|---|---|---|
| 231 | 2003.139573 | 731266.944184 | 39.669291 | -106.225142 | 3505.12 | 3508.25 | 5.74 | 12 |
| 301 | 2003.139573 | 731266.944316 | 38.961190 | -106.355153 | 4046.47 | 4047.25 | 7.14 | 12 |
| 4891 | 2003.147846 | 731269.963718 | 48.587233 | -113.484046 | 2135.76 | 2123.37 | 1.18 | 12 |
| 4921 | 2003.147846 | 731269.963811 | 48.091352 | -113.595790 | 1632.52 | 1615.77 | 11.43 | 12 |
| 7561 | 2003.157366 | 731273.438572 | 43.897412 | -114.457131 | 2886.39 | 2889.82 | 20.31 | 12 |
| ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 647241 | 2009.764964 | 733687.211708 | 40.689722 | -105.918309 | 3267.33 | 3267.62 | 1.83 | 12 |
| 647251 | 2009.764964 | 733687.211709 | 40.694371 | -105.919164 | 3235.77 | 3238.94 | 3.78 | 12 |
| 649831 | 2009.771998 | 733689.779258 | 47.910365 | -123.628017 | 1671.86 | 1711.73 | 8.44 | 12 |
| 649841 | 2009.771998 | 733689.779258 | 47.908820 | -123.628357 | 1737.70 | 1776.17 | 7.70 | 12 |
| 649851 | 2009.771998 | 733689.779258 | 47.907275 | -123.628697 | 1782.52 | 1828.93 | 4.41 | 12 |
2268 rows × 8 columns
glas_df[idx2].shape
(2268, 8)
glas_df[idx2].mean()
decyear 2006.008627
ordinal 732315.035881
lat 43.065223
lon -112.936499
glas_z 2918.746261
dem_z 2920.785754
dem_z_std 9.719951
lulc 12.000000
dtype: float64
Groupby#
Let’s consider statistics for groups of rows that share the same column attribute
glas_df.groupby('lulc')
<pandas.core.groupby.generic.DataFrameGroupBy object at 0x7f308da5a040>
glas_df.groupby('lulc').count()
| decyear | ordinal | lat | lon | glas_z | dem_z | dem_z_std | |
|---|---|---|---|---|---|---|---|
| lulc | |||||||
| 12 | 2268 | 2268 | 2268 | 2268 | 2268 | 2268 | 2268 |
| 31 | 62968 | 62968 | 62968 | 62968 | 62968 | 62968 | 62968 |
glas_df.groupby('lulc').mean()
| decyear | ordinal | lat | lon | glas_z | dem_z | dem_z_std | |
|---|---|---|---|---|---|---|---|
| lulc | |||||||
| 12 | 2006.008627 | 732315.035881 | 43.065223 | -112.936499 | 2918.746261 | 2920.785754 | 9.719951 |
| 31 | 2005.943042 | 732291.056710 | 40.870496 | -115.116398 | 1750.892469 | 1751.613426 | 5.352924 |
glas_df.groupby('lulc').agg(['mean', 'std'])
| decyear | ordinal | lat | lon | glas_z | dem_z | dem_z_std | ||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| mean | std | mean | std | mean | std | mean | std | mean | std | mean | std | mean | std | |
| lulc | ||||||||||||||
| 12 | 2006.008627 | 1.498488 | 732315.035881 | 547.316709 | 43.065223 | 3.569772 | -112.936499 | 7.610318 | 2918.746261 | 772.429857 | 2920.785754 | 769.897983 | 9.719951 | 5.805685 |
| 31 | 2005.943042 | 1.737290 | 732291.056710 | 634.586821 | 40.870496 | 3.567855 | -115.116398 | 5.356521 | 1750.892469 | 1022.544938 | 1751.613426 | 1023.340882 | 5.352924 | 7.529161 |