Demo: NumPy, Pandas#

UW Geospatial Data Analysis
CEE467/CEWA567
David Shean

modified by Quinn Brencher

Introduction#

This is a quick demo of some key functionality for these core Python packages, emphasizing topics that will help with lab exercises this week and later in the quarter. It is by no means complete!

Please consult the reading assignment and lists of other excellent, more complete online resources.

NumPy#

NumPy is the fundamental package for scientific computing in Python. It is a Python library that provides a multidimensional array object, various derived objects (such as masked arrays and matrices), and an assortment of routines for fast operations on arrays, including mathematical, logical, shape manipulation, sorting, selecting, I/O, discrete Fourier transforms, basic linear algebra, basic statistical operations, random simulation and much more.

Pandas#

pandas is a fast, powerful, flexible and easy to use open source data analysis and manipulation tool, built on top of the Python programming language.

pandas is a Python package providing fast, flexible, and expressive data structures designed to make working with “relational” or “labeled” data both easy and intuitive. It aims to be the fundamental high-level building block for doing practical, real-world data analysis in Python. Additionally, it has the broader goal of becoming the most powerful and flexible open source data analysis/manipulation tool available in any language. It is already well on its way toward this goal.

Matplotlib#

Matplotlib is a comprehensive library for creating static, animated, and interactive visualizations in Python.

Matplotlib is a Python 2D plotting library which produces publication quality figures in a variety of hardcopy formats and interactive environments across platforms. Matplotlib can be used in Python scripts, the Python and IPython shells, the Jupyter notebook, web application servers, and four graphical user interface toolkits.

Matplotlib tries to make easy things easy and hard things possible. You can generate plots, histograms, power spectra, bar charts, errorcharts, scatterplots, etc., with just a few lines of code. For examples, see the sample plots and thumbnail gallery.

For simple plotting the pyplot module provides a MATLAB-like interface, particularly when combined with IPython. For the power user, you have full control of line styles, font properties, axes properties, etc, via an object oriented interface or via a set of functions familiar to MATLAB users.

Import necessary modules#

  • Use shorthand, so you don’t have to type out full module name each time

  • Note different structure for matplotlib package

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

NumPy 1D array#

#Create 1D array of random integers
#Note parenthesis and brackets
a = np.random.randint(0,10,10)
a
array([4, 3, 3, 6, 3, 6, 5, 2, 3, 9])
type(a)
numpy.ndarray
#np.ndarray?

Constructing an array#

#np.array?
np.array(0, 1, 2)
---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
Cell In[6], line 1
----> 1 np.array(0, 1, 2)

TypeError: array() takes from 1 to 2 positional arguments but 3 were given
#Pass in an array-like object - need brackets around the numbers
np.array([0, 1, 2])
array([0, 1, 2])
mylist = [0, 1, 2]
np.array(mylist)
array([0, 1, 2])

Array properties and datatypes#

a
array([4, 3, 3, 6, 3, 6, 5, 2, 3, 9])
a.shape
(10,)
a.size
10
a.dtype
dtype('int64')

What is ‘int64’?#

  • Signed integer represented by 64 bits

  • Each bit can be 0 or 1

  • 0 = 0000000000000000000000000000000000000000000000000000000000000000

  • 1 = 0000000000000000000000000000000000000000000000000000000000000001

  • 2 = 0000000000000000000000000000000000000000000000000000000000000010

  • https://numpy.org/doc/stable/user/basics.types.html

#Possible unique combinations of 64 bits
range = 2**64
range
18446744073709551616
print(f"{range:.2e}")
1.84e+19
mm = int((2**64)/2)
mm
9223372036854775808
f'A 64-bit signed integer can store values between -{mm} and +{mm}'
'A 64-bit signed integer can store values between -9223372036854775808 and +9223372036854775808'
# Overkill for our single integer values
a
array([4, 3, 3, 6, 3, 6, 5, 2, 3, 9])
#Number of bytes (8 bits each) for each element in the array
a.itemsize
8
#Total number of bytes for 10 elements
a.nbytes
80
# Recast to 8-bit unsigned integer (valid range: 0-255)
b = a.astype('uint8')
b
array([4, 3, 3, 6, 3, 6, 5, 2, 3, 9], dtype=uint8)
b.dtype
dtype('uint8')
2**8
256
b
array([4, 3, 3, 6, 3, 6, 5, 2, 3, 9], dtype=uint8)
b.nbytes
10
#Assign value within valid range
b[0] = 255
b
array([255,   3,   3,   6,   3,   6,   5,   2,   3,   9], dtype=uint8)
#Assign value outside of valid range - overlflow!
# https://en.wikipedia.org/wiki/Integer_overflow
b[0] = 257
b
/tmp/ipykernel_802/1809295710.py:3: DeprecationWarning: NumPy will stop allowing conversion of out-of-bound Python integers to integer arrays.  The conversion of 257 to uint8 will fail in the future.
For the old behavior, usually:
    np.array(value).astype(dtype)`
will give the desired result (the cast overflows).
  b[0] = 257
array([1, 3, 3, 6, 3, 6, 5, 2, 3, 9], dtype=uint8)

2D arrays#

a2 = np.random.random((10,10))
a2
array([[1.41204469e-01, 3.89917325e-01, 7.30803440e-01, 2.28727875e-01,
        3.05261875e-01, 5.57365149e-01, 3.19747387e-01, 9.76131931e-01,
        6.90435916e-01, 7.29483182e-01],
       [3.76782504e-01, 9.54623086e-04, 4.96253865e-01, 8.83186872e-01,
        7.97195917e-01, 6.37901937e-01, 5.78254274e-01, 5.03462246e-01,
        6.37138050e-01, 1.84502898e-02],
       [6.05770836e-01, 7.25509385e-01, 2.16041657e-01, 3.50670357e-03,
        6.34606720e-01, 9.77455944e-01, 7.18494507e-01, 6.50530622e-01,
        5.85406094e-01, 7.57894513e-01],
       [6.58903005e-01, 8.07604015e-01, 9.97131444e-01, 6.90749625e-01,
        8.11268416e-02, 1.07874063e-01, 4.17861123e-01, 7.23787367e-01,
        9.20449637e-02, 4.16555164e-01],
       [3.16168125e-01, 6.55887756e-01, 8.20854844e-02, 8.19284438e-02,
        2.25303248e-01, 4.16816605e-01, 5.13325213e-01, 5.34474506e-01,
        1.73097254e-01, 7.11062044e-01],
       [9.22911408e-01, 1.11008058e-01, 1.39258614e-01, 8.76430130e-01,
        4.95702854e-01, 7.16097827e-01, 6.36533645e-01, 7.98180948e-01,
        9.58087288e-01, 7.77772061e-02],
       [7.50955759e-01, 1.28775956e-01, 3.02774126e-01, 5.05208454e-01,
        2.43478124e-01, 9.86661911e-02, 7.22968293e-01, 3.39784935e-01,
        7.08604837e-01, 1.37895600e-01],
       [2.71098182e-01, 8.23604609e-01, 7.39260002e-01, 5.96863088e-01,
        6.42867723e-01, 5.57721585e-01, 2.05087063e-01, 3.45959313e-01,
        7.13187596e-01, 1.26699262e-01],
       [9.75129602e-01, 6.94600553e-01, 1.46673272e-01, 1.61391117e-01,
        6.93398224e-01, 7.58123721e-01, 8.01035619e-01, 7.90701972e-01,
        7.34414677e-01, 8.80065416e-01],
       [4.58111282e-01, 8.55722122e-01, 7.18808961e-01, 3.28407866e-01,
        5.75350931e-01, 4.71047838e-01, 1.48364415e-01, 5.68606354e-01,
        7.63957334e-02, 4.03262879e-01]])
a2.shape
(10, 10)
a2.size
100
a2.dtype
dtype('float64')
#Get first element along first axis
#Question is this first row or col?
a2[0]
array([0.14120447, 0.38991732, 0.73080344, 0.22872787, 0.30526187,
       0.55736515, 0.31974739, 0.97613193, 0.69043592, 0.72948318])
#Get first element along second axis
a2[:,0]
array([0.14120447, 0.3767825 , 0.60577084, 0.658903  , 0.31616813,
       0.92291141, 0.75095576, 0.27109818, 0.9751296 , 0.45811128])
#Get first element along both axes
a2[0,0]
0.14120446935395448
#Get slice along first axis - first 3 rows
a2[0:3]
array([[1.41204469e-01, 3.89917325e-01, 7.30803440e-01, 2.28727875e-01,
        3.05261875e-01, 5.57365149e-01, 3.19747387e-01, 9.76131931e-01,
        6.90435916e-01, 7.29483182e-01],
       [3.76782504e-01, 9.54623086e-04, 4.96253865e-01, 8.83186872e-01,
        7.97195917e-01, 6.37901937e-01, 5.78254274e-01, 5.03462246e-01,
        6.37138050e-01, 1.84502898e-02],
       [6.05770836e-01, 7.25509385e-01, 2.16041657e-01, 3.50670357e-03,
        6.34606720e-01, 9.77455944e-01, 7.18494507e-01, 6.50530622e-01,
        5.85406094e-01, 7.57894513e-01]])
#Slice along second axis - first 3 cols
a2[:,0:3]
array([[1.41204469e-01, 3.89917325e-01, 7.30803440e-01],
       [3.76782504e-01, 9.54623086e-04, 4.96253865e-01],
       [6.05770836e-01, 7.25509385e-01, 2.16041657e-01],
       [6.58903005e-01, 8.07604015e-01, 9.97131444e-01],
       [3.16168125e-01, 6.55887756e-01, 8.20854844e-02],
       [9.22911408e-01, 1.11008058e-01, 1.39258614e-01],
       [7.50955759e-01, 1.28775956e-01, 3.02774126e-01],
       [2.71098182e-01, 8.23604609e-01, 7.39260002e-01],
       [9.75129602e-01, 6.94600553e-01, 1.46673272e-01],
       [4.58111282e-01, 8.55722122e-01, 7.18808961e-01]])
a2[0:3,0:3]
array([[0.14120447, 0.38991732, 0.73080344],
       [0.3767825 , 0.00095462, 0.49625387],
       [0.60577084, 0.72550939, 0.21604166]])

ufunc#

  • Efficiently perform operation element-by-element in “vectorized” fashion (different than GIS vector dataset)

  • Do not loop over arrays (unless absolutely necessary)

  • https://numpy.org/doc/stable/reference/ufuncs.html

a2 * 10
array([[1.41204469e+00, 3.89917325e+00, 7.30803440e+00, 2.28727875e+00,
        3.05261875e+00, 5.57365149e+00, 3.19747387e+00, 9.76131931e+00,
        6.90435916e+00, 7.29483182e+00],
       [3.76782504e+00, 9.54623086e-03, 4.96253865e+00, 8.83186872e+00,
        7.97195917e+00, 6.37901937e+00, 5.78254274e+00, 5.03462246e+00,
        6.37138050e+00, 1.84502898e-01],
       [6.05770836e+00, 7.25509385e+00, 2.16041657e+00, 3.50670357e-02,
        6.34606720e+00, 9.77455944e+00, 7.18494507e+00, 6.50530622e+00,
        5.85406094e+00, 7.57894513e+00],
       [6.58903005e+00, 8.07604015e+00, 9.97131444e+00, 6.90749625e+00,
        8.11268416e-01, 1.07874063e+00, 4.17861123e+00, 7.23787367e+00,
        9.20449637e-01, 4.16555164e+00],
       [3.16168125e+00, 6.55887756e+00, 8.20854844e-01, 8.19284438e-01,
        2.25303248e+00, 4.16816605e+00, 5.13325213e+00, 5.34474506e+00,
        1.73097254e+00, 7.11062044e+00],
       [9.22911408e+00, 1.11008058e+00, 1.39258614e+00, 8.76430130e+00,
        4.95702854e+00, 7.16097827e+00, 6.36533645e+00, 7.98180948e+00,
        9.58087288e+00, 7.77772061e-01],
       [7.50955759e+00, 1.28775956e+00, 3.02774126e+00, 5.05208454e+00,
        2.43478124e+00, 9.86661911e-01, 7.22968293e+00, 3.39784935e+00,
        7.08604837e+00, 1.37895600e+00],
       [2.71098182e+00, 8.23604609e+00, 7.39260002e+00, 5.96863088e+00,
        6.42867723e+00, 5.57721585e+00, 2.05087063e+00, 3.45959313e+00,
        7.13187596e+00, 1.26699262e+00],
       [9.75129602e+00, 6.94600553e+00, 1.46673272e+00, 1.61391117e+00,
        6.93398224e+00, 7.58123721e+00, 8.01035619e+00, 7.90701972e+00,
        7.34414677e+00, 8.80065416e+00],
       [4.58111282e+00, 8.55722122e+00, 7.18808961e+00, 3.28407866e+00,
        5.75350931e+00, 4.71047838e+00, 1.48364415e+00, 5.68606354e+00,
        7.63957334e-01, 4.03262879e+00]])
# Don't do this!
#for n, i in enumerate(a2):
#    a2[n] = i + 10
#np.power(a2, 2)
a2**2
array([[1.99387022e-02, 1.52035520e-01, 5.34073667e-01, 5.23164406e-02,
        9.31848122e-02, 3.10655909e-01, 1.02238392e-01, 9.52833548e-01,
        4.76701755e-01, 5.32145713e-01],
       [1.41965055e-01, 9.11305236e-07, 2.46267899e-01, 7.80019051e-01,
        6.35521330e-01, 4.06918882e-01, 3.34378006e-01, 2.53474233e-01,
        4.05944895e-01, 3.40413195e-04],
       [3.66958306e-01, 5.26363868e-01, 4.66739974e-02, 1.22969699e-05,
        4.02725689e-01, 9.55420122e-01, 5.16234356e-01, 4.23190091e-01,
        3.42700295e-01, 5.74404092e-01],
       [4.34153170e-01, 6.52224244e-01, 9.94271116e-01, 4.77135045e-01,
        6.58156443e-03, 1.16368135e-02, 1.74607918e-01, 5.23868153e-01,
        8.47227534e-03, 1.73518205e-01],
       [9.99622834e-02, 4.30188749e-01, 6.73802675e-03, 6.71226990e-03,
        5.07615536e-02, 1.73736082e-01, 2.63502774e-01, 2.85662998e-01,
        2.99626595e-02, 5.05609231e-01],
       [8.51765467e-01, 1.23227889e-02, 1.93929617e-02, 7.68129773e-01,
        2.45721319e-01, 5.12796098e-01, 4.05175081e-01, 6.37092825e-01,
        9.17931251e-01, 6.04929379e-03],
       [5.63934553e-01, 1.65832469e-02, 9.16721711e-02, 2.55235582e-01,
        5.92815967e-02, 9.73501726e-03, 5.22683153e-01, 1.15453802e-01,
        5.02120816e-01, 1.90151964e-02],
       [7.34942241e-02, 6.78324551e-01, 5.46505351e-01, 3.56245546e-01,
        4.13278909e-01, 3.11053366e-01, 4.20607035e-02, 1.19687846e-01,
        5.08636547e-01, 1.60527030e-02],
       [9.50877740e-01, 4.82469929e-01, 2.15130488e-02, 2.60470928e-02,
        4.80801097e-01, 5.74751576e-01, 6.41658063e-01, 6.25209609e-01,
        5.39364917e-01, 7.74515137e-01],
       [2.09865947e-01, 7.32260350e-01, 5.16686322e-01, 1.07851727e-01,
        3.31028694e-01, 2.21886066e-01, 2.20119995e-02, 3.23313185e-01,
        5.83630808e-03, 1.62620949e-01]])
#a2**0.5
np.sqrt(a2)
array([[0.37577183, 0.6244336 , 0.85487042, 0.47825503, 0.55250509,
        0.74656892, 0.5654621 , 0.98799389, 0.83092474, 0.85409788],
       [0.61382612, 0.03089698, 0.70445288, 0.93978023, 0.89285828,
        0.79868763, 0.76043032, 0.70955074, 0.79820928, 0.13583184],
       [0.77831281, 0.85176839, 0.46480281, 0.05921743, 0.79662207,
        0.98866372, 0.84764055, 0.80655479, 0.76511835, 0.87057137],
       [0.81172841, 0.89866791, 0.99856469, 0.83111349, 0.28482774,
        0.32844187, 0.64642178, 0.85075694, 0.30338913, 0.64541085],
       [0.56228829, 0.80986897, 0.28650564, 0.28623145, 0.47466119,
        0.64561336, 0.71646717, 0.73107763, 0.41604958, 0.84324495],
       [0.96068278, 0.33317872, 0.37317371, 0.93617847, 0.70406168,
        0.84622564, 0.79783059, 0.89340973, 0.97881933, 0.27888565],
       [0.86657704, 0.35885367, 0.55024915, 0.71078017, 0.49343502,
        0.31411175, 0.85027542, 0.58291074, 0.84178669, 0.37134297],
       [0.5206709 , 0.90752664, 0.8598023 , 0.77256915, 0.80179032,
        0.7468076 , 0.45286539, 0.58818306, 0.84450435, 0.3559484 ],
       [0.98748651, 0.83342699, 0.38297947, 0.40173513, 0.83270536,
        0.870703  , 0.89500593, 0.88921424, 0.85697997, 0.93811802],
       [0.67683918, 0.9250525 , 0.84782602, 0.57306881, 0.75851891,
        0.68632925, 0.38518102, 0.75405991, 0.27639778, 0.63502983]])

Built-in functions#

  • Operate over entire array, specified axes, or slice

  • Very fast/efficient

a2.mean()
0.5051072363137312
a2.std()
0.2780218131440386
a2.min()
0.0009546230860033589
a2
array([[1.41204469e-01, 3.89917325e-01, 7.30803440e-01, 2.28727875e-01,
        3.05261875e-01, 5.57365149e-01, 3.19747387e-01, 9.76131931e-01,
        6.90435916e-01, 7.29483182e-01],
       [3.76782504e-01, 9.54623086e-04, 4.96253865e-01, 8.83186872e-01,
        7.97195917e-01, 6.37901937e-01, 5.78254274e-01, 5.03462246e-01,
        6.37138050e-01, 1.84502898e-02],
       [6.05770836e-01, 7.25509385e-01, 2.16041657e-01, 3.50670357e-03,
        6.34606720e-01, 9.77455944e-01, 7.18494507e-01, 6.50530622e-01,
        5.85406094e-01, 7.57894513e-01],
       [6.58903005e-01, 8.07604015e-01, 9.97131444e-01, 6.90749625e-01,
        8.11268416e-02, 1.07874063e-01, 4.17861123e-01, 7.23787367e-01,
        9.20449637e-02, 4.16555164e-01],
       [3.16168125e-01, 6.55887756e-01, 8.20854844e-02, 8.19284438e-02,
        2.25303248e-01, 4.16816605e-01, 5.13325213e-01, 5.34474506e-01,
        1.73097254e-01, 7.11062044e-01],
       [9.22911408e-01, 1.11008058e-01, 1.39258614e-01, 8.76430130e-01,
        4.95702854e-01, 7.16097827e-01, 6.36533645e-01, 7.98180948e-01,
        9.58087288e-01, 7.77772061e-02],
       [7.50955759e-01, 1.28775956e-01, 3.02774126e-01, 5.05208454e-01,
        2.43478124e-01, 9.86661911e-02, 7.22968293e-01, 3.39784935e-01,
        7.08604837e-01, 1.37895600e-01],
       [2.71098182e-01, 8.23604609e-01, 7.39260002e-01, 5.96863088e-01,
        6.42867723e-01, 5.57721585e-01, 2.05087063e-01, 3.45959313e-01,
        7.13187596e-01, 1.26699262e-01],
       [9.75129602e-01, 6.94600553e-01, 1.46673272e-01, 1.61391117e-01,
        6.93398224e-01, 7.58123721e-01, 8.01035619e-01, 7.90701972e-01,
        7.34414677e-01, 8.80065416e-01],
       [4.58111282e-01, 8.55722122e-01, 7.18808961e-01, 3.28407866e-01,
        5.75350931e-01, 4.71047838e-01, 1.48364415e-01, 5.68606354e-01,
        7.63957334e-02, 4.03262879e-01]])

Note on axis order#

  • When indexing, first axis (0) will extract rows, second axis (1) will extract cols

  • When aggregating (e.g., computing mean along an axis), you are specifing the dimension of the array that will be collapsed, not the dimension that will be returned

    • So axis=0 will aggregate values across all rows for each column in a 2D array

a2[0:3,0:3]
array([[0.14120447, 0.38991732, 0.73080344],
       [0.3767825 , 0.00095462, 0.49625387],
       [0.60577084, 0.72550939, 0.21604166]])
a2[0:3,0:3].min(axis=0)
array([0.14120447, 0.00095462, 0.21604166])
a2[0:3,0:3].min(axis=1)
array([0.14120447, 0.00095462, 0.21604166])

nD arrays#

# how many dimensions can a numpy array have?
# let's try creating arrays with increasingly more dimensions
shape = []
for i in range(100): 
    shape.append(1) # add a new dimension of size 1 to the array shape
    an = np.random.random(shape) # create a random array with the given shape
---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
Cell In[49], line 4
      1 # how many dimensions can a numpy array have?
      2 # let's try creating arrays with increasingly more dimensions
      3 shape = []
----> 4 for i in range(100): 
      5     shape.append(1) # add a new dimension of size 1 to the array shape
      6     an = np.random.random(shape) # create a random array with the given shape

TypeError: 'int' object is not callable

Basic array plotting and visualization#

a
array([4, 3, 3, 6, 3, 6, 5, 2, 3, 9])
plt.plot(a)
[<matplotlib.lines.Line2D at 0x7f309773c850>]
../../_images/f121c3a6230e4af0f52d4599f8e690eb78fe96f96713e37c88d8092afee0f855.png
a2
array([[1.41204469e-01, 3.89917325e-01, 7.30803440e-01, 2.28727875e-01,
        3.05261875e-01, 5.57365149e-01, 3.19747387e-01, 9.76131931e-01,
        6.90435916e-01, 7.29483182e-01],
       [3.76782504e-01, 9.54623086e-04, 4.96253865e-01, 8.83186872e-01,
        7.97195917e-01, 6.37901937e-01, 5.78254274e-01, 5.03462246e-01,
        6.37138050e-01, 1.84502898e-02],
       [6.05770836e-01, 7.25509385e-01, 2.16041657e-01, 3.50670357e-03,
        6.34606720e-01, 9.77455944e-01, 7.18494507e-01, 6.50530622e-01,
        5.85406094e-01, 7.57894513e-01],
       [6.58903005e-01, 8.07604015e-01, 9.97131444e-01, 6.90749625e-01,
        8.11268416e-02, 1.07874063e-01, 4.17861123e-01, 7.23787367e-01,
        9.20449637e-02, 4.16555164e-01],
       [3.16168125e-01, 6.55887756e-01, 8.20854844e-02, 8.19284438e-02,
        2.25303248e-01, 4.16816605e-01, 5.13325213e-01, 5.34474506e-01,
        1.73097254e-01, 7.11062044e-01],
       [9.22911408e-01, 1.11008058e-01, 1.39258614e-01, 8.76430130e-01,
        4.95702854e-01, 7.16097827e-01, 6.36533645e-01, 7.98180948e-01,
        9.58087288e-01, 7.77772061e-02],
       [7.50955759e-01, 1.28775956e-01, 3.02774126e-01, 5.05208454e-01,
        2.43478124e-01, 9.86661911e-02, 7.22968293e-01, 3.39784935e-01,
        7.08604837e-01, 1.37895600e-01],
       [2.71098182e-01, 8.23604609e-01, 7.39260002e-01, 5.96863088e-01,
        6.42867723e-01, 5.57721585e-01, 2.05087063e-01, 3.45959313e-01,
        7.13187596e-01, 1.26699262e-01],
       [9.75129602e-01, 6.94600553e-01, 1.46673272e-01, 1.61391117e-01,
        6.93398224e-01, 7.58123721e-01, 8.01035619e-01, 7.90701972e-01,
        7.34414677e-01, 8.80065416e-01],
       [4.58111282e-01, 8.55722122e-01, 7.18808961e-01, 3.28407866e-01,
        5.75350931e-01, 4.71047838e-01, 1.48364415e-01, 5.68606354e-01,
        7.63957334e-02, 4.03262879e-01]])
plt.plot(a2)
[<matplotlib.lines.Line2D at 0x7f308f4f3d60>,
 <matplotlib.lines.Line2D at 0x7f308f4f3dc0>,
 <matplotlib.lines.Line2D at 0x7f308f4f3df0>,
 <matplotlib.lines.Line2D at 0x7f308f4f3ee0>,
 <matplotlib.lines.Line2D at 0x7f308f4f3fd0>,
 <matplotlib.lines.Line2D at 0x7f308f481100>,
 <matplotlib.lines.Line2D at 0x7f308f4811f0>,
 <matplotlib.lines.Line2D at 0x7f308f4812e0>,
 <matplotlib.lines.Line2D at 0x7f308f4813d0>,
 <matplotlib.lines.Line2D at 0x7f308f4814c0>]
../../_images/f7f685a640bc214ed13604314e7ceb248b077bce18cbd2f9deaae357ee0e6e8e.png
plt.plot(a2[0])
[<matplotlib.lines.Line2D at 0x7f308dc00c70>]
../../_images/cbf4eb3811caa3c8a7a585b5e6827ad3fef35c75a37800a9e4938c204a537265.png
#2D array visualization
plt.imshow(a2, cmap='gray')
plt.colorbar()
<matplotlib.colorbar.Colorbar at 0x7f308f44d3a0>
../../_images/1f866cb8beb0a843b9f8b6bc14df0972da9e014315e7cec21aecdcb6cd122da7.png
plt.hist(a2.ravel(), bins='auto')
(array([12., 14.,  8., 10., 12., 25., 10.,  9.]),
 array([9.54623086e-04, 1.25476726e-01, 2.49998828e-01, 3.74520931e-01,
        4.99043033e-01, 6.23565136e-01, 7.48087239e-01, 8.72609341e-01,
        9.97131444e-01]),
 <BarContainer object of 8 artists>)
../../_images/8a2c0300944a95f70a7402af76e9ec07eda3504e935e463499b2566d33a2a16b.png

Boolean arrays and fancy indexing#

a2
array([[1.41204469e-01, 3.89917325e-01, 7.30803440e-01, 2.28727875e-01,
        3.05261875e-01, 5.57365149e-01, 3.19747387e-01, 9.76131931e-01,
        6.90435916e-01, 7.29483182e-01],
       [3.76782504e-01, 9.54623086e-04, 4.96253865e-01, 8.83186872e-01,
        7.97195917e-01, 6.37901937e-01, 5.78254274e-01, 5.03462246e-01,
        6.37138050e-01, 1.84502898e-02],
       [6.05770836e-01, 7.25509385e-01, 2.16041657e-01, 3.50670357e-03,
        6.34606720e-01, 9.77455944e-01, 7.18494507e-01, 6.50530622e-01,
        5.85406094e-01, 7.57894513e-01],
       [6.58903005e-01, 8.07604015e-01, 9.97131444e-01, 6.90749625e-01,
        8.11268416e-02, 1.07874063e-01, 4.17861123e-01, 7.23787367e-01,
        9.20449637e-02, 4.16555164e-01],
       [3.16168125e-01, 6.55887756e-01, 8.20854844e-02, 8.19284438e-02,
        2.25303248e-01, 4.16816605e-01, 5.13325213e-01, 5.34474506e-01,
        1.73097254e-01, 7.11062044e-01],
       [9.22911408e-01, 1.11008058e-01, 1.39258614e-01, 8.76430130e-01,
        4.95702854e-01, 7.16097827e-01, 6.36533645e-01, 7.98180948e-01,
        9.58087288e-01, 7.77772061e-02],
       [7.50955759e-01, 1.28775956e-01, 3.02774126e-01, 5.05208454e-01,
        2.43478124e-01, 9.86661911e-02, 7.22968293e-01, 3.39784935e-01,
        7.08604837e-01, 1.37895600e-01],
       [2.71098182e-01, 8.23604609e-01, 7.39260002e-01, 5.96863088e-01,
        6.42867723e-01, 5.57721585e-01, 2.05087063e-01, 3.45959313e-01,
        7.13187596e-01, 1.26699262e-01],
       [9.75129602e-01, 6.94600553e-01, 1.46673272e-01, 1.61391117e-01,
        6.93398224e-01, 7.58123721e-01, 8.01035619e-01, 7.90701972e-01,
        7.34414677e-01, 8.80065416e-01],
       [4.58111282e-01, 8.55722122e-01, 7.18808961e-01, 3.28407866e-01,
        5.75350931e-01, 4.71047838e-01, 1.48364415e-01, 5.68606354e-01,
        7.63957334e-02, 4.03262879e-01]])
a2 > 0.5
array([[False, False,  True, False, False,  True, False,  True,  True,
         True],
       [False, False, False,  True,  True,  True,  True,  True,  True,
        False],
       [ True,  True, False, False,  True,  True,  True,  True,  True,
         True],
       [ True,  True,  True,  True, False, False, False,  True, False,
        False],
       [False,  True, False, False, False, False,  True,  True, False,
         True],
       [ True, False, False,  True, False,  True,  True,  True,  True,
        False],
       [ True, False, False,  True, False, False,  True, False,  True,
        False],
       [False,  True,  True,  True,  True,  True, False, False,  True,
        False],
       [ True,  True, False, False,  True,  True,  True,  True,  True,
         True],
       [False,  True,  True, False,  True, False, False,  True, False,
        False]])
idx = (a2 > 0.5)
idx
array([[False, False,  True, False, False,  True, False,  True,  True,
         True],
       [False, False, False,  True,  True,  True,  True,  True,  True,
        False],
       [ True,  True, False, False,  True,  True,  True,  True,  True,
         True],
       [ True,  True,  True,  True, False, False, False,  True, False,
        False],
       [False,  True, False, False, False, False,  True,  True, False,
         True],
       [ True, False, False,  True, False,  True,  True,  True,  True,
        False],
       [ True, False, False,  True, False, False,  True, False,  True,
        False],
       [False,  True,  True,  True,  True,  True, False, False,  True,
        False],
       [ True,  True, False, False,  True,  True,  True,  True,  True,
         True],
       [False,  True,  True, False,  True, False, False,  True, False,
        False]])
# Quick visualization, True = yellow (1)
plt.imshow(idx)
plt.colorbar()
<matplotlib.colorbar.Colorbar at 0x7f308da5a370>
../../_images/2c8b3725e4d2de10dbdaace22f87586a228c66b19bc3bb15672acc29053db2a9.png
# Return only elements where condition is True
a2[idx]
array([0.73080344, 0.55736515, 0.97613193, 0.69043592, 0.72948318,
       0.88318687, 0.79719592, 0.63790194, 0.57825427, 0.50346225,
       0.63713805, 0.60577084, 0.72550939, 0.63460672, 0.97745594,
       0.71849451, 0.65053062, 0.58540609, 0.75789451, 0.658903  ,
       0.80760401, 0.99713144, 0.69074963, 0.72378737, 0.65588776,
       0.51332521, 0.53447451, 0.71106204, 0.92291141, 0.87643013,
       0.71609783, 0.63653364, 0.79818095, 0.95808729, 0.75095576,
       0.50520845, 0.72296829, 0.70860484, 0.82360461, 0.73926   ,
       0.59686309, 0.64286772, 0.55772158, 0.7131876 , 0.9751296 ,
       0.69460055, 0.69339822, 0.75812372, 0.80103562, 0.79070197,
       0.73441468, 0.88006542, 0.85572212, 0.71880896, 0.57535093,
       0.56860635])
# Original shape
a2.shape
(10, 10)
# Selected shape
a2[idx].shape
#idx.nonzero()[0].size
(56,)
#Can also be used for assignment
#a2[idx] = 0
### Bitwise operators, combining boolean arrays
(a2 > 0.5)
array([[False, False,  True, False, False,  True, False,  True,  True,
         True],
       [False, False, False,  True,  True,  True,  True,  True,  True,
        False],
       [ True,  True, False, False,  True,  True,  True,  True,  True,
         True],
       [ True,  True,  True,  True, False, False, False,  True, False,
        False],
       [False,  True, False, False, False, False,  True,  True, False,
         True],
       [ True, False, False,  True, False,  True,  True,  True,  True,
        False],
       [ True, False, False,  True, False, False,  True, False,  True,
        False],
       [False,  True,  True,  True,  True,  True, False, False,  True,
        False],
       [ True,  True, False, False,  True,  True,  True,  True,  True,
         True],
       [False,  True,  True, False,  True, False, False,  True, False,
        False]])
(a2 < 0.7)
array([[ True,  True, False,  True,  True,  True,  True, False,  True,
        False],
       [ True,  True,  True, False, False,  True,  True,  True,  True,
         True],
       [ True, False,  True,  True,  True, False, False,  True,  True,
        False],
       [ True, False, False,  True,  True,  True,  True, False,  True,
         True],
       [ True,  True,  True,  True,  True,  True,  True,  True,  True,
        False],
       [False,  True,  True, False,  True, False,  True, False, False,
         True],
       [False,  True,  True,  True,  True,  True, False,  True, False,
         True],
       [ True, False, False,  True,  True,  True,  True,  True, False,
         True],
       [False,  True,  True,  True,  True, False, False, False, False,
        False],
       [ True, False, False,  True,  True,  True,  True,  True,  True,
         True]])
#Bitwise and - True if both are True
idx = (a2 > 0.5) & (a2 < 0.7)
#Bitwise or - True if either is True
#idx = (a < 0.5) | (a > 0.9)
idx
array([[False, False, False, False, False,  True, False, False,  True,
        False],
       [False, False, False, False, False,  True,  True,  True,  True,
        False],
       [ True, False, False, False,  True, False, False,  True,  True,
        False],
       [ True, False, False,  True, False, False, False, False, False,
        False],
       [False,  True, False, False, False, False,  True,  True, False,
        False],
       [False, False, False, False, False, False,  True, False, False,
        False],
       [False, False, False,  True, False, False, False, False, False,
        False],
       [False, False, False,  True,  True,  True, False, False, False,
        False],
       [False,  True, False, False,  True, False, False, False, False,
        False],
       [False, False, False, False,  True, False, False,  True, False,
        False]])
plt.imshow(idx)
<matplotlib.image.AxesImage at 0x7f308d9f2430>
../../_images/4ced832050a346b39fc7307c174a585b6b7ce46961d064d5d34de5ed4a5f2748.png
a2[idx]
array([0.55736515, 0.69043592, 0.63790194, 0.57825427, 0.50346225,
       0.63713805, 0.60577084, 0.63460672, 0.65053062, 0.58540609,
       0.658903  , 0.69074963, 0.65588776, 0.51332521, 0.53447451,
       0.63653364, 0.50520845, 0.59686309, 0.64286772, 0.55772158,
       0.69460055, 0.69339822, 0.57535093, 0.56860635])
a2[idx].shape
(24,)
#Invert the boolean array
~idx
array([[ True,  True,  True,  True,  True, False,  True,  True, False,
         True],
       [ True,  True,  True,  True,  True, False, False, False, False,
         True],
       [False,  True,  True,  True, False,  True,  True, False, False,
         True],
       [False,  True,  True, False,  True,  True,  True,  True,  True,
         True],
       [ True, False,  True,  True,  True,  True, False, False,  True,
         True],
       [ True,  True,  True,  True,  True,  True, False,  True,  True,
         True],
       [ True,  True,  True, False,  True,  True,  True,  True,  True,
         True],
       [ True,  True,  True, False, False, False,  True,  True,  True,
         True],
       [ True, False,  True,  True, False,  True,  True,  True,  True,
         True],
       [ True,  True,  True,  True, False,  True,  True, False,  True,
         True]])
plt.imshow(idx)
<matplotlib.image.AxesImage at 0x7f309773acd0>
../../_images/4ced832050a346b39fc7307c174a585b6b7ce46961d064d5d34de5ed4a5f2748.png
plt.imshow(~idx)
<matplotlib.image.AxesImage at 0x7f308d916c10>
../../_images/eb8b3d25a9a0cedb1043f2cf163ea4b5863fe3bbe5c7289e2277d8267b87090f.png
a2[~idx].shape
(76,)

Pandas!#

#pd.DataFrame?
df = pd.DataFrame(a2)
df
0 1 2 3 4 5 6 7 8 9
0 0.141204 0.389917 0.730803 0.228728 0.305262 0.557365 0.319747 0.976132 0.690436 0.729483
1 0.376783 0.000955 0.496254 0.883187 0.797196 0.637902 0.578254 0.503462 0.637138 0.018450
2 0.605771 0.725509 0.216042 0.003507 0.634607 0.977456 0.718495 0.650531 0.585406 0.757895
3 0.658903 0.807604 0.997131 0.690750 0.081127 0.107874 0.417861 0.723787 0.092045 0.416555
4 0.316168 0.655888 0.082085 0.081928 0.225303 0.416817 0.513325 0.534475 0.173097 0.711062
5 0.922911 0.111008 0.139259 0.876430 0.495703 0.716098 0.636534 0.798181 0.958087 0.077777
6 0.750956 0.128776 0.302774 0.505208 0.243478 0.098666 0.722968 0.339785 0.708605 0.137896
7 0.271098 0.823605 0.739260 0.596863 0.642868 0.557722 0.205087 0.345959 0.713188 0.126699
8 0.975130 0.694601 0.146673 0.161391 0.693398 0.758124 0.801036 0.790702 0.734415 0.880065
9 0.458111 0.855722 0.718809 0.328408 0.575351 0.471048 0.148364 0.568606 0.076396 0.403263

Wow! Now we have labels for our Numpy array. This will make it much easier to keep track of.

df.index = ['a','b','c','d','e','f','g','h','i','j']
df
0 1 2 3 4 5 6 7 8 9
a 0.141204 0.389917 0.730803 0.228728 0.305262 0.557365 0.319747 0.976132 0.690436 0.729483
b 0.376783 0.000955 0.496254 0.883187 0.797196 0.637902 0.578254 0.503462 0.637138 0.018450
c 0.605771 0.725509 0.216042 0.003507 0.634607 0.977456 0.718495 0.650531 0.585406 0.757895
d 0.658903 0.807604 0.997131 0.690750 0.081127 0.107874 0.417861 0.723787 0.092045 0.416555
e 0.316168 0.655888 0.082085 0.081928 0.225303 0.416817 0.513325 0.534475 0.173097 0.711062
f 0.922911 0.111008 0.139259 0.876430 0.495703 0.716098 0.636534 0.798181 0.958087 0.077777
g 0.750956 0.128776 0.302774 0.505208 0.243478 0.098666 0.722968 0.339785 0.708605 0.137896
h 0.271098 0.823605 0.739260 0.596863 0.642868 0.557722 0.205087 0.345959 0.713188 0.126699
i 0.975130 0.694601 0.146673 0.161391 0.693398 0.758124 0.801036 0.790702 0.734415 0.880065
j 0.458111 0.855722 0.718809 0.328408 0.575351 0.471048 0.148364 0.568606 0.076396 0.403263
# Still just NumPy array under the hood
df.values
array([[1.41204469e-01, 3.89917325e-01, 7.30803440e-01, 2.28727875e-01,
        3.05261875e-01, 5.57365149e-01, 3.19747387e-01, 9.76131931e-01,
        6.90435916e-01, 7.29483182e-01],
       [3.76782504e-01, 9.54623086e-04, 4.96253865e-01, 8.83186872e-01,
        7.97195917e-01, 6.37901937e-01, 5.78254274e-01, 5.03462246e-01,
        6.37138050e-01, 1.84502898e-02],
       [6.05770836e-01, 7.25509385e-01, 2.16041657e-01, 3.50670357e-03,
        6.34606720e-01, 9.77455944e-01, 7.18494507e-01, 6.50530622e-01,
        5.85406094e-01, 7.57894513e-01],
       [6.58903005e-01, 8.07604015e-01, 9.97131444e-01, 6.90749625e-01,
        8.11268416e-02, 1.07874063e-01, 4.17861123e-01, 7.23787367e-01,
        9.20449637e-02, 4.16555164e-01],
       [3.16168125e-01, 6.55887756e-01, 8.20854844e-02, 8.19284438e-02,
        2.25303248e-01, 4.16816605e-01, 5.13325213e-01, 5.34474506e-01,
        1.73097254e-01, 7.11062044e-01],
       [9.22911408e-01, 1.11008058e-01, 1.39258614e-01, 8.76430130e-01,
        4.95702854e-01, 7.16097827e-01, 6.36533645e-01, 7.98180948e-01,
        9.58087288e-01, 7.77772061e-02],
       [7.50955759e-01, 1.28775956e-01, 3.02774126e-01, 5.05208454e-01,
        2.43478124e-01, 9.86661911e-02, 7.22968293e-01, 3.39784935e-01,
        7.08604837e-01, 1.37895600e-01],
       [2.71098182e-01, 8.23604609e-01, 7.39260002e-01, 5.96863088e-01,
        6.42867723e-01, 5.57721585e-01, 2.05087063e-01, 3.45959313e-01,
        7.13187596e-01, 1.26699262e-01],
       [9.75129602e-01, 6.94600553e-01, 1.46673272e-01, 1.61391117e-01,
        6.93398224e-01, 7.58123721e-01, 8.01035619e-01, 7.90701972e-01,
        7.34414677e-01, 8.80065416e-01],
       [4.58111282e-01, 8.55722122e-01, 7.18808961e-01, 3.28407866e-01,
        5.75350931e-01, 4.71047838e-01, 1.48364415e-01, 5.68606354e-01,
        7.63957334e-02, 4.03262879e-01]])
df.index.values
array(['a', 'b', 'c', 'd', 'e', 'f', 'g', 'h', 'i', 'j'], dtype=object)
# Mean of each column
df.mean()
0    0.547704
1    0.519358
2    0.456909
3    0.435640
4    0.469429
5    0.529907
6    0.506167
7    0.623162
8    0.536881
9    0.425915
dtype: float64
# Mean of each row
df.mean(axis=1)
a    0.506908
b    0.492958
c    0.587522
d    0.499364
e    0.371015
f    0.573199
g    0.393911
h    0.502235
i    0.663553
j    0.460408
dtype: float64

Reading files with Pandas#

Most of the time, you will read in tabular data and let Pandas do the work

# Path to csv file
csv_fn = './data/GLAH14_tllz_conus_lulcfilt_demfilt.csv'
pd.read_csv(csv_fn)
decyear ordinal lat lon glas_z dem_z dem_z_std lulc
0 2003.139571 731266.943345 44.157897 -105.356562 1398.51 1400.52 0.33 31
1 2003.139571 731266.943346 44.150175 -105.358116 1387.11 1384.64 0.43 31
2 2003.139571 731266.943347 44.148632 -105.358427 1392.83 1383.49 0.28 31
3 2003.139571 731266.943347 44.147087 -105.358738 1384.24 1382.85 0.84 31
4 2003.139571 731266.943347 44.145542 -105.359048 1369.21 1380.24 1.73 31
... ... ... ... ... ... ... ... ...
65231 2009.775995 733691.238340 37.896222 -117.044399 1556.16 1556.43 0.00 31
65232 2009.775995 733691.238340 37.897769 -117.044675 1556.02 1556.43 0.00 31
65233 2009.775995 733691.238340 37.899319 -117.044952 1556.19 1556.44 0.00 31
65234 2009.775995 733691.238340 37.900869 -117.045230 1556.18 1556.44 0.00 31
65235 2009.775995 733691.238341 37.902420 -117.045508 1556.32 1556.44 0.00 31

65236 rows × 8 columns

# Store output as a new Pandas DataFrame
glas_df = pd.read_csv(csv_fn)
glas_df
decyear ordinal lat lon glas_z dem_z dem_z_std lulc
0 2003.139571 731266.943345 44.157897 -105.356562 1398.51 1400.52 0.33 31
1 2003.139571 731266.943346 44.150175 -105.358116 1387.11 1384.64 0.43 31
2 2003.139571 731266.943347 44.148632 -105.358427 1392.83 1383.49 0.28 31
3 2003.139571 731266.943347 44.147087 -105.358738 1384.24 1382.85 0.84 31
4 2003.139571 731266.943347 44.145542 -105.359048 1369.21 1380.24 1.73 31
... ... ... ... ... ... ... ... ...
65231 2009.775995 733691.238340 37.896222 -117.044399 1556.16 1556.43 0.00 31
65232 2009.775995 733691.238340 37.897769 -117.044675 1556.02 1556.43 0.00 31
65233 2009.775995 733691.238340 37.899319 -117.044952 1556.19 1556.44 0.00 31
65234 2009.775995 733691.238340 37.900869 -117.045230 1556.18 1556.44 0.00 31
65235 2009.775995 733691.238341 37.902420 -117.045508 1556.32 1556.44 0.00 31

65236 rows × 8 columns

type(glas_df)
pandas.core.frame.DataFrame
# For demonstration purpuoses - multiply index to illustrate difference between loc and iloc
glas_df.set_index(glas_df.index*10+1, inplace=True)
glas_df
decyear ordinal lat lon glas_z dem_z dem_z_std lulc
1 2003.139571 731266.943345 44.157897 -105.356562 1398.51 1400.52 0.33 31
11 2003.139571 731266.943346 44.150175 -105.358116 1387.11 1384.64 0.43 31
21 2003.139571 731266.943347 44.148632 -105.358427 1392.83 1383.49 0.28 31
31 2003.139571 731266.943347 44.147087 -105.358738 1384.24 1382.85 0.84 31
41 2003.139571 731266.943347 44.145542 -105.359048 1369.21 1380.24 1.73 31
... ... ... ... ... ... ... ... ...
652311 2009.775995 733691.238340 37.896222 -117.044399 1556.16 1556.43 0.00 31
652321 2009.775995 733691.238340 37.897769 -117.044675 1556.02 1556.43 0.00 31
652331 2009.775995 733691.238340 37.899319 -117.044952 1556.19 1556.44 0.00 31
652341 2009.775995 733691.238340 37.900869 -117.045230 1556.18 1556.44 0.00 31
652351 2009.775995 733691.238341 37.902420 -117.045508 1556.32 1556.44 0.00 31

65236 rows × 8 columns

# Awesome descriptive statistics for each column
glas_df.describe()
decyear ordinal lat lon glas_z dem_z dem_z_std lulc
count 65236.000000 65236.000000 65236.000000 65236.000000 65236.000000 65236.000000 65236.000000 65236.000000
mean 2005.945322 732291.890372 40.946798 -115.040612 1791.494167 1792.260964 5.504748 30.339444
std 1.729573 631.766682 3.590476 5.465065 1037.183482 1037.925371 7.518558 3.480576
min 2003.139571 731266.943345 34.999455 -124.482406 -115.550000 -114.570000 0.000000 12.000000
25% 2004.444817 731743.803182 38.101451 -119.257599 1166.970000 1168.240000 0.070000 31.000000
50% 2005.846896 732256.116938 39.884541 -115.686241 1555.730000 1556.380000 1.350000 31.000000
75% 2007.223249 732758.486046 43.453565 -109.816475 2399.355000 2400.072500 9.530000 31.000000
max 2009.775995 733691.238341 48.999727 -104.052336 4340.310000 4252.940000 49.900000 31.000000

Indexing and selecting#

  • https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#different-choices-for-indexing

# Integer indexing like NumPy
glas_df.iloc[2]
decyear        2003.139571
ordinal      731266.943347
lat              44.148632
lon            -105.358427
glas_z         1392.830000
dem_z          1383.490000
dem_z_std         0.280000
lulc             31.000000
Name: 21, dtype: float64
glas_df.iloc[0:3]
decyear ordinal lat lon glas_z dem_z dem_z_std lulc
1 2003.139571 731266.943345 44.157897 -105.356562 1398.51 1400.52 0.33 31
11 2003.139571 731266.943346 44.150175 -105.358116 1387.11 1384.64 0.43 31
21 2003.139571 731266.943347 44.148632 -105.358427 1392.83 1383.49 0.28 31
glas_df.loc[21]
decyear        2003.139571
ordinal      731266.943347
lat              44.148632
lon            -105.358427
glas_z         1392.830000
dem_z          1383.490000
dem_z_std         0.280000
lulc             31.000000
Name: 21, dtype: float64
# Get labeled indices between 0 and 20
glas_df.loc[0:20]
decyear ordinal lat lon glas_z dem_z dem_z_std lulc
1 2003.139571 731266.943345 44.157897 -105.356562 1398.51 1400.52 0.33 31
11 2003.139571 731266.943346 44.150175 -105.358116 1387.11 1384.64 0.43 31
# Get integer indices between 0 and 20
glas_df.iloc[0:20]
decyear ordinal lat lon glas_z dem_z dem_z_std lulc
1 2003.139571 731266.943345 44.157897 -105.356562 1398.51 1400.52 0.33 31
11 2003.139571 731266.943346 44.150175 -105.358116 1387.11 1384.64 0.43 31
21 2003.139571 731266.943347 44.148632 -105.358427 1392.83 1383.49 0.28 31
31 2003.139571 731266.943347 44.147087 -105.358738 1384.24 1382.85 0.84 31
41 2003.139571 731266.943347 44.145542 -105.359048 1369.21 1380.24 1.73 31
51 2003.139571 731266.943347 44.143996 -105.359359 1366.60 1375.23 1.60 31
61 2003.139571 731266.943351 44.126969 -105.362876 1355.14 1379.38 2.17 31
71 2003.139571 731266.943360 44.074358 -105.373549 1369.53 1391.71 2.88 31
81 2003.139571 731266.943361 44.072806 -105.373864 1380.02 1387.79 0.45 31
91 2003.139571 731266.943361 44.071256 -105.374177 1391.47 1396.90 1.56 31
101 2003.139571 731266.943362 44.063515 -105.375712 1388.58 1408.54 0.24 31
111 2003.139571 731266.943363 44.061967 -105.376015 1372.55 1406.21 0.17 31
121 2003.139571 731266.943364 44.057328 -105.376934 1402.38 1406.23 0.33 31
131 2003.139571 731266.943364 44.055780 -105.377243 1401.82 1405.75 0.35 31
141 2003.139571 731266.943364 44.054231 -105.377553 1399.31 1406.05 0.68 31
151 2003.139571 731266.943366 44.046487 -105.379115 1394.22 1398.14 0.27 31
161 2003.139571 731266.943366 44.044941 -105.379430 1394.94 1400.58 0.17 31
171 2003.139571 731266.943367 44.041850 -105.380064 1386.00 1389.69 0.57 31
181 2003.139571 731266.943424 43.737000 -105.441568 1496.53 1498.16 1.52 31
191 2003.139571 731266.943429 43.706060 -105.447754 1459.99 1460.90 0.08 31

Selecting columns#

glas_df.columns
Index(['decyear', 'ordinal', 'lat', 'lon', 'glas_z', 'dem_z', 'dem_z_std',
       'lulc'],
      dtype='object')
glas_df['glas_z']
1         1398.51
11        1387.11
21        1392.83
31        1384.24
41        1369.21
           ...   
652311    1556.16
652321    1556.02
652331    1556.19
652341    1556.18
652351    1556.32
Name: glas_z, Length: 65236, dtype: float64
glas_df.glas_z
1         1398.51
11        1387.11
21        1392.83
31        1384.24
41        1369.21
           ...   
652311    1556.16
652321    1556.02
652331    1556.19
652341    1556.18
652351    1556.32
Name: glas_z, Length: 65236, dtype: float64
glas_df.iloc[:,4]
1         1398.51
11        1387.11
21        1392.83
31        1384.24
41        1369.21
           ...   
652311    1556.16
652321    1556.02
652331    1556.19
652341    1556.18
652351    1556.32
Name: glas_z, Length: 65236, dtype: float64
glas_df.loc[:,'glas_z']
1         1398.51
11        1387.11
21        1392.83
31        1384.24
41        1369.21
           ...   
652311    1556.16
652321    1556.02
652331    1556.19
652341    1556.18
652351    1556.32
Name: glas_z, Length: 65236, dtype: float64
#Multiple columns
glas_df['glas_z', 'dem_z']
---------------------------------------------------------------------------
KeyError                                  Traceback (most recent call last)
File ~/sw/miniconda3/envs/gda/lib/python3.8/site-packages/pandas/core/indexes/base.py:3652, in Index.get_loc(self, key)
   3651 try:
-> 3652     return self._engine.get_loc(casted_key)
   3653 except KeyError as err:

File ~/sw/miniconda3/envs/gda/lib/python3.8/site-packages/pandas/_libs/index.pyx:147, in pandas._libs.index.IndexEngine.get_loc()

File ~/sw/miniconda3/envs/gda/lib/python3.8/site-packages/pandas/_libs/index.pyx:176, in pandas._libs.index.IndexEngine.get_loc()

File pandas/_libs/hashtable_class_helper.pxi:7080, in pandas._libs.hashtable.PyObjectHashTable.get_item()

File pandas/_libs/hashtable_class_helper.pxi:7088, in pandas._libs.hashtable.PyObjectHashTable.get_item()

KeyError: ('glas_z', 'dem_z')

The above exception was the direct cause of the following exception:

KeyError                                  Traceback (most recent call last)
Cell In[103], line 2
      1 #Multiple columns
----> 2 glas_df['glas_z', 'dem_z']

File ~/sw/miniconda3/envs/gda/lib/python3.8/site-packages/pandas/core/frame.py:3761, in DataFrame.__getitem__(self, key)
   3759 if self.columns.nlevels > 1:
   3760     return self._getitem_multilevel(key)
-> 3761 indexer = self.columns.get_loc(key)
   3762 if is_integer(indexer):
   3763     indexer = [indexer]

File ~/sw/miniconda3/envs/gda/lib/python3.8/site-packages/pandas/core/indexes/base.py:3654, in Index.get_loc(self, key)
   3652     return self._engine.get_loc(casted_key)
   3653 except KeyError as err:
-> 3654     raise KeyError(key) from err
   3655 except TypeError:
   3656     # If we have a listlike key, _check_indexing_error will raise
   3657     #  InvalidIndexError. Otherwise we fall through and re-raise
   3658     #  the TypeError.
   3659     self._check_indexing_error(key)

KeyError: ('glas_z', 'dem_z')
# Need to pass in a list of column names
glas_df[['glas_z', 'dem_z']]
glas_z dem_z
1 1398.51 1400.52
11 1387.11 1384.64
21 1392.83 1383.49
31 1384.24 1382.85
41 1369.21 1380.24
... ... ...
652311 1556.16 1556.43
652321 1556.02 1556.43
652331 1556.19 1556.44
652341 1556.18 1556.44
652351 1556.32 1556.44

65236 rows × 2 columns

glas_df.loc[:,['glas_z', 'dem_z']]
glas_z dem_z
1 1398.51 1400.52
11 1387.11 1384.64
21 1392.83 1383.49
31 1384.24 1382.85
41 1369.21 1380.24
... ... ...
652311 1556.16 1556.43
652321 1556.02 1556.43
652331 1556.19 1556.44
652341 1556.18 1556.44
652351 1556.32 1556.44

65236 rows × 2 columns

Boolean indexing#

glas_df['lulc']
1         31
11        31
21        31
31        31
41        31
          ..
652311    31
652321    31
652331    31
652341    31
652351    31
Name: lulc, Length: 65236, dtype: int64
glas_df['lulc'].value_counts()
lulc
31    62968
12     2268
Name: count, dtype: int64
glas_df['lulc'] == 12
1         False
11        False
21        False
31        False
41        False
          ...  
652311    False
652321    False
652331    False
652341    False
652351    False
Name: lulc, Length: 65236, dtype: bool
# Boolean Series (index and single column) will be True for records with 'lulc' == 12
idx2 = glas_df['lulc'] == 12
type(idx2)
pandas.core.series.Series
idx2.shape
(65236,)
glas_df.shape
(65236, 8)
# Use to select corresponding rows, returns a new DataFrame with all columns
glas_df[idx2]
decyear ordinal lat lon glas_z dem_z dem_z_std lulc
231 2003.139573 731266.944184 39.669291 -106.225142 3505.12 3508.25 5.74 12
301 2003.139573 731266.944316 38.961190 -106.355153 4046.47 4047.25 7.14 12
4891 2003.147846 731269.963718 48.587233 -113.484046 2135.76 2123.37 1.18 12
4921 2003.147846 731269.963811 48.091352 -113.595790 1632.52 1615.77 11.43 12
7561 2003.157366 731273.438572 43.897412 -114.457131 2886.39 2889.82 20.31 12
... ... ... ... ... ... ... ... ...
647241 2009.764964 733687.211708 40.689722 -105.918309 3267.33 3267.62 1.83 12
647251 2009.764964 733687.211709 40.694371 -105.919164 3235.77 3238.94 3.78 12
649831 2009.771998 733689.779258 47.910365 -123.628017 1671.86 1711.73 8.44 12
649841 2009.771998 733689.779258 47.908820 -123.628357 1737.70 1776.17 7.70 12
649851 2009.771998 733689.779258 47.907275 -123.628697 1782.52 1828.93 4.41 12

2268 rows × 8 columns

glas_df[idx2].shape
(2268, 8)
glas_df[idx2].mean()
decyear        2006.008627
ordinal      732315.035881
lat              43.065223
lon            -112.936499
glas_z         2918.746261
dem_z          2920.785754
dem_z_std         9.719951
lulc             12.000000
dtype: float64

Groupby#

  • Let’s consider statistics for groups of rows that share the same column attribute

glas_df.groupby('lulc')
<pandas.core.groupby.generic.DataFrameGroupBy object at 0x7f308da5a040>
glas_df.groupby('lulc').count()
decyear ordinal lat lon glas_z dem_z dem_z_std
lulc
12 2268 2268 2268 2268 2268 2268 2268
31 62968 62968 62968 62968 62968 62968 62968
glas_df.groupby('lulc').mean()
decyear ordinal lat lon glas_z dem_z dem_z_std
lulc
12 2006.008627 732315.035881 43.065223 -112.936499 2918.746261 2920.785754 9.719951
31 2005.943042 732291.056710 40.870496 -115.116398 1750.892469 1751.613426 5.352924
glas_df.groupby('lulc').agg(['mean', 'std'])
decyear ordinal lat lon glas_z dem_z dem_z_std
mean std mean std mean std mean std mean std mean std mean std
lulc
12 2006.008627 1.498488 732315.035881 547.316709 43.065223 3.569772 -112.936499 7.610318 2918.746261 772.429857 2920.785754 769.897983 9.719951 5.805685
31 2005.943042 1.737290 732291.056710 634.586821 40.870496 3.567855 -115.116398 5.356521 1750.892469 1022.544938 1751.613426 1023.340882 5.352924 7.529161