Nighthawk Pages

Data Frames | Pandas | Intro 1

Data connections, trends, and correlation. Pandas is introduced as it could be valuable for CPT and PBL.

CSP Big Idea 2

Files To Get

  1. Use wget or drag-and-drop the _notebooks/CSP/big-ideas/big-idea-2 folder for this and other ipynb on pandas.

  2. Use wget or drag-and-drop, in a subfolder named data in your _notebookx to grab data files.

  • data.csv
  • grade.json
  1. Use wget or drag-and-drop, then copy image file and place into subfolder named data_structures in your images folder. Grab the entire folder.

Pandas and DataFrames

In this lesson we will be exploring data analysis using Pandas.

  • College Board talks about ideas like
    • Tools. “the ability to process data depends on users capabilities and their tools”
    • Combining Data. “combine county data sets”
    • Status on Data”determining the artist with the greatest attendance during a particular month”
    • Data poses challenge. “the need to clean data”, “incomplete data”
  • From Pandas Overview – When working with tabular data, such as data stored in spreadsheets or databases, pandas is the right tool for you. pandas will help you to explore, clean, and process your data. In pandas, a data table is called a DataFrame.

  • DataFrame is a 2-dimensional labeled data structure with columns of potentially different types. It is similar to:
    • a spreadsheet
    • an SQL table
    • a JSON object with rows [] with nexted key-values {}

DataFrame

# uncomment the following line to install the pandas library
#!pip install pandas 

'''Pandas is used to gather data sets through its DataFrames implementation'''
import pandas as pd

Cleaning Data

When looking at a data set, check to see what data needs to be cleaned. Examples include:

  • Missing Data Points
  • Invalid Data
  • Inaccurate Data

Run the following code to see what needs to be cleaned

# Read the JSON file and convert it to a Pandas DataFrame 
# pd.read_json:  a method that reads a JSON and converts it to a DataFrame (df)
# df: a variable that holds the DataFrame
df = pd.read_json('data/grade.json')

# Print the DataFrame
print(df)

# Additional print statements to understand the DataFrame:
# print(df.info()) # prints a summary of the DataFrame, simmilar to database schema
# print(df.describe()) # prints statistics of the DataFrame
# print(df.head()) # prints the first 5 rows of the DataFrame
# print(df.tail()) # prints the last 5 rows of the DataFrame
# print(df.columns) # prints the columns of the DataFrame
# print(df.index) # prints the index of the DataFrame

# Questions:
# What part of the data set needs to be cleaned?
# From PBL learning, what is a good time to clean data?  
# Could you hav Garbage in, Garbage out problem if you don't clean the data?
---------------------------------------------------------------------------

FileNotFoundError                         Traceback (most recent call last)

Cell In[5], line 4
      1 # Read the JSON file and convert it to a Pandas DataFrame 
      2 # pd.read_json:  a method that reads a JSON and converts it to a DataFrame (df)
      3 # df: a variable that holds the DataFrame
----> 4 df = pd.read_json('data/grade.json')
      6 # Print the DataFrame
      7 print(df)


File ~/nighthawk2/prajna_2025/venv/lib/python3.12/site-packages/pandas/io/json/_json.py:791, in read_json(path_or_buf, orient, typ, dtype, convert_axes, convert_dates, keep_default_dates, precise_float, date_unit, encoding, encoding_errors, lines, chunksize, compression, nrows, storage_options, dtype_backend, engine)
    788 if convert_axes is None and orient != "table":
    789     convert_axes = True
--> 791 json_reader = JsonReader(
    792     path_or_buf,
    793     orient=orient,
    794     typ=typ,
    795     dtype=dtype,
    796     convert_axes=convert_axes,
    797     convert_dates=convert_dates,
    798     keep_default_dates=keep_default_dates,
    799     precise_float=precise_float,
    800     date_unit=date_unit,
    801     encoding=encoding,
    802     lines=lines,
    803     chunksize=chunksize,
    804     compression=compression,
    805     nrows=nrows,
    806     storage_options=storage_options,
    807     encoding_errors=encoding_errors,
    808     dtype_backend=dtype_backend,
    809     engine=engine,
    810 )
    812 if chunksize:
    813     return json_reader


File ~/nighthawk2/prajna_2025/venv/lib/python3.12/site-packages/pandas/io/json/_json.py:904, in JsonReader.__init__(self, filepath_or_buffer, orient, typ, dtype, convert_axes, convert_dates, keep_default_dates, precise_float, date_unit, encoding, lines, chunksize, compression, nrows, storage_options, encoding_errors, dtype_backend, engine)
    902     self.data = filepath_or_buffer
    903 elif self.engine == "ujson":
--> 904     data = self._get_data_from_filepath(filepath_or_buffer)
    905     self.data = self._preprocess_data(data)


File ~/nighthawk2/prajna_2025/venv/lib/python3.12/site-packages/pandas/io/json/_json.py:960, in JsonReader._get_data_from_filepath(self, filepath_or_buffer)
    952     filepath_or_buffer = self.handles.handle
    953 elif (
    954     isinstance(filepath_or_buffer, str)
    955     and filepath_or_buffer.lower().endswith(
   (...)
    958     and not file_exists(filepath_or_buffer)
    959 ):
--> 960     raise FileNotFoundError(f"File {filepath_or_buffer} does not exist")
    961 else:
    962     warnings.warn(
    963         "Passing literal json to 'read_json' is deprecated and "
    964         "will be removed in a future version. To read from a "
   (...)
    967         stacklevel=find_stack_level(),
    968     )


FileNotFoundError: File data/grade.json does not exist

Extracting Info

Take a look at some features that the Pandas library has that extracts info from the dataset

DataFrame Extract Column

#print the values in the points column with column header
print(df[['GPA']])

print()

#try two columns and remove the index from print statement
print(df[['Student ID','GPA']].to_string(index=False))
---------------------------------------------------------------------------

NameError                                 Traceback (most recent call last)

Cell In[6], line 2
      1 #print the values in the points column with column header
----> 2 print(df[['GPA']])
      4 print()
      6 #try two columns and remove the index from print statement


NameError: name 'df' is not defined

DataFrame Sort

#sort values
print(df.sort_values(by=['GPA']))

print()

#sort the values in reverse order
print(df.sort_values(by=['GPA'], ascending=False))

DataFrame Selection or Filter

#print only values with a specific criteria 
print(df[df.GPA > 3.00])

DataFrame Selection Max and Min

print(df[df.GPA == df.GPA.max()])
print()
print(df[df.GPA == df.GPA.min()])

Create your own DataFrame

Using Pandas allows you to create your own DataFrame in Python.

Python Dictionary to Pandas DataFrame

import pandas as pd

#the data can be stored as a python dictionary
dict = {
  "calories": [420, 380, 390],
  "duration": [50, 40, 45]
}
print("-------------Dictionary------------------")
print(dict)

#stores the data in a data frame
print("-------------Dict_to_DF------------------")
df = pd.DataFrame(dict)
print(df)

print("----------Dict_to_DF_labels--------------")
#or with the index argument, you can label rows.
df = pd.DataFrame(dict, index = ["day1", "day2", "day3"])
print(df)
-------------Dictionary------------------
{'calories': [420, 380, 390], 'duration': [50, 40, 45]}
-------------Dict_to_DF------------------
   calories  duration
0       420        50
1       380        40
2       390        45
----------Dict_to_DF_labels--------------
      calories  duration
day1       420        50
day2       380        40
day3       390        45

Examine DataFrame Rows

print("-------Examine Selected Rows---------")
#use a list for multiple labels:
print(df.loc[["day1", "day3"]])

#refer to the row index:
print("--------Examine Single Row-----------")
print(df.loc["day1"])
-------Examine Selected Rows---------
      calories  duration
day1       420        50
day3       390        45
--------Examine Single Row-----------
calories    420
duration     50
Name: day1, dtype: int64

Pandas DataFrame Information

#print info about the data set
print(df.info())

Example of larger data set

Pandas can read CSV and many other types of files, run the following code to see more features with a larger data set

import pandas as pd

#read csv and sort 'Duration' largest to smallest
df = pd.read_csv('data/data.csv').sort_values(by=['Duration'], ascending=False)

print("--Duration Top 10---------")
print(df.head(10))

print("--Duration Bottom 10------")
print(df.tail(10))

---------------------------------------------------------------------------

FileNotFoundError                         Traceback (most recent call last)

Cell In[9], line 4
      1 import pandas as pd
      3 #read csv and sort 'Duration' largest to smallest
----> 4 df = pd.read_csv('data/data.csv').sort_values(by=['Duration'], ascending=False)
      6 print("--Duration Top 10---------")
      7 print(df.head(10))


File ~/nighthawk2/prajna_2025/venv/lib/python3.12/site-packages/pandas/io/parsers/readers.py:1026, in read_csv(filepath_or_buffer, sep, delimiter, header, names, index_col, usecols, dtype, engine, converters, true_values, false_values, skipinitialspace, skiprows, skipfooter, nrows, na_values, keep_default_na, na_filter, verbose, skip_blank_lines, parse_dates, infer_datetime_format, keep_date_col, date_parser, date_format, dayfirst, cache_dates, iterator, chunksize, compression, thousands, decimal, lineterminator, quotechar, quoting, doublequote, escapechar, comment, encoding, encoding_errors, dialect, on_bad_lines, delim_whitespace, low_memory, memory_map, float_precision, storage_options, dtype_backend)
   1013 kwds_defaults = _refine_defaults_read(
   1014     dialect,
   1015     delimiter,
   (...)
   1022     dtype_backend=dtype_backend,
   1023 )
   1024 kwds.update(kwds_defaults)
-> 1026 return _read(filepath_or_buffer, kwds)


File ~/nighthawk2/prajna_2025/venv/lib/python3.12/site-packages/pandas/io/parsers/readers.py:620, in _read(filepath_or_buffer, kwds)
    617 _validate_names(kwds.get("names", None))
    619 # Create the parser.
--> 620 parser = TextFileReader(filepath_or_buffer, **kwds)
    622 if chunksize or iterator:
    623     return parser


File ~/nighthawk2/prajna_2025/venv/lib/python3.12/site-packages/pandas/io/parsers/readers.py:1620, in TextFileReader.__init__(self, f, engine, **kwds)
   1617     self.options["has_index_names"] = kwds["has_index_names"]
   1619 self.handles: IOHandles | None = None
-> 1620 self._engine = self._make_engine(f, self.engine)


File ~/nighthawk2/prajna_2025/venv/lib/python3.12/site-packages/pandas/io/parsers/readers.py:1880, in TextFileReader._make_engine(self, f, engine)
   1878     if "b" not in mode:
   1879         mode += "b"
-> 1880 self.handles = get_handle(
   1881     f,
   1882     mode,
   1883     encoding=self.options.get("encoding", None),
   1884     compression=self.options.get("compression", None),
   1885     memory_map=self.options.get("memory_map", False),
   1886     is_text=is_text,
   1887     errors=self.options.get("encoding_errors", "strict"),
   1888     storage_options=self.options.get("storage_options", None),
   1889 )
   1890 assert self.handles is not None
   1891 f = self.handles.handle


File ~/nighthawk2/prajna_2025/venv/lib/python3.12/site-packages/pandas/io/common.py:873, in get_handle(path_or_buf, mode, encoding, compression, memory_map, is_text, errors, storage_options)
    868 elif isinstance(handle, str):
    869     # Check whether the filename is to be opened in binary mode.
    870     # Binary mode does not support 'encoding' and 'newline'.
    871     if ioargs.encoding and "b" not in ioargs.mode:
    872         # Encoding
--> 873         handle = open(
    874             handle,
    875             ioargs.mode,
    876             encoding=ioargs.encoding,
    877             errors=errors,
    878             newline="",
    879         )
    880     else:
    881         # Binary mode
    882         handle = open(handle, ioargs.mode)


FileNotFoundError: [Errno 2] No such file or directory: 'data/data.csv'

APIs are a Source for Panda Data

3rd Party APIs are a great source for creating Pandas Data Frames.

  • Data can be fetched and resulting json can be placed into a Data Frame
  • Observe output, this looks very similar to a Database
import pandas as pd
import requests

def fetch():
    '''Obtain data from an endpoint'''
    url = "https://devops.nighthawkcodingsociety.com/api/users/"
    fetch = requests.get(url)
    json = fetch.json()

    # filter data for requirement
    df = pd.DataFrame(json)
 
    # Check if 'active_classes' column exists in the DataFrame
    if 'active_classes' in df.columns:
        # Split the 'active_classes' strings into lists of class names and expand the lists into separate rows
        classes_series = df['active_classes'].str.split(',').explode()

        # Count the unique class names and print the counts
        print(classes_series.str.strip().value_counts())
    else:
        print("Column 'active_classes' does not exist in the DataFrame")

fetch()
import pandas as pd
import requests

def fetch():
    '''Obtain data from an endpoint'''
    url = "https://devops.nighthawkcodingsociety.com/api/users/"
    fetch = requests.get(url)
    json = fetch.json()

    # filter data for requirement
    df = pd.DataFrame(json)
    
    # Check if 'active_classes' column exists in the DataFrame
    if 'active_classes' in df.columns:
        # Split the 'active_classes' strings into lists of class names
        df['active_classes'] = df['active_classes'].str.split(',')

        # Get a list of unique class names by using a set comprehension
        unique_classes = pd.Series([unique_class.strip() for class_list in df['active_classes'] for unique_class in class_list]).unique()
                                    
        # Iterate over the each class name
        for current_class in unique_classes:
            # Filter the DataFrame for students in the current class using a lambda function
            class_df = df[df['active_classes'].apply(lambda classes: current_class in classes)]

            # Select the desired data frame column
            students = class_df[['active_classes','id', 'first_name', 'last_name']]

            # Print the list of students in the current class
            print(students.sort_values(by='last_name').head()) # avoids jupyter notebook truncation, remove .head() to print all students
            print()
    else:
        print("Column 'active_classes' does not exist in the DataFrame")

fetch()

Hacks

Early Seed award. Don’t tell anyone. Show to Teacher.

  • Add this Blog to you own Blogging site.
  • Have all lecture files saved to your files directory before Tech Talk starts.
  • Add this Blog to you own Blogging site. In the Blog add notes and observations on each code cell.

The next 6 weeks, the Teachers want you to improve your understanding of data structures and data science. Your intention is to find some things to differentiate your individual College Board project, particularly if your project looks like all other projects.

  • Look at this blog and others on data structures for todays date.
  • Create or Find your own dataset. The suggestion is to use a JSON file, integrating with your CPT/PBL project would be Amazing.
  • Build frontend to backend to filter or use your data set in your CPT/PBL.
  • When choosing a data set, think about the following…
    • Does it have a good sample size?
    • Is there bias in the data?
    • Does the data set need to be cleaned?
    • What is the purpose of the data set?
Scroll to top