Palmer Penguins Blog Post

Blog Post for the First Assignment in CS0451
Author

Otis Milliken

Published

February 20, 2024

Palmer Penguins Blog Post

Abstract In this blog posts, I delve into some of the factors that we can use to predict species in the palmer penguins dataset. I used SelectKBest and going through all combinations to select my columns that I’d use to try to predict species. Although both SelectKBest and combinations method got similar levels of accuracy, they selected slightly different columns. Finally, I visualize my model to showcase how future data predictions would look.

Importing Palmer Penguins Data Set

import pandas as pd
import numpy as np
train_url = "https://raw.githubusercontent.com/PhilChodrow/ml-notes/main/data/palmer-penguins/train.csv"
train = pd.read_csv(train_url)
train.head()
studyName Sample Number Species Region Island Stage Individual ID Clutch Completion Date Egg Culmen Length (mm) Culmen Depth (mm) Flipper Length (mm) Body Mass (g) Sex Delta 15 N (o/oo) Delta 13 C (o/oo) Comments
0 PAL0809 31 Chinstrap penguin (Pygoscelis antarctica) Anvers Dream Adult, 1 Egg Stage N63A1 Yes 11/24/08 40.9 16.6 187.0 3200.0 FEMALE 9.08458 -24.54903 NaN
1 PAL0809 41 Chinstrap penguin (Pygoscelis antarctica) Anvers Dream Adult, 1 Egg Stage N74A1 Yes 11/24/08 49.0 19.5 210.0 3950.0 MALE 9.53262 -24.66867 NaN
2 PAL0708 4 Gentoo penguin (Pygoscelis papua) Anvers Biscoe Adult, 1 Egg Stage N32A2 Yes 11/27/07 50.0 15.2 218.0 5700.0 MALE 8.25540 -25.40075 NaN
3 PAL0708 15 Gentoo penguin (Pygoscelis papua) Anvers Biscoe Adult, 1 Egg Stage N38A1 Yes 12/3/07 45.8 14.6 210.0 4200.0 FEMALE 7.79958 -25.62618 NaN
4 PAL0809 34 Chinstrap penguin (Pygoscelis antarctica) Anvers Dream Adult, 1 Egg Stage N65A2 Yes 11/24/08 51.0 18.8 203.0 4100.0 MALE 9.23196 -24.17282 NaN

Here I am prepareing the data by dropping columns that don’t makes sense to train on or are constant for the data set. I also convert columns like Island into boolean columns using pandas getDummies.

from sklearn.preprocessing import LabelEncoder
le = LabelEncoder()
le.fit(train["Species"])

def prepare_data(df):
  df = df.drop(["studyName", "Sample Number", "Individual ID", "Date Egg", "Comments", "Region"], axis = 1)
  df = df[df["Sex"] != "."]
  df = df.dropna()
  y = le.transform(df["Species"])
  df = df.drop(["Species"], axis = 1)
  df = df.drop(["Stage"], axis = 1)
  df = pd.get_dummies(df)
  return df, y

X_train, y_train = prepare_data(train)
X_train.head()
Culmen Length (mm) Culmen Depth (mm) Flipper Length (mm) Body Mass (g) Delta 15 N (o/oo) Delta 13 C (o/oo) Island_Biscoe Island_Dream Island_Torgersen Clutch Completion_No Clutch Completion_Yes Sex_FEMALE Sex_MALE
0 40.9 16.6 187.0 3200.0 9.08458 -24.54903 False True False False True True False
1 49.0 19.5 210.0 3950.0 9.53262 -24.66867 False True False False True False True
2 50.0 15.2 218.0 5700.0 8.25540 -25.40075 True False False False True False True
3 45.8 14.6 210.0 4200.0 7.79958 -25.62618 True False False False True True False
4 51.0 18.8 203.0 4100.0 9.23196 -24.17282 False True False False True False True

Here we can see what the new data looks like

X_train.head()
Culmen Length (mm) Culmen Depth (mm) Flipper Length (mm) Body Mass (g) Delta 15 N (o/oo) Delta 13 C (o/oo) Island_Biscoe Island_Dream Island_Torgersen Clutch Completion_No Clutch Completion_Yes Sex_FEMALE Sex_MALE
0 40.9 16.6 187.0 3200.0 9.08458 -24.54903 False True False False True True False
1 49.0 19.5 210.0 3950.0 9.53262 -24.66867 False True False False True False True
2 50.0 15.2 218.0 5700.0 8.25540 -25.40075 True False False False True False True
3 45.8 14.6 210.0 4200.0 7.79958 -25.62618 True False False False True True False
4 51.0 18.8 203.0 4100.0 9.23196 -24.17282 False True False False True False True

Table It’s always important to see the sample size of the different columns that we’re testing for. In this dataset, for example, we see that we have significantly more Adelie and Gentoo Penguins than Chinstrap penguins. In fact, we have twice as many Adelie penguins as Chinstrap ones. This isn’t ideal to train on because our model may choose to priorities features that classify Adelie penguins. For example, a naive classifier that classified only Adelie penguins correctly would have 43% accuracy while one that only classified Chinstrap ones would have 21% accuracy. If these were the only two options then the model would pick the 43% accuracy even tho it has a massive tilt.

train.groupby("Species").aggregate("count")
studyName Sample Number Region Island Stage Individual ID Clutch Completion Date Egg Culmen Length (mm) Culmen Depth (mm) Flipper Length (mm) Body Mass (g) Sex Delta 15 N (o/oo) Delta 13 C (o/oo) Comments
Species
Adelie Penguin (Pygoscelis adeliae) 120 120 120 120 120 120 120 120 119 119 119 119 114 110 110 23
Chinstrap penguin (Pygoscelis antarctica) 57 57 57 57 57 57 57 57 57 57 57 57 57 56 57 0
Gentoo penguin (Pygoscelis papua) 98 98 98 98 98 98 98 98 97 97 97 97 94 96 96 0

Plots This plot looks at the qualitative column “Island” and visualizes it to see if there would possibly be any trends that could be helpful. From the plot, we see that Chinstrap penguins are exclusively found on Dream, while Gentoo penguins are exclusively found on Biscoe Island.

from matplotlib import pyplot as plt
import seaborn as sns

fig, ax = plt.subplots(1, figsize = (8, 3.5))

plot = sns.countplot(train, x = "Island", hue = "Species")

My second graph visualizes Culment Length vs Flipper Length to see if there’s a correlation between the two and species. We see that Adelie penguins tend to have small culmen lengths and flippper lengths. Gentoo penguins, on the other hand, tend to have medium to large culmen lengths and long flipper lengths. Lastly, Chinstrap penguins tend to have long culmen lengths and medium to small flipper lengths.

fig, ax = plt.subplots(1, 2, figsize = (8, 3.5))

p1 = sns.scatterplot(train, x = "Culmen Length (mm)", y = "Flipper Length (mm)", ax = ax[0], color = "darkgrey")
p2 = sns.scatterplot(train, x = "Culmen Length (mm)", y = "Flipper Length (mm)", hue = "Species", ax = ax[1])

Selecting Features To select my features I try two different methods: SelectKBest and trying all possible combinations. Below we see my Select K Best implementation. I had to seperate out quantative and qualitative features so that SelectKBest didn’t choose 3 quantative features.

from sklearn.feature_selection import SelectKBest
from sklearn.feature_selection import f_classif, chi2

def select_K_best(X, y, score_func, k):
    selector = SelectKBest(score_func, k=k)
    selector.fit(X, y)
    return selector.get_feature_names_out()


X_quant_selected = select_K_best(X_train[["Culmen Length (mm)", "Culmen Depth (mm)", "Flipper Length (mm)", "Body Mass (g)", "Delta 15 N (o/oo)", "Delta 13 C (o/oo)"]], y_train, f_classif, 2)

X_qual_selected = select_K_best(X_train[["Island_Biscoe", "Island_Dream", "Island_Torgersen", "Clutch Completion_No", "Clutch Completion_Yes", "Sex_FEMALE", "Sex_MALE"]], y_train, chi2, 1)
X_quant_selected
array(['Culmen Length (mm)', 'Flipper Length (mm)'], dtype=object)
X_qual_selected = [col for col in X_train.columns if X_qual_selected[0][0:4] in col]
X_qual_selected
['Island_Biscoe', 'Island_Dream', 'Island_Torgersen']

Here we see all the rows selected in through SelectKBest

selectK_cols = X_quant_selected.tolist() + X_qual_selected
selectK_cols
['Culmen Length (mm)',
 'Flipper Length (mm)',
 'Island_Biscoe',
 'Island_Dream',
 'Island_Torgersen']

I also implemented running through all combinations and calculating accuracy based on a random forrest algorithm. This method is significantly more time consuming than select k best but should return the most optimal columns.

from itertools import combinations
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import cross_val_score

clf = RandomForestClassifier(n_estimators=10, random_state=20)

# these are not actually all the columns: you'll 
# need to add any of the other ones you want to search for
all_qual_cols = ["Island", "Clutch Completion", "Sex"]
all_quant_cols = ["Culmen Length (mm)", "Culmen Depth (mm)", "Flipper Length (mm)", "Body Mass (g)", "Delta 15 N (o/oo)", "Delta 13 C (o/oo)"]

mean_score = 0
comb_cols = []
for qual in all_qual_cols: 
  qual_cols = [col for col in X_train.columns if qual in col ]
  for pair in combinations(all_quant_cols, 2):
    cols = qual_cols + list(pair) 
    clf.fit(X_train[cols], y_train)
    score = cross_val_score(clf, X_train[cols], y_train, cv=5)
    if score.mean() > mean_score:
      mean_score = score.mean()
      comb_cols = cols
comb_cols = comb_cols[3:5] + comb_cols[0:3]
comb_cols
['Culmen Length (mm)',
 'Culmen Depth (mm)',
 'Island_Biscoe',
 'Island_Dream',
 'Island_Torgersen']

Which Features to use To determine which algorithm to use, I test both possibility using cross val scores and using the random forrest classifier.

select_clf = RandomForestClassifier(n_estimators=40, random_state=40, max_depth=5)
select_clf.fit(X_train[selectK_cols], y_train)
select_score_clf = cross_val_score(select_clf, X_train[selectK_cols], y_train, cv=5)
select_score_clf.mean()
0.9804675716440423
comb_clf = RandomForestClassifier(n_estimators=40, random_state=40, max_depth=5)
comb_clf.fit(X_train[comb_cols], y_train)
comb_score_clf = cross_val_score(comb_clf, X_train[comb_cols], y_train, cv=5)
comb_score_clf.mean()
0.9804675716440423

Testing My Data While the scores were similar, I decided to use the comb_clf. Now I’ll try to test it on the test data set.

test_url = "https://raw.githubusercontent.com/PhilChodrow/ml-notes/main/data/palmer-penguins/test.csv"
test = pd.read_csv(test_url)

X_test, y_test = prepare_data(test)
comb_clf.score(X_test[comb_cols], y_test)
1.0

Yay! As we can see we achieved a score of 100%.

Visualizing Results This next block of code, given to us from Phil, helps show how our model classifies data. The one interesting thing to note is that Island Dream graph has several sharp blue ‘inlets’ that helps it categories certain Gentoo data points. This isn’t ideal and shows possibilites of overfitting.

from matplotlib import pyplot as plt
import numpy as np
from matplotlib.patches import Patch

def plot_regions(model, X, y):
    
    x0 = X[X.columns[0]]
    x1 = X[X.columns[1]]
    qual_features = X.columns[2:]
    
    fig, axarr = plt.subplots(1, len(qual_features), figsize = (7, 3))

    # create a grid
    grid_x = np.linspace(x0.min(),x0.max(),501)
    grid_y = np.linspace(x1.min(),x1.max(),501)
    xx, yy = np.meshgrid(grid_x, grid_y)
    
    XX = xx.ravel()
    YY = yy.ravel()

    for i in range(len(qual_features)):
      XY = pd.DataFrame({
          X.columns[0] : XX,
          X.columns[1] : YY
      })

      for j in qual_features:
        XY[j] = 0

      XY[qual_features[i]] = 1

      p = model.predict(XY)
      p = p.reshape(xx.shape)
      
      
      # use contour plot to visualize the predictions
      axarr[i].contourf(xx, yy, p, cmap = "jet", alpha = 0.2, vmin = 0, vmax = 2)
      
      ix = X[qual_features[i]] == 1
      # plot the data
      axarr[i].scatter(x0[ix], x1[ix], c = y[ix], cmap = "jet", vmin = 0, vmax = 2)
      
      axarr[i].set(xlabel = X.columns[0], 
            ylabel  = X.columns[1], 
            title = qual_features[i])
      
      patches = []
      for color, spec in zip(["red", "green", "blue"], ["Adelie", "Chinstrap", "Gentoo"]):
        patches.append(Patch(color = color, label = spec))

      plt.legend(title = "Species", handles = patches, loc = "best")
      
      plt.tight_layout()
plot_regions(comb_clf, X_train[comb_cols], y_train)

plot_regions(comb_clf, X_test[comb_cols], y_test)

Since the result has 100% accuracy, the confusion matrix won’t show much for the test set but it does show us that there aren’t an even amount of penguins types in the test set. This means that my algorithm could perform worse on under-represented test species.

from sklearn.metrics import confusion_matrix

y_test_pred = comb_clf.predict(X_test[comb_cols])
C = confusion_matrix(y_test, y_test_pred)
C
array([[31,  0,  0],
       [ 0, 11,  0],
       [ 0,  0, 26]])

Discussion I managed to achieve 100% accuracy for the test data! I found it interesting that the combination method and selectKBest selected different features, however there are a couple reasons I could think of why. First one is that selectKBest doesn’t take into account how features might interact with eachother. For example, two features may predict on part of the data really well while another feature may predict a different part slightly less well. The best algorithm would use both to train but selectKBest seems like it would only choose the two that help predict the same part of the data because they overall correlate better. I wonder how you can eliminate this weakness on datasets where you can’t go through every combination. Even with the combination features, from the visualizations we see that the model created isn’t perfect. The graph shows a couple slim lines that perfectly allow some data points to get correctly labeled which shows some weakness in the model. With a larger test dataset, I’m sure this overfitting wouldn’t hold for all data points in that area. In the future, I would love to try different models to see if and how they might come up with varying degrees of accuracy.