Fun With Melbourne Housing Data

Making some models and figures using Melbourne housing data from the dataset anthonypino/melbourne-housing-market. Notes from the first three assignments from dansbeckers course and exposition.

Author
Published

August 27, 2024

Modified

August 27, 2024

Keywords

austrailia, seaborn, numpy, sklearn, scikit-learn, datascience, data, science

These notes/assignments were done along with dansbeckers beginner course:

Introduction

Using kaggle Outside of the Browser

Since this is the first assignment, and since I would much rather automate things, I would like to say that it is worth knowing that the kaggle API has a python client available on PyPI. This may be installed using pip install kaggle or in my case poetry add kaggle.

It turns out that the kaggle library is not the only client available for using kaggle in python modules. There is also a solution called kagglehub. It can be installed like poetry add kagglehub.

The dataset for this assignment can be viewed and downloaded in the browser. It may be obtained in python as follows:

from typing import Iterable, Type
import kagglehub
import pathlib
import io
import contextlib

from matplotlib.axes import Axes
import matplotlib.pyplot as plt
import seaborn as sb
import numpy as np
import pandas as pd

from IPython.display import display
from sklearn.tree import DecisionTreeRegressor
from sklearn.ensemble import RandomForestRegressor
from sklearn.model_selection import train_test_split


DIR = pathlib.Path(".").resolve()

# NOTE: The ``path`` argument does not specify the path downloaded to, but 
#       instead a subpath of the data.
DATA_DOWNLOAD_IO = io.StringIO()
DATA_ID ="anthonypino/melbourne-housing-market" 
with contextlib.redirect_stdout(DATA_DOWNLOAD_IO):
  DATA_DIR = pathlib.Path(kagglehub.dataset_download(DATA_ID))

DATA_PATH_LESS = DATA_DIR / "MELBOURNE_HOUSE_PRICES_LESS.csv"
DATA_PATH = DATA_DIR / "Melbourne_housing_FULL.csv"
Warning: Looks like you're using an outdated `kagglehub` version, please consider updating (latest version: 0.3.1)
Downloading from https://www.kaggle.com/api/v1/datasets/download/anthonypino/melbourne-housing-market?dataset_version_number=27...
  0%|          | 0.00/2.28M [00:00<?, ?B/s]100%|██████████| 2.28M/2.28M [00:00<00:00, 111MB/s]
Extracting model files...

Loading and Describing Data

Note that it is necessary to capture stdout if you want your notebook to look nice. The output in DATA_PATH should be a path to the data full data, and (obviously) DATA_PATH_LESS should be a path to the partial data. It will look something like

/root/.cache/kagglehub/datasets/anthonypino/melbourne-housing-market/versions/27/Melbourne_housing_FULL.csv

Data is loaded an described using the following:

DATA_LESS = pd.read_csv(DATA_PATH_LESS)
DATA_LESS.describe()
Rooms Price Postcode Propertycount Distance
count 63023.000000 4.843300e+04 63023.000000 63023.000000 63023.000000
mean 3.110595 9.978982e+05 3125.673897 7617.728131 12.684829
std 0.957551 5.934989e+05 125.626877 4424.423167 7.592015
min 1.000000 8.500000e+04 3000.000000 39.000000 0.000000
25% 3.000000 6.200000e+05 3056.000000 4380.000000 7.000000
50% 3.000000 8.300000e+05 3107.000000 6795.000000 11.400000
75% 4.000000 1.220000e+06 3163.000000 10412.000000 16.700000
max 31.000000 1.120000e+07 3980.000000 21650.000000 64.100000

Description of the dataset in MELBOURNE_HOUSE_PRICES_LESS.csv.

This is roughly what was done on the first assignment but with a different data set (this one came from the example before) the homework assignment. Further the assignment asked for some interpretation of the data set description.

Pandas Refresher

I will go ahead and write about pandas a little more as notes on the next tutorial and for my own review.

The columns of the DataFrame are able to be viewed using the columns attribute:

DATA = pd.read_csv(DATA_PATH)
print(
  "Columns:", 
  *list(map(lambda item: f"- `{item}`", DATA)),
  sep="\n"
)
Columns:
- `Suburb`
- `Address`
- `Rooms`
- `Type`
- `Price`
- `Method`
- `SellerG`
- `Date`
- `Distance`
- `Postcode`
- `Bedroom2`
- `Bathroom`
- `Car`
- `Landsize`
- `BuildingArea`
- `YearBuilt`
- `CouncilArea`
- `Lattitude`
- `Longtitude`
- `Regionname`
- `Propertycount`

pd.core.series.Series is very similar to pd.DataFrame and shares many attributes. For instance, we can describe an individual column:

DATA["Distance"].describe()
count    34856.000000
mean        11.184929
std          6.788892
min          0.000000
25%          6.400000
50%         10.300000
75%         14.000000
max         48.100000
Name: Distance, dtype: float64

Description of the Distance column.

The following block of code confirms the type of DISTANCE and shows the useful attributes of pd.core.series.Series by filtering out methods and attributes that start with an underscore since they are often builtin or private:

def describe_attrs(col):
  print("Type:", type(col))
  print(
    "Common Attributes:", 
    *list(
      map(
        lambda attr: f"- {attr}", 
        filter(
          lambda attr: not attr.startswith("_"), 
          set(dir(col)) & set(dir(DATA))
        )
      )
    ), 
    sep="\n",
  )

describe_attrs(DATA["Distance"])
Type: <class 'pandas.core.series.Series'>
Common Attributes:
- drop
- median
- mask
- fillna
- mul
- swaplevel
- abs
- max
- subtract
- items
- resample
- pipe
- flags
- expanding
- index
- divide
- iat
- mod
- cumsum
- duplicated
- to_excel
- reset_index
- get
- size
- to_json
- clip
- align
- any
- std
- rtruediv
- interpolate
- ffill
- eq
- ewm
- ndim
- keys
- explode
- memory_usage
- rdiv
- round
- to_csv
- shape
- add
- add_prefix
- plot
- squeeze
- cumprod
- filter
- notnull
- to_latex
- ge
- at_time
- value_counts
- tz_convert
- compare
- iloc
- transform
- isnull
- cov
- first
- aggregate
- pop
- asfreq
- head
- min
- var
- cummax
- nsmallest
- count
- T
- to_markdown
- shift
- radd
- set_flags
- combine_first
- pad
- axes
- reindex
- infer_objects
- product
- copy
- truediv
- gt
- drop_duplicates
- rmul
- rfloordiv
- to_timestamp
- dropna
- droplevel
- bool
- kurtosis
- update
- asof
- attrs
- rank
- lt
- all
- to_period
- last
- rename_axis
- multiply
- first_valid_index
- quantile
- dot
- nlargest
- isna
- bfill
- take
- values
- to_pickle
- convert_dtypes
- apply
- add_suffix
- reindex_like
- notna
- rsub
- at
- sem
- truncate
- set_axis
- between_time
- last_valid_index
- le
- replace
- astype
- mode
- cummin
- rename
- isin
- to_string
- map
- div
- to_clipboard
- agg
- swapaxes
- prod
- empty
- to_xarray
- mean
- tz_localize
- pow
- kurt
- skew
- unstack
- reorder_levels
- diff
- corr
- sample
- nunique
- sum
- tail
- rpow
- groupby
- equals
- to_hdf
- dtypes
- sub
- floordiv
- idxmax
- sort_index
- to_dict
- backfill
- info
- rolling
- hist
- idxmin
- ne
- describe
- pct_change
- combine
- transpose
- where
- rmod
- loc
- xs
- to_numpy
- sort_values
- to_sql

Null columns can be removed from the DataFrame using the dropna method. This does not modify in place the DataFrame, rather it returns a new DataFrame (unless the inplace keyword argument is used):

def clean_data(data: pd.DataFrame):
  """Clean data and transform category columns into categories."""

  data_clean = data.dropna(axis='index')

  # NOTE: Categories are required for swarm plots.
  data_clean["Rooms"] = (rooms := data_clean["Rooms"]).astype(
    pd.CategoricalDtype(
      categories=list(range(rooms.min() - 1, rooms.max() + 1)),
      ordered=True,
    )
  )
  data_clean["Bathroom"] = (bathroom := data_clean["Bathroom"]).astype(
    pd.CategoricalDtype(
      categories=sorted(set(bathroom.dropna())),
      ordered=True,
    )
  )
  return data_clean

DATA_CLEAN = clean_data(DATA)
DATA_CLEAN_DESCRIPTION = DATA_CLEAN.describe() # We'll need this later.
display(DATA_CLEAN_DESCRIPTION)
/tmp/ipykernel_231/2604006848.py:7: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  data_clean["Rooms"] = (rooms := data_clean["Rooms"]).astype(
/tmp/ipykernel_231/2604006848.py:13: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  data_clean["Bathroom"] = (bathroom := data_clean["Bathroom"]).astype(
Price Distance Postcode Bedroom2 Car Landsize BuildingArea YearBuilt Lattitude Longtitude Propertycount
count 8.887000e+03 8887.000000 8887.000000 8887.000000 8887.000000 8887.000000 8887.000000 8887.000000 8887.000000 8887.000000 8887.000000
mean 1.092902e+06 11.199887 3111.662653 3.078204 1.692247 523.480365 149.309477 1965.753348 -37.804501 144.991393 7475.940137
std 6.793819e+05 6.813402 112.614268 0.966269 0.975464 1061.324228 87.925580 37.040876 0.090549 0.118919 4375.024364
min 1.310000e+05 0.000000 3000.000000 0.000000 0.000000 0.000000 0.000000 1196.000000 -38.174360 144.423790 249.000000
25% 6.410000e+05 6.400000 3044.000000 2.000000 1.000000 212.000000 100.000000 1945.000000 -37.858560 144.920000 4382.500000
50% 9.000000e+05 10.200000 3084.000000 3.000000 2.000000 478.000000 132.000000 1970.000000 -37.798700 144.998500 6567.000000
75% 1.345000e+06 13.900000 3150.000000 4.000000 2.000000 652.000000 180.000000 2000.000000 -37.748945 145.064560 10331.000000
max 9.000000e+06 47.400000 3977.000000 12.000000 10.000000 42800.000000 3112.000000 2019.000000 -37.407200 145.526350 21650.000000

Description of the data minus null rows.

The axis keyword argument of DataFrame.dropna is used to determine if rows (aka index or 0) or columns (columns or 1) with null values are dropped. From this data a certain number of columns can be selected using a list as an index:

DATA_FEATURES_COLUMNS = ['Rooms', 'Bathroom', 'Landsize', 'Lattitude', 'Longtitude']

# DATA_FEATURES: pd.DataFrame = DATA_CLEAN[DATA_FEATURES_COLUMNS] # type: ignore
# DATA_FEATURES.head(10)
#
# DATA_TARGET = DATA_CLEAN["Price"]
DATA_FEATURES: pd.DataFrame
TEST_FEATURES: pd.DataFrame
DATA_TARGET: pd.Series
TEST_TARGET: pd.Series
DATA_FEATURES, TEST_FEATURES, DATA_TARGET, TEST_TARGET = train_test_split( # type: ignore
  DATA_CLEAN[DATA_FEATURES_COLUMNS], 
  DATA_CLEAN["Price"],
  random_state=1,
)

DATA_FEATURES.head(10)
Rooms Bathroom Landsize Lattitude Longtitude
13207 5 3.0 601.0 -37.76370 144.88430
31715 4 3.0 581.0 -37.98081 145.26047
19587 2 1.0 80.0 -37.89563 145.06992
13599 2 2.0 0.0 -37.81510 145.00030
7106 2 1.0 229.0 -37.83680 144.87680
5808 2 1.0 1111.0 -37.78960 144.93210
4175 2 1.0 1658.0 -37.77810 145.01570
14335 3 1.0 610.0 -37.96322 145.20586
6602 3 3.0 173.0 -37.77300 144.89470
24049 4 2.0 640.0 -37.63931 145.06568

First ten rows of DATA.

Note that sklearn.model_.train_test_split is used to chunk up the data so that we can compute error in predictions of the model outside of the data that will be used to train it, this is referred to as ‘Out Sample’ data.

‘In Sample’ data is used in the initial error analysis of the model used in this notebook. In the section after that, ‘Out Sample’ data is used to assess the accuracy of the model. Finally, predictions are made for entries that did not have a price.

It is also useful to look at the price distribution of both the training and testing datasets. This is easy to do with pd.DataFrame.hist:

def plot_price_dist():
  bins = np.linspace(
    bin_start := DATA_TARGET.quantile(0), 
    bin_stop := DATA_TARGET.quantile(.99), 
    50
  )
  subplot: Axes = DATA_TARGET.hist(bins=bins, label="Training Dataset")
  subplot = TEST_TARGET.hist(bins=bins, label="Testing Dataset")
  subplot.set_title("Distribution of Prices")
  subplot.set_xlabel("Price")
  subplot.set_xlim(bin_start, bin_stop)
  subplot.set_ylabel("Count")
  subplot.legend()
  return subplot


subplot = plot_price_dist()
subplot.figure.savefig("./plot-price-dist.png")
plt.close()

Distribution of prices in test and training datasets.

Predicting Prices with a Tree Model

About scikit-learn Learn

It is easy to install scikit-learn using poetry or pip like

poetry add scikit-learn

Model Implementation

The following cell will predict the prices of houses for which the price is known:

def create_model(features: pd.DataFrame, target, /, cls: Type =DecisionTreeRegressor, **kwargs):
  tree = cls(random_state=1, **kwargs)
  tree.fit(features, target)
  return tree


TREE = create_model(DATA_FEATURES, DATA_TARGET)

Model In Sample Error Analysis

Now we should measure the accuracy of the model against some in sample data. This is done to contrast against our out sample analysis in the next section of this notebook. The following function creates a dataframe for comparison:

def create_price_compare(
  tree,
  data: pd.DataFrame,
  *,
  price = None,
):
  """Create a dataframe with price, actual price, error, error_percent and 
  feature columns."""

  data_features = data[DATA_FEATURES_COLUMNS]
  price_actual = price if price is not None else data["Price"] 
  price_predictions = tree.predict(data_features)
  error = np.array(
    list(
      actual - predicted
      for predicted, actual in zip(price_predictions, price_actual)
    )
  )
  df = pd.DataFrame(
    {
      "predicted": price_predictions,
      "actual": price_actual, 
      "error": error,
      "error_percent": 100 * abs(error / price_actual)
    }
  )
  df = df.sort_values(by="error_percent")
  df = df.join(data_features)
  return df


PRICE_COMPARE = create_price_compare(TREE, DATA_CLEAN)
count    8887.000000
mean        5.749109
std        20.797978
min         0.000000
25%         0.000000
50%         0.000000
75%         0.773343
max      1284.615385
Name: error_percent, dtype: float64

Description of PRICE_COMPARE["error_percent"].

The description indicates that the mean error is reasonably low. Let’s now plot the distribution of prediction errors within the in sample data:

def create_model_errdist(price_compare: pd.DataFrame, *, ax=None, **kwargs):
  percents = np.linspace(0, 100, 50)
  counts = list(
    price_compare[
      (percent <= price_compare["error_percent"])
      & (price_compare["error_percent"] < percent + 2)
    ]["error_percent"].count()  # type: ignore
    for percent in percents
  )
  data = pd.DataFrame({"percents": percents, "counts": counts})
  return sb.lineplot(data, x="percents", y="counts", ax=ax, **kwargs)


err_dist = create_model_errdist(PRICE_COMPARE)
err_dist.set_title("Model In Sample Error Distribution")
err_dist.figure.savefig("./err-dist-in-sample.png") # type: ignore
plt.close()

In Sample Error Distribution

This is good (as most of the error is distributed between \(0\) and \(5\) percent). However, as will be shown in the next section, this cannot be expected for any out sample data.

Model Out Sample Error Analysis

Conveniently, the functions above can be used for our out sample data. This is as easy as

TEST_PRICE_COMPARE = create_price_compare(TREE, TEST_FEATURES, price=TEST_TARGET)
TEST_PRICE_COMPARE["error_percent"].describe()
count    2222.000000
mean       22.838637
std        36.606299
min         0.000000
25%         6.757544
50%        15.991803
75%        29.227409
max      1284.615385
Name: error_percent, dtype: float64

Prediction analysis for out sample data.

Nest, we can create a plot of the count per percent error:

def mnae(price_compare: pd.DataFrame):
  cpc = abs(price_compare["error_percent"])
  return sum(cpc) / (100 * len(cpc)) # type: ignore


plt.axvline(mnae(TEST_PRICE_COMPARE) * 100)
err_dist = create_model_errdist(TEST_PRICE_COMPARE, color="red")
err_dist.set_title("Model Out Sample Error Distribution")
err_dist.figure.savefig(DIR / "err-dist-out-sample.png") # type: ignore
plt.close()

Out sample error distribution.

This plot does not look at all like the in sample error, which decays immediately with its spike contained under \(5\) percent. It would indicate that error is generally high on the out sample data, implying that there is some room for improvement. The vertical line is included to show the \(mnae\), the mean normalized absolute error (as a percentage).

Improving the Model

After running the model against some out sample data, it is clear that the model does not perform well right out of the box. If we were to only look at in sample data, this would not be apparent.

It is possible to modify change model parameters to attempt to tune the model. To make comparisons, we should combine the above script into a function to get the analysis dataframe for each tree.

def create_tree_analysis(
  features: pd.DataFrame, 
  features_test: pd.DataFrame,
  target, 
  target_test,
  /,
  **kwargs
):
  tree = create_model(features, target, **kwargs)
  tree_price_compare = create_price_compare(tree, features_test, price=target_test)
  return tree, tree_price_compare

This function then can be mapped over many different parameter sets.

def create_model_errdist_many(err_dists: list[pd.DataFrame], labels: Iterable[str]):
  fig, axs = plt.subplots(ncols=1, nrows=1)

  plot = None
  for dist, label in zip(err_dists, labels):
    plot = create_model_errdist(dist, ax=axs, label=label)

  assert plot is not None
  axs.legend()
  return plot


MAX_LEAF_NODES = list(map(lambda k: 5 * 10 ** k,  range(5)))
TREES, PRICE_COMPARES = zip(
  *(
    create_tree_analysis(
      DATA_FEATURES, TEST_FEATURES, DATA_TARGET, TEST_TARGET,
      max_leaf_nodes=max_leaf_nodes,
    )
    for max_leaf_nodes in MAX_LEAF_NODES
  )
)


err_dist = create_model_errdist_many(PRICE_COMPARES, map(str, MAX_LEAF_NODES))
err_dist.set_title("Model Out Sample Error for Various Models")
err_dist.figure.savefig("./err-dist-out-sample-many.png") # type: ignore
plt.close()

Out sample error distributions for models with various max_leaf_node values.

The best curves should have a strong peak towards the front (implying that error tends to lower for more entries) and decay rapidly. The initial model would appear to be reasonable fit because it matches the best curves (where max_leaf_nodes is \(5000\) and \(50000\)).

It would appear that there is not much room for improvement of the model along parameter of max_leaf_nodes. An objective choice of the number of leaf nodes can be done by minimizing the mean normalized absolute error:

def min_mae(items, price_compares):
  maes = map( mnae, price_compares)
  best, candidate_mae = min(zip(items, maes), key = lambda pair: pair[1])
  return best, candidate_mae


BEST, BEST_MAE = min_mae(MAX_LEAF_NODES, PRICE_COMPARES)
print(f"The minimized mnae (`{BEST_MAE}`) has `max_leaf_nodes = {BEST}`.")
The minimized mnae (`0.21206521792307836`) has `max_leaf_nodes = 500`.

From this we will take the corresponding tree as the best model:

TREE = TREES[MAX_LEAF_NODES.index(BEST)]

Making Predictions with the Model

The goal here is to plot and compare price predictions on the rows of DATA that did not have a price and make some pretty plots. Rows with null columns can be found like follows:

def create_price_null(data: pd.DataFrame) -> pd.DataFrame:
  price_null = data["Price"].isnull()
  return data[price_null][DATA_FEATURES_COLUMNS] # type: ignore

# DATA_FEATURES_PRICE_NULL: pd.DataFrame = DATA[DATA["Price"].isnull()][DATA_FEATURES_COLUMNS] # type: ignore
DATA_FEATURES_PRICE_NULL = create_price_null(DATA)
DATA_FEATURES_PRICE_NULL.describe()
Rooms Bathroom Landsize Lattitude Longtitude
count 7610.000000 5831.000000 5065.000000 5888.000000 5888.000000
mean 3.169645 1.742926 593.989733 -37.823724 145.020178
std 1.010263 0.790775 1564.289341 0.084040 0.116506
min 1.000000 0.000000 0.000000 -38.184630 144.431620
25% 3.000000 1.000000 245.000000 -37.868730 144.968300
50% 3.000000 2.000000 538.000000 -37.826900 145.022990
75% 4.000000 2.000000 696.000000 -37.774300 145.083840
max 12.000000 12.000000 80000.000000 -37.390200 145.489850

Description of the dataset rows with no price specified.

This works because DATA["Price"] should contain the indices of the respective rows within the dataframe, making it a suitable index. In the description it is clear that this worked because the price stats are either zero of NaN. Now it is time to attempt to fill in these values:

def create_price_predictions(tree: DecisionTreeRegressor, features: pd.DataFrame):
  predictions = tree.predict(features)
  completed = features.copy()
  completed["Price"] = predictions
  return completed


DATA_PRICE_NULL_PREDICTIONS = create_price_predictions(TREE, DATA_FEATURES_PRICE_NULL)
DATA_PRICE_NULL_PREDICTIONS.describe()
Rooms Bathroom Landsize Lattitude Longtitude Price
count 7610.000000 5831.000000 5065.000000 5888.000000 5888.000000 7.610000e+03
mean 3.169645 1.742926 593.989733 -37.823724 145.020178 1.348987e+06
std 1.010263 0.790775 1564.289341 0.084040 0.116506 7.938628e+05
min 1.000000 0.000000 0.000000 -38.184630 144.431620 3.663929e+05
25% 3.000000 1.000000 245.000000 -37.868730 144.968300 8.003333e+05
50% 3.000000 2.000000 538.000000 -37.826900 145.022990 1.185000e+06
75% 4.000000 2.000000 696.000000 -37.774300 145.083840 1.675000e+06
max 12.000000 12.000000 80000.000000 -37.390200 145.489850 9.000000e+06

Price predictions for the rows missing a price.

Note that TREE will reject the input if it contains all of columns and not just the feature columns, thus why DATA_PRICE_NULL is indexed. The description of this dataframe should be reasonable comparable to the data description of DATA_CLEAN.

def create_price_predictions_compare(data_clean: pd.DataFrame, data_interpolated: pd.DataFrame):
  interpolated = data_interpolated["Price"].describe()
  actual = data_clean["Price"].describe()
  # error = interpolated - actual

  return pd.DataFrame(
    {
      "predicted": interpolated,
      "actual": actual,
      # "error": error,
      # "error_percent": 100 * (error / actual),
    }
  )


# NOTE: The last object in a code cell is displayed by default, thus why this
#       dataframe is created yet not assigned.
create_price_predictions_compare(DATA_CLEAN, DATA_PRICE_NULL_PREDICTIONS)
predicted actual
count 7.610000e+03 8.887000e+03
mean 1.348987e+06 1.092902e+06
std 7.938628e+05 6.793819e+05
min 3.663929e+05 1.310000e+05
25% 8.003333e+05 6.410000e+05
50% 1.185000e+06 9.000000e+05
75% 1.675000e+06 1.345000e+06
max 9.000000e+06 9.000000e+06

Comparison of interpolated data and completed data descriptions.

Now that we know the data descriptions are reasonable (by comparing the magnitude of any of the provided data) we can combine the predictions and the clean data and label them as being estimated or not in the Estimated column.

def create_data_completed(data_clean: pd.DataFrame, data_interpolated: pd.DataFrame, ) -> pd.DataFrame:

  # NOTE: Create dataframe with features and prices, add that it is not estimated
  data_estimated_not = data_clean[[*DATA_FEATURES_COLUMNS, "Price"]].copy()
  data_estimated_not["Estimated"] = pd.Series(data=(False for _ in range(len(data_clean))))

  # NOTE: Add estimated to the estimated prices dataframe.
  data_interpolated = data_interpolated.copy()
  data_interpolated["Estimated"]= pd.Series(data=(True for _ in range(len(data_interpolated))))

  return pd.concat((data_estimated_not, data_interpolated)) # type: ignore


DATA_COMPLETED = create_data_completed(DATA_CLEAN, DATA_PRICE_NULL_PREDICTIONS)

This will allow us to generate some nice swarm plots in seaborn.

def create_prediction_plots(data_completed):
  price_min: float = data_completed["Price"].min() 
  price_ub = data_completed["Price"].quantile(0.99).max() 

  if not (_ := DIR / "rooms-stripplot.png").exists():
    rooms_plot = sb.swarmplot(
      data_completed[data_completed["Rooms"] <= 5], # noqa: reportArgumentType
      y="Price", 
      x="Rooms",
      hue="Estimated",
      dodge=True,
      size=0.99,
    )
    rooms_plot.set_xlim(0, 6)
    rooms_plot.set_ylim(price_min, price_ub)
    rooms_plot.figure.savefig(_) # type: ignore
    plt.close()

  if not (_ := DIR / "bathrooms-stripplot.png").exists():
    bathrooms_plot = sb.swarmplot(
      data_completed[data_completed["Bathroom"] <= 5], # noqa: reportArgumentType 
      y="Price", 
      x="Bathroom",
      hue="Estimated",
      dodge=True,
      size=0.99,
    )
    bathrooms_plot.set_xlim(0, 5)
    bathrooms_plot.set_ylim(price_min, price_ub)
    bathrooms_plot.figure.savefig(_) # type: ignore
    plt.close()

  # if not (_ := DIR / "geospacial-scatterplot.png").exists() or True:
  #   geospacial_plot = sb.scatterplot(
  #     DATA_COMPLETED,
  #     x="Longtitude",
  #     y="Lattitude",
  #     hue="Price",
  #     alpha=0.5,
  #   )
  #   # geospacial_plot = sb.histplot(
  #   #   DATA_COMPLETED,
  #   #   x=DATA_COMPLETED["Longtitude"],
  #   #   y=DATA_COMPLETED["Lattitude"],
  #   #   bins=25,
  #   #   cmap="mako",
  #   #   hue="Price"
  #   # )
  #
  #   geospacial_plot.figure.savefig(_)
  #   plt.close()

create_prediction_plots(DATA_COMPLETED)
/quarto/.venv/lib/python3.10/site-packages/seaborn/categorical.py:3399: UserWarning: 16.1% of the points cannot be placed; you may want to decrease the size of the markers or use stripplot.
  warnings.warn(msg, UserWarning)
/quarto/.venv/lib/python3.10/site-packages/seaborn/categorical.py:3399: UserWarning: 13.1% of the points cannot be placed; you may want to decrease the size of the markers or use stripplot.
  warnings.warn(msg, UserWarning)
/quarto/.venv/lib/python3.10/site-packages/seaborn/categorical.py:3399: UserWarning: 15.2% of the points cannot be placed; you may want to decrease the size of the markers or use stripplot.
  warnings.warn(msg, UserWarning)
/quarto/.venv/lib/python3.10/site-packages/seaborn/categorical.py:3399: UserWarning: 18.7% of the points cannot be placed; you may want to decrease the size of the markers or use stripplot.
  warnings.warn(msg, UserWarning)
/quarto/.venv/lib/python3.10/site-packages/seaborn/categorical.py:3399: UserWarning: 14.5% of the points cannot be placed; you may want to decrease the size of the markers or use stripplot.
  warnings.warn(msg, UserWarning)
/quarto/.venv/lib/python3.10/site-packages/seaborn/categorical.py:3399: UserWarning: 17.2% of the points cannot be placed; you may want to decrease the size of the markers or use stripplot.
  warnings.warn(msg, UserWarning)
/quarto/.venv/lib/python3.10/site-packages/seaborn/categorical.py:3399: UserWarning: 5.7% of the points cannot be placed; you may want to decrease the size of the markers or use stripplot.
  warnings.warn(msg, UserWarning)

Swarm Plot by Rooms

Swarm Plot by Bathrooms.

It is interesting to notice the stacking of identical values on the prediction side. This would mean that the decision tree would follow a path down to the same node each time, an inherent problem with decision trees. In the next section an attempt to remedy this is made.

Making Predictions With an Ensemble of Trees

A forest is simply many trees. To try to better the predictions made by a tree and resolve over-fitting and under-fitting with some sort of consensus many trees are used and their results are averaged. sklearn.ensemble.RandomForestRegressor may be constructed and trained in exactly the same way that DecisionTreeRegressor is, e.g.

FOREST = create_model(DATA_FEATURES, DATA_TARGET, cls=RandomForestRegressor)

and now we may compare to some out sample data:

PRICE_COMPARE = create_price_compare(FOREST, TEST_FEATURES, price=TEST_TARGET)
MNAE = mnae(PRICE_COMPARE) * 100

plt.axvline(MNAE)
err_dist = create_model_errdist(PRICE_COMPARE, color="red")
err_dist.set_title("Error Distribution of Random Forest")


err_dist.figure.savefig("./err-dist-forest.png") # type: ignore

plt.close()

Random Forest Error Distribution

This leaves us with a mean normalize absolute error below \(20\)%, which is absolutely an improvement .