Fun With Melbourne Housing Data

Making some models and figures using Melbourne housing data from the dataset anthonypino/melbourne-housing-market. Notes from the first three assignments from dansbeckers course and exposition.

Author

Adrian Cederberg

Published

August 27, 2024

Modified

August 27, 2024

Keywords

austrailia, seaborn, numpy, sklearn, scikit-learn, datascience, data, science

These notes/assignments were done along with dansbeckers beginner course:

Introduction

Using `kaggle` Outside of the Browser

Since this is the first assignment, and since I would much rather automate things, I would like to say that it is worth knowing that the kaggle API has a python client available on PyPI. This may be installed using pip install kaggle or in my case poetry add kaggle.

It turns out that the kaggle library is not the only client available for using kaggle in python modules. There is also a solution called kagglehub. It can be installed like poetry add kagglehub.

The dataset for this assignment can be viewed and downloaded in the browser. It may be obtained in python as follows:

from typing import Iterable, Type
import kagglehub
import pathlib
import io
import contextlib

from matplotlib.axes import Axes
import matplotlib.pyplot as plt
import seaborn as sb
import numpy as np
import pandas as pd

from IPython.display import display
from sklearn.tree import DecisionTreeRegressor
from sklearn.ensemble import RandomForestRegressor
from sklearn.model_selection import train_test_split


DIR = pathlib.Path(".").resolve()

# NOTE: The ``path`` argument does not specify the path downloaded to, but 
#       instead a subpath of the data.
DATA_DOWNLOAD_IO = io.StringIO()
DATA_ID ="anthonypino/melbourne-housing-market" 
with contextlib.redirect_stdout(DATA_DOWNLOAD_IO):
  DATA_DIR = pathlib.Path(kagglehub.dataset_download(DATA_ID))

DATA_PATH_LESS = DATA_DIR / "MELBOURNE_HOUSE_PRICES_LESS.csv"
DATA_PATH = DATA_DIR / "Melbourne_housing_FULL.csv"

Warning: Looks like you're using an outdated `kagglehub` version, please consider updating (latest version: 0.3.8)
Downloading from https://www.kaggle.com/api/v1/datasets/download/anthonypino/melbourne-housing-market?dataset_version_number=27...

  0%|          | 0.00/2.28M [00:00<?, ?B/s]100%|██████████| 2.28M/2.28M [00:00<00:00, 64.2MB/s]

Extracting model files...

Loading and Describing Data

Note that it is necessary to capture stdout if you want your notebook to look nice. The output in DATA_PATH should be a path to the data full data, and (obviously) DATA_PATH_LESS should be a path to the partial data. It will look something like

/root/.cache/kagglehub/datasets/anthonypino/melbourne-housing-market/versions/27/Melbourne_housing_FULL.csv

Data is loaded an described using the following:

DATA_LESS = pd.read_csv(DATA_PATH_LESS)
DATA_LESS.describe()

	Rooms	Price	Postcode	Propertycount	Distance
count	63023.000000	4.843300e+04	63023.000000	63023.000000	63023.000000
mean	3.110595	9.978982e+05	3125.673897	7617.728131	12.684829
std	0.957551	5.934989e+05	125.626877	4424.423167	7.592015
min	1.000000	8.500000e+04	3000.000000	39.000000	0.000000
25%	3.000000	6.200000e+05	3056.000000	4380.000000	7.000000
50%	3.000000	8.300000e+05	3107.000000	6795.000000	11.400000
75%	4.000000	1.220000e+06	3163.000000	10412.000000	16.700000
max	31.000000	1.120000e+07	3980.000000	21650.000000	64.100000

Description of the dataset in MELBOURNE_HOUSE_PRICES_LESS.csv.

This is roughly what was done on the first assignment but with a different data set (this one came from the example before) the homework assignment. Further the assignment asked for some interpretation of the data set description.

Pandas Refresher

I will go ahead and write about pandas a little more as notes on the next tutorial and for my own review.

The columns of the DataFrame are able to be viewed using the columns attribute:

DATA = pd.read_csv(DATA_PATH)
print(
  "Columns:", 
  *list(map(lambda item: f"- `{item}`", DATA)),
  sep="\n"
)

Columns:
- `Suburb`
- `Address`
- `Rooms`
- `Type`
- `Price`
- `Method`
- `SellerG`
- `Date`
- `Distance`
- `Postcode`
- `Bedroom2`
- `Bathroom`
- `Car`
- `Landsize`
- `BuildingArea`
- `YearBuilt`
- `CouncilArea`
- `Lattitude`
- `Longtitude`
- `Regionname`
- `Propertycount`

pd.core.series.Series is very similar to pd.DataFrame and shares many attributes. For instance, we can describe an individual column:

DATA["Distance"].describe()

count    34856.000000
mean        11.184929
std          6.788892
min          0.000000
25%          6.400000
50%         10.300000
75%         14.000000
max         48.100000
Name: Distance, dtype: float64

Description of the Distance column.

The following block of code confirms the type of DISTANCE and shows the useful attributes of pd.core.series.Series by filtering out methods and attributes that start with an underscore since they are often builtin or private:

def describe_attrs(col):
  print("Type:", type(col))
  print(
    "Common Attributes:", 
    *list(
      map(
        lambda attr: f"- {attr}", 
        filter(
          lambda attr: not attr.startswith("_"), 
          set(dir(col)) & set(dir(DATA))
        )
      )
    ), 
    sep="\n",
  )

describe_attrs(DATA["Distance"])

Type: <class 'pandas.core.series.Series'>
Common Attributes:
- at_time
- isna
- asfreq
- explode
- xs
- sort_index
- floordiv
- min
- expanding
- mode
- ndim
- at
- notna
- le
- to_timestamp
- plot
- rmod
- axes
- rename
- duplicated
- sum
- mask
- subtract
- T
- agg
- skew
- dot
- aggregate
- std
- empty
- memory_usage
- isin
- update
- lt
- sem
- flags
- div
- isnull
- sort_values
- rolling
- align
- reorder_levels
- prod
- radd
- cumsum
- idxmax
- dtypes
- nunique
- rdiv
- to_csv
- to_markdown
- to_numpy
- shape
- between_time
- values
- ffill
- value_counts
- round
- ewm
- backfill
- clip
- index
- to_xarray
- reindex
- add_suffix
- tail
- rename_axis
- to_clipboard
- combine_first
- where
- shift
- bool
- mod
- set_axis
- to_dict
- last_valid_index
- divide
- count
- dropna
- swaplevel
- sub
- convert_dtypes
- add
- diff
- compare
- product
- rsub
- cumprod
- multiply
- info
- to_excel
- fillna
- keys
- truncate
- iloc
- replace
- last
- to_json
- iat
- take
- copy
- infer_objects
- kurt
- reindex_like
- first_valid_index
- any
- astype
- loc
- to_latex
- equals
- first
- pop
- add_prefix
- idxmin
- cummin
- mean
- bfill
- size
- rpow
- groupby
- cummax
- get
- pipe
- pow
- gt
- hist
- rtruediv
- cov
- resample
- pad
- rmul
- median
- drop
- transform
- rfloordiv
- nsmallest
- combine
- unstack
- pct_change
- set_flags
- eq
- quantile
- tz_convert
- sample
- var
- describe
- rank
- attrs
- all
- swapaxes
- asof
- filter
- apply
- tz_localize
- nlargest
- mul
- to_period
- squeeze
- corr
- to_sql
- items
- ne
- notnull
- droplevel
- reset_index
- to_string
- abs
- map
- drop_duplicates
- kurtosis
- head
- interpolate
- truediv
- to_pickle
- max
- to_hdf
- transpose
- ge

Null columns can be removed from the DataFrame using the dropna method. This does not modify in place the DataFrame, rather it returns a new DataFrame (unless the inplace keyword argument is used):

def clean_data(data: pd.DataFrame):
  """Clean data and transform category columns into categories."""

  data_clean = data.dropna(axis='index')

  # NOTE: Categories are required for swarm plots.
  data_clean["Rooms"] = (rooms := data_clean["Rooms"]).astype(
    pd.CategoricalDtype(
      categories=list(range(rooms.min() - 1, rooms.max() + 1)),
      ordered=True,
    )
  )
  data_clean["Bathroom"] = (bathroom := data_clean["Bathroom"]).astype(
    pd.CategoricalDtype(
      categories=sorted(set(bathroom.dropna())),
      ordered=True,
    )
  )
  return data_clean

DATA_CLEAN = clean_data(DATA)
DATA_CLEAN_DESCRIPTION = DATA_CLEAN.describe() # We'll need this later.
display(DATA_CLEAN_DESCRIPTION)

/tmp/ipykernel_5544/2604006848.py:7: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  data_clean["Rooms"] = (rooms := data_clean["Rooms"]).astype(
/tmp/ipykernel_5544/2604006848.py:13: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  data_clean["Bathroom"] = (bathroom := data_clean["Bathroom"]).astype(

	Price	Distance	Postcode	Bedroom2	Car	Landsize	BuildingArea	YearBuilt	Lattitude	Longtitude	Propertycount
count	8.887000e+03	8887.000000	8887.000000	8887.000000	8887.000000	8887.000000	8887.000000	8887.000000	8887.000000	8887.000000	8887.000000
mean	1.092902e+06	11.199887	3111.662653	3.078204	1.692247	523.480365	149.309477	1965.753348	-37.804501	144.991393	7475.940137
std	6.793819e+05	6.813402	112.614268	0.966269	0.975464	1061.324228	87.925580	37.040876	0.090549	0.118919	4375.024364
min	1.310000e+05	0.000000	3000.000000	0.000000	0.000000	0.000000	0.000000	1196.000000	-38.174360	144.423790	249.000000
25%	6.410000e+05	6.400000	3044.000000	2.000000	1.000000	212.000000	100.000000	1945.000000	-37.858560	144.920000	4382.500000
50%	9.000000e+05	10.200000	3084.000000	3.000000	2.000000	478.000000	132.000000	1970.000000	-37.798700	144.998500	6567.000000
75%	1.345000e+06	13.900000	3150.000000	4.000000	2.000000	652.000000	180.000000	2000.000000	-37.748945	145.064560	10331.000000
max	9.000000e+06	47.400000	3977.000000	12.000000	10.000000	42800.000000	3112.000000	2019.000000	-37.407200	145.526350	21650.000000

Description of the data minus null rows.

The axis keyword argument of DataFrame.dropna is used to determine if rows (aka index or 0) or columns (columns or 1) with null values are dropped. From this data a certain number of columns can be selected using a list as an index:

DATA_FEATURES_COLUMNS = ['Rooms', 'Bathroom', 'Landsize', 'Lattitude', 'Longtitude']

# DATA_FEATURES: pd.DataFrame = DATA_CLEAN[DATA_FEATURES_COLUMNS] # type: ignore
# DATA_FEATURES.head(10)
#
# DATA_TARGET = DATA_CLEAN["Price"]
DATA_FEATURES: pd.DataFrame
TEST_FEATURES: pd.DataFrame
DATA_TARGET: pd.Series
TEST_TARGET: pd.Series
DATA_FEATURES, TEST_FEATURES, DATA_TARGET, TEST_TARGET = train_test_split( # type: ignore
  DATA_CLEAN[DATA_FEATURES_COLUMNS], 
  DATA_CLEAN["Price"],
  random_state=1,
)

DATA_FEATURES.head(10)

	Rooms	Bathroom	Landsize	Lattitude	Longtitude
13207	5	3.0	601.0	-37.76370	144.88430
31715	4	3.0	581.0	-37.98081	145.26047
19587	2	1.0	80.0	-37.89563	145.06992
13599	2	2.0	0.0	-37.81510	145.00030
7106	2	1.0	229.0	-37.83680	144.87680
5808	2	1.0	1111.0	-37.78960	144.93210
4175	2	1.0	1658.0	-37.77810	145.01570
14335	3	1.0	610.0	-37.96322	145.20586
6602	3	3.0	173.0	-37.77300	144.89470
24049	4	2.0	640.0	-37.63931	145.06568

First ten rows of DATA.

Note that sklearn.model_.train_test_split is used to chunk up the data so that we can compute error in predictions of the model outside of the data that will be used to train it, this is referred to as ‘Out Sample’ data.

‘In Sample’ data is used in the initial error analysis of the model used in this notebook. In the section after that, ‘Out Sample’ data is used to assess the accuracy of the model. Finally, predictions are made for entries that did not have a price.

It is also useful to look at the price distribution of both the training and testing datasets. This is easy to do with pd.DataFrame.hist:

def plot_price_dist():
  bins = np.linspace(
    bin_start := DATA_TARGET.quantile(0), 
    bin_stop := DATA_TARGET.quantile(.99), 
    50
  )
  subplot: Axes = DATA_TARGET.hist(bins=bins, label="Training Dataset")
  subplot = TEST_TARGET.hist(bins=bins, label="Testing Dataset")
  subplot.set_title("Distribution of Prices")
  subplot.set_xlabel("Price")
  subplot.set_xlim(bin_start, bin_stop)
  subplot.set_ylabel("Count")
  subplot.legend()
  return subplot


subplot = plot_price_dist()
subplot.figure.savefig("./plot-price-dist.png")
plt.close()

Distribution of prices in test and training datasets.

Predicting Prices with a Tree Model

About `scikit-learn` Learn

It is easy to install scikit-learn using poetry or pip like

poetry add scikit-learn

Model Implementation

The following cell will predict the prices of houses for which the price is known:

def create_model(features: pd.DataFrame, target, /, cls: Type =DecisionTreeRegressor, **kwargs):
  tree = cls(random_state=1, **kwargs)
  tree.fit(features, target)
  return tree


TREE = create_model(DATA_FEATURES, DATA_TARGET)

Model In Sample Error Analysis

Now we should measure the accuracy of the model against some in sample data. This is done to contrast against our out sample analysis in the next section of this notebook. The following function creates a dataframe for comparison:

def create_price_compare(
  tree,
  data: pd.DataFrame,
  *,
  price = None,
):
  """Create a dataframe with price, actual price, error, error_percent and 
  feature columns."""

  data_features = data[DATA_FEATURES_COLUMNS]
  price_actual = price if price is not None else data["Price"] 
  price_predictions = tree.predict(data_features)
  error = np.array(
    list(
      actual - predicted
      for predicted, actual in zip(price_predictions, price_actual)
    )
  )
  df = pd.DataFrame(
    {
      "predicted": price_predictions,
      "actual": price_actual, 
      "error": error,
      "error_percent": 100 * abs(error / price_actual)
    }
  )
  df = df.sort_values(by="error_percent")
  df = df.join(data_features)
  return df


PRICE_COMPARE = create_price_compare(TREE, DATA_CLEAN)

count    8887.000000
mean        5.749109
std        20.797978
min         0.000000
25%         0.000000
50%         0.000000
75%         0.773343
max      1284.615385
Name: error_percent, dtype: float64

Description of PRICE_COMPARE["error_percent"].

The description indicates that the mean error is reasonably low. Let’s now plot the distribution of prediction errors within the in sample data:

def create_model_errdist(price_compare: pd.DataFrame, *, ax=None, **kwargs):
  percents = np.linspace(0, 100, 50)
  counts = list(
    price_compare[
      (percent <= price_compare["error_percent"])
      & (price_compare["error_percent"] < percent + 2)
    ]["error_percent"].count()  # type: ignore
    for percent in percents
  )
  data = pd.DataFrame({"percents": percents, "counts": counts})
  return sb.lineplot(data, x="percents", y="counts", ax=ax, **kwargs)


err_dist = create_model_errdist(PRICE_COMPARE)
err_dist.set_title("Model In Sample Error Distribution")
err_dist.figure.savefig("./err-dist-in-sample.png") # type: ignore
plt.close()

This is good (as most of the error is distributed between \(0\) and \(5\) percent). However, as will be shown in the next section, this cannot be expected for any out sample data.

Model Out Sample Error Analysis

Conveniently, the functions above can be used for our out sample data. This is as easy as

TEST_PRICE_COMPARE = create_price_compare(TREE, TEST_FEATURES, price=TEST_TARGET)
TEST_PRICE_COMPARE["error_percent"].describe()

count    2222.000000
mean       22.838637
std        36.606299
min         0.000000
25%         6.757544
50%        15.991803
75%        29.227409
max      1284.615385
Name: error_percent, dtype: float64

Prediction analysis for out sample data.

Nest, we can create a plot of the count per percent error:

def mnae(price_compare: pd.DataFrame):
  cpc = abs(price_compare["error_percent"])
  return sum(cpc) / (100 * len(cpc)) # type: ignore


plt.axvline(mnae(TEST_PRICE_COMPARE) * 100)
err_dist = create_model_errdist(TEST_PRICE_COMPARE, color="red")
err_dist.set_title("Model Out Sample Error Distribution")
err_dist.figure.savefig(DIR / "err-dist-out-sample.png") # type: ignore
plt.close()

This plot does not look at all like the in sample error, which decays immediately with its spike contained under \(5\) percent. It would indicate that error is generally high on the out sample data, implying that there is some room for improvement. The vertical line is included to show the \(mnae\), the mean normalized absolute error (as a percentage).

Improving the Model

After running the model against some out sample data, it is clear that the model does not perform well right out of the box. If we were to only look at in sample data, this would not be apparent.

It is possible to modify change model parameters to attempt to tune the model. To make comparisons, we should combine the above script into a function to get the analysis dataframe for each tree.

def create_tree_analysis(
  features: pd.DataFrame, 
  features_test: pd.DataFrame,
  target, 
  target_test,
  /,
  **kwargs
):
  tree = create_model(features, target, **kwargs)
  tree_price_compare = create_price_compare(tree, features_test, price=target_test)
  return tree, tree_price_compare

This function then can be mapped over many different parameter sets.

def create_model_errdist_many(err_dists: list[pd.DataFrame], labels: Iterable[str]):
  fig, axs = plt.subplots(ncols=1, nrows=1)

  plot = None
  for dist, label in zip(err_dists, labels):
    plot = create_model_errdist(dist, ax=axs, label=label)

  assert plot is not None
  axs.legend()
  return plot


MAX_LEAF_NODES = list(map(lambda k: 5 * 10 ** k,  range(5)))
TREES, PRICE_COMPARES = zip(
  *(
    create_tree_analysis(
      DATA_FEATURES, TEST_FEATURES, DATA_TARGET, TEST_TARGET,
      max_leaf_nodes=max_leaf_nodes,
    )
    for max_leaf_nodes in MAX_LEAF_NODES
  )
)


err_dist = create_model_errdist_many(PRICE_COMPARES, map(str, MAX_LEAF_NODES))
err_dist.set_title("Model Out Sample Error for Various Models")
err_dist.figure.savefig("./err-dist-out-sample-many.png") # type: ignore
plt.close()

Out sample error distributions for models with various `max_leaf_node` values.

The best curves should have a strong peak towards the front (implying that error tends to lower for more entries) and decay rapidly. The initial model would appear to be reasonable fit because it matches the best curves (where max_leaf_nodes is \(5000\) and \(50000\)).

It would appear that there is not much room for improvement of the model along parameter of max_leaf_nodes. An objective choice of the number of leaf nodes can be done by minimizing the mean normalized absolute error:

def min_mae(items, price_compares):
  maes = map( mnae, price_compares)
  best, candidate_mae = min(zip(items, maes), key = lambda pair: pair[1])
  return best, candidate_mae


BEST, BEST_MAE = min_mae(MAX_LEAF_NODES, PRICE_COMPARES)
print(f"The minimized mnae (`{BEST_MAE}`) has `max_leaf_nodes = {BEST}`.")

The minimized mnae (`0.21206521792307836`) has `max_leaf_nodes = 500`.

From this we will take the corresponding tree as the best model:

TREE = TREES[MAX_LEAF_NODES.index(BEST)]

Making Predictions with the Model

The goal here is to plot and compare price predictions on the rows of DATA that did not have a price and make some pretty plots. Rows with null columns can be found like follows:

def create_price_null(data: pd.DataFrame) -> pd.DataFrame:
  price_null = data["Price"].isnull()
  return data[price_null][DATA_FEATURES_COLUMNS] # type: ignore

# DATA_FEATURES_PRICE_NULL: pd.DataFrame = DATA[DATA["Price"].isnull()][DATA_FEATURES_COLUMNS] # type: ignore
DATA_FEATURES_PRICE_NULL = create_price_null(DATA)
DATA_FEATURES_PRICE_NULL.describe()

	Rooms	Bathroom	Landsize	Lattitude	Longtitude
count	7610.000000	5831.000000	5065.000000	5888.000000	5888.000000
mean	3.169645	1.742926	593.989733	-37.823724	145.020178
std	1.010263	0.790775	1564.289341	0.084040	0.116506
min	1.000000	0.000000	0.000000	-38.184630	144.431620
25%	3.000000	1.000000	245.000000	-37.868730	144.968300
50%	3.000000	2.000000	538.000000	-37.826900	145.022990
75%	4.000000	2.000000	696.000000	-37.774300	145.083840
max	12.000000	12.000000	80000.000000	-37.390200	145.489850

Description of the dataset rows with no price specified.

This works because DATA["Price"] should contain the indices of the respective rows within the dataframe, making it a suitable index. In the description it is clear that this worked because the price stats are either zero of NaN. Now it is time to attempt to fill in these values:

def create_price_predictions(tree: DecisionTreeRegressor, features: pd.DataFrame):
  predictions = tree.predict(features)
  completed = features.copy()
  completed["Price"] = predictions
  return completed


DATA_PRICE_NULL_PREDICTIONS = create_price_predictions(TREE, DATA_FEATURES_PRICE_NULL)
DATA_PRICE_NULL_PREDICTIONS.describe()

	Rooms	Bathroom	Landsize	Lattitude	Longtitude	Price
count	7610.000000	5831.000000	5065.000000	5888.000000	5888.000000	7.610000e+03
mean	3.169645	1.742926	593.989733	-37.823724	145.020178	1.348987e+06
std	1.010263	0.790775	1564.289341	0.084040	0.116506	7.938628e+05
min	1.000000	0.000000	0.000000	-38.184630	144.431620	3.663929e+05
25%	3.000000	1.000000	245.000000	-37.868730	144.968300	8.003333e+05
50%	3.000000	2.000000	538.000000	-37.826900	145.022990	1.185000e+06
75%	4.000000	2.000000	696.000000	-37.774300	145.083840	1.675000e+06
max	12.000000	12.000000	80000.000000	-37.390200	145.489850	9.000000e+06

Price predictions for the rows missing a price.

Note that TREE will reject the input if it contains all of columns and not just the feature columns, thus why DATA_PRICE_NULL is indexed. The description of this dataframe should be reasonable comparable to the data description of DATA_CLEAN.

def create_price_predictions_compare(data_clean: pd.DataFrame, data_interpolated: pd.DataFrame):
  interpolated = data_interpolated["Price"].describe()
  actual = data_clean["Price"].describe()
  # error = interpolated - actual

  return pd.DataFrame(
    {
      "predicted": interpolated,
      "actual": actual,
      # "error": error,
      # "error_percent": 100 * (error / actual),
    }
  )


# NOTE: The last object in a code cell is displayed by default, thus why this
#       dataframe is created yet not assigned.
create_price_predictions_compare(DATA_CLEAN, DATA_PRICE_NULL_PREDICTIONS)

	predicted	actual
count	7.610000e+03	8.887000e+03
mean	1.348987e+06	1.092902e+06
std	7.938628e+05	6.793819e+05
min	3.663929e+05	1.310000e+05
25%	8.003333e+05	6.410000e+05
50%	1.185000e+06	9.000000e+05
75%	1.675000e+06	1.345000e+06
max	9.000000e+06	9.000000e+06

Comparison of interpolated data and completed data descriptions.

Now that we know the data descriptions are reasonable (by comparing the magnitude of any of the provided data) we can combine the predictions and the clean data and label them as being estimated or not in the Estimated column.

def create_data_completed(data_clean: pd.DataFrame, data_interpolated: pd.DataFrame, ) -> pd.DataFrame:

  # NOTE: Create dataframe with features and prices, add that it is not estimated
  data_estimated_not = data_clean[[*DATA_FEATURES_COLUMNS, "Price"]].copy()
  data_estimated_not["Estimated"] = pd.Series(data=(False for _ in range(len(data_clean))))

  # NOTE: Add estimated to the estimated prices dataframe.
  data_interpolated = data_interpolated.copy()
  data_interpolated["Estimated"]= pd.Series(data=(True for _ in range(len(data_interpolated))))

  return pd.concat((data_estimated_not, data_interpolated)) # type: ignore


DATA_COMPLETED = create_data_completed(DATA_CLEAN, DATA_PRICE_NULL_PREDICTIONS)

This will allow us to generate some nice swarm plots in seaborn.

def create_prediction_plots(data_completed):
  price_min: float = data_completed["Price"].min() 
  price_ub = data_completed["Price"].quantile(0.99).max() 

  if not (_ := DIR / "rooms-stripplot.png").exists():
    rooms_plot = sb.swarmplot(
      data_completed[data_completed["Rooms"] <= 5], # noqa: reportArgumentType
      y="Price", 
      x="Rooms",
      hue="Estimated",
      dodge=True,
      size=0.99,
    )
    rooms_plot.set_xlim(0, 6)
    rooms_plot.set_ylim(price_min, price_ub)
    rooms_plot.figure.savefig(_) # type: ignore
    plt.close()

  if not (_ := DIR / "bathrooms-stripplot.png").exists():
    bathrooms_plot = sb.swarmplot(
      data_completed[data_completed["Bathroom"] <= 5], # noqa: reportArgumentType 
      y="Price", 
      x="Bathroom",
      hue="Estimated",
      dodge=True,
      size=0.99,
    )
    bathrooms_plot.set_xlim(0, 5)
    bathrooms_plot.set_ylim(price_min, price_ub)
    bathrooms_plot.figure.savefig(_) # type: ignore
    plt.close()

  # if not (_ := DIR / "geospacial-scatterplot.png").exists() or True:
  #   geospacial_plot = sb.scatterplot(
  #     DATA_COMPLETED,
  #     x="Longtitude",
  #     y="Lattitude",
  #     hue="Price",
  #     alpha=0.5,
  #   )
  #   # geospacial_plot = sb.histplot(
  #   #   DATA_COMPLETED,
  #   #   x=DATA_COMPLETED["Longtitude"],
  #   #   y=DATA_COMPLETED["Lattitude"],
  #   #   bins=25,
  #   #   cmap="mako",
  #   #   hue="Price"
  #   # )
  #
  #   geospacial_plot.figure.savefig(_)
  #   plt.close()

create_prediction_plots(DATA_COMPLETED)

/quarto/.venv/lib/python3.11/site-packages/seaborn/categorical.py:3399: UserWarning: 16.1% of the points cannot be placed; you may want to decrease the size of the markers or use stripplot.
  warnings.warn(msg, UserWarning)
/quarto/.venv/lib/python3.11/site-packages/seaborn/categorical.py:3399: UserWarning: 13.1% of the points cannot be placed; you may want to decrease the size of the markers or use stripplot.
  warnings.warn(msg, UserWarning)
/quarto/.venv/lib/python3.11/site-packages/seaborn/categorical.py:3399: UserWarning: 15.2% of the points cannot be placed; you may want to decrease the size of the markers or use stripplot.
  warnings.warn(msg, UserWarning)
/quarto/.venv/lib/python3.11/site-packages/seaborn/categorical.py:3399: UserWarning: 18.7% of the points cannot be placed; you may want to decrease the size of the markers or use stripplot.
  warnings.warn(msg, UserWarning)
/quarto/.venv/lib/python3.11/site-packages/seaborn/categorical.py:3399: UserWarning: 14.5% of the points cannot be placed; you may want to decrease the size of the markers or use stripplot.
  warnings.warn(msg, UserWarning)
/quarto/.venv/lib/python3.11/site-packages/seaborn/categorical.py:3399: UserWarning: 17.2% of the points cannot be placed; you may want to decrease the size of the markers or use stripplot.
  warnings.warn(msg, UserWarning)
/quarto/.venv/lib/python3.11/site-packages/seaborn/categorical.py:3399: UserWarning: 5.7% of the points cannot be placed; you may want to decrease the size of the markers or use stripplot.
  warnings.warn(msg, UserWarning)

It is interesting to notice the stacking of identical values on the prediction side. This would mean that the decision tree would follow a path down to the same node each time, an inherent problem with decision trees. In the next section an attempt to remedy this is made.

Making Predictions With an Ensemble of Trees

A forest is simply many trees. To try to better the predictions made by a tree and resolve over-fitting and under-fitting with some sort of consensus many trees are used and their results are averaged. sklearn.ensemble.RandomForestRegressor may be constructed and trained in exactly the same way that DecisionTreeRegressor is, e.g.

FOREST = create_model(DATA_FEATURES, DATA_TARGET, cls=RandomForestRegressor)

and now we may compare to some out sample data:

PRICE_COMPARE = create_price_compare(FOREST, TEST_FEATURES, price=TEST_TARGET)
MNAE = mnae(PRICE_COMPARE) * 100

plt.axvline(MNAE)
err_dist = create_model_errdist(PRICE_COMPARE, color="red")
err_dist.set_title("Error Distribution of Random Forest")


err_dist.figure.savefig("./err-dist-forest.png") # type: ignore

plt.close()

This leaves us with a mean normalize absolute error below \(20\)%, which is absolutely an improvement .