Shower Time!

Analyse humidity patterns in the bathroom

Posted by Gregor Tätzner


As part of my smart home network I installed a couple of weeks ago new zigbee temperature and humidity sensors. You can find these for a good price on Aliexpress and they work really nice with Home Assistant.

Zigbee temperature sensor

After some time collecting data in various rooms, I had a look on the humidity chart of the bathroom and noticed some interesting patterns, or actually spikes:

Humidity Bath

What are these? Of course the effect of the shower in our bathroom! Each time you can see a substantial increase in humidity when the water is released into the air. The spike could also be the result of a bath, but I dont remember the exact time when we used the shower vs the bath.

Pingu bath

Lets count outliners with spark

Visually we can see around 7-8 spikes which is a good estimate, but if we have a large history, we want an automated solution. This will create the humidity line chart for visual inspection:

import pyspark.pandas as ps
from pyspark.sql.functions import *
from pyspark.sql.types import *
from datetime import datetime, timedelta

bathHmdDf = spark.sql("""
SELECT *
FROM sensordata
WHERE attributes.friendly_name = 'Feuchtigkeit Bad'
""")

bathHmdPsDf = bathHmdDf.toPandas()
bathHmdPsDf.state = ps.to_numeric(bathHmdPsDf.state)
bathHmdPsDf.last_changed = ps.to_datetime(bathHmdPsDf.last_changed)
bathHmdPsDf.plot.line(x = "last_changed", y = ["state"], xlabel = 'Date', ylabel = '%', title = 'Feuchtigkeit Bad', marker='.', markersize=5, lw=2, grid=True)
plt.legend(["Feuchtigkeit"])

Then we run the script to count the outliners in the history. I tried a couple of approaches, including zscores. But what worked best was a simple median calculation in a daily interval, with an upper threshold in the humidity of median + 7%. This means if the median humidity of day 1 is 50, all values above 57 will be marked as a shower event. Also I rounded and grouped the log date to the nearest hour, since the sensor logs a lot of data every couple minutes and we want to consolidate this to one shower event. In the end all aggregates for each day are collected and counted for the shower times total.

# cast timestamp
bathHmdDf = bathHmdDf.select('*', bathHmdDf.last_changed.cast('timestamp').alias('timestamp'))

# find first sensor day
firstDate = bathHmdDf.select('last_changed').orderBy(bathHmdDf.last_changed.asc()).first()[0]
lastDate = bathHmdDf.select('last_changed').orderBy(bathHmdDf.last_changed.desc()).first()[0]
firstDate = datetime.fromisoformat(firstDate).replace(hour=0, minute=0)
lastDate = datetime.fromisoformat(lastDate).replace(hour=0, minute=0)
diffDays = (lastDate - firstDate).days
bathHmdOutliners = None

# count outliners for each day
for day in range(diffDays):
  # filter sensordata by day
  dateStart = (firstDate + timedelta(days=day)).isoformat()
  dateEnd = (firstDate + timedelta(days=(day+1))).isoformat()
  bathHmdDfFilteredDay = bathHmdDf.where(bathHmdDf.last_changed >= dateStart).where(bathHmdDf.last_changed < dateEnd)
  if bathHmdDfFilteredDay.count() < 1:
    continue

  # calculate upper_limit and avg for day
  bathAvg = bathHmdDfFilteredDay.select(median('state')).collect()[0][0]
  upper_limit = bathAvg + 7

  # find outliners
  bathHmdDfFilteredZscore = bathHmdDfFilteredDay.select("*").withColumn('date', to_date('last_changed')).withColumn("hour_rounded", hour((round(unix_timestamp("timestamp")/3600)*3600).cast("timestamp"))).where((bathHmdDfFilteredDay.state > upper_limit))

  bathHmdDfFilteredZscore = bathHmdDfFilteredZscore.groupBy('date', 'hour_rounded').agg(count('state')).sort('date', 'hour_rounded')
  if bathHmdOutliners is None:
    bathHmdOutliners = bathHmdDfFilteredZscore
  else:
    bathHmdOutliners = bathHmdOutliners.union(bathHmdDfFilteredZscore)

bathHmdOutliners.show()
print("Shower times: ", bathHmdOutliners.count())

groupHoursDf = bathHmdOutliners.groupBy('hour_rounded').agg(count('count(state)').alias('times'))
groupPsDf = groupHoursDf.toPandas()
groupPsDf.plot.pie(y='times', labels=groupPsDf["hour_rounded"], autopct='%1.1f%%')

Are we dirty?

In my case, the script counts the humidity spikes to the number of 8. That comes close to the pattern I can see in the chart, so lets call it a success! I also created a pie chart, to see the shower distribution by hour each day. Looks like we are more of the shower in the evening type of people. 2/3 of the shower counts are around 20/21 o’clock.

Shower times pie

I’m curious how this data will change in the long run, also with respect to a different season/the summer!