Data science in everyday life
Data science can be very useful in everyday life. Besides handling accounting and other money-related tasks, which I should cover in another post, data science tools become useful in the most unexpected places.
Today I have a large collection of audio files that I need to organize. I treat this as a data cleanup and dataset curation task, for a little fun.
Examining the dataset
So the first problem is we have all these folders, there are 12 of them.
Sony - Feb 22 2023/
Sony - Jan 5 2023/
sony-2020-2-16/
sony-2021-03-06/
sony-2021-05-29/
sony-2021-12-11/
sony-2022-05-01/
sony-download-april-11-2019/
sony-download-jun16-2019/
sony-download-may17-2019/
sony-download-nov-21-2019/
sony-may-27-2023/
Who named these? Hint: I did. They all each at least contain the year in the name, which we can extract easily. For some background, they are called Sony because they are recorded on my Sony voice recorder.
Inside each folder is some number of MP3 files, labeled with the date and time recorded. There are 682 files total.
sony-may-27-2023/
230109_1207.mp3
230109_1207_01.mp3
230109_1210.mp3
230112_2021.mp3
So we'll need to parse those, and also handle this case where a _01
is appended to files where the recording starts within the same minute. In the whole dataset, 667 files do not have this suffix, and 15 files have the suffix.
Preprocessing the metadata
We'll handle the creation of metadata programmatically. Since I have the recorder set to the correct date and time, each file should contain an accurate timestamp. Furthermore, we can examine the files to determine their lengths. We are interested in understanding the general shape of the data.
In general, I do not care about the folder groups, so we are really just interested in the files themselves. However, we'll keep track of both. As much in data science, the code we write will ultimately attempt to collect instances of a particular data structure (here we use a Python dict
with a handful of defined fields).
I wrote a short script to generate two main output files. The first output file is a summary of the different groups of recordings. These groups don't have any intrinsic meaning since they are simply times when the recorder is full and I offload the files to a secondary storage device (meaning, the hard drive on my laptop). The second output file is much more interesting.
First, here's the script.
import re
from glob import glob
import pandas
from mutagen.mp3 import MP3
from rich.progress import track
# helper function to extract a year
def get_year(s):
match = re.search(r'\d{4}', s)
if match:
return match.group(0)
# collect data from files on disk
file_data = []
data = []
path = "/Volumes/Storage/Projects/Recordings"
for dir in glob(f"{path}/*"):
# metadata for the dir
dir_name = dir.split("/")[-1]
year = get_year(dir_name)
# analyze the files in the dir
files = glob(f"{dir}/*mp3")
lengths = []
for file in track(files, description=f"{dir_name:32s}"):
audio = MP3(file)
length = audio.info.length
lengths.append(length)
filename = file.split("/")[-1]
# record the metadata for the recording
file_pkg = {
"name": dir_name,
"year": year,
"filename": filename,
"length": length,
}
file_data.append(file_pkg)
# record the group metadata
pkg = {
"n_files": len(files),
"name": dir_name,
"year": year,
"min_length": min(lengths),
"max_length": max(lengths),
}
data.append(pkg)
# collect group metadata
df = pandas.DataFrame(data)
df.to_csv("groups.csv")
# collect metadata on individual recordings
files = pandas.DataFrame(file_data)
files.to_csv("files.csv")
The CSV called "files.csv" contains some really interesting information. First, it contains a total count of the recordings: 682.
Second, it contains the length of each recording. Just looking at this in Pandas looks like this.
from datetime import datetime
def parse_datetime(name):
"""Parse out the year, month, day, hour, minute from
a file like 230112_2021.mp3"""
year = int("20" + name[:2])
month = int(name[2:4])
day = int(name[4:6])
hour = int(name[7:9])
minute = int(name[9:11])
return datetime(year, month, day, hour=hour, minute=minute)
df = pandas.read_csv("./files.csv", index_col=0)
df["time"] = df["filename"].map(parse_datetime)
When displayed, the first few rows look like this. The lengths are in seconds and displaying far too many decimal points to be useful, but you get the idea.
group year filename length date time
sony-may-27-2023 2023 230520_1058.mp3 85.159208 2023-05-20 10:58:00
sony-may-27-2023 2023 230520_1101.mp3 1408.626958 2023-05-20 11:01:00
sony-may-27-2023 2023 230522_1014.mp3 4285.544500 2023-05-22 10:14:00
sony-may-27-2023 2023 230522_1627.mp3 1256.646542 2023-05-22 16:27:00
sony-may-27-2023 2023 230527_1330.mp3 3469.897167 2023-05-27 13:30:00```
A little bit of Pandas magic allows us to get these proper dates and times from the file name.
I tried looking at histograms and other 1-D distribution plots faceted various ways, but my favorite way of summarizing the data turned out to be the simplest, just binning by length
bins = [-0.1, 0.99, 5, 20, 40, 60, 120, 10_000]
labels = ["< 1 minute", "1-5 minutes", "5-20 minutes", "20-40 minutes", "40 minutes-1 hour", "2 hours", "longer than 2 hours"]
df["duration:minutes"] = (df["length"] / 60).astype(int)
df["duration:range"] = pandas.cut(df["duration:minutes"], bins, labels=labels)
print(df["duration:range"].value_counts(dropna=False))
Which provides this breakdown of the dataset. Sorted by count:
5-20 minutes 207
1-5 minutes 199
20-40 minutes 97
< 1 minute 71
40 minutes-1 hour 45
2 hours 34
longer than 2 hours 29
And sorted by duration
< 1 minute 71
1-5 minutes 199
5-20 minutes 207
20-40 minutes 97
40 minutes-1 hour 45
2 hours 34
longer than 2 hours 29
The other thing I am interested in is the time of day that the recordings are initiated. But that will have to be another post!