Alex Carlin

Data science in everyday life

Data science can be very useful in everyday life. Besides handling accounting and other money-related tasks, which I should cover in another post, data science tools become useful in the most unexpected places.

Today I have a large collection of audio files that I need to organize. I treat this as a data cleanup and dataset curation task, for a little fun.

Examining the dataset

So the first problem is we have all these folders, there are 12 of them.

Sony - Feb 22 2023/
Sony - Jan 5 2023/
sony-2020-2-16/
sony-2021-03-06/
sony-2021-05-29/
sony-2021-12-11/
sony-2022-05-01/
sony-download-april-11-2019/
sony-download-jun16-2019/
sony-download-may17-2019/
sony-download-nov-21-2019/
sony-may-27-2023/

Who named these? Hint: I did. They all each at least contain the year in the name, which we can extract easily. For some background, they are called Sony because they are recorded on my Sony voice recorder.

Inside each folder is some number of MP3 files, labeled with the date and time recorded. There are 682 files total.

sony-may-27-2023/
    230109_1207.mp3
    230109_1207_01.mp3
    230109_1210.mp3
    230112_2021.mp3

So we'll need to parse those, and also handle this case where a _01 is appended to files where the recording starts within the same minute. In the whole dataset, 667 files do not have this suffix, and 15 files have the suffix.

Preprocessing the metadata

We'll handle the creation of metadata programmatically. Since I have the recorder set to the correct date and time, each file should contain an accurate timestamp. Furthermore, we can examine the files to determine their lengths. We are interested in understanding the general shape of the data.

In general, I do not care about the folder groups, so we are really just interested in the files themselves. However, we'll keep track of both. As much in data science, the code we write will ultimately attempt to collect instances of a particular data structure (here we use a Python dict with a handful of defined fields).

I wrote a short script to generate two main output files. The first output file is a summary of the different groups of recordings. These groups don't have any intrinsic meaning since they are simply times when the recorder is full and I offload the files to a secondary storage device (meaning, the hard drive on my laptop). The second output file is much more interesting.

First, here's the script.

import re
from glob import glob 

import pandas 
from mutagen.mp3 import MP3
from rich.progress import track 


# helper function to extract a year 
def get_year(s):
    match = re.search(r'\d{4}', s)
    if match:
        return match.group(0)

# collect data from files on disk 
file_data = []
data = []
path = "/Volumes/Storage/Projects/Recordings"

for dir in glob(f"{path}/*"):

    # metadata for the dir 
    dir_name = dir.split("/")[-1]
    year = get_year(dir_name)

    # analyze the files in the dir 
    files = glob(f"{dir}/*mp3")
    lengths = []
    for file in track(files, description=f"{dir_name:32s}"):
        audio = MP3(file)
        length = audio.info.length
        lengths.append(length)  
        filename = file.split("/")[-1]

        # record the metadata for the recording 
        file_pkg = {
            "name": dir_name, 
            "year": year, 
            "filename": filename, 
            "length": length, 
        }   
        file_data.append(file_pkg)

    # record the group metadata
    pkg = {
        "n_files": len(files), 
        "name": dir_name, 
        "year": year, 
        "min_length": min(lengths), 
        "max_length": max(lengths),
    }
    data.append(pkg) 


# collect group metadata 
df = pandas.DataFrame(data)
df.to_csv("groups.csv")

# collect metadata on individual recordings 
files = pandas.DataFrame(file_data)
files.to_csv("files.csv")

The CSV called "files.csv" contains some really interesting information. First, it contains a total count of the recordings: 682.

Second, it contains the length of each recording. Just looking at this in Pandas looks like this.

from datetime import datetime 


def parse_datetime(name):
    """Parse out the year, month, day, hour, minute from 
    a file like 230112_2021.mp3"""
    year = int("20" + name[:2])
    month = int(name[2:4])
    day = int(name[4:6])
    hour = int(name[7:9])
    minute = int(name[9:11])

    return datetime(year, month, day, hour=hour, minute=minute)


df = pandas.read_csv("./files.csv", index_col=0)

df["time"] = df["filename"].map(parse_datetime)

When displayed, the first few rows look like this. The lengths are in seconds and displaying far too many decimal points to be useful, but you get the idea.

group             year            filename       length       date     time
sony-may-27-2023  2023     230520_1058.mp3    85.159208 2023-05-20 10:58:00
sony-may-27-2023  2023     230520_1101.mp3  1408.626958 2023-05-20 11:01:00
sony-may-27-2023  2023     230522_1014.mp3  4285.544500 2023-05-22 10:14:00
sony-may-27-2023  2023     230522_1627.mp3  1256.646542 2023-05-22 16:27:00
sony-may-27-2023  2023     230527_1330.mp3  3469.897167 2023-05-27 13:30:00```

A little bit of Pandas magic allows us to get these proper dates and times from the file name.

I tried looking at histograms and other 1-D distribution plots faceted various ways, but my favorite way of summarizing the data turned out to be the simplest, just binning by length

bins = [-0.1, 0.99, 5, 20, 40, 60, 120, 10_000]
labels = ["< 1 minute", "1-5 minutes", "5-20 minutes", "20-40 minutes", "40 minutes-1 hour", "2 hours", "longer than 2 hours"]

df["duration:minutes"] = (df["length"] / 60).astype(int)
df["duration:range"] = pandas.cut(df["duration:minutes"], bins, labels=labels)

print(df["duration:range"].value_counts(dropna=False))

Which provides this breakdown of the dataset. Sorted by count:

5-20 minutes           207
1-5 minutes            199
20-40 minutes           97
< 1 minute              71
40 minutes-1 hour       45
2 hours                 34
longer than 2 hours     29

And sorted by duration

< 1 minute              71
1-5 minutes            199
5-20 minutes           207
20-40 minutes           97
40 minutes-1 hour       45
2 hours                 34
longer than 2 hours     29

The other thing I am interested in is the time of day that the recordings are initiated. But that will have to be another post!