Skip to contents

Analyzes a directory structure containing audio files (typically organized by class in subdirectories) and provides comprehensive statistics including class distribution, file types, duration statistics, file sizes, and outlier detection.

Usage

training_dataset_summary(
  root_dir,
  audio_extensions = c("wav", "mp3", "flac", "m4a", "ogg", "aiff", "aif"),
  outlier_threshold = 3,
  use_tuneR = TRUE,
  ignore_classes = NULL,
  sample_size = NULL,
  parallel = TRUE,
  n_cores = NULL,
  show_progress = TRUE
)

Arguments

root_dir

Character. Path to the root directory containing audio files. Typically organized with subdirectories representing different classes.

audio_extensions

Character vector. File extensions to consider as audio files. Default includes common formats: wav, mp3, flac, m4a, ogg, aiff.

outlier_threshold

Numeric. Number of standard deviations from the mean to flag as outliers. Default is 3.

use_tuneR

Logical. If TRUE and tuneR package is available, extract detailed audio metadata (duration, sample rate, etc.). Default is TRUE.

ignore_classes

Character vector. Names of class subdirectories to ignore (e.g. "noise"). Default is NULL.

sample_size

Integer. If not NULL, randomly sample this many files for duration analysis instead of processing all files. Useful for large datasets. Default is NULL (process all files).

parallel

Logical. If TRUE, use parallel processing for audio metadata extraction (requires parallel package). Default is TRUE.

n_cores

Integer. Number of cores to use for parallel processing. If NULL, uses half of available cores. Default is NULL.

show_progress

Logical. If TRUE, show progress bar during audio metadata extraction. Default is TRUE.

Value

A list with the following components:

summary

Data.frame with overall statistics

class_distribution

Data.frame with per-class file counts

file_type_distribution

Data.frame with counts by file extension

duration_stats

Data.frame with duration statistics (if available)

size_stats

Data.frame with file size statistics

outliers

Data.frame with flagged unusual files

imbalance_stats

Data.frame with class imbalance metrics

recommendations

Character vector of transfer learning advice

Details

The function recursively scans the directory structure and:

  • Counts files per subdirectory (classes)

  • Identifies file types by extension

  • Calculates file size distributions

  • Extracts audio duration (if tuneR is available)

  • Flags outliers based on size and duration

Outliers are identified as files that deviate from the mean by more than outlier_threshold standard deviations in either size or duration.

Examples

if (FALSE) { # \dontrun{
# Analyze a training dataset
summary <- training_dataset_summary("/path/to/audio/dataset")

# Ignore noise class
summary <- training_dataset_summary(
  "/path/to/dataset",
  ignore_classes = c("noise", "background")
)

# Print summary statistics
print(summary$summary)
print(summary$class_distribution)

# Check for outliers
if (nrow(summary$outliers) > 0) {
  print(summary$outliers)
}

# For large datasets, use sampling for faster analysis
summary <- training_dataset_summary(
  "/path/to/large/dataset",
  sample_size = 1000,  # Only analyze 1000 random files
  parallel = TRUE,     # Use parallel processing
  n_cores = 4          # Use 4 CPU cores
)

# Skip duration extraction for even faster analysis
summary <- training_dataset_summary(
  "/path/to/dataset",
  use_tuneR = FALSE  # Only file sizes, no duration data
)
} # }