Upload config

Browse files

Files changed (3) hide show

README.md +199 -0
config.json +14 -0
prosody_preprocessor.py +201 -0

README.md ADDED Viewed

	@@ -0,0 +1,199 @@

+---
+library_name: transformers
+tags: []
+---
+# Model Card for Model ID
+<!-- Provide a quick summary of what the model is/does. -->
+## Model Details
+### Model Description
+<!-- Provide a longer summary of what this model is. -->
+This is the model card of a 🤗 transformers model that has been pushed on the Hub. This model card has been automatically generated.
+- **Developed by:** [More Information Needed]
+- **Funded by [optional]:** [More Information Needed]
+- **Shared by [optional]:** [More Information Needed]
+- **Model type:** [More Information Needed]
+- **Language(s) (NLP):** [More Information Needed]
+- **License:** [More Information Needed]
+- **Finetuned from model [optional]:** [More Information Needed]
+### Model Sources [optional]
+<!-- Provide the basic links for the model. -->
+- **Repository:** [More Information Needed]
+- **Paper [optional]:** [More Information Needed]
+- **Demo [optional]:** [More Information Needed]
+## Uses
+<!-- Address questions around how the model is intended to be used, including the foreseeable users of the model and those affected by the model. -->
+### Direct Use
+<!-- This section is for the model use without fine-tuning or plugging into a larger ecosystem/app. -->
+[More Information Needed]
+### Downstream Use [optional]
+<!-- This section is for the model use when fine-tuned for a task, or when plugged into a larger ecosystem/app -->
+[More Information Needed]
+### Out-of-Scope Use
+<!-- This section addresses misuse, malicious use, and uses that the model will not work well for. -->
+[More Information Needed]
+## Bias, Risks, and Limitations
+<!-- This section is meant to convey both technical and sociotechnical limitations. -->
+[More Information Needed]
+### Recommendations
+<!-- This section is meant to convey recommendations with respect to the bias, risk, and technical limitations. -->
+Users (both direct and downstream) should be made aware of the risks, biases and limitations of the model. More information needed for further recommendations.
+## How to Get Started with the Model
+Use the code below to get started with the model.
+[More Information Needed]
+## Training Details
+### Training Data
+<!-- This should link to a Dataset Card, perhaps with a short stub of information on what the training data is all about as well as documentation related to data pre-processing or additional filtering. -->
+[More Information Needed]
+### Training Procedure
+<!-- This relates heavily to the Technical Specifications. Content here should link to that section when it is relevant to the training procedure. -->
+#### Preprocessing [optional]
+[More Information Needed]
+#### Training Hyperparameters
+- **Training regime:** [More Information Needed] <!--fp32, fp16 mixed precision, bf16 mixed precision, bf16 non-mixed precision, fp16 non-mixed precision, fp8 mixed precision -->
+#### Speeds, Sizes, Times [optional]
+<!-- This section provides information about throughput, start/end time, checkpoint size if relevant, etc. -->
+[More Information Needed]
+## Evaluation
+<!-- This section describes the evaluation protocols and provides the results. -->
+### Testing Data, Factors & Metrics
+#### Testing Data
+<!-- This should link to a Dataset Card if possible. -->
+[More Information Needed]
+#### Factors
+<!-- These are the things the evaluation is disaggregating by, e.g., subpopulations or domains. -->
+[More Information Needed]
+#### Metrics
+<!-- These are the evaluation metrics being used, ideally with a description of why. -->
+[More Information Needed]
+### Results
+[More Information Needed]
+#### Summary
+## Model Examination [optional]
+<!-- Relevant interpretability work for the model goes here -->
+[More Information Needed]
+## Environmental Impact
+<!-- Total emissions (in grams of CO2eq) and additional considerations, such as electricity usage, go here. Edit the suggested text below accordingly -->
+Carbon emissions can be estimated using the [Machine Learning Impact calculator](https://mlco2.github.io/impact#compute) presented in [Lacoste et al. (2019)](https://arxiv.org/abs/1910.09700).
+- **Hardware Type:** [More Information Needed]
+- **Hours used:** [More Information Needed]
+- **Cloud Provider:** [More Information Needed]
+- **Compute Region:** [More Information Needed]
+- **Carbon Emitted:** [More Information Needed]
+## Technical Specifications [optional]
+### Model Architecture and Objective
+[More Information Needed]
+### Compute Infrastructure
+[More Information Needed]
+#### Hardware
+[More Information Needed]
+#### Software
+[More Information Needed]
+## Citation [optional]
+<!-- If there is a paper or blog post introducing the model, the APA and Bibtex information for that should go in this section. -->
+**BibTeX:**
+[More Information Needed]
+**APA:**
+[More Information Needed]
+## Glossary [optional]
+<!-- If relevant, include terms and calculations in this section that can help readers understand the model or model card. -->
+[More Information Needed]
+## More Information [optional]
+[More Information Needed]
+## Model Card Authors [optional]
+[More Information Needed]
+## Model Card Contact
+[More Information Needed]

config.json ADDED Viewed

	@@ -0,0 +1,14 @@

+{
+  "auto_map": {
+    "AutoConfig": "prosody_preprocessor.ProsodyConfig"
+  },
+  "f0_max": 500.0,
+  "f0_min": 65.0,
+  "frame_length": 20.0,
+  "frame_space": 5.0,
+  "intensity_max": 100.0,
+  "intensity_min": 0.0,
+  "model_type": "prosody_preprocessor",
+  "sampling_rate": 16000,
+  "transformers_version": "4.52.4"
+}

prosody_preprocessor.py ADDED Viewed

	@@ -0,0 +1,201 @@

+import amfm_decompy.basic_tools as basic
+import amfm_decompy.pYAAPT as pYAAPT
+from dataclasses import dataclass
+from typing import Dict, List, Optional
+import numpy as np
+import torch
+import dataclasses
+import parselmouth
+from transformers import PreTrainedModel,PretrainedConfig
+from datasets import Dataset
+@dataclass
+class SpeakerStats:
+    f0_mean: float
+    f0_std: float
+    intensity_mean: float
+    intensity_std: float
+    @classmethod
+    def from_features(cls, f0_values: List[np.ndarray], intensity_values: List[np.ndarray]):
+        """Calculate stats from a list of features"""
+        # Convert lists to numpy arrays
+        f0_arrays = [np.array(f0) for f0 in f0_values]
+        intensity_arrays = [np.array(i) for i in intensity_values]
+        # Now we can use numpy operations
+        f0_concat = np.concatenate([f0[f0 != 0] for f0 in f0_arrays])
+        intensity_concat = np.concatenate(intensity_arrays)
+        print(f"F0 shape: {f0_concat.shape}")
+        print(f"Intensity shape: {intensity_concat.shape}")
+        return cls(
+            f0_mean=float(np.mean(f0_concat)),
+            f0_std=float(np.std(f0_concat)),
+            intensity_mean=float(np.mean(intensity_concat)),
+            intensity_std=float(np.std(intensity_concat))
+        )
+class ProsodyConfig(PretrainedConfig):
+    """Configuration class for prosody preprocessing"""
+    model_type = "prosody_preprocessor"
+    def __init__(
+        self,
+        sampling_rate: int = 16000,
+        frame_length: float = 20.0,  # in ms
+        frame_space: float = 5.0,   # in ms
+        f0_min: float = 65.0,
+        f0_max: float = 500.0,
+        intensity_min: float = 0.0,
+        intensity_max: float = 100.0,
+        **kwargs
+    ):
+        super().__init__(**kwargs)
+        self.sampling_rate = sampling_rate
+        self.frame_length = frame_length
+        self.frame_space = frame_space
+        self.f0_min = f0_min
+        self.f0_max = f0_max
+        self.intensity_min = intensity_min
+        self.intensity_max = intensity_max
+class ProsodyPreprocessor(PreTrainedModel):
+    config_class = ProsodyConfig
+    def __init__(self, config: Optional[ProsodyConfig] = None):
+        self.config = config or ProsodyConfig()
+        self.speaker_stats: Dict[str, SpeakerStats] = {}
+    def extract_features(self, audio):
+        """Extract F0 and intensity features"""
+        print(f"audio", audio)
+        audio = torch.Tensor(audio)
+        if audio.dim() == 1:
+            audio = audio.unsqueeze(0)
+        f0, f0_interp = self._get_f0(audio)
+        f0 = f0[0, 0, :]
+        f0_interpolated = f0_interp[0, 0, :]
+        # Remove first 5 frames as in original
+        f0 = f0[5:]
+        f0_interpolated = f0_interpolated[5:]
+        sound = parselmouth.Sound(audio.numpy(), sampling_frequency=self.config.sampling_rate, start_time=0)
+        print(f"Sound duration: {sound.duration} seconds")
+        # Extract intensity at 200Hz
+        intensity = sound.to_intensity(time_step=1/200.0)
+        print(f"Intensity duration: {intensity.duration} seconds")
+        intensity_values = intensity.values.T.flatten()
+        # Ensure same length
+        min_len = min(len(f0), len(intensity))
+        f0 = f0[:min_len]
+        intensity_values = intensity_values[:min_len]
+        # Your existing _get_f0 and intensity extraction code here
+        # Returns raw features
+        print(f"f0", f0)
+        return {
+            "f0": f0,
+            "f0_interp": f0_interpolated,
+            "intensity": intensity_values,
+        }
+    def collect_stats(self, dataset: Dataset, num_proc: int = 4, batch_size: int = 32) -> Dict[str, SpeakerStats]:
+        """First pass: collect speaker statistics using dataset.map"""
+        # Step 1: Extract features using map
+        def extract_features_batch(examples):
+            features_list = []
+            for audio in examples['audio']:
+                features = self.extract_features(audio)
+                features_list.append(features)
+            return {
+                'f0': [f['f0'] for f in features_list],
+                'intensity': [f['intensity'] for f in features_list],
+                'speaker_id': examples['speaker_id']
+            }
+        # Extract features for all samples
+        features_dataset = dataset.map(
+            extract_features_batch,
+            batched=True,
+            batch_size=batch_size,
+            num_proc=num_proc,
+            # load_from_cache_file=False
+            remove_columns=dataset.column_names
+        )
+        print(f"features_dataset", features_dataset)
+        # Step 2: Group features by speaker
+        speaker_features = {}
+        for item in features_dataset:
+            print(f"item", item)
+            speaker_id = item['speaker_id']
+            if speaker_id not in speaker_features:
+                speaker_features[speaker_id] = {'f0': [], 'intensity': []}
+            speaker_features[speaker_id]['f0'].append(item['f0'])
+            speaker_features[speaker_id]['intensity'].append(item['intensity'])
+        # Step 3: Calculate stats per speaker
+        self.speaker_stats = {
+            spk: SpeakerStats.from_features(
+                feats['f0'],
+                feats['intensity']
+            )
+            for spk, feats in speaker_features.items()
+        }
+        return features_dataset, self.speaker_stats
+    def save_stats(self, path: str):
+        """Save speaker stats to file"""
+        stats_dict = {
+            spk: dataclasses.asdict(stats)
+            for spk, stats in self.speaker_stats.items()
+        }
+        torch.save(stats_dict, path)
+    @classmethod
+    def load_stats(cls, path: str) -> Dict[str, SpeakerStats]:
+        """Load speaker stats from file"""
+        stats_dict = torch.load(path)
+        return {
+            spk: SpeakerStats(**stats)
+            for spk, stats in stats_dict.items()
+        }
+    def _get_f0(self, audio: torch.Tensor):
+        """Extract F0 using YAAPT."""
+        to_pad = int(self.config.frame_length / 1000 * self.config.sampling_rate) // 2
+        f0s = []
+        f0s_interp = []
+        for y in audio.numpy().astype(np.float64):
+            y_pad = np.pad(y.squeeze(), (to_pad, to_pad), "constant", constant_values=0)
+            signal = basic.SignalObj(y_pad, self.config.sampling_rate)
+            pitch = pYAAPT.yaapt(
+                signal,
+                frame_length=self.config.frame_length,
+                frame_space=self.config.frame_space,
+                nccf_thresh1=0.25,
+                tda_frame_length=25.0
+            )
+            f0s_interp.append(pitch.samp_interp[None, None, :])
+            f0s.append(pitch.samp_values[None, None, :])
+        f0 = np.vstack(f0s)
+        f0_interp = np.vstack(f0s_interp)
+        # Apply frequency threshold
+        f0[f0 > 500] = 0
+        f0_interp[f0_interp > 500] = 0
+        return f0, f0_interp