diff --git a/.dockerignore b/.dockerignore new file mode 100644 index 0000000000000000000000000000000000000000..14f5114d01be349d8328bcbbfab08ca0c0a9ba98 --- /dev/null +++ b/.dockerignore @@ -0,0 +1,19 @@ +__pycache__ +*.pyc +*.pyo +*.pyd +.Python +env +pip-log.txt +pip-delete-this-directory.txt +.tox +.coverage +.coverage.* +.cache +nosetests.xml +coverage.xml +*,cover +*.log +.git +**/*.nemo +**/*.ckpt diff --git a/.gitattributes b/.gitattributes index c7d9f3332a950355d5a77d85000f05e6f45435ea..5d51b4e71dbb8419c4700c911ee5ef682e339314 100644 --- a/.gitattributes +++ b/.gitattributes @@ -32,3 +32,14 @@ saved_model/**/* filter=lfs diff=lfs merge=lfs -text *.zip filter=lfs diff=lfs merge=lfs -text *.zst filter=lfs diff=lfs merge=lfs -text *tfevents* filter=lfs diff=lfs merge=lfs -text +docs/source/nlp/dialogue_UML.png filter=lfs diff=lfs merge=lfs -text +docs/source/nlp/nemo_megatron/images/ddp.gif filter=lfs diff=lfs merge=lfs -text +docs/source/nlp/nemo_megatron/images/pnom.gif filter=lfs diff=lfs merge=lfs -text +docs/source/nlp/nemo_megatron/images/pp.gif filter=lfs diff=lfs merge=lfs -text +docs/source/nlp/nemo_megatron/images/tp.gif filter=lfs diff=lfs merge=lfs -text +docs/source/tts/images/fastpitch_model.png filter=lfs diff=lfs merge=lfs -text +docs/source/tts/images/radaligner_model.png filter=lfs diff=lfs merge=lfs -text +docs/source/tts/images/tacotron2_model.png filter=lfs diff=lfs merge=lfs -text +docs/source/tts/images/waveglow_model.png filter=lfs diff=lfs merge=lfs -text +examples/nlp/language_modeling/nemo_2b_bf16_tp1.nemo filter=lfs diff=lfs merge=lfs -text +tools/speech_data_explorer/screenshot.png filter=lfs diff=lfs merge=lfs -text diff --git a/.github/ISSUE_TEMPLATE/bug_report.md b/.github/ISSUE_TEMPLATE/bug_report.md new file mode 100644 index 0000000000000000000000000000000000000000..5aedacf07f1b8b3877a51ef392a87c04dffb97a2 --- /dev/null +++ b/.github/ISSUE_TEMPLATE/bug_report.md @@ -0,0 +1,42 @@ +--- +name: Bug report +about: Create a report to help us improve +title: '' +labels: bug +assignees: '' + +--- + +**Describe the bug** + +A clear and concise description of what the bug is. + +**Steps/Code to reproduce bug** + +Please list *minimal* steps or code snippet for us to be able to reproduce the bug. + +A helpful guide on on how to craft a minimal bug report http://matthewrocklin.com/blog/work/2018/02/28/minimal-bug-reports. + + +**Expected behavior** + +A clear and concise description of what you expected to happen. + +**Environment overview (please complete the following information)** + + - Environment location: [Bare-metal, Docker, Cloud(specify cloud provider - AWS, Azure, GCP, Collab)] + - Method of NeMo install: [pip install or from source]. Please specify exact commands you used to install. + - If method of install is [Docker], provide `docker pull` & `docker run` commands used + +**Environment details** + +If NVIDIA docker image is used you don't need to specify these. +Otherwise, please provide: +- OS version +- PyTorch version +- Python version + +**Additional context** + +Add any other context about the problem here. +Example: GPU model diff --git a/.github/ISSUE_TEMPLATE/feature_request.md b/.github/ISSUE_TEMPLATE/feature_request.md new file mode 100644 index 0000000000000000000000000000000000000000..e56d0d05e0c272143d5915f8af3660e9b32b32da --- /dev/null +++ b/.github/ISSUE_TEMPLATE/feature_request.md @@ -0,0 +1,25 @@ +--- +name: Feature request +about: Suggest an idea for this project +title: '' +labels: feature request +assignees: okuchaiev + +--- + +**Is your feature request related to a problem? Please describe.** + +A clear and concise description of what the problem is. Ex. I'm always frustrated when [...] + +**Describe the solution you'd like** + +A clear and concise description of what you want to happen. +Provide a code snippet on how new APIs/changes would be used by others. + +**Describe alternatives you've considered** + +A clear and concise description of any alternative solutions or features you've considered. + +**Additional context** + +Add any other context or screenshots about the feature request here. diff --git a/.github/PULL_REQUEST_TEMPLATE.md b/.github/PULL_REQUEST_TEMPLATE.md new file mode 100644 index 0000000000000000000000000000000000000000..6858131a81f84e964a587ada2be4f3818e03f30b --- /dev/null +++ b/.github/PULL_REQUEST_TEMPLATE.md @@ -0,0 +1,39 @@ +# What does this PR do ? + +Add a one line overview of what this PR aims to accomplish. + +**Collection**: [Note which collection this PR will affect] + +# Changelog +- Add specific line by line info of high level changes in this PR. + +# Usage +* You can potentially add a usage example below + +```python +# Add a code snippet demonstrating how to use this +``` + +# Before your PR is "Ready for review" +**Pre checks**: +- [ ] Make sure you read and followed [Contributor guidelines](https://github.com/NVIDIA/NeMo/blob/main/CONTRIBUTING.md) +- [ ] Did you write any new necessary tests? +- [ ] Did you add or update any necessary documentation? +- [ ] Does the PR affect components that are optional to install? (Ex: Numba, Pynini, Apex etc) + - [ ] Reviewer: Does the PR have correct import guards for all optional libraries? + +**PR Type**: +- [ ] New Feature +- [ ] Bugfix +- [ ] Documentation + +If you haven't finished some of the above items you can still open "Draft" PR. + + +## Who can review? + +Anyone in the community is free to review the PR once the checks have passed. +[Contributor guidelines](https://github.com/NVIDIA/NeMo/blob/main/CONTRIBUTING.md) contains specific people who can review PRs to various areas. + +# Additional Information +* Related to # (issue) diff --git a/.github/labeler.yml b/.github/labeler.yml new file mode 100644 index 0000000000000000000000000000000000000000..e0e6691b14c623b2cbde9df1b7c4520e89d7f77d --- /dev/null +++ b/.github/labeler.yml @@ -0,0 +1,33 @@ +ASR: +- nemo/collections/asr/**/* +- examples/asr/**/* +- tutorials/asr/**/* +- docs/source/asr/**/* + +NLP: +- nemo/collections/nlp/**/* +- examples/nlp/**/* +- tutorials/nlp/**/* +- docs/source/nlp/**/* + +Speaker Tasks: +- examples/speaker_tasks/**/* +- tutorials/speaker_tasks/**/* + +TTS: +- nemo/collections/tts/**/* +- examples/tts/**/* +- tutorials/tts/**/* +- docs/source/tts/**/* + +core: +- nemo/core/**/* + +common: +- nemo/collections/common/**/* + +CI: +- .github/**/* +- Jenkinsfile +- Dockerfile +- ci.groovy diff --git a/.github/workflows/blossom-ci.yml b/.github/workflows/blossom-ci.yml new file mode 100644 index 0000000000000000000000000000000000000000..bdfb24c4b1e5d31c7858f5f9c103166750697ede --- /dev/null +++ b/.github/workflows/blossom-ci.yml @@ -0,0 +1,104 @@ +# Copyright (c) 2020-2021, NVIDIA CORPORATION. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + +# A workflow to trigger ci on hybrid infra (github + self hosted runner) +name: Blossom-CI +on: + issue_comment: + types: [created] + workflow_dispatch: + inputs: + platform: + description: 'runs-on argument' + required: false + args: + description: 'argument' + required: false +jobs: + Authorization: + name: Authorization + runs-on: blossom + outputs: + args: ${{ env.args }} + + # This job only runs for pull request comments + if: | + contains( 'okuchaiev,ericharper,titu1994,MaximumEntropy,nithinraok,redoctopus,yidong72,SeanNaren,yzhang123,ekmb,arendu,', format('{0},', github.actor)) && + github.event.comment.body == '/blossom-ci' + steps: + - name: Check if comment is issued by authorized person + run: blossom-ci + env: + OPERATION: 'AUTH' + REPO_TOKEN: ${{ secrets.GITHUB_TOKEN }} + REPO_KEY_DATA: ${{ secrets.BLOSSOM_KEY }} + + Vulnerability-scan: + name: Vulnerability scan + needs: [Authorization] + runs-on: ubuntu-latest + steps: + - name: Checkout code + uses: actions/checkout@v2 + with: + repository: ${{ fromJson(needs.Authorization.outputs.args).repo }} + ref: ${{ fromJson(needs.Authorization.outputs.args).ref }} + lfs: 'true' + + # repo specific steps + #- name: Setup java + # uses: actions/setup-java@v1 + # with: + # java-version: 1.8 + + # add blackduck properties https://synopsys.atlassian.net/wiki/spaces/INTDOCS/pages/631308372/Methods+for+Configuring+Analysis#Using-a-configuration-file + #- name: Setup blackduck properties + # run: | + # PROJECTS=$(mvn -am dependency:tree | grep maven-dependency-plugin | awk '{ out="com.nvidia:"$(NF-1);print out }' | grep rapids | xargs | sed -e 's/ /,/g') + # echo detect.maven.build.command="-pl=$PROJECTS -am" >> application.properties + # echo detect.maven.included.scopes=compile >> application.properties + + - name: Run blossom action + uses: NVIDIA/blossom-action@main + env: + REPO_TOKEN: ${{ secrets.GITHUB_TOKEN }} + REPO_KEY_DATA: ${{ secrets.BLOSSOM_KEY }} + with: + args1: ${{ fromJson(needs.Authorization.outputs.args).args1 }} + args2: ${{ fromJson(needs.Authorization.outputs.args).args2 }} + args3: ${{ fromJson(needs.Authorization.outputs.args).args3 }} + + Job-trigger: + name: Start ci job + needs: [Vulnerability-scan] + runs-on: blossom + steps: + - name: Start ci job + run: blossom-ci + env: + OPERATION: 'START-CI-JOB' + CI_SERVER: ${{ secrets.CI_SERVER }} + REPO_TOKEN: ${{ secrets.GITHUB_TOKEN }} + + Upload-Log: + name: Upload log + runs-on: blossom + if : github.event_name == 'workflow_dispatch' + steps: + - name: Jenkins log for pull request ${{ fromJson(github.event.inputs.args).pr }} (click here) + run: blossom-ci + env: + OPERATION: 'POST-PROCESSING' + CI_SERVER: ${{ secrets.CI_SERVER }} + REPO_TOKEN: ${{ secrets.GITHUB_TOKEN }} diff --git a/.github/workflows/changelog-build.yml b/.github/workflows/changelog-build.yml new file mode 100644 index 0000000000000000000000000000000000000000..7e16c344acb861a1c136444084f863d3f6784049 --- /dev/null +++ b/.github/workflows/changelog-build.yml @@ -0,0 +1,47 @@ +name: 'Changelog Build (Release)' + +on: + push: + tags: + - '*' + +jobs: + changelog: + if: startsWith(github.ref, 'refs/tags/') + runs-on: ubuntu-latest + steps: + - uses: actions/checkout@v2 + with: + fetch-depth: 0 # Required due to the way Git works, without it this action won't be able to find any or the correct tags + + - name: Get Previous tag + id: previous_tag + # git for-each-ref --sort=-creatordate --format '%(refname)' refs/tags ==> refs/tags/vX.Y.Z in descending order of date + # awk 'FNR == 2 {print substr($1, 11, length($1))}') ==> Selects the 2nd tag from the list, then strips the /refs/tags/ part of the tag + # set-output name=tag_name:: ==> Takes the clean tag vX.Y.Z and sets it to steps.previous_tag.outputs.tag_name + run: | + echo "::set-output name=tag_name::$(git for-each-ref --sort=-creatordate --format '%(refname)' refs/tags | awk 'FNR == 2 {print substr($1, 11, length($1))}')" + echo ${{ steps.previous_tag.outputs.tag_name }} + + - name: Build Changelog + id: github_tag + uses: mikepenz/release-changelog-builder-action@v3.3.1 + env: + GITHUB_TOKEN: ${{ secrets.GITHUB_TOKEN }} + with: + # Configuration file is setup with filters for domains + # owner:repo must point to current repo + # fromTag: Auto resolved from historical tag order (previous tag compared to current tag) + # toTag: Current tag reference + configuration: ".github/workflows/config/changelog-config.json" + owner: "NVIDIA" + repo: "NeMo" + ignorePreReleases: "false" + failOnError: "false" + fromTag: ${{ steps.previous_tag.outputs.tag_name }} + toTag: ${{ github.ref_name }} + + - name: Print Changelog + run: | + echo "${{steps.github_tag.outputs.changelog}}" + echo "--- DONE ---" diff --git a/.github/workflows/cherry-pick-release-commit.yml b/.github/workflows/cherry-pick-release-commit.yml new file mode 100644 index 0000000000000000000000000000000000000000..3c82269cb9a672dc21e874a8bcd2ee9c737517c3 --- /dev/null +++ b/.github/workflows/cherry-pick-release-commit.yml @@ -0,0 +1,28 @@ +name: Create PR to main with cherry-pick from release + +on: + pull_request_target: + branches: + - 'r*.*.*' + types: ["closed"] + +jobs: + cherry-pick-release-commit: + name: Cherry-pick release commit + runs-on: ubuntu-latest + steps: + - name: Checkout + uses: actions/checkout@v3 + with: + fetch-depth: 0 + - name: github-cherry-pick-action v1.0.3 + uses: carloscastrojumo/github-cherry-pick-action@bb0869df47c27be4ae4c7a2d93d22827aa5a0054 + with: + branch: main + labels: | + cherry-pick + reviewers: | + ${{ github.event.pull_request.user.login }} + +env: + GITHUB_TOKEN: ${{ secrets.GITHUB_TOKEN }} \ No newline at end of file diff --git a/.github/workflows/close-inactive-issue-pr.yml b/.github/workflows/close-inactive-issue-pr.yml new file mode 100644 index 0000000000000000000000000000000000000000..c71997e4b9bf8ec40ff3cbe4220216650acaf726 --- /dev/null +++ b/.github/workflows/close-inactive-issue-pr.yml @@ -0,0 +1,25 @@ +name: Stale-Close-Inactive-Issues-PRs +on: + schedule: + - cron: "30 1 * * *" + +jobs: + close-issues: + runs-on: ubuntu-latest + permissions: + issues: write + pull-requests: write + steps: + - uses: actions/stale@v6 + with: + operations-per-run: 100 + days-before-issue-stale: 30 + days-before-issue-close: 7 + stale-issue-label: "stale" + stale-issue-message: "This issue is stale because it has been open for 30 days with no activity. Remove stale label or comment or this will be closed in 7 days." + close-issue-message: "This issue was closed because it has been inactive for 7 days since being marked as stale." + days-before-pr-stale: 14 + days-before-pr-close: 7 + stale-pr-message: "This PR is stale because it has been open for 14 days with no activity. Remove stale label or comment or update or this will be closed in 7 days." + close-pr-message: "This PR was closed because it has been inactive for 7 days since being marked as stale." + repo-token: ${{ secrets.GITHUB_TOKEN }} diff --git a/.github/workflows/codeql.yml b/.github/workflows/codeql.yml new file mode 100644 index 0000000000000000000000000000000000000000..673687412096eabbe60d57ad39142cff6a5d2719 --- /dev/null +++ b/.github/workflows/codeql.yml @@ -0,0 +1,74 @@ +# For most projects, this workflow file will not need changing; you simply need +# to commit it to your repository. +# +# You may wish to alter this file to override the set of languages analyzed, +# or to provide custom queries or build logic. +# +# ******** NOTE ******** +# We have attempted to detect the languages in your repository. Please check +# the `language` matrix defined below to confirm you have the correct set of +# supported CodeQL languages. +# +name: "CodeQL" + +on: + push: + branches: [ "main", "[rv][0-9]*", "gh-pages-src" ] + pull_request: + # The branches below must be a subset of the branches above + branches: [ "main" ] + schedule: + - cron: '19 1 * * 4' + +jobs: + analyze: + name: Analyze + runs-on: ubuntu-latest + permissions: + actions: read + contents: read + security-events: write + + strategy: + fail-fast: false + matrix: + language: [ 'python' ] + # CodeQL supports [ 'cpp', 'csharp', 'go', 'java', 'javascript', 'python', 'ruby' ] + # Learn more about CodeQL language support at https://aka.ms/codeql-docs/language-support + + steps: + - name: Checkout repository + uses: actions/checkout@v3 + + # Initializes the CodeQL tools for scanning. + - name: Initialize CodeQL + uses: github/codeql-action/init@v2 + with: + languages: ${{ matrix.language }} + # If you wish to specify custom queries, you can do so here or in a config file. + # By default, queries listed here will override any specified in a config file. + # Prefix the list here with "+" to use these queries and those in the config file. + + # Details on CodeQL's query packs refer to : https://docs.github.com/en/code-security/code-scanning/automatically-scanning-your-code-for-vulnerabilities-and-errors/configuring-code-scanning#using-queries-in-ql-packs + queries: security-and-quality # security-extended, + + + # Autobuild attempts to build any compiled languages (C/C++, C#, Go, or Java). + # If this step fails, then you should remove it and run the build manually (see below) + - name: Autobuild + uses: github/codeql-action/autobuild@v2 + + # ℹ️ Command-line programs to run using the OS shell. + # 📚 See https://docs.github.com/en/actions/using-workflows/workflow-syntax-for-github-actions#jobsjob_idstepsrun + + # If the Autobuild fails above, remove it and uncomment the following three lines. + # modify them (or add more) to build your code if your project, please refer to the EXAMPLE below for guidance. + + # - run: | + # echo "Run, Build Application using script" + # ./location_of_script_within_repo/buildscript.sh + + - name: Perform CodeQL Analysis + uses: github/codeql-action/analyze@v2 + with: + category: "/language:${{matrix.language}}" diff --git a/.github/workflows/config/changelog-config.json b/.github/workflows/config/changelog-config.json new file mode 100644 index 0000000000000000000000000000000000000000..fe18f8ac0681a46532957d7f4c19ca87c8125515 --- /dev/null +++ b/.github/workflows/config/changelog-config.json @@ -0,0 +1,134 @@ +{ + "categories": [ + { + "title": "## ASR \n\n
Changelog\n\n
\n\n", + "labels": ["asr"], + "exclude_labels": ["cherry-pick"] + }, + { + "title": "## TTS \n\n
Changelog\n\n
\n\n", + "labels": ["tts"], + "exclude_labels": ["cherry-pick"] + }, + { + "title": "## NLP / NMT \n\n
Changelog\n\n
\n\n", + "labels": ["nlp", "nmt", "megatron"], + "exclude_labels": ["cherry-pick"] + }, + { + "title": "## Text Normalization / Inverse Text Normalization \n\n
Changelog\n\n
\n\n", + "labels": ["tn", "itn"], + "exclude_labels": ["cherry-pick"] + }, + { + "title": "## NeMo Tools \n\n
Changelog\n\n
\n\n", + "labels": ["tools"], + "exclude_labels": ["cherry-pick"] + }, + { + "title": "## Export \n\n
Changelog\n\n
\n\n", + "labels": ["export"], + "exclude_labels": ["cherry-pick"] + }, + { + "title": "## Documentation \n\n
Changelog\n\n
\n\n", + "labels": ["docs"], + "exclude_labels": ["cherry-pick"] + }, + { + "title": "## Bugfixes \n\n
Changelog\n\n
\n\n", + "labels": ["bug"], + "exclude_labels": ["cherry-pick"] + }, + { + "title": "## Cherrypick \n\n
Changelog\n\n
\n\n", + "labels": ["cherry-pick"], + "exclude_labels": ["cherry-pick"] + } + ], + "ignore_labels": [ + "ignore" + ], + "sort": "ASC", + "template": "\n${{CHANGELOG}}\nUncategorized:\n${{UNCATEGORIZED}}\n\n", + "pr_template": "- ${{TITLE}} by @${{AUTHOR}} :: PR: #${{NUMBER}}", + "empty_template": "${{OWNER}}\n${{REPO}}\n${{FROM_TAG}}\n${{TO_TAG}}", + "label_extractor": [ + { + "pattern": "(.*tts.*)|(.*g2p.*)", + "target": "tts", + "flags": "gimu", + "on_property": ["title", "body"] + }, + { + "pattern": "(.*asr.*)|(.*ctc.*)|(.*rnnt.*)|(.*transducer.*)|(.*dali.*)|(.*k2.*)", + "target": "asr", + "flags": "gimu", + "on_property": ["title", "body"] + }, + { + "pattern": "(.*nlp.*)|(.*punctuation.*)|(.*capitalization.*)|(.*entity.*)|(.*glue.*)|(.*entity.*)|(.*retrieval.*)|(.*entity.*)|(.*intent.*)|(.*slot.*)|(.*entity.*)|(.*language.*)|(.*qa.*)|(.*token class.*)|(.*text class.*)", + "target": "nlp", + "flags": "gimu", + "on_property": ["title", "body"] + }, + { + "pattern": "(.*nmt.*)|(.*bignlp.*)|(.*megatron.*)|(.*machine.*)|(.*translation.*)|(.*gpt.*)", + "target": "nmt", + "flags": "gimu", + "on_property": ["title", "body"] + }, + { + "pattern": "(.*tn.*)|(.*itn.*)|(.*text norm.*)", + "target": "tn", + "flags": "gimu", + "on_property": ["title", "body"] + }, + { + "pattern": "(.*sde.*)|(.*ctc segment.*)", + "target": "tools", + "flags": "gimu", + "on_property": ["title", "body"] + }, + { + "pattern": "(.*trt.*)|(.*onnx.*)|(.*export.*)", + "target": "export", + "flags": "gimu", + "on_property": ["title", "body"] + }, + { + "pattern": "(.*\\[x\\] Documentation.*)", + "target": "docs", + "flags": "gmu", + "on_property": ["title", "body"] + }, + { + "pattern": "(.*\\[x\\] Bugfix.*)|(.*patch.*)", + "target": "bug", + "flags": "gmu", + "on_property": ["title", "body"] + }, + { + "pattern": "(.*cherry-pick.*)|(.*cherrypick.*)", + "target": "cherrypick", + "flags": "gimu", + "on_property": ["title", "body"] + } + ], + "duplicate_filter": { + "pattern": ".+", + "on_property": "title", + "method": "match" + }, + "transformers": [ + ], + "max_tags_to_fetch": 100, + "max_pull_requests": 500, + "max_back_track_time_days": 365, + "exclude_merge_branches": [ + ], + "tag_resolver": { + "method": "semver" + } +} + diff --git a/.github/workflows/gh-docs.yml b/.github/workflows/gh-docs.yml new file mode 100644 index 0000000000000000000000000000000000000000..6f8e8ea1e3e1a13f0d307705b3435c9d5a0ee04d --- /dev/null +++ b/.github/workflows/gh-docs.yml @@ -0,0 +1,38 @@ +name: gh-docs-build +on: + push: + pull_request: + paths: + - "**" + +# Set the access for individual scopes +permissions: write-all + +jobs: + deploy: + runs-on: ubuntu-latest + + container: + image: squidfunk/mkdocs-material + + steps: + - uses: actions/checkout@v3 + if: github.event.repository.fork == false + with: + ref: gh-pages-src + + - name: "Correct github config" + if: github.event.repository.fork == false + run: | + git config --global --add safe.directory "$GITHUB_WORKSPACE" + git config --global user.name "${GITHUB_ACTOR}" + git config --global user.email "${GITHUB_ACTOR}@users.noreply.${GITHUB_DOMAIN:-"github.com"}" + remote_repo="https://x-access-token:${GITHUB_TOKEN}@${GITHUB_DOMAIN:-"github.com"}/${GITHUB_REPOSITORY}.git" + echo "${remote_repo}" + git remote rm origin + git remote add origin "${remote_repo}" + + - name: "Deploy Github Page" + continue-on-error: true + run: mkdocs gh-deploy --force + diff --git a/.github/workflows/import-test.yml b/.github/workflows/import-test.yml new file mode 100644 index 0000000000000000000000000000000000000000..5fc34347710d8601b8e3ba5c86290f8554ff0b4b --- /dev/null +++ b/.github/workflows/import-test.yml @@ -0,0 +1,63 @@ +name: CI-Import-Check + +on: + push: + pull_request: + paths: + - "**" + +jobs: + ci-import-check: + runs-on: ubuntu-latest + + # Check https://hub.docker.com/r/pytorch/pytorch/tags for latest tags + container: + image: pytorch/pytorch:1.11.0-cuda11.3-cudnn8-runtime + + steps: + - uses: actions/checkout@v2 + + - name: Update base dependencies + run: | + apt-get update && apt-get install -y build-essential + apt-get install -y libsndfile1 make + + - name: Install nemo dependencies + id: nemo-wheel + run: | + # install test requirements + pip install -r requirements/requirements_test.txt + # Build nemo as a wheel + pip install build + python -m build --no-isolation --wheel + # Preserve wheel location + DIST_FILE=$(find ./dist -name "*.whl" | head -n 1) + echo "::set-output name=DIST_FILE::${DIST_FILE}" + + - name: Test ASR Domain Imports + run: | + # Install NeMo Domain + pip install "${{ steps.nemo-wheel.outputs.DIST_FILE }}[asr]" + # Run import checks + python tests/core_ptl/check_imports.py --domain "asr" + # Uninstall NeMo + pip uninstall -y nemo_toolkit + + - name: Test TTS Domain Imports + run: | + # Install NeMo Domain + pip install "${{ steps.nemo-wheel.outputs.DIST_FILE }}[tts]" + # Run import checks + python tests/core_ptl/check_imports.py --domain "tts" + # Uninstall NeMo + pip uninstall -y nemo_toolkit + + - name: Test NLP Domain Imports + run: | + # Install NeMo Domain + pip install "${{ steps.nemo-wheel.outputs.DIST_FILE }}[nlp]" + # Run import checks + python tests/core_ptl/check_imports.py --domain "nlp" + # Uninstall NeMo + pip uninstall -y nemo_toolkit + diff --git a/.github/workflows/labeler.yml b/.github/workflows/labeler.yml new file mode 100644 index 0000000000000000000000000000000000000000..680f9d187a3b77819a0e87e5d3c0fca965d74831 --- /dev/null +++ b/.github/workflows/labeler.yml @@ -0,0 +1,14 @@ +name: "Pull Request Labeler" +on: +- pull_request_target + +jobs: + triage: + permissions: + contents: read + pull-requests: write + runs-on: ubuntu-latest + steps: + - uses: actions/labeler@v4 + with: + repo-token: "${{ secrets.GITHUB_TOKEN }}" \ No newline at end of file diff --git a/.gitignore b/.gitignore new file mode 100644 index 0000000000000000000000000000000000000000..1ff2a92cac64c41cc2b62fe4c08737989e61c56a --- /dev/null +++ b/.gitignore @@ -0,0 +1,181 @@ +# log and data files +*.model +*.pkl +#*.ipynb +output +result +*.pt +tests/data/asr +.DS_Store +bert.pt.json +work +runs +fastspeech_output +.hydra +.bash_history.local + +# Byte-compiled / optimized / DLL files +__pycache__/ +*.py[cod] +*$py.class +**.pyc + +# C extensions +*.so + +# Distribution / packaging +.idea +.Python +wandb +build/ +develop-eggs/ +dist/ +downloads/ +eggs/ +.eggs/ +lib/ +lib64/ +#parts/ +sdist/ +var/ +wheels/ +pip-wheel-metadata/ +share/python-wheels/ +*.egg-info/ +.installed.cfg +*.egg +MANIFEST + +# PyInstaller +# Usually these files are written by a python script from a template +# before PyInstaller builds the exe, so as to inject date/other infos into it. +*.manifest +*.spec + +# Installer logs +pip-log.txt +pip-delete-this-directory.txt + +# Unit test / coverage reports +htmlcov/ +.tox/ +.nox/ +.coverage +.coverage.* +.cache +nosetests.xml +coverage.xml +*.cover +.hypothesis/ +.pytest_cache/ + +# Translations +*.mo +*.pot + +# Django stuff: +*.log +local_settings.py +db.sqlite3 + +# Flask stuff: +instance/ +.webassets-cache + +# Scrapy stuff: +.scrapy + +# Sphinx documentation +docs/build + +# PyBuilder +target/ + +# Jupyter Notebook +.ipynb_checkpoints + +# Override Jupyter in Github Language states for more accurate estimate of repo code. +# Reference: https://github.com/github/linguist/blob/master/docs/overrides.md#generated-code +*.ipynb linguist-generated + +# IPython +profile_default/ +ipython_config.py + +# pyenv +.python-version + +# pipenv +# According to pypa/pipenv#598, it is recommended to include Pipfile.lock in version control. +# However, in case of collaboration, if having platform-specific dependencies or dependencies +# having no cross-platform support, pipenv may install dependencies that don’t work, or not +# install all needed dependencies. +#Pipfile.lock + +# celery beat schedule file +celerybeat-schedule + +# SageMath parsed files +*.sage.py + +# Environments +.env +.venv +env/ +venv/ +ENV/ +env.bak/ +venv.bak/ + +# VSCode project settins +.vscode/ + +# Spyder project settings +.spyderproject +.spyproject + +# Rope project settings +.ropeproject + +# mkdocs documentation +/site +/docs/html +/docs/docs_zh/zh + +# mypy +.mypy_cache/ +.dmypy.json +dmypy.json + +# Pyre type checker +.pyre/ + +# Emacs backup files +*~ + +cifar-10-batches-py +*.tar.gz + +# Test data. +tests/.data +tests/data + +# outputs folder +examples/*/outputs +examples/*/NeMo_experiments +examples/*/nemo_experiments +examples/*/.hydra +examples/*/wandb +examples/*/data +wandb +dump.py + +docs/sources/source/test_build/ + +# Checkpoints, config files and temporary files created in tutorials. +examples/neural_graphs/*.chkpt +examples/neural_graphs/*.yml + +.hydra/ +nemo_experiments/ + diff --git a/.pre-commit-config.yaml b/.pre-commit-config.yaml new file mode 100644 index 0000000000000000000000000000000000000000..fd89d3983cc551081d065456f36f1e5bd5c61e8d --- /dev/null +++ b/.pre-commit-config.yaml @@ -0,0 +1,47 @@ +# Copyright (c) 2022, NVIDIA CORPORATION. All rights reserved. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + +default_language_version: + python: python3 + +ci: + autofix_prs: true + autoupdate_commit_msg: '[pre-commit.ci] pre-commit suggestions' + autoupdate_schedule: quarterly + +repos: + - repo: https://github.com/pre-commit/pre-commit-hooks + rev: v4.3.0 + hooks: + - id: check-yaml + - id: check-case-conflict + - id: detect-private-key + - id: check-added-large-files + args: ['--maxkb=1000'] + - id: requirements-txt-fixer + + - repo: https://github.com/PyCQA/isort + rev: 5.12.0 + hooks: + - id: isort + name: Format imports + exclude: docs/ + + - repo: https://github.com/psf/black + rev: 19.10b0 + hooks: + - id: black + name: Format code + args: [--skip-string-normalization, --line-length=119] + additional_dependencies: ['click==8.0.2'] diff --git a/.readthedocs.yml b/.readthedocs.yml new file mode 100644 index 0000000000000000000000000000000000000000..226be6a7eab01dbec87fb57678afa0f1b46f658a --- /dev/null +++ b/.readthedocs.yml @@ -0,0 +1,31 @@ +# ============================================================================= +# Copyright (c) 2020 NVIDIA. All Rights Reserved. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. +# ============================================================================= + +# Read the Docs configuration file +# See https://docs.readthedocs.io/en/stable/config-file/v2.html for details + +# Required field. +version: 2 + +# Build documentation in the docs/ directory with Sphinx. +sphinx: + configuration: docs/source/conf.py + +# Set the version of Python and requirements required to build your docs +python: + version: 3.8 + install: + - requirements: requirements/requirements_docs.txt diff --git a/CITATION.cff b/CITATION.cff new file mode 100644 index 0000000000000000000000000000000000000000..436750dd0af057b8c4d701cfb2a01a4ca0d4365a --- /dev/null +++ b/CITATION.cff @@ -0,0 +1,41 @@ +cff-version: 1.2.0 +message: "If you use this software, please cite it as below." +title: "NeMo: a toolkit for Conversational AI and Large Language Models" +url: https://nvidia.github.io/NeMo/ +repository-code: https://github.com/NVIDIA/NeMo +authors: + - family-names: Harper + given-names: Eric + - family-names: Majumdar + given-names: Somshubra + - family-names: Kuchaiev + given-names: Oleksii + - family-names: Jason + given-names: Li + - family-names: Zhang + given-names: Yang + - family-names: Bakhturina + given-names: Evelina + - family-names: Noroozi + given-names: Vahid + - family-names: Subramanian + given-names: Sandeep + - family-names: Nithin + given-names: Koluguri + - family-names: Jocelyn + given-names: Huang + - family-names: Jia + given-names: Fei + - family-names: Balam + given-names: Jagadeesh + - family-names: Yang + given-names: Xuesong + - family-names: Livne + given-names: Micha + - family-names: Dong + given-names: Yi + - family-names: Naren + given-names: Sean + - family-names: Ginsburg + given-names: Boris + diff --git a/CONTRIBUTING.md b/CONTRIBUTING.md new file mode 100644 index 0000000000000000000000000000000000000000..621a37a171b7bb06a10491350697cdd671aa7a11 --- /dev/null +++ b/CONTRIBUTING.md @@ -0,0 +1,79 @@ +# Contributions are welcome! + +We do all of NeMo's development in the open. Contributions from NeMo community are welcome. + + +# Pull Requests (PR) Guidelines + +**Send your PRs to the `main` branch** + +1) Make sure your PR does one thing. Have a clear answer to "What does this PR do?". +2) Read General Principles and style guide below +3) Make sure you sign your commits. E.g. use ``git commit -s`` when before your commit +4) Make sure all unittests finish successfully before sending PR ``pytest`` or (if yor dev box does not have GPU) ``pytest --cpu`` from NeMo's root folder +5) Send your PR and request a review + +## Unit tests +Quick tests (locally, while developing) +``` +pytest +# If you don't have NVIDIA GPU do: +# pytest --cpu +``` +Full tests, including pre-trained model downloads +``` +pytest --with_downloads +``` + +## Whom should you ask for review: +1. For changes to NeMo's core: @ericharper, @titu1994, @blisc, or @okuchaiev +1. For changes to NeMo's ASR collection: @titu1994, @redoctopus, @jbalam-nv, or @okuchaiev +1. For changes to NeMo's NLP collection: @MaximumEntropy, @ericharper, @ekmb, @yzhang123, @VahidooX, @vladgets, or @okuchaiev +1. For changes to NeMo's TTS collection: @blisc, or @okuchaiev + +Note that some people may self-assign to review your PR - in which case, please wait for them to add a review. + +Your pull requests must pass all checks and peer-review before they can be merged. + +# General principles +1. **User-oriented**: make it easy for end users, even at the cost of writing more code in the background +1. **Robust**: make it hard for users to make mistakes. +1. **Well-tested**: please add simple, fast unittests. Consider adding CI tests for end-to-end functionality. +1. **Reusable**: for every piece of code, think about how it can be reused in the future and make it easy to be reused. +1. **Readable**: code should be easier to read. +1. **Legal**: if you copy even one line of code from the Internet, make sure that the code allows the license that NeMo supports. Give credit and link back to the code. +1. **Sensible**: code should make sense. If you think a piece of code might be confusing, write comments. + +## Class naming conventions +* No “I”, “Interface”, “NM” nor “NeMo” pre/postfixes anywhere +* Core interfaces have simple names: Typing, Cloud, Serialization, FileIO* +* Core classes have the simplest names ever: NeuralModule, Model, Graph, Dataset, Loss, Module* +* Abstract classes in the Model hierarchy have Model postfix +* A config class for MyModel should be called MyModelConfig +* Leaf Neural Module classes have simple names without any postfixes (e.g. AudioPreprocess) +* Leaf Datasets have Dataset postfix (e.g. AudioToSpeechLabelDataset) +* Leaf Losses have Loss postfix (e.g. CTCLoss) +* Leaf Models do not have any postfix, just name (e.g. QuartzNet) + +## Python style +We use ``black`` as our style guide. To check whether your code will pass style check (from the NeMo's repo folder) run: +``python setup.py style`` and if it does not pass run ``python setup.py style --fix``. + +1. Include docstrings for every class and method exposed to the user. +1. Use Python 3 type hints for every class and method exposed to the user. +1. Avoid wild import: ``from X import *`` unless in ``X.py``, ``__all__`` is defined. +1. Minimize the use of ``**kwargs``. +1. ``RaiseError`` is preferred to ``assert``. Write: ```if X: raise Error``` instead of ```assert X```. +1. Classes are preferred to standalone methods. +1. Methods should be atomic. A method shouldn't be longer than 75 lines, e.g. can be fit into the computer screen without scrolling. +1. If a method has arguments that don't fit into one line, each argument should be in its own line for readability. +1. Add ``__init__.py`` for every folder. +1. F-strings are prefered to formatted strings. +1. Loggers are preferred to print. In NeMo, you can use logger from ``from nemo.utils import logging`` +1. Private functions (functions start with ``_``) shouldn't be called outside its host file. +1. If a comment lasts multiple lines, use ``'''`` instead of ``#``. + +# Collections +Collection is a logical grouping of related Neural Modules. It is a grouping of modules that share a domain area or semantics. +When contributing module to a collection, please make sure it belongs to that category. +If you would like to start a new one and contribute back to the platform, you are very welcome to do so. diff --git a/Dockerfile b/Dockerfile new file mode 100644 index 0000000000000000000000000000000000000000..434ecb0abd1df8a93ba123cff8dcd9ca0e464237 --- /dev/null +++ b/Dockerfile @@ -0,0 +1,127 @@ +# syntax=docker/dockerfile:experimental + +# Copyright (c) 2019, NVIDIA CORPORATION. All rights reserved. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + +ARG BASE_IMAGE=nvcr.io/nvidia/pytorch:23.02-py3 + +# build an image that includes only the nemo dependencies, ensures that dependencies +# are included first for optimal caching, and useful for building a development +# image (by specifying build target as `nemo-deps`) +FROM ${BASE_IMAGE} as nemo-deps + +# dependency flags; should be declared after FROM +# torchaudio: not required by default +ARG REQUIRE_TORCHAUDIO=false +# k2: not required by default +ARG REQUIRE_K2=false +# ais cli: not required by default, install only if required +ARG REQUIRE_AIS_CLI=false + +# Ensure apt-get won't prompt for selecting options +ENV DEBIAN_FRONTEND=noninteractive +# libavdevice-dev rerquired for latest torchaudio +RUN apt-get update && \ + apt-get upgrade -y && \ + apt-get install -y \ + libsndfile1 sox \ + libfreetype6 \ + swig \ + ffmpeg \ + libavdevice-dev && \ + rm -rf /var/lib/apt/lists/* + +WORKDIR /tmp/ + +# TODO: Remove once this Apex commit (2/24/23) is included in PyTorch +# container +RUN git clone https://github.com/NVIDIA/apex.git && \ + cd apex && \ + git checkout 03c9d80ed54c0eaa5b581bf42ceca3162f085327 && \ + pip3 install -v --disable-pip-version-check --no-cache-dir --global-option="--cpp_ext" --global-option="--cuda_ext" --global-option="--fast_layer_norm" --global-option="--distributed_adam" --global-option="--deprecated_fused_adam" ./ + +# uninstall stuff from base container +RUN pip3 uninstall -y sacrebleu torchtext + +# build torchaudio +WORKDIR /tmp/torchaudio_build +COPY scripts/installers /tmp/torchaudio_build/scripts/installers/ +RUN INSTALL_MSG=$(/bin/bash /tmp/torchaudio_build/scripts/installers/install_torchaudio_latest.sh); INSTALL_CODE=$?; \ + echo ${INSTALL_MSG}; \ + if [ ${INSTALL_CODE} -ne 0 ]; then \ + echo "torchaudio installation failed"; \ + if [ "${REQUIRE_TORCHAUDIO}" = true ]; then \ + exit ${INSTALL_CODE}; \ + else echo "Skipping failed torchaudio installation"; fi \ + else echo "torchaudio installed successfully"; fi + +# install nemo dependencies +WORKDIR /tmp/nemo +COPY requirements . +RUN for f in $(ls requirements*.txt); do pip3 install --disable-pip-version-check --no-cache-dir -r $f; done + +# install k2, skip if installation fails +COPY scripts /tmp/nemo/scripts/ +RUN INSTALL_MSG=$(/bin/bash /tmp/nemo/scripts/speech_recognition/k2/setup.sh); INSTALL_CODE=$?; \ + echo ${INSTALL_MSG}; \ + if [ ${INSTALL_CODE} -ne 0 ]; then \ + echo "k2 installation failed"; \ + if [ "${REQUIRE_K2}" = true ]; then \ + exit ${INSTALL_CODE}; \ + else echo "Skipping failed k2 installation"; fi \ + else echo "k2 installed successfully"; fi + +# copy nemo source into a scratch image +FROM scratch as nemo-src +COPY . . + +# start building the final container +FROM nemo-deps as nemo +ARG NEMO_VERSION=1.17.0 + +# Check that NEMO_VERSION is set. Build will fail without this. Expose NEMO and base container +# version information as runtime environment variable for introspection purposes +RUN /usr/bin/test -n "$NEMO_VERSION" && \ + /bin/echo "export NEMO_VERSION=${NEMO_VERSION}" >> /root/.bashrc && \ + /bin/echo "export BASE_IMAGE=${BASE_IMAGE}" >> /root/.bashrc + +# Install NeMo +RUN --mount=from=nemo-src,target=/tmp/nemo cd /tmp/nemo && pip install ".[all]" + +# Check install +RUN python -c "import nemo.collections.nlp as nemo_nlp" && \ + python -c "import nemo.collections.tts as nemo_tts" && \ + python -c "import nemo_text_processing.text_normalization as text_normalization" + + +# copy scripts/examples/tests into container for end user +WORKDIR /workspace/nemo +COPY scripts /workspace/nemo/scripts +COPY examples /workspace/nemo/examples +COPY tests /workspace/nemo/tests +COPY tutorials /workspace/nemo/tutorials +# COPY README.rst LICENSE /workspace/nemo/ + +RUN printf "#!/bin/bash\njupyter lab --no-browser --allow-root --ip=0.0.0.0" >> start-jupyter.sh && \ + chmod +x start-jupyter.sh + +# If required, install AIS CLI +RUN if [ "${REQUIRE_AIS_CLI}" = true ]; then \ + INSTALL_MSG=$(/bin/bash scripts/installers/install_ais_cli_latest.sh); INSTALL_CODE=$?; \ + echo ${INSTALL_MSG}; \ + if [ ${INSTALL_CODE} -ne 0 ]; then \ + echo "AIS CLI installation failed"; \ + exit ${INSTALL_CODE}; \ + else echo "AIS CLI installed successfully"; fi \ + else echo "Skipping AIS CLI installation"; fi diff --git a/Jenkinsfile b/Jenkinsfile new file mode 100644 index 0000000000000000000000000000000000000000..3082cb1aad73c4c1a12f6b7aa2c7ae8ccb2b1d97 --- /dev/null +++ b/Jenkinsfile @@ -0,0 +1,4447 @@ +pipeline { + agent { + docker { + image 'pytorch_23.02:apex_eec72500b073581edf1bc9183f0337338478ba9b_te_f06e2d85619376b9db0ca86847df2f1a5cb71388' + args '--device=/dev/nvidia0 --gpus all --user 0:128 -v /home/TestData:/home/TestData -v $HOME/.cache:/root/.cache --shm-size=8g' + } + } + options { + timeout(time: 2, unit: 'HOURS') + disableConcurrentBuilds(abortPrevious: true) + } + + stages { + + stage('Add git safe directory'){ + steps{ + sh 'git config --global --add safe.directory /var/lib/jenkins/workspace/NeMo_$GIT_BRANCH' + sh 'git config --global --add safe.directory /raid/JenkinsWorkDir/workspace/NeMo_$GIT_BRANCH' + sh 'git config --global --add safe.directory /mnt/D3/JenkinsWorkDir/workspace/NeMo_$GIT_BRANCH' + } + } + + stage('nvidia-smi'){ + steps{ + sh 'nvidia-smi' + } + } + + stage('PyTorch version') { + steps { + sh 'python -c "import torch; print(torch.__version__)"' + sh 'python -c "import torchvision; print(torchvision.__version__)"' + } + } + + stage('Install test requirements') { + steps { + sh 'apt-get update && apt-get install -y bc && pip install -r requirements/requirements_test.txt' + } + } + + stage('Code formatting checks') { + steps { + sh 'python setup.py style' + } + } + + stage('Copyright Headers check') { + steps { + sh 'python tests/check_copyright_header.py --dir .' + } + } + + stage('NeMo Installation') { + steps { + sh './reinstall.sh release' + } + } + + + stage('PyTorch Lightning version') { + steps { + sh 'python -c "import pytorch_lightning; print(pytorch_lightning.__version__)"' + } + } + + stage('PyTorch Lightning DDP Checks') { + steps { + sh 'CUDA_VISIBLE_DEVICES="0,1" python "tests/core_ptl/check_for_ranks.py"' + } + } + + stage('Basic Import Checks') { + steps { + sh 'python -c "import nemo.collections.asr as nemo_asr"' + sh 'python -c "import nemo.collections.nlp as nemo_nlp"' + sh 'python -c "import nemo.collections.tts as nemo_tts"' + } + } + stage('L0: Unit Tests GPU') { + steps { + sh 'NEMO_NUMBA_MINVER=0.53 pytest -m "not pleasefixme" --with_downloads' + } + } + + stage('L0: Unit Tests CPU') { + when { + anyOf { + branch 'r1.17.0' + changeRequest target: 'r1.17.0' + } + } + steps { + sh 'CUDA_VISIBLE_DEVICES="" NEMO_NUMBA_MINVER=0.53 pytest -m "not pleasefixme" --cpu --with_downloads --relax_numba_compat' + } + } + + stage('L2: ASR dev run') { + when { + anyOf { + branch 'r1.17.0' + changeRequest target: 'r1.17.0' + } + } + failFast true + parallel { + stage('Speech to Text') { + steps { + sh 'python examples/asr/asr_ctc/speech_to_text_ctc.py \ + model.train_ds.manifest_filepath=/home/TestData/an4_dataset/an4_train.json \ + model.validation_ds.manifest_filepath=/home/TestData/an4_dataset/an4_val.json \ + trainer.devices=[0] \ + trainer.accelerator="gpu" \ + +trainer.fast_dev_run=True \ + exp_manager.exp_dir=examples/asr/speech_to_text_results' + sh 'rm -rf examples/asr/speech_to_text_results' + } + } + + stage('L2: Speech to Text WPE - CitriNet') { + steps { + sh 'python examples/asr/asr_ctc/speech_to_text_ctc_bpe.py \ + --config-path="../conf/citrinet/" --config-name="config_bpe" \ + model.train_ds.manifest_filepath=/home/TestData/an4_dataset/an4_train.json \ + model.validation_ds.manifest_filepath=/home/TestData/an4_dataset/an4_val.json \ + model.tokenizer.dir="/home/TestData/asr_tokenizers/an4_wpe_128/" \ + model.tokenizer.type="wpe" \ + trainer.devices=[1] \ + trainer.accelerator="gpu" \ + +trainer.fast_dev_run=True \ + exp_manager.exp_dir=examples/asr/speech_to_text_wpe_results' + sh 'rm -rf examples/asr/speech_to_text_wpe_results' + } + } + + stage('L2: Speech Pre-training - CitriNet') { + steps { + sh 'python examples/asr/speech_pretraining/speech_pre_training.py \ + --config-path="../conf/ssl/citrinet/" --config-name="citrinet_ssl_ci" \ + model.train_ds.manifest_filepath=/home/TestData/an4_dataset/an4_train.json \ + model.validation_ds.manifest_filepath=/home/TestData/an4_dataset/an4_val.json \ + trainer.devices=[1] \ + trainer.accelerator="gpu" \ + +trainer.fast_dev_run=True \ + exp_manager.exp_dir=examples/asr/speech_pre_training_results' + sh 'rm -rf examples/asr/speech_pre_training_results' + } + } + + stage('L2: Speech Pre-training - Wav2Vec') { + steps { + sh 'python examples/asr/speech_pretraining/speech_pre_training.py \ + --config-path="../conf/ssl/wav2vec/" --config-name="wav2vec_ci" \ + model.train_ds.manifest_filepath=/home/TestData/an4_dataset/an4_train.json \ + model.validation_ds.manifest_filepath=/home/TestData/an4_dataset/an4_val.json \ + trainer.devices=[1] \ + trainer.accelerator="gpu" \ + +trainer.fast_dev_run=True \ + exp_manager.exp_dir=examples/asr/speech_pre_training_results' + sh 'rm -rf examples/asr/speech_pre_training_results' + } + } + + stage('L2: Speech to Text WPE - Conformer') { + steps { + sh 'python examples/asr/asr_ctc/speech_to_text_ctc_bpe.py \ + --config-path="../conf/conformer" --config-name="conformer_ctc_bpe" \ + model.train_ds.manifest_filepath=/home/TestData/an4_dataset/an4_train.json \ + model.validation_ds.manifest_filepath=/home/TestData/an4_dataset/an4_val.json \ + model.tokenizer.dir="/home/TestData/asr_tokenizers/an4_wpe_128/" \ + model.tokenizer.type="wpe" \ + model.train_ds.batch_size=4 \ + model.validation_ds.batch_size=4 \ + trainer.devices=[1] \ + trainer.accelerator="gpu" \ + +trainer.fast_dev_run=True \ + exp_manager.exp_dir=examples/asr/speech_to_text_wpe_conformer_results' + sh 'rm -rf examples/asr/speech_to_text_wpe_conformer_results' + } + } + } + } + + stage('L2: ASR dev run - part two') { + when { + anyOf { + branch 'r1.17.0' + changeRequest target: 'r1.17.0' + } + } + failFast true + parallel { + stage('L2: Speech to Text WPE - Squeezeformer') { + steps { + sh 'python examples/asr/asr_ctc/speech_to_text_ctc_bpe.py \ + --config-path="../conf/squeezeformer" --config-name="squeezeformer_ctc_bpe" \ + model.train_ds.manifest_filepath=/home/TestData/an4_dataset/an4_train.json \ + model.validation_ds.manifest_filepath=/home/TestData/an4_dataset/an4_val.json \ + model.tokenizer.dir="/home/TestData/asr_tokenizers/an4_wpe_128/" \ + model.tokenizer.type="wpe" \ + model.encoder.d_model=144 \ + model.train_ds.batch_size=4 \ + model.validation_ds.batch_size=4 \ + trainer.devices=[0] \ + trainer.accelerator="gpu" \ + +trainer.fast_dev_run=True \ + exp_manager.exp_dir=examples/asr/speech_to_text_wpe_squeezeformer_results' + sh 'rm -rf examples/asr/speech_to_text_wpe_squeezeformer_results' + } + } + } + } + + stage('L2: Speech to Text EMA') { + when { + anyOf { + branch 'r1.17.0' + changeRequest target: 'r1.17.0' + } + } + steps { + sh 'python examples/asr/asr_ctc/speech_to_text_ctc.py \ + model.train_ds.manifest_filepath=/home/TestData/an4_dataset/an4_train.json \ + model.validation_ds.manifest_filepath=/home/TestData/an4_dataset/an4_val.json \ + trainer.devices=2 \ + trainer.accelerator="gpu" \ + +trainer.fast_dev_run=True \ + +exp_manager.ema.enable=True \ + exp_manager.exp_dir=examples/asr/speech_to_text_results' + sh 'rm -rf examples/asr/speech_to_text_results' + } + + } + + stage('L2: Speaker dev run') { + when { + anyOf { + branch 'r1.17.0' + changeRequest target: 'r1.17.0' + } + } + failFast true + parallel { + stage('Speaker Recognition') { + steps { + sh 'python examples/speaker_tasks/recognition/speaker_reco.py \ + model.train_ds.batch_size=10 \ + model.validation_ds.batch_size=2 \ + model.train_ds.manifest_filepath=/home/TestData/an4_speaker/train.json \ + model.validation_ds.manifest_filepath=/home/TestData/an4_speaker/dev.json \ + model.decoder.num_classes=2 \ + trainer.max_epochs=10 \ + trainer.devices=[1] \ + trainer.accelerator="gpu" \ + +trainer.fast_dev_run=True \ + exp_manager.exp_dir=examples/speaker_tasks/recognition/speaker_recognition_results' + sh 'rm -rf examples/speaker_tasks/recognition/speaker_recognition_results' + } + } + + stage('Speaker Diarization') { + steps { + sh 'python examples/speaker_tasks/diarization/neural_diarizer/multiscale_diar_decoder.py \ + model.diarizer.speaker_embeddings.model_path=titanet_large \ + model.train_ds.batch_size=5 \ + model.validation_ds.batch_size=5 \ + model.train_ds.emb_dir=examples/speaker_tasks/diarization/speaker_diarization_results \ + model.validation_ds.emb_dir=examples/speaker_tasks/diarization/speaker_diarization_results \ + model.train_ds.manifest_filepath=/home/TestData/an4_diarizer/simulated_train/msdd_data.50step.json \ + model.validation_ds.manifest_filepath=/home/TestData/an4_diarizer/simulated_valid/msdd_data.50step.json \ + trainer.devices=[1] \ + trainer.accelerator="gpu" \ + +trainer.fast_dev_run=True \ + exp_manager.exp_dir=examples/speaker_tasks/diarization/speaker_diarization_results' + sh 'rm -rf examples/speaker_tasks/diarization/speaker_diarization_results' + } + } + + stage('Speech to Label') { + steps { + sh 'python examples/asr/speech_classification/speech_to_label.py \ + model.train_ds.manifest_filepath=/home/TestData/speech_commands/train_manifest.json \ + model.validation_ds.manifest_filepath=/home/TestData/speech_commands/test_manifest.json \ + model.test_ds.manifest_filepath=/home/TestData/speech_commands/test_manifest.json \ + trainer.devices=[1] \ + trainer.accelerator="gpu" \ + +trainer.fast_dev_run=True \ + model.preprocessor._target_=nemo.collections.asr.modules.AudioToMelSpectrogramPreprocessor \ + ~model.preprocessor.window_size \ + ~model.preprocessor.window_stride \ + ~model.preprocessor.window \ + ~model.preprocessor.n_mels \ + ~model.preprocessor.n_mfcc \ + ~model.preprocessor.n_fft \ + exp_manager.exp_dir=examples/asr/speech_to_label_results' + sh 'rm -rf examples/asr/speech_to_label_results' + } + } + + stage('Speaker Diarization with ASR Inference') { + steps { + sh 'python examples/speaker_tasks/diarization/clustering_diarizer/offline_diar_with_asr_infer.py \ + diarizer.manifest_filepath=/home/TestData/an4_diarizer/an4_manifest.json \ + diarizer.speaker_embeddings.model_path=/home/TestData/an4_diarizer/spkr.nemo \ + diarizer.speaker_embeddings.parameters.save_embeddings=True \ + diarizer.speaker_embeddings.parameters.window_length_in_sec=[1.5] \ + diarizer.speaker_embeddings.parameters.shift_length_in_sec=[0.75] \ + diarizer.speaker_embeddings.parameters.multiscale_weights=[1.0] \ + diarizer.asr.model_path=QuartzNet15x5Base-En \ + diarizer.asr.parameters.asr_based_vad=True \ + diarizer.out_dir=examples/speaker_tasks/diarization/speaker_diarization_asr_results' + sh 'rm -rf examples/speaker_tasks/diarization/speaker_diarization_asr_results' + } + } + + stage('Clustering Diarizer Inference') { + steps { + sh 'python examples/speaker_tasks/diarization/clustering_diarizer/offline_diar_infer.py \ + diarizer.manifest_filepath=/home/TestData/an4_diarizer/an4_manifest.json \ + diarizer.speaker_embeddings.model_path=/home/TestData/an4_diarizer/spkr.nemo \ + diarizer.speaker_embeddings.parameters.save_embeddings=True \ + diarizer.speaker_embeddings.parameters.window_length_in_sec=1.5 \ + diarizer.speaker_embeddings.parameters.shift_length_in_sec=0.75 \ + diarizer.speaker_embeddings.parameters.multiscale_weights=null \ + diarizer.vad.model_path=/home/TestData/an4_diarizer/MatchboxNet_VAD_3x2.nemo \ + diarizer.out_dir=examples/speaker_tasks/diarization/clustering_diarizer_results' + sh 'rm -rf examples/speaker_tasks/diarization/clustering_diarizer_results' + } + } + + stage('Neural Diarizer Inference') { + steps { + sh 'python examples/speaker_tasks/diarization/neural_diarizer/multiscale_diar_decoder_infer.py \ + diarizer.manifest_filepath=/home/TestData/an4_diarizer/an4_manifest.json \ + diarizer.msdd_model.model_path=/home/TestData/an4_diarizer/diar_msdd_telephonic.nemo \ + diarizer.speaker_embeddings.parameters.save_embeddings=True \ + diarizer.vad.model_path=/home/TestData/an4_diarizer/MatchboxNet_VAD_3x2.nemo \ + diarizer.out_dir=examples/speaker_tasks/diarization/neural_diarizer_results' + sh 'rm -rf examples/speaker_tasks/diarization/neural_diarizer_results' + } + } + + stage('Multispeaker ASR Data Simulation') { + steps { + sh 'python tools/speech_data_simulator/multispeaker_simulator.py \ + --config-path=conf --config-name=data_simulator.yaml \ + data_simulator.random_seed=42 \ + data_simulator.manifest_filepath=/home/TestData/LibriSpeechShort/dev-clean-align-short.json \ + data_simulator.outputs.output_dir=./test_simulator \ + data_simulator.session_config.num_sessions=2 \ + data_simulator.session_config.session_length=60' + sh 'rm -rf ./test_simulator' + } + } + } + } + // TODO: Enable test after 21.08 container is used. + // stage('L2: ASR DALI dev run') { + // when { + // anyOf { + // branch 'r1.17.0' + // changeRequest target: 'r1.17.0' + // } + // } + // failFast true + // parallel { + // stage('Speech to Text - DALI AudioToMelSpectrogramPreprocessor') { + // steps { + // sh 'python examples/asr/asr_ctc/speech_to_text_ctc.py \ + // model.train_ds.manifest_filepath=/home/TestData/an4_dataset/an4_train.json \ + // +model.train_ds.use_dali=True \ + // model.validation_ds.manifest_filepath=/home/TestData/an4_dataset/an4_val.json \ + // +model.validation_ds.use_dali=True \ + // trainer.devices=[0] \ + // trainer.accelerator="gpu" \ + // +trainer.fast_dev_run=True \ + // exp_manager.exp_dir=examples/asr/speech_to_text_results' + // sh 'rm -rf examples/asr/speech_to_text_results' + // } + // } + // stage('Speech to Text BPE - DALI AudioToMelSpectrogramPreprocessor') { + // steps { + // sh 'python examples/asr/asr_ctc/speech_to_text_bpe.py \ + // --config-path="../conf/citrinet/" --config-name="config_bpe" \ + // model.tokenizer.dir="/home/TestData/asr_tokenizers/an4_wpe_128/" \ + // model.tokenizer.type="wpe" \ + // model.train_ds.manifest_filepath=/home/TestData/an4_dataset/an4_train.json \ + // +model.train_ds.use_dali=True \ + // model.validation_ds.manifest_filepath=/home/TestData/an4_dataset/an4_val.json \ + // +model.validation_ds.use_dali=True \ + // trainer.devices=[0] \ + // trainer.accelerator="gpu" \ + // +trainer.fast_dev_run=True \ + // exp_manager.exp_dir=examples/asr/speech_to_text_wpe_results' + // sh 'rm -rf examples/asr/speech_to_text_wpe_results' + // } + // } + // // TODO: This would fail due to an unnecessary torchaudio import. + // // To be enabled once torchaudio is available in the container used for CI + // // stage('Speech to Text - DALI AudioToMFCCPreprocessor') { + // // steps { + // // sh 'python examples/asr/asr_ctc/speech_to_text_ctc.py \ + // // model.train_ds.manifest_filepath=/home/TestData/an4_dataset/an4_train.json \ + // // +model.train_ds.use_dali=True \ + // // model.validation_ds.manifest_filepath=/home/TestData/an4_dataset/an4_val.json \ + // // +model.validation_ds.use_dali=True \ + // // model.preprocessor._target_=nemo.collections.asr.modules.AudioToMFCCPreprocessor \ + // // ~model.preprocessor.normalize \ + // // ~model.preprocessor.features \ + // // ~model.preprocessor.frame_splicing \ + // // ~model.preprocessor.dither \ + // // ~model.preprocessor.stft_conv \ + // // +model.n_mels=64 \ + // // +model.n_mfcc=64 \ + // // trainer.devices=[1] \ + // // trainer.accelerator="gpu" \ + // // +trainer.fast_dev_run=True \ + // // exp_manager.exp_dir=examples/asr/speech_to_text_results' + // // sh 'rm -rf examples/asr/speech_to_text_results' + // // } + // // } + // } + // } + + // TODO: Add back once CI is updated + // stage('L2: ASR RNNT dev run') { + // when { + // anyOf { + // branch 'r1.17.0' + // changeRequest target: 'r1.17.0' + // } + // } + // failFast true + // parallel { + // stage('Speech to Text - RNNT') { + // steps { + // sh 'STRICT_NUMBA_COMPAT_CHECK=false python examples/asr/asr_transducer/speech_to_text_rnnt.py \ + // --config-path="../conf/contextnet_rnnt/" --config-name="config_rnnt.yaml" \ + // model.train_ds.manifest_filepath=/home/TestData/an4_dataset/an4_train.json \ + // model.validation_ds.manifest_filepath=/home/TestData/an4_dataset/an4_val.json \ + // model.train_ds.batch_size=2 \ + // model.validation_ds.batch_size=2 \ + // trainer.devices=[0] \ + // trainer.accelerator="gpu" \ + // +trainer.fast_dev_run=True \ + // exp_manager.exp_dir=examples/asr/speech_to_text_rnnt_results' + // sh 'rm -rf examples/asr/speech_to_text_rnnt_results' + // } + // } + // stage('L2: Speech to Text RNNT WPE') { + // steps { + // sh 'STRICT_NUMBA_COMPAT_CHECK=false python examples/asr/asr_transducer/speech_to_text_rnnt_bpe.py \ + // --config-path="../conf/contextnet_rnnt/" --config-name="config_rnnt_bpe.yaml" \ + // model.train_ds.manifest_filepath=/home/TestData/an4_dataset/an4_train.json \ + // model.validation_ds.manifest_filepath=/home/TestData/an4_dataset/an4_val.json \ + // model.train_ds.batch_size=2 \ + // model.validation_ds.batch_size=2 \ + // model.tokenizer.dir="/home/TestData/asr_tokenizers/an4_wpe_128/" \ + // model.tokenizer.type="wpe" \ + // trainer.devices=[0] \ + // trainer.accelerator="gpu" \ + // +trainer.fast_dev_run=True \ + // exp_manager.exp_dir=examples/asr/speech_to_text_rnnt_wpe_results' + // sh 'rm -rf examples/asr/speech_to_text_rnnt_wpe_results' + // } + // } + // stage('L3: Speech to Text Hybrid Transducer-CTC WPE') { + // steps { + // sh 'STRICT_NUMBA_COMPAT_CHECK=false python examples/asr/asr_hybrid_transducer_ctc/speech_to_text_hybrid_rnnt_ctc_bpe.py \ + // --config-path="../conf/conformer/hybrid_transducer_ctc/conformer_hybrid_transducer_ctc/" --config-name="conformer_hybrid_transducer_ctc_bpe.yaml" \ + // model.train_ds.manifest_filepath=/home/TestData/an4_dataset/an4_train.json \ + // model.validation_ds.manifest_filepath=/home/TestData/an4_dataset/an4_val.json \ + // model.encoder.n_layers= 2 \ + // model.train_ds.batch_size=2 \ + // model.validation_ds.batch_size=2 \ + // model.tokenizer.dir="/home/TestData/asr_tokenizers/an4_wpe_128/" \ + // model.tokenizer.type="wpe" \ + // trainer.devices=[0] \ + // trainer.accelerator="gpu" \ + // +trainer.fast_dev_run=True \ + // exp_manager.exp_dir=examples/asr/speech_to_text_hybrid_transducer_ctc_wpe_results' + // sh 'rm -rf examples/asr/speech_to_text_hybrid_transducer_ctc_wpe_results' + // } + // } + // } + // } + + // stage('L2: Hybrid ASR RNNT-CTC dev run') { + // when { + // anyOf { + // branch 'r1.17.0' + // changeRequest target: 'r1.17.0' + // } + // } + // failFast true + // parallel { + // stage('Speech to Text Hybrid Transducer-CTC WPE') { + // steps { + // sh 'STRICT_NUMBA_COMPAT_CHECK=false python examples/asr/asr_hybrid_transducer_ctc/speech_to_text_hybrid_rnnt_ctc_bpe.py \ + // --config-path="../conf/conformer/hybrid_transducer_ctc/conformer_hybrid_transducer_ctc/" --config-name="conformer_hybrid_transducer_ctc_bpe.yaml" \ + // model.train_ds.manifest_filepath=/home/TestData/an4_dataset/an4_train.json \ + // model.validation_ds.manifest_filepath=/home/TestData/an4_dataset/an4_val.json \ + // model.encoder.n_layers= 2 \ + // model.train_ds.batch_size=2 \ + // model.validation_ds.batch_size=2 \ + // model.tokenizer.dir="/home/TestData/asr_tokenizers/an4_wpe_128/" \ + // model.tokenizer.type="wpe" \ + // trainer.devices=[0] \ + // trainer.accelerator="gpu" \ + // +trainer.fast_dev_run=True \ + // exp_manager.exp_dir=examples/asr/speech_to_text_hybrid_transducer_ctc_wpe_results' + // sh 'rm -rf examples/asr/speech_to_text_hybrid_transducer_ctc_wpe_results' + // } + // } + // } + // } + + stage('L2: ASR Multi-dataloader dev run') { + when { + anyOf { + branch 'r1.17.0' + changeRequest target: 'r1.17.0' + } + } + failFast true + parallel { + stage('Speech to Text multi-dataloader') { + steps { + sh 'python examples/asr/asr_ctc/speech_to_text_ctc.py \ + model.train_ds.manifest_filepath=/home/TestData/an4_dataset/an4_train.json \ + model.validation_ds.manifest_filepath=[/home/TestData/an4_dataset/an4_val.json,/home/TestData/an4_dataset/an4_val.json] \ + trainer.devices=[0] \ + trainer.accelerator="gpu" \ + trainer.max_epochs=1 \ + trainer.max_steps=1 \ + +trainer.num_sanity_val_steps=1 \ + exp_manager.exp_dir=examples/asr/speech_to_text_results' + sh 'rm -rf examples/asr/speech_to_text_results' + } + } + + stage('Speech to Label multi-dataloader') { + steps { + sh 'python examples/asr/speech_classification/speech_to_label.py \ + model.train_ds.manifest_filepath=/home/TestData/speech_commands/train_manifest.json \ + model.validation_ds.manifest_filepath=[/home/TestData/speech_commands/test_manifest.json,/home/TestData/speech_commands/test_manifest.json] \ + trainer.devices=[1] \ + trainer.accelerator="gpu" \ + trainer.max_epochs=1 \ + trainer.max_steps=1 \ + +trainer.num_sanity_val_steps=1 \ + model.preprocessor._target_=nemo.collections.asr.modules.AudioToMelSpectrogramPreprocessor \ + ~model.preprocessor.window_size \ + ~model.preprocessor.window_stride \ + ~model.preprocessor.window \ + ~model.preprocessor.n_mels \ + ~model.preprocessor.n_mfcc \ + ~model.preprocessor.n_fft \ + exp_manager.exp_dir=examples/asr/speech_to_label_results' + sh 'rm -rf examples/asr/speech_to_label_results' + } + } + } + } + + stage('L2: ASR Adapters') { + when { + anyOf { + branch 'r1.17.0' + changeRequest target: 'r1.17.0' + } + } + failFast true + parallel { + stage('Linear Adapters') { + steps { + sh 'python examples/asr/asr_adapters/train_asr_adapter.py \ + model.pretrained_model="stt_en_conformer_ctc_small" \ + model.adapter.adapter_name="an4" \ + model.adapter.linear.in_features=176 \ + model.train_ds.manifest_filepath=/home/TestData/an4_dataset/an4_train.json \ + model.validation_ds.manifest_filepath=/home/TestData/an4_dataset/an4_val.json \ + trainer.max_steps=5 \ + trainer.devices=[0] \ + trainer.accelerator="gpu" \ + +trainer.fast_dev_run=True \ + exp_manager.exp_dir=examples/asr/speech_to_text_adapters_results' + sh 'rm -rf examples/asr/speech_to_text_adapters_results' + } + } + stage('RelPos MHA Adapters') { + steps { + sh 'python examples/asr/asr_adapters/train_asr_adapter.py \ + model.pretrained_model="stt_en_conformer_ctc_small" \ + model.adapter.adapter_name="encoder:an4" \ + model.adapter.adapter_type="tiny_attn" \ + model.adapter.tiny_attn.n_feat=176 \ + model.train_ds.manifest_filepath=/home/TestData/an4_dataset/an4_train.json \ + model.validation_ds.manifest_filepath=/home/TestData/an4_dataset/an4_val.json \ + trainer.max_steps=5 \ + trainer.devices=[0] \ + trainer.accelerator="gpu" \ + +trainer.fast_dev_run=True \ + exp_manager.exp_dir=examples/asr/speech_to_text_adapters_mha_results' + sh 'rm -rf examples/asr/speech_to_text_adapters_mha_results' + } + } + + } + } + stage('L2: Megatron T5 Adapter PP=2') { + when { + anyOf { + branch 'r1.17.0' + changeRequest target: 'r1.17.0' + } + } + failFast true + parallel{ + stage('T5 Adapter tuning & inference TP=1 PP=2') { + steps { + sh "python examples/nlp/language_modeling/tuning/megatron_t5_adapter_tuning.py \ + --config-name=megatron_t5_adapter_tuning_config \ + name='test_tp1_pp2' \ + exp_manager.exp_dir='examples/adapter_tuning' \ + trainer.devices=2 \ + trainer.max_steps=1 \ + trainer.val_check_interval=1 \ + trainer.max_epochs=null \ + model.data.num_workers=1 \ + model.tensor_model_parallel_size=1 \ + model.pipeline_model_parallel_size=2 \ + model.language_model_path='/home/TestData/nlp/megatron_t5/8m/megatron_t5_8m_tp1_pp2.nemo' \ + model.existing_tasks=[] \ + model.new_tasks=['rte'] \ + model.data.train_ds=['/home/TestData/nlp/prompt_learning/rte_CI_test.jsonl'] \ + model.data.validation_ds=['/home/TestData/nlp/prompt_learning/rte_CI_test.jsonl'] \ + model.global_batch_size=4" + sh "python examples/nlp/language_modeling/tuning/megatron_t5_adapter_eval.py \ + --config-name=megatron_t5_adapter_inference \ + adapter_model_file='examples/adapter_tuning/test_tp1_pp2.nemo' \ + language_model_path='/home/TestData/nlp/megatron_t5/8m/megatron_t5_8m_tp1_pp2.nemo' \ + trainer.devices=2 \ + data.num_workers=1 \ + tensor_model_parallel_size=1 \ + pipeline_model_parallel_size=2 \ + data.global_batch_size=2 \ + data.micro_batch_size=2 \ + data.test_ds=['/home/TestData/nlp/prompt_learning/rte_CI_test.jsonl'] \ + pred_file_path='examples/adapter_tuning/test_tp1_pp2/preds.txt'" + sh "rm -rf examples/adapter_tuning/test_tp1_pp2.nemo" + sh "rm -rf examples/adapter_tuning/test_tp1_pp2" + } + } + } + } + stage('L2: Megatron T5 Adapter TP=2') { + when { + anyOf { + branch 'r1.17.0' + changeRequest target: 'r1.17.0' + } + } + failFast true + parallel{ + stage('T5 Adapter tuning & inference TP=2 PP=1') { + steps { + sh "python examples/nlp/language_modeling/tuning/megatron_t5_adapter_tuning.py \ + --config-name=megatron_t5_adapter_tuning_config \ + name='test_tp2_pp1' \ + exp_manager.exp_dir='examples/adapter_tuning' \ + trainer.devices=2 \ + trainer.max_steps=1 \ + trainer.val_check_interval=1 \ + trainer.max_epochs=null \ + model.data.num_workers=1 \ + model.tensor_model_parallel_size=2 \ + model.language_model_path='/home/TestData/nlp/megatron_t5/8m/megatron_t5_8m_tp2.nemo' \ + model.existing_tasks=[] \ + model.new_tasks=['rte'] \ + model.data.train_ds=['/home/TestData/nlp/prompt_learning/rte_CI_test.jsonl'] \ + model.data.validation_ds=['/home/TestData/nlp/prompt_learning/rte_CI_test.jsonl'] \ + model.global_batch_size=4" + sh "python examples/nlp/language_modeling/tuning/megatron_t5_adapter_eval.py \ + --config-name=megatron_t5_adapter_inference \ + adapter_model_file='examples/adapter_tuning/test_tp2_pp1.nemo' \ + language_model_path='/home/TestData/nlp/megatron_t5/8m/megatron_t5_8m_tp2.nemo' \ + trainer.devices=2 \ + tensor_model_parallel_size=2 \ + data.global_batch_size=2 \ + data.micro_batch_size=2 \ + data.num_workers=1 \ + data.test_ds=['/home/TestData/nlp/prompt_learning/rte_CI_test.jsonl'] \ + pred_file_path='examples/adapter_tuning/test_tp2_pp1/preds.txt'" + sh "rm -rf examples/adapter_tuning/test_tp2_pp1.nemo" + sh "rm -rf examples/adapter_tuning/test_tp2_pp1" + } + } + } + } + stage('L2: Megatron T5 IA3 PP=2') { + when { + anyOf { + branch 'r1.17.0' + changeRequest target: 'r1.17.0' + } + } + failFast true + parallel{ + stage('T5 IA3 tuning & inference TP=1 PP=2') { + steps { + sh "python examples/nlp/language_modeling/tuning/megatron_t5_ia3_tuning.py \ + --config-name=megatron_t5_ia3_tuning_config \ + name='test_tp1_pp2' \ + exp_manager.exp_dir='examples/ia3_tuning' \ + trainer.devices=2 \ + trainer.max_steps=1 \ + trainer.val_check_interval=1 \ + trainer.max_epochs=null \ + model.data.num_workers=1 \ + model.tensor_model_parallel_size=1 \ + model.pipeline_model_parallel_size=2 \ + model.language_model_path='/home/TestData/nlp/megatron_t5/8m/megatron_t5_8m_tp1_pp2.nemo' \ + model.existing_tasks=[] \ + model.new_tasks=['rte'] \ + model.data.train_ds=['/home/TestData/nlp/prompt_learning/rte_CI_test.jsonl'] \ + model.data.validation_ds=['/home/TestData/nlp/prompt_learning/rte_CI_test.jsonl'] \ + model.global_batch_size=4" + sh "python examples/nlp/language_modeling/tuning/megatron_t5_ia3_eval.py \ + --config-name=megatron_t5_ia3_inference \ + adapter_model_file='examples/ia3_tuning/test_tp1_pp2.nemo' \ + language_model_path='/home/TestData/nlp/megatron_t5/8m/megatron_t5_8m_tp1_pp2.nemo' \ + trainer.devices=2 \ + data.num_workers=1 \ + tensor_model_parallel_size=1 \ + pipeline_model_parallel_size=2 \ + data.global_batch_size=2 \ + data.micro_batch_size=2 \ + data.test_ds=['/home/TestData/nlp/prompt_learning/rte_CI_test.jsonl'] \ + pred_file_path='examples/ia3_tuning/test_tp1_pp2/preds.txt'" + sh "rm -rf examples/ia3_tuning/test_tp1_pp2.nemo" + sh "rm -rf examples/ia3_tuning/test_tp1_pp2" + } + } + } + } + stage('L2: Megatron T5 IA3 TP=2') { + when { + anyOf { + branch 'r1.17.0' + changeRequest target: 'r1.17.0' + } + } + failFast true + parallel{ + stage('T5 IA3 tuning & inference TP=2 PP=1') { + steps { + sh "python examples/nlp/language_modeling/tuning/megatron_t5_ia3_tuning.py \ + --config-name=megatron_t5_ia3_tuning_config \ + name='test_tp2_pp1' \ + exp_manager.exp_dir='examples/ia3_tuning' \ + trainer.devices=2 \ + trainer.max_steps=1 \ + trainer.val_check_interval=1 \ + trainer.max_epochs=null \ + model.data.num_workers=1 \ + model.tensor_model_parallel_size=2 \ + model.language_model_path='/home/TestData/nlp/megatron_t5/8m/megatron_t5_8m_tp2.nemo' \ + model.existing_tasks=[] \ + model.new_tasks=['rte'] \ + model.data.train_ds=['/home/TestData/nlp/prompt_learning/rte_CI_test.jsonl'] \ + model.data.validation_ds=['/home/TestData/nlp/prompt_learning/rte_CI_test.jsonl'] \ + model.global_batch_size=4" + sh "python examples/nlp/language_modeling/tuning/megatron_t5_ia3_eval.py \ + --config-name=megatron_t5_ia3_inference \ + adapter_model_file='examples/ia3_tuning/test_tp2_pp1.nemo' \ + language_model_path='/home/TestData/nlp/megatron_t5/8m/megatron_t5_8m_tp2.nemo' \ + trainer.devices=2 \ + data.num_workers=1 \ + tensor_model_parallel_size=2 \ + data.global_batch_size=2 \ + data.micro_batch_size=2 \ + data.test_ds=['/home/TestData/nlp/prompt_learning/rte_CI_test.jsonl'] \ + pred_file_path='examples/ia3_tuning/test_tp2_pp1/preds.txt'" + sh "rm -rf examples/ia3_tuning/test_tp2_pp1.nemo" + sh "rm -rf examples/ia3_tuning/test_tp2_pp1" + } + } + } + } + stage('L2: Megatron GPT Adapter TP=2') { + when { + anyOf { + branch 'r1.17.0' + changeRequest target: 'r1.17.0' + } + } + failFast true + parallel{ + stage('GPT Adapter tuning & inference TP=2 PP=1') { + steps { + sh "python examples/nlp/language_modeling/tuning/megatron_gpt_adapter_tuning.py \ + --config-name=megatron_gpt_adapter_tuning_config \ + name='test_tp2_pp1' \ + exp_manager.exp_dir='examples/adapter_tuning' \ + trainer.devices=2 \ + trainer.max_steps=1 \ + trainer.val_check_interval=1 \ + trainer.max_epochs=null \ + model.data.num_workers=1 \ + model.tensor_model_parallel_size=2 \ + model.language_model_path='/home/TestData/nlp/megatron_gpt/tiny/megatron_14m_gpt_tp2_pp1.nemo' \ + model.existing_tasks=[] \ + model.new_tasks=['rte'] \ + model.data.train_ds=['/home/TestData/nlp/prompt_learning/rte_CI_test.jsonl'] \ + model.data.validation_ds=['/home/TestData/nlp/prompt_learning/rte_CI_test.jsonl'] \ + model.global_batch_size=4" + sh "python examples/nlp/language_modeling/tuning/megatron_gpt_adapter_eval.py \ + --config-name=megatron_gpt_adapter_inference \ + adapter_model_file='examples/adapter_tuning/test_tp2_pp1.nemo' \ + gpt_model_file='/home/TestData/nlp/megatron_gpt/tiny/megatron_14m_gpt_tp2_pp1.nemo' \ + inference.greedy=True \ + num_workers=1 \ + inference.add_BOS=False \ + trainer.devices=2 \ + tensor_model_parallel_size=2 \ + data_paths=['/home/TestData/nlp/prompt_learning/rte_CI_test.jsonl']" + sh "rm -rf examples/adapter_tuning/test_tp2_pp1.nemo" + sh "rm -rf examples/adapter_tuning/test_tp2_pp1" + } + } + } + } + stage('L2: Megatron GPT Adapter PP=2') { + when { + anyOf { + branch 'r1.17.0' + changeRequest target: 'r1.17.0' + } + } + failFast true + parallel{ + stage('GPT Adapter tuning & inference TP=1 PP=2') { + steps { + sh "python examples/nlp/language_modeling/tuning/megatron_gpt_adapter_tuning.py \ + --config-name=megatron_gpt_adapter_tuning_config \ + name='test_tp1_pp2' \ + exp_manager.exp_dir='examples/adapter_tuning' \ + trainer.devices=2 \ + trainer.max_steps=1 \ + trainer.val_check_interval=1 \ + trainer.max_epochs=null \ + model.data.num_workers=1 \ + model.tensor_model_parallel_size=1 \ + model.pipeline_model_parallel_size=2 \ + model.language_model_path='/home/TestData/nlp/megatron_gpt/tiny/megatron_14m_gpt_tp1_pp2.nemo' \ + model.existing_tasks=[] \ + model.new_tasks=['rte'] \ + model.data.train_ds=['/home/TestData/nlp/prompt_learning/rte_CI_test.jsonl'] \ + model.data.validation_ds=['/home/TestData/nlp/prompt_learning/rte_CI_test.jsonl'] \ + model.global_batch_size=4" + sh "python examples/nlp/language_modeling/tuning/megatron_gpt_adapter_eval.py \ + --config-name=megatron_gpt_adapter_inference \ + adapter_model_file='examples/adapter_tuning/test_tp1_pp2.nemo' \ + gpt_model_file='/home/TestData/nlp/megatron_gpt/tiny/megatron_14m_gpt_tp1_pp2.nemo' \ + inference.greedy=True \ + inference.add_BOS=False \ + trainer.devices=2 \ + num_workers=1 \ + tensor_model_parallel_size=2 \ + data_paths=['/home/TestData/nlp/prompt_learning/rte_CI_test.jsonl']" + sh "rm -rf examples/adapter_tuning/test_tp1_pp2.nemo" + sh "rm -rf examples/adapter_tuning/test_tp1_pp2" + } + } + } + } + stage('L2: Speech Transcription') { + when { + anyOf { + branch 'r1.17.0' + changeRequest target: 'r1.17.0' + } + } + failFast true + parallel { + stage('Speech to Text Transcribe') { + steps { + sh 'python examples/asr/transcribe_speech.py \ + pretrained_name="QuartzNet15x5Base-En" \ + audio_dir="/home/TestData/an4_transcribe/test_subset/" \ + output_filename="stt_test_res.json" \ + amp=true' + sh 'rm -rf stt_test_res.json' + } + } + } + } + stage('L2: Transducer alignment') { + when { + anyOf { + branch 'r1.17.0' + changeRequest target: 'r1.17.0' + } + } + failFast true + parallel { + stage('Running pytest') { + steps { + sh 'pytest tests/collections/asr/decoding/rnnt_alignments_check.py --durations=-1' + } + } + } + } + + stage('L2: Segmentation Tool') { + when { + anyOf { + branch 'r1.17.0' + changeRequest target: 'r1.17.0' + } + } + stages { + stage('Install ctc_segmentation requirements') { + steps { + sh 'cd tools/ctc_segmentation && \ + pip install -r requirements.txt && \ + apt-get update && apt-get install libsox-fmt-all -y' + } + } + + stage('Parallel ctc_segmentation test') { + failFast true + parallel { + stage('L2: Eng CitriNet with .wav') { + steps { + sh 'cd tools/ctc_segmentation && \ + TIME=`date +"%Y-%m-%d-%T"` && \ + /bin/bash run_segmentation.sh \ + --MODEL_NAME_OR_PATH="stt_en_citrinet_512_gamma_0_25" \ + --DATA_DIR=/home/TestData/ctc_segmentation/eng \ + --OUTPUT_DIR=/home/TestData/ctc_segmentation/eng/output${TIME} \ + --LANGUAGE=en \ + --USE_NEMO_NORMALIZATION="TRUE" && \ + python /home/TestData/ctc_segmentation/verify_alignment.py \ + -r /home/TestData/ctc_segmentation/eng/eng_valid_segments_1.7.txt \ + -g /home/TestData/ctc_segmentation/eng/output${TIME}/verified_segments/nv_test_segments.txt && \ + rm -rf /home/TestData/ctc_segmentation/eng/output${TIME}' + } + } + stage('L2: Ru QN with mp3') { + steps { + sh 'cd tools/ctc_segmentation && \ + TIME=`date +"%Y-%m-%d-%T"` && \ + /bin/bash run_segmentation.sh \ + --MODEL_NAME_OR_PATH=/home/TestData/ctc_segmentation/QuartzNet15x5-Ru-e512-wer14.45.nemo \ + --DATA_DIR=/home/TestData/ctc_segmentation/ru \ + --OUTPUT_DIR=/home/TestData/ctc_segmentation/ru/output${TIME} \ + --LANGUAGE=ru \ + --ADDITIONAL_SPLIT_SYMBOLS=";" && \ + python /home/TestData/ctc_segmentation/verify_alignment.py \ + -r /home/TestData/ctc_segmentation/ru/valid_ru_segments_1.7.txt \ + -g /home/TestData/ctc_segmentation/ru/output${TIME}/verified_segments/ru_segments.txt && \ + rm -rf /home/TestData/ctc_segmentation/ru/output${TIME}' + } + } + } + } + } + } + + stage('L2: G2P Models') { + when { + anyOf { + branch 'r1.17.0' + changeRequest target: 'r1.17.0' + } + } + failFast true + parallel { + stage('G2P Conformer training, evaluation and inference') { + steps { + sh 'cd examples/tts/g2p && \ + TIME=`date +"%Y-%m-%d-%T"` && OUTPUT_DIR_CONFORMER=output_ctc_${TIME} && \ + python g2p_train_and_evaluate.py \ + train_manifest=/home/TestData/g2p/g2p.json \ + validation_manifest=/home/TestData/g2p/g2p.json \ + model.test_ds.manifest_filepath=/home/TestData/g2p/g2p.json \ + model.tokenizer.dir=/home/TestData/g2p/tokenizer_spe_unigram_v512 \ + trainer.max_epochs=1 \ + model.max_source_len=64 \ + trainer.devices=[0] \ + do_training=True \ + do_testing=True \ + exp_manager.exp_dir=${OUTPUT_DIR_CONFORMER} \ + +exp_manager.use_datetime_version=False\ + +exp_manager.version=test \ + --config-name=g2p_conformer_ctc && \ + python g2p_inference.py \ + pretrained_model=${OUTPUT_DIR_CONFORMER}/G2P-Conformer-CTC/test/checkpoints/G2P-Conformer-CTC.nemo \ + manifest_filepath=/home/TestData/g2p/g2p.json \ + phoneme_field=text' + } + } + stage('ByT5G2P training, evaluation and inference') { + steps { + sh 'TRANSFORMERS_OFFLINE=0 && cd examples/tts/g2p && \ + TIME=`date +"%Y-%m-%d-%T"` && OUTPUT_DIR_T5=output_byt5_${TIME} && \ + python g2p_train_and_evaluate.py \ + train_manifest=/home/TestData/g2p/g2p.json \ + validation_manifest=/home/TestData/g2p/g2p.json \ + model.test_ds.manifest_filepath=/home/TestData/g2p/g2p.json \ + trainer.max_epochs=1 \ + model.max_source_len=64 \ + trainer.devices=[1] \ + do_training=True \ + do_testing=True \ + exp_manager.exp_dir=${OUTPUT_DIR_T5} \ + +exp_manager.use_datetime_version=False\ + +exp_manager.version=test && \ + python g2p_inference.py \ + pretrained_model=${OUTPUT_DIR_T5}/T5G2P/test/checkpoints/T5G2P.nemo \ + manifest_filepath=/home/TestData/g2p/g2p.json \ + phoneme_field=text && TRANSFORMERS_OFFLINE=1' + } + } + stage('HeteronymClassificationModel training, evaluation and inference') { + steps { + sh 'cd examples/tts/g2p && \ + TIME=`date +"%Y-%m-%d-%T"` && OUTPUT_DIR=output_${TIME} && \ + python g2p_heteronym_classification_train_and_evaluate.py \ + train_manifest=/home/TestData/g2p/manifest.json \ + validation_manifest=/home/TestData/g2p/manifest.json \ + test_manifest=/home/TestData/g2p/manifest.json \ + model.wordids=/home/TestData/g2p/wordids.tsv \ + trainer.max_epochs=1 \ + model.max_seq_length=64 \ + do_training=True \ + do_testing=True \ + exp_manager.exp_dir=${OUTPUT_DIR} \ + +exp_manager.use_datetime_version=False\ + +exp_manager.version=test && \ + python g2p_heteronym_classification_inference.py \ + manifest=/home/TestData/g2p/manifest.json \ + pretrained_model=${OUTPUT_DIR}/HeteronymClassification/test/checkpoints/HeteronymClassification.nemo \ + output_manifest=preds.json' + } + } + } + } + + // TODO: add test once megatron-bert is supported again + // stage('L2: Multi-GPU Megatron finetuning') { + // when { + // anyOf { + // branch 'r1.17.0' + // changeRequest target: 'r1.17.0' + // } + // } + // failFast true + // parallel { + // stage('L2: Cased Megatron finetuning on MRPC') { + // steps { + // sh 'cd examples/nlp/glue_benchmark && \ + // python glue_benchmark.py \ + // model.dataset.data_dir=/home/TestData/nlp/glue_fake/MRPC \ + // trainer.devices=[0,1] \ + // trainer.accelerator="gpu" \ + // +trainer.fast_dev_run=true \ + // model.dataset.use_cache=false \ + // model.language_model.pretrained_model_name=megatron-bert-345m-cased \ + // trainer.accelerator=gpu \ + // trainer.strategy=ddp \ + // exp_manager=null' + // } + // } + // } + // } + + stage('L2: STS-b') { + when { + anyOf { + branch 'r1.17.0' + changeRequest target: 'r1.17.0' + } + } + failFast true + parallel { + stage('GLUE STS-b with AlBERT') { + steps { + sh 'python examples/nlp/glue_benchmark/glue_benchmark.py \ + model.dataset.use_cache=false \ + model.task_name=sts-b \ + model.dataset.data_dir=/home/TestData/nlp/glue_fake/STS-B \ + trainer.devices=[1] \ + trainer.accelerator="gpu" \ + +trainer.fast_dev_run=True \ + model.language_model.pretrained_model_name=albert-base-v1 \ + exp_manager=null' + } + } + stage('Test Restore Punctuation & Capitalization with AlBERT') { + steps { + sh 'data_dir="$(mktemp -d -p "$(pwd)")" && \ + cp /home/TestData/nlp/token_classification_punctuation/*.txt "${data_dir}"/ && \ + python examples/nlp/token_classification/punctuation_capitalization_train_evaluate.py \ + +do_training=false \ + +do_testing=true \ + pretrained_model=/home/TestData/nlp/pretrained_models/Punctuation_and_Capitalization_albert.nemo \ + +model.test_ds.use_cache=false \ + ~model.train_ds \ + ~model.validation_ds \ + model.test_ds.ds_item="${data_dir}" \ + trainer.devices=[1] \ + trainer.accelerator="gpu" \ + exp_manager=null && \ + rm -rf "${data_dir}"' + } + } +// stage('Test Restore Punctuation & Capitalization with RoBERTa') { +// steps { +// sh 'data_dir="$(mktemp -d -p "$(pwd)")" && \ +// cp /home/TestData/nlp/token_classification_punctuation/*.txt "${data_dir}"/ && \ +// python examples/nlp/token_classification/punctuation_capitalization_train_evaluate.py \ +// +do_training=false \ +// +do_testing=true \ +// pretrained_model=/home/TestData/nlp/pretrained_models/Punctuation_and_Capitalization_roberta.nemo \ +// +model.test_ds.use_cache=false \ +// ~model.train_ds \ +// ~model.validation_ds \ +// model.test_ds.ds_item="${data_dir}" \ +// trainer.devices=[1] \ +// trainer.accelerator="gpu" \ +// exp_manager=null && \ +// rm -rf "${data_dir}"' +// } +// } + } + } + stage('L2: Dialogue Classification') { + when { + anyOf { + branch 'r1.17.0' + changeRequest target: 'r1.17.0' + } + } + failFast true + parallel { + stage('Dialogue: Intent and slot classification using GPT') { + steps { + sh 'TRANSFORMERS_OFFLINE=0 && cd examples/nlp/dialogue && \ + python dialogue.py \ + model.dataset.data_dir=/home/TestData/nlp/sgd_small \ + model.language_model.lm_checkpoint=/home/TestData/nlp/gpt2/pytorch_model.bin\ + model.tokenizer.vocab_file=/home/TestData/nlp/gpt2/vocab.json\ + model.dataset.dialogues_example_dir=sgd_gen_outputs \ + model.dataset.task_name=debug_sample \ + trainer.max_steps=1 \ + trainer.max_epochs=1 \ + model.train_ds.batch_size=2 \ + model.validation_ds.batch_size=2 \ + model.test_ds.batch_size=2 \ + model.nemo_path=null \ + trainer.val_check_interval=0.0 \ + trainer.devices=[0] \ + model.dataset.use_cache=false \ + model.tokenizer.special_tokens={pad_token:"endoftext"} \ + model.tokenizer.tokenizer_name=gpt2 \ + model.tokenizer.vocab_file=/home/TestData/nlp/gpt2/vocab.json\ + model.language_model.pretrained_model_name=/home/TestData/nlp/gpt2 \ + trainer.accelerator=gpu \ + exp_manager=null && \ + rm -rf sgd_gen_outputs' + } + } + stage('Intent and slot classification using SGDQA') { + steps { + sh 'TRANSFORMERS_OFFLINE=0 && cd examples/nlp/dialogue && \ + python dialogue.py \ + model.dataset.data_dir=/home/TestData/nlp/sgd_small \ + model.dataset.dialogues_example_dir=sgd_gen_bert_outputs \ + model.dataset.task_name=debug_sample \ + trainer.max_steps=1 \ + trainer.max_epochs=1 \ + model.train_ds.batch_size=2 \ + model.validation_ds.batch_size=2 \ + model.test_ds.batch_size=2 \ + model.dataset.num_tasks=6 \ + model.nemo_path=null \ + trainer.val_check_interval=0.0 \ + trainer.devices=[0] \ + model.dataset.use_cache=false \ + model.language_model.pretrained_model_name=bert-base-cased \ + trainer.accelerator=gpu \ + exp_manager=null && \ + rm -rf sgd_gen_bert_outputs' + } + } + stage('Intent and slot classification using IntentSlotClassificationModel') { + steps { + sh 'TRANSFORMERS_OFFLINE=0 && cd examples/nlp/dialogue && \ + python dialogue.py \ + model.dataset.data_dir=/home/TestData/nlp/processed_assistant \ + model.dataset.dialogues_example_dir=sgd_gen_bert_intent_classification_outputs \ + model.dataset.task=assistant \ + trainer.max_steps=1 \ + trainer.max_epochs=1 \ + model.train_ds.batch_size=2 \ + model.validation_ds.batch_size=2 \ + model.test_ds.batch_size=2 \ + model.nemo_path=null \ + trainer.val_check_interval=0.0 \ + trainer.devices=[0] \ + model.dataset.use_cache=false \ + model.language_model.pretrained_model_name=bert-base-uncased \ + trainer.accelerator=gpu \ + exp_manager=null && \ + rm -rf sgd_gen_bert_intent_classification_outputs && TRANSFORMERS_OFFLINE=1' + } + } + stage('Intent classification using ZeroShotIntentModel') { + steps { + sh 'TRANSFORMERS_OFFLINE=0 && cd examples/nlp/dialogue && \ + python dialogue.py \ + do_training=False \ + model.dataset.data_dir=/home/TestData/nlp/drive_thru_revised \ + model.original_nemo_checkpoint=/home/TestData/nlp/drive_thru_revised/zeroshotintent_en_bert_base_uncased.nemo \ + model.dataset.dialogues_example_dir=sgd_gen_zero_shot_intent_classification_outputs \ + model.dataset.task=zero_shot \ + model.dataset.prompt_template="This example is" \ + trainer.max_steps=1 \ + trainer.max_epochs=1 \ + model.train_ds.batch_size=2 \ + model.validation_ds.batch_size=2 \ + model.test_ds.batch_size=2 \ + model.nemo_path=null \ + trainer.val_check_interval=0.0 \ + trainer.devices=[1] \ + model.dataset.use_cache=false \ + model.language_model.pretrained_model_name=bert-base-uncased \ + trainer.accelerator=gpu \ + exp_manager=null && \ + rm -rf sgd_gen_zero_shot_intent_classification_outputs && TRANSFORMERS_OFFLINE=1' + } + } + stage('Design Intent classification using ZeroShotIntentModel') { + steps { + sh 'TRANSFORMERS_OFFLINE=0 && cd examples/nlp/dialogue && \ + python dialogue.py \ + do_training=False \ + model.dataset.data_dir=/home/TestData/nlp/design_dataset \ + model.original_nemo_checkpoint=/home/TestData/nlp/drive_thru_revised/zeroshotintent_en_bert_base_uncased.nemo \ + model.dataset.dialogues_example_dir=design_zero_shot_intent_classification_outputs \ + model.dataset.task=design \ + model.dataset.prompt_template="This example is related to" \ + model.library=megatron \ + trainer.max_steps=1 \ + trainer.max_epochs=1 \ + model.train_ds.batch_size=2 \ + model.validation_ds.batch_size=2 \ + model.test_ds.batch_size=2 \ + model.nemo_path=null \ + trainer.val_check_interval=0.0 \ + trainer.devices=[1] \ + model.dataset.use_cache=false \ + model.language_model.pretrained_model_name=bert-base-uncased \ + trainer.accelerator=gpu \ + exp_manager=null && \ + rm -rf design_zero_shot_intent_classification_outputs && TRANSFORMERS_OFFLINE=1' + } + } + stage('Design Intent classification using ZeroShotIntentModel BART Classifier') { + steps { + sh 'TRANSFORMERS_OFFLINE=0 && cd examples/nlp/dialogue && \ + python dialogue.py \ + do_training=False \ + model.dataset.data_dir=/home/TestData/nlp/design_dataset \ + model.original_nemo_checkpoint=/home/TestData/nlp/drive_thru_revised/zeroshotintent_en_bert_base_uncased.nemo \ + model.dataset.dialogues_example_dir=design_zero_shot_intent_classification_bart_outputs \ + model.dataset.task=design \ + model.dataset.prompt_template="This example is related to" \ + model.library=huggingface \ + trainer.devices=[1] \ + model.dataset.use_cache=false \ + model.language_model.pretrained_model_name=bert-base-uncased \ + trainer.accelerator=gpu \ + exp_manager=null && \ + rm -rf design_zero_shot_intent_classification_bart_outputs && TRANSFORMERS_OFFLINE=1' + } + } + stage('Design Intent classification using DialogueNearestNeighbourModel') { + steps { + sh 'TRANSFORMERS_OFFLINE=0 && cd examples/nlp/dialogue && \ + python dialogue.py \ + do_training=False \ + model.dataset.data_dir=/home/TestData/nlp/design_dataset \ + model.dataset.dialogues_example_dir=design_dialogue_nearest_neighbour_classification_outputs \ + model.dataset.task=design \ + model.dataset.prompt_template="" \ + model.library=huggingface \ + trainer.devices=[0] \ + model.dataset.use_cache=false \ + model.language_model.pretrained_model_name=sentence-transformers/all-MiniLM-L6-v2 \ + trainer.accelerator=gpu \ + exp_manager=null && \ + rm -rf design_dialogue_nearest_neighbour_classification_outputs && TRANSFORMERS_OFFLINE=1' + } + } + } + } + stage('L2: Dialogue Generation') { + when { + anyOf { + branch 'r1.17.0' + changeRequest target: 'r1.17.0' + } + } + failFast true + parallel { + stage('Dialogue: Answer Extender using DialogueS2SGenerationModel') { + steps { + sh 'TRANSFORMERS_OFFLINE=0 && cd examples/nlp/dialogue && \ + python dialogue.py \ + do_training=False \ + model.dataset.data_dir=/home/TestData/nlp/ms-marco-qa \ + model.dataset.dialogues_example_dir=answer_extender_s2s \ + model.dataset.task=ms_marco \ + model.library=huggingface \ + model.dataset.debug_mode=True \ + trainer.max_steps=1 \ + trainer.max_epochs=1 \ + model.train_ds.batch_size=2 \ + model.validation_ds.batch_size=2 \ + model.test_ds.batch_size=2 \ + model.nemo_path=null \ + trainer.val_check_interval=0.0 \ + trainer.devices=[1] \ + model.dataset.use_cache=false \ + model.language_model.pretrained_model_name=facebook/bart-large \ + trainer.accelerator=gpu \ + exp_manager=null && \ + rm -rf answer_extender_s2s' + } + } + stage('Dialogue: SGD Based Answer Extender using DialogueS2SGenerationModel') { + steps { + sh 'TRANSFORMERS_OFFLINE=0 && cd examples/nlp/dialogue && \ + python dialogue.py \ + do_training=False \ + model.dataset.data_dir=/home/TestData/nlp/sgd_small \ + model.dataset.dialogues_example_dir=sgd_answer_extender_s2s \ + model.dataset.task_name=debug_sample \ + model.dataset.task=sgd_generation \ + model.dataset.input_field=utterance+system_actions \ + model.dataset.output_field=system_utterance \ + model.dataset.use_cache=false \ + model.dataset.system_utterance=next_turn \ + model.dataset.debug_mode=True \ + model.dataset.prompt_template=slots_values \ + model.library=huggingface \ + trainer.max_steps=1 \ + trainer.max_epochs=1 \ + model.train_ds.batch_size=2 \ + model.validation_ds.batch_size=2 \ + model.test_ds.batch_size=2 \ + model.nemo_path=null \ + trainer.val_check_interval=0.0 \ + trainer.devices=[0] \ + model.language_model.pretrained_model_name=facebook/bart-large \ + trainer.accelerator=gpu \ + exp_manager=null && \ + rm -rf sgd_answer_extender_s2s' + } + } + } + } +// stage('L2: Dialogue Generation Part 2') { +// when { +// anyOf { +// branch 'r1.17.0' +// changeRequest target: 'r1.17.0' +// } +// } +// failFast true +// parallel { +// stage('Dialogue: Answer Extender using DialogueGPTGenerationModel') { +// steps { +// sh 'TRANSFORMERS_OFFLINE=0 && cd examples/nlp/dialogue && \ +// python dialogue.py \ +// do_training=False \ +// model.dataset.data_dir=/home/TestData/nlp/ms-marco-qa \ +// model.dataset.dialogues_example_dir=answer_extender \ +// model.library=huggingface \ +// model.dataset.task=ms_marco \ +// model.dataset.debug_mode=True \ +// trainer.val_check_interval=0.0 \ +// trainer.devices=[0] \ +// model.dataset.use_cache=false \ +// model.language_model.pretrained_model_name=gpt2 \ +// trainer.accelerator=gpu \ +// exp_manager=null && \ +// rm -rf answer_extender' +// } +// } +// } +// } + stage('L2: COPY') { + when { + anyOf { + branch 'r1.17.0' + changeRequest target: 'r1.17.0' + } + } + failFast true + parallel { + stage('Dialogue: Answer Extender using DialogueGPTGenerationModel') { + steps { + sh 'TRANSFORMERS_OFFLINE=0 && cd examples/nlp/dialogue && \ + python dialogue.py \ + do_training=False \ + model.dataset.data_dir=/home/TestData/nlp/ms-marco-qa \ + model.dataset.dialogues_example_dir=answer_extender \ + model.library=huggingface \ + model.dataset.task=ms_marco \ + model.dataset.debug_mode=True \ + trainer.val_check_interval=0.0 \ + trainer.devices=[0] \ + model.dataset.use_cache=false \ + model.language_model.pretrained_model_name=gpt2 \ + trainer.accelerator=gpu \ + exp_manager=null && \ + rm -rf answer_extender' + } + } + } + } + stage('L2: Duplex Text Normalization') { + when { + anyOf { + branch 'r1.17.0' + changeRequest target: 'r1.17.0' + } + } + failFast true + parallel { + stage('Duplex Text Normalization with Tarred dataset') { + steps { + sh 'cd examples/nlp/duplex_text_normalization && \ + python duplex_text_normalization_train.py \ + data.validation_ds.data_path=/home/TestData/nlp/duplex_text_norm/small_test.tsv \ + mode=tn \ + lang=en \ + tagger_model.do_training=false \ + decoder_model.transformer=t5-small \ + data.validation_ds.batch_size=2 \ + data.train_ds.use_cache=false \ + data.validation_ds.use_cache=false \ + data.test_ds.batch_size=2 \ + data.train_ds.decoder_data_augmentation=false \ + data.train_ds.num_workers=2 \ + decoder_trainer.devices=[0,1] \ + decoder_trainer.accelerator="gpu" \ + data.train_ds.use_tarred_dataset=true \ + +decoder_trainer.fast_dev_run=true \ + decoder_exp_manager.create_checkpoint_callback=false \ + data.train_ds.tar_metadata_file=/home/TestData/nlp/duplex_text_norm/tarred_small/metadata.json \ + data.test_ds.use_cache=false \ + data.test_ds.data_path=/home/TestData/nlp/duplex_text_norm/small_test.tsv' + } + } + } + } + // Runs out of memory on the 12G TITAN V (GPU 0 on main CI) + // TODO: add when megatron bert is supported again in NeMo + // stage('L2: MegaBERT Token Classification') { + // when { + // anyOf { + // branch 'r1.17.0' + // changeRequest target: 'r1.17.0' + // } + // } + // failFast true + // steps { + // sh 'cd examples/nlp/token_classification && \ + // python token_classification_train.py \ + // model.dataset.data_dir=/home/TestData/nlp/token_classification_punctuation/ \ + // model.language_model.pretrained_model_name=megatron-bert-345m-uncased \ + // model.train_ds.batch_size=10 \ + // model.dataset.max_seq_length=50 \ + // model.dataset.use_cache=false \ + // trainer.accelerator=gpu \ + // trainer.strategy=ddp \ + // trainer.precision=16 \ + // trainer.devices=[1] \ + // trainer.accelerator="gpu" \ + // +trainer.fast_dev_run=true \ + // exp_manager=null' + // } + // } + + stage('L2: BERT Text Classification') { + when { + anyOf { + branch 'r1.17.0' + changeRequest target: 'r1.17.0' + } + } + failFast true + parallel { + stage ('Text Classification with BERT Test') { + steps { + sh 'cd examples/nlp/text_classification && \ + python text_classification_with_bert.py \ + model.dataset.num_classes=6 \ + model.train_ds.file_path=/home/TestData/nlp/retail_text_classification/train.tsv \ + model.validation_ds.file_path=/home/TestData/nlp/retail_text_classification/dev.tsv \ + model.language_model.pretrained_model_name=distilbert-base-uncased \ + model.train_ds.batch_size=10 \ + model.dataset.max_seq_length=50 \ + model.dataset.use_cache=false \ + trainer.devices=[0] \ + trainer.accelerator="gpu" \ + +trainer.fast_dev_run=true \ + exp_manager=null' + } + } + } + } + + stage('L2: Parallel BERT Question-Answering SQUAD v1.1 & v2.0') { + when { + anyOf { + branch 'r1.17.0' + changeRequest target: 'r1.17.0' + } + } + failFast true + parallel { + stage('BERT SQUAD 1.1') { + // Cannot do fast_dev_run because squad needs whole dev dataset + steps { + sh 'TRANSFORMERS_OFFLINE=0 && cd examples/nlp/question_answering && \ + python question_answering.py \ + model.train_ds.file=/home/TestData/nlp/squad_mini/v1.1/train-v1.1.json \ + model.dataset.use_cache=false \ + model.validation_ds.file=/home/TestData/nlp/squad_mini/v1.1/dev-v1.1.json \ + model.test_ds.file=/home/TestData/nlp/squad_mini/v1.1/dev-v1.1.json \ + model.train_ds.batch_size=2 \ + model.train_ds.num_samples=2 \ + model.validation_ds.batch_size=2 \ + model.validation_ds.num_samples=2 \ + model.test_ds.num_samples=2 \ + model.test_ds.batch_size=2 \ + trainer.max_epochs=1 \ + trainer.max_steps=1 \ + model.language_model.pretrained_model_name=bert-base-uncased \ + model.dataset.version_2_with_negative=false \ + trainer.precision=16 \ + trainer.devices=[0] \ + trainer.accelerator="gpu" \ + exp_manager=null && TRANSFORMERS_OFFLINE=1' + } + } + stage('BERT SQUAD 2.0') { + // Cannot do fast_dev_run because squad needs whole dev dataset + steps { + sh 'TRANSFORMERS_OFFLINE=0 && cd examples/nlp/question_answering && \ + python question_answering.py \ + model.train_ds.file=/home/TestData/nlp/squad_mini/v2.0/train-v2.0.json \ + model.dataset.use_cache=false \ + model.train_ds.batch_size=2 \ + model.train_ds.num_samples=2 \ + model.validation_ds.batch_size=2 \ + model.validation_ds.num_samples=2 \ + trainer.max_epochs=1 \ + trainer.max_steps=1 \ + model.validation_ds.file=/home/TestData/nlp/squad_mini/v2.0/dev-v2.0.json \ + model.language_model.pretrained_model_name=bert-base-uncased \ + model.dataset.version_2_with_negative=true \ + trainer.precision=16 \ + trainer.devices=[1] \ + trainer.accelerator="gpu" \ + exp_manager=null && TRANSFORMERS_OFFLINE=1' + } + } + } + } + + stage('L2: Parallel BART Question-Answering SQUAD v1.1 & v2.0') { + when { + anyOf { + branch 'r1.17.0' + changeRequest target: 'r1.17.0' + } + } + failFast true + parallel { + stage('BART SQUAD 1.1') { + // Cannot do fast_dev_run because squad needs whole dev dataset + steps { + sh 'TRANSFORMERS_OFFLINE=0 && cd examples/nlp/question_answering && \ + python question_answering.py \ + model.train_ds.file=/home/TestData/nlp/squad_mini/v1.1/train-v1.1.json \ + model.dataset.use_cache=false \ + model.dataset.check_if_answer_in_context=false \ + model.validation_ds.file=/home/TestData/nlp/squad_mini/v1.1/dev-v1.1.json \ + model.test_ds.file=/home/TestData/nlp/squad_mini/v1.1/dev-v1.1.json \ + model.train_ds.batch_size=2 \ + model.train_ds.num_samples=2 \ + model.validation_ds.batch_size=2 \ + model.validation_ds.num_samples=2 \ + model.test_ds.num_samples=2 \ + model.test_ds.batch_size=2 \ + trainer.max_epochs=1 \ + trainer.max_steps=1 \ + model.language_model.pretrained_model_name=facebook/bart-base \ + model.dataset.version_2_with_negative=false \ + trainer.precision=16 \ + trainer.devices=[0] \ + trainer.accelerator="gpu" \ + exp_manager=null && TRANSFORMERS_OFFLINE=1' + } + } + stage('BART SQUAD 2.0') { + // Cannot do fast_dev_run because squad needs whole dev dataset + steps { + sh 'TRANSFORMERS_OFFLINE=0 && cd examples/nlp/question_answering && \ + python question_answering.py \ + model.train_ds.file=/home/TestData/nlp/squad_mini/v2.0/train-v2.0.json \ + model.dataset.use_cache=false \ + model.dataset.check_if_answer_in_context=false \ + model.train_ds.batch_size=2 \ + model.train_ds.num_samples=2 \ + model.validation_ds.batch_size=2 \ + model.validation_ds.num_samples=2 \ + trainer.max_epochs=1 \ + trainer.max_steps=1 \ + model.validation_ds.file=/home/TestData/nlp/squad_mini/v2.0/dev-v2.0.json \ + model.language_model.pretrained_model_name=facebook/bart-base \ + model.dataset.version_2_with_negative=true \ + trainer.precision=16 \ + trainer.devices=[1] \ + trainer.accelerator="gpu" \ + exp_manager=null && TRANSFORMERS_OFFLINE=1' + } + } + } + } + + stage('L2: Parallel GPT2 Question-Answering SQUAD v1.1 & v2.0') { + when { + anyOf { + branch 'r1.17.0' + changeRequest target: 'r1.17.0' + } + } + failFast true + parallel { + stage('GPT2 SQUAD 1.1') { + // Cannot do fast_dev_run because squad needs whole dev dataset + steps { + sh 'TRANSFORMERS_OFFLINE=0 && cd examples/nlp/question_answering && \ + python question_answering.py \ + model.train_ds.file=/home/TestData/nlp/squad_mini/v1.1/train-v1.1.json \ + model.dataset.use_cache=false \ + model.dataset.check_if_answer_in_context=false \ + model.validation_ds.file=/home/TestData/nlp/squad_mini/v1.1/dev-v1.1.json \ + model.test_ds.file=/home/TestData/nlp/squad_mini/v1.1/dev-v1.1.json \ + model.train_ds.batch_size=2 \ + model.train_ds.num_samples=2 \ + model.validation_ds.batch_size=2 \ + model.validation_ds.num_samples=2 \ + model.test_ds.num_samples=2 \ + model.test_ds.batch_size=2 \ + trainer.max_epochs=1 \ + trainer.max_steps=1 \ + model.language_model.pretrained_model_name=gpt2 \ + model.dataset.version_2_with_negative=false \ + trainer.precision=16 \ + trainer.devices=[0] \ + trainer.accelerator="gpu" \ + exp_manager=null && TRANSFORMERS_OFFLINE=1' + } + } + stage('GPT2 SQUAD 2.0') { + // Cannot do fast_dev_run because squad needs whole dev dataset + steps { + sh 'TRANSFORMERS_OFFLINE=0 && cd examples/nlp/question_answering && \ + python question_answering.py \ + model.train_ds.file=/home/TestData/nlp/squad_mini/v2.0/train-v2.0.json \ + model.dataset.use_cache=false \ + model.dataset.check_if_answer_in_context=false \ + model.train_ds.batch_size=2 \ + model.train_ds.num_samples=2 \ + model.validation_ds.batch_size=2 \ + model.validation_ds.num_samples=2 \ + trainer.max_epochs=1 \ + trainer.max_steps=1 \ + model.validation_ds.file=/home/TestData/nlp/squad_mini/v2.0/dev-v2.0.json \ + model.language_model.pretrained_model_name=gpt2 \ + model.dataset.version_2_with_negative=true \ + trainer.precision=16 \ + trainer.devices=[1] \ + trainer.accelerator="gpu" \ + exp_manager=null && TRANSFORMERS_OFFLINE=1' + } + } + } + } + + stage('L2: Intent and Slot Classification Tasks') { + when { + anyOf { + branch 'r1.17.0' + changeRequest target: 'r1.17.0' + } + } + failFast true + parallel { + stage('L2: Intent and Slot Classification') { + steps { + sh 'cd examples/nlp/intent_slot_classification && \ + python intent_slot_classification.py \ + model.data_dir=/home/TestData/nlp/retail \ + model.validation_ds.prefix=dev \ + model.test_ds.prefix=dev \ + trainer.devices=[0] \ + trainer.accelerator="gpu" \ + +trainer.fast_dev_run=true \ + exp_manager.exp_dir=checkpoints' + sh 'rm -rf checkpoints' + } + } + stage('L2: Multi-Label Intent and Slot Classification') { + steps { + sh 'cd examples/nlp/intent_slot_classification && \ + python multi_label_intent_slot_classification.py \ + model.data_dir=/home/TestData/nlp/new_multiatis \ + model.validation_ds.prefix=dev \ + model.test_ds.prefix=dev \ + trainer.devices=[0] \ + +trainer.fast_dev_run=true \ + exp_manager.exp_dir=checkpoints2' + sh 'rm -rf checkpoints2' + } + } + } + } + + // TODO: add when megatron-bert is supported again + // stage('L2: Model Parallel Size 2 Megatron Text Classification') { + // when { + // anyOf{ + // branch 'r1.17.0' + // changeRequest target: 'r1.17.0' + // } + // } + // failFast true + // steps{ + // sh 'cd examples/nlp/text_classification && \ + // python text_classification_with_bert.py \ + // trainer.devices=[0,1] \ + // trainer.accelerator="gpu" \ + // trainer.num_nodes=1 \ + // trainer.precision=16 \ + // trainer.gradient_clip_val=1.0 \ + // +trainer.fast_dev_run=true \ + // model.dataset.num_classes=6 \ + // model.train_ds.file_path=/home/TestData/nlp/retail_text_classification/train.tsv \ + // model.train_ds.batch_size=4 \ + // model.language_model.pretrained_model_name=megatron-bert-uncased \ + // model.language_model.config_file=/home/TestData/nlp/mp_2_bert_toy/config.json \ + // model.language_model.lm_checkpoint=/home/TestData/nlp/mp_2_bert_toy/iter_2000000 \ + // model.nemo_path=null \ + // ~model.infer_samples \ + // exp_manager=null' + // } + // } + + // stage('L2: Model Parallel Size 2 Megatron Autoresume') { + // when { + // anyOf{ + // branch 'r1.17.0' + // changeRequest target: 'r1.17.0' + // } + // } + // failFast true + // steps{ + // sh 'cd examples/nlp/text_classification && \ + // python text_classification_with_bert.py \ + // trainer.devices=[0,1] \ + // trainer.accelerator="gpu" \ + // trainer.num_nodes=1 \ + // trainer.precision=16 \ + // trainer.gradient_clip_val=1.0 \ + // trainer.max_epochs=1 \ + // +trainer.fast_dev_run=true \ + // model.dataset.num_classes=6 \ + // model.train_ds.file_path=/home/TestData/nlp/retail_text_classification/train.tsv \ + // model.train_ds.batch_size=4 \ + // model.language_model.pretrained_model_name=megatron-bert-uncased \ + // model.language_model.config_file=/home/TestData/nlp/mp_2_bert_toy/config.json \ + // model.language_model.lm_checkpoint=/home/TestData/nlp/mp_2_bert_toy/iter_2000000 \ + // model.nemo_path=null \ + // ~model.infer_samples \ + // +exp_manager.explicit_log_dir=/home/TestData/nlp/mp_autoresume \ + // +exp_manager.resume_if_exists=true' + // } + // } + + // stage('L2: Model Parallel Size 2 Megatron Evaluation from .nemo') { + // when { + // anyOf{ + // branch 'r1.17.0' + // changeRequest target: 'r1.17.0' + // } + // } + // failFast true + // steps{ + // sh 'cd examples/nlp/text_classification && \ + // python model_parallel_text_classification_evaluation.py \ + // trainer.devices=[0,1] \ + // trainer.accelerator="gpu" \ + // trainer.num_nodes=1 \ + // model.dataset.num_classes=6 \ + // model.test_ds.file_path=/home/TestData/nlp/retail_text_classification/dev.tsv \ + // model.nemo_path=/home/TestData/nlp/mp_2_nemo/retail_text_class_350M.nemo \ + // exp_manager=null' + // } + // } + + // stage('L2: Model Parallel Size 2 Megatron Train from .nemo') { + // when { + // anyOf{ + // branch 'r1.17.0' + // changeRequest target: 'r1.17.0' + // } + // } + // failFast true + // steps{ + // sh 'cd examples/nlp/token_classification && \ + // python token_classification_train.py \ + // pretrained_model=/home/TestData/nlp/mp_2_nemo/ner_350M.nemo \ + // model.dataset.data_dir=/home/TestData/nlp/ner/ \ + // model.train_ds.batch_size=2 \ + // model.dataset.use_cache=false \ + // trainer.devices=[0,1] \ + // trainer.accelerator="gpu" \ + // +trainer.fast_dev_run=true \ + // model.dataset.class_balancing="weighted_loss" \ + // exp_manager=null' + // } + // } + + stage('L2: Parallel NLP Examples 2') { + when { + anyOf { + branch 'r1.17.0' + changeRequest target: 'r1.17.0' + } + } + failFast true + parallel { + stage ('NER finetuning from pretrained Test') { + steps { + sh 'cd examples/nlp/token_classification && \ + python token_classification_train.py \ + pretrained_model=ner_en_bert \ + model.dataset.data_dir=/home/TestData/nlp/ner/ \ + model.train_ds.batch_size=2 \ + model.dataset.use_cache=false \ + trainer.devices=[0] \ + trainer.accelerator="gpu" \ + +trainer.fast_dev_run=true \ + model.dataset.class_balancing="weighted_loss" \ + exp_manager.exp_dir=null' + } + } + stage ('Punctuation and capitalization finetuning from pretrained test') { + steps { + sh 'cd examples/nlp/token_classification && \ + data_dir="$(mktemp -d -p "$(pwd)")" && \ + cp /home/TestData/nlp/token_classification_punctuation/*.txt "${data_dir}"/ && \ + python punctuation_capitalization_train_evaluate.py \ + pretrained_model=punctuation_en_bert \ + model.train_ds.ds_item="${data_dir}" \ + model.validation_ds.ds_item="${data_dir}" \ + model.test_ds.ds_item="${data_dir}" \ + +model.train_ds.use_cache=false \ + +model.validation_ds.use_cache=false \ + +model.test_ds.use_cache=false \ + trainer.devices=[1] \ + trainer.accelerator="gpu" \ + +trainer.fast_dev_run=true \ + exp_manager.exp_dir=null && \ + rm -rf "${data_dir}"' + } + } + stage ('NER with TurkuNLP/bert-base-finnish-cased-v1') { + steps { + sh 'cd examples/nlp/token_classification && \ + python token_classification_train.py \ + model.dataset.data_dir=/home/TestData/nlp/token_classification_punctuation/ \ + trainer.devices=[0] \ + trainer.accelerator="gpu" \ + +trainer.fast_dev_run=true \ + model.dataset.use_cache=false \ + model.language_model.pretrained_model_name="TurkuNLP/bert-base-finnish-cased-v1" \ + exp_manager.exp_dir=null' + } + } + stage('Evaluation script for Token Classification') { + steps { + sh 'python examples/nlp/token_classification/token_classification_evaluate.py \ + model.dataset.data_dir=/home/TestData/nlp/ner/ \ + model.dataset.use_cache=false \ + pretrained_model=/home/TestData/nlp/pretrained_models/NER_Model_with_BERT_base_uncased.nemo' + } + } + stage('Evaluation script for Punctuation') { + steps { + sh 'data_dir="$(mktemp -d -p "$(pwd)")" && \ + cp /home/TestData/nlp/token_classification_punctuation/*.txt "${data_dir}"/ && \ + python examples/nlp/token_classification/punctuation_capitalization_train_evaluate.py \ + +do_training=false \ + +do_testing=true \ + model.test_ds.ds_item="${data_dir}" \ + ~model.train_ds \ + ~model.validation_ds \ + +model.test_ds.use_cache=false \ + pretrained_model=/home/TestData/nlp/pretrained_models/Punctuation_Capitalization_with_DistilBERT_base_uncased.nemo && \ + rm -rf "${data_dir}"' + } + } + stage('L2: Punctuation & Capitalization, 2GPUs with DistilBERT, Fine-tuning on different data') { + steps { + sh 'cd examples/nlp/token_classification && \ + output_dir="$(mktemp -d -p "$(pwd)")" && \ + tmp_data_dir="$(mktemp -d -p "$(pwd)")" && \ + cp /home/TestData/nlp/token_classification_punctuation/*.txt "${tmp_data_dir}"/ && \ + python punctuation_capitalization_train_evaluate.py \ + model.train_ds.use_tarred_dataset=false \ + model.train_ds.ds_item="${tmp_data_dir}" \ + model.validation_ds.ds_item="${tmp_data_dir}" \ + model.test_ds.ds_item="${tmp_data_dir}" \ + model.language_model.pretrained_model_name=distilbert-base-uncased \ + +model.train_ds.use_cache=false \ + +model.validation_ds.use_cache=false \ + +model.test_ds.use_cache=false \ + trainer.devices=[0,1] \ + trainer.accelerator="gpu" \ + trainer.strategy=ddp \ + trainer.max_epochs=1 \ + +exp_manager.explicit_log_dir="${output_dir}" \ + +do_testing=true && \ + tmp_data_dir_2="$(mktemp -d -p "$(pwd)")" && \ + mv "${tmp_data_dir}"/* "${tmp_data_dir_2}" && \ + rm -rf "${tmp_data_dir}" && \ + python punctuation_capitalization_train_evaluate.py \ + model.train_ds.use_tarred_dataset=false \ + model.train_ds.ds_item="${tmp_data_dir_2}" \ + model.validation_ds.ds_item="${tmp_data_dir_2}" \ + model.test_ds.ds_item="${tmp_data_dir_2}" \ + pretrained_model="${output_dir}/checkpoints/Punctuation_and_Capitalization.nemo" \ + +model.train_ds.use_cache=false \ + +model.validation_ds.use_cache=false \ + +model.test_ds.use_cache=false \ + trainer.devices=[0,1] \ + trainer.accelerator="gpu" \ + trainer.strategy=ddp \ + trainer.max_epochs=1 \ + exp_manager=null && \ + rm -rf /workspace/NeMo/examples/nlp/token_classification/nemo_experiments \ + "${tmp_data_dir_2}" \ + "${output_dir}"' + } + } + } + } + stage('Punctuation & Capitalization tarred dataset') { + when { + anyOf { + branch 'r1.17.0' + changeRequest target: 'r1.17.0' + } + } + failFast true + stages { + stage('create and use tarred dataset') { + steps { + sh 'data_dir="$(mktemp -d -p "$(pwd)")" && \ + cp -r /home/TestData/nlp/token_classification_punctuation/*.txt \ + /home/TestData/nlp/token_classification_punctuation/wmt_wiki_10000 \ + "${data_dir}"/ && \ + usual_data=${data_dir}/wmt_wiki_10000 && \ + output_dir="$(mktemp -d -p "$(pwd)")" && \ + tarred_data=${output_dir}/train_tarred && \ + tokens_in_batch=2000 && \ + max_seq_length=512 && \ + lm_model=distilbert-base-uncased && \ + python examples/nlp/token_classification/data/create_punctuation_capitalization_tarred_dataset.py \ + --text ${usual_data}/input.txt \ + --labels ${usual_data}/labels.txt \ + --output_dir ${tarred_data} \ + --tokens_in_batch ${tokens_in_batch} \ + --max_seq_length 512 \ + --lines_per_dataset_fragment 2000 \ + --num_batches_per_tarfile 5 \ + --tar_file_prefix punctuation_capitalization \ + --tokenizer_name ${lm_model} \ + --use_fast_tokenizer \ + --pad_label O \ + --n_jobs 3 && \ + echo "Number of tarred files in dataset:" && \ + ls ${tarred_data}/*.tar | wc -l && \ + echo "Label id files in dataset:" && \ + ls ${tarred_data}/*.csv && \ + metadata_file=${tarred_data}/metadata.punctuation_capitalization.tokens${tokens_in_batch}.max_seq_length${max_seq_length}.${lm_model}.json && \ + python examples/nlp/token_classification/punctuation_capitalization_train_evaluate.py \ + model.validation_ds.ds_item="${data_dir}" \ + model.test_ds.ds_item="${data_dir}" \ + model.train_ds.ds_item=${tarred_data} \ + model.language_model.pretrained_model_name=${lm_model} \ + model.train_ds.use_tarred_dataset=true \ + model.train_ds.tar_metadata_file=${metadata_file} \ + +model.train_ds.use_cache=false \ + +model.validation_ds.use_cache=false \ + +model.test_ds.use_cache=false \ + trainer.devices=[0,1] \ + trainer.accelerator="gpu" \ + trainer.strategy=ddp \ + trainer.max_epochs=1 \ + +exp_manager.explicit_log_dir=${output_dir}/output && \ + rm -rf "${output_dir}" "${data_dir}"' + } + } + } + } + stage('Punctuation & Capitalization, Different ways of passing labels to model') { + when { + anyOf { + branch 'r1.17.0' + changeRequest target: 'r1.17.0' + } + } + failFast true + stages { + stage('Punctuation & Capitalization, Using model.common_datasest_parameters.label_vocab_dir') { + steps { + sh 'cd examples/nlp/token_classification && \ + work_dir="$(mktemp -d -p "$(pwd)")" && \ + label_vocab_dir="${work_dir}/labels" && \ + mkdir -p ${label_vocab_dir} && \ + data_dir="${work_dir}/data" && \ + mkdir -p "${data_dir}" && \ + cp /home/TestData/nlp/token_classification_punctuation/*.txt "${data_dir}" && \ + output_dir="${work_dir}/output" && \ + mkdir -p "${output_dir}" && \ + punct_label_vocab="${label_vocab_dir}/punct_label_vocab.csv" && \ + capit_label_vocab="${label_vocab_dir}/capit_label_vocab.csv" && \ + printf "O\n,\n.\n?\n" > "${punct_label_vocab}" && \ + printf "O\nU\n" > "${capit_label_vocab}" && \ + python punctuation_capitalization_train_evaluate.py \ + model.train_ds.use_tarred_dataset=false \ + model.train_ds.ds_item="${data_dir}" \ + model.validation_ds.ds_item="${data_dir}" \ + model.test_ds.ds_item="${data_dir}" \ + model.language_model.pretrained_model_name=distilbert-base-uncased \ + model.common_dataset_parameters.label_vocab_dir="${label_vocab_dir}" \ + model.class_labels.punct_labels_file="$(basename "${punct_label_vocab}")" \ + model.class_labels.capit_labels_file="$(basename "${capit_label_vocab}")" \ + +model.train_ds.use_cache=false \ + +model.validation_ds.use_cache=false \ + +model.test_ds.use_cache=false \ + trainer.devices=[0,1] \ + trainer.strategy=ddp \ + trainer.max_epochs=1 \ + +exp_manager.explicit_log_dir="${output_dir}" \ + +do_testing=false && \ + python punctuation_capitalization_train_evaluate.py \ + +do_training=false \ + +do_testing=true \ + ~model.train_ds \ + ~model.validation_ds \ + model.test_ds.ds_item="${data_dir}" \ + pretrained_model="${output_dir}/checkpoints/Punctuation_and_Capitalization.nemo" \ + +model.train_ds.use_cache=false \ + +model.validation_ds.use_cache=false \ + +model.test_ds.use_cache=false \ + trainer.devices=[0,1] \ + trainer.strategy=ddp \ + trainer.max_epochs=1 \ + exp_manager=null && \ + rm -rf "${work_dir}"' + } + } + stage('Punctuation & Capitalization, Using model.common_datasest_parameters.{punct,capit}_label_ids') { + steps { + sh 'cd examples/nlp/token_classification && \ + work_dir="$(mktemp -d -p "$(pwd)")" && \ + output_dir="${work_dir}/output" && \ + mkdir -p "${output_dir}" && \ + data_dir="${work_dir}/data" && \ + mkdir -p "${data_dir}" && \ + cp /home/TestData/nlp/token_classification_punctuation/*.txt "${data_dir}" && \ + conf_name=punctuation_capitalization_config_with_ids && \ + cp conf/punctuation_capitalization_config.yaml "${work_dir}/${conf_name}.yaml" && \ + sed -i $\'s/punct_label_ids: null/punct_label_ids: {O: 0, \\\',\\\': 1, .: 2, \\\'?\\\': 3}/\' \ + "${work_dir}/${conf_name}.yaml" && \ + sed -i $\'s/capit_label_ids: null/capit_label_ids: {O: 0, U: 1}/\' \ + "${work_dir}/${conf_name}.yaml" && \ + python punctuation_capitalization_train_evaluate.py \ + --config-path "${work_dir}" \ + --config-name "${conf_name}" \ + model.train_ds.use_tarred_dataset=false \ + model.train_ds.ds_item="${data_dir}" \ + model.validation_ds.ds_item="${data_dir}" \ + model.test_ds.ds_item="${data_dir}" \ + model.language_model.pretrained_model_name=distilbert-base-uncased \ + +model.train_ds.use_cache=false \ + +model.validation_ds.use_cache=false \ + +model.test_ds.use_cache=false \ + trainer.devices=[0,1] \ + trainer.strategy=ddp \ + trainer.max_epochs=1 \ + +exp_manager.explicit_log_dir="${output_dir}" \ + +do_testing=false && \ + python punctuation_capitalization_train_evaluate.py \ + +do_training=false \ + +do_testing=true \ + ~model.train_ds \ + ~model.validation_ds \ + model.test_ds.ds_item="${data_dir}" \ + pretrained_model="${output_dir}/checkpoints/Punctuation_and_Capitalization.nemo" \ + +model.train_ds.use_cache=false \ + +model.validation_ds.use_cache=false \ + +model.test_ds.use_cache=false \ + trainer.devices=[0,1] \ + trainer.strategy=ddp \ + trainer.max_epochs=1 \ + exp_manager=null && \ + rm -rf "${work_dir}"' + } + } + } + } + stage('Punctuation & Capitalization inference') { + when { + anyOf { + branch 'r1.17.0' + changeRequest target: 'r1.17.0' + } + } + failFast true + stages { + stage('Restore punctuation and capitalization in long text') { + steps { + sh 'output_dir="$(mktemp -d -p "$(pwd)")" && \ + python examples/nlp/token_classification/punctuate_capitalize_infer.py \ + --input_manifest /home/TestData/nlp/token_classification_punctuation/iwslt_tst2019.manifest \ + --output_text "${output_dir}/iwslt_inference_result.txt" \ + --max_seq_length 92 \ + --step 8 \ + --margin 16 \ + --pretrained_name punctuation_en_bert \ + --batch_size 32 && \ + rm -rf "${output_dir}"' + } + } + } + } + + stage('L2: Parallel Pretraining BERT pretraining from Text/Preprocessed') { + when { + anyOf { + branch 'r1.17.0' + changeRequest target: 'r1.17.0' + } + } + failFast true + parallel { + stage('L2: Pretraining BERT pretraining from Text') { + steps { + sh 'cd examples/nlp/language_modeling && \ + python bert_pretraining.py \ + --config-name=bert_pretraining_from_text_config.yaml \ + trainer.devices=[0] \ + trainer.accelerator="gpu" \ + trainer.precision=16 \ + +trainer.fast_dev_run=true \ + model.train_ds.data_file=/home/TestData/nlp/wikitext-2/train.txt \ + model.train_ds.batch_size=32 \ + model.validation_ds.data_file=/home/TestData/nlp/wikitext-2/valid.txt \ + model.validation_ds.batch_size=32 \ + model.language_model.config_file=/home/TestData/nlp/bert_configs/bert_3200.json \ + model.optim.lr=0.01 \ + model.optim.sched.warmup_ratio=0.1 \ + model.tokenizer.tokenizer_name=sentencepiece \ + model.tokenizer.tokenizer_model=/home/TestData/nlp/wikitext-2/tokenizer_bpe_v3193/tokenizer.model \ + model.mask_prob=0.15 \ + model.short_seq_prob=0.1 \ + exp_manager.exp_dir=PretrainingBERTFromText \ + ' + sh 'rm -f /home/TestData/nlp/wikitext-2/*.pkl' + sh 'rm -rf examples/nlp/language_modeling/PretrainingBERTFromText' + sh 'ls -lha examples/nlp/language_modeling' + } + } + stage('L2: Pretraining BERT from Preprocessed') { + steps { + sh 'cd examples/nlp/language_modeling && \ + python bert_pretraining.py \ + --config-name=bert_pretraining_from_preprocessed_config.yaml \ + trainer.devices=[1] \ + trainer.accelerator="gpu" \ + trainer.precision=16 \ + +trainer.fast_dev_run=true \ + model.train_ds.data_file=/home/TestData/nlp/wiki_book_mini/training \ + model.train_ds.batch_size=8 \ + model.language_model.lm_checkpoint=/home/TestData/nlp/bert_ckpts/nemo1.0/bert_base_uncased_mlm_final_1074591_nemo1.0.pt \ + model.language_model.config_file=/home/TestData/nlp/bert_configs/uncased_L-12_H-768_A-12.json \ + model.optim.lr=0.875e-4 \ + model.optim.weight_decay=0.01 \ + model.optim.sched.warmup_ratio=0.01 \ + exp_manager.exp_dir=PretrainingBERTFromPreprocessed \ + exp_manager.create_checkpoint_callback=False \ + ' + sh 'rm -rf examples/nlp/language_modeling/PretrainingBERTFromPreprocessed' + sh 'ls -lha examples/nlp/language_modeling' + } + } + } + } + + stage('L2: Entity Linking') { + when { + anyOf { + branch 'r1.17.0' + changeRequest target: 'r1.17.0' + } + } + failFast true + parallel { + stage ('Self Alignment Pretraining BERT') { + steps { + sh 'cd examples/nlp/entity_linking && \ + python self_alignment_pretraining.py \ + project_dir=. \ + trainer.val_check_interval=3 \ + model.raw_data=None \ + model.train_ds.data_file=/home/TestData/nlp/entity_linking/tiny_example_train_pairs.tsv \ + model.validation_ds.data_file=/home/TestData/nlp/entity_linking/tiny_example_validation_pairs.tsv \ + model.train_ds.batch_size=8 \ + model.validation_ds.batch_size=8 \ + exp_manager.exp_dir=null' + } + } + } + } + + // TODO: remove +model.optim.capturable=True when Pytorch fix: https://github.com/pytorch/pytorch/pull/81858 + // is in the release container + stage('L2: NMT Attention is All You Need Training') { + when { + anyOf { + branch 'r1.17.0' + changeRequest target: 'r1.17.0' + } + } + failFast true + parallel { + stage('L2: NMT Training Post-LN') { + steps { + sh 'python examples/nlp/machine_translation/enc_dec_nmt.py \ + --config-path=conf \ + --config-name=aayn_base \ + do_testing=false \ + model.train_ds.src_file_name=/home/TestData/nlp/nmt/toy_data/wmt14-de-en.src \ + model.train_ds.tgt_file_name=/home/TestData/nlp/nmt/toy_data/wmt14-de-en.ref \ + model.validation_ds.src_file_name=/home/TestData/nlp/nmt/toy_data/wmt14-de-en.src \ + model.validation_ds.tgt_file_name=/home/TestData/nlp/nmt/toy_data/wmt14-de-en.src \ + model.test_ds.src_file_name=/home/TestData/nlp/nmt/toy_data/wmt14-de-en.src \ + model.test_ds.tgt_file_name=/home/TestData/nlp/nmt/toy_data/wmt14-de-en.src \ + model.encoder_tokenizer.tokenizer_model=/home/TestData/nlp/nmt/toy_data/tt_tokenizer.BPE.4096.model \ + model.decoder_tokenizer.tokenizer_model=/home/TestData/nlp/nmt/toy_data/tt_tokenizer.BPE.4096.model \ + model.encoder.num_layers=1 \ + model.encoder.hidden_size=64 \ + model.encoder.inner_size=256 \ + model.decoder.num_layers=1 \ + model.decoder.hidden_size=64 \ + model.decoder.inner_size=256 \ + +model.optim.capturable=True \ + trainer.devices=[0] \ + trainer.accelerator="gpu" \ + +trainer.val_check_interval=2 \ + +trainer.limit_val_batches=1 \ + +trainer.max_steps=2 \ + trainer.precision=16 \ + +exp_manager.explicit_log_dir=examples/nlp/machine_translation/nmt_results \ + +exp_manager.create_checkpoint_callback=true \ + ' + sh 'python examples/nlp/machine_translation/enc_dec_nmt.py \ + --config-path=conf \ + --config-name=aayn_base \ + do_testing=true \ + model.train_ds.src_file_name=/home/TestData/nlp/nmt/toy_data/wmt14-de-en.src \ + model.train_ds.tgt_file_name=/home/TestData/nlp/nmt/toy_data/wmt14-de-en.ref \ + model.validation_ds.src_file_name=/home/TestData/nlp/nmt/toy_data/wmt14-de-en.src \ + model.validation_ds.tgt_file_name=/home/TestData/nlp/nmt/toy_data/wmt14-de-en.src \ + model.test_ds.src_file_name=/home/TestData/nlp/nmt/toy_data/wmt14-de-en.src \ + model.test_ds.tgt_file_name=/home/TestData/nlp/nmt/toy_data/wmt14-de-en.src \ + model.encoder_tokenizer.tokenizer_model=/home/TestData/nlp/nmt/toy_data/tt_tokenizer.BPE.4096.model \ + model.decoder_tokenizer.tokenizer_model=/home/TestData/nlp/nmt/toy_data/tt_tokenizer.BPE.4096.model \ + model.encoder.num_layers=1 \ + model.encoder.hidden_size=64 \ + model.encoder.inner_size=256 \ + model.decoder.num_layers=1 \ + model.decoder.hidden_size=64 \ + model.decoder.inner_size=256 \ + +model.optim.capturable=True \ + trainer.devices=[0] \ + trainer.accelerator="gpu" \ + +trainer.val_check_interval=10 \ + +trainer.limit_val_batches=1 \ + +trainer.limit_test_batches=1 \ + +trainer.max_steps=10 \ + +exp_manager.explicit_log_dir=examples/nlp/machine_translation/nmt_results \ + +exp_manager.create_checkpoint_callback=true \ + +exp_manager.resume_if_exists=True \ + ' + sh 'rm -rf examples/nlp/machine_translation/nmt_results' + } + } + + stage('L2: NMT Training Pre-LN') { + steps { + sh 'cd examples/nlp/machine_translation && \ + python enc_dec_nmt.py \ + --config-path=conf \ + --config-name=aayn_base \ + do_testing=true \ + model.train_ds.src_file_name=/home/TestData/nlp/nmt/toy_data/wmt14-de-en.src \ + model.train_ds.tgt_file_name=/home/TestData/nlp/nmt/toy_data/wmt14-de-en.ref \ + model.validation_ds.src_file_name=/home/TestData/nlp/nmt/toy_data/wmt14-de-en.src \ + model.validation_ds.tgt_file_name=/home/TestData/nlp/nmt/toy_data/wmt14-de-en.src \ + model.test_ds.src_file_name=/home/TestData/nlp/nmt/toy_data/wmt14-de-en.src \ + model.test_ds.tgt_file_name=/home/TestData/nlp/nmt/toy_data/wmt14-de-en.src \ + model.encoder_tokenizer.tokenizer_model=/home/TestData/nlp/nmt/toy_data/tt_tokenizer.BPE.4096.model \ + model.decoder_tokenizer.tokenizer_model=/home/TestData/nlp/nmt/toy_data/tt_tokenizer.BPE.4096.model \ + model.encoder.pre_ln=true \ + model.decoder.pre_ln=true \ + trainer.devices=[1] \ + trainer.accelerator="gpu" \ + +trainer.fast_dev_run=true \ + +trainer.limit_test_batches=2 \ + exp_manager=null \ + ' + } + } + stage('L2: NMT Multi-Validation') { + steps { + sh 'cd examples/nlp/machine_translation && \ + python enc_dec_nmt.py \ + --config-path=conf \ + --config-name=aayn_base \ + do_testing=true \ + model.train_ds.src_file_name=/home/TestData/nlp/nmt/toy_data/wmt14-en-de.src \ + model.train_ds.tgt_file_name=/home/TestData/nlp/nmt/toy_data/wmt14-en-de.ref \ + model.validation_ds.src_file_name=[/home/TestData/nlp/nmt/toy_data/wmt13-en-de.src,/home/TestData/nlp/nmt/toy_data/wmt14-en-de.src] \ + model.validation_ds.tgt_file_name=[/home/TestData/nlp/nmt/toy_data/wmt13-en-de.ref,/home/TestData/nlp/nmt/toy_data/wmt14-en-de.ref] \ + model.test_ds.src_file_name=[/home/TestData/nlp/nmt/toy_data/wmt13-en-de.src,/home/TestData/nlp/nmt/toy_data/wmt14-en-de.src] \ + model.test_ds.tgt_file_name=[/home/TestData/nlp/nmt/toy_data/wmt13-en-de.ref,/home/TestData/nlp/nmt/toy_data/wmt14-en-de.ref] \ + model.encoder_tokenizer.tokenizer_model=/home/TestData/nlp/nmt/toy_data/tt_tokenizer.BPE.4096.model \ + model.decoder_tokenizer.tokenizer_model=/home/TestData/nlp/nmt/toy_data/tt_tokenizer.BPE.4096.model \ + trainer.devices=[0] \ + trainer.accelerator="gpu" \ + +trainer.fast_dev_run=true \ + +trainer.limit_test_batches=2 \ + exp_manager=null \ + ' + } + } + } + } + + stage('L2: NMT Attention is All You Need Inference') { + when { + anyOf { + branch 'r1.17.0' + changeRequest target: 'r1.17.0' + } + } + failFast true + parallel { + stage('L2: NMT Inference - PostLN') { + steps { + sh 'cd examples/nlp/machine_translation && \ + python nmt_transformer_infer.py \ + --model=/home/TestData/nlp/nmt/toy_data/TransformerLargeDe-En.nemo \ + --srctext=/home/TestData/nlp/nmt/toy_data/wmt14-de-en.test.src \ + --tgtout=/home/TestData/nlp/nmt/toy_data/out.txt \ + --target_lang en \ + --source_lang de \ + ' + } + } + stage('L2: NMT Inference - Pre-LN') { + steps { + sh 'cd examples/nlp/machine_translation && \ + python nmt_transformer_infer.py \ + --model=/home/TestData/nlp/nmt/toy_data/en_de_24x6_preln.nemo \ + --srctext=/home/TestData/nlp/nmt/toy_data/wmt14-en-de.test.src \ + --tgtout=/home/TestData/nlp/nmt/toy_data/out.txt \ + --target_lang de \ + --source_lang en \ + ' + } + } + } + } + + stage('L2: NMT Attention is All You Need Finetuning') { + when { + anyOf { + branch 'r1.17.0' + changeRequest target: 'r1.17.0' + } + } + failFast true + steps { + sh "cd examples/nlp/machine_translation && \ + python enc_dec_nmt_finetune.py \ + model_path=/home/TestData/nlp/nmt/toy_data/en_de_24x6_preln.nemo \ + trainer.devices=[0] \ + ~trainer.max_epochs \ + model.train_ds.src_file_name=/home/TestData/nlp/nmt/toy_data/wmt14-de-en.src \ + model.train_ds.tgt_file_name=/home/TestData/nlp/nmt/toy_data/wmt14-de-en.ref \ + model.validation_ds.src_file_name=/home/TestData/nlp/nmt/toy_data/wmt14-de-en.src \ + model.validation_ds.tgt_file_name=/home/TestData/nlp/nmt/toy_data/wmt14-de-en.src \ + model.test_ds.src_file_name=/home/TestData/nlp/nmt/toy_data/wmt14-de-en.src \ + model.test_ds.tgt_file_name=/home/TestData/nlp/nmt/toy_data/wmt14-de-en.src \ + +trainer.val_check_interval=10 \ + +trainer.limit_val_batches=1 \ + +trainer.limit_test_batches=1 \ + +trainer.max_steps=10 \ + +exp_manager.exp_dir=examples/nlp/machine_translation/nmt_finetune \ + +exp_manager.create_checkpoint_callback=True \ + +exp_manager.checkpoint_callback_params.monitor=val_sacreBLEU \ + +exp_manager.checkpoint_callback_params.mode=max \ + +exp_manager.checkpoint_callback_params.save_best_model=true \ + " + sh "rm -rf examples/nlp/machine_translation/nmt_finetune" + } + } + + + stage('L2: NMT Tarred Dataset Creation') { + when { + anyOf { + branch 'r1.17.0' + changeRequest target: 'r1.17.0' + } + } + failFast true + parallel { + stage('L2: NMT Auto Tarred Dataset Creation') { + steps { + sh 'cd examples/nlp/machine_translation && \ + python enc_dec_nmt.py \ + --config-path=conf \ + --config-name=aayn_base \ + do_training=false \ + model.preproc_out_dir=$PWD/preproc_out_dir \ + model.train_ds.use_tarred_dataset=true \ + model.train_ds.n_preproc_jobs=2 \ + model.train_ds.lines_per_dataset_fragment=500 \ + model.train_ds.num_batches_per_tarfile=10 \ + model.train_ds.src_file_name=/home/TestData/nlp/nmt/toy_data/wmt14-de-en.src \ + model.train_ds.tgt_file_name=/home/TestData/nlp/nmt/toy_data/wmt14-de-en.ref \ + model.validation_ds.src_file_name=/home/TestData/nlp/nmt/toy_data/wmt14-de-en.src \ + model.validation_ds.tgt_file_name=/home/TestData/nlp/nmt/toy_data/wmt14-de-en.src \ + model.encoder_tokenizer.vocab_size=2000 \ + model.decoder_tokenizer.vocab_size=2000 \ + ~model.test_ds \ + trainer.devices=[0] \ + trainer.accelerator="gpu" \ + +trainer.fast_dev_run=true \ + exp_manager=null \ + ' + } + } + + stage('L2: NMT Script Tarred Dataset Creation') { + steps { + sh 'cd examples/nlp/machine_translation && \ + python create_tarred_parallel_dataset.py \ + --src_fname /home/TestData/nlp/nmt/toy_data/wmt14-de-en.src \ + --tgt_fname /home/TestData/nlp/nmt/toy_data/wmt14-de-en.ref \ + --out_dir $PWD/out_dir \ + --encoder_tokenizer_vocab_size=2000 \ + --decoder_tokenizer_vocab_size=2000 \ + --tokens_in_batch=1000 \ + --lines_per_dataset_fragment=500 \ + --num_batches_per_tarfile=10 \ + --n_preproc_jobs=2 \ + ' + } + } + } + } + stage('L2: Megatron NMT Training TP=2') { + when { + anyOf { + branch 'r1.17.0' + changeRequest target: 'r1.17.0' + } + } + failFast true + steps { + sh "python examples/nlp/machine_translation/megatron_nmt_training.py \ + trainer.devices=2 \ + trainer.accelerator=gpu \ + trainer.log_every_n_steps=1 \ + trainer.val_check_interval=10 \ + +trainer.limit_val_batches=2 \ + trainer.accumulate_grad_batches=1 \ + trainer.max_steps=10 \ + trainer.precision=16 \ + trainer.gradient_clip_val=1.0 \ + exp_manager.exp_dir=examples/nlp/machine_translation/megatron_nmt_results \ + model.tensor_model_parallel_size=2 \ + model.seq_length=128 \ + model.encoder.num_layers=4 \ + model.encoder.hidden_size=64 \ + model.encoder.num_attention_heads=8 \ + model.encoder.activation='swiglu' \ + model.encoder.masked_softmax_fusion=False \ + model.encoder.bias_activation_fusion=False \ + model.encoder.activations_checkpoint_method='block' \ + model.encoder.activations_checkpoint_num_layers=1 \ + model.decoder.num_layers=2 \ + model.decoder.hidden_size=64 \ + model.decoder.num_attention_heads=8 \ + model.decoder.activation='swiglu' \ + model.decoder.masked_softmax_fusion=False \ + model.decoder.bias_activation_fusion=False \ + model.decoder.activations_checkpoint_method='block' \ + model.decoder.activations_checkpoint_num_layers=1 \ + model.micro_batch_size=2 \ + model.global_batch_size=4 \ + model.train_ds.src_file_name=/home/TestData/nlp/nmt/toy_data/wmt14-de-en.src \ + model.train_ds.tgt_file_name=/home/TestData/nlp/nmt/toy_data/wmt14-de-en.ref \ + model.validation_ds.src_file_name=/home/TestData/nlp/nmt/toy_data/wmt14-de-en.src \ + model.validation_ds.tgt_file_name=/home/TestData/nlp/nmt/toy_data/wmt14-de-en.ref \ + ~model.test_ds \ + model.train_ds.dataset_type=text_memmap \ + model.encoder_tokenizer.library=sentencepiece \ + model.encoder_tokenizer.model=/home/TestData/nlp/nmt/toy_data/spm_64k_all_langs_plus_en.model \ + model.decoder_tokenizer.library=sentencepiece \ + model.decoder_tokenizer.model=/home/TestData/nlp/nmt/toy_data/spm_64k_all_langs_plus_en.model" + sh "python examples/nlp/machine_translation/megatron_nmt_training.py \ + trainer.devices=2 \ + trainer.accelerator=gpu \ + trainer.log_every_n_steps=1 \ + trainer.val_check_interval=10 \ + +trainer.limit_val_batches=2 \ + trainer.accumulate_grad_batches=1 \ + trainer.max_steps=10 \ + trainer.precision=16 \ + trainer.gradient_clip_val=1.0 \ + exp_manager.exp_dir=examples/nlp/machine_translation/megatron_nmt_results \ + model.tensor_model_parallel_size=2 \ + model.seq_length=128 \ + model.encoder.num_layers=4 \ + model.encoder.hidden_size=64 \ + model.encoder.num_attention_heads=8 \ + model.encoder.activation='swiglu' \ + model.encoder.masked_softmax_fusion=False \ + model.encoder.bias_activation_fusion=False \ + model.encoder.activations_checkpoint_method='block' \ + model.encoder.activations_checkpoint_num_layers=1 \ + model.decoder.num_layers=2 \ + model.decoder.hidden_size=64 \ + model.decoder.num_attention_heads=8 \ + model.decoder.activation='swiglu' \ + model.decoder.masked_softmax_fusion=False \ + model.decoder.bias_activation_fusion=False \ + model.decoder.activations_checkpoint_method='block' \ + model.decoder.activations_checkpoint_num_layers=1 \ + model.micro_batch_size=2 \ + model.global_batch_size=4 \ + model.train_ds.src_file_name=/home/TestData/nlp/nmt/toy_data/wmt14-de-en.src \ + model.train_ds.tgt_file_name=/home/TestData/nlp/nmt/toy_data/wmt14-de-en.ref \ + model.validation_ds.src_file_name=/home/TestData/nlp/nmt/toy_data/wmt14-de-en.src \ + model.validation_ds.tgt_file_name=/home/TestData/nlp/nmt/toy_data/wmt14-de-en.ref \ + ~model.test_ds \ + model.train_ds.dataset_type=text_memmap \ + model.encoder_tokenizer.library=sentencepiece \ + model.encoder_tokenizer.model=/home/TestData/nlp/nmt/toy_data/spm_64k_all_langs_plus_en.model \ + model.decoder_tokenizer.library=sentencepiece \ + model.decoder_tokenizer.model=/home/TestData/nlp/nmt/toy_data/spm_64k_all_langs_plus_en.model" + sh "rm -rf examples/nlp/machine_translation/megatron_nmt_results" + } + } + + // stage('L2: NMT Bottleneck Fallback') { + // when { + // anyOf { + // branch 'r1.17.0' + // changeRequest target: 'r1.17.0' + // } + // } + // failFast true + // parallel { + // stage('L2: seq2seq (no bottleneck)') { + // steps { + // sh 'cd examples/nlp/machine_translation && \ + // enc_dec_nmt-bottleneck.py \ + // --config-path=conf \ + // --config-name=aayn_bottleneck \ + // do_testing=true \ + // model.model_type=nll \ + // model.encoder.arch=seq2seq \ + // model.encoder.hidden_steps=1 \ + // model.encoder.hidden_blocks=1 \ + // model.encoder.hidden_init_method=params \ + // model.encoder.hidden_size=64 \ + // model.encoder.inner_size=128 \ + // model.encoder.num_attention_heads=2 \ + // model.encoder.num_layers=2 \ + // model.decoder.hidden_size=64 \ + // model.decoder.inner_size=128 \ + // model.decoder.num_attention_heads=2 \ + // model.decoder.num_layers=2 \ + // model.train_ds.src_file_name=/home/TestData/nlp/nmt/toy_data/wmt14-en-de.src \ + // model.train_ds.tgt_file_name=/home/TestData/nlp/nmt/toy_data/wmt14-en-de.ref \ + // model.validation_ds.src_file_name=[/home/TestData/nlp/nmt/toy_data/wmt13-en-de.src,/home/TestData/nlp/nmt/toy_data/wmt14-en-de.src] \ + // model.validation_ds.tgt_file_name=[/home/TestData/nlp/nmt/toy_data/wmt13-en-de.ref,/home/TestData/nlp/nmt/toy_data/wmt14-en-de.ref] \ + // model.test_ds.src_file_name=/home/TestData/nlp/nmt/toy_data/wmt13-en-de.src \ + // model.test_ds.tgt_file_name=/home/TestData/nlp/nmt/toy_data/wmt13-en-de.ref \ + // model.encoder_tokenizer.tokenizer_model=/home/TestData/nlp/nmt/toy_data/tt_tokenizer.BPE.4096.model \ + // model.decoder_tokenizer.tokenizer_model=/home/TestData/nlp/nmt/toy_data/tt_tokenizer.BPE.4096.model \ + // trainer.devices=[1] \ + // trainer.accelerator="gpu" \ + // +trainer.fast_dev_run=true \ + // +trainer.limit_test_batches=2 \ + // exp_manager=null \ + // ' + // } + // } + // } + // } + // stage('L2: NMT Bottleneck Architecture') { + // when { + // anyOf { + // branch 'r1.17.0' + // changeRequest target: 'r1.17.0' + // } + // } + // failFast true + // parallel { + // stage('Bridge Encoder (identity)') { + // steps { + // sh 'cd examples/nlp/machine_translation && \ + // enc_dec_nmt-bottleneck.py \ + // --config-path=conf \ + // --config-name=aayn_bottleneck \ + // do_testing=true \ + // model.model_type=nll \ + // model.encoder.arch=bridge \ + // model.encoder.hidden_steps=1 \ + // model.encoder.hidden_blocks=1 \ + // model.encoder.hidden_init_method=identity \ + // model.encoder.hidden_size=64 \ + // model.encoder.inner_size=128 \ + // model.encoder.num_attention_heads=2 \ + // model.encoder.num_layers=2 \ + // model.decoder.hidden_size=64 \ + // model.decoder.inner_size=128 \ + // model.decoder.num_attention_heads=2 \ + // model.decoder.num_layers=2 \ + // model.train_ds.src_file_name=/home/TestData/nlp/nmt/toy_data/wmt14-de-en.src \ + // model.train_ds.tgt_file_name=/home/TestData/nlp/nmt/toy_data/wmt14-de-en.ref \ + // model.validation_ds.src_file_name=/home/TestData/nlp/nmt/toy_data/wmt14-de-en.src \ + // model.validation_ds.tgt_file_name=/home/TestData/nlp/nmt/toy_data/wmt14-de-en.src \ + // model.test_ds.src_file_name=/home/TestData/nlp/nmt/toy_data/wmt14-de-en.src \ + // model.test_ds.tgt_file_name=/home/TestData/nlp/nmt/toy_data/wmt14-de-en.src \ + // model.encoder_tokenizer.tokenizer_model=/home/TestData/nlp/nmt/toy_data/tt_tokenizer.BPE.4096.model \ + // model.decoder_tokenizer.tokenizer_model=/home/TestData/nlp/nmt/toy_data/tt_tokenizer.BPE.4096.model \ + // trainer.devices=[0] \ + // trainer.accelerator="gpu" \ + // +trainer.fast_dev_run=true \ + // +trainer.limit_test_batches=2 \ + // exp_manager=null \ + // ' + // } + // } + // stage('Perceiver Encoder (params)') { + // steps { + // sh 'cd examples/nlp/machine_translation && \ + // enc_dec_nmt-bottleneck.py \ + // --config-path=conf \ + // --config-name=aayn_bottleneck \ + // do_testing=true \ + // model.model_type=nll \ + // model.encoder.arch=perceiver \ + // model.encoder.hidden_steps=1 \ + // model.encoder.hidden_blocks=1 \ + // model.encoder.hidden_init_method=params \ + // model.encoder.hidden_size=64 \ + // model.encoder.inner_size=128 \ + // model.encoder.num_attention_heads=2 \ + // model.encoder.num_layers=2 \ + // model.decoder.hidden_size=64 \ + // model.decoder.inner_size=128 \ + // model.decoder.num_attention_heads=2 \ + // model.decoder.num_layers=2 \ + // model.train_ds.src_file_name=/home/TestData/nlp/nmt/toy_data/wmt14-de-en.src \ + // model.train_ds.tgt_file_name=/home/TestData/nlp/nmt/toy_data/wmt14-de-en.ref \ + // model.validation_ds.src_file_name=/home/TestData/nlp/nmt/toy_data/wmt14-de-en.src \ + // model.validation_ds.tgt_file_name=/home/TestData/nlp/nmt/toy_data/wmt14-de-en.src \ + // model.test_ds.src_file_name=/home/TestData/nlp/nmt/toy_data/wmt14-de-en.src \ + // model.test_ds.tgt_file_name=/home/TestData/nlp/nmt/toy_data/wmt14-de-en.src \ + // model.encoder_tokenizer.tokenizer_model=/home/TestData/nlp/nmt/toy_data/tt_tokenizer.BPE.4096.model \ + // model.decoder_tokenizer.tokenizer_model=/home/TestData/nlp/nmt/toy_data/tt_tokenizer.BPE.4096.model \ + // trainer.devices=[1] \ + // trainer.accelerator="gpu" \ + // +trainer.fast_dev_run=true \ + // +trainer.limit_test_batches=2 \ + // exp_manager=null \ + // ' + // } + // } + // } + // } + // stage('L2: NMT Bottleneck LVM') { + // when { + // anyOf { + // branch 'r1.17.0' + // changeRequest target: 'r1.17.0' + // } + // } + // failFast true + // parallel { + // stage('VAE') { + // steps { + // sh 'cd examples/nlp/machine_translation && \ + // enc_dec_nmt-bottleneck.py \ + // --config-path=conf \ + // --config-name=aayn_bottleneck \ + // do_testing=true \ + // model.model_type=vae \ + // model.encoder.arch=perceiver \ + // model.encoder.hidden_steps=1 \ + // model.encoder.hidden_blocks=1 \ + // model.encoder.hidden_init_method=params \ + // model.encoder.hidden_size=64 \ + // model.encoder.inner_size=128 \ + // model.encoder.num_attention_heads=2 \ + // model.encoder.num_layers=2 \ + // model.decoder.hidden_size=64 \ + // model.decoder.inner_size=128 \ + // model.decoder.num_attention_heads=2 \ + // model.decoder.num_layers=2 \ + // model.train_ds.src_file_name=/home/TestData/nlp/nmt/toy_data/wmt14-de-en.src \ + // model.train_ds.tgt_file_name=/home/TestData/nlp/nmt/toy_data/wmt14-de-en.ref \ + // model.validation_ds.src_file_name=/home/TestData/nlp/nmt/toy_data/wmt14-de-en.src \ + // model.validation_ds.tgt_file_name=/home/TestData/nlp/nmt/toy_data/wmt14-de-en.src \ + // model.test_ds.src_file_name=/home/TestData/nlp/nmt/toy_data/wmt14-de-en.src \ + // model.test_ds.tgt_file_name=/home/TestData/nlp/nmt/toy_data/wmt14-de-en.src \ + // model.encoder_tokenizer.tokenizer_model=/home/TestData/nlp/nmt/toy_data/tt_tokenizer.BPE.4096.model \ + // model.decoder_tokenizer.tokenizer_model=/home/TestData/nlp/nmt/toy_data/tt_tokenizer.BPE.4096.model \ + // trainer.devices=[0] \ + // trainer.accelerator="gpu" \ + // +trainer.fast_dev_run=true \ + // +trainer.limit_test_batches=2 \ + // exp_manager=null \ + // ' + // } + // } + // stage('MIM') { + // steps { + // sh 'cd examples/nlp/machine_translation && \ + // enc_dec_nmt-bottleneck.py \ + // --config-path=conf \ + // --config-name=aayn_bottleneck \ + // do_testing=true \ + // model.model_type=mim \ + // model.encoder.arch=perceiver \ + // model.encoder.hidden_steps=1 \ + // model.encoder.hidden_blocks=1 \ + // model.encoder.hidden_init_method=params \ + // model.encoder.hidden_size=64 \ + // model.encoder.inner_size=128 \ + // model.encoder.num_attention_heads=2 \ + // model.encoder.num_layers=2 \ + // model.decoder.hidden_size=64 \ + // model.decoder.inner_size=128 \ + // model.decoder.num_attention_heads=2 \ + // model.decoder.num_layers=2 \ + // model.train_ds.src_file_name=/home/TestData/nlp/nmt/toy_data/wmt14-de-en.src \ + // model.train_ds.tgt_file_name=/home/TestData/nlp/nmt/toy_data/wmt14-de-en.ref \ + // model.validation_ds.src_file_name=/home/TestData/nlp/nmt/toy_data/wmt14-de-en.src \ + // model.validation_ds.tgt_file_name=/home/TestData/nlp/nmt/toy_data/wmt14-de-en.src \ + // model.test_ds.src_file_name=/home/TestData/nlp/nmt/toy_data/wmt14-de-en.src \ + // model.test_ds.tgt_file_name=/home/TestData/nlp/nmt/toy_data/wmt14-de-en.src \ + // model.encoder_tokenizer.tokenizer_model=/home/TestData/nlp/nmt/toy_data/tt_tokenizer.BPE.4096.model \ + // model.decoder_tokenizer.tokenizer_model=/home/TestData/nlp/nmt/toy_data/tt_tokenizer.BPE.4096.model \ + // trainer.devices=[1] \ + // trainer.accelerator="gpu" \ + // +trainer.fast_dev_run=true \ + // +trainer.limit_test_batches=2 \ + // exp_manager=null \ + // ' + // } + // } + // } + // } + stage('L2: Megatron Bert Pretraining and Resume Training with Pipeline Paralleism') { + when { + anyOf { + branch 'r1.17.0' + changeRequest target: 'r1.17.0' + } + } + failFast true + steps { + sh "python examples/nlp/language_modeling/megatron_bert_pretraining.py \ + trainer.devices=2 \ + trainer.accelerator=gpu \ + trainer.log_every_n_steps=1 \ + trainer.val_check_interval=10 \ + trainer.limit_val_batches=2 \ + trainer.accumulate_grad_batches=1 \ + trainer.max_steps=10 \ + trainer.precision=16 \ + trainer.gradient_clip_val=1.0 \ + exp_manager.exp_dir=examples/nlp/language_modeling/bert_pretrain_results \ + model.pipeline_model_parallel_size=2 \ + model.optim.name=fused_adam \ + model.optim.lr=2e-4 \ + model.optim.sched.warmup_steps=2 \ + model.optim.sched.constant_steps=2 \ + model.optim.sched.min_lr=8e-5 \ + model.max_position_embeddings=128 \ + model.encoder_seq_length=128 \ + model.data.seq_length=128 \ + model.tokenizer.vocab_file=/home/TestData/nlp/megatron_bert/data/bert/vocab.txt \ + model.num_layers=8 \ + model.hidden_size=256 \ + model.num_attention_heads=8 \ + model.activations_checkpoint_method='block' \ + model.activations_checkpoint_num_layers=1 \ + model.data.data_prefix=[.5,/home/TestData/nlp/megatron_bert/data/bert/simple_wiki_bert_preproc_text_sentence,.5,/home/TestData/nlp/megatron_bert/data/bert/simple_wiki_bert_preproc_text_sentence] \ + model.data.index_mapping_dir=examples/nlp/language_modeling/bert_index_mappings" + sh "python examples/nlp/language_modeling/megatron_bert_pretraining.py \ + trainer.devices=2 \ + trainer.accelerator=gpu \ + trainer.log_every_n_steps=1 \ + trainer.val_check_interval=10 \ + trainer.limit_val_batches=2 \ + trainer.accumulate_grad_batches=1 \ + trainer.max_steps=20 \ + trainer.precision=16 \ + trainer.gradient_clip_val=1.0 \ + exp_manager.exp_dir=examples/nlp/language_modeling/bert_pretrain_results \ + exp_manager.resume_if_exists=True \ + model.pipeline_model_parallel_size=2 \ + model.optim.name=fused_adam \ + model.optim.lr=2e-4 \ + model.optim.sched.warmup_steps=2 \ + model.optim.sched.constant_steps=2 \ + model.optim.sched.min_lr=8e-5 \ + model.max_position_embeddings=128 \ + model.encoder_seq_length=128 \ + model.data.seq_length=128 \ + model.tokenizer.vocab_file=/home/TestData/nlp/megatron_bert/data/bert/vocab.txt \ + model.num_layers=8 \ + model.hidden_size=256 \ + model.num_attention_heads=8 \ + model.activations_checkpoint_method='block' \ + model.activations_checkpoint_num_layers=1 \ + model.data.data_prefix=[.5,/home/TestData/nlp/megatron_bert/data/bert/simple_wiki_bert_preproc_text_sentence,.5,/home/TestData/nlp/megatron_bert/data/bert/simple_wiki_bert_preproc_text_sentence] \ + model.data.index_mapping_dir=examples/nlp/language_modeling/bert_index_mappings" + sh "rm -rf examples/nlp/language_modeling/bert_pretrain_results" + sh "rm -rf examples/nlp/language_modeling/bert_index_mappings" + } + } + stage('L2: Megatron Bert Pretraining and Resume Training') { + when { + anyOf { + branch 'r1.17.0' + changeRequest target: 'r1.17.0' + } + } + failFast true + steps { + sh "python examples/nlp/language_modeling/megatron_bert_pretraining.py \ + trainer.devices=2 \ + trainer.accelerator=gpu \ + trainer.log_every_n_steps=1 \ + trainer.val_check_interval=10 \ + trainer.limit_val_batches=2 \ + trainer.accumulate_grad_batches=1 \ + trainer.max_steps=10 \ + trainer.precision=16 \ + trainer.gradient_clip_val=1.0 \ + exp_manager.exp_dir=examples/nlp/language_modeling/bert_pretrain_results \ + model.tensor_model_parallel_size=2 \ + model.optim.name=fused_adam \ + model.optim.lr=2e-4 \ + model.sequence_parallel=True \ + model.optim.sched.warmup_steps=2 \ + model.optim.sched.constant_steps=2 \ + model.optim.sched.min_lr=8e-5 \ + model.max_position_embeddings=128 \ + model.encoder_seq_length=128 \ + model.data.seq_length=128 \ + model.tokenizer.vocab_file=/home/TestData/nlp/megatron_bert/data/bert/vocab.txt \ + model.num_layers=8 \ + model.hidden_size=256 \ + model.num_attention_heads=8 \ + model.activations_checkpoint_method='block' \ + model.activations_checkpoint_num_layers=1 \ + model.data.data_prefix=[.5,/home/TestData/nlp/megatron_bert/data/bert/simple_wiki_bert_preproc_text_sentence,.5,/home/TestData/nlp/megatron_bert/data/bert/simple_wiki_bert_preproc_text_sentence] \ + model.data.index_mapping_dir=examples/nlp/language_modeling/bert_index_mappings" + sh "python examples/nlp/language_modeling/megatron_bert_pretraining.py \ + trainer.devices=2 \ + trainer.accelerator=gpu \ + trainer.log_every_n_steps=1 \ + trainer.val_check_interval=10 \ + trainer.limit_val_batches=2 \ + trainer.accumulate_grad_batches=1 \ + trainer.max_steps=20 \ + trainer.precision=16 \ + trainer.gradient_clip_val=1.0 \ + exp_manager.exp_dir=examples/nlp/language_modeling/bert_pretrain_results \ + exp_manager.resume_if_exists=True \ + model.tensor_model_parallel_size=2 \ + model.optim.name=fused_adam \ + model.optim.lr=2e-4 \ + model.optim.sched.warmup_steps=2 \ + model.optim.sched.constant_steps=2 \ + model.optim.sched.min_lr=8e-5 \ + model.max_position_embeddings=128 \ + model.encoder_seq_length=128 \ + model.data.seq_length=128 \ + model.tokenizer.vocab_file=/home/TestData/nlp/megatron_bert/data/bert/vocab.txt \ + model.num_layers=8 \ + model.hidden_size=256 \ + model.num_attention_heads=8 \ + model.activations_checkpoint_method='block' \ + model.activations_checkpoint_num_layers=1 \ + model.data.data_prefix=[.5,/home/TestData/nlp/megatron_bert/data/bert/simple_wiki_bert_preproc_text_sentence,.5,/home/TestData/nlp/megatron_bert/data/bert/simple_wiki_bert_preproc_text_sentence] \ + model.data.index_mapping_dir=examples/nlp/language_modeling/bert_index_mappings" + sh "rm -rf examples/nlp/language_modeling/bert_pretrain_results" + sh "rm -rf examples/nlp/language_modeling/bert_index_mappings" + } + } + stage('L2: Megatron RETRO Pretraining and Resume Training') { + when { + anyOf { + branch 'r1.17.0' + changeRequest target: 'r1.17.0' + } + } + failFast true + steps { + sh "python examples/nlp/language_modeling/megatron_retro_pretraining.py \ + trainer.devices=2 \ + trainer.num_nodes=1 \ + trainer.accelerator=gpu \ + trainer.accumulate_grad_batches=1 \ + trainer.limit_val_batches=2 \ + exp_manager.resume_if_exists=True \ + trainer.max_steps=10 \ + trainer.precision=16 \ + trainer.gradient_clip_val=1.0 \ + trainer.val_check_interval=10 \ + exp_manager.exp_dir=examples/nlp/language_modeling/retro_results \ + model.data.data_prefix='' \ + model.data.knn_index='' \ + model.data.retrieval_prefix='' \ + model.tensor_model_parallel_size=2 \ + model.micro_batch_size=4 \ + model.optim.name=fused_adam \ + model.optim.lr=2e-4 \ + model.optim.sched.warmup_steps=2 \ + model.optim.sched.constant_steps=2 \ + model.optim.sched.min_lr=8e-5 \ + model.max_position_embeddings=128 \ + model.encoder_seq_length=128 \ + model.chunk_size=32 \ + model.enc_num_layers=2 \ + model.dec_num_layers=2 \ + model.enc_cross_attention=[1] \ + model.dec_cross_attention=[1] \ + +model.data.mock=True" + sh "python examples/nlp/language_modeling/megatron_retro_pretraining.py \ + trainer.devices=2 \ + trainer.num_nodes=1 \ + trainer.accelerator=gpu \ + trainer.accumulate_grad_batches=1 \ + trainer.limit_val_batches=2 \ + exp_manager.resume_if_exists=True \ + trainer.max_steps=20 \ + trainer.precision=16 \ + trainer.gradient_clip_val=1.0 \ + trainer.val_check_interval=10 \ + exp_manager.exp_dir=examples/nlp/language_modeling/retro_results \ + model.data.data_prefix='' \ + model.data.knn_index='' \ + model.data.retrieval_prefix='' \ + model.tensor_model_parallel_size=2 \ + model.micro_batch_size=4 \ + model.optim.name=fused_adam \ + model.optim.lr=2e-4 \ + model.optim.sched.warmup_steps=2 \ + model.optim.sched.constant_steps=2 \ + model.optim.sched.min_lr=8e-5 \ + model.max_position_embeddings=128 \ + model.encoder_seq_length=128 \ + model.chunk_size=32 \ + model.enc_num_layers=2 \ + model.dec_num_layers=2 \ + model.enc_cross_attention=[1] \ + model.dec_cross_attention=[1] \ + +model.data.mock=True" + sh "rm -rf examples/nlp/language_modeling/retro_results" + } + } + stage('L2: Megatron RETRO muTransfer Pretraining Performance') { + when { + anyOf { + branch 'r1.17.0' + changeRequest target: 'r1.17.0' + } + } + failFast true + steps { + sh "python examples/nlp/language_modeling/megatron_retro_mutransfer_pretrain.py \ + trainer.devices=2 \ + trainer.num_nodes=1 \ + trainer.accelerator=gpu \ + trainer.accumulate_grad_batches=1 \ + trainer.max_steps=100 \ + trainer.log_every_n_steps=1 \ + trainer.precision=16 \ + trainer.val_check_interval=100 \ + trainer.limit_val_batches=0 \ + trainer.gradient_clip_val=1.0 \ + +trainer.num_sanity_val_steps=0 \ + exp_manager.exp_dir=examples/nlp/language_modeling/retro_results/ \ + +exp_manager.version=smalltest \ + model.data.neighbors=2 \ + model.megatron_amp_O2=False \ + model.apply_query_key_layer_scaling=False \ + model.tensor_model_parallel_size=1 \ + model.optim.name=muadamw \ + model.optim.weight_decay=0.1 \ + model.optim.betas=[0.9,0.95] \ + model.optim.lr=6e-4 \ + model.optim.sched.warmup_steps=1000 \ + model.optim.sched.constant_steps=0 \ + model.optim.sched.min_lr=6e-5 \ + model.add_position_embedding=False \ + model.enc_num_layers=2 \ + model.dec_num_layers=6 \ + model.enc_cross_attention=[0] \ + model.dec_cross_attention=[3,5] \ + model.hidden_size=96 \ + model.ffn_hidden_size=384 \ + model.init_method_std=0.023 \ + model.num_attention_heads=12 \ + model.max_position_embeddings=1024 \ + model.encoder_seq_length=1024 \ + model.tokenizer.library=megatron \ + model.tokenizer.type=GPT2BPETokenizer \ + model.tokenizer.merge_file=/home/TestData/nlp/megatron_retro/gpt2-merges.txt \ + model.tokenizer.vocab_file=/home/TestData/nlp/megatron_retro/gpt2-vocab.json \ + model.data.data_prefix=[/home/TestData/nlp/megatron_retro/retro_wiki_test_text_document] \ + model.data.knn_index=[/home/TestData/nlp/megatron_retro/knn2_map_wiki_test.idx] \ + model.data.retrieval_prefix=/home/TestData/nlp/megatron_retro/retro_wiki_test_text_document \ + model.data.index_mapping_dir=/home/TestData/nlp/megatron_retro \ + model.data.num_workers=8 \ + model.micro_batch_size=8 \ + model.normalization=rmsnorm \ + model.transformer_block_type=pre_ln \ + model.bias_activation_fusion=True \ + model.bias_dropout_add_fusion=False \ + model.masked_softmax_fusion=True \ + model.hidden_dropout=0 \ + model.attention_dropout=0 \ + model.fp32_residual_connection=True \ + model.shape_file=/home/TestData/nlp/megatron_retro/o1_rel_shape_info_tiny.yaml" + sh '''python -c "import pandas as pd +import pathlib +from pandas.testing import assert_frame_equal +from tensorboard.backend.event_processing.event_accumulator import EventAccumulator +import torch +if not (torch.cuda.is_available() and 'A100' in torch.cuda.get_device_name()): + import sys + sys.exit(0) +event_file = list(pathlib.Path('examples/nlp/language_modeling/retro_results/megatron_retro/smalltest').glob('events.out.tfevents*'))[0] +ea = EventAccumulator(str(event_file)).Reload() +vals = [] +for i in ea.Scalars('reduced_train_loss'): + vals.append(i.value) +training_curve = pd.DataFrame({'loss': vals}) +gt_curve = pd.read_csv('/home/TestData/nlp/megatron_retro/expected_learning_curve.csv') +assert_frame_equal(training_curve, gt_curve, rtol=1e-3, atol=1e-3)"''' + sh "rm -rf examples/nlp/language_modeling/retro_results" + } + } + stage('L2: BioMegatron Bert NER Task') { + when { + anyOf { + branch 'r1.17.0' + changeRequest target: 'r1.17.0' + } + } + failFast true + steps { + sh "python examples/nlp/token_classification/token_classification_train.py \ + exp_manager.exp_dir=examples/nlp/language_modeling/token_classification_results \ + trainer.max_epochs=1 \ + model.dataset.data_dir=/home/TestData/nlp/ner \ + model.language_model.pretrained_model_name=biomegatron345m_biovocab_30k_cased \ + model.tokenizer.tokenizer_name=null" + sh "rm -rf examples/nlp/language_modeling/token_classification_results" + } + } + stage('L2: Megatron GPT Pretraining and Resume Training TP=2') { + when { + anyOf { + branch 'r1.17.0' + changeRequest target: 'r1.17.0' + } + } + failFast true + steps { + sh "python examples/nlp/language_modeling/megatron_gpt_pretraining.py \ + trainer.devices=2 \ + trainer.accelerator=gpu \ + trainer.log_every_n_steps=1 \ + trainer.val_check_interval=2 \ + trainer.limit_val_batches=2 \ + trainer.accumulate_grad_batches=1 \ + trainer.max_steps=3 \ + trainer.precision=16 \ + trainer.gradient_clip_val=1.0 \ + exp_manager.exp_dir=examples/nlp/language_modeling/gpt_pretrain_results \ + model.tensor_model_parallel_size=2 \ + model.optim.name=fused_adam \ + model.optim.lr=2e-4 \ + model.optim.sched.warmup_steps=1 \ + model.optim.sched.constant_steps=1 \ + model.optim.sched.min_lr=8e-5 \ + model.max_position_embeddings=128 \ + model.encoder_seq_length=128 \ + model.data.seq_length=128 \ + model.position_embedding_type=rope \ + model.rotary_percentage=0.5 \ + model.normalization=rmsnorm \ + model.bias=False \ + model.bias_activation_fusion=False \ + model.bias_dropout_add_fusion=False \ + model.tokenizer.vocab_file=/home/TestData/nlp/megatron_gpt/data/gpt/vocab.json \ + model.tokenizer.merge_file=/home/TestData/nlp/megatron_gpt/data/gpt/merges.txt \ + model.num_layers=8 \ + model.hidden_size=256 \ + model.num_attention_heads=8 \ + model.activations_checkpoint_method='block' \ + model.activations_checkpoint_num_layers=1 \ + model.data.data_prefix=[.5,/home/TestData/nlp/megatron_gpt/data/gpt/simple_wiki_gpt_preproc_text_document,.5,/home/TestData/nlp/megatron_gpt/data/gpt/simple_wiki_gpt_preproc_text_document] \ + model.data.index_mapping_dir=examples/nlp/language_modeling/gpt_index_mappings" + sh "python examples/nlp/language_modeling/megatron_gpt_pretraining.py \ + trainer.devices=2 \ + trainer.accelerator=gpu \ + trainer.log_every_n_steps=1 \ + trainer.val_check_interval=2 \ + trainer.limit_val_batches=1 \ + trainer.accumulate_grad_batches=1 \ + trainer.max_steps=6 \ + trainer.precision=16 \ + trainer.gradient_clip_val=1.0 \ + exp_manager.exp_dir=examples/nlp/language_modeling/gpt_pretrain_results \ + exp_manager.resume_if_exists=True \ + model.tensor_model_parallel_size=2 \ + model.optim.name=fused_adam \ + model.optim.lr=2e-4 \ + model.optim.sched.warmup_steps=2 \ + model.optim.sched.constant_steps=2 \ + model.optim.sched.min_lr=8e-5 \ + model.max_position_embeddings=128 \ + model.encoder_seq_length=128 \ + model.data.seq_length=128 \ + model.position_embedding_type=rope \ + model.rotary_percentage=0.5 \ + model.normalization=rmsnorm \ + model.bias=False \ + model.bias_activation_fusion=False \ + model.bias_dropout_add_fusion=False \ + model.tokenizer.vocab_file=/home/TestData/nlp/megatron_gpt/data/gpt/vocab.json \ + model.tokenizer.merge_file=/home/TestData/nlp/megatron_gpt/data/gpt/merges.txt \ + model.num_layers=8 \ + model.hidden_size=256 \ + model.num_attention_heads=8 \ + model.activations_checkpoint_method='block' \ + model.activations_checkpoint_num_layers=1 \ + model.data.data_prefix=[.5,/home/TestData/nlp/megatron_gpt/data/gpt/simple_wiki_gpt_preproc_text_document,.5,/home/TestData/nlp/megatron_gpt/data/gpt/simple_wiki_gpt_preproc_text_document] \ + model.data.index_mapping_dir=examples/nlp/language_modeling/gpt_index_mappings" + sh "rm -rf examples/nlp/language_modeling/gpt_pretrain_results" + sh "rm -rf examples/nlp/language_modeling/gpt_index_mappings" + } + } + stage('L2: Megatron GPT Pretraining and Resume Training PP=2') { + when { + anyOf { + branch 'r1.17.0' + changeRequest target: 'r1.17.0' + } + } + failFast true + steps { + sh "python examples/nlp/language_modeling/megatron_gpt_pretraining.py \ + trainer.devices=2 \ + trainer.log_every_n_steps=1 \ + trainer.val_check_interval=2 \ + trainer.limit_val_batches=2 \ + trainer.accumulate_grad_batches=1 \ + trainer.max_steps=3 \ + trainer.precision=16 \ + trainer.gradient_clip_val=1.0 \ + exp_manager.exp_dir=examples/nlp/language_modeling/gpt_pretrain_results \ + model.pipeline_model_parallel_size=2 \ + model.tensor_model_parallel_size=1 \ + model.optim.name=fused_adam \ + model.optim.lr=2e-4 \ + model.optim.sched.warmup_steps=1 \ + model.optim.sched.constant_steps=1 \ + model.optim.sched.min_lr=8e-5 \ + model.max_position_embeddings=128 \ + model.encoder_seq_length=128 \ + model.activation=fast-swiglu \ + model.bias_activation_fusion=False \ + model.hidden_dropout=0.0 \ + model.attention_dropout=0.0 \ + model.transformer_block_type=normformer \ + model.headscale=True \ + model.data.seq_length=128 \ + model.tokenizer.vocab_file=/home/TestData/nlp/megatron_gpt/data/gpt/vocab.json \ + model.tokenizer.merge_file=/home/TestData/nlp/megatron_gpt/data/gpt/merges.txt \ + model.num_layers=8 \ + model.hidden_size=256 \ + model.num_attention_heads=8 \ + model.activations_checkpoint_method='block' \ + model.activations_checkpoint_num_layers=1 \ + model.data.data_prefix=[.5,/home/TestData/nlp/megatron_gpt/data/gpt/simple_wiki_gpt_preproc_text_document,.5,/home/TestData/nlp/megatron_gpt/data/gpt/simple_wiki_gpt_preproc_text_document] \ + model.data.index_mapping_dir=examples/nlp/language_modeling/gpt_index_mappings" + sh "python examples/nlp/language_modeling/megatron_gpt_pretraining.py \ + trainer.devices=2 \ + trainer.log_every_n_steps=1 \ + trainer.val_check_interval=2 \ + trainer.limit_val_batches=2 \ + trainer.accumulate_grad_batches=1 \ + trainer.max_steps=6 \ + trainer.precision=16 \ + trainer.gradient_clip_val=1.0 \ + exp_manager.exp_dir=examples/nlp/language_modeling/gpt_pretrain_results \ + exp_manager.resume_if_exists=True \ + model.pipeline_model_parallel_size=2 \ + model.tensor_model_parallel_size=1 \ + model.optim.name=fused_adam \ + model.optim.lr=2e-4 \ + model.optim.sched.warmup_steps=2 \ + model.optim.sched.constant_steps=2 \ + model.optim.sched.min_lr=8e-5 \ + model.max_position_embeddings=128 \ + model.encoder_seq_length=128 \ + model.activation=fast-swiglu \ + model.bias_activation_fusion=False \ + model.hidden_dropout=0.0 \ + model.attention_dropout=0.0 \ + model.transformer_block_type=normformer \ + model.headscale=True \ + model.data.seq_length=128 \ + model.tokenizer.vocab_file=/home/TestData/nlp/megatron_gpt/data/gpt/vocab.json \ + model.tokenizer.merge_file=/home/TestData/nlp/megatron_gpt/data/gpt/merges.txt \ + model.num_layers=8 \ + model.hidden_size=256 \ + model.num_attention_heads=8 \ + model.activations_checkpoint_method='block' \ + model.activations_checkpoint_num_layers=1 \ + model.data.data_prefix=[.5,/home/TestData/nlp/megatron_gpt/data/gpt/simple_wiki_gpt_preproc_text_document,.5,/home/TestData/nlp/megatron_gpt/data/gpt/simple_wiki_gpt_preproc_text_document] \ + model.data.index_mapping_dir=examples/nlp/language_modeling/gpt_index_mappings" + sh "rm -rf examples/nlp/language_modeling/gpt_pretrain_results" + sh "rm -rf examples/nlp/language_modeling/gpt_index_mappings" + } + } + stage('L2: Megatron GPT Eval') { + when { + anyOf { + branch 'r1.17.0' + changeRequest target: 'r1.17.0' + } + } + failFast true + steps{ + sh "python examples/nlp/language_modeling/megatron_gpt_eval.py \ + gpt_model_file=/home/TestData/nlp/megatron_gpt/125M/megatron_gpt.nemo \ + prompts=['How to fix GPU memory? A:'] \ + tensor_model_parallel_size=1 \ + inference.tokens_to_generate=32 \ + trainer.precision=16" + } + } + stage('L2: Megatron GPT Eval PP2') { + when { + anyOf { + branch 'r1.17.0' + changeRequest target: 'r1.17.0' + } + } + failFast true + steps { + sh "python examples/nlp/language_modeling/megatron_gpt_eval.py \ + gpt_model_file=/home/TestData/nlp/megatron_gpt/PP2/gpt_pp2_tp1.nemo \ + server=False \ + tensor_model_parallel_size=1 \ + pipeline_model_parallel_size=2 \ + trainer.devices=2 \ + trainer.num_nodes=1" + } + } + + stage('L2: Megatron GPT Prompt Tuning TP1 PP1') { + when { + anyOf { + branch 'r1.17.0' + changeRequest target: 'r1.17.0' + } + } + failFast true + parallel{ + stage('GPT Prompt Learning TP=1 PP=1') { + steps { + sh "python examples/nlp/language_modeling/megatron_gpt_prompt_learning.py \ + --config-name=megatron_gpt_prompt_learning_config \ + name='/home/TestData/nlp/prompt_learning/prompt_tuning_test' \ + trainer.devices=1 \ + trainer.max_steps=1 \ + trainer.val_check_interval=1 \ + trainer.max_epochs=null \ + model.data.num_workers=1 \ + model.tensor_model_parallel_size=1 \ + model.virtual_prompt_style='p-tuning' \ + model.p_tuning.encoder_type='embedding' \ + model.language_model_path='/home/TestData/nlp/megatron_gpt/tiny/megatron_14m_gpt_tp1_pp1.nemo' \ + model.existing_tasks=[] \ + model.new_tasks=['rte'] \ + model.data.train_ds=['/home/TestData/nlp/prompt_learning/rte_CI_test.jsonl'] \ + model.data.validation_ds=['/home/TestData/nlp/prompt_learning/rte_CI_test.jsonl'] \ + model.global_batch_size=4" + sh "rm -rf /home/TestData/nlp/prompt_learning/prompt_tuning_test" + sh "rm -rf /home/TestData/nlp/prompt_learning/prompt_tuning_test.nemo" + } + } + } + } + + stage('L2: Megatron GPT Prompt Tuning TP2 PP1') { + when { + anyOf { + branch 'r1.17.0' + changeRequest target: 'r1.17.0' + } + } + failFast true + parallel{ + stage('GPT Prompt Learning TP=2 PP=1') { + steps { + sh "python examples/nlp/language_modeling/megatron_gpt_prompt_learning.py \ + --config-name=megatron_gpt_prompt_learning_config \ + name='/home/TestData/nlp/prompt_learning/p_tuning_test_tp' \ + trainer.devices=2 \ + trainer.max_steps=1 \ + trainer.val_check_interval=1 \ + trainer.max_epochs=null \ + model.data.num_workers=1 \ + model.tensor_model_parallel_size=2 \ + model.language_model_path='/home/TestData/nlp/megatron_gpt/tiny/megatron_14m_gpt_tp2_pp1.nemo' \ + model.existing_tasks=[] \ + model.new_tasks=['rte'] \ + model.data.train_ds=['/home/TestData/nlp/prompt_learning/rte_CI_test.jsonl'] \ + model.data.validation_ds=['/home/TestData/nlp/prompt_learning/rte_CI_test.jsonl'] \ + model.global_batch_size=4" + sh "rm -rf /home/TestData/nlp/prompt_learning/p_tuning_test_tp" + sh "python examples/nlp/language_modeling/megatron_gpt_prompt_learning_eval.py \ + virtual_prompt_model_file='/home/TestData/nlp/prompt_learning/p_tuning_test_tp.nemo' \ + gpt_model_file='/home/TestData/nlp/megatron_gpt/tiny/megatron_14m_gpt_tp2_pp1.nemo' \ + inference.greedy=True \ + inference.add_BOS=False \ + trainer.devices=2 \ + tensor_model_parallel_size=2 \ + pred_file_path=/home/TestData/nlp/prompt_learning/p_tuning_test_tp_preds.txt \ + data_paths=['/home/TestData/nlp/prompt_learning/rte_CI_test.jsonl']" + sh "rm -rf /home/TestData/nlp/prompt_learning/p_tuning_test_tp.nemo" + sh "rm -rf /home/TestData/nlp/prompt_learning/p_tuning_test_tp_preds.txt" + } + } + } + } + + // TODO: add when https://github.com/NVIDIA/apex/pull/1596 is merged + // stage('L2: Megatron GPT Prompt Tuning TP1 PP2') { + // when { + // anyOf { + // branch 'r1.17.0' + // changeRequest target: 'r1.17.0' + // } + // } + // failFast true + // parallel{ + // stage('GPT Prompt Learning TP=1 PP=2') { + // steps { + // sh "python examples/nlp/language_modeling/megatron_gpt_prompt_learning.py \ + // --config-name=megatron_gpt_prompt_learning_config \ + // name='/home/TestData/nlp/prompt_learning/p_tuning_test_pp' \ + // trainer.devices=2 \ + // trainer.max_steps=1 \ + // trainer.val_check_interval=1 \ + // trainer.max_epochs=null \ + // model.optim.name=fused_adam \ + // model.data.num_workers=1 \ + // model.pipeline_model_parallel_size=2 \ + // model.language_model_path='/home/TestData/nlp/megatron_gpt/tiny/megatron_14m_gpt_tp1_pp2.nemo' \ + // model.existing_tasks=[] \ + // model.new_tasks=['boolq'] \ + // model.data.train_ds=['/home/TestData/nlp/prompt_learning/boolq_CI_test.jsonl'] \ + // model.data.validation_ds=['/home/TestData/nlp/prompt_learning/boolq_CI_test.jsonl'] \ + // model.global_batch_size=4" + // sh "rm -rf /home/TestData/nlp/prompt_learning/p_tuning_test_pp" + // sh "python examples/nlp/language_modeling/megatron_gpt_prompt_learning_eval.py \ + // virtual_prompt_model_file='/home/TestData/nlp/prompt_learning/p_tuning_test_pp.nemo' \ + // gpt_model_file='/home/TestData/nlp/megatron_gpt/tiny/megatron_14m_gpt_tp1_pp2.nemo' \ + // inference.greedy=True \ + // inference.add_BOS=False \ + // trainer.devices=2 \ + // pipeline_model_parallel_size=2 \ + // pred_file_path=/home/TestData/nlp/prompt_learning/p_tuning_test_pp_preds.txt \ + // data_paths=['/home/TestData/nlp/prompt_learning/boolq_CI_test.jsonl']" + // sh "rm -rf /home/TestData/nlp/prompt_learning/p_tuning_test_pp.nemo" + // sh "rm -rf /home/TestData/nlp/prompt_learning/p_tuning_test_pp_preds.txt" + // } + // } + // } + // } + + // TODO: Add this test back. Test was failing on CI machines due to HW error + // stage('L2: Megatron GPT Convert from Megatron-LM checkpoing and Eval') { + // when { + // anyOf { + // branch 'r1.17.0' + // changeRequest target: 'r1.17.0' + // } + // } + // failFast true + // steps { + // sh "python -m torch.distributed.launch --nproc_per_node=2 \ + // examples/nlp/language_modeling/megatron_lm_ckpt_to_nemo.py \ + // --checkpoint_folder=/home/TestData/nlp/megatron_gpt/data/gpt/iter_0008700 \ + // --checkpoint_name=model_optim_rng.pt \ + // --hparams_file=/home/TestData/nlp/megatron_gpt/data/gpt/iter_0008700/hparams.yaml \ + // --nemo_file_path=examples/nlp/language_modeling/small_gpt.nemo \ + // --model_type=gpt \ + // --pipeline_model_parallel_size=1 \ + // --gpus_per_node=2 \ + // --tensor_model_parallel_size=2" + // sh "python examples/nlp/language_modeling/megatron_gpt_eval.py \ + // --gpt_model_file=examples/nlp/language_modeling/small_gpt.nemo \ + // --tokens_to_generate=32 \ + // --tensor_model_parallel_size=2 \ + // --prompt='This is a test.'" + // sh "rm examples/nlp/language_modeling/small_gpt.nemo" + // } + // } + stage('L2: Megatron Change Partitions') { + when { + anyOf { + branch 'r1.17.0' + changeRequest target: 'r1.17.0' + } + } + failFast true + parallel{ + stage('Reduce Num Partitions (2 to 1)'){ + steps{ + sh "python examples/nlp/language_modeling/megatron_change_num_partitions.py \ + --model_file \ + /home/TestData/nlp/megatron_gpt/TP2/megatron_gpt_tp2.nemo \ + --target_file \ + /home/TestData/nlp/megatron_gpt/TP2/test-reduce.nemo \ + --tensor_model_parallel_size \ + 2 \ + --target_tensor_model_parallel_size \ + 1" + sh "rm /home/TestData/nlp/megatron_gpt/TP2/test-reduce.nemo" + } + } + stage('Increase Num Partitions (2 to 4)'){ + steps{ + sh "python examples/nlp/language_modeling/megatron_change_num_partitions.py \ + --model_file \ + /home/TestData/nlp/megatron_gpt/TP2/megatron_gpt_tp2.nemo \ + --target_file \ + /home/TestData/nlp/megatron_gpt/TP2/test-increase.nemo \ + --tensor_model_parallel_size \ + 2 \ + --target_tensor_model_parallel_size \ + 4" + sh "rm /home/TestData/nlp/megatron_gpt/TP2/test-increase.nemo" + } + } + } + } + stage('L2: Megatron T5 Pretraining and Resume Training TP=2') { + when { + anyOf { + branch 'r1.17.0' + changeRequest target: 'r1.17.0' + } + } + failFast true + steps { + sh "python examples/nlp/language_modeling/megatron_t5_pretraining.py \ + trainer.devices=2 \ + trainer.accelerator=gpu \ + trainer.log_every_n_steps=1 \ + trainer.val_check_interval=10 \ + trainer.limit_val_batches=2 \ + trainer.accumulate_grad_batches=1 \ + trainer.max_steps=10 \ + trainer.precision=16 \ + trainer.gradient_clip_val=1.0 \ + exp_manager.exp_dir=examples/nlp/language_modeling/t5_pretrain_results \ + model.tensor_model_parallel_size=2 \ + model.seq_length=128 \ + model.encoder.num_layers=4 \ + model.encoder.hidden_size=64 \ + model.encoder.num_attention_heads=8 \ + model.encoder.activation='swiglu' \ + model.encoder.masked_softmax_fusion=False \ + model.encoder.bias_activation_fusion=False \ + model.encoder.activations_checkpoint_method='block' \ + model.encoder.activations_checkpoint_num_layers=1 \ + model.encoder.position_embedding_type=relative \ + model.decoder.num_layers=2 \ + model.decoder.hidden_size=64 \ + model.decoder.num_attention_heads=8 \ + model.decoder.activation='fast-swiglu' \ + model.decoder.masked_softmax_fusion=False \ + model.decoder.bias_activation_fusion=False \ + model.decoder.activations_checkpoint_method='block' \ + model.decoder.activations_checkpoint_num_layers=1 \ + model.encoder.transformer_block_type='pre_ln' \ + model.decoder.transformer_block_type='pre_ln' \ + model.data.data_prefix=[.5,/home/TestData/nlp/nmt/toy_data/wmt14-de-en.src,.5,/home/TestData/nlp/nmt/toy_data/wmt14-de-en.ref] \ + model.data.index_mapping_dir=examples/nlp/language_modeling/t5_index_mappings \ + model.data.data_impl=text_mmap \ + +model.data.data_impl_kwargs.newline_int=10 \ + +model.data.data_impl_kwargs.header_lines=0 \ + +model.data.data_impl_kwargs.workers=null \ + +model.data.data_impl_kwargs.sort_dataset_paths=False \ + model.share_token_embeddings=False \ + model.share_decoder_tokens_head_embeddings=False" + sh "python examples/nlp/language_modeling/megatron_t5_pretraining.py \ + trainer.devices=2 \ + trainer.accelerator=gpu \ + trainer.log_every_n_steps=1 \ + trainer.val_check_interval=10 \ + trainer.limit_val_batches=2 \ + trainer.accumulate_grad_batches=1 \ + trainer.max_steps=10 \ + trainer.precision=16 \ + trainer.gradient_clip_val=1.0 \ + exp_manager.exp_dir=examples/nlp/language_modeling/t5_pretrain_results \ + exp_manager.resume_if_exists=True \ + model.tensor_model_parallel_size=2 \ + model.seq_length=128 \ + model.encoder.num_layers=4 \ + model.encoder.hidden_size=64 \ + model.encoder.num_attention_heads=8 \ + model.encoder.activation='swiglu' \ + model.encoder.masked_softmax_fusion=False \ + model.encoder.bias_activation_fusion=False \ + model.encoder.activations_checkpoint_method='block' \ + model.encoder.activations_checkpoint_num_layers=1 \ + model.encoder.position_embedding_type=relative \ + model.decoder.num_layers=2 \ + model.decoder.hidden_size=64 \ + model.decoder.num_attention_heads=8 \ + model.decoder.activation='fast-swiglu' \ + model.decoder.masked_softmax_fusion=False \ + model.decoder.bias_activation_fusion=False \ + model.decoder.activations_checkpoint_method='block' \ + model.decoder.activations_checkpoint_num_layers=1 \ + model.encoder.transformer_block_type='pre_ln' \ + model.decoder.transformer_block_type='pre_ln' \ + model.data.data_prefix=[.5,/home/TestData/nlp/nmt/toy_data/wmt14-de-en.src,.5,/home/TestData/nlp/nmt/toy_data/wmt14-de-en.ref] \ + model.data.index_mapping_dir=examples/nlp/language_modeling/t5_index_mappings \ + model.data.data_impl=text_mmap \ + +model.data.data_impl_kwargs.newline_int=10 \ + +model.data.data_impl_kwargs.header_lines=0 \ + +model.data.data_impl_kwargs.workers=null \ + +model.data.data_impl_kwargs.sort_dataset_paths=False \ + model.share_token_embeddings=False \ + model.share_decoder_tokens_head_embeddings=False" + sh "rm -rf examples/nlp/language_modeling/t5_pretrain_results" + sh "rm -rf examples/nlp/language_modeling/t5_index_mappings" + } + } + stage('L2: Megatron T5 with ALiBi Pretraining and Resume Training TP=2') { + when { + anyOf { + branch 'r1.17.0' + changeRequest target: 'r1.17.0' + } + } + failFast true + steps { + sh "python examples/nlp/language_modeling/megatron_t5_pretraining.py \ + trainer.devices=2 \ + trainer.accelerator=gpu \ + trainer.log_every_n_steps=1 \ + trainer.val_check_interval=10 \ + trainer.limit_val_batches=2 \ + trainer.accumulate_grad_batches=1 \ + trainer.max_steps=10 \ + trainer.precision=16 \ + trainer.gradient_clip_val=1.0 \ + exp_manager.exp_dir=examples/nlp/language_modeling/t5_pretrain_results \ + model.tensor_model_parallel_size=2 \ + model.seq_length=128 \ + model.encoder.num_layers=4 \ + model.encoder.hidden_size=64 \ + model.encoder.num_attention_heads=8 \ + model.encoder.activation='swiglu' \ + model.encoder.masked_softmax_fusion=False \ + model.encoder.bias_activation_fusion=False \ + model.encoder.activations_checkpoint_method='block' \ + model.encoder.activations_checkpoint_num_layers=1 \ + model.encoder.position_embedding_type=alibi \ + model.decoder.num_layers=2 \ + model.decoder.hidden_size=64 \ + model.decoder.num_attention_heads=8 \ + model.decoder.activation='swiglu' \ + model.decoder.masked_softmax_fusion=False \ + model.decoder.bias_activation_fusion=False \ + model.decoder.activations_checkpoint_method='block' \ + model.decoder.activations_checkpoint_num_layers=1 \ + model.encoder.transformer_block_type='pre_ln' \ + model.decoder.transformer_block_type='pre_ln' \ + model.data.data_prefix=[.5,/home/TestData/nlp/nmt/toy_data/wmt14-de-en.src,.5,/home/TestData/nlp/nmt/toy_data/wmt14-de-en.ref] \ + model.data.index_mapping_dir=examples/nlp/language_modeling/t5_index_mappings \ + model.data.data_impl=text_mmap \ + +model.data.data_impl_kwargs.newline_int=10 \ + +model.data.data_impl_kwargs.header_lines=0 \ + +model.data.data_impl_kwargs.workers=null \ + +model.data.data_impl_kwargs.sort_dataset_paths=False \ + model.share_token_embeddings=False \ + model.share_decoder_tokens_head_embeddings=False" + sh "python examples/nlp/language_modeling/megatron_t5_pretraining.py \ + trainer.devices=2 \ + trainer.accelerator=gpu \ + trainer.log_every_n_steps=1 \ + trainer.val_check_interval=10 \ + trainer.limit_val_batches=2 \ + trainer.accumulate_grad_batches=1 \ + trainer.max_steps=10 \ + trainer.precision=16 \ + trainer.gradient_clip_val=1.0 \ + exp_manager.exp_dir=examples/nlp/language_modeling/t5_pretrain_results \ + exp_manager.resume_if_exists=True \ + model.tensor_model_parallel_size=2 \ + model.seq_length=128 \ + model.encoder.num_layers=4 \ + model.encoder.hidden_size=64 \ + model.encoder.num_attention_heads=8 \ + model.encoder.activation='swiglu' \ + model.encoder.masked_softmax_fusion=False \ + model.encoder.bias_activation_fusion=False \ + model.encoder.activations_checkpoint_method='block' \ + model.encoder.activations_checkpoint_num_layers=1 \ + model.encoder.position_embedding_type=alibi \ + model.decoder.num_layers=2 \ + model.decoder.hidden_size=64 \ + model.decoder.num_attention_heads=8 \ + model.decoder.activation='swiglu' \ + model.decoder.masked_softmax_fusion=False \ + model.decoder.bias_activation_fusion=False \ + model.decoder.activations_checkpoint_method='block' \ + model.decoder.activations_checkpoint_num_layers=1 \ + model.encoder.transformer_block_type='pre_ln' \ + model.decoder.transformer_block_type='pre_ln' \ + model.data.data_prefix=[.5,/home/TestData/nlp/nmt/toy_data/wmt14-de-en.src,.5,/home/TestData/nlp/nmt/toy_data/wmt14-de-en.ref] \ + model.data.index_mapping_dir=examples/nlp/language_modeling/t5_index_mappings \ + model.data.data_impl=text_mmap \ + +model.data.data_impl_kwargs.newline_int=10 \ + +model.data.data_impl_kwargs.header_lines=0 \ + +model.data.data_impl_kwargs.workers=null \ + +model.data.data_impl_kwargs.sort_dataset_paths=False \ + model.share_token_embeddings=False \ + model.share_decoder_tokens_head_embeddings=False" + sh "rm -rf examples/nlp/language_modeling/t5_pretrain_results" + sh "rm -rf examples/nlp/language_modeling/t5_index_mappings" + } + } + stage('L2: Megatron T5 Pretraining and Resume Training PP=2') { + when { + anyOf { + branch 'r1.17.0' + changeRequest target: 'r1.17.0' + } + } + failFast true + steps { + sh "python examples/nlp/language_modeling/megatron_t5_pretraining.py \ + trainer.devices=2 \ + trainer.accelerator=gpu \ + trainer.log_every_n_steps=1 \ + trainer.val_check_interval=10 \ + trainer.limit_val_batches=2 \ + trainer.accumulate_grad_batches=1 \ + trainer.max_steps=10 \ + trainer.precision=16 \ + trainer.gradient_clip_val=1.0 \ + exp_manager.exp_dir=examples/nlp/language_modeling/t5_pretrain_results \ + model.pipeline_model_parallel_size=2 \ + model.pipeline_model_parallel_split_rank=1 \ + model.seq_length=256 \ + model.encoder.num_layers=4 \ + model.decoder.num_layers=1 \ + model.encoder.hidden_size=64 \ + model.decoder.hidden_size=64 \ + model.encoder.num_attention_heads=8 \ + model.decoder.num_attention_heads=8 \ + model.decoder.ffn_hidden_size=2048 \ + model.encoder.activation='gelu' \ + model.encoder.activations_checkpoint_method='block' \ + model.encoder.activations_checkpoint_num_layers=1 \ + model.encoder.transformer_block_type='pre_ln' \ + model.decoder.transformer_block_type='post_ln' \ + model.data.data_prefix=[.5,/home/TestData/nlp/megatron_t5/data/pile_val_small_bert_tokenizer_text_document,.5,/home/TestData/nlp/megatron_t5/data/pile_val_small_bert_tokenizer_text_document] \ + model.data.index_mapping_dir=examples/nlp/language_modeling/t5_index_mappings" + sh "python examples/nlp/language_modeling/megatron_t5_pretraining.py \ + trainer.devices=2 \ + trainer.accelerator=gpu \ + trainer.log_every_n_steps=1 \ + trainer.val_check_interval=10 \ + trainer.limit_val_batches=2 \ + trainer.accumulate_grad_batches=1 \ + trainer.max_steps=10 \ + trainer.precision=16 \ + trainer.gradient_clip_val=1.0 \ + exp_manager.exp_dir=examples/nlp/language_modeling/t5_pretrain_results \ + exp_manager.resume_if_exists=True \ + model.pipeline_model_parallel_size=2 \ + model.pipeline_model_parallel_split_rank=1 \ + model.seq_length=256 \ + model.encoder.num_layers=4 \ + model.decoder.num_layers=1 \ + model.encoder.hidden_size=64 \ + model.decoder.hidden_size=64 \ + model.encoder.num_attention_heads=8 \ + model.decoder.num_attention_heads=8 \ + model.decoder.ffn_hidden_size=2048 \ + model.encoder.activation='gelu' \ + model.encoder.activations_checkpoint_method='block' \ + model.encoder.activations_checkpoint_num_layers=1 \ + model.encoder.transformer_block_type='pre_ln' \ + model.decoder.transformer_block_type='post_ln' \ + model.data.data_prefix=[.5,/home/TestData/nlp/megatron_t5/data/pile_val_small_bert_tokenizer_text_document,.5,/home/TestData/nlp/megatron_t5/data/pile_val_small_bert_tokenizer_text_document] \ + model.data.index_mapping_dir=examples/nlp/language_modeling/t5_index_mappings" + sh "rm -rf examples/nlp/language_modeling/t5_pretrain_results" + sh "rm -rf examples/nlp/language_modeling/t5_index_mappings" + } + } + stage('L2: Megatron T5 w/ Mixture of Expert Pretraining') { + when { + anyOf { + branch 'r1.17.0' + changeRequest target: 'r1.17.0' + } + } + failFast true + steps { + sh "python examples/nlp/language_modeling/megatron_t5_pretraining.py \ + trainer.devices=2 \ + trainer.accelerator=gpu \ + trainer.log_every_n_steps=1 \ + trainer.val_check_interval=10 \ + trainer.limit_val_batches=2 \ + trainer.accumulate_grad_batches=1 \ + trainer.max_steps=10 \ + trainer.precision=16 \ + trainer.gradient_clip_val=1.0 \ + exp_manager.exp_dir=examples/nlp/language_modeling/t5_pretrain_results \ + model.pipeline_model_parallel_split_rank=1 \ + model.seq_length=256 \ + model.encoder.num_layers=4 \ + model.decoder.num_layers=1 \ + model.encoder.num_moe_experts=4 \ + model.decoder.num_moe_experts=4 \ + model.encoder.moe_frequency=3 \ + model.decoder.moe_frequency=1 \ + model.encoder.hidden_size=64 \ + model.decoder.hidden_size=64 \ + model.encoder.num_attention_heads=8 \ + model.decoder.num_attention_heads=8 \ + model.decoder.ffn_hidden_size=2048 \ + model.encoder.activation='gelu' \ + model.encoder.activations_checkpoint_method='block' \ + model.encoder.activations_checkpoint_num_layers=1 \ + model.encoder.transformer_block_type='pre_ln' \ + model.decoder.transformer_block_type='post_ln' \ + model.data.data_prefix=[.5,/home/TestData/nlp/megatron_t5/data/pile_val_small_bert_tokenizer_text_document,.5,/home/TestData/nlp/megatron_t5/data/pile_val_small_bert_tokenizer_text_document] \ + model.data.index_mapping_dir=examples/nlp/language_modeling/t5_index_mappings" + sh "rm -rf examples/nlp/language_modeling/t5_pretrain_results" + sh "rm -rf examples/nlp/language_modeling/t5_index_mappings" + } + } + + stage('L2: Megatron T5 Prompt Learning TP1 PP1') { + when { + anyOf { + branch 'r1.17.0' + changeRequest target: 'r1.17.0' + } + } + failFast true + parallel{ + stage('T5 Prompt Learning TP=1 PP=1') { + steps { + sh "python examples/nlp/language_modeling/megatron_t5_prompt_learning.py \ + --config-name=megatron_t5_prompt_learning \ + name='/home/TestData/nlp/prompt_learning/t5_p_tuning_test' \ + trainer.devices=1 \ + trainer.max_steps=1 \ + trainer.val_check_interval=1 \ + trainer.max_epochs=null \ + model.data.num_workers=1 \ + model.language_model_path='/home/TestData/nlp/megatron_t5/8m/megatron_t5_8m-refactor.nemo' \ + model.existing_tasks=[] \ + model.new_tasks=['squad'] \ + model.data.train_ds=['/home/TestData/nlp/prompt_learning/squad_CI_test.jsonl'] \ + model.data.validation_ds=['/home/TestData/nlp/prompt_learning/squad_CI_test.jsonl'] \ + model.global_batch_size=4 \ + model.micro_batch_size=4" + sh "rm -rf /home/TestData/nlp/prompt_learning/t5_p_tuning_test" + sh "python examples/nlp/language_modeling/megatron_t5_prompt_learning_eval.py \ + virtual_prompt_model_file='/home/TestData/nlp/prompt_learning/t5_p_tuning_test.nemo' \ + language_model_path='/home/TestData/nlp/megatron_t5/8m/megatron_t5_8m-refactor.nemo' \ + data.test_ds=['/home/TestData/nlp/prompt_learning/squad_CI_test.jsonl'] \ + pred_file_path='/home/TestData/nlp/prompt_learning/t5_p_tuning_test_preds.txt' \ + data.global_batch_size=4 \ + data.micro_batch_size=4" + sh "rm -rf /home/TestData/nlp/prompt_learning/t5_p_tuning_test.nemo" + sh "rm -rf /home/TestData/nlp/prompt_learning/t5_p_tuning_test_preds.txt" + } + } + } + } + + stage('L2: Megatron T5 Prompt Learning TP2 PP1') { + when { + anyOf { + branch 'r1.17.0' + changeRequest target: 'r1.17.0' + } + } + failFast true + parallel{ + stage('T5 Prompt Learning TP=2 PP=1') { + steps { + sh "python examples/nlp/language_modeling/megatron_t5_prompt_learning.py \ + --config-name=megatron_t5_prompt_learning \ + name='/home/TestData/nlp/prompt_learning/t5_p_tuning_test_tp2' \ + trainer.devices=2 \ + trainer.max_steps=1 \ + trainer.val_check_interval=1 \ + trainer.max_epochs=null \ + model.data.num_workers=1 \ + model.tensor_model_parallel_size=2 \ + model.language_model_path='/home/TestData/nlp/megatron_t5/8m/megatron_t5_8m_tp2.nemo' \ + model.existing_tasks=[] \ + model.new_tasks=['squad'] \ + model.data.train_ds=['/home/TestData/nlp/prompt_learning/squad_CI_test.jsonl'] \ + model.data.validation_ds=['/home/TestData/nlp/prompt_learning/squad_CI_test.jsonl'] \ + model.global_batch_size=8 \ + model.micro_batch_size=8" + sh "rm -rf /home/TestData/nlp/prompt_learning/t5_p_tuning_test_tp2" + sh "python examples/nlp/language_modeling/megatron_t5_prompt_learning_eval.py \ + virtual_prompt_model_file='/home/TestData/nlp/prompt_learning/t5_p_tuning_test_tp2.nemo' \ + language_model_path='/home/TestData/nlp/megatron_t5/8m/megatron_t5_8m_tp2.nemo' \ + data.test_ds=['/home/TestData/nlp/prompt_learning/squad_CI_test.jsonl'] \ + pred_file_path='/home/TestData/nlp/prompt_learning/t5_p_tuning_test_tp2_preds.txt' \ + tensor_model_parallel_size=2 \ + trainer.devices=2 \ + data.global_batch_size=8 \ + data.micro_batch_size=8" + sh "rm -rf /home/TestData/nlp/prompt_learning/t5_p_tuning_test_tp2.nemo" + sh "rm -rf /home/TestData/nlp/prompt_learning/t5_p_tuning_test_tp2_preds.txt" + } + } + } + } + + // TODO: add when https://github.com/NVIDIA/apex/pull/1596 is merged + // stage('L2: Megatron T5 Prompt Learning TP1 PP2') { + // when { + // anyOf { + // branch 'r1.17.0' + // changeRequest target: 'r1.17.0' + // } + // } + // failFast true + // parallel{ + // stage('T5 Prompt Learning TP=1 PP=2') { + // steps { + // sh "python examples/nlp/language_modeling/megatron_t5_prompt_learning.py \ + // --config-name=megatron_t5_prompt_learning \ + // name='/home/TestData/nlp/prompt_learning/t5_p_tuning_test_pp2' \ + // trainer.devices=2 \ + // trainer.max_steps=1 \ + // trainer.val_check_interval=1 \ + // trainer.max_epochs=null \ + // model.data.num_workers=1 \ + // model.pipeline_model_parallel_size=2 \ + // model.language_model_path='/home/TestData/nlp/megatron_t5/8m/megatron_t5_8m_tp1_pp2.nemo' \ + // model.existing_tasks=[] \ + // model.new_tasks=['squad'] \ + // model.data.train_ds=['/home/TestData/nlp/prompt_learning/squad_CI_test.jsonl'] \ + // model.data.validation_ds=['/home/TestData/nlp/prompt_learning/squad_CI_test.jsonl'] \ + // model.global_batch_size=8 \ + // model.micro_batch_size=8" + // sh "rm -rf /home/TestData/nlp/prompt_learning/t5_p_tuning_test_pp2" + // sh "python examples/nlp/language_modeling/megatron_t5_prompt_learning_eval.py \ + // virtual_prompt_model_file='/home/TestData/nlp/prompt_learning/t5_p_tuning_test_pp2.nemo' \ + // language_model_path='/home/TestData/nlp/megatron_t5/8m/megatron_t5_8m_tp1_pp2.nemo' \ + // data.test_ds=['/home/TestData/nlp/prompt_learning/squad_CI_test.jsonl'] \ + // pred_file_path='/home/TestData/nlp/prompt_learning/t5_p_tuning_test_pp2_preds.txt' \ + // tensor_model_parallel_size=2 \ + // trainer.devices=2 \ + // data.global_batch_size=8 \ + // data.micro_batch_size=8" + // sh "rm -rf /home/TestData/nlp/prompt_learning/t5_p_tuning_test_pp2.nemo" + // sh "rm -rf /home/TestData/nlp/prompt_learning/t5_p_tuning_test_pp2_preds.txt" + // } + // } + // } + // } + + stage('L2: Megatron UL2 Pretraining and Resume Training TP=2') { + when { + anyOf { + branch 'r1.17.0' + changeRequest target: 'r1.17.0' + } + } + failFast true + steps { + sh "python examples/nlp/language_modeling/megatron_t5_pretraining.py -cn megatron_ul2_config \ + trainer.devices=2 \ + trainer.accelerator=gpu \ + trainer.log_every_n_steps=1 \ + trainer.val_check_interval=10 \ + trainer.limit_val_batches=2 \ + trainer.accumulate_grad_batches=1 \ + trainer.max_steps=10 \ + trainer.precision=16 \ + trainer.gradient_clip_val=1.0 \ + exp_manager.exp_dir=examples/nlp/language_modeling/t5_pretrain_results \ + model.tensor_model_parallel_size=2 \ + model.seq_length=128 \ + model.encoder.num_layers=4 \ + model.encoder.hidden_size=64 \ + model.encoder.num_attention_heads=8 \ + model.encoder.activation='swiglu' \ + model.encoder.bias_activation_fusion=False \ + model.encoder.activations_checkpoint_method='block' \ + model.encoder.activations_checkpoint_num_layers=1 \ + model.encoder.transformer_block_type='normformer' \ + model.encoder.headscale=True \ + model.decoder.num_layers=4 \ + model.decoder.hidden_size=64 \ + model.decoder.num_attention_heads=8 \ + model.decoder.activation='geglu' \ + model.decoder.bias_activation_fusion=False \ + model.decoder.activations_checkpoint_method='block' \ + model.decoder.activations_checkpoint_num_layers=1 \ + model.decoder.transformer_block_type='normformer' \ + model.decoder.headscale=False \ + model.data.data_prefix=[.5,/home/TestData/nlp/megatron_t5/data/pile_val_small_bert_tokenizer_text_document,.5,/home/TestData/nlp/megatron_t5/data/pile_val_small_bert_tokenizer_text_document] \ + model.data.index_mapping_dir=examples/nlp/language_modeling/t5_index_mappings" + sh "python examples/nlp/language_modeling/megatron_t5_pretraining.py \ + trainer.devices=2 \ + trainer.accelerator=gpu \ + trainer.log_every_n_steps=1 \ + trainer.val_check_interval=10 \ + trainer.limit_val_batches=2 \ + trainer.accumulate_grad_batches=1 \ + trainer.max_steps=10 \ + trainer.precision=16 \ + trainer.gradient_clip_val=1.0 \ + exp_manager.exp_dir=examples/nlp/language_modeling/t5_pretrain_results \ + exp_manager.resume_if_exists=True \ + model.tensor_model_parallel_size=2 \ + model.seq_length=128 \ + model.encoder.num_layers=4 \ + model.encoder.hidden_size=64 \ + model.encoder.num_attention_heads=8 \ + model.encoder.activation='swiglu' \ + model.encoder.bias_activation_fusion=False \ + model.encoder.activations_checkpoint_method='block' \ + model.encoder.activations_checkpoint_num_layers=1 \ + model.encoder.transformer_block_type='normformer' \ + model.encoder.headscale=True \ + model.decoder.num_layers=4 \ + model.decoder.hidden_size=64 \ + model.decoder.num_attention_heads=8 \ + model.decoder.activation='geglu' \ + model.decoder.bias_activation_fusion=False \ + model.decoder.activations_checkpoint_method='block' \ + model.decoder.activations_checkpoint_num_layers=1 \ + model.decoder.transformer_block_type='normformer' \ + model.decoder.headscale=False \ + model.data.data_prefix=[.5,/home/TestData/nlp/megatron_t5/data/pile_val_small_bert_tokenizer_text_document,.5,/home/TestData/nlp/megatron_t5/data/pile_val_small_bert_tokenizer_text_document] \ + model.data.index_mapping_dir=examples/nlp/language_modeling/t5_index_mappings" + sh "rm -rf examples/nlp/language_modeling/t5_pretrain_results" + sh "rm -rf examples/nlp/language_modeling/t5_index_mappings" + } + } + stage('L2: Megatron T5 Eval') { + when { + anyOf { + branch 'r1.17.0' + changeRequest target: 'r1.17.0' + } + } + failFast true + steps{ + sh "python examples/nlp/language_modeling/megatron_t5_eval.py \ + --model_file \ + /home/TestData/nlp/megatron_t5/8m/megatron_t5_8m-refactor.nemo \ + --prompt \ + 'How do I fix my GPU memory issue? I am seeing out of memory.' \ + --tensor_model_parallel_size 1" + } + } + stage('L2: Megatron BART Pretraining and Resume Training, TP=2') { + when { + anyOf { + branch 'r1.17.0' + changeRequest target: 'r1.17.0' + } + } + failFast true + steps { + sh "python examples/nlp/language_modeling/megatron_bart_pretraining.py \ + trainer.devices=2 \ + trainer.accelerator=gpu \ + trainer.log_every_n_steps=1 \ + trainer.val_check_interval=2 \ + trainer.limit_val_batches=2 \ + trainer.accumulate_grad_batches=1 \ + trainer.max_steps=3 \ + trainer.precision=16 \ + trainer.gradient_clip_val=1.0 \ + exp_manager.exp_dir=examples/nlp/language_modeling/bart_pretrain_results \ + model.tensor_model_parallel_size=2 \ + model.seq_length=128 \ + model.encoder.num_layers=4 \ + model.encoder.hidden_size=64 \ + model.encoder.num_attention_heads=8 \ + model.encoder.activation='reglu' \ + model.encoder.bias_activation_fusion=False \ + model.encoder.activations_checkpoint_method='block' \ + model.encoder.activations_checkpoint_num_layers=1 \ + model.decoder.num_layers=4 \ + model.decoder.hidden_size=64 \ + model.decoder.num_attention_heads=8 \ + model.decoder.activation='reglu' \ + model.decoder.bias_activation_fusion=False \ + model.decoder.activations_checkpoint_method='block' \ + model.decoder.activations_checkpoint_num_layers=1 \ + model.data.data_prefix='{train:[1.0,/home/TestData/nlp/megatron_t5/data/pile_val_small_bert_tokenizer_text_document],test:[/home/TestData/nlp/megatron_t5/data/pile_val_small_bert_tokenizer_text_document], validation:[/home/TestData/nlp/megatron_t5/data/pile_val_small_bert_tokenizer_text_document]}'" + sh "python examples/nlp/language_modeling/megatron_bart_pretraining.py \ + trainer.devices=2 \ + trainer.accelerator=gpu \ + trainer.log_every_n_steps=1 \ + trainer.val_check_interval=2 \ + trainer.limit_val_batches=1 \ + trainer.accumulate_grad_batches=1 \ + trainer.max_steps=6 \ + trainer.precision=16 \ + trainer.gradient_clip_val=1.0 \ + exp_manager.exp_dir=examples/nlp/language_modeling/bart_pretrain_results \ + exp_manager.resume_if_exists=True \ + model.tensor_model_parallel_size=2 \ + model.seq_length=128 \ + model.encoder.num_layers=4 \ + model.encoder.hidden_size=64 \ + model.encoder.num_attention_heads=8 \ + model.encoder.activation='reglu' \ + model.encoder.bias_activation_fusion=False \ + model.encoder.activations_checkpoint_method='block' \ + model.encoder.activations_checkpoint_num_layers=1 \ + model.decoder.num_layers=4 \ + model.decoder.hidden_size=64 \ + model.decoder.num_attention_heads=8 \ + model.decoder.activation='reglu' \ + model.decoder.bias_activation_fusion=False \ + model.decoder.activations_checkpoint_method='block' \ + model.decoder.activations_checkpoint_num_layers=1 \ + model.data.data_prefix='{train:[1.0,/home/TestData/nlp/megatron_t5/data/pile_val_small_bert_tokenizer_text_document],test:[/home/TestData/nlp/megatron_t5/data/pile_val_small_bert_tokenizer_text_document], validation:[/home/TestData/nlp/megatron_t5/data/pile_val_small_bert_tokenizer_text_document]}'" + sh "rm -rf examples/nlp/language_modeling/bart_pretrain_results" + } + } + stage('L2: Megatron BART Pretraining and Resume Training, PP=2') { + when { + anyOf { + branch 'r1.17.0' + changeRequest target: 'r1.17.0' + } + } + failFast true + steps { + sh "python examples/nlp/language_modeling/megatron_bart_pretraining.py \ + trainer.devices=2 \ + trainer.accelerator=gpu \ + trainer.log_every_n_steps=1 \ + trainer.val_check_interval=10 \ + trainer.limit_val_batches=2 \ + trainer.accumulate_grad_batches=1 \ + trainer.max_steps=10 \ + trainer.precision=16 \ + trainer.gradient_clip_val=1.0 \ + exp_manager.exp_dir=examples/nlp/language_modeling/bart_pretrain_results \ + model.pipeline_model_parallel_size=2 \ + model.pipeline_model_parallel_split_rank=1 \ + model.seq_length=256 \ + model.encoder.num_layers=4 \ + model.encoder.hidden_size=64 \ + model.encoder.num_attention_heads=8 \ + model.encoder.activation='geglu' \ + model.encoder.bias_activation_fusion=False \ + model.encoder.activations_checkpoint_method='block' \ + model.encoder.activations_checkpoint_num_layers=1 \ + model.decoder.num_layers=4 \ + model.decoder.hidden_size=64 \ + model.decoder.num_attention_heads=8 \ + model.decoder.activation='geglu' \ + model.decoder.bias_activation_fusion=False \ + model.decoder.activations_checkpoint_method='block' \ + model.decoder.activations_checkpoint_num_layers=1 \ + model.data.respect_document_boundaries=False \ + model.data.data_prefix=[.5,/home/TestData/nlp/megatron_t5/data/pile_val_small_bert_tokenizer_text_document,.5,/home/TestData/nlp/megatron_t5/data/pile_val_small_bert_tokenizer_text_document]" + sh "python examples/nlp/language_modeling/megatron_bart_pretraining.py \ + trainer.devices=2 \ + trainer.accelerator=gpu \ + trainer.log_every_n_steps=1 \ + trainer.val_check_interval=10 \ + trainer.limit_val_batches=2 \ + trainer.accumulate_grad_batches=1 \ + trainer.max_steps=10 \ + trainer.precision=16 \ + trainer.gradient_clip_val=1.0 \ + exp_manager.exp_dir=examples/nlp/language_modeling/bart_pretrain_results \ + exp_manager.resume_if_exists=True \ + model.pipeline_model_parallel_size=2 \ + model.pipeline_model_parallel_split_rank=1 \ + model.seq_length=256 \ + model.encoder.num_layers=4 \ + model.encoder.hidden_size=64 \ + model.encoder.num_attention_heads=8 \ + model.encoder.activation='geglu' \ + model.encoder.bias_activation_fusion=False \ + model.encoder.activations_checkpoint_method='block' \ + model.encoder.activations_checkpoint_num_layers=1 \ + model.decoder.num_layers=4 \ + model.decoder.hidden_size=64 \ + model.decoder.num_attention_heads=8 \ + model.decoder.activation='geglu' \ + model.decoder.bias_activation_fusion=False \ + model.decoder.activations_checkpoint_method='block' \ + model.decoder.activations_checkpoint_num_layers=1 \ + model.data.respect_document_boundaries=False \ + model.data.data_prefix=[.5,/home/TestData/nlp/megatron_t5/data/pile_val_small_bert_tokenizer_text_document,.5,/home/TestData/nlp/megatron_t5/data/pile_val_small_bert_tokenizer_text_document]" + sh "rm -rf examples/nlp/language_modeling/bart_pretrain_results" + } + } + stage('L2: Megatron T5 GLUE/XNLI Finetuning') { + when { + anyOf { + branch 'r1.17.0' + changeRequest target: 'r1.17.0' + } + } + failFast true + parallel { + // TODO(Oktai15): update it in 1.8.0 version + stage('T5 GLUE RTE') { + steps { + sh "python examples/nlp/language_modeling/megatron_t5_seq2seq_finetune.py \ + trainer.devices=1 \ + trainer.accelerator=gpu \ + trainer.log_every_n_steps=1 \ + trainer.val_check_interval=1 \ + +trainer.limit_val_batches=2 \ + +trainer.limit_test_batches=2 \ + trainer.accumulate_grad_batches=1 \ + trainer.max_steps=2 \ + trainer.precision=16 \ + exp_manager.exp_dir=examples/nlp/language_modeling/t5_glue_results \ + model.restore_from_path=/home/TestData/nlp/megatron_t5/8m/megatron_t5_8m-refactor.nemo \ + model.pipeline_model_parallel_size=1 \ + model.pipeline_model_parallel_split_rank=0 \ + model.data.train_ds.task_name=rte \ + model.data.train_ds.global_batch_size=4 \ + model.data.train_ds.micro_batch_size=2 \ + model.data.validation_ds.global_batch_size=2 \ + model.data.validation_ds.micro_batch_size=2 \ + model.data.train_ds.file_path=/home/TestData/nlp/megatron_t5/data/train_ci.tsv \ + model.data.validation_ds.task_name=rte \ + model.data.validation_ds.file_path=/home/TestData/nlp/megatron_t5/data/dev_ci.tsv \ + " + sh "rm -rf examples/nlp/language_modeling/t5_glue_results" + } + } + stage('T5 GLUE XNLI') { + steps { + sh "python examples/nlp/language_modeling/megatron_t5_seq2seq_finetune.py \ + -cn megatron_t5_config_finetune_glue_xnli \ + trainer.devices=1 \ + trainer.accelerator=gpu \ + trainer.log_every_n_steps=1 \ + trainer.val_check_interval=1 \ + +trainer.limit_val_batches=2 \ + +trainer.limit_test_batches=2 \ + trainer.accumulate_grad_batches=1 \ + trainer.max_steps=2 \ + trainer.precision=16 \ + exp_manager.exp_dir=examples/nlp/language_modeling/t5_xnli_results \ + model.restore_from_path=/home/TestData/nlp/megatron_t5/8m/megatron_t5_8m-refactor.nemo \ + model.pipeline_model_parallel_size=1 \ + model.pipeline_model_parallel_split_rank=0 \ + model.data.train_ds.global_batch_size=4 \ + model.data.train_ds.micro_batch_size=2 \ + model.data.validation_ds.global_batch_size=2 \ + model.data.validation_ds.micro_batch_size=2 \ + model.data.test_ds.global_batch_size=2 \ + model.data.test_ds.micro_batch_size=2 \ + model.data.train_ds.task_name=rte \ + model.data.train_ds.file_path=/home/TestData/nlp/megatron_t5/data/train_ci.tsv \ + model.data.validation_ds.task_name=xnli \ + model.data.validation_ds.file_path=/home/TestData/nlp/megatron_t5/data/xnli_dev_ci.tsv \ + model.data.test_ds.task_name=xnli \ + model.data.test_ds.file_path=/home/TestData/nlp/megatron_t5/data/xnli_dev_ci.tsv \ + " + sh "rm -rf examples/nlp/language_modeling/t5_xnli_results" + } + } + } + } + stage('L2: TTS Fast dev runs 1') { + when { + anyOf { + branch 'r1.17.0' + changeRequest target: 'r1.17.0' + } + } + parallel { + stage('Tacotron 2') { + steps { + sh 'python examples/tts/tacotron2.py \ + train_dataset=/home/TestData/an4_dataset/an4_train.json \ + validation_datasets=/home/TestData/an4_dataset/an4_val.json \ + trainer.devices=[0] \ + trainer.accelerator="gpu" \ + +trainer.limit_train_batches=1 +trainer.limit_val_batches=1 trainer.max_epochs=1 \ + trainer.strategy=null \ + model.decoder.decoder_rnn_dim=256 \ + model.decoder.attention_rnn_dim=1024 \ + model.decoder.prenet_dim=128 \ + model.postnet.postnet_n_convolutions=3 \ + model.train_ds.dataloader_params.batch_size=4 \ + model.train_ds.dataloader_params.num_workers=0 \ + model.validation_ds.dataloader_params.batch_size=4 \ + model.validation_ds.dataloader_params.num_workers=0 \ + ~model.text_normalizer \ + ~model.text_normalizer_call_kwargs \ + ~trainer.check_val_every_n_epoch \ + ' + } + } + stage('WaveGlow') { + steps { + sh 'python examples/tts/waveglow.py \ + train_dataset=/home/TestData/an4_dataset/an4_train.json \ + validation_datasets=/home/TestData/an4_dataset/an4_val.json \ + trainer.devices="[0]" \ + +trainer.limit_train_batches=1 +trainer.limit_val_batches=1 trainer.max_epochs=1 \ + trainer.strategy=null \ + model.train_ds.dataloader_params.batch_size=4 \ + model.train_ds.dataloader_params.num_workers=0 \ + model.validation_ds.dataloader_params.batch_size=4 \ + model.validation_ds.dataloader_params.num_workers=0 \ + model.waveglow.n_flows=4 \ + model.waveglow.n_wn_layers=2 \ + model.waveglow.n_wn_channels=32 \ + ~trainer.check_val_every_n_epoch' + } + } + stage('FastPitch') { + steps { + sh 'python examples/tts/fastpitch.py \ + --config-name fastpitch_align_v1.05 \ + train_dataset=/home/TestData/an4_dataset/an4_train.json \ + validation_datasets=/home/TestData/an4_dataset/an4_val.json \ + sup_data_path=/home/TestData/an4_dataset/beta_priors \ + trainer.devices="[0]" \ + +trainer.limit_train_batches=1 \ + +trainer.limit_val_batches=1 \ + trainer.max_epochs=1 \ + trainer.strategy=null \ + model.pitch_mean=212.35873413085938 \ + model.pitch_std=68.52806091308594 \ + model.train_ds.dataloader_params.batch_size=4 \ + model.train_ds.dataloader_params.num_workers=0 \ + model.validation_ds.dataloader_params.batch_size=4 \ + model.validation_ds.dataloader_params.num_workers=0 \ + model.symbols_embedding_dim=64 \ + model.input_fft.d_inner=384 \ + model.input_fft.n_layer=2 \ + model.output_fft.d_inner=384 \ + model.output_fft.n_layer=2 \ + ~trainer.check_val_every_n_epoch \ + ~model.text_normalizer \ + ~model.text_normalizer_call_kwargs' + } + } + stage('RADTTS') { + steps { + sh 'python examples/tts/radtts.py \ + train_dataset=/home/TestData/an4_dataset/an4_train.json \ + validation_datasets=/home/TestData/an4_dataset/an4_val.json \ + sup_data_path=/home/TestData/an4_dataset/radtts_beta_priors \ + trainer.devices="[0]" \ + +trainer.limit_train_batches=1 \ + +trainer.limit_val_batches=1 \ + trainer.max_epochs=1 \ + trainer.strategy=null \ + model.pitch_mean=212.35873413085938 \ + model.pitch_std=68.52806091308594 \ + model.train_ds.dataloader_params.batch_size=4 \ + model.train_ds.dataloader_params.num_workers=0 \ + model.validation_ds.dataloader_params.batch_size=4 \ + model.validation_ds.dataloader_params.num_workers=0 \ + export_dir=/home/TestData/radtts_test \ + model.optim.lr=0.0001 \ + model.modelConfig.decoder_use_partial_padding=True \ + ~trainer.check_val_every_n_epoch \ + ~model.text_normalizer \ + ~model.text_normalizer_call_kwargs' + } + } + stage('Mixer-TTS') { + steps { + sh 'python examples/tts/mixer_tts.py \ + train_dataset=/home/TestData/an4_dataset/an4_train.json \ + validation_datasets=/home/TestData/an4_dataset/an4_val.json \ + sup_data_path=/home/TestData/an4_dataset/sup_data \ + trainer.devices="[0]" \ + +trainer.limit_train_batches=1 \ + +trainer.limit_val_batches=1 \ + trainer.max_epochs=1 \ + trainer.strategy=null \ + model.pitch_mean=212.35873413085938 \ + model.pitch_std=68.52806091308594 \ + model.train_ds.dataloader_params.batch_size=4 \ + model.train_ds.dataloader_params.num_workers=0 \ + model.validation_ds.dataloader_params.batch_size=4 \ + model.validation_ds.dataloader_params.num_workers=0 \ + ~trainer.check_val_every_n_epoch \ + ~model.text_normalizer \ + ~model.text_normalizer_call_kwargs' + } + } + stage('Hifigan') { + steps { + sh 'python examples/tts/hifigan.py \ + train_dataset=/home/TestData/an4_dataset/an4_train.json \ + validation_datasets=/home/TestData/an4_dataset/an4_val.json \ + trainer.devices="[0]" \ + +trainer.limit_train_batches=1 \ + +trainer.limit_val_batches=1 \ + +trainer.max_epochs=1 \ + trainer.strategy=null \ + model.train_ds.dataloader_params.batch_size=4 \ + model.train_ds.dataloader_params.num_workers=0 \ + model.validation_ds.dataloader_params.batch_size=4 \ + model.validation_ds.dataloader_params.num_workers=0 \ + model.generator.upsample_initial_channel=64 \ + +model.debug=true \ + ~trainer.check_val_every_n_epoch' + } + } + } + } + + stage('L??: Speech Checkpoints tests') { + when { + anyOf { + branch 'r1.17.0' + changeRequest target: 'r1.17.0' + } + } + failFast true + steps { + sh 'CUDA_VISIBLE_DEVICES=0 python examples/asr/speech_to_text_eval.py \ + pretrained_name=QuartzNet15x5Base-En \ + dataset_manifest=/home/TestData/librispeech/librivox-dev-other.json \ + batch_size=64 \ + tolerance=0.1012' + sh 'rm -f examples/asr/evaluation_transcripts.json' + } + } + } + + post { + always { + sh 'chmod -R 777 .' + cleanWs() + } + } +} diff --git a/LICENSE b/LICENSE new file mode 100644 index 0000000000000000000000000000000000000000..f49a4e16e68b128803cc2dcea614603632b04eac --- /dev/null +++ b/LICENSE @@ -0,0 +1,201 @@ + Apache License + Version 2.0, January 2004 + http://www.apache.org/licenses/ + + TERMS AND CONDITIONS FOR USE, REPRODUCTION, AND DISTRIBUTION + + 1. Definitions. + + "License" shall mean the terms and conditions for use, reproduction, + and distribution as defined by Sections 1 through 9 of this document. + + "Licensor" shall mean the copyright owner or entity authorized by + the copyright owner that is granting the License. + + "Legal Entity" shall mean the union of the acting entity and all + other entities that control, are controlled by, or are under common + control with that entity. For the purposes of this definition, + "control" means (i) the power, direct or indirect, to cause the + direction or management of such entity, whether by contract or + otherwise, or (ii) ownership of fifty percent (50%) or more of the + outstanding shares, or (iii) beneficial ownership of such entity. + + "You" (or "Your") shall mean an individual or Legal Entity + exercising permissions granted by this License. + + "Source" form shall mean the preferred form for making modifications, + including but not limited to software source code, documentation + source, and configuration files. + + "Object" form shall mean any form resulting from mechanical + transformation or translation of a Source form, including but + not limited to compiled object code, generated documentation, + and conversions to other media types. + + "Work" shall mean the work of authorship, whether in Source or + Object form, made available under the License, as indicated by a + copyright notice that is included in or attached to the work + (an example is provided in the Appendix below). + + "Derivative Works" shall mean any work, whether in Source or Object + form, that is based on (or derived from) the Work and for which the + editorial revisions, annotations, elaborations, or other modifications + represent, as a whole, an original work of authorship. For the purposes + of this License, Derivative Works shall not include works that remain + separable from, or merely link (or bind by name) to the interfaces of, + the Work and Derivative Works thereof. + + "Contribution" shall mean any work of authorship, including + the original version of the Work and any modifications or additions + to that Work or Derivative Works thereof, that is intentionally + submitted to Licensor for inclusion in the Work by the copyright owner + or by an individual or Legal Entity authorized to submit on behalf of + the copyright owner. For the purposes of this definition, "submitted" + means any form of electronic, verbal, or written communication sent + to the Licensor or its representatives, including but not limited to + communication on electronic mailing lists, source code control systems, + and issue tracking systems that are managed by, or on behalf of, the + Licensor for the purpose of discussing and improving the Work, but + excluding communication that is conspicuously marked or otherwise + designated in writing by the copyright owner as "Not a Contribution." + + "Contributor" shall mean Licensor and any individual or Legal Entity + on behalf of whom a Contribution has been received by Licensor and + subsequently incorporated within the Work. + + 2. Grant of Copyright License. Subject to the terms and conditions of + this License, each Contributor hereby grants to You a perpetual, + worldwide, non-exclusive, no-charge, royalty-free, irrevocable + copyright license to reproduce, prepare Derivative Works of, + publicly display, publicly perform, sublicense, and distribute the + Work and such Derivative Works in Source or Object form. + + 3. Grant of Patent License. Subject to the terms and conditions of + this License, each Contributor hereby grants to You a perpetual, + worldwide, non-exclusive, no-charge, royalty-free, irrevocable + (except as stated in this section) patent license to make, have made, + use, offer to sell, sell, import, and otherwise transfer the Work, + where such license applies only to those patent claims licensable + by such Contributor that are necessarily infringed by their + Contribution(s) alone or by combination of their Contribution(s) + with the Work to which such Contribution(s) was submitted. If You + institute patent litigation against any entity (including a + cross-claim or counterclaim in a lawsuit) alleging that the Work + or a Contribution incorporated within the Work constitutes direct + or contributory patent infringement, then any patent licenses + granted to You under this License for that Work shall terminate + as of the date such litigation is filed. + + 4. Redistribution. You may reproduce and distribute copies of the + Work or Derivative Works thereof in any medium, with or without + modifications, and in Source or Object form, provided that You + meet the following conditions: + + (a) You must give any other recipients of the Work or + Derivative Works a copy of this License; and + + (b) You must cause any modified files to carry prominent notices + stating that You changed the files; and + + (c) You must retain, in the Source form of any Derivative Works + that You distribute, all copyright, patent, trademark, and + attribution notices from the Source form of the Work, + excluding those notices that do not pertain to any part of + the Derivative Works; and + + (d) If the Work includes a "NOTICE" text file as part of its + distribution, then any Derivative Works that You distribute must + include a readable copy of the attribution notices contained + within such NOTICE file, excluding those notices that do not + pertain to any part of the Derivative Works, in at least one + of the following places: within a NOTICE text file distributed + as part of the Derivative Works; within the Source form or + documentation, if provided along with the Derivative Works; or, + within a display generated by the Derivative Works, if and + wherever such third-party notices normally appear. The contents + of the NOTICE file are for informational purposes only and + do not modify the License. You may add Your own attribution + notices within Derivative Works that You distribute, alongside + or as an addendum to the NOTICE text from the Work, provided + that such additional attribution notices cannot be construed + as modifying the License. + + You may add Your own copyright statement to Your modifications and + may provide additional or different license terms and conditions + for use, reproduction, or distribution of Your modifications, or + for any such Derivative Works as a whole, provided Your use, + reproduction, and distribution of the Work otherwise complies with + the conditions stated in this License. + + 5. Submission of Contributions. Unless You explicitly state otherwise, + any Contribution intentionally submitted for inclusion in the Work + by You to the Licensor shall be under the terms and conditions of + this License, without any additional terms or conditions. + Notwithstanding the above, nothing herein shall supersede or modify + the terms of any separate license agreement you may have executed + with Licensor regarding such Contributions. + + 6. Trademarks. This License does not grant permission to use the trade + names, trademarks, service marks, or product names of the Licensor, + except as required for reasonable and customary use in describing the + origin of the Work and reproducing the content of the NOTICE file. + + 7. Disclaimer of Warranty. Unless required by applicable law or + agreed to in writing, Licensor provides the Work (and each + Contributor provides its Contributions) on an "AS IS" BASIS, + WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or + implied, including, without limitation, any warranties or conditions + of TITLE, NON-INFRINGEMENT, MERCHANTABILITY, or FITNESS FOR A + PARTICULAR PURPOSE. You are solely responsible for determining the + appropriateness of using or redistributing the Work and assume any + risks associated with Your exercise of permissions under this License. + + 8. Limitation of Liability. In no event and under no legal theory, + whether in tort (including negligence), contract, or otherwise, + unless required by applicable law (such as deliberate and grossly + negligent acts) or agreed to in writing, shall any Contributor be + liable to You for damages, including any direct, indirect, special, + incidental, or consequential damages of any character arising as a + result of this License or out of the use or inability to use the + Work (including but not limited to damages for loss of goodwill, + work stoppage, computer failure or malfunction, or any and all + other commercial damages or losses), even if such Contributor + has been advised of the possibility of such damages. + + 9. Accepting Warranty or Additional Liability. While redistributing + the Work or Derivative Works thereof, You may choose to offer, + and charge a fee for, acceptance of support, warranty, indemnity, + or other liability obligations and/or rights consistent with this + License. However, in accepting such obligations, You may act only + on Your own behalf and on Your sole responsibility, not on behalf + of any other Contributor, and only if You agree to indemnify, + defend, and hold each Contributor harmless for any liability + incurred by, or claims asserted against, such Contributor by reason + of your accepting any such warranty or additional liability. + + END OF TERMS AND CONDITIONS + + APPENDIX: How to apply the Apache License to your work. + + To apply the Apache License to your work, attach the following + boilerplate notice, with the fields enclosed by brackets "[]" + replaced with your own identifying information. (Don't include + the brackets!) The text should be enclosed in the appropriate + comment syntax for the file format. We also recommend that a + file or class name and description of purpose be included on the + same "printed page" as the copyright notice for easier + identification within third-party archives. + + Copyright [yyyy] [name of copyright owner] + + Licensed under the Apache License, Version 2.0 (the "License"); + you may not use this file except in compliance with the License. + You may obtain a copy of the License at + + http://www.apache.org/licenses/LICENSE-2.0 + + Unless required by applicable law or agreed to in writing, software + distributed under the License is distributed on an "AS IS" BASIS, + WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + See the License for the specific language governing permissions and + limitations under the License. \ No newline at end of file diff --git a/PUBLICATIONS.md b/PUBLICATIONS.md new file mode 100644 index 0000000000000000000000000000000000000000..365ed2773ed3e3b446ba31f441562b2b4f3a2676 --- /dev/null +++ b/PUBLICATIONS.md @@ -0,0 +1,213 @@ +# Publications + +Here, we list a collection of research articles that utilize the NeMo Toolkit. If you would like to include your paper in this collection, please submit a PR updating this document. + +------- + +# Automatic Speech Recognition (ASR) + +
+ 2023 + + * [Fast Entropy-Based Methods of Word-Level Confidence Estimation for End-to-End Automatic Speech Recognition](https://ieeexplore.ieee.org/abstract/document/10022960) + * [Damage Control During Domain Adaptation for Transducer Based Automatic Speech Recognition](https://ieeexplore.ieee.org/abstract/document/10023219) + +
+ +
+ 2022 + + * [Multi-blank Transducers for Speech Recognition](https://arxiv.org/abs/2211.03541) + +
+ +
+ 2021 + + * [Citrinet: Closing the Gap between Non-Autoregressive and Autoregressive End-to-End Models for Automatic Speech Recognition](https://arxiv.org/abs/2104.01721) + * [SPGISpeech: 5,000 hours of transcribed financial audio for fully formatted end-to-end speech recognition](https://www.isca-speech.org/archive/interspeech_2021/oneill21_interspeech.html) + * [CarneliNet: Neural Mixture Model for Automatic Speech Recognition](https://arxiv.org/abs/2107.10708) + * [CTC Variations Through New WFST Topologies](https://arxiv.org/abs/2110.03098) + * [A Toolbox for Construction and Analysis of Speech Datasets](https://openreview.net/pdf?id=oJ0oHQtAld) + +
+ + +
+ 2020 + + * [Cross-Language Transfer Learning, Continuous Learning, and Domain Adaptation for End-to-End Automatic Speech Recognition](https://ieeexplore.ieee.org/document/9428334) + * [Correction of Automatic Speech Recognition with Transformer Sequence-To-Sequence Model](https://ieeexplore.ieee.org/abstract/document/9053051) + * [Improving Noise Robustness of an End-to-End Neural Model for Automatic Speech Recognition](https://arxiv.org/abs/2010.12715) + +
+ + +
+ 2019 + + * [Jasper: An End-to-End Convolutional Neural Acoustic Model](https://arxiv.org/abs/1904.03288) + * [QuartzNet: Deep Automatic Speech Recognition with 1D Time-Channel Separable Convolutions](https://arxiv.org/abs/1910.10261) + + +
+ + +-------- + + +## Speaker Recognition (SpkR) + +
+ 2022 + + * [TitaNet: Neural Model for Speaker Representation with 1D Depth-Wise Separable Convolutions and Global Context](https://ieeexplore.ieee.org/abstract/document/9746806) + +
+ + +
+ 2020 + + * [SpeakerNet: 1D Depth-wise Separable Convolutional Network for Text-Independent Speaker Recognition and Verification]( https://arxiv.org/pdf/2010.12653.pdf) + +
+ +-------- + +## Speech Classification + +
+ 2022 + + * [AmberNet: A Compact End-to-End Model for Spoken Language Identification](https://arxiv.org/abs/2210.15781) + * [Accidental Learners: Spoken Language Identification in Multilingual Self-Supervised Models](https://arxiv.org/abs/2211.05103) + + +
+ +
+ 2021 + + * [MarbleNet: Deep 1D Time-Channel Separable Convolutional Neural Network for Voice Activity Detection](https://ieeexplore.ieee.org/abstract/document/9414470/) + +
+ + +
+ 2020 + + * [MatchboxNet - 1D Time-Channel Separable Convolutional Neural Network Architecture for Speech Commands Recognition](http://www.interspeech2020.org/index.php?m=content&c=index&a=show&catid=337&id=993) + +
+ + +-------- + +## Speech Translation + +
+ 2022 + + * [NVIDIA NeMo Offline Speech Translation Systems for IWSLT 2022](https://aclanthology.org/2022.iwslt-1.18/) + +
+ + +-------- + +# Natural Language Processing (NLP) + +## Language Modeling + +
+ 2022 + + * [Evaluating Parameter Efficient Learning for Generation](https://arxiv.org/abs/2210.13673) + * [Text Mining Drug/Chemical-Protein Interactions using an Ensemble of BERT and T5 Based Models](https://arxiv.org/abs/2111.15617) + +
+ +
+ 2021 + + * [BioMegatron: Larger Biomedical Domain Language Model ](https://aclanthology.org/2020.emnlp-main.379/) + +
+ +## Neural Machine Translation + +
+ 2022 + + * [Finding the Right Recipe for Low Resource Domain Adaptation in Neural Machine Translation](https://arxiv.org/abs/2206.01137) + +
+ +
+ 2021 + + * [NVIDIA NeMo Neural Machine Translatio Systems for English-German and English-Russian News and Biomedical Tasks at WMT21](https://arxiv.org/pdf/2111.08634.pdf) + +
+ +-------- + +## Dialogue State Tracking + +
+ 2021 + + * [SGD-QA: Fast Schema-Guided Dialogue State Tracking for Unseen Services](https://arxiv.org/abs/2105.08049) + +
+ +
+ 2020 + + * [A Fast and Robust BERT-based Dialogue State Tracker for Schema-Guided Dialogue Dataset](https://arxiv.org/abs/2008.12335) + +
+-------- + + +# Text To Speech (TTS) + +
+ 2022 + + * [Adapter-Based Extension of Multi-Speaker Text-to-Speech Model for New Speakers](https://arxiv.org/abs/2211.00585) + +
+ +
+ 2021 + + * [TalkNet: Fully-Convolutional Non-Autoregressive Speech Synthesis Model](https://www.isca-speech.org/archive/interspeech_2021/beliaev21_interspeech.html) + * [TalkNet 2: Non-Autoregressive Depth-Wise Separable Convolutional Model for Speech Synthesis with Explicit Pitch and Duration Prediction](https://arxiv.org/abs/2104.08189) + * [Hi-Fi Multi-Speaker English TTS Dataset](https://www.isca-speech.org/archive/pdfs/interspeech_2021/bakhturina21_interspeech.pdf) + * [Mixer-TTS: non-autoregressive, fast and compact text-to-speech model conditioned on language model embeddings](https://arxiv.org/abs/2110.03584) + +
+ + +-------- + +# (Inverse) Text Normalization +
+ 2022 + + * [Shallow Fusion of Weighted Finite-State Transducer and Language Model for Text Normalization](https://arxiv.org/abs/2203.15917) + * [Thutmose Tagger: Single-pass neural model for Inverse Text Normalization](https://arxiv.org/abs/2208.00064) + +
+ +
+ 2021 + + * [NeMo Inverse Text Normalization: From Development to Production](https://www.isca-speech.org/archive/pdfs/interspeech_2021/zhang21ga_interspeech.pdf) + * [A Unified Transformer-based Framework for Duplex Text Normalization](https://arxiv.org/pdf/2108.09889.pdf ) + +
+ +-------- \ No newline at end of file diff --git a/README.rst b/README.rst new file mode 100644 index 0000000000000000000000000000000000000000..3d94bf2d3848d40cac7370db0187537e18e563e4 --- /dev/null +++ b/README.rst @@ -0,0 +1,319 @@ + +|status| |documentation| |codeql| |license| |pypi| |pyversion| |downloads| |black| + +.. |status| image:: http://www.repostatus.org/badges/latest/active.svg + :target: http://www.repostatus.org/#active + :alt: Project Status: Active – The project has reached a stable, usable state and is being actively developed. + +.. |documentation| image:: https://readthedocs.com/projects/nvidia-nemo/badge/?version=main + :alt: Documentation + :target: https://docs.nvidia.com/deeplearning/nemo/user-guide/docs/en/main/ + +.. |license| image:: https://img.shields.io/badge/License-Apache%202.0-brightgreen.svg + :target: https://github.com/NVIDIA/NeMo/blob/master/LICENSE + :alt: NeMo core license and license for collections in this repo + +.. |pypi| image:: https://badge.fury.io/py/nemo-toolkit.svg + :target: https://badge.fury.io/py/nemo-toolkit + :alt: Release version + +.. |pyversion| image:: https://img.shields.io/pypi/pyversions/nemo-toolkit.svg + :target: https://badge.fury.io/py/nemo-toolkit + :alt: Python version + +.. |downloads| image:: https://static.pepy.tech/personalized-badge/nemo-toolkit?period=total&units=international_system&left_color=grey&right_color=brightgreen&left_text=downloads + :target: https://pepy.tech/project/nemo-toolkit + :alt: PyPi total downloads + +.. |codeql| image:: https://github.com/nvidia/nemo/actions/workflows/codeql.yml/badge.svg?branch=main&event=push + :target: https://github.com/nvidia/nemo/actions/workflows/codeql.yml + :alt: CodeQL + +.. |black| image:: https://img.shields.io/badge/code%20style-black-000000.svg + :target: https://github.com/psf/black + :alt: Code style: black + +.. _main-readme: + +**NVIDIA NeMo** +=============== + +Introduction +------------ + +NVIDIA NeMo is a conversational AI toolkit built for researchers working on automatic speech recognition (ASR), +text-to-speech synthesis (TTS), large language models (LLMs), and +natural language processing (NLP). +The primary objective of NeMo is to help researchers from industry and academia to reuse prior work (code and pretrained models) +and make it easier to create new `conversational AI models `_. + +All NeMo models are trained with `Lightning `_ and +training is automatically scalable to 1000s of GPUs. +Additionally, NeMo Megatron LLM models can be trained up to 1 trillion parameters using tensor and pipeline model parallelism. +NeMo models can be optimized for inference and deployed for production use-cases with `NVIDIA Riva `_. + +Getting started with NeMo is simple. +State of the Art pretrained NeMo models are freely available on `HuggingFace Hub `_ and +`NVIDIA NGC `_. +These models can be used to transcribe audio, synthesize speech, or translate text in just a few lines of code. + +We have extensive `tutorials `_ that +can all be run on `Google Colab `_. + +For advanced users that want to train NeMo models from scratch or finetune existing NeMo models +we have a full suite of `example scripts `_ that support multi-GPU/multi-node training. + +For scaling NeMo LLM training on Slurm clusters or public clouds, please see the `NVIDIA NeMo Megatron Launcher `_. +The NM launcher has extensive recipes, scripts, utilities, and documentation for training NeMo LLMs and also has an `Autoconfigurator `_ +which can be used to find the optimal model parallel configuration for training on a specific cluster. + +Also see our `introductory video `_ for a high level overview of NeMo. + +Key Features +------------ + +* Speech processing + * `HuggingFace Space for Audio Transcription (File, Microphone and YouTube) `_ + * `Automatic Speech Recognition (ASR) `_ + * Supported models: Jasper, QuartzNet, CitriNet, Conformer-CTC, Conformer-Transducer, Squeezeformer-CTC, Squeezeformer-Transducer, ContextNet, LSTM-Transducer (RNNT), LSTM-CTC, FastConformer-CTC, FastConformer-Transducer... + * Supports CTC and Transducer/RNNT losses/decoders + * NeMo Original `Multi-blank Transducers `_ + * Beam Search decoding + * `Language Modelling for ASR `_: N-gram LM in fusion with Beam Search decoding, Neural Rescoring with Transformer + * Streaming and Buffered ASR (CTC/Transducer) - `Chunked Inference Examples `_ + * `Support of long audios for Conformer with memory efficient local attention `_ + * `Speech Classification, Speech Command Recognition and Language Identification `_: MatchboxNet (Command Recognition), AmberNet (LangID) + * `Voice activity Detection (VAD) `_: MarbleNet + * ASR with VAD Inference - `Example `_ + * `Speaker Recognition `_: TitaNet, ECAPA_TDNN, SpeakerNet + * `Speaker Diarization `_ + * Clustering Diarizer: TitaNet, ECAPA_TDNN, SpeakerNet + * Neural Diarizer: MSDD (Multi-scale Diarization Decoder) + * `Speech Intent Detection and Slot Filling `_: Conformer-Transformer + * `Pretrained models on different languages. `_: English, Spanish, German, Russian, Chinese, French, Italian, Polish, ... + * `NGC collection of pre-trained speech processing models. `_ +* Natural Language Processing + * `NeMo Megatron pre-training of Large Language Models `_ + * `Neural Machine Translation (NMT) `_ + * `Punctuation and Capitalization `_ + * `Token classification (named entity recognition) `_ + * `Text classification `_ + * `Joint Intent and Slot Classification `_ + * `Question answering `_ + * `GLUE benchmark `_ + * `Information retrieval `_ + * `Entity Linking `_ + * `Dialogue State Tracking `_ + * `Prompt Learning `_ + * `NGC collection of pre-trained NLP models. `_ + * `Synthetic Tabular Data Generation `_ +* `Speech synthesis (TTS) `_ + * Spectrogram generation: Tacotron2, GlowTTS, TalkNet, FastPitch, FastSpeech2, Mixer-TTS, Mixer-TTS-X + * Vocoders: WaveGlow, SqueezeWave, UniGlow, MelGAN, HiFiGAN, UnivNet + * End-to-end speech generation: FastPitch_HifiGan_E2E, FastSpeech2_HifiGan_E2E, VITS + * `NGC collection of pre-trained TTS models. `_ +* `Tools `_ + * `Text Processing (text normalization and inverse text normalization) `_ + * `CTC-Segmentation tool `_ + * `Speech Data Explorer `_: a dash-based tool for interactive exploration of ASR/TTS datasets + + +Built for speed, NeMo can utilize NVIDIA's Tensor Cores and scale out training to multiple GPUs and multiple nodes. + +Requirements +------------ + +1) Python 3.8 or above +2) Pytorch 1.10.0 or above +3) NVIDIA GPU for training + +Documentation +------------- + +.. |main| image:: https://readthedocs.com/projects/nvidia-nemo/badge/?version=main + :alt: Documentation Status + :scale: 100% + :target: https://docs.nvidia.com/deeplearning/nemo/user-guide/docs/en/main/ + +.. |stable| image:: https://readthedocs.com/projects/nvidia-nemo/badge/?version=stable + :alt: Documentation Status + :scale: 100% + :target: https://docs.nvidia.com/deeplearning/nemo/user-guide/docs/en/stable/ + ++---------+-------------+------------------------------------------------------------------------------------------------------------------------------------------+ +| Version | Status | Description | ++=========+=============+==========================================================================================================================================+ +| Latest | |main| | `Documentation of the latest (i.e. main) branch. `_ | ++---------+-------------+------------------------------------------------------------------------------------------------------------------------------------------+ +| Stable | |stable| | `Documentation of the stable (i.e. most recent release) branch. `_ | ++---------+-------------+------------------------------------------------------------------------------------------------------------------------------------------+ + +Tutorials +--------- +A great way to start with NeMo is by checking `one of our tutorials `_. + +Getting help with NeMo +---------------------- +FAQ can be found on NeMo's `Discussions board `_. You are welcome to ask questions or start discussions there. + + +Installation +------------ + +Conda +~~~~~ + +We recommend installing NeMo in a fresh Conda environment. + +.. code-block:: bash + + conda create --name nemo python==3.8.10 + conda activate nemo + +Install PyTorch using their `configurator `_. + +.. code-block:: bash + + conda install pytorch torchvision torchaudio pytorch-cuda=11.8 -c pytorch -c nvidia + +The command used to install PyTorch may depend on your system. Please use the configurator linked above to find the right command for your system. + +Pip +~~~ +Use this installation mode if you want the latest released version. + +.. code-block:: bash + + apt-get update && apt-get install -y libsndfile1 ffmpeg + pip install Cython + pip install nemo_toolkit['all'] + +Depending on the shell used, you may need to use ``"nemo_toolkit[all]"`` instead in the above command. + +Pip from source +~~~~~~~~~~~~~~~ +Use this installation mode if you want the version from a particular GitHub branch (e.g main). + +.. code-block:: bash + + apt-get update && apt-get install -y libsndfile1 ffmpeg + pip install Cython + python -m pip install git+https://github.com/NVIDIA/NeMo.git@{BRANCH}#egg=nemo_toolkit[all] + + +From source +~~~~~~~~~~~ +Use this installation mode if you are contributing to NeMo. + +.. code-block:: bash + + apt-get update && apt-get install -y libsndfile1 ffmpeg + git clone https://github.com/NVIDIA/NeMo + cd NeMo + ./reinstall.sh + +If you only want the toolkit without additional conda-based dependencies, you may replace ``reinstall.sh`` +with ``pip install -e .`` when your PWD is the root of the NeMo repository. + +RNNT +~~~~ +Note that RNNT requires numba to be installed from conda. + +.. code-block:: bash + + conda remove numba + pip uninstall numba + conda install -c conda-forge numba + +NeMo Megatron +~~~~~~~~~~~~~ +NeMo Megatron training requires NVIDIA Apex to be installed. +Install it manually if not using the NVIDIA PyTorch container. + +.. code-block:: bash + + git clone https://github.com/NVIDIA/apex.git + cd apex + git checkout 03c9d80ed54c0eaa5b581bf42ceca3162f085327 + pip install -v --disable-pip-version-check --no-cache-dir --global-option="--cpp_ext" --global-option="--cuda_ext" --global-option="--fast_layer_norm" --global-option="--distributed_adam" --global-option="--deprecated_fused_adam" ./ + +It is highly recommended to use the NVIDIA PyTorch or NeMo container if having issues installing Apex or any other dependencies. + +While installing Apex, it may raise an error if the CUDA version on your system does not match the CUDA version torch was compiled with. +This raise can be avoided by commenting it here: https://github.com/NVIDIA/apex/blob/master/setup.py#L32 + +cuda-nvprof is needed to install Apex. The version should match the CUDA version that you are using: + +.. code-block:: bash + + conda install -c nvidia cuda-nvprof=11.8 + +packaging is also needed: + +.. code-block:: bash + + pip install -y packaging + + +Transformer Engine +~~~~~~~~~~~~~~~~~~ +NeMo Megatron GPT has been integrated with `NVIDIA Transformer Engine `_ +Transformer Engine enables FP8 training on NVIDIA Hopper GPUs. +`Install `_ it manually if not using the NVIDIA PyTorch container. + +.. code-block:: bash + + pip install --upgrade git+https://github.com/NVIDIA/TransformerEngine.git@stable + +It is highly recommended to use the NVIDIA PyTorch or NeMo container if having issues installing Transformer Engine or any other dependencies. + +Transformer Engine requires PyTorch to be built with CUDA 11.8. + +NeMo Text Processing +~~~~~~~~~~~~~~~~~~~~ +NeMo Text Processing, specifically (Inverse) Text Normalization, is now a separate repository `https://github.com/NVIDIA/NeMo-text-processing `_. + +Docker containers: +~~~~~~~~~~~~~~~~~~ +We release NeMo containers alongside NeMo releases. For example, NeMo ``r1.16.0`` comes with container ``nemo:23.01``, you may find more details about released containers in `releases page `_. + +To use built container, please run + +.. code-block:: bash + + docker pull nvcr.io/nvidia/nemo:23.01 + +To build a nemo container with Dockerfile from a branch, please run + +.. code-block:: bash + + DOCKER_BUILDKIT=1 docker build -f Dockerfile -t nemo:latest . + + +If you chose to work with main branch, we recommend using NVIDIA's PyTorch container version 23.02-py3 and then installing from GitHub. + +.. code-block:: bash + + docker run --gpus all -it --rm -v :/NeMo --shm-size=8g \ + -p 8888:8888 -p 6006:6006 --ulimit memlock=-1 --ulimit \ + stack=67108864 --device=/dev/snd nvcr.io/nvidia/pytorch:23.02-py3 + +Examples +-------- + +Many examples can be found under the `"Examples" `_ folder. + + +Contributing +------------ + +We welcome community contributions! Please refer to the `CONTRIBUTING.md `_ CONTRIBUTING.md for the process. + +Publications +------------ + +We provide an ever growing list of publications that utilize the NeMo framework. Please refer to `PUBLICATIONS.md `_. We welcome the addition of your own articles to this list ! + +License +------- +NeMo is under `Apache 2.0 license `_. diff --git a/ci.groovy b/ci.groovy new file mode 100644 index 0000000000000000000000000000000000000000..27ad659b99a1746685045a6fcdcd2d5874ead849 --- /dev/null +++ b/ci.groovy @@ -0,0 +1,119 @@ +@Library('blossom-github-lib@master') +import ipp.blossom.* + +podTemplate(cloud:'sc-ipp-blossom-prod', yaml : """ +apiVersion: v1 +kind: Pod +metadata: + labels: + some-label: some-label-value +spec: + volumes: + - name: scratch + nfs: + server: ipp1-cdot01-col01 + path: /vol/scratch1/scratch.okuchaiev_blossom + containers: + - name: latestdlfw + image: nvcr.io/nvidia/pytorch:23.02-py3 + command: + - cat + volumeMounts: + - name: scratch + mountPath: /testdata + resources: + limits: + nvidia.com/gpu: 2 + restartPolicy: Never + backoffLimit: 4 + tty: true + shm-size: 32g + nodeSelector: + kubernetes.io/os: linux + nvidia.com/gpu_type: "Tesla_T4x4" + nvidia.com/node_type: gpu_tester + nvidia.com/driver_version: "510.20" +""" +) { + node(POD_LABEL) { + def githubHelper + stage('Get Token') { + withCredentials([usernamePassword(credentialsId: 'GHAtoken', passwordVariable: 'GIT_PASSWORD', usernameVariable: 'GIT_USERNAME')]) { + // create new instance of helper object + githubHelper = GithubHelper.getInstance("${GIT_PASSWORD}", githubData) + } + + } + def stageName = '' + try { + currentBuild.description = githubHelper.getBuildDescription() + container('latestdlfw') { + stage('Code checkout') { + // update status on github + githubHelper.updateCommitStatus("$BUILD_URL", "$stageName Running", GitHubCommitState.PENDING) + checkout changelog: true, poll: true, scm: [$class: 'GitSCM', branches: [[name: "pr/"+githubHelper.getPRNumber()]], + doGenerateSubmoduleConfigurations: false, + submoduleCfg: [], + userRemoteConfigs: [[credentialsId: 'github-token', url: githubHelper.getCloneUrl(), refspec: '+refs/pull/*/head:refs/remotes/origin/pr/*']]] + } + + stage('Code Style') { + sh "apt-get update && \ + apt-get install -y bc && \ + nvidia-smi && \ + pip install -r requirements/requirements_test.txt && \ + python setup.py style && ls -l /testdata/TestData && ln -s /testdata/TestData /home/TestData && \ + ls -l /home && ls -l /home/TestData" + } + + stage('Installation') { + sh "git config --global --add safe.directory '*' && nvidia-smi && ./reinstall.sh release" + } + + stage('L0: GPU unit tests') { + sh "NEMO_NUMBA_MINVER=0.53 pytest -m 'not pleasefixme'" + } + + parallel( //USE CUDA_VISIBLE_DEVICES to execute 2 single GPU tests in parallel here + [ + "L1: NMT Training Pre-LN": { sh 'CUDA_VISIBLE_DEVICES=0 python examples/nlp/machine_translation/enc_dec_nmt.py \ + --config-path=conf \ + --config-name=aayn_base \ + do_testing=true \ + model.train_ds.src_file_name=/testdata/TestData/nlp/nmt/toy_data/wmt14-de-en.src \ + model.train_ds.tgt_file_name=/testdata/TestData/nlp/nmt/toy_data/wmt14-de-en.ref \ + model.validation_ds.src_file_name=/testdata/TestData/nlp/nmt/toy_data/wmt14-de-en.src \ + model.validation_ds.tgt_file_name=/testdata/TestData/nlp/nmt/toy_data/wmt14-de-en.src \ + model.test_ds.src_file_name=/testdata/TestData/nlp/nmt/toy_data/wmt14-de-en.src \ + model.test_ds.tgt_file_name=/testdata/TestData/nlp/nmt/toy_data/wmt14-de-en.src \ + model.encoder_tokenizer.tokenizer_model=/testdata/TestData/nlp/nmt/toy_data/tt_tokenizer.BPE.4096.model \ + model.decoder_tokenizer.tokenizer_model=/testdata/TestData/nlp/nmt/toy_data/tt_tokenizer.BPE.4096.model \ + model.encoder.pre_ln=true \ + model.decoder.pre_ln=true \ + trainer.devices=[0] \ + trainer.accelerator="gpu" \ + +trainer.fast_dev_run=true \ + +trainer.limit_test_batches=2 \ + exp_manager=null \ + '}, + "L1: Speech to text": { sh 'CUDA_VISIBLE_DEVICES=1 python examples/asr/asr_ctc/speech_to_text_ctc.py \ + model.train_ds.manifest_filepath=/testdata/TestData/an4_dataset/an4_train.json \ + model.validation_ds.manifest_filepath=/testdata/TestData/an4_dataset/an4_val.json \ + trainer.devices=[0] \ + trainer.accelerator="gpu" \ + +trainer.fast_dev_run=True \ + exp_manager=null \ + '} + ] + )//end of parallel + } + githubHelper.updateCommitStatus("$BUILD_URL", "Complete", GitHubCommitState.SUCCESS) + } + catch (Exception ex){ + currentBuild.result = 'FAILURE' + println ex + githubHelper.updateCommitStatus("$BUILD_URL", "$stageName Failed", GitHubCommitState.FAILURE) + } + + } + } \ No newline at end of file diff --git a/docs/.nojekyll b/docs/.nojekyll new file mode 100644 index 0000000000000000000000000000000000000000..e69de29bb2d1d6434b8b29ae775ad8c2e48c5391 diff --git a/docs/Makefile b/docs/Makefile new file mode 100644 index 0000000000000000000000000000000000000000..417fe2a0b149f603832abdd4dc070f30a331a502 --- /dev/null +++ b/docs/Makefile @@ -0,0 +1,216 @@ +# Makefile for Sphinx documentation +# + +# You can set these variables from the command line. +SPHINXOPTS = +SPHINXBUILD = sphinx-build +PAPER = +BUILDDIR = build + +# User-friendly check for sphinx-build +ifeq ($(shell which $(SPHINXBUILD) >/dev/null 2>&1; echo $$?), 1) +$(error The '$(SPHINXBUILD)' command was not found. Make sure you have Sphinx installed, then set the SPHINXBUILD environment variable to point to the full path of the '$(SPHINXBUILD)' executable. Alternatively you can add the directory with the executable to your PATH. If you don't have Sphinx installed, grab it from http://sphinx-doc.org/) +endif + +# Internal variables. +PAPEROPT_a4 = -D latex_paper_size=a4 +PAPEROPT_letter = -D latex_paper_size=letter +ALLSPHINXOPTS = -d $(BUILDDIR)/doctrees $(PAPEROPT_$(PAPER)) $(SPHINXOPTS) source +# the i18n builder cannot share the environment and doctrees with the others +I18NSPHINXOPTS = $(PAPEROPT_$(PAPER)) $(SPHINXOPTS) source + +.PHONY: help +help: + @echo "Please use \`make ' where is one of" + @echo " html to make standalone HTML files" + @echo " dirhtml to make HTML files named index.html in directories" + @echo " singlehtml to make a single large HTML file" + @echo " pickle to make pickle files" + @echo " json to make JSON files" + @echo " htmlhelp to make HTML files and a HTML help project" + @echo " qthelp to make HTML files and a qthelp project" + @echo " applehelp to make an Apple Help Book" + @echo " devhelp to make HTML files and a Devhelp project" + @echo " epub to make an epub" + @echo " latex to make LaTeX files, you can set PAPER=a4 or PAPER=letter" + @echo " latexpdf to make LaTeX files and run them through pdflatex" + @echo " latexpdfja to make LaTeX files and run them through platex/dvipdfmx" + @echo " text to make text files" + @echo " man to make manual pages" + @echo " texinfo to make Texinfo files" + @echo " info to make Texinfo files and run them through makeinfo" + @echo " gettext to make PO message catalogs" + @echo " changes to make an overview of all changed/added/deprecated items" + @echo " xml to make Docutils-native XML files" + @echo " pseudoxml to make pseudoxml-XML files for display purposes" + @echo " linkcheck to check all external links for integrity" + @echo " doctest to run all doctests embedded in the documentation (if enabled)" + @echo " coverage to run coverage check of the documentation (if enabled)" + +.PHONY: clean +clean: + rm -rf $(BUILDDIR)/* + +.PHONY: html +html: + $(SPHINXBUILD) -b html $(ALLSPHINXOPTS) $(BUILDDIR)/html + @echo + @echo "Build finished. The HTML pages are in $(BUILDDIR)/html." + +.PHONY: dirhtml +dirhtml: + $(SPHINXBUILD) -b dirhtml $(ALLSPHINXOPTS) $(BUILDDIR)/dirhtml + @echo + @echo "Build finished. The HTML pages are in $(BUILDDIR)/dirhtml." + +.PHONY: singlehtml +singlehtml: + $(SPHINXBUILD) -b singlehtml $(ALLSPHINXOPTS) $(BUILDDIR)/singlehtml + @echo + @echo "Build finished. The HTML page is in $(BUILDDIR)/singlehtml." + +.PHONY: pickle +pickle: + $(SPHINXBUILD) -b pickle $(ALLSPHINXOPTS) $(BUILDDIR)/pickle + @echo + @echo "Build finished; now you can process the pickle files." + +.PHONY: json +json: + $(SPHINXBUILD) -b json $(ALLSPHINXOPTS) $(BUILDDIR)/json + @echo + @echo "Build finished; now you can process the JSON files." + +.PHONY: htmlhelp +htmlhelp: + $(SPHINXBUILD) -b htmlhelp $(ALLSPHINXOPTS) $(BUILDDIR)/htmlhelp + @echo + @echo "Build finished; now you can run HTML Help Workshop with the" \ + ".hhp project file in $(BUILDDIR)/htmlhelp." + +.PHONY: qthelp +qthelp: + $(SPHINXBUILD) -b qthelp $(ALLSPHINXOPTS) $(BUILDDIR)/qthelp + @echo + @echo "Build finished; now you can run "qcollectiongenerator" with the" \ + ".qhcp project file in $(BUILDDIR)/qthelp, like this:" + @echo "# qcollectiongenerator $(BUILDDIR)/qthelp/OpenSeq2Seq.qhcp" + @echo "To view the help file:" + @echo "# assistant -collectionFile $(BUILDDIR)/qthelp/OpenSeq2Seq.qhc" + +.PHONY: applehelp +applehelp: + $(SPHINXBUILD) -b applehelp $(ALLSPHINXOPTS) $(BUILDDIR)/applehelp + @echo + @echo "Build finished. The help book is in $(BUILDDIR)/applehelp." + @echo "N.B. You won't be able to view it unless you put it in" \ + "~/Library/Documentation/Help or install it in your application" \ + "bundle." + +.PHONY: devhelp +devhelp: + $(SPHINXBUILD) -b devhelp $(ALLSPHINXOPTS) $(BUILDDIR)/devhelp + @echo + @echo "Build finished." + @echo "To view the help file:" + @echo "# mkdir -p $$HOME/.local/share/devhelp/OpenSeq2Seq" + @echo "# ln -s $(BUILDDIR)/devhelp $$HOME/.local/share/devhelp/OpenSeq2Seq" + @echo "# devhelp" + +.PHONY: epub +epub: + $(SPHINXBUILD) -b epub $(ALLSPHINXOPTS) $(BUILDDIR)/epub + @echo + @echo "Build finished. The epub file is in $(BUILDDIR)/epub." + +.PHONY: latex +latex: + $(SPHINXBUILD) -b latex $(ALLSPHINXOPTS) $(BUILDDIR)/latex + @echo + @echo "Build finished; the LaTeX files are in $(BUILDDIR)/latex." + @echo "Run \`make' in that directory to run these through (pdf)latex" \ + "(use \`make latexpdf' here to do that automatically)." + +.PHONY: latexpdf +latexpdf: + $(SPHINXBUILD) -b latex $(ALLSPHINXOPTS) $(BUILDDIR)/latex + @echo "Running LaTeX files through pdflatex..." + $(MAKE) -C $(BUILDDIR)/latex all-pdf + @echo "pdflatex finished; the PDF files are in $(BUILDDIR)/latex." + +.PHONY: latexpdfja +latexpdfja: + $(SPHINXBUILD) -b latex $(ALLSPHINXOPTS) $(BUILDDIR)/latex + @echo "Running LaTeX files through platex and dvipdfmx..." + $(MAKE) -C $(BUILDDIR)/latex all-pdf-ja + @echo "pdflatex finished; the PDF files are in $(BUILDDIR)/latex." + +.PHONY: text +text: + $(SPHINXBUILD) -b text $(ALLSPHINXOPTS) $(BUILDDIR)/text + @echo + @echo "Build finished. The text files are in $(BUILDDIR)/text." + +.PHONY: man +man: + $(SPHINXBUILD) -b man $(ALLSPHINXOPTS) $(BUILDDIR)/man + @echo + @echo "Build finished. The manual pages are in $(BUILDDIR)/man." + +.PHONY: texinfo +texinfo: + $(SPHINXBUILD) -b texinfo $(ALLSPHINXOPTS) $(BUILDDIR)/texinfo + @echo + @echo "Build finished. The Texinfo files are in $(BUILDDIR)/texinfo." + @echo "Run \`make' in that directory to run these through makeinfo" \ + "(use \`make info' here to do that automatically)." + +.PHONY: info +info: + $(SPHINXBUILD) -b texinfo $(ALLSPHINXOPTS) $(BUILDDIR)/texinfo + @echo "Running Texinfo files through makeinfo..." + make -C $(BUILDDIR)/texinfo info + @echo "makeinfo finished; the Info files are in $(BUILDDIR)/texinfo." + +.PHONY: gettext +gettext: + $(SPHINXBUILD) -b gettext $(I18NSPHINXOPTS) $(BUILDDIR)/locale + @echo + @echo "Build finished. The message catalogs are in $(BUILDDIR)/locale." + +.PHONY: changes +changes: + $(SPHINXBUILD) -b changes $(ALLSPHINXOPTS) $(BUILDDIR)/changes + @echo + @echo "The overview file is in $(BUILDDIR)/changes." + +.PHONY: linkcheck +linkcheck: + $(SPHINXBUILD) -b linkcheck $(ALLSPHINXOPTS) $(BUILDDIR)/linkcheck + @echo + @echo "Link check complete; look for any errors in the above output " \ + "or in $(BUILDDIR)/linkcheck/output.txt." + +.PHONY: doctest +doctest: + $(SPHINXBUILD) -b doctest $(ALLSPHINXOPTS) $(BUILDDIR)/doctest + @echo "Testing of doctests in the sources finished, look at the " \ + "results in $(BUILDDIR)/doctest/output.txt." + +.PHONY: coverage +coverage: + $(SPHINXBUILD) -b coverage $(ALLSPHINXOPTS) $(BUILDDIR)/coverage + @echo "Testing of coverage in the sources finished, look at the " \ + "results in $(BUILDDIR)/coverage/python.txt." + +.PHONY: xml +xml: + $(SPHINXBUILD) -b xml $(ALLSPHINXOPTS) $(BUILDDIR)/xml + @echo + @echo "Build finished. The XML files are in $(BUILDDIR)/xml." + +.PHONY: pseudoxml +pseudoxml: + $(SPHINXBUILD) -b pseudoxml $(ALLSPHINXOPTS) $(BUILDDIR)/pseudoxml + @echo + @echo "Build finished. The pseudo-XML files are in $(BUILDDIR)/pseudoxml." diff --git a/docs/source/_static/css/custom.css b/docs/source/_static/css/custom.css new file mode 100644 index 0000000000000000000000000000000000000000..e5dbe2515c0640fc5a8b5d8b7ba50265ae930b26 --- /dev/null +++ b/docs/source/_static/css/custom.css @@ -0,0 +1,298 @@ +/* Import the Roboto Thin Font */ +@import url('https://fonts.googleapis.com/css2?family=Roboto:wght@400&display=swap'); + +body { + font-size: 100%; + font-family: 'Roboto', sans-serif; +} + + +/* Width of template */ + +.wy-nav-content { + max-width: 1200px !important; +} + + + +/* Standard Text Formatting */ + +h1 { + color: #76b900; + text-align: center; + /* background-color: #ffffff; */ +} + +h2 { + color: #ffffff; + /* background-color: #ffffff; */ + /* #76b900 */ + Padding: 5px; +} + +h3 { + padding-top: 0px; + border-top: solid 3px #000000; + /* #76b900 */ + border-bottom: solid 3px #000000; + /* #76b900 */ +} + +p { + margin-bottom: 24px; +} + +/* Link Colors */ +a { + color: #76b900; +} + +a:visited { + color: #218219; +} + +.container-xl { + margin-right: unset; + margin-left: unset; +} + +section { + overflow-x: auto; +} + +/* ----------------------------------------------TABLES--------------------------------------- */ +section table { + overflow-x: auto; + display: block; +} + +table { + font-size: small; +} + +/* Table head Color */ +thead td { + background-color: #333333 !important; +} + +.row-odd p { + /*padding-bottom: 0px;*/ + /*margin-bottom: 0px;*/ +} + +/* even rows*/ + +.row-even tr { + background-color: #e5f1e6 !important; +} + +/* odd rows*/ + + +.wy-table-responsive table tr { + background-color: #ffffff !important; +} + + + +.wy-table-responsive table td { + white-space: normal; +} + + +/* Removes bottom margin in tables*/ + +.rst-content .line-block { + margin-bottom: 0px; +} + +.wy-table-responsive { + overflow: visible !important; +} + +/* reduces the size of text in multiline table columns. */ + +.rst-content table.docutils td { + font-size: 80%; +} + +.rst-content dl:not(.docutils) dt { + + background-color: inherit; + color: #000000; + border-top: solid 0px #000000; + +} + +.rst-content dl:not(.docutils) dt:before { + color: #333333; +} + +.rst-content .line-block { + margin-bottom: 0px; +} + +.wy-side-nav-search, +.wy-nav-top { + background-color: #000000; + padding: 0; +} + +.wy-side-nav-search img { + padding: 0px; + padding: 0px 0px; + margin-bottom: 0; +} + +.wy-side-nav-search input[type=text] { + border-radius: 0px; +} + + +.wy-menu-vertical p.caption { + color: #76b900; +} + + +.wy-side-nav-search>a img.logo, +.wy-side-nav-search .wy-dropdown>a img.logo { + margin: 0px 0px 0px 0px; +} + +.wy-nav-content { + margin: 0; + min-height: 100%; + height: 100%; + background: #ffffff; +} + +/* List (numbered, bulleted) padding Fix */ + + +.wy-plain-list-decimal li { + margin-top: -6px; + margin-bottom: -6px; +} + +.rst-content .section ol.loweralpha { + margin-top: -6px; + margin-bottom: 12px; +} + +.wy-plain-list-disc, +.rst-content .toctree-wrapper ul, +article ul { + margin-top: 0px !important; + margin-bottom: 12px; +} + +/* Alert Boxes */ +/* Background color of Alert Box Title */ + +.rst-content .section ul { + margin-top: -12px; + margin-bottom: 16px; +} + +.wy-alert.wy-alert-info .wy-alert-title, +.rst-content .note .wy-alert-title, +.rst-content .wy-alert-info.attention .wy-alert-title, +.rst-content .wy-alert-info.caution .wy-alert-title, +.rst-content .wy-alert-info.danger .wy-alert-title, +.rst-content .wy-alert-info.error .wy-alert-title, +.rst-content .wy-alert-info.hint .wy-alert-title, +.rst-content .wy-alert-info.important .wy-alert-title, +.rst-content .wy-alert-info.tip .wy-alert-title, +.rst-content .wy-alert-info.warning .wy-alert-title, +.rst-content .seealso .wy-alert-title, +.rst-content .wy-alert-info.admonition-todo .wy-alert-title, +.rst-content .wy-alert-info.admonition .wy-alert-title, +.wy-alert.wy-alert-info .rst-content .admonition-title, +.rst-content .wy-alert.wy-alert-info .admonition-title, +.rst-content .note .admonition-title, +.rst-content .wy-alert-info.attention .admonition-title, +.rst-content .wy-alert-info.caution .admonition-title, +.rst-content .wy-alert-info.danger .admonition-title, +.rst-content .wy-alert-info.error .admonition-title, +.rst-content .wy-alert-info.hint .admonition-title, +.rst-content .wy-alert-info.important .admonition-title, +.rst-content .wy-alert-info.tip .admonition-title, +.rst-content .wy-alert-info.warning .admonition-title, +.rst-content .seealso .admonition-title, +.rst-content .wy-alert-info.admonition-todo .admonition-title, +.rst-content .wy-alert-info.admonition .admonition-title { + background: #76b900; +} + +/* Background and Font Color of Alert Box Main Body*/ +.wy-alert.wy-alert-info, +.rst-content .note, +.rst-content .wy-alert-info.attention, +.rst-content .wy-alert-info.caution, +.rst-content .wy-alert-info.danger, +.rst-content .wy-alert-info.error, +.rst-content .wy-alert-info.hint, +.rst-content .wy-alert-info.important, +.rst-content .wy-alert-info.tip, +.rst-content .wy-alert-info.warning, +.rst-content .seealso, +.rst-content .wy-alert-info.admonition-todo, +.rst-content .wy-alert-info.admonition { + background: #333333; + color: #999999; +} + +.section { + margin-top: 50px; +} + +/* Logo */ +.navbar-brand-box { + background-color: #ffffff; +} + +/* ---------------------------------------------- Media Queries --------------------------------------- */ +@media (min-width: 1200px) { + .container-xl { + max-width: 100%; + } +} + +@media (min-width: 1400px) { + body { + font-size: 18px; + } + + #site-navigation nav ul.nav { + font-size: 18px; + } + + #site-navigation nav.bd-links p { + font-size: 18px; + } + + #site-navigation { + width: 350px; + } + + .toc-h2 { + font-size: 18px; + } + + .toc-h3 { + font-size: 1rem; + } + + .toc-h4 { + font-size: 0.85rem; + } + + .header-article .bd-toc { + font-size: 18px; + } + + #main-content>div { + margin-left: 10%; + margin-right: 10%; + } +} \ No newline at end of file diff --git a/docs/source/_static/js/pk_scripts.js b/docs/source/_static/js/pk_scripts.js new file mode 100644 index 0000000000000000000000000000000000000000..23c74982d3aad11ab94c31e623b60e0ebe66972a --- /dev/null +++ b/docs/source/_static/js/pk_scripts.js @@ -0,0 +1,19 @@ +document.addEventListener("DOMContentLoaded", function () { + var params = window.location.search.substring(1).split("&").reduce(function (params, param) { + if (!param) { + return params; + } + + var values = param.split("="); + var name = values[0]; + var value = values[1]; + params[name] = value; + return params; + }, {}); + + var form = document.getElementById("feedback-form"); + for (var name in params) { + var input = form.querySelector("[name=" + name + "]"); + input.value = params[name]; + } +}); \ No newline at end of file diff --git a/docs/source/_templates/layout.html b/docs/source/_templates/layout.html new file mode 100644 index 0000000000000000000000000000000000000000..c8651c293491cbc0eadfc36138eaadcb9552d359 --- /dev/null +++ b/docs/source/_templates/layout.html @@ -0,0 +1,14 @@ +{% extends "!layout.html" %} + +{% block extrahead %} + + + +{% endblock %} + +{% block footer %} + + + +{% endblock %} \ No newline at end of file diff --git a/docs/source/asr/api.rst b/docs/source/asr/api.rst new file mode 100644 index 0000000000000000000000000000000000000000..5735990dc82ad40fb888227eea97a7e5baf99886 --- /dev/null +++ b/docs/source/asr/api.rst @@ -0,0 +1,299 @@ +NeMo ASR collection API +======================= + + +Model Classes +------------- + +.. autoclass:: nemo.collections.asr.models.EncDecCTCModel + :show-inheritance: + :members: transcribe, change_vocabulary, setup_training_data, setup_optimization, setup_validation_data, setup_test_data, register_artifact + + +.. autoclass:: nemo.collections.asr.models.EncDecCTCModelBPE + :show-inheritance: + :members: transcribe, change_vocabulary, setup_training_data, setup_optimization, setup_validation_data, setup_test_data, register_artifact + + +.. autoclass:: nemo.collections.asr.models.EncDecRNNTModel + :show-inheritance: + :members: transcribe, change_vocabulary, setup_training_data, setup_optimization, setup_validation_data, setup_test_data, register_artifact + + +.. autoclass:: nemo.collections.asr.models.EncDecRNNTBPEModel + :show-inheritance: + :members: transcribe, change_vocabulary, setup_training_data, setup_optimization, setup_validation_data, setup_test_data, register_artifact + + +.. autoclass:: nemo.collections.asr.models.EncDecClassificationModel + :show-inheritance: + :members: setup_training_data, setup_optimization, setup_validation_data, setup_test_data, register_artifact + + +.. autoclass:: nemo.collections.asr.models.EncDecSpeakerLabelModel + :show-inheritance: + :members: setup_training_data, setup_optimization, setup_validation_data, setup_test_data, register_artifact + + +Modules +------- + +.. autoclass:: nemo.collections.asr.modules.ConvASREncoder + :show-inheritance: + :members: + +.. autoclass:: nemo.collections.asr.modules.ConvASRDecoder + :show-inheritance: + :members: + +.. autoclass:: nemo.collections.asr.modules.ConvASRDecoderClassification + :show-inheritance: + :members: + +.. autoclass:: nemo.collections.asr.modules.SpeakerDecoder + :show-inheritance: + :members: + +.. _conformer-encoder-api: + +.. autoclass:: nemo.collections.asr.modules.ConformerEncoder + :show-inheritance: + :members: + +.. _squeezeformer-encoder-api: + +.. autoclass:: nemo.collections.asr.modules.SqueezeformerEncoder + :show-inheritance: + :members: + +.. _rnn-encoder-api: + +.. autoclass:: nemo.collections.asr.modules.RNNEncoder + :show-inheritance: + :members: + +.. _rnnt-decoder-api: + +.. autoclass:: nemo.collections.asr.modules.RNNTDecoder + :show-inheritance: + :members: + +.. autoclass:: nemo.collections.asr.modules.StatelessTransducerDecoder + :show-inheritance: + :members: + +.. _rnnt-joint-api: + +.. autoclass:: nemo.collections.asr.modules.RNNTJoint + :show-inheritance: + :members: + +.. autoclass:: nemo.collections.asr.modules.SampledRNNTJoint + :show-inheritance: + :members: + + + +Parts +----- + +.. autoclass:: nemo.collections.asr.parts.submodules.jasper.JasperBlock + :show-inheritance: + :members: + + +Mixins +------ + +.. autoclass:: nemo.collections.asr.parts.mixins.mixins.ASRBPEMixin + :show-inheritance: + :members: + +.. autoclass:: nemo.collections.asr.parts.mixins.mixins.ASRModuleMixin + :show-inheritance: + :members: + +.. autoclass:: nemo.collections.asr.parts.mixins.interctc_mixin.InterCTCMixin + :show-inheritance: + :members: + +Datasets +-------- + +Character Encoding Datasets +~~~~~~~~~~~~~~~~~~~~~~~~~~~ + +.. autoclass:: nemo.collections.asr.data.audio_to_text.AudioToCharDataset + :show-inheritance: + :members: + +.. autoclass:: nemo.collections.asr.data.audio_to_text.TarredAudioToCharDataset + :show-inheritance: + :members: + +Subword Encoding Datasets +~~~~~~~~~~~~~~~~~~~~~~~~~ + +.. autoclass:: nemo.collections.asr.data.audio_to_text.AudioToBPEDataset + :show-inheritance: + :members: + +.. autoclass:: nemo.collections.asr.data.audio_to_text.TarredAudioToBPEDataset + :show-inheritance: + :members: + +Audio Preprocessors +------------------- + +.. autoclass:: nemo.collections.asr.modules.AudioToMelSpectrogramPreprocessor + :show-inheritance: + :members: + +.. autoclass:: nemo.collections.asr.modules.AudioToMFCCPreprocessor + :show-inheritance: + :members: + +Audio Augmentors +---------------- + +.. autoclass:: nemo.collections.asr.modules.SpectrogramAugmentation + :show-inheritance: + :members: + +.. autoclass:: nemo.collections.asr.modules.CropOrPadSpectrogramAugmentation + :show-inheritance: + :members: + +.. autoclass:: nemo.collections.asr.parts.preprocessing.perturb.SpeedPerturbation + :show-inheritance: + :members: + +.. autoclass:: nemo.collections.asr.parts.preprocessing.perturb.TimeStretchPerturbation + :show-inheritance: + :members: + +.. autoclass:: nemo.collections.asr.parts.preprocessing.perturb.GainPerturbation + :show-inheritance: + :members: + +.. autoclass:: nemo.collections.asr.parts.preprocessing.perturb.ImpulsePerturbation + :show-inheritance: + :members: + +.. autoclass:: nemo.collections.asr.parts.preprocessing.perturb.ShiftPerturbation + :show-inheritance: + :members: + +.. autoclass:: nemo.collections.asr.parts.preprocessing.perturb.NoisePerturbation + :show-inheritance: + :members: + +.. autoclass:: nemo.collections.asr.parts.preprocessing.perturb.WhiteNoisePerturbation + :show-inheritance: + :members: + +.. autoclass:: nemo.collections.asr.parts.preprocessing.perturb.RirAndNoisePerturbation + :show-inheritance: + :members: + +.. autoclass:: nemo.collections.asr.parts.preprocessing.perturb.TranscodePerturbation + :show-inheritance: + :members: + +Miscellaneous Classes +--------------------- + +CTC Decoding +~~~~~~~~~~~~ + +.. autoclass:: nemo.collections.asr.metrics.wer.CTCDecoding + :show-inheritance: + :members: + +.. autoclass:: nemo.collections.asr.metrics.wer_bpe.CTCBPEDecoding + :show-inheritance: + :members: + +.. autoclass:: nemo.collections.asr.parts.submodules.ctc_greedy_decoding.GreedyCTCInfer + :show-inheritance: + :members: + +.. autoclass:: nemo.collections.asr.parts.submodules.ctc_beam_decoding.BeamCTCInfer + :show-inheritance: + :members: + +RNNT Decoding +~~~~~~~~~~~~~ + +.. autoclass:: nemo.collections.asr.metrics.rnnt_wer.RNNTDecoding + :show-inheritance: + :members: + +.. autoclass:: nemo.collections.asr.metrics.rnnt_wer_bpe.RNNTBPEDecoding + :show-inheritance: + :members: + +.. autoclass:: nemo.collections.asr.parts.submodules.rnnt_greedy_decoding.GreedyRNNTInfer + :show-inheritance: + :members: + +.. autoclass:: nemo.collections.asr.parts.submodules.rnnt_greedy_decoding.GreedyBatchedRNNTInfer + :show-inheritance: + :members: + +.. autoclass:: nemo.collections.asr.parts.submodules.rnnt_beam_decoding.BeamRNNTInfer + :show-inheritance: + :members: + +Hypotheses +~~~~~~~~~~ + +.. autoclass:: nemo.collections.asr.parts.utils.rnnt_utils.Hypothesis + :show-inheritance: + :no-members: + +.. autoclass:: nemo.collections.asr.parts.utils.rnnt_utils.NBestHypotheses + :show-inheritance: + :no-members: + +Adapter Networks +~~~~~~~~~~~~~~~~ + +.. autoclass:: nemo.collections.asr.parts.submodules.adapters.multi_head_attention_adapter_module.MultiHeadAttentionAdapter + :show-inheritance: + :members: + :member-order: bysource + +----- + +.. autoclass:: nemo.collections.asr.parts.submodules.adapters.multi_head_attention_adapter_module.RelPositionMultiHeadAttentionAdapter + :show-inheritance: + :members: + :member-order: bysource + +----- + +.. autoclass:: nemo.collections.asr.parts.submodules.adapters.multi_head_attention_adapter_module.PositionalEncodingAdapter + :show-inheritance: + :members: + :member-order: bysource + +----- + +.. autoclass:: nemo.collections.asr.parts.submodules.adapters.multi_head_attention_adapter_module.RelPositionalEncodingAdapter + :show-inheritance: + :members: + :member-order: bysource + + +Adapter Strategies +~~~~~~~~~~~~~~~~~~ + +.. autoclass:: nemo.collections.asr.parts.submodules.adapters.multi_head_attention_adapter_module.MHAResidualAddAdapterStrategy + :show-inheritance: + :members: + :member-order: bysource + :undoc-members: adapter_module_names + +----- + diff --git a/docs/source/asr/asr_all.bib b/docs/source/asr/asr_all.bib new file mode 100644 index 0000000000000000000000000000000000000000..01c765f68f371557b5b37d38598c899711c7abf7 --- /dev/null +++ b/docs/source/asr/asr_all.bib @@ -0,0 +1,1043 @@ +@article{matchboxnet, + title={{MatchboxNet}: 1D Time-Channel Separable Convolutional Neural Network Architecture for Speech Commands Recognition}, + author={Majumdar, Somshubra and Ginsburg, Boris}, + journal={Proc. Interspeech 2020}, + year={2020} +} + +@article{marblenet, + title={MarbleNet: Deep 1D Time-Channel Separable Convolutional Neural Network for Voice Activity Detection}, + author={Jia, Fei and Majumdar, Somshubra and Ginsburg, Boris}, + journal={arXiv preprint arXiv:2010.13886}, + year={2020} +} + +@inproceedings{panayotov2015librispeech, + title={Librispeech: an ASR corpus based on public domain audio books}, + author={Panayotov, Vassil and Chen, Guoguo and Povey, Daniel and Khudanpur, Sanjeev}, + booktitle={Acoustics, Speech and Signal Processing (ICASSP), 2015 IEEE International Conference on}, + pages={5206--5210}, + year={2015}, + organization={IEEE} +} + +@article{luong17, + author = {Minh{-}Thang Luong and Eugene Brevdo and Rui Zhao}, + title = {Neural Machine Translation (seq2seq) Tutorial}, + journal = {https://github.com/tensorflow/nmt}, + year = {2017}, +} + +@INPROCEEDINGS{LaurentSeqWiseBN, +author={C. {Laurent} and G. {Pereyra} and P. {Brakel} and Y. {Zhang} and Y. {Bengio}}, +booktitle={2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)}, +title={Batch normalized recurrent neural networks}, +year={2016}, +volume={}, +number={}, +pages={2657-2661}, +keywords={feedforward neural nets;learning (artificial intelligence);recurrent neural nets;speech recognition;batch normalized recurrent neural networks;RNN;sequential data;long-term dependency learning;convergence rate improvement;intermediate representation normalization;feedforward neural networks;speech recognition task;language modeling;training criterion;Training;Recurrent neural networks;Convergence;Speech recognition;Computer architecture;Speech;batch normalization;RNN;LSTM;optimization}, +doi={10.1109/ICASSP.2016.7472159}, +ISSN={2379-190X}, +month={March},} + +@article{graves2005, + author = {Alex Graves and Jurgen Schmidhuber}, + title = {Framewise phoneme classification with bidirectional LSTM and other neural network architectures}, + journal = {Neural Networks, vol. 18}, + pages={602–-610}, + year = {2005}, +} + +@inproceedings{graves2006, + title={Connectionist temporal classification: labelling unsegmented sequence data with recurrent neural networks}, + author={Graves, Alex and Fern{\'a}ndez, Santiago and Gomez, Faustino and Schmidhuber, J{\"u}rgen}, + booktitle={Proceedings of the 23rd international conference on Machine learning}, + pages={369--376}, + year={2006}, + organization={ACM} +} + +@article{li2019jasper, + title={Jasper: An End-to-End Convolutional Neural Acoustic Model}, + author={Li, Jason and Lavrukhin, Vitaly and Ginsburg, Boris and Leary, Ryan and Kuchaiev, Oleksii and Cohen, Jonathan M and Nguyen, Huyen and Gadde, Ravi Teja}, + journal={arXiv preprint arXiv:1904.03288}, + year={2019} +} + +@misc{ardila2019common, + title={Common Voice: A Massively-Multilingual Speech Corpus}, + author={Rosana Ardila and Megan Branson and Kelly Davis and Michael Henretty and Michael Kohler and Josh Meyer and Reuben Morais and Lindsay Saunders and Francis M. Tyers and Gregor Weber}, + year={2019}, + eprint={1912.06670}, + archivePrefix={arXiv}, + primaryClass={cs.CL} +} + +@article{graves2012, + title={Sequence Transduction with Recurrent Neural Networks}, + author={Graves, Alex}, + journal={arXiv preprint arXiv:1211.3711}, + year={2012} +} + + +@article{graves2013, + title={Generating sequences with recurrent neural networks}, + author={Graves, Alex}, + journal={arXiv preprint arXiv:1308.0850}, + year={2013} +} + +@article{sergeev2018horovod, + title={Horovod: fast and easy distributed deep learning in TensorFlow}, + author={Sergeev, Alexander and Del Balso, Mike}, + journal={arXiv preprint arXiv:1802.05799}, + year={2018} +} + +@misc{NVVolta, + title = {NVIDIA TESLA V100 GPU ARCHITECTURE}, + howpublished = {\url{http://images.nvidia.com/content/volta-architecture/pdf/volta-architecture-whitepaper.pdf}}, + note = {Accessed: 2018-10-09} +} + +@article{NVTuring, + title = {NVIDIA TURING GPU ARCHITECTURE}, + howpublished = {\url{https://www.nvidia.com/content/dam/en-zz/Solutions/design-visualization/technologies/turing-architecture/NVIDIA-Turing-Architecture-Whitepaper.pdf}}, + author = {NVIDIA}, + year = {2018}, + note = {Accessed: 2018-10-09} +} + +@misc{Rygaard2015, + title = {Using Synthesized Speech to Improve Speech Recognition for Low-Resource Languages}, + author = {Luise Valentin Rygaard}, + howpublished = {\url{https://parasol.tamu.edu/dreu2015/Rygaard/report.pdf}}, + year = {2015}, +} + +@misc{OpenSeq2Seq, + title = {OpenSeq2Seq: extensible toolkit for distributed and mixed precision training of sequence-to-sequence models}, + author = {Kuchaiev, Oleksii and Ginsburg, Boris and Gitman, Igor and Lavrukhin,Vitaly and Case, Carl and Micikevicius, Paulius}, + howpublished = {\url{https://arxiv.org/abs/1805.10387}}, + year = {2018}, +} + +@misc{MPGuide, + title = {Training with Mixed Precision}, + howpublished = {\url{http://docs.nvidia.com/deeplearning/sdk/mixed-precision-training/}}, + note = {Accessed: 2018-04-06}, +} + +@misc{Mozilla, + title = {Mozilla: A Journey to less than 10\% Word Error Rate}, + howpublished = {\url{https://hacks.mozilla.org/2017/11/a-journey-to-10-word-error-rate/}}, + note = {Accessed: 2018-04-06}, +} + +@article{Waibel1989, + title={A time-delay neural network architecture for isolated word recognition}, + author={Waibel, Alexander, and Hanazawa, Toshiyki and Hinton,Geoffrey and Shirano, Kiyohiro and Lang, Kevin }, + journal={IEEE Trans. on Acoustics, Speech and Signal Processing}, + year={1989} +} + +@article{Lang1990, + title={A time-delay neural network architecture for isolated word recognition}, + author={Lang, Kevin and Waibel, Alexander, and Hinton,Geoffrey }, + journal={Neural Networks}, + year={1990} +} + +@book{Bengio1996, + Author = {Bengio, Y.}, + Publisher = {International Thomson Computer Press}, + Title = {Neural Networks for Speech and Sequence Recognition}, + Year = {1996} +} + +@article{Bengio1992, + title={Global optimization of a neural network-hidden Markov model hybrid}, + author={Bengio, Y., and De Mori, R., and Flammia, G., and Kompe, R. }, + journal={IEEE Transactions on Neural Networks, 3(2), 252–259}, + year={1992} +} + +@article{Bourlard1994, + title={Connectionist speech recognition: a hybrid approach}, + author={Bourlard, H. A. and Morgan, N.}, + journal={volume 247 Springer }, + year={1994} +} + +@article{srivastava14a, + author = {Nitish Srivastava, and Geoffrey Hinton, and Alex Krizhevsky, and Ilya Sutskever, and Ruslan Salakhutdinov}, + title = {Dropout: A Simple Way to Prevent Neural Networks from Overfitting}, + journal = {Journal of Machine Learning Research}, + year = {2014}, + volume = {15}, + pages = {1929-1958}, + url = {http://jmlr.org/papers/v15/srivastava14a.html} +} + + +@article{Hinton2012, + title={Deep Neural Networks for Acoustic Modeling in Speech Recognition}, + author={ Hinton,Geoffrey and Deng, Li and Yu, Dong and Dahl,George and Mohamed,Abdel-rahman and Jaitly, Navdeep and Senior, Andrew and Vanhoucke, Vincent and Nguyen, Patrick and Kingsbury, Brian and Sainath, Tara}, + journal={IEEE Signal Processing Magazine}, + year={2012} +} + +@article{Graves2014, + title={Towards End-to-End Speech Recognition with Recurrent Neural Networks}, + author={Graves, Alex and Jaitly, Navdeep}, + journal={International Conference on Machine Learning}, + year={2014} +} + +@article{Chorowski2014, + title={End-to-end Continuous Speech Recognition using Attention-based Recurrent NN: First Results}, + author={ Chorowski, Jan, and Bahdanau, Dzmitry , and Cho, Kyunghyun , and Bengio, Yoshua }, + journal={Neural Information Processing Systems: Workshop Deep Learning and Representation Learning Workshop }, + year={2014} +} + +@article{Sak2014, + title={Long short-term memory recurrent neural network architectures for large scale acoustic modeling}, + author={Sak, Hasim and Senior, Andrew and Beaufays, Francoise }, + journal={Interspeech 2014}, + year={2014} +} + +@article{Ko2015, + title={Audio Augmentation for Speech Recognition}, + author={Tom, Ko and Vijayaditya, Peddinti and Daniel, Povey + and Sanjeev, Khudanpur }, + journal={Interspeech 2015}, + year={2015} +} + +@article{Tjandra2017, + title={Listening while Speaking: Speech Chain by Deep Learning}, + author={Andros, Tjandra and Sakriani, Sakti and Satoshi, Nakamura }, + journal={ASRU 2017}, + year={2017} +} + +@article{Tjandra2018, + title={Machine Speech Chain with One-shot Speaker Adaptation}, + author={Andros, Tjandra and Sakriani, Sakti and Satoshi, Nakamura }, + journal={Interspeech 2018}, + year={2018} +} + +@article{bahdanau2014neural, + title={Neural machine translation by jointly learning to align and translate}, + author={Bahdanau, Dzmitry and Cho, Kyunghyun and Bengio, Yoshua}, + journal={arXiv preprint arXiv:1409.0473}, + year={2014} +} + +@article{cho2014learning, + title={Learning phrase representations using RNN encoder-decoder for statistical machine translation}, + author={Cho, Kyunghyun and Van Merri{\"e}nboer, Bart and Gulcehre, Caglar and Bahdanau, Dzmitry and Bougares, Fethi and Schwenk, Holger and Bengio, Yoshua}, + journal={arXiv preprint arXiv:1406.1078}, + year={2014} +} + +@article{rush2015neural, + title={A neural attention model for abstractive sentence summarization}, + author={Rush, Alexander M and Chopra, Sumit and Weston, Jason}, + journal={arXiv preprint arXiv:1509.00685}, + year={2015} +} + +@article{micikevicius2017mixed, + title={Mixed precision training}, + author={Micikevicius, Paulius and Narang, Sharan and Alben, Jonah and Diamos, Gregory and Elsen, Erich and Garcia, David and Ginsburg, Boris and Houston, Michael and Kuchaev, Oleksii and Venkatesh, Ganesh and others}, + journal={arXiv preprint arXiv:1710.03740}, + year={2017} +} + +@ARTICLE{Britz:2017, + author = {{Britz}, Denny and {Goldie}, Anna and {Luong}, Thang and {Le}, Quoc}, + title = {Massive Exploration of Neural Machine Translation Architectures}, + journal = {ArXiv e-prints arXiv:1703.03906}, + archivePrefix = "arXiv", + eprinttype = {arxiv}, + eprint = {1703.03906}, + primaryClass = "cs.CL", + keywords = {Computer Science - Computation and Language}, + year = 2017, + month = mar +} + +@inproceedings{abadi2016tensorflow, + title={TensorFlow: A System for Large-Scale Machine Learning.}, + author={Abadi, Mart{\'\i}n and Barham, Paul and Chen, Jianmin and Chen, Zhifeng and Davis, Andy and Dean, Jeffrey and Devin, Matthieu and Ghemawat, Sanjay and Irving, Geoffrey and Isard, Michael and others}, + booktitle={OSDI}, + volume={16}, + pages={265--283}, + year={2016} +} + +@article{tensor2tensor, + author = {Ashish Vaswani and Samy Bengio and Eugene Brevdo and Francois Chollet and Aidan N. Gomez and Stephan Gouws and Llion Jones and \L{}ukasz Kaiser and Nal Kalchbrenner and Niki Parmar and Ryan Sepassi and + Noam Shazeer and Jakob Uszkoreit}, + title = {Tensor2Tensor for Neural Machine Translation}, + journal = {CoRR}, + volume = {abs/1803.07416}, + year = {2018}, + url = {http://arxiv.org/abs/1803.07416}, +} + +@article{gehring2017convs2s, + author = {Gehring, Jonas, and Auli, Michael and Grangier, David and Yarats, Denis and Dauphin, Yann N}, + title = "{Convolutional Sequence to Sequence Learning}", + journal = {ArXiv e-prints arXiv:1705.03122}, + archivePrefix = "arXiv", + eprinttype = {arxiv}, + eprint = {1705.03122}, + primaryClass = "cs.CL", + keywords = {Computer Science - Computation and Language}, + year = 2017, + month = May, +} + +@inproceedings{chan2015, + title={Listen, attend and spell}, + author={Chan, William and Jaitly, Navdeep and Le, Quoc V and Vinyals, Oriol}, + booktitle={Acoustics, Speech and Signal Processing (ICASSP), 2016 IEEE International Conference on}, + pages={5206--5210}, + year={2016}, + organization={IEEE} +} + +@inproceedings{xu2015show, + title={Show, attend and tell: Neural image caption generation with visual attention}, + author={Xu, Kelvin and Ba, Jimmy and Kiros, Ryan and Cho, Kyunghyun and Courville, Aaron and Salakhudinov, Ruslan and Zemel, Rich and Bengio, Yoshua}, + booktitle={International Conference on Machine Learning}, + pages={2048--2057}, + year={2015} +} + +@incollection{Sutskever2014, + title = {Sequence to Sequence Learning with Neural Networks}, + author = {Sutskever, Ilya and Vinyals, Oriol and Le, Quoc V}, + booktitle = {Advances in Neural Information Processing Systems 27}, + editor = {Z. Ghahramani and M. Welling and C. Cortes and N. D. Lawrence and K. Q. Weinberger}, + pages = {3104--3112}, + year = {2014}, + publisher = {Curran Associates, Inc.}, + url = {http://papers.nips.cc/paper/5346-sequence-to-sequence-learning-with-neural-networks.pdf} +} + +@article{DeepSpeech2014, + title = {Deep Speech: Scaling up end-to-end speech recognition}, + author = {Awni Y. Hannun and Carl Case and Jared Casper and Bryan Catanzaro and Greg Diamos and Erich Elsen and Ryan Prenger and Sanjeev Satheesh and Shubho Sengupta and Adam Coates and Andrew Y. Ng}, + journal = {CoRR}, + volume = {abs/1412.5567}, + year = {2014}, + url = {http://arxiv.org/abs/1412.5567}, + archivePrefix = {arXiv}, + eprint = {1412.5567}, + timestamp = {Mon, 13 Aug 2018 16:48:07 +0200}, + biburl = {https://dblp.org/rec/bib/journals/corr/HannunCCCDEPSSCN14}, + bibsource = {dblp computer science bibliography, https://dblp.org} +} + +@inproceedings{DeepSpeech2, + author = {Amodei, Dario and Ananthanarayanan, Sundaram and Anubhai, Rishita and Bai, Jingliang and Battenberg, Eric and Case, Carl and Casper, Jared and Catanzaro, Bryan and Cheng, Qiang and Chen, Guoliang and Chen, Jie and Chen, Jingdong and Chen, Zhijie and Chrzanowski, Mike and Coates, Adam and Diamos, Greg and Ding, Ke and Du, Niandong and Elsen, Erich and Engel, Jesse and Fang, Weiwei and Fan, Linxi and Fougner, Christopher and Gao, Liang and Gong, Caixia and Hannun, Awni and Han, Tony and Johannes, Lappi Vaino and Jiang, Bing and Ju, Cai and Jun, Billy and LeGresley, Patrick and Lin, Libby and Liu, Junjie and Liu, Yang and Li, Weigao and Li, Xiangang and Ma, Dongpeng and Narang, Sharan and Ng, Andrew and Ozair, Sherjil and Peng, Yiping and Prenger, Ryan and Qian, Sheng and Quan, Zongfeng and Raiman, Jonathan and Rao, Vinay and Satheesh, Sanjeev and Seetapun, David and Sengupta, Shubho and Srinet, Kavya and Sriram, Anuroop and Tang, Haiyuan and Tang, Liliang and Wang, Chong and Wang, Jidong and Wang, Kaifu and Wang, Yi and Wang, Zhijian and Wang, Zhiqian and Wu, Shuang and Wei, Likai and Xiao, Bo and Xie, Wen and Xie, Yan and Yogatama, Dani and Yuan, Bin and Zhan, Jun and Zhu, Zhenyao}, + title = {Deep Speech 2: End-to-end Speech Recognition in English and Mandarin}, + booktitle = {Proceedings of the 33rd International Conference on International Conference on Machine Learning - Volume 48}, + series = {ICML'16}, + year = {2016}, + location = {New York, NY, USA}, + pages = {173--182}, + numpages = {10}, + url = {http://dl.acm.org/citation.cfm?id=3045390.3045410}, + acmid = {3045410}, + publisher = {JMLR.org}, +} + +@inproceedings{prabhavalkar2017comparison, + title={A comparison of sequence-to-sequence models for speech recognition}, + author={Prabhavalkar, Rohit and Rao, Kanishka and Sainath, Tara N and Li, Bo and Johnson, Leif and Jaitly, Navdeep}, + booktitle={Proc. Interspeech}, + pages={939--943}, + year={2017} +} + +@article{chiu2017state, + title={State-of-the-art speech recognition with sequence-to-sequence models}, + author={Chiu, Chung-Cheng and Sainath, Tara N and Wu, Yonghui and Prabhavalkar, Rohit and Nguyen, Patrick and Chen, Zhifeng and Kannan, Anjuli and Weiss, Ron J and Rao, Kanishka and Gonina, Katya and others}, + journal={arXiv preprint arXiv:1712.01769}, + year={2017} +} + +@misc{NVMixed, + title = {{NVIDA's Mixed-Precision Training - TensorFlow example}}, + howpublished = {\url{https://docs.nvidia.com/deeplearning/sdk/mixed-precision-training/#example_tensorflow}}, + author={NVIDIA}, + note = {Accessed: 2018-10-09}, + year={2018} +} + +@article{gehring2017, + title={Convolutional sequence to sequence learning}, + author={Gehring, Jonas and Auli, Michael and Grangier, David and Yarats, Denis and Dauphin, Yann N}, + journal={arXiv preprint arXiv:1705.03122}, + year={2017} +} + +@article{collobert2016, + title={Wav2letter: an end-to-end convnet-based speech recognition system}, + author={Collobert, Ronan and Puhrsch, Christian and Synnaeve, Gabriel}, + journal={arXiv preprint arXiv:1609.03193}, + year={2016} +} + +@inproceedings{Zhang2016, +author={Ying Zhang and Mohammad Pezeshki and Philémon Brakel and Saizheng Zhang and César Laurent and Yoshua Bengio and Aaron Courville}, +title={Towards End-to-End Speech Recognition with Deep Convolutional Neural Networks}, +year=2016, +booktitle={Interspeech 2016}, +doi={10.21437/Interspeech.2016-1446}, +url={http://dx.doi.org/10.21437/Interspeech.2016-1446}, +pages={410--414} +} + +@inproceedings{Zhang2017, + title={Very deep convolutional networks for end-to-end speech recognition}, + author={Zhang, Yu, and Chan, William, and Jaitly, Navdeep}, + booktitle={Acoustics, Speech and Signal Processing (ICASSP), 2017 IEEE International Conference on}, + year={2017}, + organization={IEEE} +} + + +@article{Wang2017, + title={Tacotron: Towards End-to-End Speech Synthesis}, + author={ Wang, Yuxuan, and Skerry-Ryan, RJ, and Stanton, Daisy and Wu, Yonghui and Weiss, Ron, and Jaitly, Navdeep and Yang, Zongheng and Xiao, Ying and Chen,Zhifeng and Bengio, Samy and Le, Quoc and Agiomyrgiannakis, Yannis and Clark,Rob and Saurous, Rif A.}, + journal={arXiv preprint arXiv:1703.10135}, + year={2017} +} + +@article{griffin1984signal, + title={Signal estimation from modified short-time Fourier transform}, + author={Griffin, Daniel and Lim, Jae}, + journal={IEEE Transactions on Acoustics, Speech, and Signal Processing}, + volume={32}, + number={2}, + pages={236--243}, + year={1984}, + publisher={IEEE} +} + +@misc{ito2017lj, + title={The LJ speech dataset}, + author={Ito, Keith and others}, + year={2017} +} + +@misc{mailabs, + title = {{The M-AILABS Speech Dataset}}, + howpublished = {\url{http://www.m-ailabs.bayern/en/the-mailabs-speech-dataset/}}, + author={M-AILABS}, + note = {Accessed: 2018-10-09}, + year={2018} +} + +@article{merity2016pointer, + title={Pointer sentinel mixture models}, + author={Merity, Stephen and Xiong, Caiming and Bradbury, James and Socher, Richard}, + journal={arXiv preprint arXiv:1609.07843}, + year={2016} +} + +@inproceedings{socher2013recursive, + title={Recursive deep models for semantic compositionality over a sentiment treebank}, + author={Socher, Richard and Perelygin, Alex and Wu, Jean and Chuang, Jason and Manning, Christopher D and Ng, Andrew and Potts, Christopher}, + booktitle={Proceedings of the 2013 conference on empirical methods in natural language processing}, + pages={1631--1642}, + year={2013} +} + +@InProceedings{maas-EtAl:2011:ACL-HLT2011, + author = {Maas, Andrew L. and Daly, Raymond E. and Pham, Peter T. and Huang, Dan and Ng, Andrew Y. and Potts, Christopher}, + title = {Learning Word Vectors for Sentiment Analysis}, + booktitle = {Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies}, + month = {June}, + year = {2011}, + address = {Portland, Oregon, USA}, + publisher = {Association for Computational Linguistics}, + pages = {142--150}, + url = {http://www.aclweb.org/anthology/P11-1015} +} + +@inproceedings{Povey2018SemiOrthogonalLM, + title={Semi-Orthogonal Low-Rank Matrix Factorization for Deep Neural Networks}, + author={Daniel Povey and Gaofeng Cheng and Yiming Wang and Ke Li and Hainan Xu and Mahsa Yarmohammadi and Sanjeev Khudanpur}, + booktitle={Interspeech}, + year={2018} +} + +@article{CAPIO2017, + author = {Kyu J. Han and Akshay Chandrashekaran and Jungsuk Kim and Ian R. Lane}, + title = {The {CAPIO} 2017 Conversational Speech Recognition System}, + journal = {CoRR}, + volume = {abs/1801.00059}, + year = {2018}, + url = {http://arxiv.org/abs/1801.00059}, + archivePrefix = {arXiv}, + eprint = {1801.00059}, + timestamp = {Mon, 13 Aug 2018 16:49:10 +0200}, + biburl = {https://dblp.org/rec/bib/journals/corr/abs-1801-00059}, + bibsource = {dblp computer science bibliography, https://dblp.org} +} + +@article{WaveNet, + author = {A{\"{a}}ron van den Oord and Sander Dieleman and Heiga Zen and Karen Simonyan and Oriol Vinyals and Alex Graves and Nal Kalchbrenner and Andrew W. Senior and Koray Kavukcuoglu}, + title = {WaveNet: {A} Generative Model for Raw Audio}, + journal = {CoRR}, + volume = {abs/1609.03499}, + year = {2016}, + url = {http://arxiv.org/abs/1609.03499}, + archivePrefix = {arXiv}, + eprint = {1609.03499}, + timestamp = {Mon, 13 Aug 2018 16:49:15 +0200}, + biburl = {https://dblp.org/rec/bib/journals/corr/OordDZSVGKSK16}, + bibsource = {dblp computer science bibliography, https://dblp.org} +} + +@article{FacebookGERENGBackTranslation, + author = {Rico Sennrich and Barry Haddow and Alexandra Birch}, + title = {Improving Neural Machine Translation Models with Monolingual Data}, + journal = {CoRR}, + volume = {abs/1511.06709}, + year = {2015}, + url = {http://arxiv.org/abs/1511.06709}, + archivePrefix = {arXiv}, + eprint = {1511.06709}, + timestamp = {Mon, 13 Aug 2018 16:47:05 +0200}, + biburl = {https://dblp.org/rec/bib/journals/corr/SennrichHB15a}, + bibsource = {dblp computer science bibliography, https://dblp.org} +} + +@article{GlobalStyleTokens, + author = {Yuxuan Wang and Daisy Stanton and Yu Zhang and R. J. Skerry{-}Ryan and Eric Battenberg and Joel Shor and Ying Xiao and Fei Ren and Ye Jia and Rif A. Saurous}, + title = {Style Tokens: Unsupervised Style Modeling, Control and Transfer in End-to-End Speech Synthesis}, + journal = {CoRR}, + volume = {abs/1803.09017}, + year = {2018}, + url = {http://arxiv.org/abs/1803.09017}, + archivePrefix = {arXiv}, + eprint = {1803.09017}, + timestamp = {Mon, 13 Aug 2018 16:46:53 +0200}, + biburl = {https://dblp.org/rec/bib/journals/corr/abs-1803-09017}, + bibsource = {dblp computer science bibliography, https://dblp.org} +} + +@article{IoffeS15BatchNorm, + author = {Sergey Ioffe and Christian Szegedy}, + title = {Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift}, + journal = {CoRR}, + volume = {abs/1502.03167}, + year = {2015}, + url = {http://arxiv.org/abs/1502.03167}, + archivePrefix = {arXiv}, + eprint = {1502.03167}, + timestamp = {Mon, 13 Aug 2018 16:47:06 +0200}, + biburl = {https://dblp.org/rec/bib/journals/corr/IoffeS15}, + bibsource = {dblp computer science bibliography, https://dblp.org} +} + +@article{kingma, + author = {Diederik P. Kingma and + Jimmy Ba}, + title = {Adam: {A} Method for Stochastic Optimization}, + journal = {CoRR}, + volume = {abs/1412.6980}, + year = {2014}, + url = {http://arxiv.org/abs/1412.6980}, + archivePrefix = {arXiv}, + eprint = {1412.6980}, + timestamp = {Mon, 13 Aug 2018 01:00:00 +0200}, + biburl = {https://dblp.org/rec/bib/journals/corr/KingmaB14}, + bibsource = {dblp computer science bibliography, https://dblp.org} +} + +@incollection{Salimans2016WeightNorm, + title = {Weight Normalization: A Simple Reparameterization to Accelerate Training of Deep Neural Networks}, + author = {Salimans, Tim and Kingma, Durk P}, + booktitle = {Advances in Neural Information Processing Systems 29}, + editor = {D. D. Lee and M. Sugiyama and U. V. Luxburg and I. Guyon and R. Garnett}, + pages = {901--909}, + year = {2016}, + publisher = {Curran Associates, Inc.}, + url = {http://papers.nips.cc/paper/6114-weight-normalization-a-simple-reparameterization-to-accelerate-training-of-deep-neural-networks.pdf} +} + +@article{wu2016google, + title={Google's neural machine translation system: Bridging the gap between human and machine translation}, + author={Wu, Yonghui and Schuster, Mike and Chen, Zhifeng and Le, Quoc V and Norouzi, Mohammad and Macherey, Zolfgang and Krikun, Maxim and Cao, Yuan and Gao, Qin and Macherey, Klaus and others}, + journal={arXiv preprint arXiv:1609.08144}, + year={2016} +} + +@inproceedings{opennmt, + author = {Guillaume Klein and Yoon Kim and Yuntian Deng and Jean Senellart and Alexander M. Rush}, + title = {OpenNMT: Open-Source Toolkit for Neural Machine Translation}, + booktitle = {Proc. ACL}, + year = {2017}, + url = {https://doi.org/10.18653/v1/P17-4012}, + doi = {10.18653/v1/P17-4012} +} + +@article{paszke2017automatic, + title={Automatic differentiation in PyTorch}, + author={Paszke, Adam and Gross, Sam and Chintala, Soumith and Chanan, Gregory and Yang, Edward and DeVito, Zachary and Lin, Zeming and Desmaison, Alban and Antiga, Luca and Lerer, Adam}, + year={2017} +} + +@article{yu2014introduction, + title={An introduction to computational networks and the computational network toolkit}, + author={Yu, Dong and Eversole, Adam and Seltzer, Mike and Yao, Kaisheng and Huang, Zhiheng and Guenter, Brian and Kuchaiev, Oleksii and Zhang, Yu and Seide, Frank and Wang, Huaming and others}, + journal={Microsoft Technical Report MSR-TR-2014--112}, + year={2014} +} + +@article{nvidia2017v100, + title={V100 GPU architecture. The world’s most advanced data center GPU. Version WP-08608-001\_v1. 1}, + author={NVIDIA, Tesla}, + journal={NVIDIA. Aug}, + pages={108}, + year={2017} +} + +@article{Ba2016LayerNorm, + author = {Jimmy Lei Ba and Jamie Ryan Kiros and Geoffrey E Hinton}, + title = {Layer normalization}, + journal = {CoRR}, + volume = {abs/1607.06450}, + year = {2016}, + url = {http://arxiv.org/abs/1607.06450}, + archivePrefix = {arXiv}, +} + +@inproceedings{Dauphin2017GLU, + author = {Dauphin, Yann N. and Fan, Angela and Auli, Michael and Grangier, David}, + title = {Language Modeling with Gated Convolutional Networks}, + booktitle = {Proceedings of the 34th International Conference on Machine Learning - Volume 70}, + series = {ICML'17}, + year = {2017}, + location = {Sydney, NSW, Australia}, + pages = {933--941}, + numpages = {9}, + url = {http://dl.acm.org/citation.cfm?id=3305381.3305478}, + acmid = {3305478}, + publisher = {JMLR.org}, +} + +@incollection{Oord2016PixelCNN, +title = {Conditional Image Generation with PixelCNN Decoders}, +author = {van den Oord, Aaron and Kalchbrenner, Nal and Espeholt, Lasse and kavukcuoglu, koray and Vinyals, Oriol and Graves, Alex}, +booktitle = {Advances in Neural Information Processing Systems 29}, +editor = {D. D. Lee and M. Sugiyama and U. V. Luxburg and I. Guyon and R. Garnett}, +pages = {4790--4798}, +year = {2016}, +publisher = {Curran Associates, Inc.}, +url = {http://papers.nips.cc/paper/6527-conditional-image-generation-with-pixelcnn-decoders.pdf} +} + +@article{he2015, + title={Deep residual learning for image recognition}, + author={K. He, and X. Zhang, and S. Ren, and J. Sun}, + journal={arXiv preprint arXiv:1512.03385}, + year={2015} +} + +@article{huang2016, + title={Densely Connected Convolutional Networks}, + author={Gao Huang, and Zhuang Liu, and Laurens van der Maaten, and Kilian Q. Weinberger}, + journal={arXiv preprint arXiv:1608.06993}, + year={2016} +} + +@inproceedings{heafield2011kenlm, + title={KenLM: Faster and smaller language model queries}, + author={Heafield, Kenneth}, + booktitle={Proceedings of the sixth workshop on statistical machine translation}, + pages={187--197}, + year={2011}, + organization={Association for Computational Linguistics} +} + +@article{dai2018transformer, + title={Transformer-XL: Language Modeling with Longer-Term Dependency}, + author={Dai, Zihang and Yang, Zhilin and Yang, Yiming and Cohen, William W and Carbonell, Jaime and Le, Quoc V and Salakhutdinov, Ruslan}, + year={2018}, + journal = {CoRR}, + volume = {abs/1901.02860}, + url = {http://arxiv.org/abs/1901.02860}, + archivePrefix = {arXiv}, + eprint = {1901.02860}, + timestamp = {Fri, 01 Feb 2019 13:39:59 +0100}, + biburl = {https://dblp.org/rec/bib/journals/corr/abs-1901-02860}, + bibsource = {dblp computer science bibliography, https://dblp.org} +} + +@inproceedings{Saon+2016, +author={George Saon and Tom Sercu and Steven Rennie and Hong-Kwang J. Kuo}, +title={The IBM 2016 English Conversational Telephone Speech Recognition System}, +year=2016, +booktitle={Interspeech 2016}, +doi={10.21437/Interspeech.2016-1460}, +url={http://dx.doi.org/10.21437/Interspeech.2016-1460}, +pages={7--11} +} + +@INPROCEEDINGS{Sercu-2016, +author={T. {Sercu} and C. {Puhrsch} and B. {Kingsbury} and Y. {LeCun}}, +booktitle={2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)}, +title={Very deep multilingual convolutional neural networks for LVCSR}, +year={2016}, +volume={}, +number={}, +pages={4955-4959}, +keywords={natural language processing;neural nets;speech recognition;very deep multilingual convolutional neural networks;LVCSR;CNN;large vocabulary continuous speech recognition systems;word error rate;Training;Context;Hidden Markov models;Neural networks;Computer architecture;Kernel;Training data;Convolutional Networks;Multilingual;Acoustic Modeling;Speech Recognition;Neural Networks}, +doi={10.1109/ICASSP.2016.7472620}, +ISSN={2379-190X}, +month={March},} + + +@inproceedings{Sercu+2016, +author={Tom Sercu and Vaibhava Goel}, +title={Advances in Very Deep Convolutional Neural Networks for LVCSR}, +year=2016, +booktitle={Interspeech 2016}, +doi={10.21437/Interspeech.2016-1033}, +url={http://dx.doi.org/10.21437/Interspeech.2016-1033}, +pages={3429--3433} +} + +@INPROCEEDINGS{Xiong-2018, +author={W. {Xiong} and L. {Wu} and F. {Alleva} and J. {Droppo} and X. {Huang} and A. {Stolcke}}, +booktitle={2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)}, +title={The Microsoft 2017 Conversational Speech Recognition System}, +year={2018}, +volume={}, +number={}, +pages={5934-5938}, +keywords={convolution;feedforward neural nets;natural language processing;speaker recognition;speech processing;language model rescoring step;senone level;switchboard domains;character-based LSTM language models;NIST 2000 switchboard test set;frame level;word-level voting;acoustic model posteriors;dialog session aware LSTM language models;CNN-BLSTM acoustic model;Microsoft 2017 conversational speech recognition system;Acoustics;Error analysis;Training;Speech recognition;Switches;Computational modeling;Context modeling;Conversational speech recognition;CNN;LACE;BLSTM;LSTM-LM;system combination;human parity}, +doi={10.1109/ICASSP.2018.8461870}, +ISSN={2379-190X}, +month={April},} + +@inproceedings{zeyer2018improved, + author={Albert Zeyer and Kazuki Irie and Ralf Schlüter and Hermann Ney}, + title={Improved Training of End-to-end Attention Models for Speech Recognition}, + year=2018, + booktitle={Proc. Interspeech 2018}, + pages={7--11}, + doi={10.21437/Interspeech.2018-1616}, + url={http://dx.doi.org/10.21437/Interspeech.2018-1616} +} + +@article{Wav2LetterV2, + author = {Vitaliy Liptchinsky and + Gabriel Synnaeve and + Ronan Collobert}, + title = {Letter-Based Speech Recognition with Gated ConvNets}, + journal = {CoRR}, + volume = {abs/1712.09444}, + year = {2017}, + url = {http://arxiv.org/abs/1712.09444}, + archivePrefix = {arXiv}, + eprint = {1712.09444}, + timestamp = {Mon, 13 Aug 2018 16:46:33 +0200}, + biburl = {https://dblp.org/rec/bib/journals/corr/abs-1712-09444}, + bibsource = {dblp computer science bibliography, https://dblp.org} +} + +@article{zeghidour2018, + author = {Neil Zeghidour and + Qiantong Xu and + Vitaliy Liptchinsky and + Nicolas Usunier and + Gabriel Synnaeve and + Ronan Collobert}, + title = {Fully Convolutional Speech Recognition}, + journal = {CoRR}, + volume = {abs/1812.06864}, + year = {2018}, + url = {http://arxiv.org/abs/1812.06864}, + archivePrefix = {arXiv}, + eprint = {1812.06864}, + timestamp = {Tue, 01 Jan 2019 15:01:25 +0100}, + biburl = {https://dblp.org/rec/bib/journals/corr/abs-1812-06864}, + bibsource = {dblp computer science bibliography, https://dblp.org} +} + +@inproceedings{Hadian2018, + author={Hossein Hadian and Hossein Sameti and Daniel Povey and Sanjeev Khudanpur}, + title={End-to-end Speech Recognition Using Lattice-free MMI}, + year=2018, + booktitle={Proc. Interspeech 2018}, + pages={12--16}, + doi={10.21437/Interspeech.2018-1423}, + url={http://dx.doi.org/10.21437/Interspeech.2018-1423} +} + +@inproceedings{Tang2018, + author={Jian Tang and Yan Song and Lirong Dai and Ian McLoughlin}, + title={Acoustic Modeling with Densely Connected Residual Network for Multichannel Speech Recognition}, + year=2018, + booktitle={Proc. Interspeech 2018}, + pages={1783--1787}, + doi={10.21437/Interspeech.2018-1089}, + url={http://dx.doi.org/10.21437/Interspeech.2018-1089} +} + +@article{Kurata2017LanguageMW, + title={Language modeling with highway LSTM}, + author={Gakuto Kurata and Bhuvana Ramabhadran and George Saon and Abhinav Sethy}, + journal={2017 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU)}, + year={2017}, + pages={244-251} +} + +@inproceedings{Saon2017, + author={George Saon and Gakuto Kurata and Tom Sercu and Kartik Audhkhasi and Samuel Thomas and Dimitrios Dimitriadis and Xiaodong Cui and Bhuvana Ramabhadran and Michael Picheny and Lynn-Li Lim and Bergul Roomi and Phil Hall}, + title={English Conversational Telephone Speech Recognition by Humans and Machines}, + year=2017, + booktitle={Proc. Interspeech 2017}, + pages={132--136}, + doi={10.21437/Interspeech.2017-405}, + url={http://dx.doi.org/10.21437/Interspeech.2017-405} +} + +@inproceedings{Povey+2016, +author={Daniel Povey and Vijayaditya Peddinti and Daniel Galvez and Pegah Ghahremani and Vimal Manohar and Xingyu Na and Yiming Wang and Sanjeev Khudanpur}, +title={Purely Sequence-Trained Neural Networks for ASR Based on Lattice-Free MMI}, +year=2016, +booktitle={Interspeech 2016}, +doi={10.21437/Interspeech.2016-595}, +url={http://dx.doi.org/10.21437/Interspeech.2016-595}, +pages={2751--2755} +} + +@article{Yang2018, + author = {Xuerui Yang and + Jiwei Li and + Xi Zhou}, + title = {A novel pyramidal-FSMN architecture with lattice-free {MMI} for speech + recognition}, + journal = {CoRR}, + volume = {abs/1810.11352}, + year = {2018}, + url = {http://arxiv.org/abs/1810.11352}, + archivePrefix = {arXiv}, + eprint = {1810.11352}, + timestamp = {Wed, 31 Oct 2018 14:24:29 +0100}, + biburl = {https://dblp.org/rec/bib/journals/corr/abs-1810-11352}, + bibsource = {dblp computer science bibliography, https://dblp.org} +} + +@article{liptchinsky2017based, + title={Letter-Based Speech Recognition with Gated ConvNets}, + author={Liptchinsky, Vitaliy and Synnaeve, Gabriel and Collobert, Ronan}, + journal={arXiv preprint arXiv:1712.09444}, + year={2017} +} + +@inproceedings{Weng2018, + author={Chao Weng and Jia Cui and Guangsen Wang and Jun Wang and Chengzhu Yu and Dan Su and Dong Yu}, + title={Improving Attention Based Sequence-to-Sequence Models for End-to-End English Conversational Speech Recognition}, + year=2018, + booktitle={Proc. Interspeech 2018}, + pages={761--765}, + doi={10.21437/Interspeech.2018-1030}, + url={http://dx.doi.org/10.21437/Interspeech.2018-1030} +} + +@INPROCEEDINGS{Battenberg2017, +author={E. {Battenberg} and J. {Chen} and R. {Child} and A. {Coates} and Y. G. Y. {Li} and H. {Liu} and S. {Satheesh} and A. {Sriram} and Z. {Zhu}}, +booktitle={2017 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU)}, +title={Exploring neural transducers for end-to-end speech recognition}, +year={2017}, +volume={}, +number={}, +pages={206-213}, +keywords={recurrent neural nets;speech recognition;Hub500 benchmark;CTC models;speech recognition pipeline;RNN-Transducer models;language model;Seq2Seq models;end-to-end speech recognition;neural transducers;Decoding;Hidden Markov models;Transducers;Task analysis;Speech;Mathematical model;Neural networks}, +doi={10.1109/ASRU.2017.8268937}, +ISSN={}, +month={Dec}, +} + +@inproceedings{ +loshchilov2018, +title={Decoupled Weight Decay Regularization}, +author={Ilya Loshchilov and Frank Hutter}, +booktitle={International Conference on Learning Representations}, +year={2019}, +url={https://openreview.net/forum?id=Bkg6RiCqY7}, +} + +@article{zhang2017ndadam, + author = {Zijun Zhang and Lin Ma and Zongpeng Li and Chuan Wu}, + title = {Normalized Direction-preserving Adam}, + journal = {arXiv e-prints arXiv:1709.04546}, + year = {2017}, +} + +@article{park2019, + author = {{Park}, Daniel S. and {Chan}, William and {Zhang}, Yu and + {Chiu}, Chung-Cheng and {Zoph}, Barret and {Cubuk}, Ekin D. and + {Le}, Quoc V.}, + title = "{SpecAugment: A Simple Data Augmentation Method for Automatic Speech Recognition}", + journal = {arXiv e-prints}, + year = "2019", + eid = {arXiv:1904.08779}, + eprint = {1904.08779}, +} + +@article{novograd2019, + author = {{Ginsburg}, Boris and {Castonguay}, Patrice and {Hrinchuk}, Oleksii and + {Kuchaiev}, Oleksii and {Lavrukhin}, Vitaly and {Leary}, Ryan and + {Li}, Jason and {Nguyen}, Huyen and {Cohen}, Jonathan M.}, + title = "{Stochastic Gradient Methods with Layer-wise Adaptive Moments for Training of Deep Networks}", + journal = {arXiv e-prints}, + year = "2019", + eid = {arXiv:1905.11286}, + eprint = {1905.11286}, +} + +@article{kriman2019quartznet, + title={Quartznet: {Deep} automatic speech recognition with 1d time-channel separable convolutions}, + author={Kriman, Samuel and Beliaev, Stanislav and Ginsburg, Boris and Huang, Jocelyn and Kuchaiev, Oleksii and Lavrukhin, Vitaly and Leary, Ryan and Li, Jason and Zhang, Yang}, + journal={arXiv preprint arXiv:1910.10261}, + year={2019} +} + +@misc{itu1988g711, + title={{ITU-T} {G.711} - {Pulse} code modulation ({PCM}) of voice frequencies}, + author={ITU-T Geneva Switzerland}, + year={1988}, +} + +@article{han2020contextnet, + title={ContextNet: Improving convolutional neural networks for automatic speech recognition with global context}, + author={Han, Wei and Zhang, Zhengdong and Zhang, Yu and Yu, Jiahui and Chiu, Chung-Cheng and Qin, James and Gulati, Anmol and Pang, Ruoming and Wu, Yonghui}, + journal={arXiv:2005.03191}, + year={2020} +} + +@inproceedings{hu2018squeeze, + title={Squeeze-and-excitation networks}, + author={Hu, Jie and Shen, Li and Sun, Gang}, + booktitle={ICVPR}, + year={2018} +} + +@article{koluguri2020speakernet, + title={SpeakerNet: 1D Depth-wise Separable Convolutional Network for Text-Independent Speaker Recognition and Verification}, + author={Koluguri, Nithin Rao and Li, Jason and Lavrukhin, Vitaly and Ginsburg, Boris}, + journal={arXiv preprint arXiv:2010.12653}, + year={2020} +} + +@article{gulati2020conformer, + title={Conformer: Convolution-augmented transformer for speech recognition}, + author={Gulati, Anmol and Qin, James and Chiu, Chung-Cheng and Parmar, Niki and Zhang, Yu and Yu, Jiahui and Han, Wei and Wang, Shibo and Zhang, Zhengdong and Wu, Yonghui and others}, + journal={arXiv preprint arXiv:2005.08100}, + year={2020} +} + +@article{koluguri2021titanet, + title={TitaNet: Neural Model for speaker representation with 1D Depth-wise separable convolutions and global context}, + author={Koluguri, Nithin Rao and Park, Taejin and Ginsburg, Boris}, + journal={arXiv preprint arXiv:2110.04410}, + year={2021} +} + +@article{Dawalatabad_2021, + title={ECAPA-TDNN Embeddings for Speaker Diarization}, + url={http://dx.doi.org/10.21437/Interspeech.2021-941}, + DOI={10.21437/interspeech.2021-941}, + journal={Interspeech 2021}, + publisher={ISCA}, + author={Dawalatabad, Nauman and Ravanelli, Mirco and Grondin, François and Thienpondt, Jenthe and Desplanques, Brecht and Na, Hwidong}, + year={2021}, + month={Aug} +} + +@article{park2022multi, + title = {Multi-scale Speaker Diarization with Dynamic Scale Weighting}, + author = {Park, Tae Jin and Koluguri, Nithin Rao and Balam, Jagadeesh and Ginsburg, Boris}, + journal = {https://arxiv.org/abs/2203.15974}, + year = {2022} +} + + +@inproceedings{he2019streaming, + title={Streaming end-to-end speech recognition for mobile devices}, + author={He, Yanzhang and Sainath, Tara N and Prabhavalkar, Rohit and McGraw, Ian and Alvarez, Raziel and Zhao, Ding and Rybach, David and Kannan, Anjuli and Wu, Yonghui and Pang, Ruoming and others}, + booktitle={ICASSP 2019-2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)}, + pages={6381--6385}, + year={2019}, + organization={IEEE} +} + +@misc{wav2vec2, + doi = {10.48550/ARXIV.2006.11477}, + url = {https://arxiv.org/abs/2006.11477}, + author = {Baevski, Alexei and Zhou, Henry and Mohamed, Abdelrahman and Auli, Michael}, + title = {wav2vec 2.0: A Framework for Self-Supervised Learning of Speech Representations}, + publisher = {arXiv}, + year = {2020}, + copyright = {arXiv.org perpetual, non-exclusive license} +} + +@misc{w2v_bert, + doi = {10.48550/ARXIV.2108.06209}, + url = {https://arxiv.org/abs/2108.06209}, + author = {Chung, Yu-An and Zhang, Yu and Han, Wei and Chiu, Chung-Cheng and Qin, James and Pang, Ruoming and Wu, Yonghui}, + title = {W2v-BERT: Combining Contrastive Learning and Masked Language Modeling for Self-Supervised Speech Pre-Training}, + publisher = {arXiv}, + year = {2021}, + copyright = {arXiv.org perpetual, non-exclusive license} +} + +@misc{ssl_inter, + doi = {10.48550/ARXIV.2112.08778}, + url = {https://arxiv.org/abs/2112.08778}, + author = {Wang, Chengyi and Wu, Yu and Chen, Sanyuan and Liu, Shujie and Li, Jinyu and Qian, Yao and Yang, Zhenglu}, + title = {Self-Supervised Learning for speech recognition with Intermediate layer supervision}, + publisher = {arXiv}, + year = {2021}, + copyright = {arXiv.org perpetual, non-exclusive license} +} + +@misc{kim2022squeezeformer, + doi = {10.48550/ARXIV.2206.00888}, + url = {https://arxiv.org/abs/2206.00888}, + author = {Kim, Sehoon and Gholami, Amir and Shaw, Albert and Lee, Nicholas and Mangalam, Karttikeya and Malik, Jitendra and Mahoney, Michael W. and Keutzer, Kurt}, + keywords = {Audio and Speech Processing (eess.AS), Computation and Language (cs.CL), Sound (cs.SD), FOS: Electrical engineering, electronic engineering, information engineering, FOS: Electrical engineering, electronic engineering, information engineering, FOS: Computer and information sciences, FOS: Computer and information sciences}, + title = {Squeezeformer: An Efficient Transformer for Automatic Speech Recognition}, + publisher = {arXiv}, + year = {2022}, + copyright = {arXiv.org perpetual, non-exclusive license} +} + +@misc{park2022multi, + doi = {10.48550/ARXIV.2203.15974}, + url = {https://arxiv.org/abs/2203.15974}, + author = {Park, Tae Jin and Koluguri, Nithin Rao and Balam, Jagadeesh and Ginsburg, Boris}, + keywords = {Audio and Speech Processing (eess.AS), Computation and Language (cs.CL), FOS: Electrical engineering, electronic engineering, information engineering, FOS: Electrical engineering, electronic engineering, information engineering, FOS: Computer and information sciences, FOS: Computer and information sciences}, + title = {Multi-scale Speaker Diarization with Dynamic Scale Weighting}, + publisher = {arXiv}, + year = {2022}, + copyright = {Creative Commons Attribution 4.0 International} +} diff --git a/docs/source/asr/asr_language_modeling.rst b/docs/source/asr/asr_language_modeling.rst new file mode 100644 index 0000000000000000000000000000000000000000..5692e25e0ace7186293bef13b736820f27405e9a --- /dev/null +++ b/docs/source/asr/asr_language_modeling.rst @@ -0,0 +1,395 @@ +##################### +ASR Language Modeling +##################### + +Language models have shown to help the accuracy of ASR models. NeMo supports the following two approaches to incorporate language models into the ASR models: + +* :ref:`ngram_modeling` +* :ref:`neural_rescoring` + +It is possible to use both approaches on the same ASR model. + + +.. _ngram_modeling: + +************************ +N-gram Language Modeling +************************ + +In this approach, an N-gram LM is trained on text data, then it is used in fusion with beam search decoding to find the +best candidates. The beam search decoders in NeMo support language models trained with KenLM library ( +`https://github.com/kpu/kenlm `__). +The beam search decoders and KenLM library are not installed by default in NeMo, and you need to install them to be +able to use beam search decoding and N-gram LM. +Please refer to `scripts/asr_language_modeling/ngram_lm/install_beamsearch_decoders.sh` on how to install them. + +NeMo supports both character-based and BPE-based models for N-gram LMs. An N-gram LM can be used with beam search +decoders on top of the ASR models to produce more accurate candidates. The beam search decoder would incorporate +the scores produced by the N-gram LM into its score calculations as the following: + +.. code-block:: + + final_score = acoustic_score + beam_alpha*lm_score + beam_beta*seq_length + +where acoustic_score is the score predicted by the acoustic encoder and lm_score is the one estimated by the LM. +Parameter 'beam_alpha' specifies amount of importance to place on the N-gram language model, and 'beam_beta' is a +penalty term to consider the sequence length in the scores. Larger alpha means more importance on the LM and less +importance on the acoustic model. Negative values for beta will give penalty to longer sequences and make the decoder +to prefer shorter predictions, while positive values would result in longer candidates. + + +Train N-gram LM +=============== + +The script to train an N-gram language model with KenLM can be found at +`scripts/asr_language_modeling/ngram_lm/train_kenlm.py `__. + +This script would train an N-gram language model with KenLM library which can be used with the beam search decoders +on top of the ASR models. This script supports both character level and BPE level encodings and models which is +detected automatically from the type of the model. + + +You may train the N-gram model as the following: + +.. code-block:: + + python train_kenlm.py --nemo_model_file \ + --train_file \ + --kenlm_model_file \ + --ngram_length \ + --preserve_arpa + +The train file specified by `--train_file` can be a text file or JSON manifest. If the file's extension is anything +other than `.json`, it assumes that data format is plain text. For plain text format, each line should contain one +sample. For JSON manifest file, the file need to contain json formatted samples per each line like this: + +.. code-block:: + + {"audio_filepath": "/data_path/file1.wav", "text": "The transcript of the audio file."} + +It just extracts the `text` field from each line to create the training text file. After the N-gram model is trained, +it is stored at the path specified by `--kenlm_model_file`. + +The following is the list of the arguments for the training script: + ++------------------+----------+-------------+-------------------------------------------------------------------------------------------------+ +| **Argument** | **Type** | **Default** | **Description** | ++------------------+----------+-------------+-------------------------------------------------------------------------------------------------+ +| nemo_model_file | str | Required | The path of the `.nemo` file of the ASR model. It is needed to extract the tokenizer. | ++------------------+----------+-------------+-------------------------------------------------------------------------------------------------+ +| train_file | str | Required | Path to the training file, it can be a text file or JSON manifest. | ++------------------+----------+-------------+-------------------------------------------------------------------------------------------------+ +| kenlm_model_file | str | Required | The path to store the KenLM binary model file. | ++------------------+----------+-------------+-------------------------------------------------------------------------------------------------+ +| kenlm_bin_path | str | Required | The path to the bin folder of KenLM. It is a folder named `bin` under where KenLM is installed. | ++------------------+----------+-------------+-------------------------------------------------------------------------------------------------+ +| ngram_length** | int | Required | Specifies order of N-gram LM. | ++------------------+----------+-------------+-------------------------------------------------------------------------------------------------+ +| do_lower_case | bool | ``False`` | Whether to make the training text all lower case. | ++------------------+----------+-------------+-------------------------------------------------------------------------------------------------+ +| preserve_arpa | bool | ``False`` | Whether to preserve the intermediate ARPA file after construction of the BIN file. | ++------------------+----------+-------------+-------------------------------------------------------------------------------------------------+ + +** Note: Recommend to use 6 as the order of the N-gram model for BPE-based models. Higher orders may need the re-compilation of KenLM to support it. + +Evaluate by Beam Search Decoding and N-gram LM +============================================== + +NeMo's beam search decoders are capable of using the KenLM's N-gram models to find the best candidates. +The script to evaluate an ASR model with beam search decoding and N-gram models can be found at +`scripts/asr_language_modeling/ngram_lm/eval_beamsearch_ngram.py `__. + +This script has a large number of possible argument overrides, therefore it is advised to use ``python eval_beamsearch_ngram.py --help`` to see the full list of arguments. + +You may evaluate an ASR model as the following: + +.. code-block:: + + python eval_beamsearch_ngram.py nemo_model_file= \ + input_manifest= \ + beam_width=[] \ + beam_alpha=[] \ + beam_beta=[] \ + preds_output_folder= \ + probs_cache_file=null \ + decoding_mode=beamsearch_ngram \ + decoding_strategy="" + +It can evaluate a model in the three following modes by setting the argument `--decoding_mode`: + +* greedy: Just greedy decoding is done, and no beam search decoding is performed. +* beamsearch: The beam search decoding is done but without using the N-gram language model, final results would be equivalent to setting the weight of LM (beam_beta) to zero. +* beamsearch_ngram: The beam search decoding is done with N-gram LM. + +The `beamsearch` mode would evaluate by beam search decoding without any language model. +It would report the performances in terms of Word Error Rate (WER) and Character Error Rate (CER). Moreover, +the WER/CER of the model when the best candidate is selected among the candidates is also reported as the best WER/CER. +It can be an indicator of how good the predicted candidates are. + +The script would initially load the ASR model and predict the outputs of the model's encoder as log probabilities. +This part would be computed in batches on a device selected by `--device`, which can be CPU (`--device=cpu`) or a +single GPU (`--device=cuda:0`). The batch size of this part can get specified by `--acoustic_batch_size`. You may use +the largest batch size feasible to speed up the step of calculating the log probabilities. You may also use `--use_amp` +to speed up the calculation of log probabilities and make it possible to use larger sizes for `--acoustic_batch_size`. +Currently multi-GPU is not supported for calculating the log probabilities, but using `--probs_cache_file` can help. +It stores the log probabilities produced from the model's encoder into a pickle file so that next time the first step +can get skipped. + +The following is the list of the important arguments for the evaluation script: + ++---------------------+----------+------------------+-------------------------------------------------------------------------+ +| **Argument** | **Type** | **Default** | **Description** | ++---------------------+----------+------------------+-------------------------------------------------------------------------+ +| nemo_model_file | str | Required | The path of the `.nemo` file of the ASR model to extract the tokenizer. | ++---------------------+----------+------------------+-------------------------------------------------------------------------+ +| input_manifest | str | Required | Path to the training file, it can be a text file or JSON manifest. | ++---------------------+----------+------------------+-------------------------------------------------------------------------+ +| kenlm_model_file | str | Required | The path to store the KenLM binary model file. | ++---------------------+----------+------------------+-------------------------------------------------------------------------+ +| preds_output_folder | str | None | The path to an optional folder to store the predictions. | ++---------------------+----------+------------------+-------------------------------------------------------------------------+ +| probs_cache_file | str | None | The cache file for storing the outputs of the model. | ++---------------------+----------+------------------+-------------------------------------------------------------------------+ +| acoustic_batch_size | int | 16 | The batch size to calculate log probabilities. | ++---------------------+----------+------------------+-------------------------------------------------------------------------+ +| use_amp | bool | False | Whether to use AMP if available to calculate log probabilities. | ++---------------------+----------+------------------+-------------------------------------------------------------------------+ +| device | str | cuda | The device to load the model onto to calculate log probabilities. | +| | | | It can `cpu`, `cuda`, `cuda:0`, `cuda:1`, ... | ++---------------------+----------+------------------+-------------------------------------------------------------------------+ +| decoding_mode | str | beamsearch_ngram | The decoding scheme to be used for evaluation. | ++---------------------+----------+------------------+-------------------------------------------------------------------------+ +| beam_width | float | Required | List of the width or list of the widths of the beam search decoding. | ++---------------------+----------+------------------+-------------------------------------------------------------------------+ +| beam_alpha | float | Required | List of the alpha parameter for the beam search decoding. | ++---------------------+----------+------------------+-------------------------------------------------------------------------+ +| beam_beta | float | Required | List of the beta parameter for the beam search decoding. | ++---------------------+----------+------------------+-------------------------------------------------------------------------+ +| beam_batch_size | int | 128 | The batch size to be used for beam search decoding. | +| | | | Larger batch size can be a little faster, but uses larger memory. | ++---------------------+----------+------------------+-------------------------------------------------------------------------+ +| decoding_strategy | str | beam | String argument for type of decoding strategy for the model. | ++---------------------+----------+------------------+-------------------------------------------------------------------------+ +| decoding | Dict | BeamCTC | Subdict of beam search configs. Values found via | +| | Config | InferConfig | python eval_beamsearch_ngram.py --help | ++---------------------+----------+------------------+-------------------------------------------------------------------------+ + +Width of the beam search (`--beam_width`) specifies the number of top candidates/predictions the beam search decoder +would search for. Larger beams result in more accurate but slower predictions. + +.. note:: + + The ``eval_beamsearch_ngram.py`` script contains the entire subconfig used for CTC Beam Decoding. + Therefore it is possible to forward arguments for various beam search libraries such as ``flashlight`` + and ``pyctcdecode`` via the ``decoding`` subconfig. + +There is also a tutorial to learn more about evaluating the ASR models with N-gram LM here: +`Offline ASR Inference with Beam Search and External Language Model Rescoring `_ + +Beam Search Engines +------------------- + +NeMo ASR CTC supports multiple beam search engines for decoding. The default engine is ``beam`` which is the OpenSeq2Seq +decoding library. + +OpenSeq2Seq (``beam``) +~~~~~~~~~~~~~~~~~~~~~~ + +CPU-based beam search engine that is quite efficient and supports char and subword models. It requires a character/subword +KenLM model to be provided. + +The config for this decoding library is described above. + +Flashlight (``flashlight``) +~~~~~~~~~~~~~~~~~~~~~~~~~~~ + +Flashlight is a C++ library for ASR decoding provided at `https://github.com/flashlight/flashlight `_. It is a CPU and CUDA-based beam search engine that is quite efficient and supports +char and subword models. It an ARPA KenLM file. + +It supports several advanced features such as lexicon based / lexicon free decoding, beam pruning threshold, and more. + +.. code-block:: python + + @dataclass + class FlashlightConfig: + lexicon_path: Optional[str] = None + beam_size_token: int = 16 + beam_threshold: float = 20.0 + unk_weight: float = -math.inf + sil_weight: float = 0.0 + unit_lm: bool = False + +.. code-block:: + + # Lexicon-based decoding + python eval_beamsearch_ngram.py ... \ + decoding_strategy="flashlight" \ + decoding.beam.flashlight_cfg.lexicon_path='/path/to/lexicon.lexicon' \ + decoding.beam.flashlight_cfg.beam_size_token = 32 \ + decoding.beam.flashlight_cfg.beam_threshold = 25.0 + + # Lexicon-free decoding + python eval_beamsearch_ngram.py ... \ + decoding_strategy="flashlight" \ + decoding.beam.flashlight_cfg.beam_size_token = 32 \ + decoding.beam.flashlight_cfg.beam_threshold = 25.0 + + +PyCTCDecode (``pyctcdecode``) +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ + +PyCTCDecode is a Python library for ASR decoding provided at `https://github.com/kensho-technologies/pyctcdecode `_. It is a CPU-based beam search engine that is somewhat efficient for a pure python library, and supports char and subword models. It requires a character/subword KenLM ARPA / BINARY model to be provided. + +It has advanced features such as word boosting which can be useful for transcript customization. + +.. code-block:: python + + @dataclass + class PyCTCDecodeConfig: + beam_prune_logp: float = -10.0 + token_min_logp: float = -5.0 + prune_history: bool = False + hotwords: Optional[List[str]] = None + hotword_weight: float = 10.0 + +.. code-block:: + + # PyCTCDecoding + python eval_beamsearch_ngram.py ... \ + decoding_strategy="pyctcdecode" \ + decoding.beam.pyctcdecode_cfg.beam_prune_logp = -10. \ + decoding.beam.pyctcdecode_cfg.token_min_logp = -5. \ + decoding.beam.pyctcdecode_cfg.hotwords=[] \ + decoding.beam.pyctcdecode_cfg.hotword_weight=10.0 + + +Hyperparameter Grid Search +-------------------------- + +Beam search decoding with N-gram LM has three main hyperparameters: `beam_width`, `beam_alpha`, and `beam_beta`. +The accuracy of the model is dependent to the values of these parameters, specially beam_alpha and beam_beta. +You may specify a single or list of values for each of these parameters to perform grid search. It would perform the +beam search decoding on all the combinations of the these three hyperparameters. +For instance, the following set of parameters would results in 2*1*2=4 beam search decodings: + +.. code-block:: + + python eval_beamsearch_ngram.py ... \ + beam_width=[64,128] \ + beam_alpha=[1.0] \ + beam_beta=[1.0,0.5] + + +.. _neural_rescoring: + +**************** +Neural Rescoring +**************** + +In this approach a neural network is used which can gives scores to a candidate. A candidate is the text transcript predicted by the decoder of the ASR model. +The top K candidates produced by the beam search decoding (beam width of K) are given to a neural language model to rank them. +Ranking can be done by a language model which gives a score to each candidate. +This score is usually combined with the scores from the beam search decoding to produce the final scores and rankings. + +Train Neural Rescorer +===================== + +An example script to train such a language model with Transformer can be found at `examples/nlp/language_modeling/transformer_lm.py `__. +It trains a ``TransformerLMModel`` which can be used as a neural rescorer for an ASR system. Full documentation on language models training is available at: + +:doc:`../nlp/language_modeling` + +You may also use a pretrained language model from HuggingFace library like Transformer-XL and GPT instead of training your model. +Models like BERT and RoBERTa are not supported by this script as they are trained as a Masked Language Model and are not efficient and effective to score sentences out of the box. + + +Evaluation +========== + +Given a trained TransformerLMModel `.nemo` file or a pretrained HF model, the script available at +`scripts/asr_language_modeling/neural_rescorer/eval_neural_rescorer.py `__ +can be used to re-score beams obtained with ASR model. You need the `.tsv` file containing the candidates produced +by the acoustic model and the beam search decoding to use this script. The candidates can be the result of just the beam +search decoding or the result of fusion with an N-gram LM. You may generate this file by specifying `--preds_output_folder' for +`scripts/asr_language_modeling/ngram_lm/eval_beamsearch_ngram.py `__. + +The neural rescorer would rescore the beams/candidates by using two parameters of `rescorer_alpha` and `rescorer_beta` as the following: + +.. code-block:: + + final_score = beam_search_score + rescorer_alpha*neural_rescorer_score + rescorer_beta*seq_length + +Parameter `rescorer_alpha` specifies amount of importance to place on the neural rescorer model, and `rescorer_beta` is +a penalty term to consider the sequence length in the scores. They have similar effects like the parameters +`beam_alpha` and `beam_beta` of beam search decoder and N-gram LM. + +You may follow the following steps to evaluate a neural LM: + +#. Obtain `.tsv` file with beams and their corresponding scores. Scores can be from a regular beam search decoder or + in fusion with an N-gram LM scores. For a given beam size `beam_size` and a number of examples + for evaluation `num_eval_examples`, it should contain (`num_eval_examples` x `beam_size`) lines of + form `beam_candidate_text \t score`. This file can be generated by `scripts/asr_language_modeling/ngram_lm/eval_beamsearch_ngram.py `__ + +#. Rescore the candidates by `scripts/asr_language_modeling/neural_rescorer/eval_neural_rescorer.py `__. + +.. code-block:: + + python eval_neural_rescorer.py + --lm_model=[path to .nemo file of the LM or the name of a HF pretrained model] + --beams_file=[path to beams .tsv file] + --beam_size=[size of the beams] + --eval_manifest=[path to eval manifest .json file] + --batch_size=[batch size used for inference on the LM model] + --alpha=[the value for the parameter rescorer_alpha] + --beta=[the value for the parameter rescorer_beta] + --scores_output_file=[the optional path to store the rescored candidates] + +The candidates along with their new scores would be stored at the file specified by `--scores_output_file`. + +The following is the list of the arguments for the evaluation script: + ++---------------------+--------+------------------+-------------------------------------------------------------------------+ +| **Argument** |**Type**| **Default** | **Description** | ++---------------------+--------+------------------+-------------------------------------------------------------------------+ +| lm_model | str | Required | The path of the '.nemo' file of an ASR model, or the name of a | +| | | | HuggingFace pretrained model like 'transfo-xl-wt103' or 'gpt2' | ++---------------------+--------+------------------+-------------------------------------------------------------------------+ +| eval_manifest | str | Required | Path to the evaluation manifest file (.json manifest file) | ++---------------------+--------+------------------+-------------------------------------------------------------------------+ +| beams_file | str | Required | path to beams file (.tsv) containing the candidates and their scores | ++---------------------+--------+------------------+-------------------------------------------------------------------------+ +| beam_size | int | Required | The width of the beams (number of candidates) generated by the decoder | ++---------------------+--------+------------------+-------------------------------------------------------------------------+ +| alpha | float | None | The value for parameter rescorer_alpha | +| | | | Not passing value would enable linear search for rescorer_alpha | ++---------------------+--------+------------------+-------------------------------------------------------------------------+ +| beta | float | None | The value for parameter rescorer_beta | +| | | | Not passing value would enable linear search for rescorer_beta | ++---------------------+--------+------------------+-------------------------------------------------------------------------+ +| batch_size | int | 16 | The batch size used to calculate the scores | ++---------------------+--------+------------------+-------------------------------------------------------------------------+ +| max_seq_length | int | 512 | Maximum sequence length (in tokens) for the input | ++---------------------+--------+------------------+-------------------------------------------------------------------------+ +| scores_output_file | str | None | The optional file to store the rescored beams | ++---------------------+--------+------------------+-------------------------------------------------------------------------+ +| use_amp | bool | ``False`` | Whether to use AMP if available calculate the scores | ++---------------------+--------+------------------+-------------------------------------------------------------------------+ +| device | str | cuda | The device to load LM model onto to calculate the scores | +| | | | It can be 'cpu', 'cuda', 'cuda:0', 'cuda:1', ... | ++---------------------+--------+------------------+-------------------------------------------------------------------------+ + + +Hyperparameter Linear Search +---------------------------- + +This script also supports linear search for parameters `alpha` and `beta`. If any of the two is not +provided, a linear search is performed to find the best value for that parameter. When linear search is used, initially +`beta` is set to zero and the best value for `alpha` is found, then `alpha` is fixed with +that value and another linear search is done to find the best value for `beta`. +If any of the of these two parameters is already specified, then search for that one is skipped. After each search for a +parameter, the plot of WER% for different values of the parameter is also shown. + +It is recommended to first use the linear search for both parameters on a validation set by not providing any values for `--alpha` and `--beta`. +Then check the WER curves and decide on the best values for each parameter. Finally, evaluate the best values on the test set. diff --git a/docs/source/asr/configs.rst b/docs/source/asr/configs.rst new file mode 100644 index 0000000000000000000000000000000000000000..80ec488fe0c3a787eb7d2e7782f27cbbece08da5 --- /dev/null +++ b/docs/source/asr/configs.rst @@ -0,0 +1,929 @@ +NeMo ASR Configuration Files +============================ + +This section describes the NeMo configuration file setup that is specific to models in the ASR collection. For general information +about how to set up and run experiments that is common to all NeMo models (e.g. Experiment Manager and PyTorch Lightning trainer +parameters), see the :doc:`../core/core` section. + +The model section of the NeMo ASR configuration files generally requires information about the dataset(s) being used, the preprocessor +for audio files, parameters for any augmentation being performed, as well as the model architecture specification. The sections on +this page cover each of these in more detail. + +Example configuration files for all of the NeMo ASR scripts can be found in the +`config directory of the examples `_. + + +Dataset Configuration +--------------------- + +Training, validation, and test parameters are specified using the ``train_ds``, ``validation_ds``, and +``test_ds`` sections in the configuration file, respectively. Depending on the task, there may be arguments specifying the sample rate +of the audio files, the vocabulary of the dataset (for character prediction), whether or not to shuffle the dataset, and so on. You may +also decide to leave fields such as the ``manifest_filepath`` blank, to be specified via the command-line at runtime. + +Any initialization parameter that is accepted for the Dataset class used in the experiment can be set in the config file. +Refer to the `Datasets <./api.html#Datasets>`__ section of the API for a list of Datasets and their respective parameters. + +An example ASR train and validation configuration should look similar to the following: + +.. code-block:: yaml + + # Specified at the beginning of the config file + labels: &labels [" ", "a", "b", "c", "d", "e", "f", "g", "h", "i", "j", "k", "l", "m", + "n", "o", "p", "q", "r", "s", "t", "u", "v", "w", "x", "y", "z", "'"] + + model: + train_ds: + manifest_filepath: ??? + sample_rate: 16000 + labels: *labels # Uses the labels above + batch_size: 32 + trim_silence: True + max_duration: 16.7 + shuffle: True + num_workers: 8 + pin_memory: true + # tarred datasets + is_tarred: false # If set to true, uses the tarred version of the Dataset + tarred_audio_filepaths: null # Not used if is_tarred is false + shuffle_n: 2048 # Not used if is_tarred is false + # bucketing params + bucketing_strategy: "synced_randomized" + bucketing_batch_size: null + bucketing_weights: null + + validation_ds: + manifest_filepath: ??? + sample_rate: 16000 + labels: *labels # Uses the labels above + batch_size: 32 + shuffle: False # No need to shuffle the validation data + num_workers: 8 + pin_memory: true + +By default, dataloaders are set up when the model is instantiated. However, dataloader setup can be deferred to +model's `setup()` method by setting ``defer_setup`` in the configuration. + +For example, training data setup can be deferred as follows: + +.. code-block:: yaml + + model: + train_ds: + # Configure training data as usual + ... + # Defer train dataloader setup from `__init__` to `setup` + defer_setup: true + + +Preprocessor Configuration +-------------------------- + +If you are loading audio files for your experiment, you will likely want to use a preprocessor to convert from the +raw audio signal to features (e.g. mel-spectrogram or MFCC). The ``preprocessor`` section of the config specifies the audio +preprocessor to be used via the ``_target_`` field, as well as any initialization parameters for that preprocessor. + +An example of specifying a preprocessor is as follows: + +.. code-block:: yaml + + model: + ... + preprocessor: + # _target_ is the audio preprocessor module you want to use + _target_: nemo.collections.asr.modules.AudioToMelSpectrogramPreprocessor + normalize: "per_feature" + window_size: 0.02 + ... + # Other parameters for the preprocessor + +Refer to the `Audio Preprocessors <./api.html#Audio Preprocessors>`__ API section for the preprocessor options, expected arguments, +and defaults. + +Augmentation Configurations +--------------------------- + +There are a few on-the-fly spectrogram augmentation options for NeMo ASR, which can be specified by the +configuration file using a ``spec_augment`` section. + +For example, there are options for `Cutout `_ and +`SpecAugment `_ available via the ``SpectrogramAugmentation`` module. + +The following example sets up both ``Cutout`` (via the ``rect_*`` parameters) and ``SpecAugment`` (via the ``freq_*`` +and ``time_*`` parameters). + +.. code-block:: yaml + + model: + ... + spec_augment: + _target_: nemo.collections.asr.modules.SpectrogramAugmentation + # Cutout parameters + rect_masks: 5 # Number of rectangles to cut from any given spectrogram + rect_freq: 50 # Max cut of size 50 along the frequency dimension + rect_time: 120 # Max cut of size 120 along the time dimension + # SpecAugment parameters + freq_masks: 2 # Cut two frequency bands + freq_width: 15 # ... of width 15 at maximum + time_masks: 5 # Cut out 10 time bands + time_width: 25 # ... of width 25 at maximum + +You can use any combination of ``Cutout``, frequency/time ``SpecAugment``, or neither of them. + +With NeMo ASR, you can also add augmentation pipelines that can be used to simulate various kinds of noise +added to audio in the channel. Augmentors in a pipeline are applied on the audio data read in the data layer. Online +augmentors can be specified in the config file using an ``augmentor`` section in ``train_ds``. The following example +adds an augmentation pipeline that first adds white noise to an audio sample with a probability of 0.5 and at a level +randomly picked between -50 dB and -10 dB and then passes the resultant samples through a room impulse response randomly +picked from the manifest file provided for ``impulse`` augmentation in the config file. + +.. code-block:: yaml + + model: + ... + train_ds: + ... + augmentor: + white_noise: + prob: 0.5 + min_level: -50 + max_level: -10 + impulse: + prob: 0.3 + manifest_path: /path/to/impulse_manifest.json + +Refer to the `Audio Augmentors <./api.html#Audio Augmentors>`__ API section for more details. + +Tokenizer Configurations +------------------------ + +Some models utilize sub-word encoding via an external tokenizer instead of explicitly defining their vocabulary. + +For such models, a ``tokenizer`` section is added to the model config. ASR models currently support two types of +custom tokenizers: + +- Google Sentencepiece tokenizers (tokenizer type of ``bpe`` in the config) +- HuggingFace WordPiece tokenizers (tokenizer type of ``wpe`` in the config) +- Aggregate tokenizers ((tokenizer type of ``agg`` in the config), see below) + +In order to build custom tokenizers, refer to the ``ASR_with_Subword_Tokenization`` notebook available in the +ASR tutorials directory. + +The following example sets up a ``SentencePiece Tokenizer`` at a path specified by the user: + +.. code-block:: yaml + + model: + ... + tokenizer: + dir: "" + type: "bpe" # can be "bpe" or "wpe" + +The Aggregate (``agg``) tokenizer feature makes it possible to combine tokenizers in order to train multilingual +models. The config file would look like this: + +.. code-block:: yaml + + model: + ... + tokenizer: + type: "agg" # aggregate tokenizer + langs: + en: + dir: "" + type: "bpe" # can be "bpe" or "wpe" + es: + dir: "" + type: "bpe" # can be "bpe" or "wpe" + +In the above config file, each language is associated with its own pre-trained tokenizer, which gets assigned +a token id range in the order the tokenizers are listed. To train a multilingual model, one needs to populate the +``lang`` field in the manifest file, allowing the routing of each sample to the correct tokenizer. At inference time, +the routing is done based on the inferred token id range. + +For models which utilize sub-word tokenization, we share the decoder module (``ConvASRDecoder``) with character tokenization models. +All parameters are shared, but for models which utilize sub-word encoding, there are minor differences when setting up the config. For +such models, the tokenizer is utilized to fill in the missing information when the model is constructed automatically. + +For example, a decoder config corresponding to a sub-word tokenization model should look similar to the following: + +.. code-block:: yaml + + model: + ... + decoder: + _target_: nemo.collections.asr.modules.ConvASRDecoder + feat_in: *enc_final + num_classes: -1 # filled with vocabulary size from tokenizer at runtime + vocabulary: [] # filled with vocabulary from tokenizer at runtime + + +Model Architecture Configurations +--------------------------------- + +Each configuration file should describe the model architecture being used for the experiment. Models in the NeMo ASR collection need +an ``encoder`` section and a ``decoder`` section, with the ``_target_`` field specifying the module to use for each. + +Here is the list of the parameters in the model section which are shared among most of the ASR models: + ++-------------------------+------------------+---------------------------------------------------------------------------------------------------------------+---------------------------------+ +| **Parameter** | **Datatype** | **Description** | **Supported Values** | ++=========================+==================+===============================================================================================================+=================================+ +| :code:`log_prediction` | bool | Whether a random sample should be printed in the output at each step, along with its predicted transcript. | | ++-------------------------+------------------+---------------------------------------------------------------------------------------------------------------+---------------------------------+ +| :code:`ctc_reduction` | string | Specifies the reduction type of CTC loss. Defaults to ``mean_batch`` which would take the average over the | :code:`none`, | +| | | batch after taking the average over the length of each sample. | :code:`mean_batch` | +| | | | :code:`mean`, :code:`sum` | ++-------------------------+------------------+---------------------------------------------------------------------------------------------------------------+---------------------------------+ + +The following sections go into more detail about the specific configurations of each model architecture. + +For more information about the ASR models, refer to the :doc:`Models <./models>` section. + +Jasper and QuartzNet +~~~~~~~~~~~~~~~~~~~~ + +The `Jasper <./models.html#Jasper>`__ and `QuartzNet <./models.html#QuartzNet>`__ models are very similar, and as such the components in their +configs are very similar as well. + +Both architectures use the ``ConvASREncoder`` for the ``encoder``, with parameters detailed in the table below. The encoder parameters +include details about the Jasper/QuartzNet ``[BxR]`` encoder architecture, including how many blocks to use (``B``), how many times +to repeat each sub-block (``R``), and the convolution parameters for each block. + +The number of blocks ``B`` is determined by the number of list elements under ``jasper`` minus the one prologue and two epilogue blocks. +The number of sub-blocks ``R`` is determined by setting the ``repeat`` parameter. + +To use QuartzNet (which uses more compact time-channel separable convolutions) instead of Jasper, add :code:`separable: true` to all +but the last block in the architecture. + +Change the parameter name ``jasper``. + ++-------------------------+------------------+---------------------------------------------------------------------------------------------------------------+-------------------------------------+ +| **Parameter** | **Datatype** | **Description** | **Supported Values** | ++=========================+==================+===============================================================================================================+=====================================+ +| :code:`feat_in` | int | The number of input features. Should be equal to :code:`features` in the preprocessor parameters. | | ++-------------------------+------------------+---------------------------------------------------------------------------------------------------------------+-------------------------------------+ +| :code:`activation` | string | Which activation function to use in the encoder. | :code:`hardtanh`, :code:`relu`, | +| | | | :code:`selu`, :code:`swish` | ++-------------------------+------------------+---------------------------------------------------------------------------------------------------------------+-------------------------------------+ +| :code:`conv_mask` | bool | Whether to use masked convolutions in the encoder. Defaults to ``true``. | | ++-------------------------+------------------+---------------------------------------------------------------------------------------------------------------+-------------------------------------+ +| :code:`jasper` | | A list of blocks that specifies your encoder architecture. Each entry in this list represents one block in | | +| | | the architecture and contains the parameters for that block, including convolution parameters, dropout, and | | +| | | the number of times the block is repeated. Refer to the `Jasper `_ and | | +| | | `QuartzNet `_ papers for details about specific model configurations. | | ++-------------------------+------------------+---------------------------------------------------------------------------------------------------------------+-------------------------------------+ + +A QuartzNet 15x5 (fifteen blocks, each sub-block repeated five times) encoder configuration should look similar to the following example: + +.. code-block:: yaml + + # Specified at the beginning of the file for convenience + n_mels: &n_mels 64 # Used for both the preprocessor and encoder as number of input features + repeat: &repeat 5 # R=5 + dropout: &dropout 0.0 + separable: &separable true # Set to true for QN. Set to false for Jasper. + + model: + ... + encoder: + _target_: nemo.collections.asr.modules.ConvASREncoder + feat_in: *n_mels # Should match "features" in the preprocessor. + activation: relu + conv_mask: true + + jasper: # This field name should be "jasper" for both types of models. + + # Prologue block + - dilation: [1] + dropout: *dropout + filters: 256 + kernel: [33] + repeat: 1 # Prologue block is not repeated. + residual: false + separable: *separable + stride: [2] + + # Block 1 + - dilation: [1] + dropout: *dropout + filters: 256 + kernel: [33] + repeat: *repeat + residual: true + separable: *separable + stride: [1] + + ... # Entries for blocks 2~14 + + # Block 15 + - dilation: [1] + dropout: *dropout + filters: 512 + kernel: [75] + repeat: *repeat + residual: true + separable: *separable + stride: [1] + + # Two epilogue blocks + - dilation: [2] + dropout: *dropout + filters: 512 + kernel: [87] + repeat: 1 # Epilogue blocks are not repeated + residual: false + separable: *separable + stride: [1] + + - dilation: [1] + dropout: *dropout + filters: &enc_filters 1024 + kernel: [1] + repeat: 1 # Epilogue blocks are not repeated + residual: false + stride: [1] + +Both Jasper and QuartzNet use the ``ConvASRDecoder`` as the decoder. The decoder parameters are detailed in the following table. + ++-------------------------+------------------+---------------------------------------------------------------------------------------------------------------+---------------------------------+ +| **Parameter** | **Datatype** | **Description** | **Supported Values** | ++=========================+==================+===============================================================================================================+=================================+ +| :code:`feat_in` | int | The number of input features to the decoder. Should be equal to the number of filters in the last block of | | +| | | the encoder. | | ++-------------------------+------------------+---------------------------------------------------------------------------------------------------------------+---------------------------------+ +| :code:`vocabulary` | list | A list of the valid output characters for your model. For example, for an English dataset, this could be a | | +| | | list of all lowercase letters, space, and apostrophe. | | ++-------------------------+------------------+---------------------------------------------------------------------------------------------------------------+---------------------------------+ +| :code:`num_classes` | int | Number of output classes, i.e. the length of :code:`vocabulary`. | | ++-------------------------+------------------+---------------------------------------------------------------------------------------------------------------+---------------------------------+ + +For example, a decoder config corresponding to the encoder above should look similar to the following: + +.. code-block:: yaml + + model: + ... + decoder: + _target_: nemo.collections.asr.modules.ConvASRDecoder + feat_in: *enc_filters + vocabulary: *labels + num_classes: 28 # Length of the vocabulary list + +Citrinet +~~~~~~~~ + +The `Citrinet <./models.html#Citrinet>`__ and `QuartzNet <./models.html#QuartzNet>`__ models are very similar, and as such the +components in their configs are very similar as well. Citrinet utilizes Squeeze and Excitation, as well as sub-word tokenization, in +contrast to QuartzNet. Depending on the dataset, we utilize different tokenizers. For Librispeech, we utilize the HuggingFace WordPiece +tokenizer, and for all other datasets we utilize the Google Sentencepiece tokenizer - usually the ``unigram`` tokenizer type. + +Both architectures use the ``ConvASREncoder`` for the ``encoder``, with parameters detailed above. The encoder parameters include +details about the Citrinet-C encoder architecture, including how many filters are used per channel (``C``). The Citrinet-C +configuration is a shortform notation for Citrinet-21x5xC, such that ``B = 21`` and ``R = 5`` are the default and should generally +not be changed. + +To use Citrinet instead of QuartzNet, refer to the ``citrinet_512.yaml`` configuration found inside the ``examples/asr/conf/citrinet`` +directory. Citrinet is primarily comprised of the same :class:`~nemo.collections.asr.parts.submodules.jasper.JasperBlock` as ``Jasper`` or +``QuartzNet``. + +While the configs for Citrinet and QuartzNet are similar, we note the additional flags used for Citrinet below. Refer to the +``JasperBlock`` documentation for the meaning of these arguments. + ++---------------------------+------------------+-----------------------------------------------------------------------------------------------------------+-----------------------------------+ +| **Parameter** | **Datatype** | **Description** | **Supported Values** | ++===========================+==================+===========================================================================================================+===================================+ +| :code:`se` | bool | Whether to apply squeeze-and-excitation mechanism or not. | :code:`true` or :code:`false` | ++---------------------------+------------------+-----------------------------------------------------------------------------------------------------------+-----------------------------------+ +| :code:`se_context_size` | int | SE context size. -1 means global context. | :code:`-1` or :code:`+ve int` | ++---------------------------+------------------+-----------------------------------------------------------------------------------------------------------+-----------------------------------+ +| :code:`stride_last` | bool | Stride on the final repeated block or all repeated blocks. | :code:`true` or :code:`false` | ++---------------------------+------------------+-----------------------------------------------------------------------------------------------------------+-----------------------------------+ +| :code:`residual_mode` | str | Type of residual branch to construct. | :code:`"add"` or | +| | | Can be pointwise residual addition or pointwise strided residual attention | :code:`"stride_add"` | ++---------------------------+------------------+-----------------------------------------------------------------------------------------------------------+-----------------------------------+ + +A Citrinet-512 config should look similar to the following: + +.. code-block:: yaml + + model: + ... + # Specify some defaults across the entire model + model_defaults: + repeat: 5 + dropout: 0.1 + separable: true + se: true + se_context_size: -1 + ... + encoder: + _target_: nemo.collections.asr.modules.ConvASREncoder + feat_in: *n_mels # Should match "features" in the preprocessor. + activation: relu + conv_mask: true + + jasper: # This field name should be "jasper" for the JasperBlock (which constructs Citrinet). + + # Prologue block + - filters: 512 + repeat: 1 + kernel: [5] + stride: [1] + dilation: [1] + dropout: 0.0 + residual: false + separable: ${model.model_defaults.separable} + se: ${model.model_defaults.se} + se_context_size: ${model.model_defaults.se_context_size} + + # Block 1 + - filters: 512 + repeat: ${model.model_defaults.repeat} + kernel: [11] + stride: [2] + dilation: [1] + dropout: ${model.model_defaults.dropout} + residual: true + separable: ${model.model_defaults.separable} + se: ${model.model_defaults.se} + se_context_size: ${model.model_defaults.se_context_size} + stride_last: true + residual_mode: "stride_add" + + ... # Entries for blocks 2~21 + + # Block 22 + - filters: 512 + repeat: ${model.model_defaults.repeat} + kernel: [39] + stride: [1] + dilation: [1] + dropout: ${model.model_defaults.dropout} + residual: true + separable: ${model.model_defaults.separable} + se: ${model.model_defaults.se} + se_context_size: ${model.model_defaults.se_context_size} + + # Epilogue block + + - filters: &enc_final 640 + repeat: 1 + kernel: [41] + stride: [1] + dilation: [1] + dropout: 0.0 + residual: false + separable: ${model.model_defaults.separable} + se: ${model.model_defaults.se} + se_context_size: ${model.model_defaults.se_context_size} + +As mentioned above, Citrinet uses the ``ConvASRDecoder`` as the decoder layer similar to QuartzNet. Only the configuration must be +changed slightly as Citrinet utilizes sub-word tokenization. + +.. note:: + The following information is relevant to any of the above models that implements its encoder as an :class:`~nemo.collections.asr.modules.conv_asr.ConvASREncoder`, and utilizes the ``SqueezeExcite`` mechanism. + +The ``SqueezeExcite`` block within a :class:`~nemo.collections.asr.modules.conv_asr.ConvASREncoder` network can be modified to utilize a different context window after the model has been instantiated (even after the model has been trained) so as to evaluate the model with limited context. This can be achieved using the :meth:`~nemo.collections.asr.parts.mixins.mixins.ASRModuleMixin.change_conv_asr_se_context_window` + +.. code-block:: python + + # Here, model can be any model that has a `ConvASREncoder` as its encoder, and utilized `SqueezeExcite` blocks + # `context_window` : It is an integer representing the number of timeframes (each corresponding to some window stride). + # `update_config` : Bool flag which determines whether the config of the model should be updated to reflect the new context window. + + # Here, we specify that 128 timeframes of 0.01s stride should be the context window + # This is equivalent to 128 * 0.01s context window for `SqueezeExcite` + model.change_conv_asr_se_context_window(context_window=128, update_config=True) + +Conformer-CTC +~~~~~~~~~~~~~ + +The config files for Conformer-CTC model contain character-based encoding and sub-word encoding at +``/examples/asr/conf/conformer/conformer_ctc_char.yaml`` and ``/examples/asr/conf/conformer/conformer_ctc_bpe.yaml`` +respectively. Some components of the configs of `Conformer-CTC <./models.html#Conformer-CTC>`__ include the following datasets: + +* ``train_ds``, ``validation_ds``, and ``test_ds`` +* opimizer (``optim``) +* augmentation (``spec_augment``) +* ``decoder`` +* ``trainer`` +* ``exp_manager`` + +These datasets are similar to other ASR models like `QuartzNet <./models.html#QuartzNet>`__. There should be a tokenizer section where you can +specify the tokenizer if you want to use sub-word encoding instead of character-based encoding. + + +The encoder section includes the details about the Conformer-CTC encoder architecture. You may find more information in the +config files and also :ref:`nemo.collections.asr.modules.ConformerEncoder `. + +Squeezeformer-CTC +~~~~~~~~~~~~~~~~~ + +The config files for Squeezeformer-CTC model contain character-based encoding and sub-word encoding at +``/examples/asr/conf/squeezeformer/squeezeformer_ctc_char.yaml`` and ``/examples/asr/conf/squeezeformer/squeezeformer_ctc_bpe.yaml`` +respectively. Components of the configs of `Squeezeformer-CTC <./models.html#Squeezeformer-CTC>`__ are similar to Conformer config - `QuartzNet <./configs.html#Conformer-CTC>`__. + +The encoder section includes the details about the Squeezeformer-CTC encoder architecture. You may find more information in the +config files and also :ref:`nemo.collections.asr.modules.SqueezeformerEncoder `. + + +ContextNet +~~~~~~~~~~ + +Please refer to the model page of `ContextNet <./models.html#ContextNet>`__ for more information on this model. + +Conformer-Transducer +~~~~~~~~~~~~~~~~~~~~ + +Please refer to the model page of `Conformer-Transducer <./models.html#Conformer-Transducer>`__ for more information on this model. + +LSTM-Transducer and LSTM-CTC +~~~~~~~~~~~~~~~~~~~~~~~~~~~~ + +The config files for LSTM-Transducer and LSTM-CTC models can be found at ``/examples/asr/conf/lstm/lstm_transducer_bpe.yaml`` and ``/examples/asr/conf/lstm/lstm_ctc_bpe.yaml`` respectively. +Most of the of the configs of are similar to other ctc or transducer models. The main difference is the encoder part. +The encoder section includes the details about the RNN-based encoder architecture. You may find more information in the +config files and also :ref:`nemo.collections.asr.modules.RNNEncoder `. + + +InterCTC Config +--------------- + +All CTC-based models also support `InterCTC loss `_. To use it, you need to specify +2 parameters as in example below + +.. code-block:: yaml + + model: + # ... + interctc: + loss_weights: [0.3] + apply_at_layers: [8] + +which can be used to reproduce the default setup from the paper (assuming the total number of layers is 18). +You can also specify multiple CTC losses from different layers, e.g., to get 2 losses from layers 3 and 8 with +weights 0.1 and 0.3, specify: + +.. code-block:: yaml + + model: + # ... + interctc: + loss_weights: [0.1, 0.3] + apply_at_layers: [3, 8] + +Note that the final-layer CTC loss weight is automatically computed to normalize +all weight to 1 (0.6 in the example above). + + +Stochastic Depth Config +----------------------- + +`Stochastic Depth `_ is a useful technique for regularizing ASR model training. +Currently it's only supported for :ref:`nemo.collections.asr.modules.ConformerEncoder `. To +use it, specify the following parameters in the encoder config file to reproduce the default setup from the paper: + +.. code-block:: yaml + + model: + # ... + encoder: + # ... + stochastic_depth_drop_prob: 0.3 + stochastic_depth_mode: linear # linear or uniform + stochastic_depth_start_layer: 1 + +See :ref:`documentation of ConformerEncoder ` for more details. Note that stochastic depth +is supported for both CTC and Transducer model variations (or any other kind of model/loss that's using +conformer as encoder). + + +Transducer Configurations +------------------------- + +All CTC-based ASR model configs can be modified to support Transducer loss training. Below, we discuss the modifications required in the config to enable Transducer training. All modifications are made to the ``model`` config. + +Model Defaults +~~~~~~~~~~~~~~ + +It is a subsection to the model config representing the default values shared across the entire model represented as ``model.model_defaults``. + +There are three values that are primary components of a transducer model. They are : + +* ``enc_hidden``: The hidden dimension of the final layer of the Encoder network. +* ``pred_hidden``: The hidden dimension of the final layer of the Prediction network. +* ``joint_hidden``: The hidden dimension of the intermediate layer of the Joint network. + +One can access these values inside the config by using OmegaConf interpolation as follows : + +.. code-block:: yaml + + model: + ... + model_defaults: + enc_hidden: 256 + pred_hidden: 256 + joint_hidden: 256 + ... + decoder: + ... + prednet: + pred_hidden: ${model.model_defaults.pred_hidden} + +Acoustic Encoder Model +~~~~~~~~~~~~~~~~~~~~~~ + +The transducer model is comprised of three models combined. One of these models is the Acoustic (encoder) model. We should be able to drop in any CTC Acoustic model config into this section of the transducer config. + +The only condition that needs to be met is that **the final layer of the acoustic model must have the hidden dimension defined in ``model_defaults.enc_hidden``**. + +Decoder / Prediction Model +~~~~~~~~~~~~~~~~~~~~~~~~~~ + +The Prediction model is generally an autoregressive, causal model that consumes text tokens and returns embeddings that will be used by the Joint model. The base config for an LSTM based Prediction network can be found in the the ``decoder`` section of `ContextNet <./models.html#ContextNet>`__ or other Transducer architectures. For further information refer to the ``Intro to Transducers`` tutorial in the ASR tutorial section. + +**This config can be copy-pasted into any custom transducer model with no modification.** + +Let us discuss some of the important arguments: + +* ``blank_as_pad``: In ordinary transducer models, the embedding matrix does not acknowledge the ``Transducer Blank`` token (similar to CTC Blank). However, this causes the autoregressive loop to be more complicated and less efficient. Instead, this flag which is set by default, will add the ``Transducer Blank`` token to the embedding matrix - and use it as a pad value (zeros tensor). This enables more efficient inference without harming training. For further information refer to the ``Intro to Transducers`` tutorial in the ASR tutorial section. + +* ``prednet.pred_hidden``: The hidden dimension of the LSTM and the output dimension of the Prediction network. + +.. code-block:: yaml + + decoder: + _target_: nemo.collections.asr.modules.RNNTDecoder + normalization_mode: null + random_state_sampling: false + blank_as_pad: true + + prednet: + pred_hidden: ${model.model_defaults.pred_hidden} + pred_rnn_layers: 1 + t_max: null + dropout: 0.0 + +Joint Model +~~~~~~~~~~~ + +The Joint model is a simple feed-forward Multi-Layer Perceptron network. This MLP accepts the output of the Acoustic and Prediction models and computes a joint probability distribution over the entire vocabulary space. The base config for the Joint network can be found in the the ``joint`` section of `ContextNet <./models.html#ContextNet>`__ or other Transducer architectures. For further information refer to the ``Intro to Transducers`` tutorial in the ASR tutorial section. + +**This config can be copy-pasted into any custom transducer model with no modification.** + +The Joint model config has several essential components which we discuss below : + +* ``log_softmax``: Due to the cost of computing softmax on such large tensors, the Numba CUDA implementation of RNNT loss will implicitly compute the log softmax when called (so its inputs should be logits). The CPU version of the loss doesn't face such memory issues so it requires log-probabilities instead. Since the behaviour is different for CPU-GPU, the ``None`` value will automatically switch behaviour dependent on whether the input tensor is on a CPU or GPU device. + +* ``preserve_memory``: This flag will call ``torch.cuda.empty_cache()`` at certain critical sections when computing the Joint tensor. While this operation might allow us to preserve some memory, the empty_cache() operation is tremendously slow and will slow down training by an order of magnitude or more. It is available to use but not recommended. + +* ``fuse_loss_wer``: This flag performs "batch splitting" and then "fused loss + metric" calculation. It will be discussed in detail in the next tutorial that will train a Transducer model. + +* ``fused_batch_size``: When the above flag is set to True, the model will have two distinct "batch sizes". The batch size provided in the three data loader configs (``model.*_ds.batch_size``) will now be the ``Acoustic model`` batch size, whereas the ``fused_batch_size`` will be the batch size of the ``Prediction model``, the ``Joint model``, the ``transducer loss`` module and the ``decoding`` module. + +* ``jointnet.joint_hidden``: The hidden intermediate dimension of the joint network. + +.. code-block:: yaml + + joint: + _target_: nemo.collections.asr.modules.RNNTJoint + log_softmax: null # sets it according to cpu/gpu device + + # fused mode + fuse_loss_wer: false + fused_batch_size: 16 + + jointnet: + joint_hidden: ${model.model_defaults.joint_hidden} + activation: "relu" + dropout: 0.0 + +Sampled Softmax Joint Model +^^^^^^^^^^^^^^^^^^^^^^^^^^^ + +There are some situations where a large vocabulary with a Transducer model - such as for multilingual models with a large +number of languages. In this setting, we need to consider the cost of memory of training Transducer networks which does +not allow large vocabulary. + +For such cases, one can instead utilize the ``SampledRNNTJoint`` module instead of the usual ``RNNTJoint`` module, in order +to compute the loss using a sampled subset of the vocabulary rather than the full vocabulary file. + +It adds only one additional parameter : + +* ``n_samples``: Specifies the minimum number of tokens to sample from the vocabulary space, + excluding the RNNT blank token. If a given value is larger than the entire vocabulary size, + then the full vocabulary will be used. + +The only difference in config required is to replace ``nemo.collections.asr.modules.RNNTJoint`` with ``nemo.collections.asr.modules.SampledRNNTJoint`` + +.. code-block:: yaml + + joint: + _target_: nemo.collections.asr.modules.SampledRNNTJoint + n_samples: 500 + ... # All other arguments from RNNTJoint can be used after this. + + +Effect of Batch Splitting / Fused Batch step +^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ + +The following information below explain why memory is an issue when training Transducer models and how NeMo tackles the issue with its Fused Batch step. The material can be read for a thorough understanding, otherwise, it can be skipped. You can also follow these steps in the "ASR_with_Transducers" tutorial. + +**Diving deeper into the memory costs of Transducer Joint** + +One of the significant limitations of Transducers is the exorbitant memory cost of computing the Joint module. The Joint module is comprised of two steps. + +1) Projecting the Acoustic and Transcription feature dimensions to some standard hidden dimension (specified by model.model_defaults.joint_hidden) + +2) Projecting this intermediate hidden dimension to the final vocabulary space to obtain the transcription. + +Take the following example. + +BS=32 ; T (after 2x stride) = 800, U (with character encoding) = 400-450 tokens, Vocabulary size V = 28 (26 alphabet chars, space and apostrophe). Let the hidden dimension of the Joint model be 640 (Most Google Transducer papers use hidden dimension of 640). + +* :math:`Memory \, (Hidden, \, gb) = 32 \times 800 \times 450 \times 640 \times 4 = 29.49` gigabytes (4 bytes per float). + +* :math:`Memory \, (Joint, \, gb) = 32 \times 800 \times 450 \times 28 \times 4 = 1.290` gigabytes (4 bytes per float) + +**NOTE**: This is just for the forward pass! We need to double this memory to store gradients! This much memory is also just for the Joint model **alone**. Far more memory is required for the Prediction model as well as the large Acoustic model itself and its gradients! + +Even with mixed precision, that's $\sim 30$ GB of GPU RAM for just 1 part of the network + its gradients. + +Effect of Fused Batch Step +^^^^^^^^^^^^^^^^^^^^^^^^^^ + +The fundamental problem is that the joint tensor grows in size when ``[T x U]`` grows in size. This growth in memory cost is due to many reasons - either by model construction (downsampling) or the choice of dataset preprocessing (character tokenization vs. sub-word tokenization). + +Another dimension that NeMo can control is **batch**. Due to how we batch our samples, small and large samples all get clumped together into a single batch. So even though the individual samples are not all as long as the maximum length of T and U in that batch, when a batch of such samples is constructed, it will consume a significant amount of memory for the sake of compute efficiency. + +So as is always the case - **trade-off compute speed for memory savings**. + +The fused operation goes as follows : + +1) Forward the entire acoustic model in a single pass. (Use global batch size here for acoustic model - found in ``model.*_ds.batch_size``) + +2) Split the Acoustic Model's logits by ``fused_batch_size`` and loop over these sub-batches. + +3) Construct a sub-batch of same ``fused_batch_size`` for the Prediction model. Now the target sequence length is :math:`U_{sub-batch} < U`. + +4) Feed this :math:`U_{sub-batch}` into the Joint model, along with a sub-batch from the Acoustic model (with :math:`T_{sub-batch} < T)`. Remember, we only have to slice off a part of the acoustic model here since we have the full batch of samples :math:`(B, T, D)` from the acoustic model. + +5) Performing steps (3) and (4) yields :math:`T_{sub-batch}` and :math:`U_{sub-batch}`. Perform sub-batch joint step - costing an intermediate :math:`(B, T_{sub-batch}, U_{sub-batch}, V)` in memory. + +6) Compute loss on sub-batch and preserve in a list to be later concatenated. + +7) Compute sub-batch metrics (such as Character / Word Error Rate) using the above Joint tensor and sub-batch of ground truth labels. Preserve the scores to be averaged across the entire batch later. + +8) Delete the sub-batch joint matrix :math:`(B, T_{sub-batch}, U_{sub-batch}, V)`. Only gradients from .backward() are preserved now in the computation graph. + +9) Repeat steps (3) - (8) until all sub-batches are consumed. + +10) Cleanup step. Compute full batch WER and log. Concatenate loss list and pass to PTL to compute the equivalent of the original (full batch) Joint step. Delete ancillary objects necessary for sub-batching. + +Transducer Decoding +~~~~~~~~~~~~~~~~~~~ + +Models which have been trained with CTC can transcribe text simply by performing a regular argmax over the output of their decoder. For transducer-based models, the three networks must operate in a synchronized manner in order to transcribe the acoustic features. The base config for the Transducer decoding step can be found in the the ``decoding`` section of `ContextNet <./models.html#ContextNet>`__ or other Transducer architectures. For further information refer to the ``Intro to Transducers`` tutorial in the ASR tutorial section. + +**This config can be copy-pasted into any custom transducer model with no modification.** + +The most important component at the top level is the ``strategy``. It can take one of many values: + +* ``greedy``: This is sample-level greedy decoding. It is generally exceptionally slow as each sample in the batch will be decoded independently. For publications, this should be used alongside batch size of 1 for exact results. + +* ``greedy_batch``: This is the general default and should nearly match the ``greedy`` decoding scores (if the acoustic features are not affected by feature mixing in batch mode). Even for small batch sizes, this strategy is significantly faster than ``greedy``. + +* ``beam``: Runs beam search with the implicit language model of the Prediction model. It will generally be quite slow, and might need some tuning of the beam size to get better transcriptions. + +* ``tsd``: Time synchronous decoding. Please refer to the paper: `Alignment-Length Synchronous Decoding for RNN Transducer `_ for details on the algorithm implemented. Time synchronous decoding (TSD) execution time grows by the factor T * max_symmetric_expansions. For longer sequences, T is greater and can therefore take a long time for beams to obtain good results. TSD also requires more memory to execute. + +* ``alsd``: Alignment-length synchronous decoding. Please refer to the paper: `Alignment-Length Synchronous Decoding for RNN Transducer `_ for details on the algorithm implemented. Alignment-length synchronous decoding (ALSD) execution time is faster than TSD, with a growth factor of T + U_max, where U_max is the maximum target length expected during execution. Generally, T + U_max < T * max_symmetric_expansions. However, ALSD beams are non-unique. Therefore it is required to use larger beam sizes to achieve the same (or close to the same) decoding accuracy as TSD. For a given decoding accuracy, it is possible to attain faster decoding via ALSD than TSD. + +* ``maes``: Modified Adaptive Expansion Search Decoding. Please refer to the paper `Accelerating RNN Transducer Inference via Adaptive Expansion Search `_. Modified Adaptive Synchronous Decoding (mAES) execution time is adaptive w.r.t the number of expansions (for tokens) required per timestep. The number of expansions can usually be constrained to 1 or 2, and in most cases 2 is sufficient. This beam search technique can possibly obtain superior WER while sacrificing some evaluation time. + +.. code-block:: yaml + + decoding: + strategy: "greedy_batch" + + # preserve decoding alignments + preserve_alignments: false + + # Overrides the fused batch size after training. + # Setting it to -1 will process whole batch at once when combined with `greedy_batch` decoding strategy + fused_batch_size: Optional[int] = -1 + + # greedy strategy config + greedy: + max_symbols: 10 + + # beam strategy config + beam: + beam_size: 2 + score_norm: true + softmax_temperature: 1.0 # scale the logits by some temperature prior to softmax + tsd_max_sym_exp: 10 # for Time Synchronous Decoding, int > 0 + alsd_max_target_len: 5.0 # for Alignment-Length Synchronous Decoding, float > 1.0 + maes_num_steps: 2 # for modified Adaptive Expansion Search, int > 0 + maes_prefix_alpha: 1 # for modified Adaptive Expansion Search, int > 0 + maes_expansion_beta: 2 # for modified Adaptive Expansion Search, int >= 0 + maes_expansion_gamma: 2.3 # for modified Adaptive Expansion Search, float >= 0 + +Transducer Loss +~~~~~~~~~~~~~~~ + +This section configures the type of Transducer loss itself, along with possible sub-sections. By default, an optimized implementation of Transducer loss will be used which depends on Numba for CUDA acceleration. The base config for the Transducer loss section can be found in the the ``loss`` section of `ContextNet <./models.html#ContextNet>`__ or other Transducer architectures. For further information refer to the ``Intro to Transducers`` tutorial in the ASR tutorial section. + +**This config can be copy-pasted into any custom transducer model with no modification.** + +The loss config is based on a resolver pattern and can be used as follows: + +1) ``loss_name``: ``default`` is generally a good option. Will select one of the available resolved losses and match the kwargs from a sub-configs passed via explicit ``{loss_name}_kwargs`` sub-config. + +2) ``{loss_name}_kwargs``: This sub-config is passed to the resolved loss above and can be used to configure the resolved loss. + + +.. code-block:: yaml + + loss: + loss_name: "default" + warprnnt_numba_kwargs: + fastemit_lambda: 0.0 + +FastEmit Regularization +^^^^^^^^^^^^^^^^^^^^^^^ + +FastEmit Regularization is supported for the default Numba based WarpRNNT loss. Recently proposed regularization approach - `FastEmit: Low-latency Streaming ASR with Sequence-level Emission Regularization `_ allows us near-direct control over the latency of transducer models. + +Refer to the above paper for results and recommendations of ``fastemit_lambda``. + + +Fine-tuning Configurations +-------------------------- + +All ASR scripts support easy fine-tuning by partially/fully loading the pretrained weights from a checkpoint into the **currently instantiated model**. Note that the currently instantiated model should have parameters that match the pre-trained checkpoint (such that weights may load properly). In order to directly fine-tune a pre-existing checkpoint, please follow the tutorial `ASR Language Fine-tuning. `_ + +Pre-trained weights can be provided in multiple ways - + +1) Providing a path to a NeMo model (via ``init_from_nemo_model``) +2) Providing a name of a pretrained NeMo model (which will be downloaded via the cloud) (via ``init_from_pretrained_model``) +3) Providing a path to a Pytorch Lightning checkpoint file (via ``init_from_ptl_ckpt``) + +There are multiple ASR subtasks inside the ``examples/asr/`` directory, you can substitute the ```` tag below. + +Fine-tuning via a NeMo model +~~~~~~~~~~~~~~~~~~~~~~~~~~~~ + +.. code-block:: sh + + python examples/asr//script_to_.py \ + --config-path= \ + --config-name=) \ + model.train_ds.manifest_filepath="" \ + model.validation_ds.manifest_filepath="" \ + trainer.devices=-1 \ + trainer.accelerator='gpu' \ + trainer.max_epochs=50 \ + +init_from_nemo_model="" + + +Fine-tuning via a NeMo pretrained model name +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ + +.. code-block:: sh + + python examples/asr//script_to_.py \ + --config-path= \ + --config-name=) \ + model.train_ds.manifest_filepath="" \ + model.validation_ds.manifest_filepath="" \ + trainer.devices=-1 \ + trainer.accelerator='gpu' \ + trainer.max_epochs=50 \ + +init_from_pretrained_model="" + +Fine-tuning via a Pytorch Lightning checkpoint +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ + +.. code-block:: sh + + python examples/asr//script_to_.py \ + --config-path= \ + --config-name=) \ + model.train_ds.manifest_filepath="" \ + model.validation_ds.manifest_filepath="" \ + trainer.devices=-1 \ + trainer.accelerator='gpu' \ + trainer.max_epochs=50 \ + +init_from_ptl_ckpt="" + +Fine-tuning Execution Flow Diagram +---------------------------------- + +When preparing your own training or fine-tuning scripts, please follow the execution flow diagram order for correct inference. + +Depending on the type of model, there may be extra steps that must be performed - + +* CTC Models - `Examples directory for CTC Models `_ +* RNN Transducer Models - `Examples directory for Transducer Models `_ diff --git a/docs/source/asr/data/asrlm_results.csv b/docs/source/asr/data/asrlm_results.csv new file mode 100644 index 0000000000000000000000000000000000000000..d9a395cb8b75ffc59300341feeeba2c8be2aa5fa --- /dev/null +++ b/docs/source/asr/data/asrlm_results.csv @@ -0,0 +1,2 @@ +Model Name,Model Base Class,Model Card +asrlm_en_transformer_large_ls,TransformerLMModel,"https://ngc.nvidia.com/catalog/models/nvidia:nemo:asrlm_en_transformer_large_ls" diff --git a/docs/source/asr/data/benchmark_ca.csv b/docs/source/asr/data/benchmark_ca.csv new file mode 100644 index 0000000000000000000000000000000000000000..bd7e174b922faaf7e099ad843ede10cda6b46b8c --- /dev/null +++ b/docs/source/asr/data/benchmark_ca.csv @@ -0,0 +1,4 @@ +Model,Model Base Class,Model Card +stt_ca_quartznet15x5,EncDecCTCModel,"https://ngc.nvidia.com/catalog/models/nvidia:nemo:stt_ca_quartznet15x5" +stt_ca_conformer_ctc_large,EncDecCTCModel,"https://ngc.nvidia.com/catalog/models/nvidia:nemo:stt_ca_conformer_ctc_large" +stt_ca_conformer_transducer_large,EncDecRNNTBPEModel,"https://ngc.nvidia.com/catalog/models/nvidia:nemo:stt_ca_conformer_transducer_large" \ No newline at end of file diff --git a/docs/source/asr/data/benchmark_de.csv b/docs/source/asr/data/benchmark_de.csv new file mode 100644 index 0000000000000000000000000000000000000000..99e221a6b835517eed8023592af27c91afef70b2 --- /dev/null +++ b/docs/source/asr/data/benchmark_de.csv @@ -0,0 +1,6 @@ +Model,Model Base Class,Model Card +stt_de_quartznet15x5,EncDecCTCModel,"https://ngc.nvidia.com/catalog/models/nvidia:nemo:stt_de_quartznet15x5" +stt_de_citrinet_1024,EncDecCTCModel,"https://ngc.nvidia.com/catalog/models/nvidia:nemo:stt_de_citrinet_1024" +stt_de_contextnet_1024,EncDecRNNTBPEModel,"https://ngc.nvidia.com/catalog/models/nvidia:nemo:stt_de_contextnet_1024" +stt_de_conformer_ctc_large,EncDecCTCModelBPE,"https://ngc.nvidia.com/catalog/models/nvidia:nemo:stt_de_conformer_ctc_large" +stt_de_conformer_transducer_large,EncDecRNNTBPEModel,"https://ngc.nvidia.com/catalog/models/nvidia:nemo:stt_de_conformer_transducer_large" diff --git a/docs/source/asr/data/benchmark_en.csv b/docs/source/asr/data/benchmark_en.csv new file mode 100644 index 0000000000000000000000000000000000000000..0f03452d034ddebff39b98dfbe1302d9ce71b64a --- /dev/null +++ b/docs/source/asr/data/benchmark_en.csv @@ -0,0 +1,28 @@ +Model Name,Model Base Class,Model Card +QuartzNet15x5Base-En,EncDecCTCModel,"https://ngc.nvidia.com/catalog/models/nvidia:nemospeechmodels" +stt_en_jasper10x5dr,EncDecCTCModel,"https://ngc.nvidia.com/catalog/models/nvidia:nemo:stt_en_jasper10x5dr" +stt_en_citrinet_256,EncDecCTCModelBPE,"https://ngc.nvidia.com/catalog/models/nvidia:nemo:stt_en_citrinet_256" +stt_en_citrinet_512,EncDecCTCModelBPE,"https://ngc.nvidia.com/catalog/models/nvidia:nemo:stt_en_citrinet_512" +stt_en_citrinet_1024,EncDecCTCModelBPE,"https://ngc.nvidia.com/catalog/models/nvidia:nemo:stt_en_citrinet_1024" +stt_en_citrinet_256_gamma_0_25,EncDecCTCModelBPE,"https://ngc.nvidia.com/catalog/models/nvidia:nemo:stt_en_citrinet_256_gamma_0_25" +stt_en_citrinet_512_gamma_0_25,EncDecCTCModelBPE,"https://ngc.nvidia.com/catalog/models/nvidia:nemo:stt_en_citrinet_512_gamma_0_25" +stt_en_citrinet_1024_gamma_0_25,EncDecCTCModelBPE,"https://ngc.nvidia.com/catalog/models/nvidia:nemo:stt_en_citrinet_1024_gamma_0_25" +stt_en_contextnet_256_mls,EncDecRNNTBPEModel,"https://ngc.nvidia.com/catalog/models/nvidia:nemo:stt_en_contextnet_256_mls" +stt_en_contextnet_512_mls,EncDecRNNTBPEModel,"https://ngc.nvidia.com/catalog/models/nvidia:nemo:stt_en_contextnet_512_mls" +stt_en_contextnet_1024_mls,EncDecRNNTBPEModel,"https://ngc.nvidia.com/catalog/models/nvidia:nemo:stt_en_contextnet_1024_mls" +stt_en_contextnet_256,EncDecRNNTBPEModel,"https://ngc.nvidia.com/catalog/models/nvidia:nemo:stt_en_contextnet_256" +stt_en_contextnet_512,EncDecRNNTBPEModel,"https://ngc.nvidia.com/catalog/models/nvidia:nemo:stt_en_contextnet_512" +stt_en_contextnet_1024,EncDecRNNTBPEModel,"https://ngc.nvidia.com/catalog/models/nvidia:nemo:stt_en_contextnet_1024" +stt_en_conformer_ctc_small,EncDecCTCModelBPE,"https://ngc.nvidia.com/catalog/models/nvidia:nemo:stt_en_conformer_ctc_small" +stt_en_conformer_ctc_medium,EncDecCTCModelBPE,"https://ngc.nvidia.com/catalog/models/nvidia:nemo:stt_en_conformer_ctc_medium" +stt_en_conformer_ctc_large,EncDecCTCModelBPE,"https://ngc.nvidia.com/catalog/models/nvidia:nemo:stt_en_conformer_ctc_large" +stt_en_conformer_ctc_xlarge,EncDecCTCModelBPE,"https://ngc.nvidia.com/catalog/models/nvidia:nemo:stt_en_conformer_ctc_xlarge" +stt_en_conformer_ctc_small_ls,EncDecCTCModelBPE,"https://ngc.nvidia.com/catalog/models/nvidia:nemo:stt_en_conformer_ctc_small_ls" +stt_en_conformer_ctc_medium_ls,EncDecCTCModelBPE,"https://ngc.nvidia.com/catalog/models/nvidia:nemo:stt_en_conformer_ctc_medium_ls" +stt_en_conformer_ctc_large_ls,EncDecCTCModelBPE,"https://ngc.nvidia.com/catalog/models/nvidia:nemo:stt_en_conformer_ctc_large_ls" +stt_en_conformer_transducer_large_ls,EncDecRNNTBPEModel,"https://ngc.nvidia.com/catalog/models/nvidia:nemo:stt_en_conformer_transducer_large_ls" +stt_en_conformer_transducer_small,EncDecRNNTBPEModel,"https://ngc.nvidia.com/catalog/models/nvidia:nemo:stt_en_conformer_transducer_small" +stt_en_conformer_transducer_medium,EncDecRNNTBPEModel,"https://ngc.nvidia.com/catalog/models/nvidia:nemo:stt_en_conformer_transducer_medium" +stt_en_conformer_transducer_large,EncDecRNNTBPEModel,"https://ngc.nvidia.com/catalog/models/nvidia:nemo:stt_en_conformer_transducer_large" +stt_en_conformer_transducer_xlarge,EncDecRNNTBPEModel,"https://ngc.nvidia.com/catalog/models/nvidia:nemo:stt_en_conformer_transducer_xlarge" +stt_en_conformer_transducer_xxlarge,EncDecRNNTBPEModel,"https://ngc.nvidia.com/catalog/models/nvidia:nemo:stt_en_conformer_transducer_xxlarge" \ No newline at end of file diff --git a/docs/source/asr/data/benchmark_es.csv b/docs/source/asr/data/benchmark_es.csv new file mode 100644 index 0000000000000000000000000000000000000000..1e1ade3a739c4d6e9d1c14b493845ae8f29e3aa0 --- /dev/null +++ b/docs/source/asr/data/benchmark_es.csv @@ -0,0 +1,7 @@ +Model,Model Base Class,Model Card +stt_es_quartznet15x5,EncDecCTCModel,"https://ngc.nvidia.com/catalog/models/nvidia:nemo:stt_es_quartznet15x5" +stt_es_citrinet_512,EncDecCTCModelBPE,"https://ngc.nvidia.com/catalog/models/nvidia:nemo:stt_es_citrinet_512" +stt_es_citrinet_1024_gamma_0_25,EncDecCTCModelBPE,"https://ngc.nvidia.com/catalog/models/nvidia:nemo:stt_es_citrinet_1024_gamma_0_25" +stt_es_conformer_ctc_large,EncDecCTCModelBPE,"https://ngc.nvidia.com/catalog/models/nvidia:nemo:stt_es_conformer_ctc_large" +stt_es_conformer_transducer_large,EncDecRNNTBPEModel,"https://ngc.nvidia.com/catalog/models/nvidia:nemo:stt_es_conformer_transducer_large" +stt_es_contextnet_1024,EncDecRNNTBPEModel,"https://ngc.nvidia.com/catalog/models/nvidia:nemo:stt_es_contextnet_1024" \ No newline at end of file diff --git a/docs/source/asr/data/benchmark_fr.csv b/docs/source/asr/data/benchmark_fr.csv new file mode 100644 index 0000000000000000000000000000000000000000..2f27a0ab200959959bf389bb72074dd946f34a7f --- /dev/null +++ b/docs/source/asr/data/benchmark_fr.csv @@ -0,0 +1,8 @@ +Model,Model Base Class,Model Card +stt_fr_quartznet15x5,EncDecCTCModel,"https://ngc.nvidia.com/catalog/models/nvidia:nemo:stt_fr_quartznet15x5" +stt_fr_citrinet_1024_gamma_0_25,EncDecCTCModelBPE,"https://ngc.nvidia.com/catalog/models/nvidia:nemo:stt_fr_citrinet_1024_gamma_0_25" +stt_fr_no_hyphen_citrinet_1024_gamma_0_25,EncDecCTCModelBPE,"https://ngc.nvidia.com/catalog/models/nvidia:nemo:stt_fr_citrinet_1024_gamma_0_25" +stt_fr_contextnet_1024,EncDecRNNTBPEModel,"https://ngc.nvidia.com/catalog/models/nvidia:nemo:stt_fr_contextnet_1024" +stt_fr_conformer_ctc_large,EncDecCTCModelBPE,"https://ngc.nvidia.com/catalog/models/nvidia:nemo:stt_fr_conformer_ctc_large" +stt_fr_no_hyphen_conformer_ctc_large,EncDecCTCModelBPE,"https://ngc.nvidia.com/catalog/models/nvidia:nemo:stt_fr_conformer_ctc_large" +stt_fr_conformer_transducer_large,EncDecRNNTBPEModel,"https://ngc.nvidia.com/catalog/models/nvidia:nemo:stt_fr_conformer_transducer_large" \ No newline at end of file diff --git a/docs/source/asr/data/benchmark_hi.csv b/docs/source/asr/data/benchmark_hi.csv new file mode 100644 index 0000000000000000000000000000000000000000..4d3df532ed2e253f74d0e7a0c66f5ae0381bb75e --- /dev/null +++ b/docs/source/asr/data/benchmark_hi.csv @@ -0,0 +1,2 @@ +Model Name,Model Base Class,Model Card +stt_hi_conformer_ctc_medium,EncDecCTCModelBPE,"https://catalog.ngc.nvidia.com/orgs/nvidia/teams/nemo/models/stt_hi_conformer_ctc_medium" diff --git a/docs/source/asr/data/benchmark_hr.csv b/docs/source/asr/data/benchmark_hr.csv new file mode 100644 index 0000000000000000000000000000000000000000..ea506eed34324bee914e70145fdd0a82fda7ca75 --- /dev/null +++ b/docs/source/asr/data/benchmark_hr.csv @@ -0,0 +1,3 @@ +Model,Model Base Class,Model Card +stt_hr_conformer_ctc_large,EncDecCTCModel,"https://ngc.nvidia.com/catalog/models/nvidia:nemo:stt_hr_conformer_ctc_large" +stt_hr_conformer_transducer_large,EncDecRNNTBPEModel,"https://ngc.nvidia.com/catalog/models/nvidia:nemo:stt_hr_conformer_transducer_large" diff --git a/docs/source/asr/data/benchmark_it.csv b/docs/source/asr/data/benchmark_it.csv new file mode 100644 index 0000000000000000000000000000000000000000..d605b68809eb6481c576340a0f5be843b0c504a4 --- /dev/null +++ b/docs/source/asr/data/benchmark_it.csv @@ -0,0 +1,3 @@ +Model,Model Base Class,Model Card +stt_it_quartznet15x5,EncDecCTCModel,"https://ngc.nvidia.com/catalog/models/nvidia:nemo:stt_it_quartznet15x5" + diff --git a/docs/source/asr/data/benchmark_kab.csv b/docs/source/asr/data/benchmark_kab.csv new file mode 100644 index 0000000000000000000000000000000000000000..76a54cfe42de4de29f8d702b50262abc546b193a --- /dev/null +++ b/docs/source/asr/data/benchmark_kab.csv @@ -0,0 +1,2 @@ +Model,Model Base Class,Model Card +stt_kab_conformer_transducer_large,EncDecRNNTBPEModel,"https://ngc.nvidia.com/catalog/models/nvidia:nemo:stt_kab_conformer_transducer_large" diff --git a/docs/source/asr/data/benchmark_mr.csv b/docs/source/asr/data/benchmark_mr.csv new file mode 100644 index 0000000000000000000000000000000000000000..00ae7211bd75729b5305deca275e26c140896e0c --- /dev/null +++ b/docs/source/asr/data/benchmark_mr.csv @@ -0,0 +1,3 @@ +Model Name,Model Base Class,Model Card +stt_mr_conformer_ctc_medium,EncDecCTCModelBPE,"https://catalog.ngc.nvidia.com/orgs/nvidia/teams/nemo/models/stt_mr_conformer_ctc_medium" + diff --git a/docs/source/asr/data/benchmark_pl.csv b/docs/source/asr/data/benchmark_pl.csv new file mode 100644 index 0000000000000000000000000000000000000000..bf646e107306ab01498d09ea842742662ee5cc47 --- /dev/null +++ b/docs/source/asr/data/benchmark_pl.csv @@ -0,0 +1,2 @@ +Model,Model Base Class,Model Card +stt_pl_quartznet15x5,EncDecCTCModel,"https://ngc.nvidia.com/catalog/models/nvidia:nemo:stt_pl_quartznet15x5" diff --git a/docs/source/asr/data/benchmark_ru.csv b/docs/source/asr/data/benchmark_ru.csv new file mode 100644 index 0000000000000000000000000000000000000000..b46d4d9ca65cb5521fcd192494f58fe56689973c --- /dev/null +++ b/docs/source/asr/data/benchmark_ru.csv @@ -0,0 +1,3 @@ +Model,Model Base Class,Model Card +stt_ru_quartznet15x5,EncDecCTCModel,"https://ngc.nvidia.com/catalog/models/nvidia:nemo:stt_ru_quartznet15x5" + diff --git a/docs/source/asr/data/benchmark_rw.csv b/docs/source/asr/data/benchmark_rw.csv new file mode 100644 index 0000000000000000000000000000000000000000..0264fc8a70cdc40ba96ce38701c7e837ee773f4f --- /dev/null +++ b/docs/source/asr/data/benchmark_rw.csv @@ -0,0 +1,3 @@ +Model,Model Base Class,Model Card +stt_rw_conformer_ctc_large,EncDecCTCModel,"https://ngc.nvidia.com/catalog/models/nvidia:nemo:stt_rw_conformer_ctc_large" +stt_rw_conformer_transducer_large,EncDecRNNTBPEModel,"https://ngc.nvidia.com/catalog/models/nvidia:nemo:stt_rw_conformer_transducer_large" \ No newline at end of file diff --git a/docs/source/asr/data/benchmark_zh.csv b/docs/source/asr/data/benchmark_zh.csv new file mode 100644 index 0000000000000000000000000000000000000000..3d98f2fa4cec36cc1652fe2a18e4032dd0e377eb --- /dev/null +++ b/docs/source/asr/data/benchmark_zh.csv @@ -0,0 +1,4 @@ +Model,Model Base Class,Model Card +stt_zh_citrinet_512,EncDecCTCModel,"https://ngc.nvidia.com/catalog/models/nvidia:nemo:stt_zh_citrinet_512" +stt_zh_citrinet_1024_gamma_0_25,EncDecCTCModel,"https://ngc.nvidia.com/catalog/models/nvidia:nemo:stt_zh_citrinet_1024_gamma_0_25" +stt_zh_conformer_transducer_large,EncDecRNNTModel,"https://ngc.nvidia.com/catalog/models/nvidia:nemo:stt_zh_conformer_transducer_large" diff --git a/docs/source/asr/data/scores/be/conformer_be.csv b/docs/source/asr/data/scores/be/conformer_be.csv new file mode 100644 index 0000000000000000000000000000000000000000..12fcfe0e554bced7865e409b5ad57e6557441425 --- /dev/null +++ b/docs/source/asr/data/scores/be/conformer_be.csv @@ -0,0 +1,3 @@ +Model Name,Language,MCV Test-Set v10 (be) +stt_be_conformer_ctc_large,be,4.7 % +stt_be_conformer_transducer_large,be,3.8 % diff --git a/docs/source/asr/data/scores/ca/conformer_ca.csv b/docs/source/asr/data/scores/ca/conformer_ca.csv new file mode 100644 index 0000000000000000000000000000000000000000..bc30b90a25b4f9dcf1d782fe2e4b44484b265b2c --- /dev/null +++ b/docs/source/asr/data/scores/ca/conformer_ca.csv @@ -0,0 +1,3 @@ +Model Name,Language,MCV Dev-Set (v??) (ca),MCV Dev-Set v9.0 (ca),MCV Test-Set v9.0 (ca) +stt_ca_conformer_ctc_large,ca,,4.70,4.27 +stt_ca_conformer_transducer_large,ca,,4.43,3.85 diff --git a/docs/source/asr/data/scores/ca/quartznet15x5_ca.csv b/docs/source/asr/data/scores/ca/quartznet15x5_ca.csv new file mode 100644 index 0000000000000000000000000000000000000000..6b826662e25ecdeeb1d68483f6f2260e0f9f38fd --- /dev/null +++ b/docs/source/asr/data/scores/ca/quartznet15x5_ca.csv @@ -0,0 +1,2 @@ +Model Name,Language,MCV Dev-Set (v??) (ca),MCV Dev-Set v9.0 (ca),MCV Test-Set v9.0 (ca) +stt_ca_quartznet15x5,ca,6.0,, diff --git a/docs/source/asr/data/scores/de/citrinet_de.csv b/docs/source/asr/data/scores/de/citrinet_de.csv new file mode 100644 index 0000000000000000000000000000000000000000..1b3e7db093a2ed9b010dd8fbb706a4ca268fdb64 --- /dev/null +++ b/docs/source/asr/data/scores/de/citrinet_de.csv @@ -0,0 +1,2 @@ +Model Name,Language,MCV Dev-Set (v??) (de),MCV Dev-Set v7.0 (de),MCV Test-Set v7.0 (de),MLS Dev (en),MLS Test (en),VoxPopuli Dev (de),VoxPopuli Test (de) +stt_de_citrinet_1024,de,,6.63,7.59,4.06,5.07,12.33,10.02 diff --git a/docs/source/asr/data/scores/de/conformer_de.csv b/docs/source/asr/data/scores/de/conformer_de.csv new file mode 100644 index 0000000000000000000000000000000000000000..3d0a9e18d452eda69c32971a60ceed589d6e3bcf --- /dev/null +++ b/docs/source/asr/data/scores/de/conformer_de.csv @@ -0,0 +1,3 @@ +Model Name,Language,MCV Dev-Set (v??) (de),MCV Dev-Set v7.0 (de),MCV Test-Set v7.0 (de),MLS Dev (en),MLS Test (en),VoxPopuli Dev (de),VoxPopuli Test (de) +stt_de_conformer_ctc_large,de,,5.84,6.68,3.85,4.63,12.56,10.51 +stt_de_conformer_transducer_large,de,,4.75,5.36,3.46,4.19,11.21,9.14 diff --git a/docs/source/asr/data/scores/de/contextnet_de.csv b/docs/source/asr/data/scores/de/contextnet_de.csv new file mode 100644 index 0000000000000000000000000000000000000000..b7d52d649e73bb7aebf9a5a60191fd7f00404acb --- /dev/null +++ b/docs/source/asr/data/scores/de/contextnet_de.csv @@ -0,0 +1,2 @@ +Model Name,Language,MCV Dev-Set (v??) (de),MCV Dev-Set v7.0 (de),MCV Test-Set v7.0 (de),MLS Dev (en),MLS Test (en),VoxPopuli Dev (de),VoxPopuli Test (de) +stt_de_contextnet_1024,de,,4.76,5.5,3.53,4.2,11.32,9.4 diff --git a/docs/source/asr/data/scores/de/quartznet15x5_de.csv b/docs/source/asr/data/scores/de/quartznet15x5_de.csv new file mode 100644 index 0000000000000000000000000000000000000000..17540903f41e6b5f922a5909b935da9a05be5314 --- /dev/null +++ b/docs/source/asr/data/scores/de/quartznet15x5_de.csv @@ -0,0 +1,2 @@ +Model Name,Language,MCV Dev-Set (v??) (de),MCV Dev-Set v7.0 (de),MCV Test-Set v7.0 (de),MLS Dev (en),MLS Test (en),VoxPopuli Dev (de),VoxPopuli Test (de) +stt_de_quartznet15x5,de,11.78,,,,,, diff --git a/docs/source/asr/data/scores/en/citrinet_en.csv b/docs/source/asr/data/scores/en/citrinet_en.csv new file mode 100644 index 0000000000000000000000000000000000000000..42d8cff2cb9b199c705adfa56ba917bad10ea5ac --- /dev/null +++ b/docs/source/asr/data/scores/en/citrinet_en.csv @@ -0,0 +1,7 @@ +Model Name,Language,Librispeech Dev-Clean,Librispeech Dev-Other,Librispeech Test-Clean,Librispeech Test-Other,MCV Test-Set v8.0 (en),MLS Dev (en),MLS Test (en),NSC Part1,NSC Part6,Peoples Speech Test v1,SLR 83 Test,WSJ Dev 93,WSJ Eval 92 +stt_en_citrinet_256,en,4.2 % WER,10.7 % WER,4.4 % WER,10.7 % WER,,,,,,,,, +stt_en_citrinet_512,en,3.7 % WER,8.9 % WER,3.7 % WER,8.9 % WER,,,,,,,,, +stt_en_citrinet_1024,en,3.7 % WER,8.3 % WER,3.6 % WER,7.9 % WER,,,,,,,,, +stt_en_citrinet_256_gamma_0_25,en,4.7 %,10.6 %,4.8 %,10.7 %,,,,8.3 %,,,,5.8 %,3.6 % +stt_en_citrinet_512_gamma_0_25,en,4.0 %,9.0 %,3.9 %,9.0 %,,,,6.9 %,,,,4.4 %,3.6 % +stt_en_citrinet_1024_gamma_0_25,en,3.4 %,7.7 %,3.4 %,7.6 %,,,,6.2 %,,,,4.0 %,2.5 % diff --git a/docs/source/asr/data/scores/en/conformer_en.csv b/docs/source/asr/data/scores/en/conformer_en.csv new file mode 100644 index 0000000000000000000000000000000000000000..23ec4438257837987ca1692ccf12954072ce9e17 --- /dev/null +++ b/docs/source/asr/data/scores/en/conformer_en.csv @@ -0,0 +1,14 @@ +Model Name,Language,Librispeech Dev-Clean,Librispeech Dev-Other,Librispeech Test-Clean,Librispeech Test-Other,MCV Test-Set v8.0 (en),MLS Dev (en),MLS Test (en),NSC Part1,NSC Part6,Peoples Speech Test v1,SLR 83 Test,WSJ Dev 93,WSJ Eval 92 +stt_en_conformer_ctc_small,en,3.6,8.1,3.7,8.1,,,,,,,,, +stt_en_conformer_ctc_medium,en,2.5,5.8,2.6,5.9,,,,,,,,, +stt_en_conformer_ctc_large,en,1.9,4.4,2.1,4.5,,,,,,,,, +stt_en_conformer_ctc_xlarge,en,1.77 %,3.79 %,2.00 %,3.74 %,7.88 %,,5.99 %,,6.44 %,22.90 %,5.50 %,2.36 %, +stt_en_conformer_ctc_small_ls,en,3.3,8.8,3.4,8.8,,,,,,,,, +stt_en_conformer_ctc_medium_ls,en,2.7,7.4,3.0,7.3,,,,,,,,, +stt_en_conformer_ctc_large_ls,en,2.4,6.2,2.7,6.0,,,,,,,,, +stt_en_conformer_transducer_small,en,2.8,6.6,2.5,6.6,,,,,,,,, +stt_en_conformer_transducer_medium,en,2.0,4.6,2.1,4.7,,,,,,,,, +stt_en_conformer_transducer_large,en,1.6,3.5,1.7,3.7,,,,,,,,, +stt_en_conformer_transducer_large_ls,en,2.1,5.0,2.3,5.1,,,,,,,,, +stt_en_conformer_transducer_xlarge,en,1.48 %,2.95 %,1.62 %,3.01 %,6.46 %,4.59 %,5.32 %,5.70 %,6.47 %,21.32 %,,2.05 %,1.17 % +stt_en_conformer_transducer_xxlarge,en,1.52 %,3.09 %,1.72 %,3.14 %,,5.29 %,5.85 %,6.64 %,,,,2.42 %,1.49 % diff --git a/docs/source/asr/data/scores/en/contextnet_en.csv b/docs/source/asr/data/scores/en/contextnet_en.csv new file mode 100644 index 0000000000000000000000000000000000000000..4a065dd299f8f7a232fde5e3b3d706e72b8ad64a --- /dev/null +++ b/docs/source/asr/data/scores/en/contextnet_en.csv @@ -0,0 +1,7 @@ +Model Name,Language,Librispeech Dev-Clean,Librispeech Dev-Other,Librispeech Test-Clean,Librispeech Test-Other,MCV Test-Set v8.0 (en),MLS Dev (en),MLS Test (en),NSC Part1,NSC Part6,Peoples Speech Test v1,SLR 83 Test,WSJ Dev 93,WSJ Eval 92 +stt_en_contextnet_256,en,3.3 %,7.9 %,3.3 %,8.0 %,,9.7 %,11.0 %,7.1 %,,,,4.6 %,3.2 % +stt_en_contextnet_512,en,2.0 %,4.8 %,2.2 %,5.0 %,,6.6 %,7.3 %,5.9 %,,,,2.8 %,1.4 % +stt_en_contextnet_1024,en,1.7 %,3.8 %,1.9 %,4.0 %,7.9 %,,5.9 %,5.2 %,6.5 %,21.7 %,4.7 %,2.3 %,1.3 % +stt_en_contextnet_256_mls,en,,9.0 %,,9.2 %,,9.4 %,10.9 %,,,,,, +stt_en_contextnet_512_mls,en,,5.2 %,,5.2 %,,5.6 %,6.6 %,,,,,, +stt_en_contextnet_1024_mls,en,,4.1 %,,4.2 %,,4.6 %,5.6 %,,,,,, diff --git a/docs/source/asr/data/scores/en/jasper10x5dr_en.csv b/docs/source/asr/data/scores/en/jasper10x5dr_en.csv new file mode 100644 index 0000000000000000000000000000000000000000..ac9b260c5bb34ce5ece0c4b02dc1359d0371c677 --- /dev/null +++ b/docs/source/asr/data/scores/en/jasper10x5dr_en.csv @@ -0,0 +1,2 @@ +Model Name,Language,Librispeech Dev-Clean,Librispeech Dev-Other,Librispeech Test-Clean,Librispeech Test-Other,MCV Test-Set v8.0 (en),MLS Dev (en),MLS Test (en),NSC Part1,NSC Part6,Peoples Speech Test v1,SLR 83 Test,WSJ Dev 93,WSJ Eval 92 +stt_en_jasper10x5dr,en,3.74,10.21,,,,,,,,,,, diff --git a/docs/source/asr/data/scores/en/quartznet15x5_en.csv b/docs/source/asr/data/scores/en/quartznet15x5_en.csv new file mode 100644 index 0000000000000000000000000000000000000000..04aef4aa49dd63e1a1f45d1b5ba0fd40372a5b74 --- /dev/null +++ b/docs/source/asr/data/scores/en/quartznet15x5_en.csv @@ -0,0 +1,2 @@ +Model Name,Language,Librispeech Dev-Clean,Librispeech Dev-Other,Librispeech Test-Clean,Librispeech Test-Other,MCV Test-Set v8.0 (en),MLS Dev (en),MLS Test (en),NSC Part1,NSC Part6,Peoples Speech Test v1,SLR 83 Test,WSJ Dev 93,WSJ Eval 92 +stt_en_quartznet15x5,en,4.38,11.3,,,,,,,,,,, diff --git a/docs/source/asr/data/scores/en/squeezeformer_en.csv b/docs/source/asr/data/scores/en/squeezeformer_en.csv new file mode 100644 index 0000000000000000000000000000000000000000..fdbd9bd99665c41235c1e60f9a0f8c4940b432a6 --- /dev/null +++ b/docs/source/asr/data/scores/en/squeezeformer_en.csv @@ -0,0 +1,7 @@ +Model Name,Language,Librispeech Dev-Clean,Librispeech Dev-Other,Librispeech Test-Clean,Librispeech Test-Other,MCV Test-Set v8.0 (en),MLS Dev (en),MLS Test (en),NSC Part1,NSC Part6,Peoples Speech Test v1,SLR 83 Test,WSJ Dev 93,WSJ Eval 92 +stt_en_squeezeformer_ctc_xsmall_ls,en,3.6 %,9.7 %,3.8 %,9.4 %,,,,,,,,, +stt_en_squeezeformer_ctc_small_ls,en,2.9 %,7.4 %,3.1 %,7.4 %,,,,,,,,, +stt_en_squeezeformer_ctc_small_medium_ls,en,2.7 %,7.0 %,2.8 %,7.1 %,,,,,,,,, +stt_en_squeezeformer_ctc_medium_ls,en,2.4 %,6.2 %,2.6 %,6.3 %,,,,,,,,, +stt_en_squeezeformer_ctc_medium_large_ls,en,2.3 %,6.0 %,2.5 %,5.9 %,,,,,,,,, +stt_en_squeezeformer_ctc_large_ls,en,2.3 %,5.7 %,2.4 %,5.7 %,,,,,,,,, diff --git a/docs/source/asr/data/scores/enes/conformer_enes.csv b/docs/source/asr/data/scores/enes/conformer_enes.csv new file mode 100644 index 0000000000000000000000000000000000000000..983e664d4de1b60053cb43a039b0c4ceafe10a00 --- /dev/null +++ b/docs/source/asr/data/scores/enes/conformer_enes.csv @@ -0,0 +1,5 @@ +Model Name,Language,Fisher-Dev-En,Fisher-Dev-Es,Fisher-Test-En,Fisher-Test-Es,Librispeech Dev-Clean,Librispeech Dev-Other,Librispeech Test-Clean,Librispeech Test-Other,MCV Dev-Set v7.0 (en),MCV Dev-Set v7.0 (es),MCV Test-Set v7.0 (en),MCV Test-Set v7.0 (es),MLS Dev (en),MLS Dev (es),MLS Test (en),MLS Test (es),VoxPopuli Dev (en),VoxPopuli Dev (es),VoxPopuli Test (en),VoxPopuli Test (es) +stt_enes_conformer_ctc_large,enes,,16.7 %,,,2.2 %,5.5 %,2.6 %,5.5 %,5.8 %,,,,,3.5 %,,,,5.7 %,, +stt_enes_conformer_ctc_large_codesw,enes,,16.51 %,,16.31 %,2.22 %,5.36 %,2.55 %,5.38 %,,5.00 %,,5.51 %,,3.46 %,,3.73 %,,5.58 %,,6.63 % +stt_enes_conformer_transducer_large,enes,,16.2 %,,,2.0 %,4.6 %,2.2 %,4.6 %,5.0 %,,,,,3.3 %,,,,5.3 %,, +stt_enes_conformer_transducer_large_codesw,enes,15.70 %,,15.66 %,,1.97 %,4.54 %,2.17 %,4.53 %,4.51 %,,5.06 %,,3.27 %,,3.67 %,,5.28 %,,6.54 %, diff --git a/docs/source/asr/data/scores/enes/contextnet_enes.csv b/docs/source/asr/data/scores/enes/contextnet_enes.csv new file mode 100644 index 0000000000000000000000000000000000000000..72a895303bbbd8b43fd94d2f66a93753b28af9e5 --- /dev/null +++ b/docs/source/asr/data/scores/enes/contextnet_enes.csv @@ -0,0 +1,2 @@ +Model Name,Language,Fisher-Dev-En,Fisher-Dev-Es,Fisher-Test-En,Fisher-Test-Es,Librispeech Dev-Clean,Librispeech Dev-Other,Librispeech Test-Clean,Librispeech Test-Other,MCV Dev-Set v7.0 (en),MCV Dev-Set v7.0 (es),MCV Test-Set v7.0 (en),MCV Test-Set v7.0 (es),MLS Dev (en),MLS Dev (es),MLS Test (en),MLS Test (es),VoxPopuli Dev (en),VoxPopuli Dev (es),VoxPopuli Test (en),VoxPopuli Test (es) +stt_enes_contextnet_large,enes,,14.8 %,,,2.2 %,5.6 %,2.3 %,5.5 %,4.7 %,,,,,3.0 %,,,,5.0 %,, diff --git a/docs/source/asr/data/scores/eo/conformer_eo.csv b/docs/source/asr/data/scores/eo/conformer_eo.csv new file mode 100644 index 0000000000000000000000000000000000000000..f77d4c0eadfcc5d7b954a06df2036cfbbcf50f3d --- /dev/null +++ b/docs/source/asr/data/scores/eo/conformer_eo.csv @@ -0,0 +1,3 @@ +Model Name,Language,MCV Dev-Set v11.0 (eo),MCV Test-Set v11.0 (eo) +stt_eo_conformer_ctc_large,eo,2.9 %,4.8 % +stt_eo_conformer_transducer_large,eo,2.4 %,4.0 % diff --git a/docs/source/asr/data/scores/es/citrinet_es.csv b/docs/source/asr/data/scores/es/citrinet_es.csv new file mode 100644 index 0000000000000000000000000000000000000000..9311fb2b04fd18382ad0bd93e7f2ea5eeb34684b --- /dev/null +++ b/docs/source/asr/data/scores/es/citrinet_es.csv @@ -0,0 +1,3 @@ +Model Name,Language,Call Home Dev Test (es),Call Home Eval Test (es),Call Home Train (es),Fisher Dev Set (es),Fisher Test Set (es),MCV Dev-Set (v??) (es),MCV Dev-Set v7.0 (es),MCV Test-Set (v??) (es),MCV Test-Set v7.0 (es),MLS Dev (en),MLS Test (en),VoxPopuli Dev (es),VoxPopuli Test (es) +stt_es_citrinet_512,es,,,,,,9.1 % WER,,10.3 % WER,,4.9 % WER,5.2 % WER,, +stt_es_citrinet_1024_gamma_0_25,es,19.9 %,21.3 %,19.1 %,15.8 %,15.9 %,,6.1 %,,6.8 %,3.5 %,4.1 %,5.6 %,7.0 % diff --git a/docs/source/asr/data/scores/es/conformer_es.csv b/docs/source/asr/data/scores/es/conformer_es.csv new file mode 100644 index 0000000000000000000000000000000000000000..10b28dc49f4efddb94c9be7b1e4d83031878b627 --- /dev/null +++ b/docs/source/asr/data/scores/es/conformer_es.csv @@ -0,0 +1,3 @@ +Model Name,Language,Call Home Dev Test (es),Call Home Eval Test (es),Call Home Train (es),Fisher Dev Set (es),Fisher Test Set (es),MCV Dev-Set (v??) (es),MCV Dev-Set v7.0 (es),MCV Test-Set (v??) (es),MCV Test-Set v7.0 (es),MLS Dev (en),MLS Test (en),VoxPopuli Dev (es),VoxPopuli Test (es) +stt_es_conformer_ctc_large,es,23.7 %,25.3 %,22.4 %,18.3 %,18.5 %,,6.3 %,,6.9 %,4.3 %,4.2 %,6.1 %,7.5 % +stt_es_conformer_transducer_large,es,18.0 %,19.4 %,17.2 %,14.7 %,14.8 %,,4.6 %,,5.2 %,2.7 %,3.2 %,4.7 %,6.0 % diff --git a/docs/source/asr/data/scores/es/contextnet_es.csv b/docs/source/asr/data/scores/es/contextnet_es.csv new file mode 100644 index 0000000000000000000000000000000000000000..ec20b5708d934b1377ffbda0537934f60cf0766e --- /dev/null +++ b/docs/source/asr/data/scores/es/contextnet_es.csv @@ -0,0 +1,2 @@ +Model Name,Language,Call Home Dev Test (es),Call Home Eval Test (es),Call Home Train (es),Fisher Dev Set (es),Fisher Test Set (es),MCV Dev-Set (v??) (es),MCV Dev-Set v7.0 (es),MCV Test-Set (v??) (es),MCV Test-Set v7.0 (es),MLS Dev (en),MLS Test (en),VoxPopuli Dev (es),VoxPopuli Test (es) +stt_es_contextnet_1024,es,19.1 %,20.7 %,18.2 %,15.3 %,15.1 %,,4.8 %,,5.2 %,3.1 %,3.5 %,5.1 %,6.2 % diff --git a/docs/source/asr/data/scores/es/quartznet15x5_es.csv b/docs/source/asr/data/scores/es/quartznet15x5_es.csv new file mode 100644 index 0000000000000000000000000000000000000000..79de5ce952d825f3c6580337ffaacf358267dd73 --- /dev/null +++ b/docs/source/asr/data/scores/es/quartznet15x5_es.csv @@ -0,0 +1,2 @@ +Model Name,Language,Call Home Dev Test (es),Call Home Eval Test (es),Call Home Train (es),Fisher Dev Set (es),Fisher Test Set (es),MCV Dev-Set (v??) (es),MCV Dev-Set v7.0 (es),MCV Test-Set (v??) (es),MCV Test-Set v7.0 (es),MLS Dev (en),MLS Test (en),VoxPopuli Dev (es),VoxPopuli Test (es) +stt_es_quartznet15x5,es,,,,,,12.97,,,,,,, diff --git a/docs/source/asr/data/scores/fr/citrinet_fr.csv b/docs/source/asr/data/scores/fr/citrinet_fr.csv new file mode 100644 index 0000000000000000000000000000000000000000..651dcb8494400ac5053dff0280ed10ab88350abf --- /dev/null +++ b/docs/source/asr/data/scores/fr/citrinet_fr.csv @@ -0,0 +1,2 @@ +Model Name,Language,MCV Dev-Set (v??) (fr),MCV Dev-Set v7.0 (fr),MCV Dev-Set v7.0 (fr) (No Hyphen),MCV Test-Set v7.0 (fr),MCV Test-Set v7.0 (fr) (No Hyphen),MLS Dev (en),MLS Dev (en) (No Hyphen),MLS Test (en),MLS Test (en) (No Hyphen) +stt_fr_citrinet_1024_gamma_0_25,fr,,10.76,9.90,12.20,11.11,6.66,6.19,5.53,5.12 diff --git a/docs/source/asr/data/scores/fr/conformer_fr.csv b/docs/source/asr/data/scores/fr/conformer_fr.csv new file mode 100644 index 0000000000000000000000000000000000000000..8f74dfe8cae047091e8c7f6081dcc24aca37df10 --- /dev/null +++ b/docs/source/asr/data/scores/fr/conformer_fr.csv @@ -0,0 +1,3 @@ +Model Name,Language,MCV Dev-Set (v??) (fr),MCV Dev-Set v7.0 (fr),MCV Dev-Set v7.0 (fr) (No Hyphen),MCV Test-Set v7.0 (fr),MCV Test-Set v7.0 (fr) (No Hyphen),MLS Dev (en),MLS Dev (en) (No Hyphen),MLS Test (en),MLS Test (en) (No Hyphen) +stt_fr_conformer_ctc_large,fr,,8.35,7.88,9.63,9.01,5.88,5.90,4.91,4.63 +stt_fr_conformer_transducer_large,fr,,6.85,,7.95,,5.05,,4.10, diff --git a/docs/source/asr/data/scores/fr/contextnet_fr.csv b/docs/source/asr/data/scores/fr/contextnet_fr.csv new file mode 100644 index 0000000000000000000000000000000000000000..71f601871d1522745af18714f8691675d1c4d468 --- /dev/null +++ b/docs/source/asr/data/scores/fr/contextnet_fr.csv @@ -0,0 +1,2 @@ +Model Name,Language,MCV Dev-Set (v??) (fr),MCV Dev-Set v7.0 (fr),MCV Dev-Set v7.0 (fr) (No Hyphen),MCV Test-Set v7.0 (fr),MCV Test-Set v7.0 (fr) (No Hyphen),MLS Dev (en),MLS Dev (en) (No Hyphen),MLS Test (en),MLS Test (en) (No Hyphen) +stt_fr_contextnet_1024,fr,,8.32,,9.42,,6.02,,5.01, diff --git a/docs/source/asr/data/scores/fr/quartznet15x5_fr.csv b/docs/source/asr/data/scores/fr/quartznet15x5_fr.csv new file mode 100644 index 0000000000000000000000000000000000000000..a30f447f42818eeb4e8e1e871e0d15dabe1955e1 --- /dev/null +++ b/docs/source/asr/data/scores/fr/quartznet15x5_fr.csv @@ -0,0 +1,2 @@ +Model Name,Language,MCV Dev-Set (v??) (fr),MCV Dev-Set v7.0 (fr),MCV Dev-Set v7.0 (fr) (No Hyphen),MCV Test-Set v7.0 (fr),MCV Test-Set v7.0 (fr) (No Hyphen),MLS Dev (en),MLS Dev (en) (No Hyphen),MLS Test (en),MLS Test (en) (No Hyphen) +stt_fr_quartznet15x5,fr,14.01,,,,,,,, diff --git a/docs/source/asr/data/scores/hr/conformer_hr.csv b/docs/source/asr/data/scores/hr/conformer_hr.csv new file mode 100644 index 0000000000000000000000000000000000000000..04383a14e88885d656067462657f392fcd7b67c9 --- /dev/null +++ b/docs/source/asr/data/scores/hr/conformer_hr.csv @@ -0,0 +1,3 @@ +Model Name,Language,ParlaSpeech Dev-Set v1.0 (hr),ParlaSpeech Test-Set v1.0 (hr) +stt_hr_conformer_ctc_large,hr,4.43,4.70 +stt_hr_conformer_transducer_large,hr,4.56,4.69 diff --git a/docs/source/asr/data/scores/it/conformer_it.csv b/docs/source/asr/data/scores/it/conformer_it.csv new file mode 100644 index 0000000000000000000000000000000000000000..3e3854eb862ae540734071811662c9c4ef712e11 --- /dev/null +++ b/docs/source/asr/data/scores/it/conformer_it.csv @@ -0,0 +1,3 @@ +Model Name,Language,MCV Dev-Set (v??) (it),MCV Dev-Set v11.0 (it),MCV Test-Set v11.0 (it),MLS Dev (en),MLS Test (en),VoxPopuli Dev (it),VoxPopuli Test (it) +stt_it_conformer_ctc_large,it,,5.38,5.92,13.16,10.62,13.43,16.75 +stt_it_conformer_transducer_large,it,,4.80,5.24,14.62,12.18,12.00,15.15 diff --git a/docs/source/asr/data/scores/it/quartznet15x5_it.csv b/docs/source/asr/data/scores/it/quartznet15x5_it.csv new file mode 100644 index 0000000000000000000000000000000000000000..475058e38bc019c1aac22e641179357541563e76 --- /dev/null +++ b/docs/source/asr/data/scores/it/quartznet15x5_it.csv @@ -0,0 +1,2 @@ +Model Name,Language,MCV Dev-Set (v??) (it),MCV Dev-Set v11.0 (it),MCV Test-Set v11.0 (it),MLS Dev (en),MLS Test (en),VoxPopuli Dev (it),VoxPopuli Test (it) +stt_it_quartznet15x5,it,15.22,,,,,, diff --git a/docs/source/asr/data/scores/kab/conformer_kab.csv b/docs/source/asr/data/scores/kab/conformer_kab.csv new file mode 100644 index 0000000000000000000000000000000000000000..9db989dc23778d02a380c391f9c78c7b0b0694a8 --- /dev/null +++ b/docs/source/asr/data/scores/kab/conformer_kab.csv @@ -0,0 +1,2 @@ +Model Name,Language,MCV Test-Set v10.0 (kab) +stt_kab_conformer_transducer_large,kab,18.86 diff --git a/docs/source/asr/data/scores/pl/quartznet15x5_pl.csv b/docs/source/asr/data/scores/pl/quartznet15x5_pl.csv new file mode 100644 index 0000000000000000000000000000000000000000..5692e36037ac7ec8e9284debb300256f1cbdf642 --- /dev/null +++ b/docs/source/asr/data/scores/pl/quartznet15x5_pl.csv @@ -0,0 +1,2 @@ +Model Name,Language,MCV Dev-Set (v??) (pl) +stt_pl_quartznet15x5,pl,14 diff --git a/docs/source/asr/data/scores/ru/conformer_ru.csv b/docs/source/asr/data/scores/ru/conformer_ru.csv new file mode 100644 index 0000000000000000000000000000000000000000..a4f2c20a2726a8cd55a79578edf4497dee88838b --- /dev/null +++ b/docs/source/asr/data/scores/ru/conformer_ru.csv @@ -0,0 +1,3 @@ +Model Name,Language,GOLOS Crowd Test-Set (v??) (ru),GOLOS Farfield Test-Set (v??) (ru),Librispeech Test,MCV Dev-Set (v??) (ru),MCV Dev-Set v10.0 (ru),MCV Test-Set v10.0 (ru) +stt_ru_conformer_ctc_large,ru,2.8 %,7.1 %,13.5 %,,3.9 %,4.3 % +stt_ru_conformer_transducer_large,ru,2.7%,7.6%,12.0%,,3.5%,4.0% diff --git a/docs/source/asr/data/scores/ru/quartznet15x5_ru.csv b/docs/source/asr/data/scores/ru/quartznet15x5_ru.csv new file mode 100644 index 0000000000000000000000000000000000000000..db86ab2e8b6bf179d9c94659ba24f2231149b4fc --- /dev/null +++ b/docs/source/asr/data/scores/ru/quartznet15x5_ru.csv @@ -0,0 +1,2 @@ +Model Name,Language,GOLOS Crowd Test-Set (v??) (ru),GOLOS Farfield Test-Set (v??) (ru),Librispeech Test,MCV Dev-Set (v??) (ru),MCV Dev-Set v10.0 (ru),MCV Test-Set v10.0 (ru) +stt_ru_quartznet15x5,ru,,,,16.23,, diff --git a/docs/source/asr/data/scores/rw/conformer_rw.csv b/docs/source/asr/data/scores/rw/conformer_rw.csv new file mode 100644 index 0000000000000000000000000000000000000000..e5544a8067d55fc7d9753391e7dfefa90cca5336 --- /dev/null +++ b/docs/source/asr/data/scores/rw/conformer_rw.csv @@ -0,0 +1,3 @@ +Model Name,Language,MCV Test-Set v9.0 (rw) +stt_rw_conformer_ctc_large,rw,18.2 % +stt_rw_conformer_transducer_large,rw,16.2 % diff --git a/docs/source/asr/data/scores/zh/citrinet_zh.csv b/docs/source/asr/data/scores/zh/citrinet_zh.csv new file mode 100644 index 0000000000000000000000000000000000000000..2ad05e0233e1a103dbef05728d49535d70dcdbc5 --- /dev/null +++ b/docs/source/asr/data/scores/zh/citrinet_zh.csv @@ -0,0 +1,3 @@ +Model Name,Language,AIShell Dev-Android v2,AIShell Dev-Ios v1,AIShell Dev-Ios v2,AIShell Dev-Mic v2,AIShell Test-Android v2,AIShell Test-Ios v1,AIShell Test-Ios v2,AIShell Test-Mic v2 +stt_zh_citrinet_512,zh,,6.25%,,,,6.44%,, +stt_zh_citrinet_1024_gamma_0_25,zh,5.2 %,,4.8 %,5.2 %,5.5 %,,5.1 %,5.5 % diff --git a/docs/source/asr/data/scores/zh/conformer_zh.csv b/docs/source/asr/data/scores/zh/conformer_zh.csv new file mode 100644 index 0000000000000000000000000000000000000000..8d0ef96dc8d9d7766eb13fc7c6d08717c129132a --- /dev/null +++ b/docs/source/asr/data/scores/zh/conformer_zh.csv @@ -0,0 +1,2 @@ +Model Name,Language,AIShell Dev-Android v2,AIShell Dev-Ios v1,AIShell Dev-Ios v2,AIShell Dev-Mic v2,AIShell Test-Android v2,AIShell Test-Ios v1,AIShell Test-Ios v2,AIShell Test-Mic v2 +stt_zh_conformer_transducer_large,zh,3.4,,3.2,3.4,3.4,,3.2,3.4 diff --git a/docs/source/asr/datasets.rst b/docs/source/asr/datasets.rst new file mode 100644 index 0000000000000000000000000000000000000000..b55e49ad1c8ff9b3bef31f6a5a97004d8aacee0b --- /dev/null +++ b/docs/source/asr/datasets.rst @@ -0,0 +1,484 @@ +Datasets +======== + +NeMo has scripts to convert several common ASR datasets into the format expected by the ``nemo_asr`` collection. You can get started +with those datasets by following the instructions to run those scripts in the section appropriate to each dataset below. + +If the user has their own data and want to preprocess it to use with NeMo ASR models, refer to the `Preparing Custom ASR Data`_ section. + +If the user already has a dataset that you want to convert to a tarred format, refer to the `Tarred Datasets`_ section. + +.. _LibriSpeech_dataset: + +LibriSpeech +----------- + +Run the following scripts to download the LibriSpeech data and convert it into the format expected by `nemo_asr`. At least 250GB free +space is required. + +.. code-block:: bash + + # install sox + sudo apt-get install sox + mkdir data + python get_librispeech_data.py --data_root=data --data_set=ALL + +After this, the ``data`` folder should contain wav files and ``.json`` manifests for NeMo ASR datalayer. + +Each line is a training example. ``audio_filepath`` contains the path to the wav file, ``duration`` is the duration in seconds, and ``text`` is the transcript: + +.. code-block:: json + + {"audio_filepath": "/1355-39947-0000.wav", "duration": 11.3, "text": "psychotherapy and the community both the physician and the patient find their place in the community the life interests of which are superior to the interests of the individual"} + {"audio_filepath": "/1355-39947-0001.wav", "duration": 15.905, "text": "it is an unavoidable question how far from the higher point of view of the social mind the psychotherapeutic efforts should be encouraged or suppressed are there any conditions which suggest suspicion of or direct opposition to such curative work"} + +Fisher English Training Speech +------------------------------ + +Run these scripts to convert the Fisher English Training Speech data into a format expected by the ``nemo_asr`` collection. + +In brief, the following scripts convert the ``.sph`` files to ``.wav``, slices those files into smaller audio samples, matches the +smaller slices with their corresponding transcripts, and splits the resulting audio segments into train, validation, and test sets +(with one manifest each). + +.. note:: + - 106 GB of space is required to run the ``.wav`` conversion + - additional 105 GB is required for the slicing and matching + - ``sph2pipe`` is required in order to run the ``.wav`` conversion + +**Instructions** + +The following scripts assume that you already have the Fisher dataset from the Linguistic Data Consortium, with a directory structure +that looks similar to the following: + +.. code-block:: bash + + FisherEnglishTrainingSpeech/ + ├── LDC2004S13-Part1 + │   ├── fe_03_p1_transcripts + │   ├── fisher_eng_tr_sp_d1 + │   ├── fisher_eng_tr_sp_d2 + │   ├── fisher_eng_tr_sp_d3 + │   └── ... + └── LDC2005S13-Part2 + ├── fe_03_p2_transcripts + ├── fe_03_p2_sph1 + ├── fe_03_p2_sph2 + ├── fe_03_p2_sph3 + └── ... + +The transcripts that will be used are located in the ``fe_03_p<1,2>_transcripts/data/trans`` directory. The audio files (``.sph``) +are located in the remaining directories in an ``audio`` subdirectory. + +#. Convert the audio files from ``.sph`` to ``.wav`` by running: + + .. code-block:: bash + + cd /scripts/dataset_processing + python fisher_audio_to_wav.py \ + --data_root= --dest_root= + + This will place the unsliced ``.wav`` files in ``/LDC200[4,5]S13-Part[1,2]/audio-wav/``. It will take several + minutes to run. + +#. Process the transcripts and slice the audio data. + + .. code-block:: bash + + python process_fisher_data.py \ + --audio_root= --transcript_root= \ + --dest_root= \ + --remove_noises + + This script splits the full dataset into train, validation, test sets, and places the audio slices in the corresponding folders + in the destination directory. One manifest is written out per set, which includes each slice's transcript, duration, and path. + + This will likely take around 20 minutes to run. Once finished, delete the 10 minute long ``.wav`` files. + +2000 HUB5 English Evaluation Speech +----------------------------------- + +Run the following script to convert the HUB5 data into a format expected by the ``nemo_asr`` collection. + +Similarly, to the Fisher dataset processing scripts, this script converts the ``.sph`` files to ``.wav``, slices the audio files and +transcripts into utterances, and combines them into segments of some minimum length (default is 10 seconds). The resulting segments +are all written out to an audio directory and the corresponding transcripts are written to a manifest JSON file. + +.. note:: + - 5 GB of free space is required to run this script + - ``sph2pipe`` is also required to be installed + +This script assumes you already have the 2000 HUB5 dataset from the Linguistic Data Consortium. + +Run the following command to process the 2000 HUB5 English Evaluation Speech samples: + +.. code-block:: bash + + python process_hub5_data.py \ + --data_root= \ + --dest_root= + +You can optionally include ``--min_slice_duration=`` if you would like to change the minimum audio segment duration. + +AN4 Dataset +----------- + +This is a small dataset recorded and distributed by Carnegie Mellon University. It consists of recordings of people spelling out +addresses, names, etc. Information about this dataset can be found on the `official CMU site `_. + +#. `Download and extract the dataset `_ (which is labeled "NIST's Sphere audio (.sph) format (64M)". + +#. Convert the ``.sph`` files to ``.wav`` using sox, and build one training and one test manifest. + + .. code-block:: bash + + python process_an4_data.py --data_root= + +After the script finishes, the ``train_manifest.json`` and ``test_manifest.json`` can be found in the ``/an4/`` directory. + +Aishell-1 +--------- + +To download the Aishell-1 data and convert it into a format expected by ``nemo_asr``, run: + +.. code-block:: bash + + # install sox + sudo apt-get install sox + mkdir data + python get_aishell_data.py --data_root=data + +After the script finishes, the ``data`` folder should contain a ``data_aishell`` folder which contains a wav file, a transcript folder, and related ``.json`` and ``vocab.txt`` files. + +Aishell-2 +--------- + +To process the AIShell-2 dataset, in the command below, set the data folder of AIShell-2 using ``--audio_folder`` and where to push +these files using ``--dest_folder``. In order to generate files in the supported format of ``nemo_asr``, run: + +.. code-block:: bash + + python process_aishell2_data.py --audio_folder= --dest_folder= + +After the script finishes, the ``train.json``, ``dev.json``, ``test.json``, and ``vocab.txt`` files can be found in the ``dest_folder`` directory. + +Preparing Custom ASR Data +------------------------- + +The ``nemo_asr`` collection expects each dataset to consist of a set of utterances in individual audio files plus +a manifest that describes the dataset, with information about one utterance per line (``.json``). +The audio files can be of any format supported by `Pydub `_, though we recommend +WAV files as they are the default and have been most thoroughly tested. + +There should be one manifest file per dataset that will be passed in, therefore, if the user wants separate training and validation +datasets, they should also have separate manifests. Otherwise, they will be loading validation data with their training data and vice +versa. + +Each line of the manifest should be in the following format: + +.. code-block:: json + + {"audio_filepath": "/path/to/audio.wav", "text": "the transcription of the utterance", "duration": 23.147} + +The :code:`audio_filepath` field should provide an absolute path to the ``.wav`` file corresponding to the utterance. +The :code:`text` field should contain the full transcript for the utterance, and the :code:`duration` field should +reflect the duration of the utterance in seconds. + +Each entry in the manifest (describing one audio file) should be bordered by '{' and '}' and must +be contained on one line. The fields that describe the file should be separated by commas, and have the form :code:`"field_name": value`, +as shown above. There should be no extra lines in the manifest, i.e. there should be exactly as many lines in the manifest as +there are audio files in the dataset. + +Since the manifest specifies the path for each utterance, the audio files do not have to be located +in the same directory as the manifest, or even in any specific directory structure. + +Once there is a manifest that describes each audio file in the dataset, use the dataset by passing +in the manifest file path in the experiment config file, e.g. as ``training_ds.manifest_filepath=``. + +Tarred Datasets +--------------- + +If experiments are run on a cluster with datasets stored on a distributed file system, the user will likely +want to avoid constantly reading multiple small files and would prefer tarring their audio files. +There are tarred versions of some NeMo ASR dataset classes for this case, such as the ``TarredAudioToCharDataset`` +(corresponding to the ``AudioToCharDataset``) and the ``TarredAudioToBPEDataset`` (corresponding to the +``AudioToBPEDataset``). The tarred audio dataset classes in NeMo use `WebDataset `_. + +To use an existing tarred dataset instead of a non-tarred dataset, set ``is_tarred: true`` in +the experiment config file. Then, pass in the paths to all of the audio tarballs in ``tarred_audio_filepaths``, either as a list +of filepaths, e.g. ``['/data/shard1.tar', '/data/shard2.tar']``, or in a single brace-expandable string, e.g. +``'/data/shard_{1..64}.tar'`` or ``'/data/shard__OP_1..64_CL_'`` (recommended, see note below). + +.. note:: + For brace expansion, there may be cases where ``{x..y}`` syntax cannot be used due to shell interference. This occurs most commonly + inside SLURM scripts. Therefore, we provide a few equivalent replacements. Supported opening braces (equivalent to ``{``) are ``(``, + ``[``, ``<`` and the special tag ``_OP_``. Supported closing braces (equivalent to ``}``) are ``)``, ``]``, ``>`` and the special + tag ``_CL_``. For SLURM based tasks, we suggest the use of the special tags for ease of use. + +As with non-tarred datasets, the manifest file should be passed in ``manifest_filepath``. The dataloader assumes that the length +of the manifest after filtering is the correct size of the dataset for reporting training progress. + +The ``tarred_shard_strategy`` field of the config file can be set if you have multiple shards and are running an experiment with +multiple workers. It defaults to ``scatter``, which preallocates a set of shards per worker which do not change during runtime. +Note that this strategy, on specific occasions (when the number of shards is not divisible with ``world_size``), will not sample +the entire dataset. As an alternative the ``replicate`` strategy, will preallocate the entire set of shards to every worker and not +change it during runtime. The benefit of this strategy is that it allows each worker to sample data points from the entire dataset +independently of others. Note, though, that more than one worker may sample the same shard, and even sample the same data points! +As such, there is no assured guarantee that all samples in the dataset will be sampled at least once during 1 epoch. Note that +for these reasons it is not advisable to use tarred datasets as validation and test datasets. + +For more information about the individual tarred datasets and the parameters available, including shuffling options, +see the corresponding class APIs in the `Datasets <./api.html#Datasets>`__ section. + +.. warning:: + If using multiple workers, the number of shards should be divisible by the world size to ensure an even + split among workers. If it is not divisible, logging will give a warning but training will proceed, but likely hang at the last epoch. + In addition, if using distributed processing, each shard must have the same number of entries after filtering is + applied such that each worker ends up with the same number of files. We currently do not check for this in any dataloader, but the user's + program may hang if the shards are uneven. + +Conversion to Tarred Datasets +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ + +You can easily convert your existing NeMo-compatible ASR datasets using the +`conversion script here `_. + +.. code:: bash + + python convert_to_tarred_audio_dataset.py \ + --manifest_path= \ + --target_dir= \ + --num_shards= + --max_duration= \ + --min_duration= \ + --shuffle --shuffle_seed=0 + +This script shuffles the entries in the given manifest (if ``--shuffle`` is set, which we recommend), filter +audio files according to ``min_duration`` and ``max_duration``, and tar the remaining audio files to the directory +``--target_dir`` in ``n`` shards, along with separate manifest and metadata files. + +The files in the target directory should look similar to the following: + +.. code:: + + target_dir/ + ├── audio_1.tar + ├── audio_2.tar + ├── ... + ├── metadata.yaml + └── tarred_audio_manifest.json + +Note that file structures are flattened such that all audio files are at the top level in each tarball. This ensures that +filenames are unique in the tarred dataset and the filepaths do not contain "-sub" and forward slashes in each ``audio_filepath`` are +simply converted to underscores. For example, a manifest entry for ``/data/directory1/file.wav`` would be ``_data_directory1_file.wav`` +in the tarred dataset manifest, and ``/data/directory2/file.wav`` would be converted to ``_data_directory2_file.wav``. + +Bucketing Datasets +------------------ + +For training ASR models, audios with different lengths may be grouped into a batch. It would make it necessary to use paddings to make all the same length. +These extra paddings is a significant source of computation waste. Splitting the training samples into buckets with different lengths and sampling from the same bucket for each batch would increase the computation efficicncy. +It may result into training speeedup of more than 2X. To enable and use the bucketing feature, you need to create the bucketing version of the dataset by using `conversion script here `_. +You may use --buckets_num to specify the number of buckets (Recommened to use 4 to 8 buckets). It creates multiple tarred datasets, one per bucket, based on the audio durations. The range of [min_duration, max_duration) is split into equal sized buckets. + + +To enable the bucketing feature in the dataset section of the config files, you need to pass the multiple tarred datasets as a list of lists. +If user passes just a list of strings, then the datasets would simply get concatenated which would be different from bucketing. +Here is an example for 4 buckets and 512 shards: + +.. code:: + + python speech_to_text_bpe.py + ... + model.train_ds.manifest_filepath=[[PATH_TO_TARS/bucket1/tarred_audio_manifest.json], + [PATH_TO_TARS/bucket2/tarred_audio_manifest.json], + [PATH_TO_TARS/bucket3/tarred_audio_manifest.json], + [PATH_TO_TARS/bucket4/tarred_audio_manifest.json]] + model.train_ds.tarred_audio_filepaths=[[PATH_TO_TARS/bucket1/audio__OP_0..511_CL_.tar], + [PATH_TO_TARS/bucket2/audio__OP_0..511_CL_.tar], + [PATH_TO_TARS/bucket3/audio__OP_0..511_CL_.tar], + [PATH_TO_TARS/bucket4/audio__OP_0..511_CL_.tar]] + +When bucketing is enabled, in each epoch, first all GPUs would use the first bucket, then go to the second bucket, and so on. It guarantees that all GPUs are using the same bucket at the same time. It reduces the number of paddings in each batch and speedup the training significantly without hurting the accuracy significantly. + +There are two types of batching: + +* Fixed-size bucketing: all batches would have the same number of samples specified by train_ds.batch_size +* Adaptive-size bucketing: uses different batch sizes for each bucket. + +Adaptive-size bucketing helps to increase the GPU utilization and speedup the training. +Batches sampled from buckets with smaller audio lengths can be larger which would increase the GPU utilization and speedup the training. +You may use train_ds.bucketing_batch_size to enable the adaptive batching and specify the batch sizes for the buckets. +When bucketing_batch_size is not set, train_ds.batch_size is going to be used for all buckets (fixed-size bucketing). + +bucketing_batch_size can be set as an integer or a list of integers to explicitly specify the batch size for each bucket. +if bucketing_batch_size is set to be an integer, then linear scaling is being used to scale-up the batch sizes for batches with shorted audio size. For example, setting train_ds.bucketing_batch_size=8 for 4 buckets would use these sizes [32,24,16,8] for different buckets. +When bucketing_batch_size is set, traind_ds.batch_size need to be set to 1. + +Training an ASR model on audios sorted based on length may affect the accuracy of the model. We introduced some strategies to mitigate it. +We support three types of bucketing strategies: + +* fixed_order: the same order of buckets are used for all epochs +* synced_randomized (default): each epoch would have a different order of buckets. Order of the buckets is shuffled every epoch. +* fully_randomized: similar to synced_randomized but each GPU has its own random order. So GPUs would not be synced. + +Tha parameter train_ds.bucketing_strategy can be set to specify one of these strategies. The recommended strategy is synced_randomized which gives the highest training speedup. +The fully_randomized strategy would have lower speedup than synced_randomized but may give better accuracy. + +Bucketing may improve the training speed more than 2x but may affect the final accuracy of the model slightly. Training for more epochs and using 'synced_randomized' strategy help to fill this gap. +Currently bucketing feature is just supported for tarred datasets. + +Upsampling Datasets +------------------- + +Buckets may also be 'weighted' to allow multiple runs through a target dataset during each training epoch. This can be beneficial in cases when a dataset is composed of several component sets of unequal sizes and one desires to mitigate bias towards the larger sets through oversampling. + +Weighting is managed with the `bucketing_weights` parameter. After passing your composite tarred datasets in the format described above for bucketing, pass a list of integers (one per bucket) to indicate how many times a manifest should be read during training. + +For example, by passing `[2,1,1,3]` to the code below: + +.. code:: + + python speech_to_text_bpe.py + ... + model.train_ds.manifest_filepath=[[PATH_TO_TARS/bucket1/tarred_audio_manifest.json], + [PATH_TO_TARS/bucket2/tarred_audio_manifest.json], + [PATH_TO_TARS/bucket3/tarred_audio_manifest.json], + [PATH_TO_TARS/bucket4/tarred_audio_manifest.json]] + model.train_ds.tarred_audio_filepaths=[[PATH_TO_TARS/bucket1/audio__OP_0..511_CL_.tar], + [PATH_TO_TARS/bucket2/audio__OP_0..511_CL_.tar], + [PATH_TO_TARS/bucket3/audio__OP_0..511_CL_.tar], + [PATH_TO_TARS/bucket4/audio__OP_0..511_CL_.tar]] + ... + model.train_ds.bucketing_weights=[2,1,1,3] + +NeMo will configure training so that all data in `bucket1` will be present twice in a training epoch, `bucket4` will be present three times, and that of `bucket2` and `bucket3` will occur only once each. Note that this will increase the effective amount of data present during training and thus affect training time per epoch. + +If using adaptive bucketing, note that the same batch size will be assigned to each instance of the upsampled data. That is, given the following: + +.. code:: + + python speech_to_text_bpe.py + ... + model.train_ds.manifest_filepath=[[PATH_TO_TARS/bucket1/tarred_audio_manifest.json], + [PATH_TO_TARS/bucket2/tarred_audio_manifest.json], + [PATH_TO_TARS/bucket3/tarred_audio_manifest.json], + [PATH_TO_TARS/bucket4/tarred_audio_manifest.json]] + ... + ... + model.train_ds.bucketing_weights=[2,1,1,3] + model.train_ds.bucketing_batch_size=[4,4,4,2] + +All instances of data from `bucket4` will still be trained with a batch size of 2 while all others would have a batch size of 4. As with standard bucketing, this requires `batch_size`` to be set to 1. +If `bucketing_batch_size` is not specified, all datasets will be passed with the same fixed batch size as specified by the `batch_size` parameter. + +It is recommended to set bucketing strategies to `fully_randomized` during multi-GPU training to prevent possible dataset bias during training. + + +Datasets on AIStore +------------------- + +`AIStore `_ is an open-source lightweight object storage system focused on large-scale deep learning. +AIStore is aimed to scale linearly with each added storage node, can be deployed on any Linux machine and can provide a unified namespace across multiple remote backends, such as Amazon S3, Google Cloud, and Microsoft Azure. +More details are provided in the `documentation `_ and the `repository `_ of the AIStore project. + +NeMo currently supports datasets from an AIStore bucket provider under ``ais://`` namespace. + +AIStore Setup +~~~~~~~~~~~~~ + +NeMo is currently relying on the AIStore (AIS) command-line interface (CLI) to handle the supported datasets. +The CLI is available in current NeMo Docker containers. +If necessary, the CLI can be configured using the instructions provided in `AIStore CLI `_ documentation. + +To start using the AIS CLI to access data on an AIS cluster, an endpoint needs to be configured. +The endpoint is configured by setting ``AIS_ENDPOINT`` environment variable before using the CLI + +.. code:: + + export AIS_ENDPOINT=http://hostname:port + ais --help + +In the above, ``hostname:port`` denotes the address of an AIS gateway. +For example, the address could be ``localhost:51080`` if testing using a local `minimal production-ready standalone Docker container `_. + +Dataset Setup +~~~~~~~~~~~~~ + +Currently, both tarred and non-tarred datasets are supported. +For any dataset, the corresponding manifest file is cached locally and processed as a regular manifest file. +For non-tarred datasets, the audio data is also cached locally. +For tarred datasets, shards from the AIS cluster are used by piping ``ais get`` to WebDataset. + +Tarred Dataset from AIS +^^^^^^^^^^^^^^^^^^^^^^^ + +A tarred dataset can be easily used as described in the :ref:`Tarred Datasets` section by providing paths to manifests on an AIS cluster. +For example, a tarred dataset from an AIS cluster can be configured as + +.. code:: + + manifest_filepath='ais://bucket/tarred_audio_manifest.json' + tarred_audio_filepaths='ais://bucket/shard_{1..64}.tar' + +:ref:`Bucketing Datasets` are configured in a similar way by providing paths on an AIS cluster. + +Non-tarred Dataset from AIS +^^^^^^^^^^^^^^^^^^^^^^^^^^^ + +A non-tarred dataset can be easly used by providing a manifest file path on an AIS cluster + +.. code:: + + manifest_filepath='ais://bucket/dataset_manifest.json' + +Note that it is assumed that the manifest file path contains audio file paths relative to the manifest locations. +For example the manifest file may have lines in the following format + +.. code-block:: json + + {"audio_filepath": "path/to/audio.wav", "text": "transcription of the uterance", "duration": 23.147} + +The corresponding audio file would be downloaded from ``ais://bucket/path/to/audio.wav``. + +Cache configuration +^^^^^^^^^^^^^^^^^^^ + +Manifests and audio files from non-tarred datasets will be cached locally. +Location of the cache can be configured by setting two environment variables + +- ``NEMO_DATA_STORE_CACHE_DIR``: path to a location which can be used to cache the data +- ``NEMO_DATA_STORE_CACHE_SHARED``: flag to denote whether the cache location is shared between the compute nodes + +In a multi-node environment, the cache location may or may be not shared between the nodes. +This can be configured by setting ``NEMO_DATA_STORE_CACHE_SHARED`` to ``1`` when the location is shared between the nodes or to ``0`` when each node has a separate cache. + +When a globally shared cache is available, the data should be cached only once from the global rank zero node. +When a node-specific cache is used, the data should be cached only once by each local rank zero node. +To control this behavior using `torch.distributed.barrier`, instantiation of the corresponding dataloader needs to be deferred ``ModelPT::setup``, to ensure a distributed environment has been initialized. +This can be achieved by setting ``defer_setup`` as + +.. code:: shell + + ++model.train_ds.defer_setup=true + ++model.validation_ds.defer_setup=true + ++model.test_ds.defer_setup=true + + +Complete Example +^^^^^^^^^^^^^^^^ + +An example using an AIS cluster at ``hostname:port`` with a tarred dataset for training, a non-tarred dataset for validation and node-specific caching is given below + +.. code:: shell + + export AIS_ENDPOINT=http://hostname:port \ + && export NEMO_DATA_STORE_CACHE_DIR=/tmp \ + && export NEMO_DATA_STORE_CACHE_SHARED=0 \ + python speech_to_text_bpe.py \ + ... + model.train_ds.manifest_filepath=ais://train_bucket/tarred_audio_manifest.json \ + model.train_ds.tarred_audio_filepaths=ais://train_bucket/audio__OP_0..511_CL_.tar \ + ++model.train_ds.defer_setup=true \ + mode.validation_ds.manifest_filepath=ais://validation_bucket/validation_manifest.json \ + ++model.validation_ds.defer_setup=true \ No newline at end of file diff --git a/docs/source/asr/examples/kinyarwanda_asr.rst b/docs/source/asr/examples/kinyarwanda_asr.rst new file mode 100644 index 0000000000000000000000000000000000000000..bd1eac94e31f0a690cb2d6b9b65bc86921ecabce --- /dev/null +++ b/docs/source/asr/examples/kinyarwanda_asr.rst @@ -0,0 +1,631 @@ +######################################################################## +Example: Kinyarwanda ASR using Mozilla Common Voice Dataset +######################################################################## + +In this example, we describe essential steps of training an ASR model for a new language (Kinyarwanda). Namely, + +* Data preprocessing +* Building tokenizers +* Tarred datasets and bucketing +* Training from scratch and finetuning +* Inference and evaluation + + +************************** +Kinyarwanda Speech Dataset +************************** +We use `Mozilla Common Voice `_ dataset for Kinyarwanda which is a large dataset with 2000+ hours of audio data. + +**Note**: You should download this dataset by yourself. + +Mozilla distributes the dataset in tsv+mp3 format. +After downloading and unpacking, the dataset has the following structure + +.. code-block:: bash + + ├── cv-corpus-9.0-2022-04-27 + │ └── rw + │ ├── clips [here are all audio files, e.g. common_voice_rw_26260276.mp3] + │ ├── dev.tsv + │ ├── invalidated.tsv + │ ├── other.tsv + │ ├── reported.tsv + │ ├── test.tsv + │ ├── train.tsv + │ └── validated.tsv + +Mozilla provides **train/dev/test** split of the data, so we can just use it. +Let's look at the format of a .tsv file + +.. code-block:: bash + + head train.tsv + +.. code-block:: bash + + client_id path sentence up_votes down_votes age gender accents locale segment + e2a04c0ecacf81302f4270a3dddaa7a131420f6b7319208473af17d4adf3724ad9a3b6cdee107e2f321495db86f114a50c396e0928464a58dfad472130e7514a common_voice_rw_26273273.mp3 kandi tuguwe neza kugira ngo twakire amagambo y’ukuri, 2 0 twenties male rw + e2a04c0ecacf81302f4270a3dddaa7a131420f6b7319208473af17d4adf3724ad9a3b6cdee107e2f321495db86f114a50c396e0928464a58dfad472130e7514a common_voice_rw_26273478.mp3 Simbi na we akajya kwiga nubwo byari bigoye 2 0 twenties male rw + e2a04c0ecacf81302f4270a3dddaa7a131420f6b7319208473af17d4adf3724ad9a3b6cdee107e2f321495db86f114a50c396e0928464a58dfad472130e7514a common_voice_rw_26273483.mp3 Inshuti yanjye yaje kunsura ku biro byanjye. 2 0 twenties male rw + e2a04c0ecacf81302f4270a3dddaa7a131420f6b7319208473af17d4adf3724ad9a3b6cdee107e2f321495db86f114a50c396e0928464a58dfad472130e7514a common_voice_rw_26273488.mp3 Grand Canyon ni ahantu hazwi cyane ba mukerarugendo. 2 0 twenties male rw + +Each line corresponds to one record (usually one sentence) and contains: + +* name of the audio file +* corresponding transcription +* meta information: client_id, age, gender, etc. + + +Resampling and creating manifests +################################# + +To be able to use a dataset with NeMo Toolkit, we first need to + +* Convert *.tsv* files to *.json* manifests +* Convert *.mp3* files to *.wav* with sample rate of 16000 + +To convert a .tsv file to .json manifest, we used the following script + +.. code-block:: bash + + python tsv_to_json.py \ + --tsv=cv-corpus-9.0-2022-04-27/rw/train.tsv \ + --folder=cv-corpus-9.0-2022-04-27/rw/clips \ + --sampling_count=-1 + +**tsv_to_json.py**: + +.. code-block:: python + + import pandas as pd + import json + import tqdm + import argparse + + parser = argparse.ArgumentParser("MCV TSV-to-JSON converter") + parser.add_argument("--tsv", required=True, type=str, help="Input TSV file") + parser.add_argument("--sampling_count", required=True, type=int, help="Number of examples, you want, use -1 for all examples") + parser.add_argument("--folder", required=True, type=str, help="Relative path to folder with audio files") + args = parser.parse_args() + + df = pd.read_csv(args.tsv, sep='\t') + with open(args.tsv.replace('.tsv', '.json'), 'w') as fo: + mod = 1 + if args.sampling_count > 0: + mod = len(df) // args.sampling_count + for idx in tqdm.tqdm(range(len(df))): + if idx % mod != 0: + continue + item = { + 'audio_filepath': args.folder + "/" + df['path'][idx], + 'text': df['sentence'][idx], + 'up_votes': int(df['up_votes'][idx]), 'down_votes': int(df['down_votes'][idx]), + 'age': df['age'][idx], 'gender': df['gender'][idx], 'accents': df['accents'][idx], + 'client_id': df['client_id'][idx] + } + fo.write(json.dumps(item) + "\n") + +This script will create a corresponding **train.json** manifest near the initial **train.tsv**. It will look like this: + +.. code-block:: bash + + {"audio_filepath": "cv-corpus-9.0-2022-04-27/rw/clips/common_voice_rw_26273273.mp3", "text": "kandi tuguwe neza kugira ngo twakire amagambo y\u2019ukuri,", "up_votes": 2, "down_votes": 0, "age": "twenties", "gender": "male", "accents": NaN, "client_id": "e2a04c0ecacf81302f4270a3dddaa7a131420f6b7319208473af17d4adf3724ad9a3b6cdee107e2f321495db86f114a50c396e0928464a58dfad472130e7514a"} + {"audio_filepath": "cv-corpus-9.0-2022-04-27/rw/clips/common_voice_rw_26273478.mp3", "text": "Simbi na we akajya kwiga nubwo byari bigoye", "up_votes": 2, "down_votes": 0, "age": "twenties", "gender": "male", "accents": NaN, "client_id": "e2a04c0ecacf81302f4270a3dddaa7a131420f6b7319208473af17d4adf3724ad9a3b6cdee107e2f321495db86f114a50c396e0928464a58dfad472130e7514a"} + {"audio_filepath": "cv-corpus-9.0-2022-04-27/rw/clips/common_voice_rw_26273483.mp3", "text": "Inshuti yanjye yaje kunsura ku biro byanjye.", "up_votes": 2, "down_votes": 0, "age": "twenties", "gender": "male", "accents": NaN, "client_id": "e2a04c0ecacf81302f4270a3dddaa7a131420f6b7319208473af17d4adf3724ad9a3b6cdee107e2f321495db86f114a50c396e0928464a58dfad472130e7514a"} + {"audio_filepath": "cv-corpus-9.0-2022-04-27/rw/clips/common_voice_rw_26273488.mp3", "text": "Grand Canyon ni ahantu hazwi cyane ba mukerarugendo.", "up_votes": 2, "down_votes": 0, "age": "twenties", "gender": "male", "accents": NaN, "client_id": "e2a04c0ecacf81302f4270a3dddaa7a131420f6b7319208473af17d4adf3724ad9a3b6cdee107e2f321495db86f114a50c396e0928464a58dfad472130e7514a"} + +For resampling we used the following script: + +.. code-block:: bash + + mkdir train + python ../decode_resample.py \ + --manifest=cv-corpus-9.0-2022-04-27/rw/train.json \ + --destination_folder=./train + +**decode_resample.py**: + +.. code-block:: python + + import argparse + import os + import json + + import sox + from sox import Transformer + import tqdm + import multiprocessing + from tqdm.contrib.concurrent import process_map + + + parser = argparse.ArgumentParser() + parser.add_argument('--manifest', required=True, type=str, help='path to the original manifest') + parser.add_argument("--num_workers", default=multiprocessing.cpu_count(), type=int, help="Workers to process dataset.") + parser.add_argument("--destination_folder", required=True, type=str, help="Destination folder where audio files will be stored") + args = parser.parse_args() + + + def process(x): + if not isinstance(x['text'], str): + x['text'] = '' + else: + x['text'] = x['text'].lower().strip() + _, file_with_ext = os.path.split(x['audio_filepath']) + name, ext = os.path.splitext(file_with_ext) + output_wav_path = args.destination_folder + "/" + name + '.wav' + if not os.path.exists(output_wav_path): + tfm = Transformer() + tfm.rate(samplerate=16000) + tfm.channels(n_channels=1) + tfm.build(input_filepath=x['audio_filepath'], + output_filepath=output_wav_path) + x['duration'] = sox.file_info.duration(output_wav_path) + x['audio_filepath'] = output_wav_path + return x + + + def load_data(manifest): + data = [] + with open(manifest, 'r') as f: + for line in tqdm.tqdm(f): + item = json.loads(line) + data.append(item) + return data + + + data = load_data(args.manifest) + + data_new = process_map(process, data, max_workers=args.num_workers, chunksize=100) + + with open(args.manifest.replace('.json', '_decoded.json'), 'w') as f: + for item in tqdm.tqdm(data_new): + f.write(json.dumps(item) + '\n') + +It will write the resampled .wav-files to the specified directory and save a new json manifest with corrected audiopaths. + +**Note:** You need to repeat these steps for **test.tsv** and **dev.tsv** as well. + +****************** +Data Preprocessing +****************** + +Before we start training the model on the above manifest files, we need to preprocess the text data. Data pre-processing is done to reduce ambiguity in transcripts. This is an essential step, and often requires moderate expertise in the language. + +We used the following script +**prepare_dataset_kinyarwanda.py**: + +.. code-block:: python + + import json + import os + import re + from collections import defaultdict + from nemo.collections.asr.parts.utils.manifest_utils import read_manifest, write_manifest + from tqdm.auto import tqdm + + def write_processed_manifest(data, original_path): + original_manifest_name = os.path.basename(original_path) + new_manifest_name = original_manifest_name.replace(".json", "_processed.json") + + manifest_dir = os.path.split(original_path)[0] + filepath = os.path.join(manifest_dir, new_manifest_name) + write_manifest(filepath, data) + print(f"Finished writing manifest: {filepath}") + return filepath + + + # calculate the character set + def get_charset(manifest_data): + charset = defaultdict(int) + for row in tqdm(manifest_data, desc="Computing character set"): + text = row['text'] + for character in text: + charset[character] += 1 + return charset + + + # Preprocessing steps + def remove_special_characters(data): + chars_to_ignore_regex = "[\.\,\?\:\-!;()«»…\]\[/\*–‽+&_\\½√>€™$•¼}{~—=“\"”″‟„]" + apostrophes_regex = "[’'‘`ʽ']" + data["text"] = re.sub(chars_to_ignore_regex, " ", data["text"]) # replace punctuation by space + data["text"] = re.sub(apostrophes_regex, "'", data["text"]) # replace different apostrophes by one + data["text"] = re.sub(r"'+", "'", data["text"]) # merge multiple apostrophes + + # remove spaces where apostrophe marks a deleted vowel + # this rule is taken from https://huggingface.co/lucio/wav2vec2-large-xlsr-kinyarwanda-apostrophied + data["text"] = re.sub(r"([b-df-hj-np-tv-z])' ([aeiou])", r"\1'\2", data["text"]) + + data["text"] = re.sub(r" '", " ", data["text"]) # delete apostrophes at the beginning of word + data["text"] = re.sub(r"' ", " ", data["text"]) # delete apostrophes at the end of word + data["text"] = re.sub(r" +", " ", data["text"]) # merge multiple spaces + return data + + + def replace_diacritics(data): + data["text"] = re.sub(r"[éèëēê]", "e", data["text"]) + data["text"] = re.sub(r"[ãâāá]", "a", data["text"]) + data["text"] = re.sub(r"[úūü]", "u", data["text"]) + data["text"] = re.sub(r"[ôōó]", "o", data["text"]) + data["text"] = re.sub(r"[ćç]", "c", data["text"]) + data["text"] = re.sub(r"[ïī]", "i", data["text"]) + data["text"] = re.sub(r"[ñ]", "n", data["text"]) + return data + + + def remove_oov_characters(data): + oov_regex = "[^ 'aiuenrbomkygwthszdcjfvplxq]" + data["text"] = re.sub(oov_regex, "", data["text"]) # delete oov characters + data["text"] = data["text"].strip() + return data + + + # Processing pipeline + def apply_preprocessors(manifest, preprocessors): + for processor in preprocessors: + for idx in tqdm(range(len(manifest)), desc=f"Applying {processor.__name__}"): + manifest[idx] = processor(manifest[idx]) + + print("Finished processing manifest !") + return manifest + + + # List of pre-processing functions + PREPROCESSORS = [ + remove_special_characters, + replace_diacritics, + remove_oov_characters, + ] + + train_manifest = "train_decoded.json" + dev_manifest = "dev_decoded.json" + test_manifest = "test_decoded.json" + + train_data = read_manifest(train_manifest) + dev_data = read_manifest(dev_manifest) + test_data = read_manifest(test_manifest) + + # Apply preprocessing + train_data_processed = apply_preprocessors(train_data, PREPROCESSORS) + dev_data_processed = apply_preprocessors(dev_data, PREPROCESSORS) + test_data_processed = apply_preprocessors(test_data, PREPROCESSORS) + + # Write new manifests + train_manifest_cleaned = write_processed_manifest(train_data_processed, train_manifest) + dev_manifest_cleaned = write_processed_manifest(dev_data_processed, dev_manifest) + test_manifest_cleaned = write_processed_manifest(test_data_processed, test_manifest) + +It performs the following operations: + +* Remove all punctuation except for apostrophes +* Replace different kinds of apostrophes by one +* Lowercase +* Replace rare characters with diacritics (e.g. [éèëēê] => e) +* Delete all remaining out-of-vocabulary (OOV) characters + +The final Kinyarwanda alphabet in all trancripts consists of Latin letters, space and apostrophe. + +******************* +Building Tokenizers +******************* + +Though it is possible to train character-based ASR model, usually we get some improvement in quality and speed if we predict longer units. The commonly used tokenization algorithm is called `Byte-pair encoding `_. This is a deterministic tokenization algorithm based on corpus statistics. It splits the words to subtokens and the beginning of word is marked by special symbol so it's easy to restore the original words. +NeMo toolkit supports on-the-fly subword tokenization, so you need not modify the transcripts, but need to pass your tokenizer via the model config. NeMo supports both Word Piece Tokenizer (via HuggingFace) and Sentence Piece Tokenizer (via Google SentencePiece library) +For Kinyarwanda experiments we used 128 subtokens for the CTC model and 1024 subtokens for the Transducer model. The tokenizers for these models were built using the text transcripts of the train set with this script. For vocabulary of size 1024 we restrict maximum subtoken length to 4 symbols (2 symbols for size 128) to avoid populating vocabulary with specific frequent words from the dataset. This does not affect the model performance and potentially helps to adapt to other domain without retraining tokenizer. +We used the following script from NeMo toolkit to create `Sentencepiece `_ tokenizers with different vocabulary sizes (128 and 1024 subtokens) + +.. code-block:: bash + + python ${NEMO_ROOT}/scripts/tokenizers/process_asr_text_tokenizer.py \ + --manifest=dev_decoded_processed.json,train_decoded_processed.json \ + --vocab_size=1024 \ + --data_root=tokenizer_bpe_maxlen_4 \ + --tokenizer="spe" \ + --spe_type=bpe \ + --spe_character_coverage=1.0 \ + --spe_max_sentencepiece_length=4 \ + --log + + python ${NEMO_ROOT}/scripts/tokenizers/process_asr_text_tokenizer.py \ + --manifest=dev_decoded_processed.json,train_decoded_processed.json \ + --vocab_size=128 \ + --data_root=tokenizer_bpe_maxlen_2 \ + --tokenizer="spe" \ + --spe_type=bpe \ + --spe_character_coverage=1.0 \ + --spe_max_sentencepiece_length=2 \ + --log + +Most of the arguments are similar to those explained in the `ASR with Subword Tokenization tutorial `_. + +The resulting tokenizer is a folder like that: + +.. code-block:: bash + + ├── tokenizer_spe_bpe_v1024_max_4 + │ ├── tokenizer.model + │ ├── tokenizer.vocab + │ └── vocab.txt + +Remember that you will need to pass the path to tokenizer in the model config. +You can see all the subtokens in the **vocab.txt** file. + +***************************** +Tarred datasets and bucketing +***************************** + +There are two useful techniques for training on large datasets. + +* Tarred dataset allows to store the dataset as large .tar files instead of small separate audio files. It speeds up the training and minimizes the load on the network in the cluster. +* Bucketing groups utterances with similar duration. It reduces padding and speeds up the training. + +The NeMo toolkit provides a script to implement both of these techniques. + +.. code-block:: bash + + ## create tarred dataset with 1 bucket + python ${NEMO_ROOT}/scripts/speech_recognition/convert_to_tarred_audio_dataset.py \ + --manifest_path=train_decoded_processed.json \ + --target_dir=train_tarred_1bk \ + --num_shards=1024 \ + --max_duration=11.0 \ + --min_duration=1.0 \ + --shuffle \ + --shuffle_seed=1 \ + --sort_in_shards \ + --workers=-1 + + + ## create tarred dataset with 4 buckets + python ${NEMO_ROOT}/scripts/speech_recognition/convert_to_tarred_audio_dataset.py \ + --manifest_path=train_decoded_processed.json \ + --target_dir=train_tarred_4bk \ + --num_shards=1024 \ + --max_duration=11.0 \ + --min_duration=1.0 \ + --shuffle \ + --shuffle_seed=1 \ + --sort_in_shards \ + --workers=-1 \ + --buckets_num=4 + +**Note**: we only need to process train data, dev and test are usually much smaller and can be used as is. + +Our final dataset folder looks like this: + +.. code-block:: bash + + ├── dev [15988 .wav files] + ├── dev_decoded_processed.json (dev manifest) + ├── test [16213 .wav files] + ├── test_decoded_processed.json (test manifest) + └── train_tarred_1bk + ├── metadata.yaml + ├── tarred_audio_manifest.json + └── [1024 .tar files] + +In case of 4 buckets it will look like: + +.. code-block:: bash + + └── train_tarred_4bk + ├── bucket1 + ├── metadata.yaml + ├── tarred_audio_manifest.json + └── [1024 .tar files] + ├── bucket2 + ... + ├── bucket3 + └── bucket4 + +************************************ +Training from scratch and finetuning +************************************ + +ASR models +########## + +Our goal was to train two ASR models with different architectures: `Conformer-CTC `_ and `Conformer-Transducer `_, with around 120 million parameters. +The CTC model predicts output tokens for each timestep. The outputs are assumed to be independent of each other. As a result the CTC models work faster but they can produce outputs that are inconsistent with each other. CTC models are often combined with external language models in production. In contrast, the Transducer models contain the decoding part which generates the output tokens one by one and the next token prediction depends on this history. Due to autoregressive nature of decoding the inference speed is several times slower than that of CTC models, but the quality is usually better because it can incorporate language model information within the same model. + +Training scripts and configs +############################ + +To train a Conformer-CTC model, we use `speech_to_text_ctc_bpe.py `_ with the default config `conformer_ctc_bpe.yaml `_. +To train a Conformer-Transducer model, we use `speech_to_text_rnnt_bpe.py `_ with the default config `conformer_transducer_bpe.yaml `_. +Any options of default config can be overwritten from command line. +Usually we should provide the options related to the dataset and tokenizer. + +This is an example of how we can run the training script: + +.. code-block:: bash + + TOKENIZER=tokenizers/tokenizer_spe_bpe_v1024_max_4/ + TRAIN_MANIFEST=data/train_tarred_1bk/tarred_audio_manifest.json + TRAIN_FILEPATHS=data/train_tarred_1bk/audio__OP_0..1023_CL_.tar + VAL_MANIFEST=data/dev_decoded_processed.json + TEST_MANIFEST=data/test_decoded_processed.json + + python ${NEMO_ROOT}/examples/asr/asr_ctc/speech_to_text_ctc_bpe.py \ + --config-path=../conf/conformer/ \ + --config-name=conformer_ctc_bpe \ + exp_manager.name="Some name of our experiment" \ + exp_manager.resume_if_exists=true \ + exp_manager.resume_ignore_no_checkpoint=true \ + exp_manager.exp_dir=results/ \ + model.tokenizer.dir=$TOKENIZER \ + model.train_ds.is_tarred=true \ + model.train_ds.tarred_audio_filepaths=$TRAIN_FILEPATHS \ + model.train_ds.manifest_filepath=$TRAIN_MANIFEST \ + model.validation_ds.manifest_filepath=$VAL_MANIFEST \ + model.test_ds.manifest_filepath=$TEST_MANIFEST + +The option *exp_manager.resume_if_exists=true* allows to resume training. Actually you can stop training at any moment and then continue from the last checkpoint. +When the training is finished, the final model will be saved as *.nemo* file inside the folder that we specified in *exp_manager.exp_dir*. + +Training dynamics +################# + +The figure below shows the training dynamics when we train Kinyarwanda models **from scratch**. In these experiments we used the hyperparameters from the default configs, the training was run on 2 nodes with 16 gpus per node, training batch size was 32. We see that Transducer model achieves better quality than CTC. + + .. image:: ../images/kinyarwanda_from_scratch.png + :align: center + :alt: Training dynamics of Kinyarwanda models trained from scratch + :width: 800px + +Finetuning from another model +############################# + +Often it's a good idea to initialize our ASR model with the weights of some other pretrained model, for example, a model for another language. It usually makes our model to converge faster and achieve better quality, especially if the dataset for our target language is small. + +Though Kinyarwanda dataset is rather large, we also tried finetuning Kinyarwanda Conformer-Transducer model from different pretrained checkpoints, namely: + +* English Conformer-Transducer checkpoint +* Self-supervised Learning (SSL) checkpoint trained on English data +* SSL checkpoint trained on multilingual data + +To initialize from **non-SSL checkpoint** we should simply add the option `+init_from_pretrained_model`: + +.. code-block:: bash + + INIT_MODEL='stt_en_conformer_ctc_large' + + python ${NEMO_ROOT}/examples/asr/asr_ctc/speech_to_text_ctc_bpe.py + ...[same options as in the previous example]... + +init_from_pretrained_model=${INIT_MODEL} + +In that case the pretrained model `stt_en_conformer_ctc_large `_ will be automatically downloaded from NVIDIA GPU Cloud(NGC) and used to initialize weights before training. + +To initialize from **SSL checkpoint** we should edit our training script like the following code: + +.. code-block:: python + + import nemo.collections.asr as nemo_asr + ssl_model = nemo_asr.models.ssl_models.SpeechEncDecSelfSupervisedModel.from_pretrained(model_name='ssl_en_conformer_large') + + # define fine-tune model + asr_model = nemo_asr.models.EncDecCTCModelBPE(cfg=cfg.model, trainer=trainer) + + # load ssl checkpoint + asr_model.load_state_dict(ssl_model.state_dict(), strict=False) + + del ssl_model + +When using finetuning you probably will need to change the some hyperparameters from the default config, especially the learning rate and learning rate policy. In the experiments below we used *model.optim.sched.name=CosineAnnealing* and *model.optim.lr=1e-3*. + +The figure below compares the training dynamics for three Conformer-Transducer models. They differ only by how they are initialized. We see that finetuning leads to faster convergence and better quality. Initializing from SSL gives lowest WER at earlier stages, but in a longer period it performs worse. + + .. image:: ../images/kinyarwanda_finetuning.png + :align: center + :alt: Training dynamics of Kinyarwanda models trained from scratch and finetuned from different pretrained checkpoints + :width: 800px + +************************ +Inference and evaluation +************************ + +Running the inference +##################### + +To run the inference we need a pretrained model. This can be either a `.nemo` file that we get after the training is finished, or any published model from `NGC `_. +We run the inference using the following script: + +.. code-block:: bash + + python ${NEMO_ROOT}/examples/asr/transcribe_speech.py \ + model_path=.nemo \ + dataset_manifest=./test_decoded_processed.json \ + output_filename=./test_with_predictions.json \ + batch_size=8 \ + cuda=1 \ + amp=True + +To run inference with NVIDIA's Kinyarwanda checkpoints `STT Rw Conformer-CTC Large `_ or `STT Rw Conformer-Transducer Large `_ use: + +.. code-block:: bash + + python ${NEMO_ROOT}/examples/asr/transcribe_speech.py \ + pretrained_name="stt_rw_conformer_ctc_large" \ + dataset_manifest=test_decoded_processed.json \ + output_filename=./pred_ctc.json \ + batch_size=8 \ + cuda=1 \ + amp=True + +**Note:** If you want to transcribe new audios, you can pass a folder with audio files using `audio_dir` parameter instead of `dataset_manifest`. + +After the inference is finished the `output_filename` is a `.json` manifest augmented with a new field `pred_text` containing the resulting transcript. Example: + +.. code-block:: + + {"audio_filepath": "test/common_voice_rw_19835615.wav", "text": "kw'ibumoso", "up_votes": 2, "down_votes": 0, "age": NaN, "gender": NaN, "accents": NaN, "client_id": "66675a7003e6baa3e7d4af01bff8324ac3c5f15e7f8918180799dd2928227c791f19e2811f9ec5779a2b06dac1b7a97fa7740dcfe98646ea1b5e106250c260be", "duration": 3.672, "pred_text": "n'ibumoso"} + {"audio_filepath": "test/common_voice_rw_24795878.wav", "text": "ni ryari uheruka kurya urusenda", "up_votes": 2, "down_votes": 0, "age": NaN, "gender": NaN, "accents": NaN, "client_id": "90e0438947a75b6c0cf59a0444aee3b81a76c5f9459c4b22df2e14b4ce257aeacaed8ac6092bfcd75b8e831633d58a84287fd62190c21d70d75efe8d93ed74ed", "duration": 3.312, "pred_text": "ni ryari uheruka kurya urusenda"} + {"audio_filepath": "test/common_voice_rw_24256935.wav", "text": "umunani", "up_votes": 2, "down_votes": 0, "age": NaN, "gender": NaN, "accents": NaN, "client_id": "974d4876e99e7437183c20f9107053acc9e514379d448bcf00aaaabc0927f5380128af86d39650867fa80a82525110dfc40784a5371c989de1a5bdf531f6d943", "duration": 3.24, "pred_text": "umunani"} + +Word Error Rate (WER) and Character Error Rate (CER) +#################################################### + +As soon as we have a manifest file with `text` and `pred_text` we can measure the quality of predictions of our model. + +.. code-block:: bash + + # Calculate WER + python ${NEMO_ROOT}/examples/asr/speech_to_text_eval.py \ + dataset_manifest=test_with_predictions.json \ + use_cer=False \ + only_score_manifest=True + + # Calculate CER + python ${NEMO_ROOT}/examples/asr/speech_to_text_eval.py \ + dataset_manifest=test_with_predictions.json \ + use_cer=True \ + only_score_manifest=True + + +Evaluation of NVIDIA's Kinyarwanda checkpoints +############################################## + +If you run inference and evaluation of NVIDIA's published Kinyarwanda models, you should get metrics like these: + ++----------------------------------+-------+-------+ +| Model | WER % | CER % | ++==================================+=======+=======+ +| stt_rw_conformer_ctc_large | 18.22 | 5.45 | ++----------------------------------+-------+-------+ +| stt_rw_conformer_trasducer_large | 16.19 | 5.7 | ++----------------------------------+-------+-------+ + +Error analysis +############## + +Still, even WER of 16% is not as good as we usually get for other languages trained with NeMo toolkit, so we may want to look at the errors that the model makes to better understand what's the problem. + +We can use `Speech Data Explorer `_ to analyze the errors. + +If we run + +.. code-block:: bash + + python ${NEMO_ROOT}/tools/speech_data_explorer/data_explorer.py + +it will start a local server, and provide a http address to open from the browser. +In the UI we can see the model predictions and their diff with the reference, and also we can listen to the corresponding audio. We also can sort the sentences by descending WER and look through the top of them. + +The error analysis showed several problems concerning the Kinyarwanda dataset: + +* Noisy multi-speaker records (e.g. common_voice_rw_19830859.wav) +* Bad quality of record (e.g. common_voice_rw_24452415.wav) +* Orthographic variability related to space/no space/apostrophe + * *kugira ngo / kugirango* + * *nkuko / nk'uko* + * *n iyo / n'iyo* +* Multiple orthographic variants for foreign words + * *telefoni / telephone* + * *film / filime* + * *isiraheli / israel* + * *radio / radiyo* + * *kongo / congo* +* l/r variability + * *abamalayika / abamarayika* + + diff --git a/docs/source/asr/images/citrinet_vertical.png b/docs/source/asr/images/citrinet_vertical.png new file mode 100644 index 0000000000000000000000000000000000000000..a9a7f8c2e8530dd866aa46aa95f5c59fa56b442f Binary files /dev/null and b/docs/source/asr/images/citrinet_vertical.png differ diff --git a/docs/source/asr/images/conformer_ctc.png b/docs/source/asr/images/conformer_ctc.png new file mode 100644 index 0000000000000000000000000000000000000000..e491856502ed39793d631633cfc3d4da5b7d5e53 Binary files /dev/null and b/docs/source/asr/images/conformer_ctc.png differ diff --git a/docs/source/asr/images/ctc_asr.png b/docs/source/asr/images/ctc_asr.png new file mode 100644 index 0000000000000000000000000000000000000000..56ef705602da10b6dca5e5f01afd66bc1f5230d7 Binary files /dev/null and b/docs/source/asr/images/ctc_asr.png differ diff --git a/docs/source/asr/images/jasper_layers.png b/docs/source/asr/images/jasper_layers.png new file mode 100644 index 0000000000000000000000000000000000000000..af88379af98f95cd02141b9dfcb669ead346700c Binary files /dev/null and b/docs/source/asr/images/jasper_layers.png differ diff --git a/docs/source/asr/images/jasper_vertical.png b/docs/source/asr/images/jasper_vertical.png new file mode 100644 index 0000000000000000000000000000000000000000..13ba93c2c73b0b186a26076e1581698ede4f6da5 Binary files /dev/null and b/docs/source/asr/images/jasper_vertical.png differ diff --git a/docs/source/asr/images/kinyarwanda_finetuning.png b/docs/source/asr/images/kinyarwanda_finetuning.png new file mode 100644 index 0000000000000000000000000000000000000000..1cbce4a6293f78dbcc7201fba03dd7f5d0480753 Binary files /dev/null and b/docs/source/asr/images/kinyarwanda_finetuning.png differ diff --git a/docs/source/asr/images/kinyarwanda_from_scratch.png b/docs/source/asr/images/kinyarwanda_from_scratch.png new file mode 100644 index 0000000000000000000000000000000000000000..2aa58ab61c16f12a4d5ee9e24b34e55391f41feb Binary files /dev/null and b/docs/source/asr/images/kinyarwanda_from_scratch.png differ diff --git a/docs/source/asr/images/quartz_vertical.png b/docs/source/asr/images/quartz_vertical.png new file mode 100644 index 0000000000000000000000000000000000000000..4cbede907736f631cb9774b207289a35dd07c757 Binary files /dev/null and b/docs/source/asr/images/quartz_vertical.png differ diff --git a/docs/source/asr/images/squeezeformer.png b/docs/source/asr/images/squeezeformer.png new file mode 100644 index 0000000000000000000000000000000000000000..b6e1b218499cf461bab48e832c66c59c741cc1cb Binary files /dev/null and b/docs/source/asr/images/squeezeformer.png differ diff --git a/docs/source/asr/intro.rst b/docs/source/asr/intro.rst new file mode 100644 index 0000000000000000000000000000000000000000..e655da836a760d70a8957221e59aa889c9134877 --- /dev/null +++ b/docs/source/asr/intro.rst @@ -0,0 +1,58 @@ +Automatic Speech Recognition (ASR) +================================== + +ASR, or Automatic Speech Recognition, refers to the problem of getting a program to automatically transcribe spoken language +(speech-to-text). Our goal is usually to have a model that minimizes the Word Error Rate (WER) metric when transcribing speech input. +In other words, given some audio file (e.g. a WAV file) containing speech, how do we transform this into the corresponding text with +as few errors as possible? + +Traditional speech recognition takes a generative approach, modeling the full pipeline of how speech sounds are produced in order to +evaluate a speech sample. We would start from a language model that encapsulates the most likely orderings of words that are generated +(e.g. an n-gram model), to a pronunciation model for each word in that ordering (e.g. a pronunciation table), to an acoustic model that +translates those pronunciations to audio waveforms (e.g. a Gaussian Mixture Model). + +Then, if we receive some spoken input, our goal would be to find the most likely sequence of text that would result in the given audio +according to our generative pipeline of models. Overall, with traditional speech recognition, we try to model ``Pr(audio|transcript)*Pr(transcript)``, +and take the argmax of this over possible transcripts. + +Over time, neural nets advanced to the point where each component of the traditional speech recognition model could be replaced by a +neural model that had better performance and that had a greater potential for generalization. For example, we could replace an n-gram +model with a neural language model, and replace a pronunciation table with a neural pronunciation model, and so on. However, each of +these neural models need to be trained individually on different tasks, and errors in any model in the pipeline could throw off the +whole prediction. + +Thus, we can see the appeal of end-to-end ASR architectures: discriminative models that simply take an audio input and give a textual +output, and in which all components of the architecture are trained together towards the same goal. The model's encoder would be +akin to an acoustic model for extracting speech features, which can then be directly piped to a decoder which outputs text. If desired, +we could integrate a language model that would improve our predictions, as well. + +And the entire end-to-end ASR model can be trained at once--a much easier pipeline to handle! + +A demo below allows evaluation of NeMo ASR models in multiple langauges from the browser: + +.. raw:: html + + + + + + +The full documentation tree is as follows: + +.. toctree:: + :maxdepth: 8 + + models + datasets + asr_language_modeling + results + scores + configs + api + resources + examples/kinyarwanda_asr.rst + +.. include:: resources.rst diff --git a/docs/source/asr/models.rst b/docs/source/asr/models.rst new file mode 100644 index 0000000000000000000000000000000000000000..ed9fb63e745381de6b841b9d0c4c09020e0ab84c --- /dev/null +++ b/docs/source/asr/models.rst @@ -0,0 +1,287 @@ +Models +====== + +This section gives a brief overview of the models that NeMo's ASR collection currently supports. + +Each of these models can be used with the example ASR scripts (in the ``/examples/asr`` directory) by +specifying the model architecture in the config file used. Examples of config files for each model can be found in +the ``/examples/asr/conf`` directory. + +For more information about the config files and how they should be structured, refer to the :doc:`./configs` section. + +Pretrained checkpoints for all of these models, as well as instructions on how to load them, can be found in the :doc:`./results` +section. You can use the available checkpoints for immediate inference, or fine-tune them on your own datasets. The checkpoints section +also contains benchmark results for the available ASR models. + +.. _Jasper_model: + +Jasper +------ + +Jasper ("Just Another Speech Recognizer") :cite:`asr-models-li2019jasper` is a deep time delay neural network (TDNN) comprising of +blocks of 1D-convolutional layers. The Jasper family of models are denoted as ``Jasper_[BxR]`` where ``B`` is the number of blocks +and ``R`` is the number of convolutional sub-blocks within a block. Each sub-block contains a 1-D convolution, batch normalization, +ReLU, and dropout: + + .. image:: images/jasper_vertical.png + :align: center + :alt: jasper model + :scale: 50% + +Jasper models can be instantiated using the :class:`~nemo.collections.asr.models.EncDecCTCModel` class. + +QuartzNet +--------- + +QuartzNet :cite:`asr-models-kriman2019quartznet` is a version of Jasper :cite:`asr-models-li2019jasper` model with separable +convolutions and larger filters. It can achieve performance similar to Jasper but with an order of magnitude fewer parameters. +Similarly to Jasper, the QuartzNet family of models are denoted as ``QuartzNet_[BxR]`` where ``B`` is the number of blocks and ``R`` +is the number of convolutional sub-blocks within a block. Each sub-block contains a 1-D *separable* convolution, batch normalization, +ReLU, and dropout: + + .. image:: images/quartz_vertical.png + :align: center + :alt: quartznet model + :scale: 40% + +QuartzNet models can be instantiated using the :class:`~nemo.collections.asr.models.EncDecCTCModel` class. + +.. _Citrinet_model: + +Citrinet +-------- + +Citrinet is a version of QuartzNet :cite:`asr-models-kriman2019quartznet` that extends ContextNet :cite:`asr-models-han2020contextnet`, +utilizing subword encoding (via Word Piece tokenization) and Squeeze-and-Excitation mechanism :cite:`asr-models-hu2018squeeze` to +obtain highly accurate audio transcripts while utilizing a non-autoregressive CTC based decoding scheme for efficient inference. + + .. image:: images/citrinet_vertical.png + :align: center + :alt: citrinet model + :scale: 50% + +Citrinet models can be instantiated using the :class:`~nemo.collections.asr.models.EncDecCTCModelBPE` class. + +.. _ContextNet_model: + +ContextNet +---------- + +ContextNet is a model uses Transducer/RNNT loss/decoder and is introduced in :cite:`asr-models-han2020contextnet`. +It uses Squeeze-and-Excitation mechanism :cite:`asr-models-hu2018squeeze` to model larger context. +Unlike Citrinet, it has an autoregressive decoding scheme. + +ContextNet models can be instantiated using the :class:`~nemo.collections.asr.models.EncDecRNNTBPEModel` class for a +model with sub-word encoding and :class:`~nemo.collections.asr.models.EncDecRNNTModel` for char-based encoding. + +You may find the example config files of ContextNet model with character-based encoding at +``/examples/asr/conf/contextnet_rnnt/contextnet_rnnt_char.yaml`` and +with sub-word encoding at ``/examples/asr/conf/contextnet_rnnt/contextnet_rnnt.yaml``. + +.. _Conformer-CTC_model: + +Conformer-CTC +------------- + +Conformer-CTC is a CTC-based variant of the Conformer model introduced in :cite:`asr-models-gulati2020conformer`. Conformer-CTC has a +similar encoder as the original Conformer but uses CTC loss and decoding instead of RNNT/Transducer loss, which makes it a non-autoregressive model. +We also drop the LSTM decoder and instead use a linear decoder on the top of the encoder. This model uses the combination of +self-attention and convolution modules to achieve the best of the two approaches, the self-attention layers can learn the global +interaction while the convolutions efficiently capture the local correlations. The self-attention modules support both regular +self-attention with absolute positional encoding, and also Transformer-XL's self-attention with relative positional encodings. + +Here is the overall architecture of the encoder of Conformer-CTC: + + .. image:: images/conformer_ctc.png + :align: center + :alt: Conformer-CTC Model + :scale: 50% + +This model supports both the sub-word level and character level encodings. You can find more details on the config files for the +Conformer-CTC models at `Conformer-CTC <./configs.html#conformer-ctc>`_. The variant with sub-word encoding is a BPE-based model +which can be instantiated using the :class:`~nemo.collections.asr.models.EncDecCTCModelBPE` class, while the +character-based variant is based on :class:`~nemo.collections.asr.models.EncDecCTCModel`. + +You may find the example config files of Conformer-CTC model with character-based encoding at +``/examples/asr/conf/conformer/conformer_ctc_char.yaml`` and +with sub-word encoding at ``/examples/asr/conf/conformer/conformer_ctc_bpe.yaml``. + +.. _Conformer-Transducer_model: + +Conformer-Transducer +-------------------- + +Conformer-Transducer is the Conformer model introduced in :cite:`asr-models-gulati2020conformer` and uses RNNT/Transducer loss/decoder. +It has the same encoder as Conformer-CTC but utilizes RNNT/Transducer loss/decoder which makes it an autoregressive model. + +Most of the config file for Conformer-Transducer models are similar to Conformer-CTC except the sections related to the decoder and loss: decoder, loss, joint, decoding. +You may take a look at our `tutorials page <../starthere/tutorials.html>`_ on Transducer models to become familiar with their configs: +`Introduction to Transducers `_ and +`ASR with Transducers `_ +You can find more details on the config files for the Conformer-Transducer models at `Conformer-CTC <./configs.html#conformer-ctc>`_. + +This model supports both the sub-word level and character level encodings. The variant with sub-word encoding is a BPE-based model +which can be instantiated using the :class:`~nemo.collections.asr.models.EncDecRNNTBPEModel` class, while the +character-based variant is based on :class:`~nemo.collections.asr.models.EncDecRNNTModel`. + +You may find the example config files of Conformer-Transducer model with character-based encoding at +``/examples/asr/conf/conformer/conformer_transducer_char.yaml`` and +with sub-word encoding at ``/examples/asr/conf/conformer/conformer_transducer_bpe.yaml``. + +Fast-Conformer +-------------- + +The Fast Conformer (CTC and RNNT) models have a faster version of the Conformer encoder and differ from it as follows: + +* 8x depthwise convolutional subsampling with 256 channels +* Reduced convolutional kernel size of 9 in the conformer blocks + +The Fast Conformer encoder is about 2.4x faster than the regular Conformer encoder without a significant model quality degradation. +128 subsampling channels yield a 2.7x speedup vs baseline but model quality starts to degrade. +With local attention, inference is possible on audios >1 hrs (256 subsampling channels) / >2 hrs (128 channels). + +Fast Conformer models were trained using CosineAnnealing (instead of Noam) as the scheduler. + +You may find the example CTC config at +``/examples/asr/conf/fastconformer/fast-conformer_ctc_bpe.yaml`` and +the transducer config at ``/examples/asr/conf/fastconformer/fast-conformer_transducer_bpe.yaml`` + +Note that both configs are subword-based (BPE). + +Cache-aware Streaming Conformer +------------------------------- + +Buffered streaming uses overlapping chunks to make an offline ASR model to be used for streaming with reasonable accuracy. However, it uses significant amount of duplication in computations due to the overlapping chunks. +Also there is a accuracy gep between the offline model and the streaming one as there is inconsistency between how we train the model and how we perform inference for streaming. +The Cache-aware Streaming Conformer models would tackle and address these disadvantages. These streaming Conformers are trained with limited right context that it would make it possible to match how the model is being used in both the training and inference. +They also uses caching to store intermediate activations to avoid any duplication in compute. +The cache-aware approach is supported for both the Conformer-CTC and Conformer-Transducer and enables the model to be used very efficiently for streaming. + +Three categories of layers in Conformer have access to right tokens: 1-depthwise convolutions 2-self-attention, and 3-convolutions in the downsampling layers. +Streaming Conformer models uses causal convolutions or convolutions with lower right context and also self-attention with limited right context to limit the effective right context for the input. +The model trained with such limitations can be used in streaming mode and give the exact same outputs and accuracy as when the whole audio is given to the model in offline mode. +These model can use caching mechanism to store and reuse the activations during streaming inference to avoid any duplications in the computations as much as possible. + +We support the following three right context modeling: + +* fully causal model with zero look-ahead: tokens would not see any future tokens. convolution layers are all causal and right tokens are masked for self-attention. + +It gives zero latency but with limited accuracy. +To train such a model, you need to set `encoder.att_context_size=[left_context, 0]` and `encoder.conv_context_size=causal` in the config. + +* regular look-ahead: convolutions would be able to see few future frames, and self-attention would also see the same number of future tokens. + +In this approach the activations for the look-ahead part is not cached and recalculated in the next chunks. The right context in each layer should be a small number as multiple layers would increase the effective context size and then increase the look-ahead size and latency. +For example for a model of 17 layers with 4x downsampling and 10ms window shift, then even 2 right context in each layer means 17*2*10*4=1360ms look-ahead. Each step after the downsampling corresponds to 4*10=40ms. + +* chunk-aware look-ahead: input is split into equal chunks. Convolutions are fully causal while self-attention layers would be able to see all the tokens in their corresponding chunk. + +For example, in a model which chunk size of 20 tokens, tokens at the first position of each chunk would see all the next 19 tokens while the last token would see zero future tokens. +This approach is more efficient than regular look-ahead in terms of computations as the activations for most of the look-ahead part would be cached and there is close to zero duplications in the calculations. +In terms of accuracy, this approach gives similar or even better results in term of accuracy than regular look-ahead as each token in each layer have access to more tokens on average. That is why we recommend to use this approach for streaming. + + +** Note: Latencies are based on the assumption that the forward time of the network is zero and it just estimates the time needed after a frame would be available until it is passed through the model. + +Approaches with non-zero look-ahead can give significantly better accuracy by sacrificing latency. The latency can get controlled by the left context size. Increasing the right context would help the accuracy to a limit but would increase the compuation time. + + +In all modes, left context can be controlled by the number of tokens to be visible in the self-attention and the kernel size of the convolutions. +For example, if left context of self-attention in each layer is set to 20 tokens and there are 10 layers of Conformer, then effective left context is 20*10=200 tokens. +Left context of self-attention for regular look-ahead can be set as any number while it should be set as a multiplication of the right context in chunk-aware look-ahead. +For convolutions, if we use a left context of 30 in such model, then there would be 30*10=300 effective left context. +Left context of convolutions is dependent to the their kernel size while it can be any number for self-attention layers. Higher left context for self-attention means larger cache and more computations for the self-attention. +Self-attention left context of around 6 secs would give close result to have unlimited left context. For a model with 4x downsampling and shift window of 10ms in the preprocessor, each token corresponds to 4*10=40ms. + +If striding approach is used for downsampling, all the convolutions in downsampling would be fully causal and don't see future tokens. +You may use stacking for downsampling in the streaming models which is significantly faster and uses less memory. +It also does not some of the the limitations with striding and vggnet and you may use any downsampling rate. + +You may find the example config files of cache-aware streaming Conformer models at +``/examples/asr/conf/conformer/streaming/conformer_transducer_bpe_streaming.yaml`` for Transducer variant and +at ``/examples/asr/conf/conformer/streaming/conformer_ctc_bpe.yaml`` for CTC variant. + +To simulate cache-aware streaming, you may use the script at ``/examples/asr/asr_cache_aware_streaming/speech_to_text_cache_aware_streaming_infer.py``. It can simulate streaming in single stream or multi-stream mode (in batches) for an ASR model. +This script can be used for models trained offline with full-context but the accuracy would not be great unless the chunk size is large enough which would result in high latency. +It is recommended to train a model in streaming model with limited context for this script. More info can be found in the script. + +.. _LSTM-Transducer_model: + +LSTM-Transducer +--------------- + +LSTM-Transducer is a model which uses RNNs (eg. LSTM) in the encoder. The architecture of this model is followed from suggestions in :cite:`asr-models-he2019streaming`. +It uses RNNT/Transducer loss/decoder. The encoder consists of RNN layers (LSTM as default) with lower projection size to increase the efficiency. +Layer norm is added between the layers to stabilize the training. +It can be trained/used in unidirectional or bidirectional mode. The unidirectional mode is fully causal and can be used easily for simple and efficient streaming. However the accuracy of this model is generally lower than other models like Conformer and Citrinet. + +This model supports both the sub-word level and character level encodings. You may find the example config file of RNNT model with wordpiece encoding at ``/examples/asr/conf/lstm/lstm_transducer_bpe.yaml``. +You can find more details on the config files for the RNNT models at `LSTM-Transducer <./configs.html#lstm-transducer>`_. + +.. _LSTM-CTC_model: + +LSTM-CTC +-------- + +LSTM-CTC model is a CTC-variant of the LSTM-Transducer model which uses CTC loss/decoding instead of Transducer. +You may find the example config file of LSTM-CTC model with wordpiece encoding at ``/examples/asr/conf/lstm/lstm_ctc_bpe.yaml``. + +.. _Squeezeformer-CTC_model: + +Squeezeformer-CTC +----------------- + +Squeezeformer-CTC is a CTC-based variant of the Squeezeformer model introduced in :cite:`asr-models-kim2022squeezeformer`. Squeezeformer-CTC has a +similar encoder as the original Squeezeformer but uses CTC loss and decoding instead of RNNT/Transducer loss, which makes it a non-autoregressive model. The vast majority of the architecture is similar to Conformer model, so please refer to `Conformer-CTC <./models.html#conformer-ctc>`_. + +The model primarily differs from Conformer in the following ways : + +* Temporal U-Net style time reduction, effectively reducing memory consumption and FLOPs for execution. +* Unified activations throughout the model. +* Simplification of module structure, removal of redundant layers. + +Here is the overall architecture of the encoder of Squeezeformer-CTC: + + .. image:: images/squeezeformer.png + :align: center + :alt: Squeezeformer-CTC Model + :scale: 50% + +This model supports both the sub-word level and character level encodings. You can find more details on the config files for the +Squeezeformer-CTC models at `Squeezeformer-CTC <./configs.html#squeezeformer-ctc>`_. The variant with sub-word encoding is a BPE-based model +which can be instantiated using the :class:`~nemo.collections.asr.models.EncDecCTCModelBPE` class, while the +character-based variant is based on :class:`~nemo.collections.asr.models.EncDecCTCModel`. + +You may find the example config files of Squeezeformer-CTC model with character-based encoding at +``/examples/asr/conf/squeezeformer/squeezeformer_ctc_char.yaml`` and +with sub-word encoding at ``/examples/asr/conf/squeezeformer/squeezeformer_ctc_bpe.yaml``. + +.. _Hybrid-Transducer_CTC_model: + +Hybrid-Transducer-CTC +--------------------- + +Hybrid RNNT-CTC models is a group of models with both the RNNT and CTC decoders. Training a unified model would speedup the convergence for the CTC models and would enable +the user to use a single model which works as both a CTC and RNNT model. This category can be used with any of the ASR models. +Hybrid models uses two decoders of CTC and RNNT on the top of the encoder. The default decoding strategy after the training is done is RNNT. +User may use the ``asr_model.change_decoding_strategy(decoder_type='ctc' or 'rnnt')`` to change the default decoding. + +The variant with sub-word encoding is a BPE-based model +which can be instantiated using the :class:`~nemo.collections.asr.models.EncDecHybridRNNTCTCBPEModel` class, while the +character-based variant is based on :class:`~nemo.collections.asr.models.EncDecHybridRNNTCTCModel`. + +You may use the example scripts under ``/examples/asr/asr_hybrid_transducer_ctc`` for both the char-based encoding and sub-word encoding. +These examples can be used to train any Hybrid ASR model like Conformer, Citrinet, QuartzNet, etc. + +You may find the example config files of Conformer variant of such hybrid models with character-based encoding at +``/examples/asr/conf/conformer/hybrid_transducer_ctc/conformer_hybrid_transducer_ctc_char.yaml`` and +with sub-word encoding at ``/examples/asr/conf/conformer/hybrid_transducer_ctc/conformer_hybrid_transducer_ctc_bpe.yaml``. + + +References +---------- + +.. bibliography:: asr_all.bib + :style: plain + :labelprefix: ASR-MODELS + :keyprefix: asr-models- diff --git a/docs/source/asr/resources.rst b/docs/source/asr/resources.rst new file mode 100644 index 0000000000000000000000000000000000000000..e192f5fbe83d72f812aafdec557fd8ab739e86c4 --- /dev/null +++ b/docs/source/asr/resources.rst @@ -0,0 +1,17 @@ +Resources and Documentation +--------------------------- + +Hands-on speech recognition tutorial notebooks can be found under `the ASR tutorials folder `_. +If you are a beginner to NeMo, consider trying out the `ASR with NeMo `_ tutorial. +This and most other tutorials can be run on Google Colab by specifying the link to the notebooks' GitHub pages on Colab. + +If you are looking for information about a particular ASR model, or would like to find out more about the model +architectures available in the `nemo_asr` collection, refer to the :doc:`Models <./models>` section. + +NeMo includes preprocessing scripts for several common ASR datasets. The :doc:`Datasets <./datasets>` section contains instructions on +running those scripts. It also includes guidance for creating your own NeMo-compatible dataset, if you have your own data. + +Information about how to load model checkpoints (either local files or pretrained ones from NGC), as well as a list of the checkpoints +available on NGC are located on the :doc:`Checkpoints <./results>` section. + +Documentation regarding the configuration files specific to the ``nemo_asr`` models can be found on the :doc:`Configuration Files <./configs>` section. diff --git a/docs/source/asr/results.rst b/docs/source/asr/results.rst new file mode 100644 index 0000000000000000000000000000000000000000..97b2aeb9550ec067cdf1cc5a2e20f01e0b7c62d6 --- /dev/null +++ b/docs/source/asr/results.rst @@ -0,0 +1,253 @@ +Checkpoints +=========== + +There are two main ways to load pretrained checkpoints in NeMo: + +* Using the :code:`restore_from()` method to load a local checkpoint file (``.nemo``), or +* Using the :code:`from_pretrained()` method to download and set up a checkpoint from NGC. + +Refer to the following sections for instructions and examples for each. + +Note that these instructions are for loading fully trained checkpoints for evaluation or fine-tuning. For resuming an unfinished +training experiment, use the Experiment Manager to do so by setting the ``resume_if_exists`` flag to ``True``. + +Loading Local Checkpoints +------------------------- + +NeMo automatically saves checkpoints of a model that is trained in a ``.nemo`` format. Alternatively, to manually save the model at any +point, issue :code:`model.save_to(.nemo)`. + +If there is a local ``.nemo`` checkpoint that you'd like to load, use the :code:`restore_from()` method: + +.. code-block:: python + + import nemo.collections.asr as nemo_asr + model = nemo_asr.models..restore_from(restore_path="") + +Where the model base class is the ASR model class of the original checkpoint, or the general ``ASRModel`` class. + +NGC Pretrained Checkpoints +-------------------------- + +The ASR collection has checkpoints of several models trained on various datasets for a variety of tasks. These checkpoints are +obtainable via NGC `NeMo Automatic Speech Recognition collection `_. +The model cards on NGC contain more information about each of the checkpoints available. + +The tables below list the ASR models available from NGC. The models can be accessed via the :code:`from_pretrained()` method inside +the ASR Model class. In general, you can load any of these models with code in the following format: + +.. code-block:: python + + import nemo.collections.asr as nemo_asr + model = nemo_asr.models.ASRModel.from_pretrained(model_name="") + +Where the model name is the value under "Model Name" entry in the tables below. + +For example, to load the base English QuartzNet model for speech recognition, run: + +.. code-block:: python + + model = nemo_asr.models.ASRModel.from_pretrained(model_name="QuartzNet15x5Base-En") + +You can also call :code:`from_pretrained()` from the specific model class (such as :code:`EncDecCTCModel` +for QuartzNet) if you need to access a specific model functionality. + +If you would like to programmatically list the models available for a particular base class, you can use the +:code:`list_available_models()` method. + +.. code-block:: python + + nemo_asr.models..list_available_models() + +Transcribing/Inference +^^^^^^^^^^^^^^^^^^^^^^ + +To perform inference and transcribe a sample of speech after loading the model, use the ``transcribe()`` method: + +.. code-block:: python + + model.transcribe(paths2audio_files=[list of audio files], batch_size=BATCH_SIZE, logprobs=False) + +Setting the argument ``logprobs`` to ``True`` returns the log probabilities instead of transcriptions. For more information, see `nemo.collections.asr.modules <./api.html#modules>`__. +The audio files should be 16KHz mono-channel wav files. + +Inference on long audio +^^^^^^^^^^^^^^^^^^^^^^ + +In some cases the audio is too long for standard inference, especially if you're using a model such as Conformer, where the time and memory costs of the attention layers scale quadratically with the duration. + +There are two main ways of performing inference on long audio files in NeMo: + +The first way is to use buffered inference, where the audio is divided into chunks to run on, and the output is merged afterwards. +The relevant scripts for this are contained in `this folder `_. + +The second way, specifically for models with the Conformer encoder, is to convert to local attention, which changes the costs to be linear. +This can be done even for models trained with full attention, though may result in lower WER in some cases. You can switch to local attention when running the +`transcribe `_ or `evaluation `_ +scripts in the following way: + +.. code-block:: python + + python speech_to_text_eval.py \ + (...other parameters...) \ + ++model_change.conformer.self_attention_model="rel_pos_local_attn" \ + ++model_change.conformer.att_context_size=[64, 64] + +Alternatively, you can change the attention model after loading a checkpoint: + +.. code-block:: python + + asr_model = ASRModel.from_pretrained('stt_en_conformer_ctc_large') + asr_model.change_attention_model( + self_attention_model="rel_pos_local_attn", + att_context_size=[64, 64] + ) + +Fine-tuning on Different Datasets +^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ + +There are multiple ASR tutorials provided in the :ref:`Tutorials ` section. Most of these tutorials explain how to instantiate a pre-trained model, prepare the model for fine-tuning on some dataset (in the same language) as a demonstration. + +Inference Execution Flow Diagram +-------------------------------- + +When preparing your own inference scripts, please follow the execution flow diagram order for correct inference, found at the `examples directory for ASR collection `_. + +Automatic Speech Recognition Models +----------------------------------- + +Below is a list of all the ASR models that are available in NeMo for specific languages, as well as auxiliary language models for certain languages. + +Language Models for ASR +^^^^^^^^^^^^^^^^^^^^^^^ + +.. csv-table:: + :file: data/asrlm_results.csv + :align: left + :widths: 30, 30, 40 + :header-rows: 1 + +| + +Speech Recognition (Languages) +------------------------------ + +English +^^^^^^^ +.. csv-table:: + :file: data/benchmark_en.csv + :align: left + :widths: 40, 10, 50 + :header-rows: 1 + +----------------------------- + +Mandarin +^^^^^^^^ +.. csv-table:: + :file: data/benchmark_zh.csv + :align: left + :widths: 40, 10, 50 + :header-rows: 1 + +----------------------------- + +German +^^^^^^ +.. csv-table:: + :file: data/benchmark_de.csv + :align: left + :widths: 40, 10, 50 + :header-rows: 1 + +----------------------------- + +French +^^^^^^ +.. csv-table:: + :file: data/benchmark_fr.csv + :align: left + :widths: 40, 10, 50 + :header-rows: 1 + +----------------------------- + +Polish +^^^^^^ +.. csv-table:: + :file: data/benchmark_pl.csv + :align: left + :widths: 40, 10, 50 + :header-rows: 1 + +----------------------------- + +Italian +^^^^^^^ +.. csv-table:: + :file: data/benchmark_it.csv + :align: left + :widths: 40, 10, 50 + :header-rows: 1 + +----------------------------- + +Russian +^^^^^^^ +.. csv-table:: + :file: data/benchmark_ru.csv + :align: left + :widths: 40, 10, 50 + :header-rows: 1 + +----------------------------- + +Spanish +^^^^^^^ +.. csv-table:: + :file: data/benchmark_es.csv + :align: left + :widths: 40, 10, 50 + :header-rows: 1 + + +----------------------------- + +Catalan +^^^^^^^ +.. csv-table:: + :file: data/benchmark_ca.csv + :align: left + :widths: 40, 10, 50 + :header-rows: 1 + +----------------------------- + +Hindi +^^^^^^^ +.. csv-table:: + :file: data/benchmark_hi.csv + :align: left + :widths: 40, 10, 50 + :header-rows: 1 + +----------------------------- + +Marathi +^^^^^^^ +.. csv-table:: + :file: data/benchmark_mr.csv + :align: left + :widths: 40, 10, 50 + :header-rows: 1 + +----------------------------- + +Kinyarwanda +^^^^^^^^^^^ +.. csv-table:: + :file: data/benchmark_rw.csv + :align: left + :widths: 40, 10, 50 + :header-rows: 1 + diff --git a/docs/source/asr/scores.rst b/docs/source/asr/scores.rst new file mode 100644 index 0000000000000000000000000000000000000000..bcb083bd917e4c5d765be42fcb0872fac8262ed4 --- /dev/null +++ b/docs/source/asr/scores.rst @@ -0,0 +1,289 @@ +.. + AUTOGENERATED DOC: DO NOT EDIT MANUALLY ! + +Scores +------ + +EN +^^ + +.. csv-table:: + :header-rows: 1 + :align: left + :file: data/scores/en/citrinet_en.csv + +-------------------- + +.. csv-table:: + :header-rows: 1 + :align: left + :file: data/scores/en/conformer_en.csv + +-------------------- + +.. csv-table:: + :header-rows: 1 + :align: left + :file: data/scores/en/contextnet_en.csv + +-------------------- + +.. csv-table:: + :header-rows: 1 + :align: left + :file: data/scores/en/jasper10x5dr_en.csv + +-------------------- + +.. csv-table:: + :header-rows: 1 + :align: left + :file: data/scores/en/quartznet15x5_en.csv + +-------------------- + +.. csv-table:: + :header-rows: 1 + :align: left + :file: data/scores/en/squeezeformer_en.csv + +-------------------- + +BE +^^ + +.. csv-table:: + :header-rows: 1 + :align: left + :file: data/scores/be/conformer_be.csv + +-------------------- + +CA +^^ + +.. csv-table:: + :header-rows: 1 + :align: left + :file: data/scores/ca/conformer_ca.csv + +-------------------- + +.. csv-table:: + :header-rows: 1 + :align: left + :file: data/scores/ca/quartznet15x5_ca.csv + +-------------------- + +DE +^^ + +.. csv-table:: + :header-rows: 1 + :align: left + :file: data/scores/de/citrinet_de.csv + +-------------------- + +.. csv-table:: + :header-rows: 1 + :align: left + :file: data/scores/de/conformer_de.csv + +-------------------- + +.. csv-table:: + :header-rows: 1 + :align: left + :file: data/scores/de/contextnet_de.csv + +-------------------- + +.. csv-table:: + :header-rows: 1 + :align: left + :file: data/scores/de/quartznet15x5_de.csv + +-------------------- + +ENES +^^^^ + +.. csv-table:: + :header-rows: 1 + :align: left + :file: data/scores/enes/conformer_enes.csv + +-------------------- + +.. csv-table:: + :header-rows: 1 + :align: left + :file: data/scores/enes/contextnet_enes.csv + +-------------------- + +EO +^^ + +.. csv-table:: + :header-rows: 1 + :align: left + :file: data/scores/eo/conformer_eo.csv + +-------------------- + +ES +^^ + +.. csv-table:: + :header-rows: 1 + :align: left + :file: data/scores/es/citrinet_es.csv + +-------------------- + +.. csv-table:: + :header-rows: 1 + :align: left + :file: data/scores/es/conformer_es.csv + +-------------------- + +.. csv-table:: + :header-rows: 1 + :align: left + :file: data/scores/es/contextnet_es.csv + +-------------------- + +.. csv-table:: + :header-rows: 1 + :align: left + :file: data/scores/es/quartznet15x5_es.csv + +-------------------- + +FR +^^ + +.. csv-table:: + :header-rows: 1 + :align: left + :file: data/scores/fr/citrinet_fr.csv + +-------------------- + +.. csv-table:: + :header-rows: 1 + :align: left + :file: data/scores/fr/conformer_fr.csv + +-------------------- + +.. csv-table:: + :header-rows: 1 + :align: left + :file: data/scores/fr/contextnet_fr.csv + +-------------------- + +.. csv-table:: + :header-rows: 1 + :align: left + :file: data/scores/fr/quartznet15x5_fr.csv + +-------------------- + +HR +^^ + +.. csv-table:: + :header-rows: 1 + :align: left + :file: data/scores/hr/conformer_hr.csv + +-------------------- + +IT +^^ + +.. csv-table:: + :header-rows: 1 + :align: left + :file: data/scores/it/conformer_it.csv + +-------------------- + +.. csv-table:: + :header-rows: 1 + :align: left + :file: data/scores/it/quartznet15x5_it.csv + +-------------------- + +KAB +^^^ + +.. csv-table:: + :header-rows: 1 + :align: left + :file: data/scores/kab/conformer_kab.csv + +-------------------- + +PL +^^ + +.. csv-table:: + :header-rows: 1 + :align: left + :file: data/scores/pl/quartznet15x5_pl.csv + +-------------------- + +RU +^^ + +.. csv-table:: + :header-rows: 1 + :align: left + :file: data/scores/ru/conformer_ru.csv + +-------------------- + +.. csv-table:: + :header-rows: 1 + :align: left + :file: data/scores/ru/quartznet15x5_ru.csv + +-------------------- + +RW +^^ + +.. csv-table:: + :header-rows: 1 + :align: left + :file: data/scores/rw/conformer_rw.csv + +-------------------- + +ZH +^^ + +.. csv-table:: + :header-rows: 1 + :align: left + :file: data/scores/zh/citrinet_zh.csv + +-------------------- + +.. csv-table:: + :header-rows: 1 + :align: left + :file: data/scores/zh/conformer_zh.csv + +-------------------- + diff --git a/docs/source/asr/speaker_diarization/api.rst b/docs/source/asr/speaker_diarization/api.rst new file mode 100644 index 0000000000000000000000000000000000000000..37feabaed9f8dc1c082d6b5e6be4f06011c1e75e --- /dev/null +++ b/docs/source/asr/speaker_diarization/api.rst @@ -0,0 +1,20 @@ +NeMo Speaker Diarization API +============================= + + +Model Classes +------------- +.. autoclass:: nemo.collections.asr.models.ClusteringDiarizer + :show-inheritance: + :members: + +.. autoclass:: nemo.collections.asr.models.EncDecDiarLabelModel + :show-inheritance: + :members: add_speaker_model_config, _init_segmentation_info, _init_speaker_model, setup_training_data, setup_validation_data, setup_test_data, get_ms_emb_seq, get_cluster_avg_embs_model, get_ms_mel_feat, forward, forward_infer, training_step, validation_step, compute_accuracies + +Mixins +------ +.. autoclass:: nemo.collections.asr.parts.mixins.mixins.DiarizationMixin + :show-inheritance: + :members: + diff --git a/docs/source/asr/speaker_diarization/configs.rst b/docs/source/asr/speaker_diarization/configs.rst new file mode 100644 index 0000000000000000000000000000000000000000..ebf6e86b4be77075eff915f0831542f91ed35bbd --- /dev/null +++ b/docs/source/asr/speaker_diarization/configs.rst @@ -0,0 +1,253 @@ +NeMo Speaker Diarization Configuration Files +============================================ + +Both training and inference of speaker diarization is configured by ``.yaml`` files. The diarizer section will generally require information about the dataset(s) being used, models used in this pipeline, as well as inference related parameters such as post processing of each models. The sections on this page cover each of these in more detail. + +.. note:: + For model details and deep understanding about configs, training, fine-tuning and evaluations, + please refer to ``/tutorials/speaker_tasks/Speaker_Diarization_Inference.ipynb`` and ``/tutorials/speaker_tasks/Speaker_Diarization_Training.ipynb``; + for other applications such as possible integration with ASR, have a look at ``/tutorials/speaker_tasks/ASR_with_SpeakerDiarization.ipynb``. + + +Hydra Configurations for Diarization Training +============================================= + +Currently, NeMo supports Multi-scale diarization decoder (MSDD) as a neural diarizer model. MSDD is a speaker diarization model based on initializing clustering and multi-scale segmentation input. Example configuration files for MSDD model training can be found in ``/examples/speaker_tasks/diarization/conf/neural_diarizer/``. + +* Model name convention for MSDD: msdd_scl___Povl_xxx +* Example: ``msdd_5scl_15_05_50Povl_256x3x32x2.yaml`` has 5 scales, the longest scale is 1.5 sec, the shortest scale is 0.5 sec, with 50 percent overlap, hidden layer size is 256, 3 LSTM layers, 32 CNN channels, 2 repeated Conv layers + +MSDD model checkpoint (.ckpt) and NeMo file (.nemo) contain speaker embedding model (TitaNet) and the speaker model is loaded along with standalone MSDD module. Note that MSDD models require more than one scale. Thus, the parameters in ``diarizer.speaker_embeddings.parameters`` should have more than one scale to function as a MSDD model. + + +General Diarizer Configuration +------------------------------ + +The items (OmegaConfig keys) directly under ``model`` determines segmentation and clustering related parameters. Multi-scale parameters (``window_length_in_sec``, ``shift_length_in_sec`` and ``multiscale_weights``) are specified. ``max_num_of_spks``, ``scale_n``, ``soft_label_thres`` and ``emb_batch_size`` are set here and then assigned to dataset configurations. + +.. code-block:: yaml + + diarizer: + out_dir: null + oracle_vad: True # If True, uses RTTM files provided in manifest file to get speech activity (VAD) timestamps + speaker_embeddings: + model_path: ??? # .nemo local model path or pretrained model name (titanet_large is recommended) + parameters: + window_length_in_sec: [1.5,1.25,1.0,0.75,0.5] # Window length(s) in sec (floating-point number). either a number or a list. ex) 1.5 or [1.5,1.0,0.5] + shift_length_in_sec: [0.75,0.625,0.5,0.375,0.25] # Shift length(s) in sec (floating-point number). either a number or a list. ex) 0.75 or [0.75,0.5,0.25] + multiscale_weights: [1,1,1,1,1] # Weight for each scale. should be null (for single scale) or a list matched with window/shift scale count. ex) [0.33,0.33,0.33] + save_embeddings: True # Save embeddings as pickle file for each audio input. + + + num_workers: ${num_workers} # Number of workers used for data-loading. + max_num_of_spks: 2 # Number of speakers per model. This is currently fixed at 2. + scale_n: 5 # Number of scales for MSDD model and initializing clustering. + soft_label_thres: 0.5 # Threshold for creating discretized speaker label from continuous speaker label in RTTM files. + emb_batch_size: 0 # If this value is bigger than 0, corresponding number of embedding vectors are attached to torch graph and trained. + +Dataset Configuration +--------------------- + +Training, validation, and test parameters are specified using the ``train_ds``, ``validation_ds``, and +``test_ds`` sections in the configuration YAML file, respectively. The items such as ``num_spks``, ``soft_label_thres`` and ``emb_batch_size`` follow the settings in ``model`` key. You may also leave fields such as the ``manifest_filepath`` or ``emb_dir`` blank, and then specify it via command-line interface. Note that ``test_ds`` is not used during training and only used for speaker diarization inference. + +.. code-block:: yaml + + train_ds: + manifest_filepath: ??? + emb_dir: ??? + sample_rate: ${sample_rate} + num_spks: ${model.max_num_of_spks} + soft_label_thres: ${model.soft_label_thres} + labels: null + batch_size: ${batch_size} + emb_batch_size: ${model.emb_batch_size} + shuffle: True + + validation_ds: + manifest_filepath: ??? + emb_dir: ??? + sample_rate: ${sample_rate} + num_spks: ${model.max_num_of_spks} + soft_label_thres: ${model.soft_label_thres} + labels: null + batch_size: 2 + emb_batch_size: ${model.emb_batch_size} + shuffle: False + + test_ds: + manifest_filepath: null + emb_dir: null + sample_rate: 16000 + num_spks: ${model.max_num_of_spks} + soft_label_thres: ${model.soft_label_thres} + labels: null + batch_size: 2 + shuffle: False + seq_eval_mode: False + + +Pre-processor Configuration +--------------------------- + +In the MSDD configuration, pre-processor configuration follows the pre-processor of the embedding extractor model. + +.. code-block:: yaml + + preprocessor: + _target_: nemo.collections.asr.modules.AudioToMelSpectrogramPreprocessor + normalize: "per_feature" + window_size: 0.025 + sample_rate: ${sample_rate} + window_stride: 0.01 + window: "hann" + features: 80 + n_fft: 512 + frame_splicing: 1 + dither: 0.00001 + + +Model Architecture Configurations +--------------------------------- + +The hyper-parameters for MSDD models are under the ``msdd_module`` key. The model architecture can be changed by setting up the ``weighting_scheme`` and ``context_vector_type``. The detailed explanation for architecture can be found in the :doc:`Models <./models>` page. + +.. code-block:: yaml + + msdd_module: + _target_: nemo.collections.asr.modules.msdd_diarizer.MSDD_module + num_spks: ${model.max_num_of_spks} # Number of speakers per model. This is currently fixed at 2. + hidden_size: 256 # Hidden layer size for linear layers in MSDD module + num_lstm_layers: 3 # Number of stacked LSTM layers + dropout_rate: 0.5 # Dropout rate + cnn_output_ch: 32 # Number of filters in a conv-net layer. + conv_repeat: 2 # Determins the number of conv-net layers. Should be greater or equal to 1. + emb_dim: 192 # Dimension of the speaker embedding vectors + scale_n: ${model.scale_n} # Number of scales for multiscale segmentation input + weighting_scheme: 'conv_scale_weight' # Type of weighting algorithm. Options: ('conv_scale_weight', 'attn_scale_weight') + context_vector_type: 'cos_sim' # Type of context vector: options. Options: ('cos_sim', 'elem_prod') + +Loss Configurations +------------------- + +Neural diarizer uses a binary cross entropy (BCE) loss. A set of weights for negative (absence of the speaker's speech) and positive (presence of the speaker's speech) can be provided to the loss function. + +.. code-block:: yaml + + loss: + _target_: nemo.collections.asr.losses.bce_loss.BCELoss + weight: null # Weight for binary cross-entropy loss. Either `null` or list type input. (e.g. [0.5,0.5]) + + +Hydra Configurations for Diarization Inference +============================================== + +Example configuration files for speaker diarization inference can be found in ``/examples/speaker_tasks/diarization/conf/inference/``. Choose a yaml file that fits your targeted domain. For example, if you want to diarize audio recordings of telephonic speech, choose ``diar_infer_telephonic.yaml``. + +The configurations for all the components of diarization inference are included in a single file named ``diar_infer_.yaml``. Each ``.yaml`` file has a few different sections for the following modules: VAD, Speaker Embedding, Clustering and ASR. + +In speaker diarization inference, the datasets provided in manifest format denote the data that you would like to perform speaker diarization on. + +Diarizer Configurations +----------------------- + +An example ``diarizer`` Hydra configuration could look like: + +.. code-block:: yaml + + diarizer: + manifest_filepath: ??? + out_dir: ??? + oracle_vad: False # If True, uses RTTM files provided in manifest file to get speech activity (VAD) timestamps + collar: 0.25 # Collar value for scoring + ignore_overlap: True # Consider or ignore overlap segments while scoring + +Under ``diarizer`` key, there are ``vad``, ``speaker_embeddings``, ``clustering`` and ``asr`` keys containing configurations for the inference of the corresponding modules. + +Configurations for Voice Activity Detector +------------------------------------------ + +Parameters for VAD model are provided as in the following Hydra config example. + +.. code-block:: yaml + + vad: + model_path: null # .nemo local model path or pretrained model name or none + external_vad_manifest: null # This option is provided to use external vad and provide its speech activity labels for speaker embeddings extraction. Only one of model_path or external_vad_manifest should be set + + parameters: # Tuned parameters for CH109 (using the 11 multi-speaker sessions as dev set) + window_length_in_sec: 0.15 # Window length in sec for VAD context input + shift_length_in_sec: 0.01 # Shift length in sec for generate frame level VAD prediction + smoothing: "median" # False or type of smoothing method (eg: median) + overlap: 0.875 # Overlap ratio for overlapped mean/median smoothing filter + onset: 0.4 # Onset threshold for detecting the beginning and end of a speech + offset: 0.7 # Offset threshold for detecting the end of a speech + pad_onset: 0.05 # Adding durations before each speech segment + pad_offset: -0.1 # Adding durations after each speech segment + min_duration_on: 0.2 # Threshold for small non_speech deletion + min_duration_off: 0.2 # Threshold for short speech segment deletion + filter_speech_first: True + +Configurations for Speaker Embedding in Diarization +--------------------------------------------------- + +Parameters for speaker embedding model are provided in the following Hydra config example. Note that multiscale parameters either accept list or single floating point number. + +.. code-block:: yaml + + speaker_embeddings: + model_path: ??? # .nemo local model path or pretrained model name (titanet_large, ecapa_tdnn or speakerverification_speakernet) + parameters: + window_length_in_sec: 1.5 # Window length(s) in sec (floating-point number). Either a number or a list. Ex) 1.5 or [1.5,1.25,1.0,0.75,0.5] + shift_length_in_sec: 0.75 # Shift length(s) in sec (floating-point number). Either a number or a list. Ex) 0.75 or [0.75,0.625,0.5,0.375,0.25] + multiscale_weights: null # Weight for each scale. should be null (for single scale) or a list matched with window/shift scale count. Ex) [1,1,1,1,1] + save_embeddings: False # Save embeddings as pickle file for each audio input. + +Configurations for Clustering in Diarization +-------------------------------------------- + +Parameters for clustering algorithm are provided in the following Hydra config example. + +.. code-block:: yaml + + clustering: + parameters: + oracle_num_speakers: False # If True, use num of speakers value provided in the manifest file. + max_num_speakers: 20 # Max number of speakers for each recording. If oracle_num_speakers is passed, this value is ignored. + enhanced_count_thres: 80 # If the number of segments is lower than this number, enhanced speaker counting is activated. + max_rp_threshold: 0.25 # Determines the range of p-value search: 0 < p <= max_rp_threshold. + sparse_search_volume: 30 # The higher the number, the more values will be examined with more time. + +Configurations for Diarization with ASR +--------------------------------------- + +The following configuration needs to be appended under ``diarizer`` to run ASR with diarization to get a transcription with speaker labels. + +.. code-block:: yaml + + asr: + model_path: ??? # Provide NGC cloud ASR model name. stt_en_conformer_ctc_* models are recommended for diarization purposes. + parameters: + asr_based_vad: False # if True, speech segmentation for diarization is based on word-timestamps from ASR inference. + asr_based_vad_threshold: 50 # threshold (multiple of 10ms) for ignoring the gap between two words when generating VAD timestamps using ASR based VAD. + asr_batch_size: null # Batch size can be dependent on each ASR model. Default batch sizes are applied if set to null. + lenient_overlap_WDER: True # If true, when a word falls into speaker-overlapped regions, consider the word as a correctly diarized word. + decoder_delay_in_sec: null # Native decoder delay. null is recommended to use the default values for each ASR model. + word_ts_anchor_offset: null # Offset to set a reference point from the start of the word. Recommended range of values is [-0.05 0.2]. + word_ts_anchor_pos: "start" # Select which part of the word timestamp we want to use. The options are: 'start', 'end', 'mid'. + fix_word_ts_with_VAD: False # Fix the word timestamp using VAD output. You must provide a VAD model to use this feature. + colored_text: False # If True, use colored text to distinguish speakers in the output transcript. + print_time: True # If True, the start of the end time of each speaker turn is printed in the output transcript. + break_lines: False # If True, the output transcript breaks the line to fix the line width (default is 90 chars) + + ctc_decoder_parameters: # Optional beam search decoder (pyctcdecode) + pretrained_language_model: null # KenLM model file: .arpa model file or .bin binary file. + beam_width: 32 + alpha: 0.5 + beta: 2.5 + + realigning_lm_parameters: # Experimental feature + arpa_language_model: null # Provide a KenLM language model in .arpa format. + min_number_of_words: 3 # Min number of words for the left context. + max_number_of_words: 10 # Max number of words for the right context. + logprob_diff_threshold: 1.2 # The threshold for the difference between two log probability values from two hypotheses. diff --git a/docs/source/asr/speaker_diarization/data/diarization_results.csv b/docs/source/asr/speaker_diarization/data/diarization_results.csv new file mode 100644 index 0000000000000000000000000000000000000000..fc3594520ea4e1b1ebfbeeab8f8cd534c4954682 --- /dev/null +++ b/docs/source/asr/speaker_diarization/data/diarization_results.csv @@ -0,0 +1,7 @@ +Model Name,Model Base Class,Model Card +vad_multilingual_marblenet,EncDecClassificationModel,"https://catalog.ngc.nvidia.com/orgs/nvidia/teams/nemo/models/vad_multilingual_marblenet" +vad_marblenet,EncDecClassificationModel,"https://ngc.nvidia.com/catalog/models/nvidia:nemo:vad_marblenet" +vad_telephony_marblenet,EncDecClassificationModel,"https://ngc.nvidia.com/catalog/models/nvidia:nemo:vad_telephony_marblenet" +titanet_large,EncDecSpeakerLabelModel,"https://ngc.nvidia.com/catalog/models/nvidia:nemo:titanet_large" +ecapa_tdnn,EncDecSpeakerLabelModel,"https://ngc.nvidia.com/catalog/models/nvidia:nemo:ecapa_tdnn" +diar_msdd_telephonic,EncDecDiarLabelModel,"https://ngc.nvidia.com/catalog/models/nvidia:diar_msdd_telephonic" diff --git a/docs/source/asr/speaker_diarization/datasets.rst b/docs/source/asr/speaker_diarization/datasets.rst new file mode 100644 index 0000000000000000000000000000000000000000..ff73dad8601a6cf96f2440ae31629f817da74a6e --- /dev/null +++ b/docs/source/asr/speaker_diarization/datasets.rst @@ -0,0 +1,264 @@ +Datasets +======== + +This page is about formatting a dataset for diarization training and inference. To train or fine-tune the speaker diarization system, you could either train/fine-tune speaker embedding extractor model separately or you can train/fine-tune speaker embedding extractor and neural diarizer at the same time. + +* To train or fine-tune a speaker embedding extractor model separately, please go check out these pages: :doc:`Speech Classification Datasets <../speech_classification/datasets>` and :doc:`Speaker Recognition Datasets <../speaker_recognition/datasets>` for preparing datasets for training and validating VAD and speaker embedding models respectively. + + +* To train or fine-tune speaker embedding extractor and neural diarizer together, please follow the dataset preparation process in this page. + +Data Preparation for Training +----------------------------- + +.. image:: images/msdd_train_and_infer.png + :align: center + :width: 800px + :alt: MSDD training and inference + +As shown in the above figure, a full-fledged speaker diarization process through speaker embedding extractor, clustering algorithm and neural diarizer. Note that only speaker embedding extractor and neural diarizer are trainable models and they can be train/fine-tune together on diarization datasets. We recommend to use a speaker embedding extractor model that is trained on large amount of single-speaker dataset and use it for training a neural diarizer model. + +Speaker diarization training is also managed by Hydra configurations based on ``.yaml`` files, just as in other NeMo neural models. See :doc:`NeMo Speaker Diarization Configuration Files <./configs>` for setting up the input Hydra configuration file for speaker diarization. Input data should be provided in line delimited JSON format as below: + +* Create a manifest file for speaker diarization + +Speaker diarization training and inference both require the same type of manifest files. This manifest file can be created by using the script in ``/scripts/speaker_tasks/pathfiles_to_diarize_manifest.py``. The following example shows how to run ``pathfiles_to_diarize_manifest.py`` by providing path list files. + +.. code-block:: shell-session + + python NeMo/scripts/speaker_tasks/pathfiles_to_diarize_manifest.py \ + --paths2audio_files='/path/to/audio_file_path_list.txt' \ + --paths2rttm_files='/path/to/rttm_file_list.txt' \ + --manifest_filepath='/path/to/manifest_filepath/train_manifest.json + + +All three arguments are required. Note that we need to maintain consistency on unique filenames for every field (key) by only changing the filename extensions. For example, if there is an audio file named ``abcd01.wav``, the rttm file should be named as ``abcd01.rttm`` and the transcription file should be named as ``abcd01.txt``. + +- Example audio file path list ``audio_file_path_list.txt`` + +.. code-block:: bash + + /path/to/abcd01.wav + /path/to/abcd02.wav + + +To train a diarization model, one needs to provide Rich Transcription Time Marked (RTTM) files as ground truth label files. Here is one line from a RTTM file as an example: + +.. code-block:: bash + + SPEAKER TS3012d.Mix-Headset 1 331.573 0.671 MTD046ID + + +Make a list of RTTM files for the audio files you have in ``audio_file_path_list.txt``. + +- Example RTTM file path list ``rttm_file_path_list.txt`` + +.. code-block:: bash + + /path/to/abcd01.rttm + /path/to/abcd02.rttm + +.. note:: + We expect all the provided files (e.g. audio, rttm, text) to have the same base name and the name should be unique (uniq-id). + +As an output file, ``train_manifest.json`` will have the following line for each audio file: + +.. code-block:: bash + + {"audio_filepath": "/path/to/abcd01.wav", "offset": 0, "duration": null, "label": "infer", "text": "-", "num_speakers": 2, "rttm_filepath": "/path/to/rttm/abcd01.rttm"} + + +* Manifest files for MSDD training + +After generating a session-wise manifest file, we need to break down each session-wise manifest file into a split manifest file containing start time and duration of the split samples due to memory capacity. More importantly, since MSDD only uses pairwise (two-speaker) model and data samples, we need to split RTTM files if there are more than two speakers. + +Note that you should specify window length and shift length of the base scale of your MSDD model when you generate the manifest file for training samples. More importantly, ``step_count`` determines how many steps (i.e., base-scale segments) are in a split data sample. If ``step_count`` is too long, you might not be able to load a single sample in a batch. + +.. code-block:: bash + + python NeMo/scripts/speaker_tasks/create_msdd_train_dataset.py \ + --input_manifest_path='path/to/train_manifest.json' \ + --output_manifest_path='path/to/train_manifest.50step.json' \ + --pairwise_rttm_output_folder='path/to/rttm_output_folder' \ + --window=0.5 \ + --shift=0.25 \ + --step_count=50 + +All arguments are required to generate a new manifest file. Specify a session-wise diarization manifest file to ``--input_manifest_path`` and specify an output file name in ``--output_manifest_path``. In the folder that is specified for ``--pairwise_rttm_output_folder``, the script will create multiple two-speaker RTTM files from the given RTTM file and create manifest file that only contains two speakers in the specified RTTM range. + + +For example, if ``abcd01.wav`` has three speakers (``1911,1988,192``), the three RTTM files will be created: ``abcd01.1911_1988.rttm``, ``abcd01.1911_192.rttm`` and ``abcd01.1988_192.rttm``. Subsequently, the segments will be only generated from the newly generated two-speaker RTTM files. + + +Specify ``window`` and ``shift`` of the base-scale in your MSDD model. In this example, we use default setting of ``window=0.5`` and ``shift=0.25`` and ``step_count=50``. Here are example lines in the output file ``/path/to/train_manifest.50step.json``. + +- Example manifest file ``train_manifest.50step.json``. + +.. code-block:: bash + + {"audio_filepath": "/path/to/abcd01.wav", "offset": 0.007, "duration": 14.046, "label": "infer", "text": "-", "num_speakers": 2, "rttm_filepath": "simulated_train/abcd01.1919_1988.rttm"} + {"audio_filepath": "/path/to/abcd01.wav", "offset": 13.553, "duration": 16.429, "label": "infer", "text": "-", "num_speakers": 2, "rttm_filepath": "simulated_train/abcd01.1919_1988.rttm"} + {"audio_filepath": "/path/to/abcd02.wav", "offset": 0.246, "duration": 15.732, "label": "infer", "text": "-", "num_speakers": 2, "rttm_filepath": "path/to/rttm_output_folder/abcd02.777_5694.rttm"} + {"audio_filepath": "/path/to/abcd02.wav", "offset": 15.478, "duration": 14.47, "label": "infer", "text": "-", "num_speakers": 2, "rttm_filepath": "path/to/rttm_output_folder/abcd02.777_5694.rttm"} + + +Prepare the msdd training dataset for both train and validation. After the training dataset is prepared, you can train an MSDD model with the following script: + +.. code-block:: bash + + python ./multiscale_diar_decoder.py --config-path='../conf/neural_diarizer' --config-name='msdd_5scl_15_05_50Povl_256x3x32x2.yaml' \ + trainer.gpus=1 \ + trainer.max_epochs=20 \ + model.base.diarizer.speaker_embeddings.model_path="titanet_large" \ + model.train_ds.manifest_filepath="" \ + model.validation_ds.manifest_filepath="" \ + model.train_ds.emb_dir="" \ + model.validation_ds.emb_dir="" \ + exp_manager.name='sample_train' \ + exp_manager.exp_dir='./msdd_exp' \ + +In the above example training session, we use ``titanet_large`` model as a pretrained speaker embedding model. + +Data Preparation for Inference +------------------------------ + +As in dataset preparation for diarization trainiing, diarization inference is based on Hydra configurations which are fulfilled by ``.yaml`` files. See :doc:`NeMo Speaker Diarization Configuration Files <./configs>` for setting up the input Hydra configuration file for speaker diarization inference. Input data should be provided in line delimited JSON format as below: + +.. code-block:: bash + + {"audio_filepath": "/path/to/abcd.wav", "offset": 0, "duration": null, "label": "infer", "text": "-", "num_speakers": null, "rttm_filepath": "/path/to/rttm/abcd.rttm", "uem_filepath": "/path/to/uem/abcd.uem"} + +In each line of the input manifest file, ``audio_filepath`` item is mandatory while the rest of the items are optional and can be passed for desired diarization setting. We refer to this file as a manifest file. This manifest file can be created by using the script in ``/scripts/speaker_tasks/pathfiles_to_diarize_manifest.py``. The following example shows how to run ``pathfiles_to_diarize_manifest.py`` by providing path list files. + +.. code-block:: bash + + python pathfiles_to_diarize_manifest.py --paths2audio_files /path/to/audio_file_path_list.txt \ + --paths2txt_files /path/to/transcript_file_path_list.txt \ + --paths2rttm_files /path/to/rttm_file_path_list.txt \ + --paths2uem_files /path/to/uem_file_path_list.txt \ + --paths2ctm_files /path/to/ctm_file_path_list.txt \ + --manifest_filepath /path/to/manifest_output/input_manifest.json + +The ``--paths2audio_files`` and ``--manifest_filepath`` are required arguments. Note that we need to maintain consistency on unique filenames for every field (key) by only changing the filename extensions. For example, if there is an audio file named ``abcd.wav``, the rttm file should be named as ``abcd.rttm`` and the transcription file should be named as ``abcd.txt``. + +- Example audio file path list ``audio_file_path_list.txt`` + +.. code-block:: bash + + /path/to/abcd01.wav + /path/to/abcd02.wav + +- Example RTTM file path list ``rttm_file_path_list.txt`` + +.. code-block:: bash + + /path/to/abcd01.rttm + /path/to/abcd02.rttm + + +The path list files containing the absolute paths to these WAV, RTTM, TXT, CTM and UEM files should be provided as in the above example. ``pathsfiles_to_diarize_manifest.py`` script will match each file using the unique filename (e.g. ``abcd``). Finally, the absolute path of the created manifest file should be provided through Hydra configuration as shown below: + +.. code-block:: yaml + + diarizer.manifest_filepath="path/to/manifest/input_manifest.json" + +The following are descriptions about each field in an input manifest JSON file. + +.. note:: + We expect all the provided files (e.g. audio, rttm, text) to have the same base name and the name should be unique (uniq-id). + +``audio_filepath`` (Required): + + a string containing absolute path to the audio file. + +``num_speakers`` (Optional): + + If the number of speakers is known, provide the integer number or assign null if not known. + +``rttm_filepath`` (Optional): + + To evaluate a diarization system with known rttm files, one needs to provide Rich Transcription Time Marked (RTTM) files as ground truth label files. If RTTM files are provided, the diarization evaluation will be initiated. Here is one line from a RTTM file as an example: + +.. code-block:: bash + + SPEAKER TS3012d.Mix-Headset 1 331.573 0.671 MTD046ID + +``text`` (Optional): + + Ground truth transcription for diarization with ASR inference. Provide the ground truth transcription of the given audio file in string format + +.. code-block:: bash + + {"text": "this is an example transcript"} + +``uem_filepath`` (Optional): + + The UEM file is used for specifying the scoring regions to be evaluated in the given audio file. + UEMfile follows the following convention: `` ``. ```` is set to 1. + + Example lines of UEM file: + +.. code-block:: bash + + TS3012d.Mix-Headset 1 12.31 108.98 + TS3012d.Mix-Headset 1 214.00 857.09 + +``ctm_filepath`` (Optional): + + CTM file is used for the evaluation of word-level diarization results and word-timestamp alignment. CTM file follows the following convention: `` `` Since confidence is not required for evaluating diarization results, it can have any value. Note that the ```` should be exactly matched with speaker IDs in RTTM. + + Example lines of CTM file: + +.. code-block:: bash + + TS3012d.Mix-Headset MTD046ID 12.879 0.32 okay 0 + TS3012d.Mix-Headset MTD046ID 13.203 0.24 yeah 0 + + +Evaluation on Benchmark Datasets +-------------------------------- + +The following instructions can help one to reproduce the expected diarization performance on two benchmark English dialogue datasets. The following results are evaluations based on 0.25 second collar without evaluating overlapped speech. The evaluation is based on oracle VAD results from RTTM files. Therefore, diarization error rate (DER) is equal to confusion error rate since oracle VAD has no miss detection or false alarm. + +AMI Meeting Corpus +~~~~~~~~~~~~~~~~~~ + +The following are the suggested parameters for reproducing the diarization performance for `AMI `_ test set. This setting is based on meeting domain configuration in ``/examples/speaker_tasks/diarization/conf/inference/diar_infer_meeting.yaml`` + +.. code-block:: bash + + diarizer.manifest_filepath="/path/to/AMItest_input_manifest.json" + diarizer.oracle_num_speakers=null # Performing unknown number of speaker case + diarizer.oracle_vad=True # Use oracle VAD extracted from RTTM files. + diarizer.collar=0.25 + diarizer.ignore_overlap=True + diarizer.speaker_embeddings.model_path="titanet_large" + +We provide a helper script to download the dataset and format it into a NeMo manifest. + +.. code-block:: bash + + python scripts/data_processing/speaker_tasks/get_ami_data.py --manifest_filepath AMItest_input_manifest.json + + +CallHome American English Speech (CHAES), LDC97S42 +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ + +We use the CH109 set which is a subset of the CHAES dataset which has only two speakers in one session. +The following are the suggested parameters for reproducing the diarization performance for the CH109 set and this setting is based on telephonic domain configuration in ``/examples/speaker_tasks/diarization/conf/inference/diar_infer_telephonic.yaml`` + +.. code-block:: bash + + diarizer.manifest_filepath="/path/to/ch109_input_manifest.json" + diarizer.oracle_vad=True # Use oracle VAD extracted from RTTM files. + diarizer.collar=0.25 + diarizer.ignore_overlap=True + diarizer.speaker_embeddings.model_path="titanet_large" + + +To evaluate the performance on AMI Meeting Corpus, the following instructions can help. + - Download CHAES Meeting Corpus at LDC website `LDC97S42 `_ (CHAES is not publicly available). + - Download the CH109 filename list (whitelist) from `CH109 whitelist `_. + - Download RTTM files for CH109 set from `CH109 RTTM files `_. + - Generate an input manifest file using ``/scripts/speaker_tasks/pathfiles_to_diarize_manifest.py`` + diff --git a/docs/source/asr/speaker_diarization/images/asr_sd_diagram.png b/docs/source/asr/speaker_diarization/images/asr_sd_diagram.png new file mode 100644 index 0000000000000000000000000000000000000000..ba7613fcc75cfc636ab35018ffd5d8b3642a1939 Binary files /dev/null and b/docs/source/asr/speaker_diarization/images/asr_sd_diagram.png differ diff --git a/docs/source/asr/speaker_diarization/images/data_flow.png b/docs/source/asr/speaker_diarization/images/data_flow.png new file mode 100644 index 0000000000000000000000000000000000000000..5d7a878eb61a06601b53ad061fe48678c2445fe3 Binary files /dev/null and b/docs/source/asr/speaker_diarization/images/data_flow.png differ diff --git a/docs/source/asr/speaker_diarization/images/ms_trade_off.png b/docs/source/asr/speaker_diarization/images/ms_trade_off.png new file mode 100644 index 0000000000000000000000000000000000000000..a9e4e18c00055b3aa1e5f48a29258680bb7edff1 Binary files /dev/null and b/docs/source/asr/speaker_diarization/images/ms_trade_off.png differ diff --git a/docs/source/asr/speaker_diarization/images/msdd_train_and_infer.png b/docs/source/asr/speaker_diarization/images/msdd_train_and_infer.png new file mode 100644 index 0000000000000000000000000000000000000000..6af539c9f5ffd2f9a54a55e75fe82656af744706 Binary files /dev/null and b/docs/source/asr/speaker_diarization/images/msdd_train_and_infer.png differ diff --git a/docs/source/asr/speaker_diarization/images/scale_weight_cnn.png b/docs/source/asr/speaker_diarization/images/scale_weight_cnn.png new file mode 100644 index 0000000000000000000000000000000000000000..4620ca0ad04595c4fa1d90e0a9b9d13d2c11165b Binary files /dev/null and b/docs/source/asr/speaker_diarization/images/scale_weight_cnn.png differ diff --git a/docs/source/asr/speaker_diarization/images/sd_pipeline.png b/docs/source/asr/speaker_diarization/images/sd_pipeline.png new file mode 100644 index 0000000000000000000000000000000000000000..47b16cc8c1b7393b34b1d627cd02ae92f8f23fb0 Binary files /dev/null and b/docs/source/asr/speaker_diarization/images/sd_pipeline.png differ diff --git a/docs/source/asr/speaker_diarization/images/sequence_model.png b/docs/source/asr/speaker_diarization/images/sequence_model.png new file mode 100644 index 0000000000000000000000000000000000000000..9f00218289f352e15af8b2c382b3ab2781dd7bdb Binary files /dev/null and b/docs/source/asr/speaker_diarization/images/sequence_model.png differ diff --git a/docs/source/asr/speaker_diarization/images/weighted_sum.png b/docs/source/asr/speaker_diarization/images/weighted_sum.png new file mode 100644 index 0000000000000000000000000000000000000000..72fcd094d9c0901c65f5eee576e33a1cc6a5be52 Binary files /dev/null and b/docs/source/asr/speaker_diarization/images/weighted_sum.png differ diff --git a/docs/source/asr/speaker_diarization/intro.rst b/docs/source/asr/speaker_diarization/intro.rst new file mode 100644 index 0000000000000000000000000000000000000000..bd8dae2936148c8c81945f2db78b9cefdc072bca --- /dev/null +++ b/docs/source/asr/speaker_diarization/intro.rst @@ -0,0 +1,49 @@ +Speaker Diarization +=================== + +Speaker diarization is the process of segmenting audio recordings by speaker labels and aims to answer the question “who spoke when?”. Speaker diarization makes a clear distinction when it is compared with speech recognition. As shown in the figure below, before we perform speaker diarization, we know “what is spoken” yet we do not know “who spoke it”. Therefore, speaker diarization is an essential feature for a speech recognition system to enrich the transcription with speaker labels. + +.. image:: images/asr_sd_diagram.png + :align: center + :width: 800px + :alt: Speaker diarization pipeline- VAD, segmentation, speaker embedding extraction, clustering + + + +To figure out "who spoke when", speaker diarization systems need to capture the characteristics of unseen speakers and tell apart which regions in the audio recording belong to which speaker. To achieve this, speaker diarization systems extract voice characteristics, count the number of speakers, then assign the audio segments to the corresponding speaker index. + +The following figure shows the overall data flow of the NeMo speaker diarization pipeline. + +.. image:: images/sd_pipeline.png + :align: center + :width: 800px + :alt: Speaker diarization pipeline- VAD, segmentation, speaker embedding extraction, clustering + +NeMo speaker diarization system consists of the following modules: + +- **Voice Activity Detector (VAD)**: A trainable model which detects the presence or absence of speech to generate timestamps for speech activity from the given audio recording. + +- **Speaker Embedding Extractor**: A trainable model that extracts speaker embedding vectors containing voice characteristics from raw audio signal. + +- **Clustering Module**: A non-trainable module that groups speaker embedding vectors into a number of clusters. + +- **Neural Diarizer**: A trainable model that estimates speaker labels from the given features. + +Speaker diarization evaluation can be done in two different modes depending on the VAD settings: + +- **oracle VAD**: Speaker diarization based on ground-truth VAD timestamps +- **system VAD**: Speaker diarization based on the results from a VAD model + +The full documentation tree is as follows: + +.. toctree:: + :maxdepth: 8 + + models + datasets + results + configs + api + resources + +.. include:: resources.rst diff --git a/docs/source/asr/speaker_diarization/models.rst b/docs/source/asr/speaker_diarization/models.rst new file mode 100644 index 0000000000000000000000000000000000000000..e06be658d68aba1363ef163174b884045c90664e --- /dev/null +++ b/docs/source/asr/speaker_diarization/models.rst @@ -0,0 +1,110 @@ +Models +====== + +This section gives a brief overview of the supported speaker diarization models in NeMo's ASR collection. + +Currently speaker diarization pipeline in NeMo involves `MarbleNet <../speech_classification/models.html#marblenet-vad>`__ model for Voice Activity Detection (VAD) and `TitaNet <../speaker_recognition/models.html#titanet>`__ models for speaker embedding extraction and `Multi-scale Diarizerion Decoder` for neural diarizer, which will be explained in this page. + +.. _Multi_Scale_Diarization_Decoder: + +Multi-Scale Diarization Decoder +------------------------------- + +.. image:: images/sd_pipeline.png + :align: center + :width: 800px + :alt: Speaker diarization pipeline- VAD, segmentation, speaker embedding extraction, clustering + +Speaker diarization system needs to produce very accurate timestamps since speaker turns can be extremely short in conversational settings. Human conversation often involves very short back-channel words such as “yes”, “uh-huh” or “oh” and these words are very challenging for machines to transcribe and tell the speaker. Therefore, while segmenting audio recordings in terms of speaker identity, speaker diarization requires fine-grained decisions on relatively short segments, ranging from a few tenths of a second to several seconds. Making accurate, fine-grained decisions on such short audio segments is challenging because it is less likely to capture reliable speaker traits from the very short audio segments. We will discuss how this problem can be addressed by introducing a new technique called the multi-scale approach and multiscale diarization decoder to handle multi-scale inputs. + +Extracting long audio segments is desirable in terms of the quality of speaker characteristics. However, the length of audio segments also limits the granularity, which leads to a coarse unit length for speaker label decisions. Therefore, speaker diarization systems are challenged by a trade-off between temporal resolution and the fidelity of the speaker representation, as depicted in the curve shown in the figure below. During the speaker feature extraction process in the speaker diarization pipeline, the temporal resolution is inevitably sacrificed by taking a long speech segment to obtain high-quality speaker representation vectors. In plain and simple language, if we try to be very accurate on voice characteristics then we need to look into a longer span of time. However, at the same time, if we look into a longer span of time, we have to make a decision on a fairly long span of time and this leads to coarse decisions (temporal resolution is low). This can be easily understood if we think about the fact that even human listeners cannot accurately tell who is speaking if only half a second of recorded speech is given. + +In traditional diarization systems, an audio segment length ranges from 1.5~3.0 seconds since such numbers make a good compromise between the quality of speaker characteristics and temporal resolution. We refer to this type of segmentation method as a single-scale approach. Even with an overlap technique, the single-scale segmentation limits the temporal resolution to 0.75~1.5 seconds, which leaves room for improvement in terms of temporal accuracy. Having a coarse temporal resolution not only deteriorates the performance of diarization but also decreases speaker counting accuracy since short speech segments are not captured properly. More importantly, such coarse temporal resolution in the speaker timestamps makes the matching between the decoded ASR text and speaker diarization result more error-prone. + +.. image:: images/ms_trade_off.png + :align: center + :width: 800px + :alt: Speaker diarization pipeline- VAD, segmentation, speaker embedding extraction, clustering + +To tackle the problem, the multi-scale approach is proposed to cope with such a trade-off by extracting speaker features from multiple segment lengths and then combining the results from multiple scales. The multi-scale approach is fulfilled by employing multi-scale segmentation and extracting speaker embeddings from each scale. The left side of the above figure shows how four different scales in a multi-scale segmentation approach are performed. During the segment affinity calculation process, all the information from the longest scale to the shortest scale is combined, yet a decision is made only for the shortest segment range. When combining the features from each scale, the weight of each scale largely affects the speaker diarization performance. + +Since scale weights largely determine the accuracy of the speaker diarization system, the scale weights should be set to have the maximized speaker diarization performance. Hence, we came up with a novel multi-scale diarization system called multiscale diarization decoder :cite:`sd-models-park2022multi` that dynamically determines the importance of each scale at each timestep. + +Multiscale diarization decoder takes the multiple speaker embedding vectors from multiple scales and then estimates desirable scale weights. Based on the estimated scale weights, speaker labels are generated. Hence, the proposed system weighs more on the large scale if the input signals are considered to have more accurate information on the certain scales. + +.. image:: images/data_flow.png + :align: center + :width: 800px + :alt: Speaker diarization pipeline- VAD, segmentation, speaker embedding extraction, clustering + +The data flow of the multiscale speaker diarization system is shown in the above figure. Multi-scale segments are extracted from audio input, and corresponding speaker embedding vectors for multi-scale audio input are generated by using speaker embedding extractor (TitaNet). Followingly, the extracted multi-scale embeddings are processed by clustering algorithm to provide initializing clustering result to MSDD module. MSDD module uses cluster-average speaker embedding vectors to compare these with input speaker embedding sequences. The scale weights for each step is estimated to weigh the importance of each scale. Finally, the sequence model is trained to output speaker label probabilities for each speaker. + + +.. image:: images/scale_weight_cnn.png + :align: center + :width: 800px + :alt: A figure explaing CNN based scale weighting mechanism + + +A neural network model named multi-scale diarization decoder (MSDD) is trained to take advantage of a multi-scale approach by dynamically calculating the weight of each scale. MSDD takes the initial clustering results and compares the extracted speaker embeddings with the cluster-average speaker representation vectors. + +Most importantly, the weight of each scale at each time step is determined through a scale weighting mechanism where the scale weights are calculated from a 1-D convolutional neural networks (CNNs) applied to the multi-scale speaker embedding inputs and the cluster average embeddings as described in the above figure. + +.. image:: images/weighted_sum.png + :align: center + :width: 800px + :alt: A figure explaing weighted sum of cosine similarity values + +The estimated scale weights are applied to cosine similarity values calculated for each speaker and each scale. The above figure shows the process of calculating the context vector by applying the estimated scale weights on cosine similarity calculated between cluster-average speaker embedding and input speaker embeddings. + +Aside from CNN-based weighting scheme, MSDD implementation in NeMo toolkit allows multiple options for calculating scale weights ``model.msdd_module.weighting_scheme``: + + +- ``conv_scale_weight``: Default setting. Use 1-D CNN filters to calculate scale weights. + +- ``attn_scale_weight``: Calculate the scale weights by applying an attention mechanism between cluster-average embeddings and input embeddings. This can be viewed as attention values for scale at each timestep. + +Finally, each context vector for each step is fed to a multi-layer LSTM model that generates per-speaker speaker existence probability. The figure below shows how speaker label sequences are estimated by LSTM model and context vector input. + +.. image:: images/sequence_model.png + :align: center + :width: 400px + :alt: Speaker diarization pipeline- VAD, segmentation, speaker embedding extraction, clustering + +In NeMo toolkit, MSDD implementation has multiple options for the context vector by specifying ``model.msdd_module.context_vector_type``: + + +- ``cos_sim``: As described in this document, scale weights are applied to cosine similarity values between cluster-average embedding vectors and input embedding vectors. Default is ``cos_sim``. + + +- ``elem_prod``: The scale weights are directly applied to speaker embedding vectors then a weighted speaker embedding vector is calculated for both cluster-average embedding vectors and input embedding vectors. Finally, elementwise product between the cluster-average weighted speaker embedding vector and input multi-scale embedding vector are calculated and fed to LSTMs as a context vector for each step. + + +MSDD is designed with the following aspects in mind: + +* **Flexible number of speakers**: MSDD employs pairwise inference to diarize conversation with arbitrary numbers of speakers. For example, if there are 4 speakers, 6 pairs will be extracted, and inference results from MSDD are averaged to obtain results for each of the 4 speakers. + + +* **Overlap-aware diarization**: MSDD independently estimates the probability of two speaker labels of two speakers at each step. This enables overlap detection where two speakers are speaking at the same time. + + +* **Pretrained speaker embedding model**: MSDD is based on the pretrained embedding extractor (TitaNet) model. By using a pretrained speaker model, we can leverage the neural network weights learned from a relatively large amount of single-speaker speech data. In addition, MSDD is designed to be optimized with a pretrained speaker to fine-tune the entire speaker diarization system on a domain-specific diarization dataset. + + +* **End-to-end training of diarization model**: Since all the arithmetic operations in MSDD support gradient calculation, a speaker embedding model can be attached to the computational graph of an MSDD model and can be jointly trained from the loss calculated from speaker label outputs. + + +* **Superior temporal resolution for uniform segmentation approach**: While single-scale clustering diarizer shows the best performance at 1.5-second segment length where unit decision length is 0.75 second (half-overlap), the multi-scale approach has unit decision length of 0.25 second. The temporal resolution can be even more enhanced by using shorter shift length which requires more steps and resources. Note that merely applying 0.5-second segment length to a single-scale diarizer significantly drops the diarization performance due to the degraded fidelity of speaker features. + + +* **Performance improvement from clustering diarizer**: Diarization Error Rate (DER) is calculated by comparing hypothesis timestamps and ground-truth timestamps. MSDD can reduce the diarization error rate up to 60% on two speaker datasets when compared to the single-scale clustering diarizer.  + +References +----------- + +.. bibliography:: ../asr_all.bib + :style: plain + :labelprefix: SD-MODELS + :keyprefix: sd-models- + + diff --git a/docs/source/asr/speaker_diarization/resources.rst b/docs/source/asr/speaker_diarization/resources.rst new file mode 100644 index 0000000000000000000000000000000000000000..ee38217b8d8d6caeab916f31ea8638e28ceb7a71 --- /dev/null +++ b/docs/source/asr/speaker_diarization/resources.rst @@ -0,0 +1,24 @@ + +Resource and Documentation Guide +-------------------------------- + +Hands-on speaker diarization tutorial notebooks can be found under ``/tutorials/speaker_tasks``. + +There are tutorials for performing speaker diarization inference using :ref:`MarbleNet_model`, :ref:`TitaNet_model`, and :ref:`Multi_Scale_Diarization_Decoder`. +We also provide tutorials about getting ASR transcriptions combined with speaker labels along with voice activity timestamps with NeMo ASR collections. + +Most of the tutorials can be run on Google Colab by specifying the link to the notebooks' GitHub pages on Colab. + +If you are looking for information about a particular model used for speaker diarization inference, or would like to find out more about the model +architectures available in the `nemo_asr` collection, check out the :doc:`Models <./models>` page. + +Documentation on dataset preprocessing can be found on the :doc:`Datasets <./datasets>` page. +NeMo includes preprocessing scripts for several common ASR datasets, and this page contains instructions on running +those scripts. +It also includes guidance for creating your own NeMo-compatible dataset, if you have your own data. + +Information about how to load model checkpoints (either local files or pretrained ones from NGC), perform inference, as well as a list +of the checkpoints available on NGC are located on the :doc:`Checkpoints <./results>` page. + +Documentation for configuration files specific to the ``nemo_asr`` models can be found on the +:doc:`Configuration Files <./configs>` page. diff --git a/docs/source/asr/speaker_diarization/results.rst b/docs/source/asr/speaker_diarization/results.rst new file mode 100644 index 0000000000000000000000000000000000000000..87af559d12c58d6040a294268df3c3e415726810 --- /dev/null +++ b/docs/source/asr/speaker_diarization/results.rst @@ -0,0 +1,87 @@ +Checkpoints +=========== + +There are two main ways to load pretrained checkpoints in NeMo as introduced in `loading ASR checkpoints <../results.html#checkpoints>`__. +In speaker diarization, the diarizer loads checkpoints that are passed through the config file. For example: + +Loading Local Checkpoints +--------------------------- + +Load VAD models + +.. code-block:: bash + + pretrained_vad_model='/path/to/vad_multilingual_marblenet.nemo' # local .nemo or pretrained vad model name + ... + # pass with hydra config + config.diarizer.vad.model_path=pretrained_vad_model + + +Load speaker embedding models + +.. code-block:: bash + + pretrained_speaker_model='/path/to/titanet-l.nemo' # local .nemo or pretrained speaker embedding model name + ... + # pass with hydra config + config.diarizer.speaker_embeddings.model_path=pretrained_speaker_model + +Load neural diarizer models + +.. code-block:: bash + + pretrained_neural_diarizer_model='/path/to/diarizer_msdd_telephonic.nemo' # local .nemo or pretrained neural diarizer model name + ... + # pass with hydra config + config.diarizer.msdd_model.model_path=pretrained_neural_diarizer_model + + +NeMo will automatically save checkpoints of a model you are training in a `.nemo` format. +You can also manually save your models at any point using :code:`model.save_to(.nemo)`. + + +Inference +--------- + +.. note:: + For details and deep understanding, please refer to ``/tutorials/speaker_tasks/Speaker_Diarization_Inference.ipynb``. + +Check out :doc:`Datasets <./datasets>` for preparing audio files and optional label files. + +Run and evaluate speaker diarizer with below command: + +.. code-block:: bash + + # Have a look at the instruction inside the script and pass the arguments you might need. + python /examples/speaker_tasks/diarization/offline_diarization.py + + +NGC Pretrained Checkpoints +-------------------------- + +The ASR collection has checkpoints of several models trained on various datasets for a variety of tasks. +These checkpoints are obtainable via NGC `NeMo Automatic Speech Recognition collection `_. +The model cards on NGC contain more information about each of the checkpoints available. + +In general, you can load models with model name in the following format, + +.. code-block:: python + + pretrained_vad_model='vad_multilingual_marblenet' + pretrained_speaker_model='titanet_large' + pretrained_neural_diarizer_model='diar_msdd_telephonic' + ... + config.diarizer.vad.model_path=retrained_vad_model \ + config.diarizer.speaker_embeddings.model_path=pretrained_speaker_model \ + config.diarizer.msdd_model.model_path=pretrained_neural_diarizer_model + +where the model name is the value under "Model Name" entry in the tables below. + +Models for Speaker Diarization Pipeline +^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ + +.. csv-table:: + :file: data/diarization_results.csv + :align: left + :widths: 30, 30, 40 + :header-rows: 1 diff --git a/docs/source/asr/speaker_recognition/api.rst b/docs/source/asr/speaker_recognition/api.rst new file mode 100644 index 0000000000000000000000000000000000000000..0f95cb281145aab44508a7e02b7decfb6a940510 --- /dev/null +++ b/docs/source/asr/speaker_recognition/api.rst @@ -0,0 +1,11 @@ +NeMo Speaker Recognition API +============================= + + +Model Classes +------------- +.. autoclass:: nemo.collections.asr.models.label_models.EncDecSpeakerLabelModel + :show-inheritance: + :members: setup_finetune_model, get_embedding, verify_speakers + + diff --git a/docs/source/asr/speaker_recognition/configs.rst b/docs/source/asr/speaker_recognition/configs.rst new file mode 100644 index 0000000000000000000000000000000000000000..a6fcf6f7582d528847c7d0c91c55c3357e648bc9 --- /dev/null +++ b/docs/source/asr/speaker_recognition/configs.rst @@ -0,0 +1,116 @@ +NeMo Speaker Recognition Configuration Files +============================================ + +This page covers NeMo configuration file setup that is specific to speaker recognition models. +For general information about how to set up and run experiments that is common to all NeMo models (e.g. +experiment manager and PyTorch Lightning trainer parameters), see the :doc:`../../core/core` page. + +The model section of NeMo speaker recognition configuration files will generally require information about the dataset(s) being +used, the preprocessor for audio files, parameters for any augmentation being performed, as well as the +model architecture specification. +The sections on this page cover each of these in more detail. + +Example configuration files for all of the Speaker related scripts can be found in the +config directory of the examples ``{NEMO_ROOT/examples/speaker_tasks/recognition/conf}``. + + +Dataset Configuration +--------------------- + +Training, validation, and test parameters are specified using the ``train_ds``, ``validation_ds``, and +``test_ds`` sections of your configuration file, respectively. +Depending on the task, you may have arguments specifying the sample rate of your audio files, max time length to consider for each audio file , whether or not to shuffle the dataset, and so on. +You may also decide to leave fields such as the ``manifest_filepath`` blank, to be specified via the command line +at run time. + +Any initialization parameters that are accepted for the Dataset class used in your experiment +can be set in the config file. + +An example TitaNet train and validation configuration could look like (``{NEMO_ROOT}examples/speaker_tasks/recognition/conf/titanet-large.yaml``): + +.. code-block:: yaml + + model: + train_ds: + manifest_filepath: ??? + sample_rate: 16000 + labels: None # finds labels based on manifest file + batch_size: 32 + trim_silence: False + shuffle: True + + validation_ds: + manifest_filepath: ??? + sample_rate: 16000 + labels: None # Keep None, to match with labels extracted during training + batch_size: 32 + shuffle: False # No need to shuffle the validation data + + +If you would like to use tarred dataset, have a look at `Datasets Configuration <../configs.html#dataset-configuration>`__. + + +Preprocessor Configuration +-------------------------- +Preprocessor helps to compute MFCC or mel spectrogram features that are given as inputs to model. +For details on how to write this section, refer to `Preprocessor Configuration <../configs.html#preprocessor-configuration>`__ + + +Augmentation Configurations +--------------------------- + +For TitaNet training we use on-the-fly augmentations with MUSAN and RIR impulses using ``noise`` augmentor section + +The following example sets up musan augmentation with audio files taken from manifest path and +minimum and maximum SNR specified with min_snr and max_snr respectively. This section can be added to +``train_ds`` part in model + +.. code-block:: yaml + + model: + ... + train_ds: + ... + augmentor: + noise: + manifest_path: /path/to/musan/manifest_file + prob: 0.2 # probability to augment the incoming batch audio with augmentor data + min_snr_db: 5 + max_snr_db: 15 + + +See the :class:`nemo.collections.asr.parts.preprocessing.perturb.AudioAugmentor` API section for more details. + + +Model Architecture Configurations +--------------------------------- + +Each configuration file should describe the model architecture being used for the experiment. +Models in the NeMo ASR collection need a ``encoder`` section and a ``decoder`` section, with the ``_target_`` field +specifying the module to use for each. + +The following sections go into more detail about the specific configurations of each model architecture. + +For more information about the TitaNet Encoder models, see the :doc:`Models <./models>` page. + +Decoder Configurations +------------------------ + +After features have been computed from TitaNet encoder, we pass these features to the decoder to compute embeddings and then to compute log probabilities +for training models. + +.. code-block:: yaml + + model: + ... + decoder: + _target_: nemo.collections.asr.modules.SpeakerDecoder + feat_in: *enc_feat_out + num_classes: 7205 # Total number of classes in voxceleb1,2 training manifest file + pool_mode: attention # xvector, attention + emb_sizes: 192 # number of intermediate emb layers. can be comma separated for additional layers like 512,512 + angular: true # if true then loss will be changed to angular softmax loss and consider scale and margin from loss section else train with cross-entropy loss + + loss: + scale: 30 + margin 0.2 diff --git a/docs/source/asr/speaker_recognition/data/speaker_results.csv b/docs/source/asr/speaker_recognition/data/speaker_results.csv new file mode 100644 index 0000000000000000000000000000000000000000..a0e865c9c487748c602cc4d54074f22856762112 --- /dev/null +++ b/docs/source/asr/speaker_recognition/data/speaker_results.csv @@ -0,0 +1,4 @@ +Model Name,Model Base Class,Model Card +titanet_large,EncDecSpeakerLabelModel,"https://catalog.ngc.nvidia.com/orgs/nvidia/teams/nemo/models/titanet_large" +speakerverification_speakernet,EncDecSpeakerLabelModel,"https://ngc.nvidia.com/catalog/models/nvidia:nemo:speakerverification_speakernet" +ecapa_tdnn,EncDecSpeakerLabelModel,"https://ngc.nvidia.com/catalog/models/nvidia:nemo:ecapa_tdnn" \ No newline at end of file diff --git a/docs/source/asr/speaker_recognition/datasets.rst b/docs/source/asr/speaker_recognition/datasets.rst new file mode 100644 index 0000000000000000000000000000000000000000..88c600b3c5237109c8391746e43e10281179fe4c --- /dev/null +++ b/docs/source/asr/speaker_recognition/datasets.rst @@ -0,0 +1,62 @@ +Datasets +======== + +.. _HI-MIA: + +HI-MIA +-------- + +Run the script to download and process ``hi-mia`` dataset in order to generate files in the supported format of ``nemo_asr``. You should set the data folder of +hi-mia using ``--data_root``. These scripts are present in ``/scripts`` + +.. code-block:: bash + + python get_hi-mia_data.py --data_root= + +After download and conversion, your `data` folder should contain directories with following set of files as: + +* `data//train.json` +* `data//dev.json` +* `data//{set}_all.json` +* `data//utt2spk` + + +All-other Datasets +------------------ + +These methods can be applied to any dataset to get similar training or inference manifest files. + +`filelist_to_manifest.py` script in `$/scripts/speaker_tasks/` folder generates manifest file from a text file containing paths to audio files. + +sample `filelist.txt` file contents: + +.. code-block:: bash + + /data/datasets/voxceleb/data/dev/aac_wav/id00179/Q3G6nMr1ji0/00086.wav + /data/datasets/voxceleb/data/dev/aac_wav/id00806/VjpQLxHQQe4/00302.wav + /data/datasets/voxceleb/data/dev/aac_wav/id01510/k2tzXQXvNPU/00132.wav + +This list file is used to generate manifest file. This script has optional arguments to split the whole manifest file in to train and dev and also segment audio files to smaller segments for robust training (for testing, we don't need to create segments for each utterance). + +sample usage: + +.. code-block:: bash + + python filelist_to_manifest.py --filelist=filelist.txt --id=-3 --out=speaker_manifest.json + +This would create a manifest containing file contents as shown below: +.. code-block:: json + + {"audio_filepath": "/data/datasets/voxceleb/data/dev/aac_wav/id00179/Q3G6nMr1ji0/00086.wav", "offset": 0, "duration": 4.16, "label": "id00179"} + {"audio_filepath": "/data/datasets/voxceleb/data/dev/aac_wav/id00806/VjpQLxHQQe4/00302.wav", "offset": 0, "duration": 12.288, "label": "id00806"} + {"audio_filepath": "/data/datasets/voxceleb/data/dev/aac_wav/id01510/k2tzXQXvNPU/00132.wav", "offset": 0, "duration": 4.608, "label": "id01510"} + +For other optional arguments like splitting manifest file to train and dev and for creating segements from each utterance refer to the arguments +described in the script. + +Tarred Datasets +--------------- + +Similarly to ASR, you can tar your audio files and use ASR Dataset class ``TarredAudioToSpeechLabelDataset`` (corresponding to the ``AudioToSpeechLabelDataset``) for this case. + +If you want to use tarred dataset, have a look at `ASR Tarred Datasets <../datasets.html#tarred-datasets>`__. diff --git a/docs/source/asr/speaker_recognition/images/ICASPP_SpeakerNet.png b/docs/source/asr/speaker_recognition/images/ICASPP_SpeakerNet.png new file mode 100644 index 0000000000000000000000000000000000000000..ecdccccb69ba94f2af56a627113a11ff0768a43e Binary files /dev/null and b/docs/source/asr/speaker_recognition/images/ICASPP_SpeakerNet.png differ diff --git a/docs/source/asr/speaker_recognition/images/titanet_network.png b/docs/source/asr/speaker_recognition/images/titanet_network.png new file mode 100644 index 0000000000000000000000000000000000000000..08668e5b6d08989b9b7a165e0f5526799ca96ab5 Binary files /dev/null and b/docs/source/asr/speaker_recognition/images/titanet_network.png differ diff --git a/docs/source/asr/speaker_recognition/intro.rst b/docs/source/asr/speaker_recognition/intro.rst new file mode 100644 index 0000000000000000000000000000000000000000..9242fdb1a9b2261870b9aa23f8ab36e35b0eca50 --- /dev/null +++ b/docs/source/asr/speaker_recognition/intro.rst @@ -0,0 +1,23 @@ +Speaker Recognition (SR) +======================== + +Speaker recognition is a broad research area which solves two major tasks: speaker identification (what is the identity of the speaker?) and speaker verification (is the speaker who they claim to be?). We focus on text-independent speaker recognition when the identity of the speaker is based on how the speech is spoken, not necessarily in what is being said. Typically such speaker recognition systems operate on unconstrained speech utterances, which are converted into vectors of fixed length, called speaker embeddings. Speaker embeddings can also be used in automatic speech recognition (ASR) and speech synthesis. + +The goal of most speaker recognition systems is to get good speaker level representations that could help distinguish oneself from other speakers. To achieve this, we first train a neural network model in an end-to-end manner optimizing the encoder using cross-entropy or angular softmax loss. We modify the decoder to get these fixed size embeddings irrespective of the length of the audio input and employ a pooling strategy such as mean and variance based statistics pooling or attention based method to generate these embeddings. + +In speaker identification, we typically train on a larger training set with cross-entropy loss and fine-tune later on preferred set of labels where one would want to classify only known sets of speakers. +On the other hand, in speaker verification, we train an embedding extractor with angular softmax loss and compare the embeddings from one audio file coming from a single speaker with embeddings from an unknown speaker. For quantifying the similarity of the embeddings we use scoring techniques such as cosine similarity. + +The full documentation tree: + +.. toctree:: + :maxdepth: 8 + + models + configs + datasets + results + api + resources + +.. include:: resources.rst \ No newline at end of file diff --git a/docs/source/asr/speaker_recognition/models.rst b/docs/source/asr/speaker_recognition/models.rst new file mode 100644 index 0000000000000000000000000000000000000000..e568f8523c919e7d82cb1ba54867a4db167b4a20 --- /dev/null +++ b/docs/source/asr/speaker_recognition/models.rst @@ -0,0 +1,72 @@ +Models +====== + +Examples of config files for all the below models can be found in the ``/examples/speaker_recognition/conf`` directory. + +For more information about the config files and how they should be structured, see the :doc:`./configs` page. + +Pretrained checkpoints for all of these models, as well as instructions on how to load them, can be found on the :doc:`./results` page. +You can use the available checkpoints for immediate inference, or fine-tune them on your own datasets. +The Checkpoints page also contains benchmark results for the available speaker recognition models. + +.. _TitaNet_model: + +TitaNet +----------- + +TitaNet model :cite:`sr-models-koluguri2021titanet` is based on the ContextNet architecture :cite:`sr-models-han2020contextnet` for extracting speaker representations. +We employ 1D depth-wise separable convolutions with Squeeze-and-Excitation (SE) layers with global context followed by channel attention based statistics pooling layer to map +variable-length utterances to a fixed-length embedding (tvector). TitaNet is a scalable architecture and achieves state-of-the-art performance on speaker verification and diarization tasks. + + .. image:: images/titanet_network.png + :align: center + :alt: speakernet model + :scale: 50% + +SpeakerNet models can be instantiated using the :class:`~nemo.collections.asr.models.EncDecSpeakerLabelModel` class. + +SpeakerNet +----------- + +The model is based on the QuartzNet ASR architecture :cite:`sr-models-koluguri2020speakernet` +comprising of an encoder and decoder structure. We use the encoder of the QuartzNet model as a top-level feature extractor, and feed the output to the statistics pooling layer, where +we compute the mean and variance across channel dimensions to capture the time-independent utterance-level speaker features. + +The QuartzNet encoder used for speaker embeddings shown in figure below has the following structure: a QuartzNet BxR +model has B blocks, each with R sub-blocks. Each sub-block applies the following operations: a 1D convolution, batch norm, ReLU, and dropout. All sub-blocks in a block have the same number of output channels. These blocks are connected with residual connections. We use QuartzNet with 3 blocks, 2 sub-blocks, and 512 channels, as the Encoder for Speaker Embeddings. All conv layers have stride 1 and dilation 1. + + + .. image:: images/ICASPP_SpeakerNet.png + :align: center + :alt: speakernet model + :scale: 40% + +Top level acoustic Features, obtained from the output of +encoder are used to compute intermediate features that are +then passed to the decoder for getting utterance level speaker +embeddings. The intermediate time-independent features are +computed using a statistics pooling layer, where we compute the mean and standard deviation of features across +time-channels, to get a time-independent feature representation S of size Batch_size × 3000. +The intermediate features, S are passed through the Decoder consisting of two layers each of output size 512 for a +linear transformation from S to the final number of classes +N for the larger (L) model, and a single linear layer of output size 256 to the final number of classes N for the medium +(M) model. We extract q-vectors after the final linear layer +of fixed size 512, 256 for SpeakerNet-L and SpeakerNet-M +models respectively. + +SpeakerNet models can be instantiated using the :class:`~nemo.collections.asr.models.EncDecSpeakerLabelModel` class. + +ECAPA_TDNN +---------- + +The model is based on the paper "ECAPA_TDNN Embeddings for Speaker Diarization" :cite:`sr-models-Dawalatabad_2021` comprising an encoder of time dilation layers which are based on Emphasized Channel Attention, Propagation, and Aggregation. The ECAPA-TDNN model employs a channel and context dependent attention mechanism, Multi layer Feature Aggregation (MFA), as well as Squeeze-Excitation (SE) and residual blocks, due to faster training and inference we replacing residual blocks with group convolution blocks of single dilation. These models has shown good performance over various speaker tasks. + +ecapa_tdnn models can be instantiated using the :class:`~nemo.collections.asr.models.EncDecSpeakerLabelModel` class. + +References +----------- + +.. bibliography:: ../asr_all.bib + :style: plain + :labelprefix: SR-MODELS + :keyprefix: sr-models- diff --git a/docs/source/asr/speaker_recognition/resources.rst b/docs/source/asr/speaker_recognition/resources.rst new file mode 100644 index 0000000000000000000000000000000000000000..55e83bb598b7fcdc7e3f0a62958335bfc50e0362 --- /dev/null +++ b/docs/source/asr/speaker_recognition/resources.rst @@ -0,0 +1,22 @@ + +Resource and Documentation Guide +-------------------------------- + +Hands-on speaker recognition tutorial notebooks can be found under +`the speaker recognition tutorials folder `_. This and most other tutorials can be run on Google Colab by specifying the link to the notebooks' GitHub pages on Colab. + +If you are looking for information about a particular SpeakerNet model, or would like to find out more about the model +architectures available in the ``nemo_asr`` collection, check out the :doc:`Models <./models>` page. + +Documentation on dataset preprocessing can be found on the :doc:`Datasets <./datasets>` page. +NeMo includes preprocessing and other scripts for speaker_recognition in folder, and this page contains instructions on running +those scripts. It also includes guidance for creating your own NeMo-compatible dataset, if you have your own data. + +Information about how to load model checkpoints (either local files or pretrained ones from NGC), perform inference, as well as a list +of the checkpoints available on NGC are located on the :doc:`Checkpoints <./results>` page. + +Documentation for configuration files specific to the ``nemo_asr`` models can be found on the +:doc:`Configuration Files <./configs>` page. + + +For a clear step-by-step tutorial we advise you to refer to the tutorials found in `folder `_. diff --git a/docs/source/asr/speaker_recognition/results.rst b/docs/source/asr/speaker_recognition/results.rst new file mode 100644 index 0000000000000000000000000000000000000000..a6029595823fd05165f804fb682dd0bbcae7be1e --- /dev/null +++ b/docs/source/asr/speaker_recognition/results.rst @@ -0,0 +1,138 @@ +Checkpoints +=========== + +There are two main ways to load pretrained checkpoints in NeMo: + +* Using the :code:`restore_from()` method to load a local checkpoint file (`.nemo`), or +* Using the :code:`from_pretrained()` method to download and set up a checkpoint from NGC. + +See the following sections for instructions and examples for each. + +Note that these instructions are for loading fully trained checkpoints for evaluation or fine-tuning. +For resuming an unfinished training experiment, please use the experiment manager to do so by setting the +``resume_if_exists`` flag to True. + +Loading Local Checkpoints +------------------------- + +NeMo will automatically save checkpoints of a model you are training in a `.nemo` format. +You can also manually save your models at any point using :code:`model.save_to(.nemo)`. + +If you have a local ``.nemo`` checkpoint that you'd like to load, simply use the :code:`restore_from()` method: + +.. code-block:: python + + import nemo.collections.asr as nemo_asr + model = nemo_asr.models..restore_from(restore_path="") + +Where the model base class is the ASR model class of the original checkpoint, or the general `ASRModel` class. + +Speaker Label Inference +------------------------ + +The goal of speaker label inference is to infer speaker labels using a speaker model with known speaker labels from enrollment set. We provide `speaker_identification_infer.py` script for this purpose under `/examples/speaker_tasks/recognition` folder. +Currently supported backends are cosine_similarity and neural classifier. + +The audio files should be 16KHz mono channel wav files. + +The script takes two manifest files: + +* enrollment_manifest : This manifest contains enrollment data with known speaker labels. +* test_manifest: This manifest contains test data for which we map speaker labels captured from enrollment manifest using one of provided backend + +sample format for each of these manifests is provided in `/examples/speaker_tasks/recognition/conf/speaker_identification_infer.yaml` config file. + +To infer speaker labels using cosine_similarity backend + +.. code-block:: bash + + python speaker_identification_infer.py data.enrollment_manifest= data.test_manifest= backend.backend_model=cosine_similarity + + +Speaker Embedding Extraction +----------------------------- +Speaker Embedding Extraction, is to extract speaker embeddings for any wav file (from known or unknown speakers). We provide two ways to do this: + +* single Python liner for extracting embeddings from a single file +* Python script for extracting embeddings from a bunch of files provided through manifest file + +For extracting embeddings from a single file: + +.. code-block:: python + + speaker_model = EncDecSpeakerLabelModel.from_pretrained(model_name="") + embs = speaker_model.get_embedding('') + +For extracting embeddings from a bunch of files: + +The audio files should be 16KHz mono channel wav files. + +Write audio files to a ``manifest.json`` file with lines as in format: + +.. code-block:: json + + {"audio_filepath": "/audio_file.wav", "duration": "duration of file in sec", "label": "speaker_id"} + +This python call will download best pretrained model from NGC and writes embeddings pickle file to current working directory + +.. code-block:: bash + + python examples/speaker_tasks/recognition/extract_speaker_embeddings.py --manifest=manifest.json + +or you can run `batch_inference()` to perform inference on the manifest with seleted batch_size to get embeddings + +.. code-block:: python + + speaker_model = nemo_asr.models.EncDecSpeakerLabelModel.from_pretrained(model_name="") + embs, logits, gt_labels, trained_labels = speaker_model.batch_inference(manifest, batch_size=32) + +Speaker Verification Inference +------------------------------ + +Speaker Verification is a task of verifying if two utterances are from the same speaker or not. + +We provide a helper function to verify the audio files and return True if two provided audio files are from the same speaker, False otherwise. + +The audio files should be 16KHz mono channel wav files. + +.. code-block:: python + + speaker_model = EncDecSpeakerLabelModel.from_pretrained(model_name="titanet_large") + decision = speaker_model.verify_speakers('path/to/one/audio_file','path/to/other/audio_file') + + +NGC Pretrained Checkpoints +-------------------------- + +The SpeakerNet-ASR collection has checkpoints of several models trained on various datasets for a variety of tasks. +`TitaNet `_ , `ECAPA_TDNN `_ and `Speaker_Verification `_ model cards on NGC contain more information about each of the checkpoints available. + +The tables below list the speaker embedding extractor models available from NGC, and the models can be accessed via the +:code:`from_pretrained()` method inside the EncDecSpeakerLabelModel Model class. + +In general, you can load any of these models with code in the following format: + +.. code-block:: python + + import nemo.collections.asr as nemo_asr + model = nemo_asr.models..from_pretrained(model_name="") + +where the model name is the value under "Model Name" entry in the tables below. + +If you would like to programatically list the models available for a particular base class, you can use the +:code:`list_available_models()` method. + +.. code-block:: python + + nemo_asr.models..list_available_models() + + +Speaker Recognition Models +^^^^^^^^^^^^^^^^^^^^^^^^^^^ + +.. csv-table:: + :file: data/speaker_results.csv + :align: left + :widths: 30, 30, 40 + :header-rows: 1 + diff --git a/docs/source/asr/speech_classification/configs.rst b/docs/source/asr/speech_classification/configs.rst new file mode 100644 index 0000000000000000000000000000000000000000..d67706a46617dddd65659eeb33c96eadad67d059 --- /dev/null +++ b/docs/source/asr/speech_classification/configs.rst @@ -0,0 +1,115 @@ +NeMo Speech Classification Configuration Files +================================================ + +This page covers NeMo configuration file setup that is specific to models in the Speech Classification collection. +For general information about how to set up and run experiments that is common to all NeMo models (e.g. +experiment manager and PyTorch Lightning trainer parameters), see the :doc:`../../core/core` page. + +The model section of NeMo Speech Classification configuration files will generally require information about the dataset(s) being +used, the preprocessor for audio files, parameters for any augmentation being performed, as well as the +model architecture specification. +The sections on this page cover each of these in more detail. + +Example configuration files for all of the NeMo ASR scripts can be found in the +``/examples/asr/conf>``. + + +Dataset Configuration +--------------------- + +Training, validation, and test parameters are specified using the ``train_ds``, ``validation_ds``, and +``test_ds`` sections of your configuration file, respectively. +Depending on the task, you may have arguments specifying the sample rate of your audio files, labels, whether or not to shuffle the dataset, and so on. +You may also decide to leave fields such as the ``manifest_filepath`` blank, to be specified via the command line +at runtime. + +Any initialization parameters that are accepted for the Dataset class used in your experiment +can be set in the config file. +See the `Datasets <../api.html#Datasets>`__ section of the API for a list of Datasets and their respective parameters. + +An example Speech Classification train and validation configuration could look like: + +.. code-block:: yaml + + model: + sample_rate: 16000 + repeat: 2 # number of convolutional sub-blocks within a block, R in _[BxRxC] + dropout: 0.0 + kernel_size_factor: 1.0 + labels: ['bed', 'bird', 'cat', 'dog', 'down', 'eight', 'five', 'four', 'go', 'happy', 'house', 'left', 'marvin', + 'nine', 'no', 'off', 'on', 'one', 'right', 'seven', 'sheila', 'six', 'stop', 'three', 'tree', 'two', 'up', + 'wow', 'yes', 'zero'] + + train_ds: + manifest_filepath: ??? + sample_rate: ${model.sample_rate} + labels: ${model.labels} # Uses the labels above + batch_size: 128 + shuffle: True + + validation_ds: + manifest_filepath: ??? + sample_rate: ${model.sample_rate} + labels: ${model.labels} # Uses the labels above + batch_size: 128 + shuffle: False # No need to shuffle the validation data + +If you would like to use tarred dataset, have a look at `Datasets Configuration <../configs.html#dataset-configuration>`__. + + +Preprocessor Configuration +-------------------------- +Preprocessor helps to compute MFCC or mel spectrogram features that are given as inputs to model. +For details on how to write this section, refer to `Preprocessor Configuration <../configs.html#preprocessor-configuration>`__ + +Check config yaml files in ``/examples/asr/conf`` to find the processors been used by speech classification models. + + +Augmentation Configurations +--------------------------- + +There are a few on-the-fly spectrogram augmentation options for NeMo ASR, which can be specified by the +configuration file using the ``augmentor`` and ``spec_augment`` section. +For details on how to write this section, refer to `Augmentation Configuration <../configs.html#augmentation-configurations>`__ + +Check config yaml files in ``/tutorials/asr/conf`` to find the processors been used by speech classification models. + + +Model Architecture Configurations +--------------------------------- + +Each configuration file should describe the model architecture being used for the experiment. +Models in the NeMo ASR collection need a ``encoder`` section and a ``decoder`` section, with the ``_target_`` field +specifying the module to use for each. + +The following sections go into more detail about the specific configurations of each model architecture. + +The `MatchboxNet <./models.html#matchboxnet-speech-commands>`__ and `MarbleNet <./models.html#marblenet-vad>`__ models are very similar, and they are based on `QuartzNet <../models.html#quartznet>`__ and as such the components in their +configs are very similar as well. + +Decoder Configurations +------------------------ + +After features have been computed from ConvASREncoder, we pass the features to decoder to compute embeddings and then to compute log_probs +for training models. + +.. code-block:: yaml + + model: + ... + decoder: + _target_: nemo.collections.asr.modules.ConvASRDecoderClassification + feat_in: *enc_final_filters + return_logits: true # return logits if true, else return softmax output + pooling_type: 'avg' # AdaptiveAvgPool1d 'avg' or AdaptiveMaxPool1d 'max' + + +Fine-tuning Execution Flow Diagram +---------------------------------- + +When preparing your own training or fine-tuning scripts, please follow the execution flow diagram order for correct inference. + +Depending on the type of model, there may be extra steps that must be performed - + +* Speech Classification models - `Examples directory for Classification Models `_ + diff --git a/docs/source/asr/speech_classification/data/classification_results.csv b/docs/source/asr/speech_classification/data/classification_results.csv new file mode 100644 index 0000000000000000000000000000000000000000..06de98a2c4330135185b980932b2fe06cfab8321 --- /dev/null +++ b/docs/source/asr/speech_classification/data/classification_results.csv @@ -0,0 +1,11 @@ +Model Name,Model Base Class,Model Card +langid_ambernet,EncDecSpeakerLabelModel,"https://catalog.ngc.nvidia.com/orgs/nvidia/teams/nemo/models/langid_ambernet" +vad_multilingual_marblenet,EncDecClassificationModel,"https://catalog.ngc.nvidia.com/orgs/nvidia/teams/nemo/models/vad_multilingual_marblenet" +vad_marblenet,EncDecClassificationModel,"https://ngc.nvidia.com/catalog/models/nvidia:nemo:vad_marblenet" +vad_telephony_marblenet,EncDecClassificationModel,"https://ngc.nvidia.com/catalog/models/nvidia:nemo:vad_telephony_marblenet" +commandrecognition_en_matchboxnet3x1x64_v1,EncDecClassificationModel,"https://ngc.nvidia.com/catalog/models/nvidia:nemo:commandrecognition_en_matchboxnet3x1x64_v1" +commandrecognition_en_matchboxnet3x2x64_v1,EncDecClassificationModel,"https://ngc.nvidia.com/catalog/models/nvidia:nemo:commandrecognition_en_matchboxnet3x2x64_v1" +commandrecognition_en_matchboxnet3x1x64_v2,EncDecClassificationModel,"https://ngc.nvidia.com/catalog/models/nvidia:nemo:commandrecognition_en_matchboxnet3x1x64_v2" +commandrecognition_en_matchboxnet3x2x64_v2,EncDecClassificationModel,"https://ngc.nvidia.com/catalog/models/nvidia:nemo:commandrecognition_en_matchboxnet3x2x64_v2" +commandrecognition_en_matchboxnet3x1x64_v2_subset_task,EncDecClassificationModel,"https://ngc.nvidia.com/catalog/models/nvidia:nemo:commandrecognition_en_matchboxnet3x1x64_v2_subset_task" +commandrecognition_en_matchboxnet3x2x64_v2_subset_task,EncDecClassificationModel,"https://ngc.nvidia.com/catalog/models/nvidia:nemo:commandrecognition_en_matchboxnet3x2x64_v2_subset_task" \ No newline at end of file diff --git a/docs/source/asr/speech_classification/datasets.rst b/docs/source/asr/speech_classification/datasets.rst new file mode 100644 index 0000000000000000000000000000000000000000..cd5049c2c264128003e5ef7ffd4cf341d48d4166 --- /dev/null +++ b/docs/source/asr/speech_classification/datasets.rst @@ -0,0 +1,149 @@ +Datasets +======== + +NeMo has scripts to convert several common ASR datasets into the format expected by the `nemo_asr` collection. +You can get started with those datasets by following the instructions to run those scripts in the section appropriate to each dataset below. + +If you have your own data and want to preprocess it to use with NeMo ASR models, check out the `Preparing Custom Speech Classification Data`_ section at the bottom of the page. + +.. _Freesound-dataset: + +Freesound +----------- + +`Freesound `_ is a website that aims to create a huge open collaborative database of audio snippets, samples, recordings, bleeps. +Most audio samples are released under Creative Commons licenses that allow their reuse. +Researchers and developers can access Freesound content using the Freesound API to retrieve meaningful sound information such as metadata, analysis files, and the sounds themselves. + +**Instructions** + +Go to ``/scripts/freesound_download_resample`` and follow the below steps to download and convert freedsound data into a format expected by the `nemo_asr` collection. + +1. We will need some required libraries including freesound, requests, requests_oauthlib, joblib, librosa and sox. If they are not installed, please run `pip install -r freesound_requirements.txt` +2. Create an API key for freesound.org at https://freesound.org/help/developers/ +3. Create a python file called `freesound_private_apikey.py` and add lined `api_key = and client_id = ` +4. Authorize by run `python freesound_download.py --authorize` and visit the website and paste response code +5. Feel free to change any arguments in `download_resample_freesound.sh` such as max_samples and max_filesize +6. Run `bash download_resample_freesound.sh ` . For example: + +.. code-block:: bash + + bash download_resample_freesound.sh 4000 ./freesound ./freesound_resampled_background + +Note that downloading this dataset may take hours. Change categories in download_resample_freesound.sh to include other (speech) categories audio files. +Then, you should have 16khz mono wav files in ``. + + +.. _Google-Speech-Commands-Dataset: + +Google Speech Commands Dataset +------------------------------ + +Google released two versions of the dataset with the first version containing 65k samples over 30 classes and the second containing 110k samples over 35 classes. +We refer to these datasets as `v1` and `v2` respectively. + +Run the script `process_speech_commands_data.py` to process Google Speech Commands dataset in order to generate files in the supported format of `nemo_asr`, +which can be found in ``/scripts/dataset_processing/``. +You should set the data folder of Speech Commands using :code:`--data_root` and the version of the dataset using :code:`--data_version` as an int. + +You can further rebalance the train set by randomly oversampling files inside the manifest by passing the `--rebalance` flag. + +.. code-block:: bash + + python process_speech_commands_data.py --data_root= --data_version=<1 or 2> {--rebalance} + + +Then, you should have `train_manifest.json`, `validation_manifest.json` and `test_manifest.json` +in the directory `{data_root}/google_speech_recognition_v{1/2}`. + +.. note:: + You should have at least 4GB or 6GB of disk space available if you use v1 or v2 respectively. + Also, it will take some time to download and process, so go grab a coffee. + +Each line is a training example. + +.. code-block:: bash + + {"audio_filepath": "/two/8aa35b0c_nohash_0.wav", "duration": 1.0, "label": "two"} + {"audio_filepath": "/two/ec5ab5d5_nohash_2.wav", "duration": 1.0, "label": "two"} + + + +Speech Command & Freesound for VAD +------------------------------------ +Speech Command & Freesound (SCF) dataset is used to train MarbleNet in the `paper `_. Here we show how to download and process it. +This script assumes that you already have the Freesound dataset, if not, have a look at :ref:`Freesound-dataset`. +We will use the open-source :ref:`Google-Speech-Commands-Dataset` (we will use V2 of the dataset for SCF dataset, but require very minor changes to support V1 dataset) as our speech data. + +These scripts below will download the Google Speech Commands v2 dataset and convert speech and background data to a format suitable for use with nemo_asr. + +.. note:: + You may additionally pass :code:`--test_size` or :code:`--val_size` flag for splitting train val and test data. + + You may additionally pass :code:`--window_length_in_sec` flag for indicating the segment/window length. Default is 0.63s. + + You may additionally pass a :code:`-rebalance_method='fixed|over|under'` at the end of the script to rebalance the class samples in the manifest. + + + +* `'fixed'`: Fixed number of sample for each class. Train 5000, val 1000, and test 1000. (Change number in script if you want) +* `'over'`: Oversampling rebalance method +* `'under'`: Undersampling rebalance method + + +.. code-block:: bash + + mkdir './google_dataset_v2' + python process_vad_data.py --out_dir='./manifest/' --speech_data_root='./google_dataset_v2'--background_data_root= --log --rebalance_method='fixed' + + +After download and conversion, your `manifest` folder should contain a few json manifest files: + +* `(balanced_)background_testing_manifest.json` +* `(balanced_)background_training_manifest.json` +* `(balanced_)background_validation_manifest.json` +* `(balanced_)speech_testing_manifest.json` +* `(balanced_)speech_training_manifest.json` +* `(balanced_)speech_validation_manifest.json` + +Each line is a training example. `audio_filepath` contains path to the wav file, `duration` is duration in seconds, `offset` is offset in seconds, and `label` is label (class): + +.. code-block:: bash + + {"audio_filepath": "/two/8aa35b0c_nohash_0.wav", "duration": 0.63, "label": "speech", "offset": 0.0} + {"audio_filepath": "/Emergency_vehicle/id_58368 simambulance.wav", "duration": 0.63, "label": "background", "offset": 4.0} + + +.. _Voxlingua107: + +Voxlingua107 +------------------------------ + +VoxLingua107 consists of short speech segments automatically extracted from YouTube videos. +It contains 107 languages. The total amount of speech in the training set is 6628 hours, and 62 hours per language on average but it's highly imbalanced. +It also includes seperate evaluation set containing 1609 speech segments from 33 languages, validated by at least two volunteers. + +You could download dataset from its `official website `_. + +Each line is a training example. + +.. code-block:: bash + + {"audio_filepath": "/ln/lFpWXQYseo4__U__S113---0400.650-0410.420.wav", "offset": 0, "duration": 3.0, "label": "ln"} + {"audio_filepath": "/lt/w0lp3mGUN8s__U__S28---0352.170-0364.770.wav", "offset": 8, "duration": 4.0, "label": "lt"} + + +Preparing Custom Speech Classification Data +-------------------------------------------- + +Preparing Custom Speech Classification Data is almost identical to `Preparing Custom ASR Data <../datasets.html#preparing-custom-asr-data>`__. + +Instead of :code:`text` entry in manifest, you need :code:`label` to determine class of this sample + + +Tarred Datasets +--------------- + +Similarly to ASR, you can tar your audio files and use ASR Dataset class ``TarredAudioToClassificationLabelDataset`` (corresponding to the ``AudioToClassificationLabelDataset``) for this case. + +If you would like to use tarred dataset, have a look at `ASR Tarred Datasets <../datasets.html#tarred-datasets>`__. \ No newline at end of file diff --git a/docs/source/asr/speech_classification/images/marblenet_vertical.png b/docs/source/asr/speech_classification/images/marblenet_vertical.png new file mode 100644 index 0000000000000000000000000000000000000000..3174d1ef5ea56ca36159d7525f2f6acf6c2367b0 Binary files /dev/null and b/docs/source/asr/speech_classification/images/marblenet_vertical.png differ diff --git a/docs/source/asr/speech_classification/images/matchboxnet_vertical.png b/docs/source/asr/speech_classification/images/matchboxnet_vertical.png new file mode 100644 index 0000000000000000000000000000000000000000..2d69c94ecdb820b9b0859137873674995138ac14 Binary files /dev/null and b/docs/source/asr/speech_classification/images/matchboxnet_vertical.png differ diff --git a/docs/source/asr/speech_classification/intro.rst b/docs/source/asr/speech_classification/intro.rst new file mode 100644 index 0000000000000000000000000000000000000000..6a7126b8e73389a3d83decba494c47f626b389e3 --- /dev/null +++ b/docs/source/asr/speech_classification/intro.rst @@ -0,0 +1,31 @@ +Speech Classification +================================== +Speech Classification refers to a set of tasks or problems of getting a program to automatically classify input utterance or audio segment into categories, +such as Speech Command Recognition (multi-class), Voice Activity Detection (binary or multi-class), and Audio Sentiment Classification (typically multi-class), etc. + +**Speech Command Recognition** is the task of classifying an input audio pattern into a discrete set of classes. +It is a subset of Automatic Speech Recognition (ASR), sometimes referred to as Key Word Spotting, in which a model is constantly analyzing speech patterns to detect certain "command" classes. +Upon detection of these commands, a specific action can be taken by the system. +It is often the objective of command recognition models to be small and efficient so that they can be deployed onto low-power sensors and remain active for long durations of time. + + +**Voice Activity Detection (VAD)** also known as speech activity detection or speech detection, is the task of predicting which parts of input audio contain speech versus background noise. +It is an essential first step for a variety of speech-based applications including Automatic Speech Recognition. +It serves to determine which samples to be sent to the model and when to close the microphone. + +**Spoken Language Identification (Lang ID)** also known as spoken language recognition, is the task of recognizing the language of the spoken utterance automatically. +It typically serves as the prepossessing of ASR, determining which ASR model would be activate based on the language. + + +The full documentation tree is as follows: + +.. toctree:: + :maxdepth: 8 + + models + datasets + results + configs + resources.rst + +.. include:: resources.rst diff --git a/docs/source/asr/speech_classification/models.rst b/docs/source/asr/speech_classification/models.rst new file mode 100644 index 0000000000000000000000000000000000000000..50919336fd93d51176ad2abd17aa6f1270e10443 --- /dev/null +++ b/docs/source/asr/speech_classification/models.rst @@ -0,0 +1,84 @@ +Models +====== + +This page gives a brief overview of the models that NeMo's Speech Classification collection currently supports. +For Speech Classification, we support Speech Command (Keyword) Detection and Voice Activity Detection (VAD). + +Each of these models can be used with the example ASR scripts (in the ``/examples/asr`` directory) by +specifying the model architecture in the config file used. +Examples of config files for each model can be found in the ``/examples/asr/conf`` directory. + +For more information about the config files and how they should be structured, see the :doc:`./configs` page. + +Pretrained checkpoints for all of these models, as well as instructions on how to load them, can be found on the :doc:`./results` page. +You can use the available checkpoints for immediate inference, or fine-tune them on your own datasets. +The Checkpoints page also contains benchmark results for the available ASR models. + +.. _MatchboxNet_model: + +MatchboxNet (Speech Commands) +------------------------------ + +MatchboxNet :cite:`sc-models-matchboxnet` is an end-to-end neural network for speech command recognition based on `QuartzNet <../models.html#QuartzNet>`__. + +Similarly to QuartzNet, the MatchboxNet family of models are denoted as MatchBoxNet_[BxRxC] where B is the number of blocks, and R is the number of convolutional sub-blocks within a block, and C is the number of channels. Each sub-block contains a 1-D *separable* convolution, batch normalization, ReLU, and dropout: + + .. image:: images/matchboxnet_vertical.png + :align: center + :alt: MatchboxNet model + :scale: 50% + +It can reach state-of-the art accuracy on the Google Speech Commands dataset while having significantly fewer parameters than similar models. +The `_v1` and `_v2` are denoted for models trained on `v1` (30-way classification) and `v2` (35-way classification) datasets; +And we use _subset_task to represent (10+2)-way subset (10 specific classes + other remaining classes + silence) classification task. + +MatchboxNet models can be instantiated using the :class:`~nemo.collections.asr.models.EncDecClassificationModel` class. + +.. note:: + For model details and deep understanding about Speech Command Detedction training, inference, finetuning and etc., + please refer to ``/tutorials/asr/Speech_Commands.ipynb`` and ``/tutorials/asr/Online_Offline_Speech_Commands_Demo.ipynb``. + + + +.. _MarbleNet_model: + +MarbleNet (VAD) +------------------ + +MarbleNet :cite:`sc-models-marblenet` an end-to-end neural network for speech command recognition based on :ref:`MatchboxNet_model`, + +Similarly to MatchboxNet, the MarbleNet family of models are denoted as MarbleNet_[BxRxC] where B is the number of blocks, and R is the number of convolutional sub-blocks within a block, and C is the number of channels. Each sub-block contains a 1-D *separable* convolution, batch normalization, ReLU, and dropout: + + .. image:: images/marblenet_vertical.png + :align: center + :alt: MarbleNet model + :scale: 30% + +It can reach state-of-the art performance on the difficult `AVA speech dataset `_ while having significantly fewer parameters than similar models even training on simple data. +MarbleNet models can be instantiated using the :class:`~nemo.collections.asr.models.EncDecClassificationModel` class. + +.. note:: + For model details and deep understanding about VAD training, inference, postprocessing, threshold tuning and etc., + please refer to ``/tutorials/asr/06_Voice_Activiy_Detection.ipynb`` and ``/tutorials/asr/Online_Offline_Microphone_VAD_Demo.ipynb``. + + + +.. _AmberNet_model: + +AmberNet (Lang ID) +------------------ + +AmberNet is an end-to-end neural network for language identification moden based on `TitaNet <../speaker_recognition/models.html#titanet>`__. + +It can reach state-of-the art performance on the `Voxlingua107 dataset `_ while having significantly fewer parameters than similar models. +AmberNet models can be instantiated using the :class:`~nemo.collections.asr.models.EncDecSpeakerLabelModel` class. + + + +References +---------------- + +.. bibliography:: ../asr_all.bib + :style: plain + :labelprefix: SC-MODELS + :keyprefix: sc-models- diff --git a/docs/source/asr/speech_classification/resources.rst b/docs/source/asr/speech_classification/resources.rst new file mode 100644 index 0000000000000000000000000000000000000000..eea302c1b94b009b9a19c5b2e1baae5c6d5d75f7 --- /dev/null +++ b/docs/source/asr/speech_classification/resources.rst @@ -0,0 +1,20 @@ +Resource and Documentation Guide +-------------------------------- + +Hands-on speech classification tutorial notebooks can be found under ``/tutorials/asr/``. +There are training and offline & online microphone inference tutorials for Speech Command Detection and Voice Activity Detection tasks. +This and most other tutorials can be run on Google Colab by specifying the link to the notebooks' GitHub pages on Colab. + +If you are looking for information about a particular Speech Classification model or would like to find out more about the model +architectures available in the `nemo_asr` collection, check out the :doc:`Models <./models>` page. + +Documentation on dataset preprocessing can be found on the :doc:`Datasets <./datasets>` page. +NeMo includes preprocessing scripts for several common ASR datasets, and this page contains instructions on running +those scripts. +It also includes guidance for creating your own NeMo-compatible dataset, if you have your own data. + +Information about how to load model checkpoints (either local files or pretrained ones from NGC), perform inference, as well as a list +of the checkpoints available on NGC are located on the :doc:`Checkpoints <./results>` page. + +Documentation for configuration files specific to the ``nemo_asr`` models can be found on the +:doc:`Configuration Files <./configs>` page. diff --git a/docs/source/asr/speech_classification/results.rst b/docs/source/asr/speech_classification/results.rst new file mode 100644 index 0000000000000000000000000000000000000000..9eeca4a4036b3103ffc944ace33c66ac00e0d5b6 --- /dev/null +++ b/docs/source/asr/speech_classification/results.rst @@ -0,0 +1,138 @@ +Checkpoints +=========== + +There are two main ways to load pretrained checkpoints in NeMo: + +* Using the :code:`restore_from()` method to load a local checkpoint file (`.nemo`), or +* Using the :code:`from_pretrained()` method to download and set up a checkpoint from NGC. + +See the following sections for instructions and examples for each. + +Note that these instructions are for loading fully trained checkpoints for evaluation or fine-tuning. +For resuming an unfinished training experiment, please use the experiment manager to do so by setting the +``resume_if_exists`` flag to True. + +Loading Local Checkpoints +------------------------- + +NeMo will automatically save checkpoints of a model you are training in a `.nemo` format. +You can also manually save your models at any point using :code:`model.save_to(.nemo)`. + +If you have a local ``.nemo`` checkpoint that you'd like to load, simply use the :code:`restore_from()` method: + +.. code-block:: python + + import nemo.collections.asr as nemo_asr + model = nemo_asr.models..restore_from(restore_path="") + +Where the model base class is the ASR model class of the original checkpoint, or the general `ASRModel` class. + + +Transcribing/Inference +----------------------- + +The audio files should be 16KHz monochannel wav files. + +`Transcribe speech command segment:` + +You may perform inference and transcribe a sample of speech after loading the model by using its 'transcribe()' method: + +.. code-block:: python + + mbn_model = nemo_asr.models.EncDecClassificationModel.from_pretrained(model_name="") + mbn_model.transcribe([list of audio files], batch_size=BATCH_SIZE, logprobs=False) + +Setting argument ``logprobs`` to True would return the log probabilities instead of transcriptions. You may find more details in `Modules <../api.html#modules>`__. + +Learn how to fine tune on your own data or on subset classes in ``/tutorials/asr/Speech_Commands.ipynb`` + + +`Run VAD inference:` + +.. code-block:: bash + + python /examples/asr/speech_classification/vad_infer.py --config-path="../conf/vad" --config-name="vad_inference_postprocessing.yaml" dataset= + + +This script will perform vad frame-level prediction and will help you perform postprocessing and generate speech segments as well if needed. + +Have a look at configuration file ``/examples/asr/conf/vad/vad_inference_postprocessing.yaml`` and scripts under ``/scripts/voice_activity_detection`` for details regarding posterior processing, postprocessing and threshold tuning. + +Posterior processing includes generating predictions with overlapping input segments. Then a smoothing filter is applied to decide the label for a frame spanned by multiple segments. + +For VAD postprocessing we introduce + +Binarization: + - ``onset`` and ``offset`` threshold for detecting the beginning and end of a speech. + - padding durations ``pad_onset`` before and padding duarations ``pad_offset`` after each speech segment; + +Filtering: + - ``min_duration_on`` threshold for short speech segment deletion, + - ``min_duration_on`` threshold for small silence deletion, + - ``filter_speech_first`` to control whether to perform short speech segment deletion first. + + +`Identify language of utterance` + +You may load the model and identify the language of an audio file by using `get_label()` method: + +.. code-block:: python + + langid_model = nemo_asr.models.EncDecSpeakerLabelModel.from_pretrained(model_name="") + lang = langid_model.get_label('') + +or you can run `batch_inference()` to perform inference on a manifest with seleted batch_size to get trained model labels and gt_labels with logits + +.. code-block:: python + + langid_model = nemo_asr.models.EncDecSpeakerLabelModel.from_pretrained(model_name="") + lang_embs, logits, gt_labels, trained_labels = langid_model.batch_inference(manifest_filepath, batch_size=32) + + +NGC Pretrained Checkpoints +-------------------------- + +The Speech Classification collection has checkpoints of several models trained on various datasets for a variety of tasks. +These checkpoints are obtainable via NGC `NeMo Automatic Speech Recognition collection `_. +The model cards on NGC contain more information about each of the checkpoints available. + +The tables below list the Speech Classification models available from NGC, and the models can be accessed via the +:code:`from_pretrained()` method inside the ASR Model class. + +In general, you can load any of these models with code in the following format. + +.. code-block:: python + + import nemo.collections.asr as nemo_asr + model = nemo_asr.models.EncDecClassificationModel.from_pretrained(model_name="") + +Where the model name is the value under "Model Name" entry in the tables below. + +For example, to load the MatchboxNet3x2x64_v1 model for speech command detection, run: + +.. code-block:: python + + model = nemo_asr.models.EncDecClassificationModel.from_pretrained(model_name="commandrecognition_en_matchboxnet3x2x64_v1") + +You can also call :code:`from_pretrained()` from the specific model class (such as :code:`EncDecClassificationModel` +for MatchboxNet and MarbleNet) if you will need to access specific model functionality. + +If you would like to programatically list the models available for a particular base class, you can use the +:code:`list_available_models()` method. + +.. code-block:: python + + nemo_asr.models..list_available_models() + + +Speech Classification Models +^^^^^^^^^^^^^^^^^^^^^^^^^^^^ + +.. tabularcolumns:: 30 30 40 + +.. csv-table:: + :file: data/classification_results.csv + :header-rows: 1 + :class: longtable + :widths: 1 1 1 + diff --git a/docs/source/asr/speech_intent_slot/api.rst b/docs/source/asr/speech_intent_slot/api.rst new file mode 100644 index 0000000000000000000000000000000000000000..735c583f911549e8e69a76c70c5da5b965958df3 --- /dev/null +++ b/docs/source/asr/speech_intent_slot/api.rst @@ -0,0 +1,22 @@ +NeMo Speech Intent Classification and Slot Filling collection API +================================================================= + + +Model Classes +------------- +.. autoclass:: nemo.collections.asr.models.SLUIntentSlotBPEModel + :show-inheritance: + :members: + + +Mixins +------ + +.. autoclass:: nemo.collections.asr.parts.mixins.ASRModuleMixin + :show-inheritance: + :members: + +.. autoclass:: nemo.collections.asr.parts.mixins.ASRBPEMixin + :show-inheritance: + :members: + diff --git a/docs/source/asr/speech_intent_slot/configs.rst b/docs/source/asr/speech_intent_slot/configs.rst new file mode 100644 index 0000000000000000000000000000000000000000..48f85d0233324dc85fa22e70eaed4e778f4b49d7 --- /dev/null +++ b/docs/source/asr/speech_intent_slot/configs.rst @@ -0,0 +1,170 @@ +NeMo Speech Intent Classification and Slot Filling Configuration Files +======================================================================= + +This page covers NeMo configuration file setup that is specific to models in the Speech Intent Classification and Slot Filling collection. +For general information about how to set up and run experiments that is common to all NeMo models (e.g. +experiment manager and PyTorch Lightning trainer parameters), see the :doc:`../../core/core` page. + +Dataset Configuration +--------------------- + +Dataset configuration for Speech Intent Classification and Slot Filling model is mostly the same as for standard ASR training, +covered `here <../configs.html#Dataset Configuration>`__. One exception is that ``use_start_end_token`` must be set to ``True``. + +An example of train and validation configuration should look similar to the following: + +.. code-block:: yaml + + model: + train_ds: + manifest_filepath: ??? + sample_rate: ${model.sample_rate} + batch_size: 16 # you may increase batch_size if your memory allows + shuffle: true + num_workers: 8 + pin_memory: false + use_start_end_token: true + trim_silence: false + max_duration: 11.0 + min_duration: 0.0 + # tarred datasets + is_tarred: false + tarred_audio_filepaths: null + shuffle_n: 2048 + # bucketing params + bucketing_strategy: "synced_randomized" + bucketing_batch_size: null + + validation_ds: + manifest_filepath: ??? + sample_rate: ${model.sample_rate} + batch_size: 16 # you may increase batch_size if your memory allows + shuffle: false + num_workers: 8 + pin_memory: true + use_start_end_token: true + min_duration: 8.0 + + +Preprocessor Configuration +-------------------------- + +Preprocessor helps to compute MFCC or mel spectrogram features that are given as inputs to model. +For details on how to write this section, refer to `Preprocessor Configuration <../configs.html#preprocessor-configuration>`__ + +Augmentation Configurations +--------------------------- + + +There are a few on-the-fly spectrogram augmentation options for NeMo ASR, which can be specified by the +configuration file using the ``augmentor`` and ``spec_augment`` section. +For details on how to write this section, refer to `Augmentation Configuration <../configs.html#augmentation-configurations>`__ + + +Model Architecture Configurations +--------------------------------- + +The ``encoder`` of the model is a `Conformer-large <./models.html#Conformer-CTC>`__ model without the text decoder, and can be initialized with pretrained checkpoints. The ``decoder`` is a Transforemr model, with additional ``embedding`` and ``classifier`` modules. + +An example config for the model can be: + +.. code-block:: yaml + + pretrained_encoder: + name: stt_en_conformer_ctc_large # which model use to initialize the encoder, set to null if not using any. Only used to initialize training, not used in resuming from checkpoint. + freeze: false # whether to freeze the encoder during training. + + model: + sample_rate: 16000 + encoder: + _target_: nemo.collections.asr.modules.ConformerEncoder + feat_in: ${model.preprocessor.features} + feat_out: -1 # you may set it if you need different output size other than the default d_model + n_layers: 17 # SSL conformer-large have only 17 layers + d_model: 512 + + # Sub-sampling params + subsampling: striding # vggnet or striding, vggnet may give better results but needs more memory + subsampling_factor: 4 # must be power of 2 + subsampling_conv_channels: -1 # -1 sets it to d_model + + # Reduction parameters: Can be used to add another subsampling layer at a given position. + # Having a 2x reduction will speedup the training and inference speech while keeping similar WER. + # Adding it at the end will give the best WER while adding it at the beginning will give the best speedup. + reduction: null # pooling, striding, or null + reduction_position: null # Encoder block index or -1 for subsampling at the end of encoder + reduction_factor: 1 + + # Feed forward module's params + ff_expansion_factor: 4 + + # Multi-headed Attention Module's params + self_attention_model: rel_pos # rel_pos or abs_pos + n_heads: 8 # may need to be lower for smaller d_models + # [left, right] specifies the number of steps to be seen from left and right of each step in self-attention + att_context_size: [-1, -1] # -1 means unlimited context + xscaling: true # scales up the input embeddings by sqrt(d_model) + untie_biases: true # unties the biases of the TransformerXL layers + pos_emb_max_len: 5000 + + # Convolution module's params + conv_kernel_size: 31 + conv_norm_type: 'batch_norm' # batch_norm or layer_norm + + ### regularization + dropout: 0.1 # The dropout used in most of the Conformer Modules + dropout_pre_encoder: 0.1 # The dropout used before the encoder + dropout_emb: 0.0 # The dropout used for embeddings + dropout_att: 0.1 # The dropout for multi-headed attention modules + + embedding: + _target_: nemo.collections.asr.modules.transformer.TransformerEmbedding + vocab_size: -1 + hidden_size: ${model.encoder.d_model} + max_sequence_length: 512 + num_token_types: 1 + embedding_dropout: 0.0 + learn_positional_encodings: false + + decoder: + _target_: nemo.collections.asr.modules.transformer.TransformerDecoder + num_layers: 3 + hidden_size: ${model.encoder.d_model} + inner_size: 2048 + num_attention_heads: 8 + attn_score_dropout: 0.0 + attn_layer_dropout: 0.0 + ffn_dropout: 0.0 + + classifier: + _target_: nemo.collections.common.parts.MultiLayerPerceptron + hidden_size: ${model.encoder.d_model} + num_classes: -1 + num_layers: 1 + activation: 'relu' + log_softmax: true + + +Loss Configurations +--------------------------------- + +The loss function by default is the negative log-likelihood loss, where optional label-smoothing can be applied by using the following config (default is 0.0): + +.. code-block:: yaml + + loss: + label_smoothing: 0.0 + + +Inference Configurations +--------------------------------- +During inference, three types of sequence generation strategies can be applied: ``greedy search``, ``beam search`` and ``top-k search``. + +.. code-block:: yaml + + sequence_generator: + type: greedy # choices=[greedy, topk, beam] + max_sequence_length: ${model.embedding.max_sequence_length} + temperature: 1.0 # for top-k sampling + beam_size: 1 # K for top-k sampling, N for beam search + len_pen: 0 # for beam-search diff --git a/docs/source/asr/speech_intent_slot/data/benchmark_sis.csv b/docs/source/asr/speech_intent_slot/data/benchmark_sis.csv new file mode 100644 index 0000000000000000000000000000000000000000..397e890486b721b2286eca53538b13175b884635 --- /dev/null +++ b/docs/source/asr/speech_intent_slot/data/benchmark_sis.csv @@ -0,0 +1,2 @@ +Model Name,Model Base Class,Model Card +slu_conformer_transformer_large_slurp,SLUIntentSlotBPEModel,"https://ngc.nvidia.com/catalog/models/nvidia:nemo:slu_conformer_transformer_large_slurp" diff --git a/docs/source/asr/speech_intent_slot/datasets.rst b/docs/source/asr/speech_intent_slot/datasets.rst new file mode 100644 index 0000000000000000000000000000000000000000..3a0168a850a18231614ace194051fd07723dbd77 --- /dev/null +++ b/docs/source/asr/speech_intent_slot/datasets.rst @@ -0,0 +1,10 @@ +Datasets +======== + +Input data should be provided in line delimited JSON format as below: + +.. code-block:: bash + + {"audio_filepath": "/path/to/abcd.wav", "offset": 0, "duration": 10.1, "text": "{'scenario': 'Calendar', 'action': 'Create_entry', 'entities': [{'type': 'event_name', 'filler': 'brunch'}, {'type': 'date', 'filler': 'Saturday'}, {'type': 'timeofday', 'filler': 'morning'}, {'type': 'person', 'filler': 'Aronson'}]}"} + +The semantics annotation is a Python dictionary flattened as a string, and indexed by the "text" key in the manifest. For a semantics annotation, there are three mandatory keys: "scenario", "action" and "entities". The values for "scenario" and "action" are strings, where the value for "entities" is a Python list of dictionary. Each item in "entities" is also a Python dictionary, with two keys "type" (entity slot) and "filler" (slot filler). diff --git a/docs/source/asr/speech_intent_slot/images/example.png b/docs/source/asr/speech_intent_slot/images/example.png new file mode 100644 index 0000000000000000000000000000000000000000..1094247bb96a910d068ddbfa30f76381c7657e19 Binary files /dev/null and b/docs/source/asr/speech_intent_slot/images/example.png differ diff --git a/docs/source/asr/speech_intent_slot/images/framework.png b/docs/source/asr/speech_intent_slot/images/framework.png new file mode 100644 index 0000000000000000000000000000000000000000..3b64ae9b5f6555bda1f3e7e16ce19d109a50c76c Binary files /dev/null and b/docs/source/asr/speech_intent_slot/images/framework.png differ diff --git a/docs/source/asr/speech_intent_slot/intro.rst b/docs/source/asr/speech_intent_slot/intro.rst new file mode 100644 index 0000000000000000000000000000000000000000..785df41e57b6019f7e019814f76c74a9490a5d46 --- /dev/null +++ b/docs/source/asr/speech_intent_slot/intro.rst @@ -0,0 +1,29 @@ +Speech Intent Classification and Slot Filling +============================================== + +**Intent Classification and Slot Filling** aims to not only recognize the user's intention, but also detect entity slots and their corresponding lexical fillers. Below is an example: + + +.. image:: images/example.png + :align: center + :scale: 50% + :alt: slurp_example + + +Different from its counterpart in Natural Language Understanding (NLU) that takes text as input, here the model predicts the semantics directly from audio input. + + + +The full documentation tree is as follows: + +.. toctree:: + :maxdepth: 8 + + models + datasets + results + configs + api + resources + +.. include:: resources.rst diff --git a/docs/source/asr/speech_intent_slot/models.rst b/docs/source/asr/speech_intent_slot/models.rst new file mode 100644 index 0000000000000000000000000000000000000000..bc628dd36ba56cb409e9ed90a8808260c1b432da --- /dev/null +++ b/docs/source/asr/speech_intent_slot/models.rst @@ -0,0 +1,15 @@ +Models +====== + +There are mainly two approaches in Speech Intent Classification and Slot Filling, where we can either use an End-to-End (E2E) model that directly predicts sematics from audio, or use a cascading model composed of an ASR model followed by an NLU model. E2E methods are preferred over cascading models, since it avoids error propagation from ASR to NLU and thus have better performance. + +Our E2E model in NeMo is based on an **Encoder-Decoder** framework, where a Conformer-large module is used as the encoder to extract features, and a Transformer Decoder is applied on top of the features to predict the semantics. + +.. image:: images/framework.png + :align: center + :scale: 70% + :alt: sis_framework + +The output is a Python dictionary object flattened as a string representation, so that the problem can be formulated as a sequence-to-sequence (audio-to-text) problem. + +The model is trained by Negative Log-Likelihood (NLL) Loss with teacher forcing. diff --git a/docs/source/asr/speech_intent_slot/resources.rst b/docs/source/asr/speech_intent_slot/resources.rst new file mode 100644 index 0000000000000000000000000000000000000000..017c184e243679f2d3bb4a11b56be7137592648c --- /dev/null +++ b/docs/source/asr/speech_intent_slot/resources.rst @@ -0,0 +1,11 @@ +Resources and Documentation +--------------------------- + +Example of Speech Intent Classification and Slot Filling can be found `here `_. + +Information about how to load model checkpoints (either local files or pretrained ones from NGC), +as well as a list of the checkpoints available on NGC are located on the `Checkpoints <./results.html>`__ +page. + +Documentation regarding the configuration files specific to SLU can be found in the +`Configuration Files <./configs.html>`__ page. diff --git a/docs/source/asr/speech_intent_slot/results.rst b/docs/source/asr/speech_intent_slot/results.rst new file mode 100644 index 0000000000000000000000000000000000000000..b22f99b31cd83734d9530c2a558b6e565e410515 --- /dev/null +++ b/docs/source/asr/speech_intent_slot/results.rst @@ -0,0 +1,56 @@ +Checkpoints +=========== + +There are two main ways to load pretrained checkpoints in NeMo: + +* Using the :code:`restore_from()` method to load a local checkpoint file (`.nemo`), or +* Using the :code:`from_pretrained()` method to download and set up a checkpoint from NGC. + +See the following sections for instructions and examples for each. + +Note that these instructions are for loading fully trained checkpoints for evaluation or fine-tuning. +For resuming an unfinished training experiment, please use the experiment manager to do so by setting the +``resume_if_exists`` flag to True. + +Loading Local Checkpoints +------------------------- + +NeMo will automatically save checkpoints of a model you are training in a `.nemo` format. +You can also manually save your models at any point using :code:`model.save_to(.nemo)`. + +If you have a local ``.nemo`` checkpoint that you'd like to load, simply use the :code:`restore_from()` method: + +.. code-block:: python + + import nemo.collections.asr as nemo_asr + model = nemo_asr.models..restore_from(restore_path="") + +Where the model base class is the ASR model class of the original checkpoint, or the general `ASRModel` class. + + +Inference +----------------------- + +The audio files should be 16KHz monochannel wav files. + +**Transcribe Audios to Semantics:** + +You may perform inference on a sample of speech after loading the model by using its 'transcribe()' method: + +.. code-block:: python + + slu_model = nemo_asr.models.SLUIntentSlotBPEModel.from_pretrained(model_name="") + predictions = slu_model.transcribe([list of audio files], batch_size="") + + +SLU Models +----------------------------------- + +Below is a list of all the Speech Intent Classification and Slot Filling models that are available in NeMo. + + +.. csv-table:: + :file: data/benchmark_sis.csv + :align: left + :widths: 40, 10, 50 + :header-rows: 1 diff --git a/docs/source/asr/ssl/api.rst b/docs/source/asr/ssl/api.rst new file mode 100644 index 0000000000000000000000000000000000000000..7103243a4b20c093ecd96f4c7cff64131321ff50 --- /dev/null +++ b/docs/source/asr/ssl/api.rst @@ -0,0 +1,24 @@ +NeMo SSL collection API +============================= + + +Model Classes +------------- +.. autoclass:: nemo.collections.asr.models.SpeechEncDecSelfSupervisedModel + :show-inheritance: + :members: + + +Mixins +------ + +.. autoclass:: nemo.collections.asr.parts.mixins.mixins.ASRModuleMixin + :show-inheritance: + :members: + +.. autoclass:: nemo.core.classes.mixins.access_mixins.AccessMixin + :show-inheritance: + :members: + + + diff --git a/docs/source/asr/ssl/configs.rst b/docs/source/asr/ssl/configs.rst new file mode 100644 index 0000000000000000000000000000000000000000..8883004297b3453fd39333732590bac16fe9da12 --- /dev/null +++ b/docs/source/asr/ssl/configs.rst @@ -0,0 +1,413 @@ +NeMo SSL Configuration Files +============================ + +This page covers NeMo configuration file setup that is specific to models in the Speech Self-Supervised Pre-training collection. +For general information about how to set up and run experiments that is common to all NeMo models (e.g. +experiment manager and PyTorch Lightning trainer parameters), see the :doc:`../../core/core` page. + +Dataset Configuration +--------------------- + +Dataset configuration for self-supervised model is mostly the same as for standard ASR training, +covered `here <../configs.html#Dataset Configuration>`__. The main difference is that in order to perform contrastive loss, +we will need to mask an equivalent amount of patches for all utterances in a batch. This means that we want to avoid +the durations varying too significantly within a single batch. There are several ways you can achieve this in NeMo: + +1) The simplest way is to use the ``min_duration`` parameter in the dataset config, which will simply +discard all utterances below the specified length. This is a viable option if removing these utterances will not +significantly impact the total amount of hours of your dataset. + +2) If your dataset contains many long utterances (longer than ~16 seconds) with varying length, then you may instead +want to use the ``random_segment`` perturbation, which will sample segments of a certain length from the full sample at +runtime (samples below the provided segment length will be padded). You can enable this by adding the following to your +dataset config: + +.. code-block:: yaml + + augmentor: + random_segment: + prob: 1.0 + duration_sec: 16 # specify the duration you want + +3) You can also use bucketing to ensure similar utterance lengths within batches. +See `Bucketing documentation <../datasets.html#bucketing-datasets>`__. + +An example of SSL train and validation configuration should look similar to the following: + +.. code-block:: yaml + + model: + train_ds: + manifest_filepath: ??? + sample_rate: ${model.sample_rate} + batch_size: 16 # you may increase batch_size if your memory allows + shuffle: true + num_workers: 8 + pin_memory: false + use_start_end_token: true + trim_silence: false + max_duration: 16.7 + min_duration: 8.0 + # tarred datasets + is_tarred: false + tarred_audio_filepaths: null + shuffle_n: 2048 + # bucketing params + bucketing_strategy: "synced_randomized" + bucketing_batch_size: null + + validation_ds: + manifest_filepath: ??? + sample_rate: ${model.sample_rate} + batch_size: 16 # you may increase batch_size if your memory allows + shuffle: false + num_workers: 8 + pin_memory: true + use_start_end_token: false + min_duration: 8.0 + + +Preprocessor Configuration +-------------------------- + +Preprocessor helps to compute MFCC or mel spectrogram features that are given as inputs to model. +For details on how to write this section, refer to `Preprocessor Configuration <../configs.html#preprocessor-configuration>`__ + +Augmentation Configurations +--------------------------- + +For self-supervised pre-training, we recommend using the ``MaskedPatchAugmentation`` class for spectrogram masking. +This augmentation divides utterances into fixed size patches, and then masks a fixed amount/fraction of them. You can +also add ``freq_masks`` and ``freq_width`` to apply masking to frequency bands. + +If you are using contrastive loss with negatives sampled from masked steps in same utterance only, +make sure that the total amount of masked steps in each utterance will be big enough for the number of sampled negatives. +For example, if you are using 4x stride and want to sample 100 negatives, then you will need more than 400 masked steps. +If you are using the default ``patch_size`` of 48, then this means you will need to set ``mask_patches`` to at least 9. +When using a fraction of the total amount of patches instead of a fixed amount, you will need to make sure that the +minimum duration of your samples in large enough for the number of negatives to sample. + +.. code-block:: yaml + + spec_augment: + _target_: nemo.collections.asr.modules.MaskedPatchAugmentation + patch_size: 48 # size of a single patch + mask_patches: 0.5 # fraction of patches to mask (can be fixed int amount instead) + freq_masks: 3 # Cut three frequency bands + freq_width: 20 # ... of width 20 at maximum + + +Model Architecture Configurations +--------------------------------- + +Each configuration file should describe the model architecture being used for the experiment. For self-supervised pre-training, +we will typically train the encoder of the model and then re-use it for fine-tuning, so the encoder can be configured in the same way +as you would for an ASR model. Note that any ASR model encoder can be used with any of the available pre-training methods, +though, given the same model sizes, we find the best downstream results when using `Conformer <./models.html#Conformer-Transducer>`__. + +Unlike the encoders, the decoders and corresponding losses will be specific to the self-supervised pre-training, and are small enough that +you can discard them when transferring the model to fine-tuning. + +The most basic method of pre-training we can use is to have the model solve a contrastive task +(this is the approach used in wav2vec 2.0 :cite:`ssl-models-wav2vec2`) +We can define the corresponding decoder and loss configs in the following way for an encoder with stride 4x. + +.. code-block:: yaml + + decoder_out: 128 + + decoder: + _target_: nemo.collections.asr.modules.ConvASRDecoderReconstruction + feat_in: ${model.encoder.d_model} + feat_hidden: 128 + feat_out: ${model.decoder_out} + stride_layers: 0 + # if loss.combine_time_steps is less than the encoder stride, then a corresponding amount of stride_layers needs to + # be added to the decoder (here stride and combine_time_steps are both 4) + non_stride_layers: 0 + + loss: + _target_: nemo.collections.asr.losses.ContrastiveLoss + in_dim: ${model.preprocessor.features} + proj_dim: ${model.decoder_out} + combine_time_steps: 4 # how many spectrogram time steps are used for one target/representation for contrastive task + quantized_targets: true # should quantizer or linear layer be used + codebook_size: 300 # size of a single codebook for quantizer + num_groups: 2 # number of codebooks to use for quantizer + num_negatives: 100 # number of sampled negatives for each target + sample_from_same_utterance_only: true # should negatives be sampled only from the same utterance + sample_from_non_masked: false # should negatives be sampled from non-masked steps + +Note that in the above example we combine 4 steps from the input spectrogram into a single "token" for the loss, +which corresponds to the encoder stride 4x. We might want to use different values for "combine_time_steps" and encoder stride. +In that case, we will need to add stride layers to decoders to match the strides. We can use the following example config +for a Citrinet encoder with stride 8x. In order to go from stride 8x to 4x, we use a single ``stride_layer`` in the decoder +with ``stride_transpose`` set to True. + +.. code-block:: yaml + + decoder: + _target_: nemo.collections.asr.modules.ConvASRDecoderReconstruction + feat_in: ${model.model_defaults.enc_final} + feat_hidden: 128 + feat_out: ${model.model_defaults.decoder_out_channels} + stride_layers: 1 + #if loss.combine_time_steps is less than the encoder stride, then a corresponding amount of stride_layers needs to + #be added to the decoder (here stride is 8 and combine_time_steps is 4, so 1 stride layer is added) + non_stride_layers: 0 + stride_tranpose: true # whether to use transposed convolution for stride layers or not + + loss: + _target_: nemo.collections.asr.losses.ContrastiveLoss + in_dim: *n_mels + proj_dim: ${model.model_defaults.decoder_out_channels} + combine_time_steps: 4 #how many spectrogram time steps are used for one target/representation for contrastive task + quantized_targets: false #should quantizer or linear layer be used + sample_from_same_utterance_only: true #should negatives be sampled only from the same utterance + sample_from_non_masked: false #should negatives be sampled from non-masked steps + + +It can be beneficial to combine contrastive loss with other losses, such as a masked language modeling (mlm) loss +(similar approach to W2V-Bert :cite:`ssl-models-w2v_bert`). +In order to do this, instead of specifying a single ``decoder`` and ``loss`` in the config, we can specify a ``loss_list``, +which can contain any amount of corresponding decoders and losses. For each decoder-loss pair, +we can specify a separate named sub-config, which contains the following fields: + +1. ``decoder`` - The decoder config, specifying a ``target`` class and parameters. +2. ``loss`` - The corresponding loss config, specifying a ``target`` class and parameters. +3. ``loss_alpha`` - A multiplier on this loss (1.0 by default). +4. ``targets_from_loss`` - This parameter specifies which contrastive loss we should extract labels from. It is necessary for any loss which requires labels, if labels aren't present in your manifest. +5. ``transpose_encoded`` - This parameter is used to optionally transpose the encoded features before passing them into this loss. +6. ``start_step`` - The training step at which we should start using this decoder+loss. +7. ``output_from_layer`` - This parameter can be used to specify the name of the layer that we should extract encoded features from to pass into this decoder. If it's not specified or set to null, the final encoder layer is used. + + +The following is an example of a `loss_list` for a combination of contrastive+mlm losses, +where the mlm loss uses targets from the quantization module of the contrastive loss. + + +.. code-block:: yaml + + decoder_out: 128 + + loss_list: + contrastive: + decoder: + _target_: nemo.collections.asr.modules.ConvASRDecoderReconstruction + feat_in: ${model.encoder.d_model} + feat_hidden: 128 + # features in hidden layer of decoder + feat_out: ${model.decoder_out} + stride_layers: 0 + # if loss.combine_time_steps is less than the encoder stride, then a corresponding amount of stride_layers needs to + # be added to the decoder (here stride and combine_time_steps are both 4) + non_stride_layers: 0 + loss: + _target_: nemo.collections.asr.losses.ContrastiveLoss + in_dim: ${model.preprocessor.features} + proj_dim: ${model.decoder_out} + combine_time_steps: 4 # how many spectrogram time steps are used for one target/representation for contrastive task + quantized_targets: true # should quantizer or linear layer be used + # (quantizer is required to extract pseudo-labels for other losses) + codebook_size: 300 + num_groups: 2 + sample_from_same_utterance_only: true # should negatives be sampled only from the same utterance + sample_from_non_masked: false # should negatives be sampled from non-masked steps + + mlm: + decoder: + _target_: nemo.collections.asr.modules.ConvASRDecoder + feat_in: ${model.encoder.d_model} + num_classes: 90000 + # set this to be equal to codebook_size^groups in the contrastive loss + loss: + _target_: nemo.collections.asr.losses.MLMLoss + combine_time_steps: 4 + targets_from_loss: "contrastive" + # since this loss requires targets, we can either get them from a manifest or from a quantized contrastive loss + loss_alpha: 1000. + # multiplier applied to this loss relative to others + transpose_encoded: false + # transposing input may be necessary depending on which layer is used as input to decoder + start_step: 0 + # determines what global step this loss starts being used at; + # this can be set to a higher number if your training is long enough, + # which may increase early training stability + output_from_layer: null + # if we wanted to use outputs from non-final encoder layer as input to this decoder, + # the layer name should be specified here + + +We can also use other losses which require labels instead of mlm, such as ctc or rnnt loss. Since these losses, unlike mlm, +don't require our targets to have a direct alignment with our steps, we may also want to use set the ``reduce_ids`` parameter of the +contrastive loss to true, to convert any sequence of consecutive equivalent ids to a single occurence of that id. + +An example of a ``loss_list`` consisting of contrastive+ctc loss can look like this: + +.. code-block:: yaml + + decoder_out: 128 + + loss_list: + contr: + decoder: + _target_: nemo.collections.asr.modules.ConvASRDecoderReconstruction + feat_in: ${model.encoder.d_model} + feat_hidden: 128 + feat_out: ${model.decoder_out} + stride_layers: 0 + non_stride_layers: 0 + loss: + _target_: nemo.collections.asr.losses.ContrastiveLoss + in_dim: ${model.preprocessor.features} + proj_dim: ${model.decoder_out} + combine_time_steps: 4 + quantized_targets: true + codebook_size: 300 + num_groups: 2 + sample_from_same_utterance_only: true + sample_from_non_masked: false + reduce_ids: true + + ctc: + decoder: + _target_: nemo.collections.asr.modules.ConvASRDecoder + feat_in: ${model.encoder.d_model} + num_classes: 90000 + loss: + _target_: nemo.collections.asr.losses.CTCLossForSSL + num_classes: 90000 + targets_from_loss: "contr" + start_step: 3000 + +An example of contrastive+rnnt can look like this: + +.. code-block:: yaml + + decoder_out: 128 + + loss_list: + contr: + decoder: + _target_: nemo.collections.asr.modules.ConvASRDecoderReconstruction + feat_in: ${model.encoder.d_model} + feat_hidden: 128 + feat_out: ${model.decoder_out} + stride_layers: 0 + non_stride_layers: 0 + loss: + _target_: nemo.collections.asr.losses.ContrastiveLoss + in_dim: ${model.preprocessor.features} + proj_dim: ${model.decoder_out} + combine_time_steps: 4 + quantized_targets: true + codebook_size: 24 + sample_from_same_utterance_only: true + sample_from_non_masked: false + reduce_ids: true + + rnnt: + decoder: + _target_: nemo.collections.asr.modules.RNNTDecoderJointSSL + decoder: + _target_: nemo.collections.asr.modules.RNNTDecoder + normalization_mode: null # Currently only null is supported for export. + random_state_sampling: false # Random state sampling: https://arxiv.org/pdf/1910.11455.pdf + blank_as_pad: true # This flag must be set in order to support exporting of RNNT models + efficient inference. + vocab_size: 576 + prednet: + pred_hidden: 640 + pred_rnn_layers: 1 + t_max: null + dropout: 0.1 + joint: + _target_: nemo.collections.asr.modules.RNNTJoint + log_softmax: null # 'null' would set it automatically according to CPU/GPU device + preserve_memory: false # dramatically slows down training, but might preserve some memory + experimental_fuse_loss_wer: false + jointnet: + encoder_hidden: 512 + pred_hidden: 640 + joint_hidden: 640 + activation: "relu" + dropout: 0.1 + num_classes: 576 + loss: + _target_: nemo.collections.asr.losses.RNNTLossForSSL + num_classes: 576 + targets_from_loss: "contr" + start_step: 1000 + + +We can also use multiple losses, which use features from different intermediate layers of the encoder as input :cite:`ssl-models-ssl_inter`. +In the following config example, we use contrastive loss + three different mlm losses, which use encoder outputs +respectively from 6th, 12th and final layer. + +.. code-block:: yaml + + decoder_out: 128 + + loss_list: + contr: + decoder: + _target_: nemo.collections.asr.modules.ConvASRDecoderReconstruction + feat_in: ${model.encoder.d_model} + feat_hidden: 128 + feat_out: ${model.decoder_out} + stride_layers: 0 + non_stride_layers: 0 + loss: + _target_: nemo.collections.asr.losses.ContrastiveLoss + in_dim: ${model.preprocessor.features} + proj_dim: ${model.decoder_out} + combine_time_steps: 4 + quantized_targets: true + codebook_size: 300 + sample_from_same_utterance_only: true + sample_from_non_masked: false + loss_alpha: 5. + + mlm: + decoder: + _target_: nemo.collections.asr.modules.ConvASRDecoder + feat_in: ${model.encoder.d_model} + num_classes: 90000 + loss: + _target_: nemo.collections.asr.losses.MLMLoss + combine_time_steps: 4 + targets_from_loss: "contr" + loss_alpha: 1000. + + mlm_2: + decoder: + _target_: nemo.collections.asr.modules.ConvASRDecoder + feat_in: ${model.encoder.d_model} + num_classes: 90000 + loss: + _target_: nemo.collections.asr.losses.MLMLoss + combine_time_steps: 4 + targets_from_loss: "contr" + loss_alpha: 300. + output_from_layer: "layers.5" + transpose_encoded: true + + mlm_3: + decoder: + _target_: nemo.collections.asr.modules.ConvASRDecoder + feat_in: ${model.encoder.d_model} + num_classes: 90000 + loss: + _target_: nemo.collections.asr.losses.MLMLoss + combine_time_steps: 4 + targets_from_loss: "contr" + loss_alpha: 300. + output_from_layer: "layers.11" + transpose_encoded: true + +References +----------- + +.. bibliography:: ../asr_all.bib + :style: plain + :labelprefix: SSL-MODELS + :keyprefix: ssl-models- \ No newline at end of file diff --git a/docs/source/asr/ssl/data/benchmark_ssl.csv b/docs/source/asr/ssl/data/benchmark_ssl.csv new file mode 100644 index 0000000000000000000000000000000000000000..6085a5f3fd4f51d6bbb8dbfc24830f729e09d06b --- /dev/null +++ b/docs/source/asr/ssl/data/benchmark_ssl.csv @@ -0,0 +1,3 @@ +Model Name,Model Base Class,Model Card +ssl_en_conformer_large,SpeechEncDecSelfSupervisedModel,"https://ngc.nvidia.com/catalog/models/nvidia:nemo:ssl_en_conformer_large" +ssl_en_conformer_xlarge,SpeechEncDecSelfSupervisedModel,"https://ngc.nvidia.com/catalog/models/nvidia:nemo:ssl_en_conformer_xlarge" \ No newline at end of file diff --git a/docs/source/asr/ssl/datasets.rst b/docs/source/asr/ssl/datasets.rst new file mode 100644 index 0000000000000000000000000000000000000000..b597dcf4b7b3101b716197633a4aae88bb98de4c --- /dev/null +++ b/docs/source/asr/ssl/datasets.rst @@ -0,0 +1,8 @@ +Datasets +======== + +Any dataset available in NeMo for ASR (`ASR datasets <../datasets.html>`__) can be used for SSL. +To create your own NeMo compatible datasets, refer to +`Preparing Custom ASR Data <../datasets.html#preparing-custom-asr-data>`__ +section. Note that explicit labels (transcriptions) are not utilized in SSL and hence are optional +when creating datasets for SSL. \ No newline at end of file diff --git a/docs/source/asr/ssl/intro.rst b/docs/source/asr/ssl/intro.rst new file mode 100644 index 0000000000000000000000000000000000000000..cea9bdbc39d4af6985909e1f60189c0db7628560 --- /dev/null +++ b/docs/source/asr/ssl/intro.rst @@ -0,0 +1,34 @@ +Self-Supervised Learning +================================= + +Self-Supervised Learning (SSL) refers to the problem of learning without explicit labels. As +any learning process require feedback, without explit labels, SSL derives supervisory signals from +the data itself. The general ideal of SSL is to predict any hidden part (or property) of the input +from observed part of the input (e.g., filling in the blanks in a sentence or predicting whether +an image is upright or inverted). + +SSL for speech/audio understanding broadly falls into either contrastive or reconstruction +based approaches. In contrastive methods, models learn by distinguising between true and distractor +tokens (or latents). Examples of contrastive approaches are Contrastive Predictive Coding (CPC), +Masked Language Modeling (MLM) etc. In reconstruction methods, models learn by directly estimating +the missing (intentionally leftout) portions of the input. Masked Reconstruction, Autoregressive +Predictive Coding (APC) are few examples. + +In the recent past, SSL has been a major benefactor in improving Acoustic Modeling (AM), i.e., the +encoder module of neural ASR models. Here too, majority of SSL effort is focused on improving AM. +While it is common that AM is the focus of SSL in ASR, it can also be utilized in improving other parts of +ASR models (e.g., predictor module in transducer based ASR models). + +The full documentation tree is as follows: + +.. toctree:: + :maxdepth: 8 + + models + datasets + results + configs + api + resources + +.. include:: resources.rst diff --git a/docs/source/asr/ssl/models.rst b/docs/source/asr/ssl/models.rst new file mode 100644 index 0000000000000000000000000000000000000000..1713aff6472c3dac448389d0cf70838c9b325ce0 --- /dev/null +++ b/docs/source/asr/ssl/models.rst @@ -0,0 +1,12 @@ +Models +====== + +End-to-End ASR models are typically of encoder-decoder style, where the encoder does acoustic +modeling i.e., converting speech wavform into features, and the decoder converts those features into +text. Encoder contains the bulk of trainable parameters and is usually the focus of SSL in ASR. +Thus, any architecture that can be used as encoder in ASR models can be pre-trained using SSL. For an +overview of model architectures that are currently supported in NeMo's ASR's collection, refer +to `ASR Models <../models.html>`__. Note that SSL also uses encoder-decoder style of models. During +down-stream fine-tuning, the encoder is retained where as the decoder (used during SSL) is replaced +with down-stream task specific module. Refer to `checkpoints <./results.html>`__ to see how this is +accomplished in NeMo. diff --git a/docs/source/asr/ssl/resources.rst b/docs/source/asr/ssl/resources.rst new file mode 100644 index 0000000000000000000000000000000000000000..416c37ceb6f09fc0d05360fce7fc361d5bc91af8 --- /dev/null +++ b/docs/source/asr/ssl/resources.rst @@ -0,0 +1,23 @@ +Resources and Documentation +--------------------------- + +Refer to `SSL-for-ASR notebook `_ +for a hands-on tutorial. If you are a beginner to NeMo, consider trying out the +`ASR with NeMo `_ +tutorial. This and most other tutorials can be run on Google Colab by specifying the link to the +notebooks' GitHub pages on Colab. + +If you are looking for information about a particular ASR model, or would like to find out more +about the model architectures available in the ``nemo_asr`` collection, refer to the +`ASR Models <../models.html>`__ page. + +NeMo includes preprocessing scripts for several common ASR datasets. The `ASR Datasets <../datasets.html>`__ +page contains instructions on running those scripts. It also includes guidance for creating your +own NeMo-compatible dataset, if you have your own data. + +Information about how to load model checkpoints (either local files or pretrained ones from NGC), +as well as a list of the checkpoints available on NGC are located on the `Checkpoints <./results.html>`__ +page. + +Documentation regarding the configuration files specific to the SSL can be found in the +`Configuration Files <./configs.html>`__ page. diff --git a/docs/source/asr/ssl/results.rst b/docs/source/asr/ssl/results.rst new file mode 100644 index 0000000000000000000000000000000000000000..adc14b15285b85cb86e9ff662fc13a248e98d9fc --- /dev/null +++ b/docs/source/asr/ssl/results.rst @@ -0,0 +1,106 @@ +Checkpoints +=========== + +Pre-trained SSL checkpoints available in NeMo need to be further fine-tuned on down-stream task. +There are two main ways to load pretrained checkpoints in NeMo: + +* Using the :code:`restore_from()` method to load a local checkpoint file (``.nemo``), or +* Using the :code:`from_pretrained()` method to download and set up a checkpoint from NGC. + +Refer to the following sections for instructions and examples for each. + +Note that these instructions are for fine-tuning. To resume an unfinished training experiment, +use the Experiment Manager to do so by setting the ``resume_if_exists`` flag to ``True``. + +Loading Local Checkpoints +------------------------- + +NeMo automatically saves checkpoints of a model that is trained in a ``.nemo`` format. Alternatively, to manually save the model at any +point, issue :code:`model.save_to(.nemo)`. + +If there is a local ``.nemo`` checkpoint that you'd like to load, use the :code:`restore_from()` method: + +.. code-block:: python + + import nemo.collections.asr as nemo_asr + ssl_model = nemo_asr.models..restore_from(restore_path="") + +Where the model base class is the ASR model class of the original checkpoint, or the general ``ASRModel`` class. + +Loading NGC Pretrained Checkpoints +---------------------------------- + +The SSL collection has checkpoints of several models trained on various datasets. These checkpoints are +obtainable via NGC `NeMo Automatic Speech Recognition collection `_. +The model cards on NGC contain more information about each of the checkpoints available. + +The table at the end of this page lists the SSL models available from NGC. The models can be accessed via the :code:`from_pretrained()` method inside +the ASR Model class. In general, you can load any of these models with code in the following format: + +.. code-block:: python + + import nemo.collections.asr as nemo_asr + ssl_model = nemo_asr.models.ASRModel.from_pretrained(model_name="") + +Where the ``model_name`` is the value under "Model Name" entry in the tables below. + +For example, to load the conformer Large SSL checkpoint, run: + +.. code-block:: python + + ssl_model = nemo_asr.models.ASRModel.from_pretrained(model_name="ssl_en_conformer_large") + +You can also call :code:`from_pretrained()` from the specific model class (such as :code:`SpeechEncDecSelfSupervisedModel` +for Conformer) if you need to access a specific model functionality. + +If you would like to programatically list the models available for a particular base class, you can use the +:code:`list_available_models()` method. + +.. code-block:: python + + nemo_asr.models..list_available_models() + + +Loading SSL checkpoint into Down-stream Model +--------------------------------------------- +After loading an SSL checkpoint as shown above, it's ``state_dict`` needs to be copied to a +down-stream model for fine-tuning. + +For example, to load a SSL checkpoint for ASR down-stream task using ``EncDecRNNTBPEModel``, run: + +.. code-block:: python + + # define down-stream model + asr_model = nemo_asr.models.EncDecRNNTBPEModel(cfg=cfg.model, trainer=trainer) + + # load ssl checkpoint + asr_model.load_state_dict(ssl_model.state_dict(), strict=False) + + # discard ssl model + del ssl model + +Refer to `SSL configs <./configs.html>`__ to do this automatically via config files. + + +Fine-tuning on Downstream Datasets +----------------------------------- + +After loading SSL checkpoint into down-stream model, refer to multiple ASR tutorials provided in the :ref:`Tutorials ` section. +Most of these tutorials explain how to fine-tune on some dataset as a demonstration. + +Inference Execution Flow Diagram +-------------------------------- + +When preparing your own inference scripts after downstream fine-tuning, please follow the execution flow diagram order for correct inference, found at the `examples directory for ASR collection `_. + +SSL Models +----------------------------------- + +Below is a list of all the SSL models that are available in NeMo. + + +.. csv-table:: + :file: data/benchmark_ssl.csv + :align: left + :widths: 40, 10, 50 + :header-rows: 1 diff --git a/docs/source/common/callbacks.rst b/docs/source/common/callbacks.rst new file mode 100644 index 0000000000000000000000000000000000000000..a627e0dd2ca25032cbedece07a2eaf8e4ed3ecc3 --- /dev/null +++ b/docs/source/common/callbacks.rst @@ -0,0 +1,51 @@ +********* +Callbacks +********* + +Exponential Moving Average (EMA) +================================ + +During training, EMA maintains a moving average of the trained parameters. +EMA parameters can produce significantly better results and faster convergence for a variety of different domains and models. + +EMA is a simple calculation. EMA Weights are pre-initialized with the model weights at the start of training. + +Every training update, the EMA weights are updated based on the new model weights. + +.. math:: + ema_w = ema_w * decay + model_w * (1-decay) + +Enabling EMA is straightforward. We can pass the additional argument to the experiment manager at runtime. + +.. code-block:: bash + + python examples/asr/asr_ctc/speech_to_text_ctc.py \ + model.train_ds.manifest_filepath=/path/to/my/train/manifest.json \ + model.validation_ds.manifest_filepath=/path/to/my/validation/manifest.json \ + trainer.devices=2 \ + trainer.accelerator='gpu' \ + trainer.max_epochs=50 \ + exp_manager.ema.enable=True # pass this additional argument to enable EMA + +To change the decay rate, pass the additional argument. + +.. code-block:: bash + + python examples/asr/asr_ctc/speech_to_text_ctc.py \ + ... + exp_manager.ema.enable=True \ + exp_manager.ema.decay=0.999 + +We also offer other helpful arguments. + +.. list-table:: + :header-rows: 1 + + * - Argument + - Description + * - `exp_manager.ema.validate_original_weights=True` + - Validate the original weights instead of EMA weights. + * - `exp_manager.ema.every_n_steps=2` + - Apply EMA every N steps instead of every step. + * - `exp_manager.ema.cpu_offload=True` + - Offload EMA weights to CPU. May introduce significant slow-downs. diff --git a/docs/source/common/intro.rst b/docs/source/common/intro.rst new file mode 100644 index 0000000000000000000000000000000000000000..dbe8d5d17930b6012c1be8820867cdf7199655e0 --- /dev/null +++ b/docs/source/common/intro.rst @@ -0,0 +1,12 @@ +Common Collection +================= + +The common collection contains things that could be used across all collections. + +.. toctree:: + :maxdepth: 8 + + callbacks + losses + metrics + tokenizers diff --git a/docs/source/common/losses.rst b/docs/source/common/losses.rst new file mode 100644 index 0000000000000000000000000000000000000000..006746face297a696099a3b1151c30096debb92d --- /dev/null +++ b/docs/source/common/losses.rst @@ -0,0 +1,16 @@ +Losses +------ +.. autoclass:: nemo.collections.common.losses.AggregatorLoss + :special-members: __init__ + +.. autoclass:: nemo.collections.common.losses.CrossEntropyLoss + :special-members: __init__ + +.. autoclass:: nemo.collections.common.losses.MSELoss + :special-members: __init__ + +.. autoclass:: nemo.collections.common.losses.SmoothedCrossEntropyLoss + :special-members: __init__ + +.. autoclass:: nemo.collections.common.losses.SpanningLoss + :special-members: __init__ diff --git a/docs/source/common/metrics.rst b/docs/source/common/metrics.rst new file mode 100644 index 0000000000000000000000000000000000000000..a47bd9f6f09b7e66751bcad60453586157d61134 --- /dev/null +++ b/docs/source/common/metrics.rst @@ -0,0 +1,7 @@ +Metrics +------- + +.. autoclass:: nemo.collections.common.metrics.Perplexity + :show-inheritance: + :members: + :undoc-members: diff --git a/docs/source/common/tokenizers.rst b/docs/source/common/tokenizers.rst new file mode 100644 index 0000000000000000000000000000000000000000..5c7336e8d603bd31202223fb1f8479f3b4f58186 --- /dev/null +++ b/docs/source/common/tokenizers.rst @@ -0,0 +1,8 @@ +Tokenizers +---------- +.. autoclass:: nemo.collections.common.tokenizers.AutoTokenizer + :special-members: __init__ +.. autoclass:: nemo.collections.common.tokenizers.SentencePieceTokenizer + :special-members: __init__ +.. autoclass:: nemo.collections.common.tokenizers.TokenizerSpec + :special-members: __init__ diff --git a/docs/source/conf.py b/docs/source/conf.py new file mode 100644 index 0000000000000000000000000000000000000000..8ce8e146c2fde8826d66458c82c45ceaeb2ceffa --- /dev/null +++ b/docs/source/conf.py @@ -0,0 +1,256 @@ +#!/usr/bin/env python3 +# -*- coding: utf-8 -*- + +# Copyright (c) 2020, NVIDIA CORPORATION. All rights reserved. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + +import os +import re +import sys +import glob + +import sphinx_book_theme + +# If extensions (or modules to document with autodoc) are in another directory, +# add these directories to sys.path here. If the directory is relative to the +# documentation root, use os.path.abspath to make it absolute, like shown here. + +sys.path.insert(0, os.path.abspath("../..")) +sys.path.insert(0, os.path.abspath("../../nemo")) +sys.path.insert(0, os.path.abspath("../../nemo_text_processing")) + +from package_info import __version__ + +templates_path = ["_templates"] + +autodoc_mock_imports = [ + 'torch', + 'torch.nn', + 'torch.utils', + 'torch.optim', + 'torch.utils.data', + 'torch.utils.data.sampler', + 'torchtext', + 'torchvision', + 'ruamel.yaml', # ruamel.yaml has ., which is troublesome for this regex + 'hydra', # hydra-core in requirements, hydra during import + 'dateutil', # part of core python + 'transformers.tokenization_bert', # has ., troublesome for this regex + 'megatron', # megatron-lm in requirements, megatron in import + 'sklearn', # scikit_learn in requirements, sklearn in import + 'nemo_text_processing.inverse_text_normalization', # Not installed automatically + 'nemo_text_processing.text_normalization', # Not installed automatically + 'attr', # attrdict in requirements, attr in import + 'torchmetrics', # inherited from PTL + 'lightning_utilities', # inherited from PTL + 'apex', + 'joblib', # inherited from optional code + 'IPython', + 'ipadic', + 'psutil', +] + +_skipped_autodoc_mock_imports = ['wrapt', 'numpy'] + +for req_path in sorted(list(glob.glob("../../requirements/*.txt"))): + if "docs.txt" in req_path: + continue + + req_file = os.path.abspath(os.path.expanduser(req_path)) + with open(req_file, 'r') as f: + for line in f: + line = line.replace("\n", "") + req = re.search(r"([a-zA-Z0-9-_]*)", line) + if req: + req = req.group(1) + req = req.replace("-", "_") + + if req not in autodoc_mock_imports: + if req in _skipped_autodoc_mock_imports: + print(f"Skipping req : `{req}` (lib {line})") + continue + + autodoc_mock_imports.append(req) + print(f"Adding req : `{req}` to autodoc mock requirements (lib {line})") + else: + print(f"`{req}` already added to autodoc mock requirements (lib {line})") + +# +# -- General configuration ------------------------------------------------ + +# If your documentation needs a minimal Sphinx version, state it here. +# +# needs_sphinx = '1.0' + +# Add any Sphinx extension module names here, as strings. They can be +# extensions coming with Sphinx (named 'sphinx.ext.*') or your custom +# ones. + +extensions = [ + "sphinx.ext.autodoc", + "sphinx.ext.todo", + "sphinx.ext.coverage", + "sphinx.ext.mathjax", + "sphinx.ext.ifconfig", + "sphinx.ext.viewcode", + "sphinx.ext.napoleon", + "sphinx.ext.githubpages", + "sphinxcontrib.bibtex", + "sphinx.ext.inheritance_diagram", + "sphinx.ext.intersphinx", + "sphinx.ext.autosectionlabel", + "sphinxcontrib.bibtex", + "sphinx_copybutton", +] + +bibtex_bibfiles = [ + 'asr/asr_all.bib', + 'nlp/nlp_all.bib', + 'nlp/text_normalization/tn_itn_all.bib', + 'tools/tools_all.bib', + 'tts/tts_all.bib', + 'text_processing/text_processing_all.bib', + 'core/adapters/adapter_bib.bib', +] + +intersphinx_mapping = { + 'pytorch': ('https://pytorch.org/docs/stable', None), + 'pytorch-lightning': ('https://pytorch-lightning.readthedocs.io/en/latest/', None), +} + +# Set default flags for all classes. +autodoc_default_options = {'members': None, 'undoc-members': None, 'show-inheritance': True} + +locale_dirs = ['locale/'] # path is example but recommended. +gettext_compact = False # optional. + +# The suffix(es) of source filenames. +# You can specify multiple suffix as a list of string: +# +# source_suffix = ['.rst', '.md'] +source_suffix = ".rst" + +# The master toctree document. +master_doc = "index" + +# General information about the project. +project = "NVIDIA NeMo" +copyright = "© 2021-2022 NVIDIA Corporation & Affiliates. All rights reserved." +author = "NVIDIA CORPORATION" + +# The version info for the project you're documenting, acts as replacement for +# |version| and |release|, also used in various other places throughout the +# built documents. + + +# The short X.Y version. +# version = "0.10.0" +version = __version__ +# The full version, including alpha/beta/rc tags. +# release = "0.9.0" +release = __version__ + +# The language for content autogenerated by Sphinx. Refer to documentation +# for a list of supported languages. +# +# This is also used if you do content translation via gettext catalogs. +# Usually you set "language" from the command line for these cases. +language = None + +# List of patterns, relative to source directory, that match files and +# directories to ignore when looking for source files. +# This patterns also effect to html_static_path and html_extra_path +exclude_patterns = [] + +# The name of the Pygments (syntax highlighting) style to use. +pygments_style = "default" + +### Previous NeMo theme +# # NVIDIA theme settings. +# html_theme = 'nvidia_theme' + +# html_theme_path = ["."] + +# html_theme_options = { +# 'display_version': True, +# 'project_version': version, +# 'project_name': project, +# 'logo_path': None, +# 'logo_only': True, +# } +# html_title = 'Introduction' + +# html_logo = html_theme_options["logo_path"] + +# -- Options for HTMLHelp output ------------------------------------------ + +# Output file base name for HTML help builder. +htmlhelp_basename = "nemodoc" + +### from TLT conf.py +# -- Options for HTML output ------------------------------------------------- + +# The theme to use for HTML and HTML Help pages. See the documentation for +# a list of builtin themes. +# + +# html_theme_path = [sphinx_rtd_theme.get_html_theme_path()] + +html_theme = "sphinx_book_theme" +html_logo = os.path.join('nv_logo.png') +html_title = 'NVIDIA NeMo' + +html_theme_options = { + 'logo_only': True, + 'display_version': True, + # 'prev_next_buttons_location': 'bottom', + # 'style_external_links': False, + # 'style_nav_header_background': '#000000', + # Toc options + 'collapse_navigation': False, + # 'sticky_navigation': False, + 'navigation_depth': 10, + # 'includehidden': False, + # 'titles_only': False, + # Sphinx Book theme, + 'repository_url': 'https://github.com/NVIDIA/NeMo', + 'use_repository_button': True, + 'show_navbar_depth': 1, + 'show_toc_level': 10, +} + + +# Add any paths that contain custom static files (such as style sheets) here, +# relative to this directory. They are copied after the builtin static files, +# so a file named "default.css" will overwrite the builtin "default.css". + +html_favicon = 'favicon.ico' + +html_static_path = ['_static'] + +html_last_updated_fmt = '' + + +def setup(app): + app.add_css_file('css/custom.css') + app.add_js_file('js/pk_scripts.js') + + +# html_css_files = [ +# './custom.css', +# ] + +# html_js_files = [ +# './pk_scripts.js', +# ] diff --git a/docs/source/core/adapters/adapter_bib.bib b/docs/source/core/adapters/adapter_bib.bib new file mode 100644 index 0000000000000000000000000000000000000000..9a04d876f162657f83cda03a228672efd08a323e --- /dev/null +++ b/docs/source/core/adapters/adapter_bib.bib @@ -0,0 +1,21 @@ + + +@inproceedings{houlsby2019adapter, + title={Parameter-efficient transfer learning for NLP}, + author={Houlsby, Neil and Giurgiu, Andrei and Jastrzebski, Stanislaw and Morrone, Bruna and De Laroussilhe, Quentin and Gesmundo, Andrea and Attariyan, Mona and Gelly, Sylvain}, + booktitle={International Conference on Machine Learning}, + pages={2790--2799}, + year={2019}, + organization={PMLR} +} + +@misc{Junxian2021unified, + doi = {10.48550/ARXIV.2110.04366}, + url = {https://arxiv.org/abs/2110.04366}, + author = {He, Junxian and Zhou, Chunting and Ma, Xuezhe and Berg-Kirkpatrick, Taylor and Neubig, Graham}, + keywords = {Computation and Language (cs.CL), Machine Learning (cs.LG), FOS: Computer and information sciences, FOS: Computer and information sciences}, + title = {Towards a Unified View of Parameter-Efficient Transfer Learning}, + publisher = {arXiv}, + year = {2021}, + copyright = {arXiv.org perpetual, non-exclusive license} +} diff --git a/docs/source/core/adapters/api.rst b/docs/source/core/adapters/api.rst new file mode 100644 index 0000000000000000000000000000000000000000..b0f2a8e13610f5dc6120fc8e32fbf3402b7981da --- /dev/null +++ b/docs/source/core/adapters/api.rst @@ -0,0 +1,65 @@ +Adapters API +============ + +Core +---- + +.. autoclass:: nemo.core.adapter_mixins.AdapterModuleMixin + :show-inheritance: + :members: + :member-order: bysource + :undoc-members: adapter_module_names + +----- + +.. autoclass:: nemo.core.adapter_mixins.AdapterModelPTMixin + :show-inheritance: + :members: + :member-order: bysource + :undoc-members: adapter_module_names + +----- + +Adapter Networks +---------------- + + +.. autoclass:: nemo.collections.common.parts.adapter_modules.AdapterModuleUtil + :show-inheritance: + :members: + :member-order: bysource + +----- + +.. autoclass:: nemo.collections.common.parts.adapter_modules.LinearAdapter + :show-inheritance: + :members: + :member-order: bysource + +----- + +Adapter Strategies +------------------ + + +.. autoclass:: nemo.core.classes.mixins.adapter_mixin_strategies.AbstractAdapterStrategy + :show-inheritance: + :members: + :member-order: bysource + :undoc-members: adapter_module_names + +----- + +.. autoclass:: nemo.core.classes.mixins.adapter_mixin_strategies.ReturnResultAdapterStrategy + :show-inheritance: + :members: + :member-order: bysource + :undoc-members: adapter_module_names + +----- + +.. autoclass:: nemo.core.classes.mixins.adapter_mixin_strategies.ResidualAddAdapterStrategy + :show-inheritance: + :members: + :member-order: bysource + :undoc-members: adapter_module_names diff --git a/docs/source/core/adapters/components.rst b/docs/source/core/adapters/components.rst new file mode 100644 index 0000000000000000000000000000000000000000..cc2ea0b525df206981adc98608983404b7c95c2c --- /dev/null +++ b/docs/source/core/adapters/components.rst @@ -0,0 +1,90 @@ +Adapter Components +================== + +Adapters can be considered as any set of parameters that are added to a pre-existing module/model. In our case, we currently support the standard adapter in literature, more advanced adapter modules are being researched and can potentially be supported by NeMo. + +An adapter module can be any pytorch module, but it must follow certain straightforward requirements - + +1) The model accepts an input of some input dimension, and its output must match this dimension. +2) Ideally, the module is initialized such that the output of the adapter when initialized is such that it does not modify the original input. This allows the model to produce the same output results, even when additional parameters have been added. + +According to Junxian et al :cite:`adapters-Junxian2021unified`, we can consider an adapter being represented as three components - + +1) Functional form - the trainable parameters that will modify the input +2) Insertion form - Where the adapter outputs are integrated with the original input. The input to the adapters can be the last output of the layer, the input to some attention layer, or even the original input to the module itself (before even the modules forward pass). +3) Composition function - How the adapters outputs are integrated with the inputs. It can be as simple as residual addition connection, or concatenation, or point-wise multiplication etc. + +Functional Form - Adapter Networks +================================== + +Adapter modules represent the functional form of the adapter. We discuss an example of a most commonly used adapter module found in literature, titled the ``LinearAdapter`` (or Houlsby Adapter) :cite:`adapters-houlsby2019adapter`. + +.. note:: + + All adapter modules must extend :class:`~nemo.collections.common.parts.adapter_modules.AdapterModuleUtil` and should ideally have an equivalent DataClass config for easy instantiation ! + + +.. autoclass:: nemo.collections.common.parts.adapter_modules.AdapterModuleUtil + :show-inheritance: + :members: + :member-order: bysource + +----- + +.. autoclass:: nemo.collections.common.parts.adapter_modules.LinearAdapter + :show-inheritance: + :members: + :member-order: bysource + + +Insertion Form - Module Adapters +-------------------------------- + +Adapter modules can be integrated into many different locations of a given module. For example, it is possible to have an adapter that affects only the outputs of the final layer in each module. We can also have a ``Parallel Adapter`` :cite:`adapters-Junxian2021unified` that operates at the input of the module itself, in parallel to the forward pass of the module. Yet another insertion location is inside the Multi Head Attention Layers. + +On top of this, while adapters are commonly used only in the layers containing the most parameters (say the Encoder of a network), some models can support adapters in multiple locations (Encoder-Decoder architecture for Language Models, Machine Translation, or even Encoder-Decoder-Joint for ASR with Transducer Loss). As such, NeMo utilizes the concept of ``Module Adapters``. + +``Module Adapters`` are very simply defined when adding an adapter - by specifying the module that the adapter should be inserted into. + +.. code-block:: python + + # Get the list of supported modules / locations in a adapter compatible Model + print(model.adapter_module_names) # assume ['', 'encoder', 'decoder'] + + # When calling add_adapter, specify the module name in the left of the colon symbol, and the adapter name afterwords. + # The adapter is then directed to the decoder module instead of the default / encoder module. + model.add_adapter("decoder:first_adapter", cfg=...) + +You might note that ``model.adapter_module_names`` can sometimes return ``''`` as one of the supported module names - this refers to the "default module". Generally we try to provide the default as the most commonly used adapter in literature - for example, Encoder adapters in NLP/NMT/ASR. + +Composition Function - Adapter Strategies +----------------------------------------- + +Finally, we discuss how to compose the input and output of adapter modules. In order to generalize this step, we construct ``Adapter Strategies``. +A strategy is any class (not torch.nn.Module!) that extends :class:`~nemo.core.classes.mixins.adapter_mixin_strategies.AbstractAdapterStrategy`, and provides a ``forward()`` method that accepts a specific signature of the inputs and produces an output tensor which combines the input and output with some specific method. + +We discuss a simple residual additional connection strategy below - that accepts an input to the adapter and an adapters output and simply adds them together. It also supports ``stochastic_depth`` which enables adapters to be dynamically switched off during training, making training more robust. + +.. autoclass:: nemo.core.classes.mixins.adapter_mixin_strategies.AbstractAdapterStrategy + :show-inheritance: + :members: + :member-order: bysource + :undoc-members: adapter_module_names + +----- + +.. autoclass:: nemo.core.classes.mixins.adapter_mixin_strategies.ResidualAddAdapterStrategy + :show-inheritance: + :members: + :member-order: bysource + :undoc-members: adapter_module_names + +----- + + +References +---------- + +.. bibliography:: ./adapter_bib.bib + :style: plain + :keyprefix: adapters- diff --git a/docs/source/core/adapters/intro.rst b/docs/source/core/adapters/intro.rst new file mode 100644 index 0000000000000000000000000000000000000000..fd94c8d2344660cb93942011748bce0165855ee0 --- /dev/null +++ b/docs/source/core/adapters/intro.rst @@ -0,0 +1,147 @@ +Adapters +======== + +In NeMo, we often train models and fine-tune them for a specific task. This is a reasonable approach when the models are just a few million parameters. However, this approach quickly becomes infeasible when approaching hundreds of millions or even billions of parameters. As a potential solution to such a scenario, where fine-tuning a massive model is no longer feasible, we look to `Adapters `_ :cite:`adapters-houlsby2019adapter` to specialize our model on a specific domain or task. Adapters require a fraction of the total number of parameters as the original model and are much more efficient to fine-tune. + +.. note:: + + For a detailed tutorial on adding ``Adapter`` support to any PyTorch module, please refer to the `Tutorials for NeMo Adapters <../../starthere/tutorials.html>`_. + + +What are Adapters? +------------------ + +Adapters are a straightforward concept - one formulation can be shown by the diagram below. At their simplest, they are residual Feedforward layers that compress the input dimension (:math:`D`) to a small bottleneck dimension (:math:`H`), such that :math:`R^D \text{->} R^H`, compute an activation (such as ReLU), finally mapping :math:`R^H \text{->} R^D` with another Feedforward layer. This output is then added to the input via a simple residual connection. + +.. raw:: html + +
+ +
+ +Adapter modules such as this are usually initialized such that the initial output of the adapter will always be zeros so as to prevent degradation of the original model's performance due to addition of such modules. + +``torch.nn.Module`` with Adapters +--------------------------------- + +In NeMo, Adapters are supported via a ``Mixin`` class that can be attached to any ``torch.nn.Module``. Such a module will +have multiple additional methods which will enable adapter capabilities in that module. + +.. code-block:: python + + # Import the adapter mixin from NeMo + from nemo.core import adapter_mixins + + # NOTE: See the *two* classes being inherited here ! + class MyModule(torch.nn.Module, adapter_mixins.AdapterModuleMixin): + pass + + +AdapterModuleMixin +------------------ +Let's look into what :class:`~nemo.core.adapter_mixins.AdapterModuleMixin` adds to the general PyTorch module. Some of the most important methods that are required are listed below : + +1) ``add_adapter``: Used to add an adapter with a unique name to the module. +2) ``get_enabled_adapters``: Returns a list of names of all enabled adapter modules. +3) ``set_enabled_adapters``: Sets whether a single (or all) adapters are enabled or disabled. +4) ``is_adapter_available``: Check if any adapter is available and enabled or not. + +Modules that extend this mixin usually can directly use these methods without extending them, but we will cover a case below +where you may wish to extend these methods. + +.. autoclass:: nemo.core.adapter_mixins.AdapterModuleMixin + :show-inheritance: + :members: + :member-order: bysource + :undoc-members: adapter_module_names + + +Using the Adapter Module +------------------------ + +Now that ``MyModule`` supports adapters, we can easily add adapters, set their state, check if they are available and +perform their forward pass. Note that if multiple adapters are enabled, they are called in a chain, the output of the previous adapter is passed as input to the next adapter and so on. + +.. code-block:: python + + # Import the adapter mixin and modules from NeMo + import torch + from nemo.core import adapter_mixins + from nemo.collections.common.parts import adapter_modules + + class MyModule(torch.nn.Module, adapter_mixins.AdapterModuleMixin): + + def forward(self, x: torch.Tensor) -> torch.Tensor: + output = self.layers(x) # assume self.layers is some Sequential() module + + if self.is_adapter_available(): # check if adapters were added or not + output = self.forward_enabled_adapters() # perform the forward of all enabled adapters in a chain + + return output + + # Now let us create a module, add an adapter and do a forward pass with some random inputs + module = MyModule(dim) # assume dim is some input and output dimension of the module. + + # Add an adapter + module.add_adapter("first_adapter", cfg=adapter_modules.LinearAdapter(in_features=dim, dim=5)) + + # Check if adapter is available + module.is_adapter_available() # returns True + + # Check the name(s) of the enabled adapters + module.get_enabled_adapters() # returns ['first_adapter'] + + # Set the state of the adapter (by name) + module.set_enabled_adapters(name="first_adapter", enabled=True) + + # Freeze all the weights of the original module (equivalent to calling module.freeze() for a NeuralModule) + for param in module.parameters(): + param.requires_grad = False + + # Unfreeze only the adapter weights (so that we finetune only the adapters and not the original weights !) + module.unfreeze_enabled_adapters() + + # Now you can train this model's adapters ! + input_data = torch.randn(4, dim) # assume dim is the input-output dim of the module + outputs_with_adapter = module(input_data) + + # Compute loss and backward ... + + +Adapter Compatible Models +------------------------- + +If the goal was to support adapters in a single module, then the goal has been accomplished. In the real world however, we +build large composite models out of multiple modules and combine them to build a final model that we then train. We do this using the +:class:`~nemo.core.adapter_mixins.AdapterModelPTMixin`. + +.. note:: + + For an in-depth guide to supporting hierarchical adapter modules, please refer to the `Tutorials for NeMo Adapters <../../starthere/tutorials.html>`_. + +.. autoclass:: nemo.core.adapter_mixins.AdapterModelPTMixin + :show-inheritance: + :members: + :member-order: bysource + :undoc-members: adapter_module_names + +Below, we will discuss some useful functionality of Adapter Compatible Models. + +1) ``Save and restore a Model with adapter capability``: Any NeMo model that implements this class correctly can save and restore NeMo models with adapter capabilities, thereby allowing sharing of adapters. +2) ``save_adapters`` and ``load_adapters``: Adapters are usually a very small number of parameters, there is no need for the entire model to be duplicated for each adapter. This method allows storing just the adapter module(s) separately from the Model, so that you can use the same "base" model, and share just the Adapter modules. + + +.. toctree:: + :maxdepth: 8 + :caption: Adapters + + components + api + + +References +---------- + +.. bibliography:: ./adapter_bib.bib + :style: plain + :keyprefix: adapters- diff --git a/docs/source/core/api.rst b/docs/source/core/api.rst new file mode 100644 index 0000000000000000000000000000000000000000..aca0d5bec9c7590881351c18a29fa6b13f9f56ea --- /dev/null +++ b/docs/source/core/api.rst @@ -0,0 +1,119 @@ + +Core APIs +========= + +Base class for all NeMo models +------------------------------ + +.. autoclass:: nemo.core.ModelPT + :show-inheritance: + :members: + :member-order: bysource + :undoc-members: cfg, num_weights + :exclude-members: set_eff_save, use_eff_save, teardown + +Base Neural Module class +------------------------ + +.. autoclass:: nemo.core.NeuralModule + :show-inheritance: + :members: + :member-order: bysource + +Base Mixin classes +------------------ + +.. autoclass:: nemo.core.Typing + :show-inheritance: + :members: + :member-order: bysource + :private-members: + :exclude-members: _abc_impl + :noindex: + +----- + +.. autoclass:: nemo.core.Serialization + :show-inheritance: + :members: + :member-order: bysource + :noindex: + +----- + +.. autoclass:: nemo.core.FileIO + :show-inheritance: + :members: + :member-order: bysource + :noindex: + + +Base Connector classes +---------------------- + +.. autoclass:: nemo.core.connectors.save_restore_connector.SaveRestoreConnector + :show-inheritance: + :members: + :member-order: bysource + +Neural Type checking +-------------------- + +.. autoclass:: nemo.core.classes.common.typecheck + :show-inheritance: + :members: + :member-order: bysource + + .. automethod:: __call__ + +Neural Type classes +------------------- + +.. autoclass:: nemo.core.neural_types.NeuralType + :show-inheritance: + :members: + :member-order: bysource + +----- + +.. autoclass:: nemo.core.neural_types.axes.AxisType + :show-inheritance: + :members: + :member-order: bysource + +----- + +.. autoclass:: nemo.core.neural_types.elements.ElementType + :show-inheritance: + :members: + :member-order: bysource + +----- + +.. autoclass:: nemo.core.neural_types.comparison.NeuralTypeComparisonResult + :show-inheritance: + :members: + :member-order: bysource + +Experiment manager +------------------ + +.. autoclass:: nemo.utils.exp_manager.exp_manager + :show-inheritance: + :members: + :member-order: bysource + +.. autoclass:: nemo.utils.exp_manager.ExpManagerConfig + :show-inheritance: + :members: + :member-order: bysource + + +Exportable +---------- + +.. autoclass:: nemo.core.classes.exportable.Exportable + :show-inheritance: + :members: + :member-order: bysource + diff --git a/docs/source/core/core.rst b/docs/source/core/core.rst new file mode 100644 index 0000000000000000000000000000000000000000..4f5589653172512b8f9735f8579d5154225ea1b0 --- /dev/null +++ b/docs/source/core/core.rst @@ -0,0 +1,785 @@ +NeMo Models +=========== + +Basics +------ + +NeMo models contain everything needed to train and reproduce Conversational AI models: + +- neural network architectures +- datasets/data loaders +- data preprocessing/postprocessing +- data augmentors +- optimizers and schedulers +- tokenizers +- language models + +NeMo uses `Hydra `_ for configuring both NeMo models and the PyTorch Lightning Trainer. + +.. note:: Every NeMo model has an example configuration file and training script that can be found `here `_. + +The end result of using NeMo, `Pytorch Lightning `_, and Hydra is that NeMo models all have the same look and feel and are also fully compatible with the PyTorch ecosystem. + +Pretrained +---------- + +NeMo comes with many pretrained models for each of our collections: ASR, NLP, and TTS. Every pretrained NeMo model can be downloaded +and used with the ``from_pretrained()`` method. + +As an example, we can instantiate QuartzNet with the following: + +.. code-block:: Python + + import nemo.collections.asr as nemo_asr + + model = nemo_asr.models.EncDecCTCModel.from_pretrained(model_name="QuartzNet15x5Base-En") + +To see all available pretrained models for a specific NeMo model, use the ``list_available_models()`` method. + +.. code-block:: Python + + nemo_asr.model.EncDecCTCModel.list_available_models() + +For detailed information on the available pretrained models, refer to the collections documentation: + +- :ref:`Automatic Speech Recognition (ASR)` +- :doc:`Natural Language Processing (NLP) <../nlp/models>` +- :doc:`Text-to-Speech Synthesis (TTS) <../tts/intro>` + +Training +-------- + +NeMo leverages `PyTorch Lightning `_ for model training. PyTorch Lightning lets NeMo decouple the +conversational AI code from the PyTorch training code. This means that NeMo users can focus on their domain (ASR, NLP, TTS) and +build complex AI applications without having to rewrite boiler plate code for PyTorch training. + +When using PyTorch Lightning, NeMo users can automatically train with: + +- multi-GPU/multi-node +- mixed precision +- model checkpointing +- logging +- early stopping +- and more + +The two main aspects of the Lightning API are the `LightningModule `_ +and the `Trainer `_. + +PyTorch Lightning ``LightningModule`` +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ + +Every NeMo model is a ``LightningModule`` which is an ``nn.module``. This means that NeMo models are compatible with the PyTorch +ecosystem and can be plugged into existing PyTorch workflows. + +Creating a NeMo model is similar to any other PyTorch workflow. We start by initializing our model architecture, then define the forward pass: + +.. code-block:: python + + class TextClassificationModel(NLPModel, Exportable): + ... + def __init__(self, cfg: DictConfig, trainer: Trainer = None): + """Initializes the BERTTextClassifier model.""" + ... + super().__init__(cfg=cfg, trainer=trainer) + + # instantiate a BERT based encoder + self.bert_model = get_lm_model( + config_file=cfg.language_model.config_file, + config_dict=cfg.language_model.config, + vocab_file=cfg.tokenizer.vocab_file, + trainer=trainer, + cfg=cfg, + ) + + # instantiate the FFN for classification + self.classifier = SequenceClassifier( + hidden_size=self.bert_model.config.hidden_size, + num_classes=cfg.dataset.num_classes, + num_layers=cfg.classifier_head.num_output_layers, + activation='relu', + log_softmax=False, + dropout=cfg.classifier_head.fc_dropout, + use_transformer_init=True, + idx_conditioned_on=0, + ) + +.. code-block:: python + + def forward(self, input_ids, token_type_ids, attention_mask): + """ + No special modification required for Lightning, define it as you normally would + in the `nn.Module` in vanilla PyTorch. + """ + hidden_states = self.bert_model( + input_ids=input_ids, token_type_ids=token_type_ids, attention_mask=attention_mask + ) + logits = self.classifier(hidden_states=hidden_states) + return logits + +The ``LightningModule`` organizes PyTorch code so that across all NeMo models we have a similar look and feel. +For example, the training logic can be found in ``training_step``: + +.. code-block:: python + + def training_step(self, batch, batch_idx): + """ + Lightning calls this inside the training loop with the data from the training dataloader + passed in as `batch`. + """ + # forward pass + input_ids, input_type_ids, input_mask, labels = batch + logits = self.forward(input_ids=input_ids, token_type_ids=input_type_ids, attention_mask=input_mask) + + train_loss = self.loss(logits=logits, labels=labels) + + lr = self._optimizer.param_groups[0]['lr'] + + self.log('train_loss', train_loss) + self.log('lr', lr, prog_bar=True) + + return { + 'loss': train_loss, + 'lr': lr, + } + +While validation logic can be found in ``validation_step``: + +.. code-block:: python + + def validation_step(self, batch, batch_idx): + """ + Lightning calls this inside the validation loop with the data from the validation dataloader + passed in as `batch`. + """ + if self.testing: + prefix = 'test' + else: + prefix = 'val' + + input_ids, input_type_ids, input_mask, labels = batch + logits = self.forward(input_ids=input_ids, token_type_ids=input_type_ids, attention_mask=input_mask) + + val_loss = self.loss(logits=logits, labels=labels) + + preds = torch.argmax(logits, axis=-1) + + tp, fn, fp, _ = self.classification_report(preds, labels) + + return {'val_loss': val_loss, 'tp': tp, 'fn': fn, 'fp': fp} + +PyTorch Lightning then handles all of the boiler plate code needed for training. Virtually any aspect of training can be customized +via PyTorch Lightning `hooks `_, +`Plugins `_, +`callbacks `_, or by overriding `methods `_. + +For more domain-specific information, see: + +- :ref:`Automatic Speech Recognition (ASR) <../asr/intro>` +- :ref:`Natural Language Processing (NLP) <../nlp/models>` +- :ref:`Text-to-Speech Synthesis (TTS) <../tts/intro>` + +PyTorch Lightning Trainer +~~~~~~~~~~~~~~~~~~~~~~~~~ + +Since every NeMo model is a ``LightningModule``, we can automatically take advantage of the PyTorch Lightning ``Trainer``. Every NeMo +`example `_ training script uses the ``Trainer`` object to fit the model. + +First, instantiate the model and trainer, then call ``.fit``: + +.. code-block:: python + + # We first instantiate the trainer based on the model configuration. + # See the model configuration documentation for details. + trainer = pl.Trainer(**cfg.trainer) + + # Then pass the model configuration and trainer object into the NeMo model + model = TextClassificationModel(cfg.model, trainer=trainer) + + # Now we can train with by calling .fit + trainer.fit(model) + + # Or we can run the test loop on test data by calling + trainer.test(model=model) + +All `trainer flags `_ can be set from from the +NeMo configuration. + + +Configuration +------------- + +Hydra is an open-source Python framework that simplifies configuration for complex applications that must bring together many different +software libraries. Conversational AI model training is a great example of such an application. To train a conversational AI model, we +must be able to configure: + +- neural network architectures +- training and optimization algorithms +- data pre/post processing +- data augmentation +- experiment logging/visualization +- model checkpointing + +For an introduction to using Hydra, refer to the `Hydra Tutorials `_. + +With Hydra, we can configure everything needed for NeMo with three interfaces: + +- Command Line (CLI) +- Configuration Files (YAML) +- Dataclasses (Python) + +YAML +~~~~ + +NeMo provides YAML configuration files for all of our `example `_ training scripts. +YAML files make it easy to experiment with different model and training configurations. + +Every NeMo example YAML has the same underlying configuration structure: + +- trainer +- exp_manager +- model + +Model configuration always contain ``train_ds``, ``validation_ds``, ``test_ds``, and ``optim``. Model architectures vary across +domains, therefore, refer to the ASR, NLP, and TTS Collections documentation for more detailed information on Model architecture configuration. + +A NeMo configuration file should look similar to the following: + +.. code-block:: yaml + + # PyTorch Lightning Trainer configuration + # any argument of the Trainer object can be set here + trainer: + devices: 1 # number of gpus per node + accelerator: gpu + num_nodes: 1 # number of nodes + max_epochs: 10 # how many training epochs to run + val_check_interval: 1.0 # run validation after every epoch + + # Experiment logging configuration + exp_manager: + exp_dir: /path/to/my/nemo/experiments + name: name_of_my_experiment + create_tensorboard_logger: True + create_wandb_logger: True + + # Model configuration + # model network architecture, train/val/test datasets, data augmentation, and optimization + model: + train_ds: + manifest_filepath: /path/to/my/train/manifest.json + batch_size: 256 + shuffle: True + validation_ds: + manifest_filepath: /path/to/my/validation/manifest.json + batch_size: 32 + shuffle: False + test_ds: + manifest_filepath: /path/to/my/test/manifest.json + batch_size: 32 + shuffle: False + optim: + name: novograd + lr: .01 + betas: [0.8, 0.5] + weight_decay: 0.001 + # network architecture can vary greatly depending on the domain + encoder: + ... + decoder: + ... + +More specific details about configuration files for each collection can be found on the following pages: + +:ref:`NeMo ASR Configuration Files` + +CLI +~~~ + +With NeMo and Hydra, every aspect of model training can be modified from the command-line. This is extremely helpful for running lots +of experiments on compute clusters or for quickly testing parameters while developing. + +All NeMo `examples `_ come with instructions on how to +run the training/inference script from the command-line (see `here `_ +for an example). + +With Hydra, arguments are set using the ``=`` operator: + +.. code-block:: bash + + python examples/asr/asr_ctc/speech_to_text_ctc.py \ + model.train_ds.manifest_filepath=/path/to/my/train/manifest.json \ + model.validation_ds.manifest_filepath=/path/to/my/validation/manifest.json \ + trainer.devices=2 \ + trainer.accelerator='gpu' \ + trainer.max_epochs=50 + +We can use the ``+`` operator to add arguments from the CLI: + +.. code-block:: bash + + python examples/asr/asr_ctc/speech_to_text_ctc.py \ + model.train_ds.manifest_filepath=/path/to/my/train/manifest.json \ + model.validation_ds.manifest_filepath=/path/to/my/validation/manifest.json \ + trainer.devices=2 \ + trainer.accelerator='gpu' \ + trainer.max_epochs=50 \ + +trainer.fast_dev_run=true + +We can use the ``~`` operator to remove configurations: + +.. code-block:: bash + + python examples/asr/asr_ctc/speech_to_text_ctc.py \ + model.train_ds.manifest_filepath=/path/to/my/train/manifest.json \ + model.validation_ds.manifest_filepath=/path/to/my/validation/manifest.json \ + ~model.test_ds \ + trainer.devices=2 \ + trainer.accelerator='gpu' \ + trainer.max_epochs=50 \ + +trainer.fast_dev_run=true + +We can specify configuration files using the ``--config-path`` and ``--config-name`` flags: + +.. code-block:: bash + + python examples/asr/asr_ctc/speech_to_text_ctc.py \ + --config-path=conf/quartznet \ + --config-name=quartznet_15x5 \ + model.train_ds.manifest_filepath=/path/to/my/train/manifest.json \ + model.validation_ds.manifest_filepath=/path/to/my/validation/manifest.json \ + ~model.test_ds \ + trainer.devices=2 \ + trainer.accelerator='gpu' \ + trainer.max_epochs=50 \ + +trainer.fast_dev_run=true + +Dataclasses +~~~~~~~~~~~ + +Dataclasses allow NeMo to ship model configurations as part of the NeMo library and also enables pure Python configuration of NeMo models. +With Hydra, dataclasses can be used to create `structured configs `_ for the conversational AI application. + +As an example, refer to the code block below for an *Attenion is All You Need* machine translation model. The model configuration can +be instantiated and modified like any Python `Dataclass `_. + +.. code-block:: Python + + from nemo.collections.nlp.models.machine_translation.mt_enc_dec_config import AAYNBaseConfig + + cfg = AAYNBaseConfig() + + # modify the number of layers in the encoder + cfg.encoder.num_layers = 8 + + # modify the training batch size + cfg.train_ds.tokens_in_batch = 8192 + +.. note:: Configuration with Hydra always has the following precedence CLI > YAML > Dataclass + +.. _optimization-label: + +Optimization +------------ + +Optimizers and learning rate schedules are configurable across all NeMo models and have their own namespace. Here is a sample YAML +configuration for a Novograd optimizer with Cosine Annealing learning rate schedule. + +.. code-block:: yaml + + optim: + name: novograd + lr: 0.01 + + # optimizer arguments + betas: [0.8, 0.25] + weight_decay: 0.001 + + # scheduler setup + sched: + name: CosineAnnealing + + # Optional arguments + max_steps: -1 # computed at runtime or explicitly set here + monitor: val_loss + reduce_on_plateau: false + + # scheduler config override + warmup_steps: 1000 + warmup_ratio: null + min_lr: 1e-9: + +.. note:: `NeMo Examples `_ has optimizer and scheduler configurations for every NeMo model. + +Optimizers can be configured from the CLI as well: + +.. code-block:: bash + + python examples/asr/asr_ctc/speech_to_text_ctc.py \ + --config-path=conf/quartznet \ + --config-name=quartznet_15x5 \ + ... + # train with the adam optimizer + model.optim=adam \ + # change the learning rate + model.optim.lr=.0004 \ + # modify betas + model.optim.betas=[.8, .5] + +.. _optimizers-label: + +Optimizers +~~~~~~~~~~ + +``name`` corresponds to the lowercase name of the optimizer. To view a list of available optimizers, run: + +.. code-block:: Python + + from nemo.core.optim.optimizers import AVAILABLE_OPTIMIZERS + + for name, opt in AVAILABLE_OPTIMIZERS.items(): + print(f'name: {name}, opt: {opt}') + +.. code-block:: bash + + name: sgd opt: + name: adam opt: + name: adamw opt: + name: adadelta opt: + name: adamax opt: + name: adagrad opt: + name: rmsprop opt: + name: rprop opt: + name: novograd opt: + +Optimizer Params +~~~~~~~~~~~~~~~~ + +Optimizer params can vary between optimizers but the ``lr`` param is required for all optimizers. To see the available params for an +optimizer, we can look at its corresponding dataclass. + +.. code-block:: python + + from nemo.core.config.optimizers import NovogradParams + + print(NovogradParams()) + +.. code-block:: bash + + NovogradParams(lr='???', betas=(0.95, 0.98), eps=1e-08, weight_decay=0, grad_averaging=False, amsgrad=False, luc=False, luc_trust=0.001, luc_eps=1e-08) + +``'???'`` indicates that the lr argument is required. + +Register Optimizer +~~~~~~~~~~~~~~~~~~ + +To register a new optimizer to be used with NeMo, run: + +.. autofunction:: nemo.core.optim.optimizers.register_optimizer + +.. _learning-rate-schedulers-label: + +Learning Rate Schedulers +~~~~~~~~~~~~~~~~~~~~~~~~ + +Learning rate schedulers can be optionally configured under the ``optim.sched`` namespace. + +``name`` corresponds to the name of the learning rate schedule. To view a list of available schedulers, run: + +.. code-block:: Python + + from nemo.core.optim.lr_scheduler import AVAILABLE_SCHEDULERS + + for name, opt in AVAILABLE_SCHEDULERS.items(): + print(f'name: {name}, schedule: {opt}') + +.. code-block:: bash + + name: WarmupPolicy, schedule: + name: WarmupHoldPolicy, schedule: + name: SquareAnnealing, schedule: + name: CosineAnnealing, schedule: + name: NoamAnnealing, schedule: + name: WarmupAnnealing, schedule: + name: InverseSquareRootAnnealing, schedule: + name: SquareRootAnnealing, schedule: + name: PolynomialDecayAnnealing, schedule: + name: PolynomialHoldDecayAnnealing, schedule: + name: StepLR, schedule: + name: ExponentialLR, schedule: + name: ReduceLROnPlateau, schedule: + name: CyclicLR, schedule: + +Scheduler Params +~~~~~~~~~~~~~~~~ + +To see the available params for a scheduler, we can look at its corresponding dataclass: + +.. code-block:: Python + + from nemo.core.config.schedulers import CosineAnnealingParams + + print(CosineAnnealingParams()) + +.. code-block:: bash + + CosineAnnealingParams(last_epoch=-1, warmup_steps=None, warmup_ratio=None, min_lr=0.0) + +Register scheduler +~~~~~~~~~~~~~~~~~~ + +To register a new scheduler to be used with NeMo, run: + +.. autofunction:: nemo.core.optim.lr_scheduler.register_scheduler + +Save and Restore +---------------- + +NeMo models all come with ``.save_to`` and ``.restore_from`` methods. + +Save +~~~~ + +To save a NeMo model, run: + +.. code-block:: Python + + model.save_to('/path/to/model.nemo') + +Everything needed to use the trained model is packaged and saved in the ``.nemo`` file. For example, in the NLP domain, ``.nemo`` files +include the necessary tokenizer models and/or vocabulary files, etc. + +.. note:: A ``.nemo`` file is simply an archive like any other ``.tar`` file. + +Restore +~~~~~~~ + +To restore a NeMo model, run: + +.. code-block:: Python + + # Here, you should usually use the class of the model, or simply use ModelPT.restore_from() for simplicity. + model.restore_from('/path/to/model.nemo') + +When using the PyTorch Lightning Trainer, a PyTorch Lightning checkpoint is created. These are mainly used within NeMo to auto-resume +training. Since NeMo models are ``LightningModules``, the PyTorch Lightning method ``load_from_checkpoint`` is available. Note that +``load_from_checkpoint`` won't necessarily work out-of-the-box for all models as some models require more artifacts than just the +checkpoint to be restored. For these models, the user will have to override ``load_from_checkpoint`` if they want to use it. + +It's highly recommended to use ``restore_from`` to load NeMo models. + +Restore with Modified Config +~~~~~~~~~~~~~~~~~~~~~~~~~~~~ + +Sometimes, there may be a need to modify the model (or it's sub-components) prior to restoring a model. A common case is when +the model's internal config must be updated due to various reasons (such as deprecation, newer versioning, support a new feature). +As long as the model has the same parameters as compared to the original config, the parameters can once again be restored safely. + +In NeMo, as part of the .nemo file, the model's internal config will be preserved. This config is used during restoration, and +as shown below we can update this config prior to restoring the model. + +.. code-block:: + + # When restoring a model, you should generally use the class of the model + # Obtain the config (as an OmegaConf object) + config = model_class.restore_from('/path/to/model.nemo', return_config=True) + # OR + config = model_class.from_pretrained('name_of_the_model', return_config=True) + + # Modify the config as needed + config.x.y = z + + # Restore the model from the updated config + model = model_class.restore_from('/path/to/model.nemo', override_config_path=config) + # OR + model = model_class.from_pretrained('name_of_the_model', override_config_path=config) + +Register Artifacts +------------------ + +Conversational AI models can be complicated to restore as more information is needed than just the checkpoint weights in order to use the model. +NeMo models can save additional artifacts in the .nemo file by calling ``.register_artifact``. +When restoring NeMo models using ``.restore_from`` or ``.from_pretrained``, any artifacts that were registered will be available automatically. + +As an example, consider an NLP model that requires a trained tokenizer model. +The tokenizer model file can be automatically added to the .nemo file with the following: + +.. code-block:: python + + self.encoder_tokenizer = get_nmt_tokenizer( + ... + tokenizer_model=self.register_artifact(config_path='encoder_tokenizer.tokenizer_model', + src='/path/to/tokenizer.model', + verify_src_exists=True), + ) + +By default, ``.register_artifact`` will always return a path. If the model is being restored from a .nemo file, +then that path will be to the artifact in the .nemo file. Otherwise, ``.register_artifact`` will return the local path specified by the user. + +``config_path`` is the artifact key. It usually corresponds to a model configuration but does not have to. +The model config that is packaged with the .nemo file will be updated according to the ``config_path`` key. +In the above example, the model config will have + +.. code-block:: YAML + + encoder_tokenizer: + ... + tokenizer_model: nemo:4978b28103264263a03439aaa6560e5e_tokenizer.model + +``src`` is the path to the artifact and the base-name of the path will be used when packaging the artifact in the .nemo file. +Each artifact will have a hash prepended to the basename of ``src`` in the .nemo file. This is to prevent collisions with basenames +base-names that are identical (say when there are two or more tokenizers, both called `tokenizer.model`). +The resulting .nemo file will then have the following file: + +.. code-block:: bash + + 4978b28103264263a03439aaa6560e5e_tokenizer.model + +If ``verify_src_exists`` is set to ``False``, then the artifact is optional. This means that ``.register_artifact`` will return ``None`` +if the ``src`` cannot be found. + +Nested NeMo Models +------------------ + +In some cases, it may be helpful to use NeMo models inside other NeMo models. For example, we can incorporate language models into ASR models to use in a decoding process to improve accuracy or use hybrid ASR-TTS models to generate audio from the text on the fly to train or finetune the ASR model. + +There are 3 ways to instantiate child models inside parent models: + +- use subconfig directly +- use the ``.nemo`` checkpoint path to load the child model +- use a pretrained NeMo model + +To register a child model, use the ``register_nemo_submodule`` method of the parent model. This method will add the child model to a provided model attribute and, in the serialization process, will handle child artifacts correctly and store the child model config in the parent model config in ``config_field``. + +.. code-block:: python + + from nemo.core.classes import ModelPT + + class ChildModel(ModelPT): + ... # implement necessary methods + + class ParentModel(ModelPT): + def __init__(self, cfg, trainer=None): + super().__init__(cfg=cfg, trainer=trainer) + + # optionally annotate type for IDE autocompletion and type checking + self.child_model: Optional[ChildModel] + if cfg.get("child_model") is not None: + # load directly from config + # either if config provided initially, or automatically + # after model restoration + self.register_nemo_submodule( + name="child_model", + config_field="child_model", + model=ChildModel(self.cfg.child_model, trainer=trainer), + ) + elif cfg.get('child_model_path') is not None: + # load from .nemo model checkpoint + # while saving, config will be automatically assigned/updated + # in cfg.child_model + self.register_nemo_submodule( + name="child_model", + config_field="child_model", + model=ChildModel.restore_from(self.cfg.child_model_path, trainer=trainer), + ) + elif cfg.get('child_model_name') is not None: + # load from pretrained model + # while saving, config will be automatically assigned/updated + # in cfg.child_model + self.register_nemo_submodule( + name="child_model", + config_field="child_model", + model=ChildModel.from_pretrained(self.cfg.child_model_name, trainer=trainer), + ) + else: + self.child_model = None + + +Neural Modules +============== + +NeMo is built around Neural Modules, conceptual blocks of neural networks that take typed inputs and produce typed outputs. Such +modules typically represent data layers, encoders, decoders, language models, loss functions, or methods of combining activations. +NeMo makes it easy to combine and re-use these building blocks while providing a level of semantic correctness checking via its neural +type system. + +.. note:: *All Neural Modules inherit from ``torch.nn.Module`` and are therefore compatible with the PyTorch ecosystem.* + +There are 3 types on Neural Modules: + + - Regular modules + - Dataset/IterableDataset + - Losses + +Every Neural Module in NeMo must inherit from `nemo.core.classes.module.NeuralModule` class. + +.. autoclass:: nemo.core.classes.module.NeuralModule + +Every Neural Modules inherits the ``nemo.core.classes.common.Typing`` interface and needs to define neural types for its inputs and outputs. +This is done by defining two properties: ``input_types`` and ``output_types``. Each property should return an ordered dictionary of +"port name"->"port neural type" pairs. Here is the example from :class:`~nemo.collections.asr.modules.ConvASREncoder` class: + +.. code-block:: python + + @property + def input_types(self): + return OrderedDict( + { + "audio_signal": NeuralType(('B', 'D', 'T'), SpectrogramType()), + "length": NeuralType(tuple('B'), LengthsType()), + } + ) + + @property + def output_types(self): + return OrderedDict( + { + "outputs": NeuralType(('B', 'D', 'T'), AcousticEncodedRepresentation()), + "encoded_lengths": NeuralType(tuple('B'), LengthsType()), + } + ) + + @typecheck() + def forward(self, audio_signal, length=None): + ... + +The code snippet above means that ``nemo.collections.asr.modules.conv_asr.ConvASREncoder`` expects two arguments: + * First one, named ``audio_signal`` of shape ``[batch, dimension, time]`` with elements representing spectrogram values. + * Second one, named ``length`` of shape ``[batch]`` with elements representing lengths of corresponding signals. + +It also means that ``.forward(...)`` and ``__call__(...)`` methods each produce two outputs: + * First one, of shape ``[batch, dimension, time]`` but with elements representing encoded representation (``AcousticEncodedRepresentation`` class). + * Second one, of shape ``[batch]``, corresponding to their lengths. + +.. tip:: It is a good practice to define types and add ``@typecheck()`` decorator to your ``.forward()`` method after your module is ready for use by others. + +.. note:: The outputs of ``.forward(...)`` method will always be of type ``torch.Tensor`` or container of tensors and will work with any other Pytorch code. The type information is attached to every output tensor. If tensors without types is passed to your module, it will not fail, however the types will not be checked. Thus, it is recommended to define input/output types for all your modules, starting with data layers and add ``@typecheck()`` decorator to them. + +.. note:: To temporarily disable typechecking, you can enclose your code in ```with typecheck.disable_checks():``` statement. + + +Dynamic Layer Freezing +---------------------- + +You can selectively freeze any modules inside a Nemo model by specifying a freezing schedule in the config yaml. Freezing stops any gradient updates +to that module, so that its weights are not changed for that step. This can be useful for combatting catastrophic forgetting, for example +when finetuning a large pretrained model on a small dataset. + +The default approach is to freeze a module for the first N training steps, but you can also enable freezing for a specific range of steps, +for example, from step 20 - 100, or even activate freezing from some N until the end of training. You can also freeze a module for the entire training run. +Dynamic freezing is specified in training steps, not epochs. + +To enable freezing, add the following to your config: + +.. code-block:: yaml + + model: + ... + freeze_updates: + enabled: true # set to false if you want to disable freezing + + modules: # list all of the modules you want to have freezing logic for + encoder: 200 # module will be frozen for the first 200 training steps + decoder: [50, -1] # module will be frozen at step 50 and will remain frozen until training ends + joint: [10, 100] # module will be frozen between step 10 and step 100 (step >= 10 and step <= 100) + transcoder: -1 # module will be frozen for the entire training run + diff --git a/docs/source/core/exp_manager.rst b/docs/source/core/exp_manager.rst new file mode 100644 index 0000000000000000000000000000000000000000..23874e5c8c135df0f8b57982539ed49a1331d883 --- /dev/null +++ b/docs/source/core/exp_manager.rst @@ -0,0 +1,350 @@ + +.. _exp-manager-label: + +Experiment Manager +================== + +NeMo's Experiment Manager leverages PyTorch Lightning for model checkpointing, TensorBoard Logging, Weights and Biases, DLLogger and MLFlow logging. The +Experiment Manager is included by default in all NeMo example scripts. + +To use the experiment manager simply call :class:`~nemo.utils.exp_manager.exp_manager` and pass in the PyTorch Lightning ``Trainer``. + +.. code-block:: python + + exp_dir = exp_manager(trainer, cfg.get("exp_manager", None)) + +And is configurable via YAML with Hydra. + +.. code-block:: bash + + exp_manager: + exp_dir: /path/to/my/experiments + name: my_experiment_name + create_tensorboard_logger: True + create_checkpoint_callback: True + +Optionally, launch TensorBoard to view the training results in ``./nemo_experiments`` (by default). + +.. code-block:: bash + + tensorboard --bind_all --logdir nemo_experiments + +.. + +If ``create_checkpoint_callback`` is set to ``True``, then NeMo automatically creates checkpoints during training +using PyTorch Lightning's `ModelCheckpoint `_. +We can configure the ``ModelCheckpoint`` via YAML or CLI. + +.. code-block:: yaml + + exp_manager: + ... + # configure the PyTorch Lightning ModelCheckpoint using checkpoint_call_back_params + # any ModelCheckpoint argument can be set here + + # save the best checkpoints based on this metric + checkpoint_callback_params.monitor=val_loss + + # choose how many total checkpoints to save + checkpoint_callback_params.save_top_k=5 + +Resume Training +--------------- + +We can auto-resume training as well by configuring the ``exp_manager``. Being able to auto-resume is important when doing long training +runs that are premptible or may be shut down before the training procedure has completed. To auto-resume training, set the following +via YAML or CLI: + +.. code-block:: yaml + + exp_manager: + ... + # resume training if checkpoints already exist + resume_if_exists: True + + # to start training with no existing checkpoints + resume_ignore_no_checkpoint: True + + # by default experiments will be versioned by datetime + # we can set our own version with + exp_manager.version: my_experiment_version + + +Experiment Loggers +------------------ + +Alongside Tensorboard, NeMo also supports Weights and Biases, MLFlow and DLLogger. To use these loggers, simply set the following +via YAML or :class:`~nemo.utils.exp_manager.ExpManagerConfig`. + + +Weights and Biases (WandB) +~~~~~~~~~~~~~~~~~~~~~~~~~~ + +.. _exp_manager_weights_biases-label: + +.. code-block:: yaml + + exp_manager: + ... + create_checkpoint_callback: True + create_wandb_logger: True + wandb_logger_kwargs: + name: ${name} + project: ${project} + entity: ${entity} + + + +MLFlow +~~~~~~ + +.. _exp_manager_mlflow-label: + +.. code-block:: yaml + + exp_manager: + ... + create_checkpoint_callback: True + create_mlflow_logger: True + mlflow_logger_kwargs: + experiment_name: ${name} + tags: + + save_dir: './mlruns' + prefix: '' + artifact_location: None + # provide run_id if resuming a previously started run + run_id: Optional[str] = None + +DLLogger +~~~~~~~~ + +.. _exp_manager_dllogger-label: + +.. code-block:: yaml + + exp_manager: + ... + create_checkpoint_callback: True + create_dllogger_logger: True + dllogger_logger_kwargs: + verbose: False + stdout: False + json_file: "./dllogger.json" + +ClearML +~~~~~~~ + +.. _exp_manager_clearml-label: + +.. code-block:: yaml + + exp_manager: + ... + create_checkpoint_callback: True + create_clearml_logger: True + clearml_logger_kwargs: + project: None # name of the project + task: None # optional name of task + connect_pytorch: False + model_name: None # optional name of model + tags: None # Should be a list of str + log_model: False # log model to clearml server + log_cfg: False # log config to clearml server + log_metrics: False # log metrics to clearml server + +Exponential Moving Average +-------------------------- + +.. _exp_manager_ema-label: + +NeMo supports using exponential moving average (EMA) for model parameters. This can be useful for improving model generalization +and stability. To use EMA, simply set the following via YAML or :class:`~nemo.utils.exp_manager.ExpManagerConfig`. + +.. code-block:: yaml + + exp_manager: + ... + # use exponential moving average for model parameters + ema: + enabled: True # False by default + decay: 0.999 # decay rate + cpu_offload: False # If EMA parameters should be offloaded to CPU to save GPU memory + every_n_steps: 1 # How often to update EMA weights + validate_original_weights: False # Whether to use original weights for validation calculation or EMA weights + + +.. _nemo_multirun-label: + +Hydra Multi-Run with NeMo +------------------------- + +When training neural networks, it is common to perform hyper parameter search in order to improve the performance of a model +on some validation data. However, it can be tedious to manually prepare a grid of experiments and management of all checkpoints +and their metrics. In order to simplify such tasks, NeMo integrates with `Hydra Multi-Run support `_ in order to provide a unified way to run a set of experiments all +from the config. + +There are certain limitations to this framework, which we list below: + +* All experiments are assumed to be run on a single GPU, and multi GPU for single run (model parallel models are not supported as of now). +* NeMo Multi-Run supports only grid search over a set of hyper-parameters, but we will eventually add support for advanced hyper parameter search strategies. +* **NeMo Multi-Run only supports running on one or more GPUs** and will not work if no GPU devices are present. + +Config Setup +~~~~~~~~~~~~ + +In order to enable NeMo Multi-Run, we first update our YAML configs with some information to let Hydra know we expect to run multiple experiments from this one config - + +.. code-block:: yaml + + # Required for Hydra launch of hyperparameter search via multirun + defaults: + - override hydra/launcher: nemo_launcher + + # Hydra arguments necessary for hyperparameter optimization + hydra: + # Helper arguments to ensure all hyper parameter runs are from the directory that launches the script. + sweep: + dir: "." + subdir: "." + + # Define all the hyper parameters here + sweeper: + params: + # Place all the parameters you wish to search over here (corresponding to the rest of the config) + # NOTE: Make sure that there are no spaces between the commas that separate the config params ! + model.optim.lr: 0.001,0.0001 + model.encoder.dim: 32,64,96,128 + model.decoder.dropout: 0.0,0.1,0.2 + + # Arguments to the process launcher + launcher: + num_gpus: -1 # Number of gpus to use. Each run works on a single GPU. + jobs_per_gpu: 1 # If each GPU has large memory, you can run multiple jobs on the same GPU for faster results (until OOM). + + +Next, we will setup the config for ``Experiment Manager``. When we perform hyper parameter search, each run may take some time to complete. +We want to therefore avoid the case where a run ends (say due to OOM or timeout on the machine) and we need to redo all experiments. +We therefore setup the experiment manager config such that every experiment has a unique "key", whose value corresponds to a single +resumable experiment. + +Let us see how to setup such a unique "key" via the experiment name. Simply attach all the hyper parameter arguments to the experiment +name as shown below - + +.. code-block:: yaml + + exp_manager: + exp_dir: null # Can be set by the user. + + # Add a unique name for all hyper parameter arguments to allow continued training. + # NOTE: It is necessary to add all hyperparameter arguments to the name ! + # This ensures successful restoration of model runs in case HP search crashes. + name: ${name}-lr-${model.optim.lr}-adim-${model.adapter.dim}-sd-${model.adapter.adapter_strategy.stochastic_depth} + + ... + checkpoint_callback_params: + ... + save_top_k: 1 # Dont save too many .ckpt files during HP search + always_save_nemo: True # saves the checkpoints as nemo files for fast checking of results later + ... + + # We highly recommend use of any experiment tracking took to gather all the experiments in one location + create_wandb_logger: True + wandb_logger_kwargs: + project: "" + + # HP Search may crash due to various reasons, best to attempt continuation in order to + # resume from where the last failure case occured. + resume_if_exists: true + resume_ignore_no_checkpoint: true + + +Running a Multi-Run config +~~~~~~~~~~~~~~~~~~~~~~~~~~ + +Once the config has been updated, we can now run it just like any normal Hydra script -- with one special flag (``-m``) ! + +.. code-block:: bash + + python script.py --config-path=ABC --config-name=XYZ -m \ + trainer.max_steps=5000 \ # Any additional arg after -m will be passed to all the runs generated from the config ! + ... + +Tips and Tricks +~~~~~~~~~~~~~~~ + +* Preserving disk space for large number of experiments + +Some models may have a large number of parameters, and it may be very expensive to save a large number of checkpoints on +physical storage drives. For example, if you use Adam optimizer, each PyTorch Lightning ".ckpt" file will actually be 3x the +size of just the model parameters - per ckpt file ! This can be exhorbitant if you have multiple runs. + +In the above config, we explicitly set ``save_top_k: 1`` and ``always_save_nemo: True`` - what this does is limit the number of +ckpt files to just 1, and also save a NeMo file (which will contain just the model parameters without optimizer state) and +can be restored immediately for further work. + +We can further reduce the storage space by utilizing some utility functions of NeMo to automatically delete either +ckpt or NeMo files after a training run has finished. This is sufficient in case you are collecting results in some experiment +tracking tool and can simply rerun the best config after the search is finished. + +.. code-block:: python + + # Import `clean_exp_ckpt` along with exp_manager + from nemo.utils.exp_manager import clean_exp_ckpt, exp_manager + + @hydra_runner(...) + def main(cfg): + ... + + # Keep track of the experiment directory + exp_log_dir = exp_manager(trainer, cfg.get("exp_manager", None)) + + ... add any training code here as needed ... + + # Add following line to end of the training script + # Remove PTL ckpt file, and potentially also remove .nemo file to conserve storage space. + clean_exp_ckpt(exp_log_dir, remove_ckpt=True, remove_nemo=False) + + +* Debugging Multi-Run Scripts + +When running hydra scripts, you may sometimes face config issues which crash the program. In NeMo Multi-Run, a crash in +any one run will **not** crash the entire program, we will simply take note of it and move onto the next job. Once all +jobs are completed, we then raise the error in the order that it occured (it will crash the program with the first error's +stack trace). + +In order to debug Muti-Run, we suggest to comment out the full hyper parameter config set inside ``sweep.params`` +and instead run just a single experiment with the config - which would immediately raise the error. + + +* Experiment name cannot be parsed by Hydra + +Sometimes our hyper parameters include PyTorch Lightning ``trainer`` arguments - such as number of steps, number of epochs +whether to use gradient accumulation or not etc. When we attempt to add these as keys to the expriment manager's ``name``, +Hydra may complain that ``trainer.xyz`` cannot be resolved. + +A simple solution is to finalize the hydra config before you call ``exp_manager()`` as follows - + +.. code-block:: python + + @hydra_runner(...) + def main(cfg): + # Make any changes as necessary to the config + cfg.xyz.abc = uvw + + # Finalize the config + cfg = OmegaConf.resolve(cfg) + + # Carry on as normal by calling trainer and exp_manager + trainer = pl.Trainer(**cfg.trainer) + exp_log_dir = exp_manager(trainer, cfg.get("exp_manager", None)) + ... + + +ExpManagerConfig +---------------- + +.. autoclass:: nemo.utils.exp_manager.ExpManagerConfig + :show-inheritance: + :members: + :member-order: bysource diff --git a/docs/source/core/export.rst b/docs/source/core/export.rst new file mode 100644 index 0000000000000000000000000000000000000000..0e598e215dbfd9b79599e2eabce1f05dc40c9892 --- /dev/null +++ b/docs/source/core/export.rst @@ -0,0 +1,192 @@ +Exporting NeMo Models +===================== + +Exporting Models +---------------- + +Most of the NeMo models can be exported to ONNX or TorchScript to be deployed for inference in optimized execution environments, such as Riva or Triton Inference Server. +Export interface is provided by the :class:`~nemo.core.classes.exportable.Exportable` mix-in class. If a model extends :class:`~nemo.core.classes.exportable.Exportable`, it can be exported by: + +.. code-block:: Python + + from nemo.core.classes import ModelPT, Exportable + # deriving from Exportable + class MyExportableModel(ModelPT, Exportable): + ... + + mymodel = MyExportableModel.from_pretrained(model_name="MyModelName") + model.eval() + model.to('cuda') # or to('cpu') if you don't have GPU + + # exporting pre-trained model to ONNX file for deployment. + mymodel.export('mymodel.onnx', [options]) + + +How to Use Model Export +----------------------- +The following arguments are for :meth:`~nemo.core.classes.exportable.Exportable.export`. In most cases, you should only supply the name of the output file and use all defaults: + +.. code-block:: Python + + def export( + self, + output: str, + input_example=None, + verbose=False, + do_constant_folding=True, + onnx_opset_version=None, + check_trace: Union[bool, List[torch.Tensor]] = False, + dynamic_axes=None, + check_tolerance=0.01, + export_modules_as_functions=False, + keep_initializers_as_inputs=None, + ): + +The ``output``, ``input_example``, ``verbose``, ``do_constant_folding``, ``onnx_opset_version`` options have the same semantics as in Pytorch ``onnx.export()`` and ``jit.trace()`` functions and are passed through. For more information about Pytorch's``onnx.export()``, refer to the `torch.onnx functions documentation +`_. Note that if ``input_example`` is None, ``Exportable.input_example()`` is called. + +The file extension of the ``output`` parameter determines export format: + +* ``.onnx->ONNX`` +* ``.pt`` or ``.ts`` -> ``TorchScript``. + +**TorchScript-specific**: By default, the module will undergo ``jit.trace()``. You may require to explicitly pass some modules under ``jit.script()`` so that they are correctly traced.The ``check_trace`` arg is passed through to ``jit.trace()``. + +**ONNX-specific**: If ``use_dynamic_axes`` is True, ``onnx.export()`` is called with dynamic axes. If ``dynamic_axes`` is ``None``, they are inferred from the model's ``input_types`` definition (batch dimension is dynamic, and so is duration etc). + +If ``check_trace`` is ``True``, the resulting ONNX also runs on ``input_example`` and the results compared to the exported model's output, using the ``check_tolerance`` argument. Note the higher tolerance default. + + +How to Make Model Exportable +---------------------------- + +If you are simply using NeMo models, the previous example is all you need to know. +If you write your own models, this section highlights the things you need to be aware of after extending ``Exportable``. + +Exportable Hooks and Overrides +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ + +You should not normally need to override ``Exportable`` default methods. However, ``Exportable.export()`` relies on the assumptions that certain methods are available in your class. + +.. code-block:: Python + + @property + def input_example(self) # => Tuple(input, [(input, ...], [Dict]) + """ + Generates input examples for tracing etc. + Returns: + A tuple of input examples. + """ + +This function should return a tuple of (normally) Tensors - one per each of model inputs (args to ``forward()``). The last element may be a ``Dict`` to specify non-positional arguments by name, as per Torch ``export()`` convention. For more information, refer to the `Using dictionaries to handle Named Arguments as model inputs +`_. + +.. Note: ``Dict`` currently does not work with Torchscript ``trace()``. + +.. code-block:: Python + + @property + def input_types(self): + @property + def output_types(self): + +Those are needed for inferring in/out names and dynamic axes. If your model derives from ``ModulePT``, those are already there. Another common scenario is that your model contains one or more modules that processes input and generates output. Then, you should override ``Exportable`` methods ``input_module()`` and ``output_module()`` to point to them, like in this example: + +.. code-block:: Python + + @property + def input_module(self): + return self.fastpitch + + @property + def output_module(self): + return self.fastpitch + +Your model should also have an export-friendly ``forward()`` method - that can mean different things for ONNX ant TorchScript. For ONNX, you can't have forced named parameters without default, like ``forward(self, *, text)``. For TorchScript, you should avoid ``None`` and use ``Optional`` instead. The criteria are highly volatile and may change with every PyTorch version, so it's a trial-and-error process. There is also the general issue that in many cases, ``forward()`` for inference can be simplified and even use less inputs/outputs. To address this, ``Exportable`` looks for ``forward_for_export()`` method in your model and uses that instead of ``forward()`` to export: + +.. code-block:: Python + + # Uses forced named args, many default parameters. + def forward( + self, + *, + text, + durs=None, + pitch=None, + speaker=0, + pace=1.0, + spec=None, + attn_prior=None, + mel_lens=None, + input_lens=None, + ): + # Passes through all self.fastpitch outputs + return self.fastpitch( + text=text, + durs=durs, + pitch=pitch, + speaker=speaker, + pace=pace, + spec=spec, + attn_prior=attn_prior, + mel_lens=mel_lens, + input_lens=input_lens, + ) + + + # Uses less inputs, no '*', returns less outputs: + def forward_for_export(self, text): + ( + spect, + durs_predicted, + log_durs_predicted, + pitch_predicted, + attn_soft, + attn_logprob, + attn_hard, + attn_hard_dur, + pitch, + ) = self.fastpitch(text=text) + return spect, durs_predicted, log_durs_predicted, pitch_predicted + +To stay consistent with input_types()/output_types(), there are also those hooks in ``Exportable`` that let you exclude particular inputs/outputs from the export process: + +.. code-block:: Python + + @property + def disabled_deployment_input_names(self): + """Implement this method to return a set of input names disabled for export""" + return set(["durs", "pitch", "speaker", "pace", "spec", "attn_prior", "mel_lens", "input_lens"]) + + @property + def disabled_deployment_output_names(self): + + +Another common requirement for models that are being exported is to run certain net modifications for inference efficiency before exporting - like disabling masks in some convolutions or removing batch normalizations. A better style is to make those happen on ``ModelPT.eval()`` (and reversed on ``.train()``), but it's not always feasible so the following hook is provided in ``Exportable`` to run those: + +.. code-block:: Python + + def _prepare_for_export(self, **kwargs): + """ + Override this method to prepare module for export. This is in-place operation. + Base version does common necessary module replacements (Apex etc) + """ + # do graph modifications specific for this model + replace_1D_2D = kwargs.get('replace_1D_2D', False) + replace_for_export(self, replace_1D_2D) + # call base method for common set of modifications + Exportable._prepare_for_export(self, **kwargs) + + +Exportable Model Code +~~~~~~~~~~~~~~~~~~~~~ + +Most importantly, the actual Torch code in your model should be ONNX or TorchScript - compatible (ideally, both). +#. Ensure the code is written in Torch - avoid bare `Numpy or Python operands `_. +#. Create your model ``Exportable`` and add an export unit test, to catch any operation/construct not supported in ONNX/TorchScript, immediately. + +For more information, refer to the PyTorch documentation: + - `List of supported operators `_ + - `Tracing vs. scripting `_ + - `AlexNet example `_ + diff --git a/docs/source/core/neural_types.rst b/docs/source/core/neural_types.rst new file mode 100644 index 0000000000000000000000000000000000000000..9003b9ca520328a0257bdade1660162eb1630bac --- /dev/null +++ b/docs/source/core/neural_types.rst @@ -0,0 +1,178 @@ + +Neural Types +============ + +Motivation +---------- + +Neural Types describe the semantics, axis order, and dimensions of a tensor. The purpose of this type system is to catch semantic and +dimensionality errors during model creation and facilitate module re-use. + +.. image:: whyntypes.gif + :width: 900 + :alt: Neural Types Motivation + +``NeuralType`` class +-------------------- + +Neural Types perform semantic checks for modules and models inputs/outputs. They contain information about: + + - Semantics of what is stored in the tensors. For example, logits, logprobs, audiosignal, embeddings, etc. + - Axes layout, semantic and (optionally) dimensionality. For example: ``[Batch, Time, Channel]`` + +Types are implemented in ``nemo.core.neural_types.NeuralType`` class. When you instantiate an instance of this class, you +are expected to include both *axes* information and *element type* information. + +.. autoclass:: nemo.core.neural_types.NeuralType + +Type Comparison Results +----------------------- + +When comparing two neural types, the following comparison results are generated. + +.. autoclass:: nemo.core.neural_types.NeuralTypeComparisonResult + +Examples +-------- + +Long vs short notation +~~~~~~~~~~~~~~~~~~~~~~ + +NeMo's ``NeuralType`` class allows you to express axis semantics information in long and short form. Consider these two equivalent types. Both encoder 3 dimensional tensors and both contain elements of type ``AcousticEncodedRepresentation`` (this type is a typical output of ASR encoders). + +.. code-block:: python + + long_version = NeuralType( + axes=(AxisType(AxisKind.Batch, None), AxisType(AxisKind.Dimension, None), AxisType(AxisKind.Time, None)), + elements_type=AcousticEncodedRepresentation(), + ) + short_version = NeuralType(('B', 'D', 'T'), AcousticEncodedRepresentation()) + assert long_version.compare(short_version) == NeuralTypeComparisonResult.SAME + +Transpose same +~~~~~~~~~~~~~~ + +Often it is useful to know if a simple transposition will solve type incompatibility. This is the case if the comparison result of two types equals ``nemo.core.neural_types.NeuralTypeComparisonResult.TRANSPOSE_SAME``. + +.. code-block:: python + + type1 = NeuralType(axes=('B', 'T', 'C')) + type2 = NeuralType(axes=('T', 'B', 'C')) + assert type1.compare(type2) == NeuralTypeComparisonResult.TRANSPOSE_SAME + assert type2.compare(type1) == NeuralTypeComparisonResult.TRANSPOSE_SAME + +Note that in this example, we dropped ``elements_type`` argument of ``NeuralType`` constructor. If not supplied, the element type is ``VoidType``. + +``VoidType`` for elements +~~~~~~~~~~~~~~~~~~~~~~~~~ + +Sometimes it is useful to express that elements' types don't matter but axes layout do. ``VoidType`` for elements can be used to express this. + +.. note:: ``VoidType`` is compatible with every other elements' type but not the other way around. See the following code snippet below for details. + +.. code-block:: python + + btc_spctr = NeuralType(('B', 'T', 'C'), SpectrogramType()) + btc_spct_bad = NeuralType(('B', 'T'), SpectrogramType()) + # Note the VoidType for elements here + btc_void = NeuralType(('B', 'T', 'C'), VoidType()) + + # This is true because VoidType is compatible with every other element type (SpectrogramType in this case) + # And axes layout between btc_void and btc_spctr is the same + assert btc_void.compare(btc_spctr) == NeuralTypeComparisonResult.SAME + # These two types are incompatible because even though VoidType is used for elements on one side, + # the axes layout is different + assert btc_void.compare(btc_spct_bad) == NeuralTypeComparisonResult.INCOMPATIBLE + # Note that even though VoidType is compatible with every other type, other types are not compatible with VoidType! + # It is one-way compatibility + assert btc_spctr.compare(btc_void) == NeuralTypeComparisonResult.INCOMPATIBLE + +Element type inheritance +~~~~~~~~~~~~~~~~~~~~~~~~ + +Neural types in NeMo support Python inheritance between element types. Consider an example where you want to develop a Neural Module which performs data augmentation for all kinds of spectrograms. +In ASR, two types of spectrograms are frequently used: mel and mfcc. To express this, we will create 3 classes to express +element's types: ``SpectrogramType``, ``MelSpectrogramType(SpectrogramType)``, ``MFCCSpectrogramType(SpectrogramType)``. + +.. code-block:: python + + input = NeuralType(('B', 'D', 'T'), SpectrogramType()) + out1 = NeuralType(('B', 'D', 'T'), MelSpectrogramType()) + out2 = NeuralType(('B', 'D', 'T'), MFCCSpectrogramType()) + + # MelSpectrogram and MFCCSpectrogram are not interchangeable. + assert out1.compare(out2) == NeuralTypeComparisonResult.INCOMPATIBLE + assert out2.compare(out1) == NeuralTypeComparisonResult.INCOMPATIBLE + # Type comparison detects that MFCC/MelSpectrogramType is a kind of SpectrogramType and can be accepted. + assert input.compare(out1) == NeuralTypeComparisonResult.GREATER + assert input.compare(out2) == NeuralTypeComparisonResult.GREATER + +Custom element types +~~~~~~~~~~~~~~~~~~~~ + +It is possible to create user-defined element types to express the semantics of elements in your tensors. To do so, the user will need to inherit and implement abstract methods of the ``nemo.core.neural_types.elements.ElementType`` class + +.. autoclass:: nemo.core.neural_types.elements.ElementType + +Note that element types can be parametrized. Consider this example where it distinguishes between audio sampled at 8Khz and 16Khz. + +.. code-block:: python + + audio16K = NeuralType(axes=('B', 'T'), elements_type=AudioSignal(16000)) + audio8K = NeuralType(axes=('B', 'T'), elements_type=AudioSignal(8000)) + + assert audio8K.compare(audio16K) == NeuralTypeComparisonResult.SAME_TYPE_INCOMPATIBLE_PARAMS + assert audio16K.compare(audio8K) == NeuralTypeComparisonResult.SAME_TYPE_INCOMPATIBLE_PARAMS + +Enforcing dimensions +~~~~~~~~~~~~~~~~~~~~ + +In addition to specifying tensor layout and elements' semantics, neural types also allow you to enforce tensor dimensions. +The user will have to use long notations to specify dimensions. Short notations only allows you to specify axes semantics and assumes +arbitrary dimensions. + +.. code-block:: python + + type1 = NeuralType( + (AxisType(AxisKind.Batch, 64), AxisType(AxisKind.Time, 10), AxisType(AxisKind.Dimension, 128)), + SpectrogramType(), + ) + type2 = NeuralType(('B', 'T', 'C'), SpectrogramType()) + + # type2 will accept elements of type1 because their axes semantics match and type2 does not care about dimensions + assert type2.compare(type1), NeuralTypeComparisonResult.SAME + # type1 will not accept elements of type2 because it need dimensions to match strictly. + assert type1.compare(type2), NeuralTypeComparisonResult.DIM_INCOMPATIBLE + +Generic Axis kind +~~~~~~~~~~~~~~~~~ + +Sometimes (especially in the case of loss modules) it is useful to be able to specify a "generic" axis kind which will make it +compatible with any other kind of axis. This is easy to express with Neural Types by using ``nemo.core.neural_types.axes.AxisKind.Any`` for axes. + +.. code-block:: python + + type1 = NeuralType(('B', 'Any', 'Any'), SpectrogramType()) + type2 = NeuralType(('B', 'T', 'C'), SpectrogramType()) + type3 = NeuralType(('B', 'C', 'T'), SpectrogramType()) + + # type1 will accept elements of type2 and type3 because it only cares about element kind (SpectrogramType) + # number of axes (3) and that first one corresponds to batch + assert type1.compare(type2) == NeuralTypeComparisonResult.SAME + assert type1.compare(type3) == NeuralTypeComparisonResult.INCOMPATIBLE + +Container types +~~~~~~~~~~~~~~~ + +The NeMo-type system understands Python containers (lists). If your module returns a nested list of typed tensors, the way to express it is by +using Python list notation and Neural Types together when defining your input/output types. + +The example below shows how to express that your module returns single output ("out") which is list of lists of two dimensional tensors of shape ``[batch, dimension]`` containing logits. + +.. code-block:: python + + @property + def output_types(self): + return { + "out": [[NeuralType(('B', 'D'), LogitsType())]], + } diff --git a/docs/source/core/whyntypes.gif b/docs/source/core/whyntypes.gif new file mode 100644 index 0000000000000000000000000000000000000000..56b5ab154391c95d393e5b6d48f2170cba260f07 Binary files /dev/null and b/docs/source/core/whyntypes.gif differ diff --git a/docs/source/favicon.ico b/docs/source/favicon.ico new file mode 100644 index 0000000000000000000000000000000000000000..424df87200c706460f9ad1c7722ef0d35f286f2b Binary files /dev/null and b/docs/source/favicon.ico differ diff --git a/docs/source/index.rst b/docs/source/index.rst new file mode 100644 index 0000000000000000000000000000000000000000..ee1d3fba805a789c20f5e9a41971df38acee3086 --- /dev/null +++ b/docs/source/index.rst @@ -0,0 +1,79 @@ +NVIDIA NeMo User Guide +====================== + +.. toctree:: + :maxdepth: 2 + :caption: Getting Started + :name: starthere + + starthere/intro + starthere/tutorials + starthere/best-practices + + +.. toctree:: + :maxdepth: 2 + :caption: NeMo Core + :name: core + + core/core + core/exp_manager + core/neural_types + core/export + core/adapters/intro + core/api + + +.. toctree:: + :maxdepth: 2 + :caption: Speech Processing + :name: Speech Processing + + asr/intro + asr/speech_classification/intro + asr/speaker_recognition/intro + asr/speaker_diarization/intro + asr/ssl/intro + asr/speech_intent_slot/intro + +.. toctree:: + :maxdepth: 3 + :caption: Natural Language Processing + :name: Natural Language Processing + + nlp/nemo_megatron/intro + nlp/machine_translation/machine_translation + nlp/text_normalization/intro + nlp/api + nlp/models + + +.. toctree:: + :maxdepth: 1 + :caption: Text To Speech (TTS) + :name: Text To Speech + + tts/intro + +.. toctree:: + :maxdepth: 2 + :caption: Common + :name: Common + + text_processing/intro + +.. toctree:: + :maxdepth: 2 + :caption: Text Processing + :name: Text Processing + + text_processing/g2p/g2p + common/intro + + +.. toctree:: + :maxdepth: 3 + :caption: Tools + :name: Tools + + tools/intro diff --git a/docs/source/nlp/api.rst b/docs/source/nlp/api.rst new file mode 100644 index 0000000000000000000000000000000000000000..46efb0851d4e06cb614698ee300d80dc28f7dbfb --- /dev/null +++ b/docs/source/nlp/api.rst @@ -0,0 +1,99 @@ +NeMo NLP collection API +======================= + +Model Classes +------------- + +.. autoclass:: nemo.collections.nlp.models.TextClassificationModel + :show-inheritance: + :members: setup_training_data, setup_optimization, setup_validation_data, setup_test_data, register_artifact, classifytext + +.. autoclass:: nemo.collections.nlp.models.GLUEModel + :show-inheritance: + :members: setup_training_data, setup_optimization, setup_validation_data, setup_test_data, register_artifact + +.. autoclass:: nemo.collections.nlp.models.PunctuationCapitalizationModel + :show-inheritance: + :members: + +.. autoclass:: nemo.collections.nlp.models.TokenClassificationModel + :show-inheritance: + :members: setup_training_data, setup_optimization, setup_validation_data, setup_test_data, register_artifact + +.. autoclass:: nemo.collections.nlp.models.QAModel + :show-inheritance: + :members: setup_training_data, setup_optimization, setup_validation_data, setup_test_data, inference, validation_epoch_end, test_epoch_end + +.. autoclass:: nemo.collections.nlp.models.DuplexTaggerModel + :show-inheritance: + :members: setup_training_data, setup_optimization, setup_validation_data, setup_test_data, inference, validation_epoch_end, test_epoch_end + +.. autoclass:: nemo.collections.nlp.models.DuplexDecoderModel + :show-inheritance: + :members: setup_training_data, setup_optimization, setup_validation_data, setup_test_data, inference, validation_epoch_end, test_epoch_end + +.. autoclass:: nemo.collections.nlp.models.BERTLMModel + :show-inheritance: + :members: setup_training_data, setup_optimization + +Modules +------- + +.. autoclass:: nemo.collections.nlp.modules.BertModule + :show-inheritance: + :members: + +.. autoclass:: nemo.collections.nlp.modules.AlbertEncoder + :show-inheritance: + :members: + +.. autoclass:: nemo.collections.nlp.modules.BertEncoder + :show-inheritance: + :members: + +.. autoclass:: nemo.collections.nlp.modules.DistilBertEncoder + :show-inheritance: + :members: + +.. autoclass:: nemo.collections.nlp.modules.RobertaEncoder + :show-inheritance: + :members: + +.. autoclass:: nemo.collections.nlp.modules.SequenceClassifier + :show-inheritance: + :members: + +.. autoclass:: nemo.collections.nlp.modules.SequenceRegression + :show-inheritance: + :members: + +.. autoclass:: nemo.collections.nlp.modules.SequenceTokenClassifier + :show-inheritance: + :members: + +.. autofunction:: nemo.collections.nlp.modules.get_lm_model + +.. autofunction:: nemo.collections.nlp.modules.get_pretrained_lm_models_list + +.. autofunction:: nemo.collections.nlp.modules.common.megatron.get_megatron_lm_models_list + +Datasets +-------- + +.. autoclass:: nemo.collections.nlp.data.token_classification.punctuation_capitalization_dataset.BertPunctuationCapitalizationDataset + :show-inheritance: + :members: + :special-members: __getitem__ + +.. autofunction:: nemo.collections.nlp.data.token_classification.punctuation_capitalization_tarred_dataset.create_tarred_dataset + +.. autoclass:: nemo.collections.nlp.data.token_classification.punctuation_capitalization_tarred_dataset.BertPunctuationCapitalizationTarredDataset + :show-inheritance: + :members: + :special-members: __iter__ + :exclude-members: reinforce_type + +.. autoclass:: nemo.collections.nlp.data.token_classification.punctuation_capitalization_infer_dataset.BertPunctuationCapitalizationInferDataset + :show-inheritance: + :members: + :special-members: __getitem__ diff --git a/docs/source/nlp/bert_pretraining.rst b/docs/source/nlp/bert_pretraining.rst new file mode 100644 index 0000000000000000000000000000000000000000..8c7fd376268f17780215d007bbec0162847b2fb0 --- /dev/null +++ b/docs/source/nlp/bert_pretraining.rst @@ -0,0 +1,134 @@ +.. _bert_pretraining: + +BERT +==== + +BERT is an autoencoding language model with a final loss composed of: + +- masked language model loss +- next sentence prediction + +The model architecture is published in `BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding `__ :cite:`nlp-bert-devlin2018bert`. +The model is originally trained on English Wikipedia and BookCorpus. BERT is often used as a language model encoder for downstream tasks, for example, :ref:`token_classification`, :ref:`text_classification`, :ref:`question_answering`, etc. +Domain-specific BERT models can be advantageous for a wide range of applications. One notable application is the domain-specific BERT in a biomedical setting, +e.g. BioBERT :cite:`nlp-bert-lee2019biobert` or its improved derivative BioMegatron :cite:`nlp-bert-shin2020biomegatron`. For the latter, refer to :ref:`megatron_finetuning`. + +Quick Start Guide +----------------- + +.. code-block:: python + + from nemo.collections.nlp.models import BERTLMModel + + # to get the list of pre-trained models + BERTLMModel.list_available_models() + + # Download and load the pre-trained BERT-based model + model = BERTLMModel.from_pretrained("bertbaseuncased") + +Available Models +^^^^^^^^^^^^^^^^ + +.. list-table:: *Pretrained Models* + :widths: 5 10 + :header-rows: 1 + + * - Model + - Pretrained Checkpoint + * - BERT-base uncased + - https://ngc.nvidia.com/catalog/models/nvidia:nemo:bertbaseuncased + * - BERT-large uncased + - https://ngc.nvidia.com/catalog/models/nvidia:nemo:bertlargeuncased + +.. _dataset_bert_pretraining: + +Data Input for the BERT model +----------------------------- + +Data preprocessing can be either done on-the-fly during training or offline before training. The latter is optimized and recommended +for large text corpora. This was also used in the original paper to train the model on Wikipedia and BookCorpus. For on-the-fly data +processing, provide text files with sentences for training and validation, where words are separated by spaces, i.e.: ``[WORD] [SPACE] [WORD] [SPACE] [WORD]``. +To use this pipeline in training, use the dedicated configuration file `NeMo/examples/nlp/language_modeling/conf/bert_pretraining_from_preprocessed_config.yaml`. + +To process data offline in advance, refer to the `BERT Quick Start Guide `__. +To recreate the original Wikipedia and BookCorpus datasets, follow steps 1-5 in the Quick Start Guide and run the script ``./data/create_datasets_from_start.sh`` inside the Docker container. +The ``downloaded`` folder should include two sub folders ``lower_case_[0,1]_seq_len_128_max_pred_20_masked_lm_prob_0.15_random_seed_12345_dupe_factor_5`` +and ``lower_case_[0,1]_seq_len_512_max_pred_80_masked_lm_prob_0.15_random_seed_12345_dupe_factor_5``, containing sequences of length 128 with a maximum of 20 masked tokens +and sequences of length 512 with a maximum of 80 masked tokens respectively. To use this pipeline in training, use the dedicated configuration file ``NeMo/examples/nlp/language_modeling/conf/bert_pretraining_from_text_config.yaml`` +and specify the path to the created hd5f files. + + +Training the BERT model +----------------------- + +Example of model configuration for on-the-fly data preprocessing: `NeMo/examples/nlp/language_modeling/conf/bert_pretraining_from_text_config.yaml `__. +Example of model configuration for offline data preprocessing: `NeMo/examples/nlp/language_modeling/conf/bert_pretraining_from_preprocessed_config.yaml `__. + +The specification can be grouped into three categories: + +- Parameters that describe the training process: **trainer** +- Parameters that describe the datasets: **model.train_ds**, **model.validation_ds** +- Parameters that describe the model: **model**, **model.tokenizer**, **model.language_model** + +More details about parameters in the config file can be found below: + ++-------------------------------------------+-----------------+--------------------------------------------------------------------------------------------------------------+ +| **Parameter** | **Data Type** | **Description** | ++-------------------------------------------+-----------------+--------------------------------------------------------------------------------------------------------------+ +| **model.only_mlm_loss** | bool | Only uses masked language model without next sentence prediction. | ++-------------------------------------------+-----------------+--------------------------------------------------------------------------------------------------------------+ +| **train_ds.data_file** | string | Name of the text file or hdf5 data directory. | ++-------------------------------------------+-----------------+--------------------------------------------------------------------------------------------------------------+ +| **train_ds.num_samples** | integer | Number of samples to use from the training dataset, ``-1`` - to use all. | ++-------------------------------------------+-----------------+--------------------------------------------------------------------------------------------------------------+ + +More details about parameters for offline data preprocessing can be found below: + ++-------------------------------------------+-----------------+--------------------------------------------------------------------------------------------------------------+ +| **Parameter** | **Data Type** | **Description** | ++-------------------------------------------+-----------------+--------------------------------------------------------------------------------------------------------------+ +| **train_ds.max_predictions_per_seq** | integer | Maximum number of masked tokens in a sequence in the preprocessed data. | ++-------------------------------------------+-----------------+--------------------------------------------------------------------------------------------------------------+ + +More details about parameters for online data preprocessing can be found below: + ++-------------------------------------------+-----------------+--------------------------------------------------------------------------------------------------------------+ +| **Parameter** | **Data Type** | **Description** | ++-------------------------------------------+-----------------+--------------------------------------------------------------------------------------------------------------+ +| **model.max_seq_length** | integer | The maximum total input sequence length after tokenization. | ++-------------------------------------------+-----------------+--------------------------------------------------------------------------------------------------------------+ +| **model.mask_prob** | float | Probability of masking a token in the input text during data processing. | ++-------------------------------------------+-----------------+--------------------------------------------------------------------------------------------------------------+ +| **model.short_seq_prob** | float | Probability of having a sequence shorter than the maximum sequence length. | ++-------------------------------------------+-----------------+--------------------------------------------------------------------------------------------------------------+ + +.. note:: + + For offline data preprocessing, **model.tokenizer** is null. For downstream task, use the same tokenizer that was used for + offline preprocessing. For online data preprocessing, **model.tokenizer** needs to be specified. See also :ref:`nlp_model` for + details. + +Example of the command for training the model: + +.. code:: + + python bert_pretraining.py \ + model.train_ds.data_file= \ + trainer.max_epochs= \ + trainer.devices=[] \ + trainer.accelerator='gpu' + + +Fine-tuning on Downstream Tasks +^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ + +To use a trained BERT model checkpoint on a NeMo NLP downstream task, e.g. :ref:`question_answering`, specify +:code:`model.language_model.lm_checkpoint=`. + +References +---------- + +.. bibliography:: nlp_all.bib + :style: plain + :labelprefix: NLP-BERT + :keyprefix: nlp-bert- diff --git a/docs/source/nlp/dialogue.rst b/docs/source/nlp/dialogue.rst new file mode 100644 index 0000000000000000000000000000000000000000..157aaa714b16998a0a5008b7e0648b89d3d0ff11 --- /dev/null +++ b/docs/source/nlp/dialogue.rst @@ -0,0 +1,143 @@ +.. _dialogue: + +Dialogue tasks +====================================== + +This module consists of various tasks that are related to dialogue. + +**Module Design** + +We decided to group dialogue tasks into a common module instead of having a module for each because they share many things in common, meaning that there can be more re-use of code. +This design can also support easier extension of this module, as developers can work on components of their interest while utilizing other components of dialogue pipeline. +In particular, we wanted to decouple the task-dependent, model-independent components of DataProcessor and InputExample from the model-dependent, task-independent components of Model and Dataset. + +.. image:: dialogue_UML.png + :alt: Dialogue-UML + :width: 800px + +**Supported Tasks** + +Supported tasks fall into broad categories of intent / domain classification with slot filling, intent classification as well as sequence generation. + +For each category of tasks, there exists several Data Processors to convert raw data from various sources into a common format as well as Dialogue Models that approachs the task in various ways. + +Currently, the supported task categories are: + ++----------------------------------------------------------+----------------------------------+----------------------------------------------------------------------------------+----------------------------------------------------------------------+------------------------------------------+ +| **Task Category** | **Tasks** | **Models** | **Supported Options for model.language_model.pretrained_model_name** | **Supported options for model.library** | ++----------------------------------------------------------+----------------------------------+----------------------------------------------------------------------------------+----------------------------------------------------------------------+------------------------------------------+ +| Domain / Intent Classification | Schema Guided Dialogue | Dialogue GPT Classification Model | gpt2, gpt2-{medium, large, xl}, microsoft/DialoGPT-{small, medium} | Huggingface, Megatron | ++ with slot filling +----------------------------------+----------------------------------------------------------------------------------+----------------------------------------------------------------------+------------------------------------------+ +| | Assistant | SGDQA (BERT-Based Schema Guided Dialogue Question Answering model) | bert-base-cased | Megatron | ++ +----------------------------------+----------------------------------------------------------------------------------+----------------------------------------------------------------------+------------------------------------------+ +| | | Intent Slot Classification Model | bert-base-uncased | Megatron | ++----------------------------------------------------------+----------------------------------+----------------------------------------------------------------------------------+----------------------------------------------------------------------+------------------------------------------+ +| Intent Classification | Zero Shot Food Ordering | Dialogue GPT Classification Model | gpt2, gpt2-{medium, large, xl}, microsoft/DialoGPT-{small, medium} | Huggingface, Megatron | ++ +----------------------------------+----------------------------------------------------------------------------------+----------------------------------------------------------------------+------------------------------------------+ +| | Omniverse Design | Dialogue Nearest Neighbour Model | sentence-transformers/* | Huggingface | ++ +----------------------------------+----------------------------------------------------------------------------------+----------------------------------------------------------------------+------------------------------------------+ +| | | Dialogue Zero Shot Intent Model (Based on MNLI pretraining) | bert-base-uncased | Huggingface, Megatron | ++----------------------------------------------------------+----------------------------------+----------------------------------------------------------------------------------+----------------------------------------------------------------------+------------------------------------------+ +| Sequence Generation | Schema Guided Dialogue Generation| Dialogue GPT Generation Model | gpt2, gpt2-{medium, large, xl}, microsoft/DialoGPT-{small, medium} | Huggingface, Megatron | ++ +----------------------------------+----------------------------------------------------------------------------------+----------------------------------------------------------------------+------------------------------------------+ +| | MS Marco NLGen | Dialogue S2S Generation Model | facebook/bart-{base, large}, t5-{small, base, large, 3b, 11b} | Huggingface, Megatron | ++----------------------------------------------------------+----------------------------------+----------------------------------------------------------------------------------+----------------------------------------------------------------------+------------------------------------------+ + +**Configuration** + +Example of model configuration file for training the model can be found at: `NeMo/examples/nlp/dialogue/conf/dialogue_config.yaml `__. + +Because the Dialogue module contains a wide variety of models and tasks, there are a large number of configuration parameters to adjust (some of which only applies to some models/some tasks) + +In the configuration file, define the parameters of the training and the model, although most of the default values will work well. +For various task-model combination, only a restricted set of config args will apply. Please read the configuration file for comments on which config args you would need for each model and task. + +The configuration can be roughly grouped into a few categories: + +- Parameters that describe the training process, such as how many gpus to use: **trainer** +- Parameters that describe the model: **model** +- Parameters that describe optimization: **model.optim** +- Parameters that describe the task: **model.dataset** +- Parameters that describe the dataloaders: **model.train_ds**, **model.validation_ds**, **model.test_ds**, +- Parameters that describe the training experiment manager that log training process: **exp_manager** + + +Arguments that very commonly need to be edited for all models and tasks + +- :code:`do_training`: perform training or only testing +- :code:`trainer.devices`: number of GPUs (int) or list of GPUs e.g. [0, 1, 3] +- :code:`model.dataset.task`: Task to work on [sgd, assistant, zero_shot, ms_marco, sgd_generation, design, mellon_qa] +- :code:`model.dataset.data_dir`: the dataset directory +- :code:`model.dataset.dialogues_example_dir`: the directory to store prediction files +- :code:`model.dataset.debug_mode`: whether to run in debug mode with a very small number of samples [True, False] +- :code:`model.language_model.pretrained_model_name`: language model to use, which causes different Dialogue Models to be loaded (see table above for options in each model class) +- :code:`model.library`: library to load language model from [huggingface or megatron] +- :code:`model.language_model.lm_checkpoint`: specifying a trained checkpoint (.bin / .ckpt / .nemo). The only exception is for DialogueZeroShotIntentModel, which can be configured at :code:`model.original_nemo_checkpoint`` instead For trained checkpoints, see :code:`list_available_models()`` for each model class and then downloading the file to a local directory + +**Obtaining data** + +Task: Schema Guided Dialogue (SGD) / SGD Generation + +:code: `git clone https://github.com/google-research-datasets/dstc8-schema-guided-dialogue.git` + +Task: MS Marco + +Please download the files below and unzip them into a common folder (for model.dataset.data_dir) + +https://msmarco.blob.core.windows.net/msmarco/train_v2.1.json.gz +https://msmarco.blob.core.windows.net/msmarco/dev_v2.1.json.gz +https://msmarco.blob.core.windows.net/msmarco/eval_v2.1_public.json.gz + +Then remove unused samples (optional, but otherwise, this would require significantly more CPU RAM ~25GB) + +:code: `python ../NeMo/examples/nlp/dialogue/remove_ms_marco_samples_without_wellFormedAnswers.py --filename train_v2.1.json` +:code: `python ../NeMo/examples/nlp/dialogue/remove_ms_marco_samples_without_wellFormedAnswers.py --filename dev_v2.1.json` + +Task: Assistant + +:code: `git clone https://github.com/xliuhw/NLU-Evaluation-Data` + +Then unzip it + +Finally, convert the dataset into the required format + +.. code:: + + python examples/nlp/intent_slot_classification/data/import_datasets.py + --source_data_dir=`source_data_dir` \ + --target_data_dir=`target_data_dir` \ + --dataset_name='assistant' + +- :code:`source_data_dir`: the directory location of the your dataset +- :code:`target_data_dir`: the directory location where the converted dataset should be saved + + +Unfortunately other datasets are currently not available publically + +**Training/Testing a model** + + +Please try the example Dialogue model in a Jupyter notebook (can run on `Google's Colab `__). + + +Connect to an instance with a GPU (**Runtime** -> **Change runtime type** -> select **GPU** for the hardware accelerator). + +An example script on how to train the model can be found here: `NeMo/examples/nlp/dialogue/dialogue.py `__. + +The following is an example of the command for training the model: + + +Code for training a model with three public datasets (from above) are available in the Jupyter/Colab notebook `Google's Colab `__) + + +.. code:: + + python examples/nlp/dialogue/dialogue.py \ + do_training=True \ + model.dataset.task=sgd \ + model.dataset.debug_mode=True \ + model.language_model.pretrained_model_name=gpt2 \ + model.data_dir= \ + model.dataset.dialogues_example_dir= \ + trainer.devices=[0] \ + trainer.accelerator='gpu' diff --git a/docs/source/nlp/dialogue_UML.png b/docs/source/nlp/dialogue_UML.png new file mode 100644 index 0000000000000000000000000000000000000000..0a76aae7c5fa2136950c13bb3781b9b947f28cbc --- /dev/null +++ b/docs/source/nlp/dialogue_UML.png @@ -0,0 +1,3 @@ +version https://git-lfs.github.com/spec/v1 +oid sha256:a6f89e670db61728ca6c2a7c01da6017d299e18d5f8bc1334ca0694c0266ab5a +size 1682802 diff --git a/docs/source/nlp/entity_linking.rst b/docs/source/nlp/entity_linking.rst new file mode 100644 index 0000000000000000000000000000000000000000..2c326fb8560338aa4dcb6f035c5d6fb211b857fe --- /dev/null +++ b/docs/source/nlp/entity_linking.rst @@ -0,0 +1,36 @@ +.. _entity_linking: + +Entity Linking +==================================== + +Entity linking is the process of matching concepts mentioned in natural language to their unique IDs and canonical forms stored +in a knowledge base. For example, an entity linking model might match the phrase ``blood thinners`` mentioned in conversation +to the knowledge base concept UID45623 anticoagulant. Entity linking applications range from helping automate ingestion of +large amounts of data to assisting in real time concept normalization. + +Within NeMo we use the entity linking approach described in Liu et. al's NAACL 2021 "`Self-alignment Pre-training for Biomedical Entity Representations `_" :cite:`nlp-entity_linking-liu2021selfalignment`. +The main idea behind this approach is to reshape an initial concept embedding space such that synonyms of the same concept are +pulled closer together and unrelated concepts are pushed further apart. The concept embeddings from this reshaped space can then +be used to build a knowledge base embedding index. + +.. image:: entity_linking_overview.jpg + :alt: Entity-Linking-Overview + :width: 800px + +Our BERT-base + Self Alignment Pretraining implementation allows you to train an entity linking encoder. We also provide example code +on building an index with `Medical UMLS `_ concepts `NeMo/examples/nlp/entity_linking/build_index.py `__. + +Please try the example Entity Linking model in a Jupyter notebook (can run on `Google's Colab `__). + +Connect to an instance with a GPU (**Runtime** -> **Change runtime type** -> select **GPU** for the hardware accelerator). + +An example script on how to train the model can be found here: `NeMo/examples/nlp/entity_linking `__. + + +References +---------- + +.. bibliography:: nlp_all.bib + :style: plain + :labelprefix: nlp-entity_linking + :keyprefix: nlp-entity_linking- diff --git a/docs/source/nlp/entity_linking_overview.jpg b/docs/source/nlp/entity_linking_overview.jpg new file mode 100644 index 0000000000000000000000000000000000000000..e28711a6e34588b3e9ec08844be3c53771bebec1 Binary files /dev/null and b/docs/source/nlp/entity_linking_overview.jpg differ diff --git a/docs/source/nlp/glue_benchmark.rst b/docs/source/nlp/glue_benchmark.rst new file mode 100644 index 0000000000000000000000000000000000000000..849a49ea1a123e853d2c92be636bb8dabbe7b0f1 --- /dev/null +++ b/docs/source/nlp/glue_benchmark.rst @@ -0,0 +1,10 @@ +.. _glue_benchmark: + +GLUE Benchmark +============== + +We recommend you try the GLUE Benchmark model in a Jupyter notebook (can run on `Google's Colab `_): `NeMo/tutorials/nlp/GLUE_Benchmark.ipynb `__. + +Connect to an instance with a GPU (**Runtime** -> **Change runtime type** -> select **GPU** for the hardware accelerator). + +An example script on how to train the model can be found here: `NeMo/examples/nlp/glue_benchmark/glue_benchmark.py `__. diff --git a/docs/source/nlp/information_retrieval.rst b/docs/source/nlp/information_retrieval.rst new file mode 100644 index 0000000000000000000000000000000000000000..3c71ffcfcd129710ce770cc6c1a005bb3bc3cc2b --- /dev/null +++ b/docs/source/nlp/information_retrieval.rst @@ -0,0 +1,10 @@ +.. _information_retrieval: + +Information Retrieval +===================== + +We recommend you try the Information Retrieval model in a Jupyter notebook (can run on `Google's Colab `_): `NeMo/tutorials/nlp/Information_Retrieval_MSMARCO.ipynb `__. + +Connect to an instance with a GPU (**Runtime** -> **Change runtime type** -> select **GPU** for hardware the accelerator), + +An example script on how to train the model can be found here: `NeMo/examples/nlp/information_retrieval `__. diff --git a/docs/source/nlp/joint_intent_slot.rst b/docs/source/nlp/joint_intent_slot.rst new file mode 100644 index 0000000000000000000000000000000000000000..30f0962a241fef1f83ceb9fc5bd098582053ef94 --- /dev/null +++ b/docs/source/nlp/joint_intent_slot.rst @@ -0,0 +1,244 @@ +.. _intent_slot: + +Joint Intent and Slot Classification +==================================== + +Joint Intent and Slot classification is a NLU task for classifying an intent and detecting all +relevant slots (Entities) for the intent in a query. For example, in the query ``What is the weather in Santa Clara tomorrow morning?``, +we would like to classify the query as a ``weather intent``, detect ``Santa Clara`` as a `location slot`, +and ``tomorrow morning`` as a ``date_time slot``. Intent and Slot names are usually task-specific and +defined as labels in the training data. This is a fundamental step that is executed in any +task-driven conversational assistant. + +Our BERT-based model implementation allows you to train and detect both of these tasks together. + +.. note:: + + We recommend you try the Joint Intent and Slot Classification model in a Jupyter notebook (can run on `Google's Colab `_.): `NeMo/tutorials/nlp/Joint_Intent_and_Slot_Classification.ipynb `__. + + Connect to an instance with a GPU (**Runtime** -> **Change runtime type** -> select **GPU** for the hardware accelerator). + + An example script on how to train the model can be found here: `NeMo/examples/nlp/intent_slot_classification `__. + + +NeMo Data Format +---------------- + +When training the model, the dataset should be first converted to the required data format, which requires the following files: + +- :code:`dict.intents.csv` - A list of all intent names in the data. One line per an intent name. The index of the intent line + (starting from ``0``) is used to identify the appropriate intent in ``train.tsv`` and ``test.tsv`` files. + +.. code:: + + weather + alarm + meeting + ... + +- :code:`dict.slots.csv` - A list of all slot names in the data. One line per slot name. The index of the slot line + (starting from ``0``) is used to identify the appropriate slots in the queries in ``train_slot.tsv`` and ``test_slot.tsv`` files. + In the last line of this dictionary ``O`` slot name is used to identify all ``out of scope`` slots, which are usually the majority of the tokens + in the queries. + +.. code:: + + date + time + city + ... + O + +- :code:`train.tsv/test.tsv` - A list of original queries, one per line, with the intent number + separated by a tab (e.g. "what alarms do i have set right now 0"). Intent numbers are + set according to the intent line in the intent dictionary file (:code:`dict.intents.csv`), + starting from ``0``. The first line in these files should contain the header line ``sentence + label``. + +- :code:`train_slot.tvs/test_slot.tsv` - A list that contains one line per query, when each word from the original text queries + is replaced by a token number from the slots dictionary file (``dict.slots.csv``), counted starting from ``0``. All the words + which do not contain a relevant slot are replaced by ``out-of scope`` token number, which is also a part of the slot dictionary file, + usually as the last entry there. For example a line from these files should look similar to: "54 0 0 54 54 12 12" (the numbers are + separated by a space). These files do not contain a header line. + + +Dataset Conversion +------------------ + +To convert to the format of the model data, use the ``import_datasets`` utility, which implements +the conversion for the Assistant dataset. Download the dataset `here `_ or you can +write your own converter for the format that you are using for data annotation. + +For a dataset that follows your own annotation format, we recommend using one text file for all +samples of the same intent, with the name of the file as the name of the intent. Use one line per +query, with brackets to define slot names. This is very similar to the assistant format, and you can +adapt this converter utility or your own format with small changes: + +:: + + did i set an alarm to [alarm_type : wake up] in the [timeofday : morning] + +Run the ``dataset_converter`` command: + +.. code:: + + python examples/nlp/intent_slot_classification/data/import_datasets.py + --source_data_dir=`source_data_dir` \ + --target_data_dir=`target_data_dir` \ + --dataset_name=['assistant'|'snips'|'atis'] + +- :code:`source_data_dir`: the directory location of the your dataset +- :code:`target_data_dir`: the directory location where the converted dataset should be saved +- :code:`dataset_name`: one of the implemented dataset names + +After conversion, ``target_data_dir`` should contain the following files: + +.. code:: + + . + |--target_data_dir + |-- dict.intents.csv + |-- dict.slots.csv + |-- train.tsv + |-- train_slots.tsv + |-- test.tsv + |-- test_slots.tsv + +Model Training +-------------- + +This is a pretrained BERT based model with 2 linear classifier heads on the top of it, one for classifying an intent of the query and +another for classifying slots for each token of the query. This model is trained with the combined loss function on the Intent and Slot +classification task on the given dataset. The model architecture is based on the paper `BERT for Joint Intent Classification and Slot Filling `__:cite:`nlp-jis-chen2019bert`. + +For each query, the model classifies it as one the intents from the intent dictionary and for each word of the query it will classify +it as one of the slots from the slot dictionary, including out of scope slot for all the remaining words in the query which does not +fall in another slot category. Out of scope slot (``O``) is a part of slot dictionary that the model is trained on. + +Example of model configuration file for training the model can be found at: `NeMo/examples/nlp/intent_slot_classification/conf/intent_slot_classification.yaml `__. +In the configuration file, define the parameters of the training and the model, although most of the default values will work well. + +The specification can be roughly grouped into three categories: + +- Parameters that describe the training process: **trainer** +- Parameters that describe the model: **model** +- Parameters that describe the datasets: **model.train_ds**, **model.validation_ds**, **model.test_ds**, + +More details about parameters in the spec file can be found below: + ++-------------------------------------------+-----------------+----------------------------------------------------------------------------------+--------------------------------------------------------------------------------------------------------------+ +| **Parameter** | **Data Type** | **Default** | **Description** | ++-------------------------------------------+-----------------+----------------------------------------------------------------------------------+--------------------------------------------------------------------------------------------------------------+ +| **model.data_dir** | string | -- | The path of the data converted to the specified format. | ++-------------------------------------------+-----------------+----------------------------------------------------------------------------------+--------------------------------------------------------------------------------------------------------------+ +| **model.class_balancing** | string | ``null`` | Choose from ``[null, weighted_loss]``. The ``weighted_loss`` enables weighted class balancing of the loss. | ++-------------------------------------------+-----------------+----------------------------------------------------------------------------------+--------------------------------------------------------------------------------------------------------------+ +| **model.intent_loss_weight** | float | ``0.6`` | The elation of intent-to-slot loss in the total loss. | ++-------------------------------------------+-----------------+----------------------------------------------------------------------------------+--------------------------------------------------------------------------------------------------------------+ +| **model.pad_label** | integer | ``-1`` | A value to pad the inputs. | ++-------------------------------------------+-----------------+----------------------------------------------------------------------------------+--------------------------------------------------------------------------------------------------------------+ +| **model.ignore_extra_tokens** | boolean | ``false`` | A flag that specifies whether to ignore extra tokens. | ++-------------------------------------------+-----------------+----------------------------------------------------------------------------------+--------------------------------------------------------------------------------------------------------------+ +| **model.ignore_start_end** | boolean | ``true`` | A flag that specifies whether to not use the first and last token for slot training. | ++-------------------------------------------+-----------------+----------------------------------------------------------------------------------+--------------------------------------------------------------------------------------------------------------+ +| **model.head.num_output_layers** | integer | ``2`` | The number of fully connected layers of the classifier on top of the BERT model. | ++-------------------------------------------+-----------------+----------------------------------------------------------------------------------+--------------------------------------------------------------------------------------------------------------+ +| **model.head.fc_dropout** | float | ``0.1`` | The dropout ratio of the fully connected layers. | ++-------------------------------------------+-----------------+----------------------------------------------------------------------------------+--------------------------------------------------------------------------------------------------------------+ +| **training_ds.prefix** | string | ``train`` | A prefix for the training file names. | ++-------------------------------------------+-----------------+----------------------------------------------------------------------------------+--------------------------------------------------------------------------------------------------------------+ +| **validation_ds.prefix** | string | ``dev`` | A prefix for the validation file names. | ++-------------------------------------------+-----------------+----------------------------------------------------------------------------------+--------------------------------------------------------------------------------------------------------------+ +| **test_ds.prefix** | string | ``test`` | A prefix for the test file names. | ++-------------------------------------------+-----------------+----------------------------------------------------------------------------------+--------------------------------------------------------------------------------------------------------------+ + +For additional config parameters common to all NLP models, refer to the `nlp_model doc `__. + +The following is an example of the command for training the model: + +.. code:: + + python examples/nlp/intent_slot_classification/intent_slot_classification.py + model.data_dir= \ + trainer.max_epochs= \ + trainer.devices=[] \ + trainer.accelerator='gpu' + + +Required Arguments for Training +^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ + +- :code:`model.data_dir`: the dataset directory + + +Optional Arguments +^^^^^^^^^^^^^^^^^^ + +Most of the default parameters in the existing configuration file are already set appropriately, however, there are some parameters +you may want to experiment with. + +- ``trainer.max_epochs``: the number of training epochs (reasonable to be between 10 to 100) +- ``model.class_balancing``: value ``weighted_loss`` may help to train the model when there is unbalanced set of classes +- ``model.intent_loss_weight``: a number between 0 to 1 that defines a weight of the intent lost versus a slot loss during training. A default value 0.6 gives a slight preference for the intent lose optimization. + +Training Procedure +^^^^^^^^^^^^^^^^^^ + +At the start of evaluation, NeMo will print out a log of the experiment specification, a summary of the training dataset, and the +model architecture. + +As the model starts training, you should see a progress bar per epoch. During training, after each epoch, NeMo will display accuracy +metrics on the validation dataset for every intent and slot separately, as well as the total accuracy. You can expect these numbers +to grow up to 50-100 epochs, depending on the size of the trained data. Since this is a joint iIntent and slot training, usually +intent's accuracy will grow first for the initial 10-20 epochs, and after that, slot's accuracy will start improving as well. + +At the end of training, NeMo saves the best checkpoint on the validation dataset at the path specified by the experiment spec file +before finishing. + +.. code:: + + GPU available: True, used: True + TPU available: None, using: 0 TPU cores + LOCAL_RANK: 0 - CUDA_VISIBLE_DEVICES: [0,1,2] + [NeMo W 2021-01-28 14:52:19 exp_manager:299] There was no checkpoint folder at checkpoint_dir :results/checkpoints. Training from scratch. + [NeMo I 2021-01-28 14:52:19 exp_manager:186] Experiments will be logged at results + ... + label precision recall f1 support + weather.weather (label_id: 0) 0.00 0.00 0.00 128 + weather.temperature (label_id: 1) 0.00 0.00 0.00 0 + weather.temperature_yes_no (label_id: 2) 0.00 0.00 0.00 0 + weather.rainfall (label_id: 3) 0.00 0.00 0.00 0 + weather.rainfall_yes_no (label_id: 4) 0.00 0.00 0.00 0 + weather.snow (label_id: 5) 0.00 0.00 0.00 0 + weather.snow_yes_no (label_id: 6) 0.00 0.00 0.00 0 + weather.humidity (label_id: 7) 0.00 0.00 0.00 0 + weather.humidity_yes_no (label_id: 8) 0.00 0.00 0.00 0 + weather.windspeed (label_id: 9) 0.00 0.00 0.00 0 + weather.sunny (label_id: 10) 0.00 0.00 0.00 0 + weather.cloudy (label_id: 11) 0.00 0.00 0.00 0 + weather.alert (label_id: 12) 0.00 0.00 0.00 0 + context.weather (label_id: 13) 0.00 0.00 0.00 0 + context.continue (label_id: 14) 0.00 0.00 0.00 0 + context.navigation (label_id: 15) 0.00 0.00 0.00 0 + context.rating (label_id: 16) 0.00 0.00 0.00 0 + context.distance (label_id: 17) 0.00 0.00 0.00 0 + ------------------- + micro avg 0.00 0.00 0.00 128 + macro avg 0.00 0.00 0.00 128 + weighted avg 0.00 0.00 0.00 128 + +Model Evaluation and Inference +------------------------------ + +There is no separate script for the evaluation and inference of this model in NeMo, however, inside of the example file `examples/nlp/intent_slot_classification/intent_slot_classification.py` +after the training part is finished, you can see the code that evaluates the trained model on an evaluation test set and then an example of doing inference using a list of given queries. + +For the deployment in the production environment, refer to `NVIDIA Riva `__ and `NVIDIA TLT documentation `__. + +References +---------- + +.. bibliography:: nlp_all.bib + :style: plain + :labelprefix: NLP-JIS + :keyprefix: nlp-jis- diff --git a/docs/source/nlp/language_modeling.rst b/docs/source/nlp/language_modeling.rst new file mode 100644 index 0000000000000000000000000000000000000000..854afe6ac6eac7369bd6493c5fe3213ed7c938cf --- /dev/null +++ b/docs/source/nlp/language_modeling.rst @@ -0,0 +1,283 @@ +.. _language_modeling: + +Language Modeling +================= + +A language model (LM) estimates the joint probability of a given text corpus :math:`(x_1,\dots,x_T)` by factorizing it with a chain rule :math:`P(x_1,\dots,x_T) = \prod_{t=1}^T P(x_t|x_1,\dots,x_{t-1})` and sequentially modeling each conditional term in the product. To simplify modeling, it is often assumed that the context size (a number of preceding words) necessary to predict each word :math:`x_t` in the corpus is limited to :math:`N:\;P(x_t|x_1,\dots,x_{t-1}) \approx P(x_t|x_{t-N},\dots,x_{t-1})`. This approximation is commonly referred to as N-gram LM. + +Currently, we mainly support sentence-level LMs which do not consider long-term dependencies and model all sentences independently of each other. Our models are based on the Transformer sequence-to-sequence architecture :cite:`nlp-language_modeling-vaswani2017attention`. + +| An example script on how to train the model can be found here: `NeMo/examples/nlp/language_modeling/transformer_lm.py `_. +| The default configuration file for the model can be found at: `NeMo/examples/nlp/language_modeling/conf/transformer_lm_config.yaml `_. + + +Data Format +----------- + +Unsupervised LMs require the corpus which comprises many examples of sentences from a particular domain (Wikipedia, news, Pubmed abstracts, etc). We assume that the data is formatted as a text file where each line corresponds to a separate sentence: + +.. list-table:: + :widths: 100 + :header-rows: 1 + + * - Sentence-level LM coprus + * - in a silver cake basket as the panins had at their party + * - let us pretermit that long comparison + * - poverty contempt and sickness treading on my heels i easily resolve not to be affrighted + +It is common practice to apply data cleaning, normalization, and tokenization to the data prior to training LM and +NeMo expects already cleaned, normalized, and tokenized data. The only data pre-processing NeMo does is subword tokenization with BPE :cite:`nlp-language_modeling-sennrich2015neural`. + +.. note:: + If LM is intended to be used in a conjunction with another model (e.g. :ref:`re-scoring of ASR `, shallow fusion with NMT), make sure that the training data is preprocessed accordingly (lower-case no punctuation for ASR, Moses tokenization/normalization for NMT). Otherwise, it might introduce inadequate LM scores. + + +Tokenizer Training +------------------ + +Our LMs support all tokenizers available in NeMo, but require special beginning-of-string ```` and end-of-string ```` tokens. + +Below is the example of training `YouTokenToMe `__ BPE tokenizer: + +.. code-block:: python + + import youtokentome as yttm + data = # string, path to file with training data + model = # string, path to where the trained model will be saved + vocab_size = # int, number of tokens in the final vocabulary + yttm.BPE.train(data, model, vocab_size) + + +Sentence Dataset Construction +----------------------------- + +Given BPE tokenizer and a cleaned sentence-level text corpus, the following steps are applied to create a `SentenceDataset `__ object. + +#. Text to IDs - Performs tokenization with the specified tokenizer model on an input sentence and maps it to a sequence of tokens. + +#. Bucketing - Sentences vary in length and when creating minibatches, we'd like sentences in them to have roughly the same length to minimize the number of ```` tokens and to maximize computational efficiency. This step groups sentences of roughly the same length into buckets. + +#. Batching and padding - Creates minibatches with a maximum number of tokens specified by ``model.{train_ds,validation_ds,test_ds}.tokens_in_batch`` from buckets and pads, so they can be packed into a tensor. + +To use ``SentenceDataset``, specify path to the training data in ``file_name`` in the experiment config file. Below is the list of all available configuration options: + ++-------------------------------------------------------------+-----------------+----------------+----------------------------------------------------------------------------------------------------------------------+ +| **Parameter** | **Data Type** | **Default** | **Description** | ++-------------------------------------------------------------+-----------------+----------------+----------------------------------------------------------------------------------------------------------------------+ +| **model.{train_ds,validation_ds,test_ds}.file_name** | str | ``null`` | Path to the file with sentences. | ++-------------------------------------------------------------+-----------------+----------------+----------------------------------------------------------------------------------------------------------------------+ +| **model.{train_ds,validation_ds,test_ds}.tokens_in_batch** | int | ``512`` | Maximum number of tokens per minibatch. | ++-------------------------------------------------------------+-----------------+----------------+----------------------------------------------------------------------------------------------------------------------+ +| **model.{train_ds,validation_ds,test_ds}.max_seq_length** | int | ``512`` | Maximum sequence length, to be used with the ``clean`` argument below. | ++-------------------------------------------------------------+-----------------+----------------+----------------------------------------------------------------------------------------------------------------------+ +| **model.{train_ds,validation_ds,test_ds}.clean** | bool | ``true`` | Whether to clean the dataset by discarding examples that are greater than ``max_seq_length``. | ++-------------------------------------------------------------+-----------------+----------------+----------------------------------------------------------------------------------------------------------------------+ +| **model.{train_ds,validation_ds,test_ds}.shuffle** | bool | ``true`` | Whether to shuffle minibatches in the PyTorch DataLoader. | ++-------------------------------------------------------------+-----------------+----------------+----------------------------------------------------------------------------------------------------------------------+ +| **model.{train_ds,validation_ds,test_ds}.num_samples** | int | ``-1`` | Number of samples to use. ``-1`` for the entire dataset. | ++-------------------------------------------------------------+-----------------+----------------+----------------------------------------------------------------------------------------------------------------------+ +| **model.{train_ds,validation_ds,test_ds}.pin_memory** | bool | ``false`` | Whether to pin memory in the PyTorch DataLoader. | ++-------------------------------------------------------------+-----------------+----------------+----------------------------------------------------------------------------------------------------------------------+ +| **model.{train_ds,validation_ds,test_ds}.num_workers** | int | ``8`` | Number of workers for the PyTorch DataLoader. | ++-------------------------------------------------------------+-----------------+----------------+----------------------------------------------------------------------------------------------------------------------+ + + +Model Configuration and Training +-------------------------------- + +The overall model consists of an encoder and a classification head with the following configuration options: + +.. list-table:: *Transformer Encoder Network* + :widths: 30 5 5 60 + :header-rows: 1 + + * - Parameter + - Data Type + - Default + - Description + * - **model.encoder.max_sequence_length** + - int + - ``512`` + - Maximum allowed sequence length. + * - **model.encoder.learn_positional_encodings** + - bool + - ``false`` + - If ``true``, this is a regular learnable embedding layer. If ``false``, fixes position encodings to sinusoidal. + * - **model.encoder.hidden_size** + - int + - ``512`` + - Size of the transformer hidden states. + * - **model.encoder.num_layers** + - int + - ``6`` + - Number of transformer layers. + * - **model.encoder.inner_size** + - int + - ``2048`` + - Size of the hidden states within the feedforward layers. + * - **model.encoder.num_attention_heads** + - int + - ``8`` + - Number of attention heads. + * - **model.encoder.embedding_dropout** + - float + - ``0.1`` + - Dropout probability of the embedding layer. + * - **model.encoder.ffn_dropout** + - float + - ``0.1`` + - Dropout probability within the feedforward layers. + * - **model.encoder.attn_score_dropout** + - float + - ``0.1`` + - Dropout probability of the attention scores before softmax normalization. + * - **model.encoder.attn_layer_dropout** + - float + - ``0.1`` + - Dropout probability of the attention query, key, and value projection activations. + * - **model.encoder.hidden_act** + - str + - ``relu`` + - Activation function throughout the network. + * - **model.encoder.mask_future** + - bool + - ``true`` + - Whether to mask future timesteps for attention. Defaults to ``true`` for the standard left-to-right LM. + * - **model.encoder.pre_ln** + - bool + - ``false`` + - Whether to apply layer-normalization before (``true``) or after (``false``) a sub-layer. + +.. list-table:: *Head Network (multilayer perceptron)* + :widths: 30 5 5 60 + :header-rows: 1 + + * - Parameter + - Data Type + - Default + - Description + * - **model.head.num_layers** + - int + - ``1`` + - Number of layers in the head network. + * - **model.head.activation** + - str + - ``relu`` + - Activation function used after each layer. + * - **model.head.log_softmax** + - bool + - ``true`` + - Whether to apply ``log_softmax`` to the final layer output. + * - **model.head.dropout** + - float + - ``0.0`` + - Dropout probability after each layer. + + +Our pre-trained models are optimized with Adam, with a maximum learning of 0.001, beta of (0.9, 0.98), and inverse square root learning rate schedule from. The **model.optim** section sets the optimization parameters. + +The following script trains 6-layer Transformer LM: + +.. code :: + + python examples/nlp/language_modeling/transformer_lm.py \ + -cn transformer_lm_config \ + trainer.devices=2 \ + trainer.accelerator='gpu' \ + +exp_manager.exp_dir=/path/to/store/results \ + +exp_manager.create_checkpoint_callback=True \ + +exp_manager.checkpoint_callback_params.monitor=val_PPL \ + +exp_manager.checkpoint_callback_params.mode=min \ + +exp_manager.checkpoint_callback_params.save_top_k=5 \ + model.train_ds.file_name=/path/to/train.txt \ + model.validation_ds.file_name=/path/to/valid.txt \ + model.tokenizer.tokenizer_model=/path/to/yttm_tokenizer_model + +The trainer keeps track of the LM perplexity (PPL) on the provided validation set and saves the checkpoints that have the top 5 (by default) PPL. At the end of training, a ``.nemo`` file is written to the result directory which allows to run inference on a test set. + + +Tarred Datasets for Large Corpora +--------------------------------- + +When training with ``DistributedDataParallel``, each process has its own copy of the dataset. For large datasets, this may not always fit in CPU memory. `Webdatasets `__ circumvents this problem by efficiently iterating over tar files stored on disk. Each tar file can contain hundreds to thousands of pickle files, each containing a single minibatch. We recommend using this method when working with the datasets of more than 5 million sentences. + +To use an existing ``TarredSentenceDataset`` instead of a non-tarred ``SentenceDataset``, set ``is_tarred: true`` in +the experiment config file. Then, pass in the path to the metadata file in ``metadata_file`` and paths to all of the text tarballs in ``tar_files``, either as a list +of filepaths, e.g. ``['/data/shard1.tar', '/data/shard2.tar']``, or in a single brace-expandable string, e.g. +``'/data/shard_{1..64}.tar'`` or ``'/data/shard__OP_1..64_CL_'`` (recommended, see note below). + +.. note:: + For brace expansion, there may be cases where ``{x..y}`` syntax cannot be used due to shell interference. This occurs most commonly + inside SLURM scripts. Therefore, we provide a few equivalent replacements. Supported opening braces (equivalent to ``{``) are ``(``, + ``[``, ``<`` and the special tag ``_OP_``. Supported closing braces (equivalent to ``}``) are ``)``, ``]``, ``>`` and the special + tag ``_CL_``. For SLURM based tasks, we suggest the use of the special tags for ease of use. + +Tarred datasets for sentence-level LMs can be created with the following script: + +.. code:: + + python examples/nlp/machine_translation/create_tarred_monolingual_dataset.py \ + --pkl_file_prefix lm \ + --tokenizer_model /path/to/tokenizer_model \ + --fname /path/to/training_data \ + --out_dir /path/to/tarred_dataset \ + --tokens_in_batch 2048 \ + --num_batches_per_tarfile 250 + +For example, if your dataset contains 10000 batches, the script above will create 40 tarballs and the output directory will look similar to the following: + +.. code:: + + /path/to/tarred_dataset + ├── lm-batches.tokens.2048.1.tar + ├── lm-batches.tokens.2048.2.tar + ├── ... + ├── lm-batches.tokens.2048.40.tar + └── metadata.json + +To train the model on this dataset, the following parameters have to be specified in the **model.train_ds** section: + +.. code:: + + use_tarred_dataset: true + tar_files: /path/to/tarred_dataset/lm-batches.2048._OP_1..40_CL_ + metadata_fiel: /path/to/tarred_dataset/metadata.json + +Below is the full list of available configuration options for ``TarredSentenceDataset``: + +.. list-table:: + :widths: 30 5 5 60 + :header-rows: 1 + + * - Parameter + - Data Type + - Default + - Description + * - **model.{train_ds,validation_ds,test_ds}.use_tarred_dataset** + - bool + - ``false`` + - Whether to use tarred datasets. + * - **model.{train_ds,validation_ds,test_ds}.tar_files** + - str + - ``null`` + - Path to all tar files. Either a list or a single brace-expandable string. + * - **model.{train_ds,validation_ds,test_ds}.metadata_file** + - str + - ``null`` + - Path to JSON metadata file that contains only a single entry for the total number of batches in the dataset. + * - **model.{train_ds,validation_ds,test_ds}.tar_shuffle_n** + - int + - ``100`` + - How many samples to look ahead and load to be shuffled. + * - **model.{train_ds,validation_ds,test_ds}.shard_strategy** + - str + - ``scatter`` + - How the shards are distributed between multiple workers. Either ``scatter`` (each node gets a unique set of shards) or ``replicate`` (each node gets all of the set of shards available in the tarred dataset). + +References +---------- + +.. bibliography:: nlp_all.bib + :style: plain + :labelprefix: nlp-language_modeling + :keyprefix: nlp-language_modeling- diff --git a/docs/source/nlp/machine_translation/machine_translation.rst b/docs/source/nlp/machine_translation/machine_translation.rst new file mode 100644 index 0000000000000000000000000000000000000000..190ac5b07da9a733b1abdd006fe0abed97278688 --- /dev/null +++ b/docs/source/nlp/machine_translation/machine_translation.rst @@ -0,0 +1,783 @@ +.. _machine_translation: + +Machine Translation Models +========================== +Machine translation is the task of translating text from one language to another. For example, from English to Spanish. Models are +based on the Transformer sequence-to-sequence architecture :cite:`nlp-machine_translation-vaswani2017attention`. + +An example script on how to train the model can be found here: `NeMo/examples/nlp/machine_translation/enc_dec_nmt.py `__. +The default configuration file for the model can be found at: `NeMo/examples/nlp/machine_translation/conf/aayn_base.yaml `__. + +Quick Start Guide +----------------- + +.. code-block:: python + + from nemo.collections.nlp.models import MTEncDecModel + + # To get the list of pre-trained models + MTEncDecModel.list_available_models() + + # Download and load the a pre-trained to translate from English to Spanish + model = MTEncDecModel.from_pretrained("nmt_en_es_transformer24x6") + + # Translate a sentence or list of sentences + translations = model.translate(["Hello!"], source_lang="en", target_lang="es") + +Available Models +^^^^^^^^^^^^^^^^ + +.. list-table:: *Pretrained Models* + :widths: 5 10 + :header-rows: 1 + + * - Model + - Pretrained Checkpoint + * - *New Checkppoints* + - + * - English -> German + - https://ngc.nvidia.com/catalog/models/nvidia:nemo:nmt_en_de_transformer24x6 + * - German -> English + - https://ngc.nvidia.com/catalog/models/nvidia:nemo:nmt_de_en_transformer24x6 + * - English -> Spanish + - https://ngc.nvidia.com/catalog/models/nvidia:nemo:nmt_en_es_transformer24x6 + * - Spanish -> English + - https://ngc.nvidia.com/catalog/models/nvidia:nemo:nmt_es_en_transformer24x6 + * - English -> French + - https://ngc.nvidia.com/catalog/models/nvidia:nemo:nmt_en_fr_transformer24x6 + * - French -> English + - https://ngc.nvidia.com/catalog/models/nvidia:nemo:nmt_fr_en_transformer24x6 + * - English -> Russian + - https://ngc.nvidia.com/catalog/models/nvidia:nemo:nmt_en_ru_transformer24x6 + * - Russian -> English + - https://ngc.nvidia.com/catalog/models/nvidia:nemo:nmt_ru_en_transformer24x6 + * - English -> Chinese + - https://ngc.nvidia.com/catalog/models/nvidia:nemo:nmt_en_zh_transformer24x6 + * - Chinese -> English + - https://ngc.nvidia.com/catalog/models/nvidia:nemo:nmt_zh_en_transformer24x6 + * - *Old Checkppoints* + - + * - English -> German + - https://ngc.nvidia.com/catalog/models/nvidia:nemo:nmt_en_de_transformer12x2 + * - German -> English + - https://ngc.nvidia.com/catalog/models/nvidia:nemo:nmt_de_en_transformer12x2 + * - English -> Spanish + - https://ngc.nvidia.com/catalog/models/nvidia:nemo:nmt_en_es_transformer12x2 + * - Spanish -> English + - https://ngc.nvidia.com/catalog/models/nvidia:nemo:nmt_es_en_transformer12x2 + * - English -> French + - https://ngc.nvidia.com/catalog/models/nvidia:nemo:nmt_en_fr_transformer12x2 + * - French -> English + - https://ngc.nvidia.com/catalog/models/nvidia:nemo:nmt_fr_en_transformer12x2 + * - English -> Russian + - https://ngc.nvidia.com/catalog/models/nvidia:nemo:nmt_en_ru_transformer6x6 + * - Russian -> English + - https://ngc.nvidia.com/catalog/models/nvidia:nemo:nmt_ru_en_transformer6x6 + * - English -> Chinese + - https://ngc.nvidia.com/catalog/models/nvidia:nemo:nmt_en_zh_transformer6x6 + * - Chinese -> English + - https://ngc.nvidia.com/catalog/models/nvidia:nemo:nmt_zh_en_transformer6x6 + +Data Format +----------- + +Supervised machine translation models require parallel corpora which comprises many examples of sentences in a source language and +their corresponding translation in a target language. We use parallel data formatted as separate text files for source and target +languages where sentences in corresponding files are aligned like in the table below. + +.. list-table:: *Parallel Coprus* + :widths: 10 10 + :header-rows: 1 + + * - train.english.txt + - train.spanish.txt + * - Hello . + - Hola . + * - Thank you . + - Gracias . + * - You can now translate from English to Spanish in NeMo . + - Ahora puedes traducir del inglés al español en NeMo . + +It is common practice to apply data cleaning, normalization, and tokenization to the data prior to training a translation model and +NeMo expects already cleaned, normalized, and tokenized data. The only data pre-processing NeMo does is subword tokenization with BPE +:cite:`nlp-machine_translation-sennrich2015neural`. + +Data Cleaning, Normalization & Tokenization +------------------------------------------- + +We recommend applying the following steps to clean, normalize, and tokenize your data. All pre-trained models released, apply these data pre-processing steps. + +#. Please take a look at a detailed notebook on best practices to pre-process and clean your datasets - NeMo/tutorials/nlp/Data_Preprocessing_and_Cleaning_for_NMT.ipynb + +#. Language ID filtering - This step filters out examples from your training dataset that aren't in the correct language. For example, + many datasets contain examples where source and target sentences are in the same language. You can use a pre-trained language ID + classifier from `fastText `__. Install fastText and then you can then run our script using the + ``lid.176.bin`` model downloaded from the fastText website. + + .. code :: + + python NeMo/scripts/neural_machine_translation/filter_langs_nmt.py \ + --input-src train.en \ + --input-tgt train.es \ + --output-src train_lang_filtered.en \ + --output-tgt train_lang_filtered.es \ + --source-lang en \ + --target-lang es \ + --removed-src train_noise.en \ + --removed-tgt train_noise.es \ + --fasttext-model lid.176.bin + +#. Length filtering - We filter out sentences from the data that are below a minimum length (1) or exceed a maximum length (250). We + also filter out sentences where the ratio between source and target lengths exceeds 1.3 except for English <-> Chinese models. + `Moses `__ is a statistical machine translation toolkit that contains many useful + pre-processing scripts. + + .. code :: + + perl mosesdecoder/scripts/training/clean-corpus-n.perl -ratio 1.3 train en es train.filter 1 250 + +#. Data cleaning - While language ID filtering can sometimes help with filtering out noisy sentences that contain too many punctuations, + it does not help in cases where the translations are potentially incorrect, disfluent, or incomplete. We use `bicleaner `__ + a tool to identify such sentences. It trains a classifier based on many features included pre-trained language model fluency, word + alignment scores from a word-alignment model like `Giza++ `__ etc. We use their available + pre-trained models wherever possible and train models ourselves using their framework for remaining languages. The following script + applies a pre-trained bicleaner model to the data and pick sentences that are clean with probability > 0.5. + + .. code :: + + awk '{print "-\t-"}' train.en \ + | paste -d "\t" - train.filter.en train.filter.es \ + | bicleaner-classify - -
> train.en-es.bicleaner.score + +#. Data deduplication - We use `bifixer `__ (which uses xxHash) to hash the source and target + sentences based on which we remove duplicate entries from the file. You may want to do something similar to remove training examples + that are in the test dataset. + + .. code :: + + cat train.en-es.bicleaner.score \ + | parallel -j 25 --pipe -k -l 30000 python bifixer.py --ignore-segmentation -q - - en es \ + > train.en-es.bifixer.score + + awk -F awk -F "\t" '!seen[$6]++' train.en-es.bifixer.score > train.en-es.bifixer.dedup.score + +#. Filter out data that bifixer assigns probability < 0.5 to. + + .. code :: + + awk -F "\t" '{ if ($5>0.5) {print $3}}' train.en-es.bifixer.dedup.score > train.cleaned.en + awk -F "\t" '{ if ($5>0.5) {print $4}}' train.en-es.bifixer.dedup.score > train.cleaned.es + +#. Punctuation Normalization - Punctuation, especially things like quotes can be written in different ways. + It's often useful to normalize the way they appear in text. We use the moses punctuation normalizer on all languages except Chinese. + + .. code :: + + perl mosesdecoder/scripts/tokenizer/normalize-punctuation.perl -l es < train.cleaned.es > train.normalized.es + perl mosesdecoder/scripts/tokenizer/normalize-punctuation.perl -l en < train.cleaned.en > train.normalized.en + + For example: + + .. code :: + + Before - Aquí se encuentran joyerías como Tiffany`s entre negocios tradicionales suizos como la confitería Sprüngli. + After - Aquí se encuentran joyerías como Tiffany's entre negocios tradicionales suizos como la confitería Sprüngli. + +#. Tokenization and word segmentation for Chinese - Naturally written text often contains punctuation markers like commas, full-stops + and apostrophes that are attached to words. Tokenization by just splitting a string on spaces will result in separate token IDs for + very similar items like ``NeMo`` and ``NeMo.``. Tokenization splits punctuation from the word to create two separate tokens. In the + previous example ``NeMo.`` becomes ``NeMo .`` which when split by space, results in two tokens and addresses the earlier problem. + + For example: + + .. code :: + + Before - Especialmente porque se enfrentará "a Mathieu (Debuchy), Yohan (Cabaye) y Adil (Rami) ", recuerda. + After - Especialmente porque se enfrentará " a Mathieu ( Debuchy ) , Yohan ( Cabaye ) y Adil ( Rami ) " , recuerda . + + We use the Moses tokenizer for all languages except Chinese. + + .. code :: + + perl mosesdecoder/scripts/tokenizer/tokenizer.perl -l es -no-escape < train.normalized.es > train.tokenized.es + perl mosesdecoder/scripts/tokenizer/tokenizer.perl -l en -no-escape < train.normalized.en > train.tokenized.en + + For languages like Chinese where there is no explicit marker like spaces that separate words, we use `Jieba `__ to segment a string into words that are space separated. + + For example: + + .. code :: + + Before - 同时,卫生局认为有必要接种的其他人员,包括公共部门,卫生局将主动联络有关机构取得名单后由卫生中心安排接种。 + After - 同时 , 卫生局 认为 有 必要 接种 的 其他 人员 , 包括 公共部门 , 卫生局 将 主动 联络 有关 机构 取得 名单 后 由 卫生 中心 安排 接种 。 + +Training a BPE Tokenization +--------------------------- + +Byte-pair encoding (BPE) :cite:`nlp-machine_translation-sennrich2015neural` is a sub-word tokenization algorithm that is commonly used +to reduce the large vocabulary size of datasets by splitting words into frequently occuring sub-words. Currently, Machine translation +only supports the `YouTokenToMe `__ BPE tokenizer. One can set the tokenization configuration +as follows: + ++-----------------------------------------------------------------+-----------------+----------------+----------------------------------------------------------------------------------------------------+ +| **Parameter** | **Data Type** | **Default** | **Description** | ++-----------------------------------------------------------------+-----------------+----------------+----------------------------------------------------------------------------------------------------+ +| **model.{encoder_tokenizer,decoder_tokenizer}.tokenizer_name** | str | ``yttm`` | BPE library name. Only supports ``yttm`` for now. | ++-----------------------------------------------------------------+-----------------+----------------+----------------------------------------------------------------------------------------------------+ +| **model.{encoder_tokenizer,decoder_tokenizer}.tokenizer_model** | str | ``null`` | Path to an existing YTTM BPE model. If ``null``, will train one from scratch on the provided data. | ++-----------------------------------------------------------------+-----------------+----------------+----------------------------------------------------------------------------------------------------+ +| **model.{encoder_tokenizer,decoder_tokenizer}.vocab_size** | int | ``null`` | Desired vocabulary size after BPE tokenization. | ++-----------------------------------------------------------------+-----------------+----------------+----------------------------------------------------------------------------------------------------+ +| **model.{encoder_tokenizer,decoder_tokenizer}.bpe_dropout** | float | ``null`` | BPE dropout probability. :cite:`nlp-machine_translation-provilkov2019bpe`. | ++-----------------------------------------------------------------+-----------------+----------------+----------------------------------------------------------------------------------------------------+ +| **model.{encoder_tokenizer,decoder_tokenizer}.vocab_file** | str | ``null`` | Path to pre-computed vocab file if exists. | ++-----------------------------------------------------------------+-----------------+----------------+----------------------------------------------------------------------------------------------------+ +| **model.shared_tokenizer** | bool | ``True`` | Whether to share the tokenizer between the encoder and decoder. | ++-----------------------------------------------------------------+-----------------+----------------+----------------------------------------------------------------------------------------------------+ + + +Applying BPE Tokenization, Batching, Bucketing and Padding +---------------------------------------------------------- + +Given BPE tokenizers, and a cleaned parallel corpus, the following steps are applied to create a `TranslationDataset `__ object. + +#. Text to IDs - This performs subword tokenization with the BPE model on an input string and maps it to a sequence of tokens for the + source and target text. + +#. Bucketing - Sentences vary in length and when creating minibatches, we'd like sentences in them to have roughly the same length to + minimize the number of ```` tokens and to maximize computational efficiency. This step groups sentences roughly the same length + into buckets. + +#. Batching and padding - Creates minibatches with a maximum number of tokens specified by ``model.{train_ds,validation_ds,test_ds}.tokens_in_batch`` + from buckets and pads, so they can be packed into a tensor. + +Datasets can be configured as follows: + ++-------------------------------------------------------------+-----------------+----------------+----------------------------------------------------------------------------------------------------------------------+ +| **Parameter** | **Data Type** | **Default** | **Description** | ++-------------------------------------------------------------+-----------------+----------------+----------------------------------------------------------------------------------------------------------------------+ +| **model.{train_ds,validation_ds,test_ds}.src_file_name** | str | ``null`` | Path to the source language file. | ++-------------------------------------------------------------+-----------------+----------------+----------------------------------------------------------------------------------------------------------------------+ +| **model.{train_ds,validation_ds,test_ds}.tgt_file_name** | str | ``null`` | Path to the target language file. | ++-------------------------------------------------------------+-----------------+----------------+----------------------------------------------------------------------------------------------------------------------+ +| **model.{train_ds,validation_ds,test_ds}.tokens_in_batch** | int | ``512`` | Maximum number of tokens per minibatch. | ++-------------------------------------------------------------+-----------------+----------------+----------------------------------------------------------------------------------------------------------------------+ +| **model.{train_ds,validation_ds,test_ds}.clean** | bool | ``true`` | Whether to clean the dataset by discarding examples that are greater than ``max_seq_length``. | ++-------------------------------------------------------------+-----------------+----------------+----------------------------------------------------------------------------------------------------------------------+ +| **model.{train_ds,validation_ds,test_ds}.max_seq_length** | int | ``512`` | Maximum sequence to be used with the ``clean`` argument above. | ++-------------------------------------------------------------+-----------------+----------------+----------------------------------------------------------------------------------------------------------------------+ +| **model.{train_ds,validation_ds,test_ds}.shuffle** | bool | ``true`` | Whether to shuffle minibatches in the PyTorch DataLoader. | ++-------------------------------------------------------------+-----------------+----------------+----------------------------------------------------------------------------------------------------------------------+ +| **model.{train_ds,validation_ds,test_ds}.num_samples** | int | ``-1`` | Number of samples to use. ``-1`` for the entire dataset. | ++-------------------------------------------------------------+-----------------+----------------+----------------------------------------------------------------------------------------------------------------------+ +| **model.{train_ds,validation_ds,test_ds}.drop_last** | bool | ``false`` | Drop last minibatch if it is not of equal size to the others. | ++-------------------------------------------------------------+-----------------+----------------+----------------------------------------------------------------------------------------------------------------------+ +| **model.{train_ds,validation_ds,test_ds}.pin_memory** | bool | ``false`` | Whether to pin memory in the PyTorch DataLoader. | ++-------------------------------------------------------------+-----------------+----------------+----------------------------------------------------------------------------------------------------------------------+ +| **model.{train_ds,validation_ds,test_ds}.num_workers** | int | ``8`` | Number of workers for the PyTorch DataLoader. | ++-------------------------------------------------------------+-----------------+----------------+----------------------------------------------------------------------------------------------------------------------+ + + +Tarred Datasets for Large Corpora +--------------------------------- + +When training with ``DistributedDataParallel``, each process has its own copy of the dataset. For large datasets, this may not always +fit in CPU memory. `Webdatasets `__ circumvents this problem by efficiently iterating over +tar files stored on disk. Each tar file can contain hundreds to thousands of pickle files, each containing a single minibatch. + +We recommend using this method when working with datasets with > 1 million sentence pairs. + +Tarred datasets can be configured as follows: + ++-----------------------------------------------------------------------+-----------------+----------------+----------------------------------------------------------------------------------------------------------------+ +| **Parameter** | **Data Type** | **Default** | **Description** | ++-----------------------------------------------------------------------+-----------------+----------------+----------------------------------------------------------------------------------------------------------------+ +| **model.{train_ds,validation_ds,test_ds}.use_tarred_dataset** | bool | ``false`` | Whether to use tarred datasets. | ++-----------------------------------------------------------------------+-----------------+----------------+----------------------------------------------------------------------------------------------------------------+ +| **model.{train_ds,validation_ds,test_ds}.tar_files** | str | ``null`` | String specifying path to all tar files. Example with 100 tarfiles ``/path/to/tarfiles._OP_1..100_CL_.tar``. | ++-----------------------------------------------------------------------+-----------------+----------------+----------------------------------------------------------------------------------------------------------------+ +| **model.{train_ds,validation_ds,test_ds}.metadata_file** | str | ``null`` | Path to JSON metadata file that contains only a single entry for the total number of batches in the dataset. | ++-----------------------------------------------------------------------+-----------------+----------------+----------------------------------------------------------------------------------------------------------------+ +| **model.{train_ds,validation_ds,test_ds}.lines_per_dataset_fragment** | int | ``1000000`` | Number of lines to consider for bucketing and padding. | ++-----------------------------------------------------------------------+-----------------+----------------+----------------------------------------------------------------------------------------------------------------+ +| **model.{train_ds,validation_ds,test_ds}.num_batches_per_tarfile** | int | ``100`` | Number of batches (pickle files) within each tarfile. | ++-----------------------------------------------------------------------+-----------------+----------------+----------------------------------------------------------------------------------------------------------------+ +| **model.{train_ds,validation_ds,test_ds}.tar_shuffle_n** | int | ``100`` | How many samples to look ahead and load to be shuffled. | ++-----------------------------------------------------------------------+-----------------+----------------+----------------------------------------------------------------------------------------------------------------+ +| **model.{train_ds,validation_ds,test_ds}.shard_strategy** | str | ``scatter`` | How the shards are distributed between multiple workers. | ++-----------------------------------------------------------------------+-----------------+----------------+----------------------------------------------------------------------------------------------------------------+ +| **model.preproc_out_dir** | str | ``null`` | Path to folder that contains processed tar files or directory where new tar files are written. | ++-----------------------------------------------------------------------+-----------------+----------------+----------------------------------------------------------------------------------------------------------------+ + +Tarred datasets can be created in two ways: + +#. Using the Hydra config and `training script `__. + + For example: + + .. code :: + + python examples/nlp/machine_translation/enc_dec_nmt.py \ + -cn aayn_base \ + do_training=false \ + model.preproc_out_dir=/path/to/preproc_dir \ + model.train_ds.use_tarred_dataset=true \ + model.train_ds.lines_per_dataset_fragment=1000000 \ + model.train_ds.num_batches_per_tarfile=200 \ + model.train_ds.src_file_name=train.tokenized.en \ + model.train_ds.tgt_file_name=train.tokenized.es \ + model.validation_ds.src_file_name=validation.tokenized.en \ + model.validation_ds.tgt_file_name=validation.tokenized.es \ + model.encoder_tokenizer.vocab_size=32000 \ + model.decoder_tokenizer.vocab_size=32000 \ + ~model.test_ds \ + trainer.devices=[0,1,2,3] \ + trainer.accelerator='gpu' \ + +trainer.fast_dev_run=true \ + exp_manager=null \ + + The above script processes the parallel tokenized text files into tarred datasets that are written to ``/path/to/preproc_dir``. Since + ``do_training`` is set to ``False``, the above script only creates tarred datasets and then exits. If ``do_training`` is set ``True``, + then one of two things happen: + + (a) If no tar files are present in ``model.preproc_out_dir``, the script first creates those files and then commences training. + (b) If tar files are already present in ``model.preproc_out_dir``, the script starts training from the provided tar files. + +#. Using a separate script without Hydra. + + Tarred datasets for parallel corpora can also be created with a script that doesn't require specifying a configs via Hydra and + just uses Python argparse. + + For example: + + .. code :: + + python examples/nlp/machine_translation/create_tarred_parallel_dataset.py \ + --shared_tokenizer \ + --clean \ + --bpe_dropout 0.1 \ + --src_fname train.tokenized.en \ + --tgt_fname train.tokenized.es \ + --out_dir /path/to/preproc_dir \ + --vocab_size 32000 \ + --max_seq_length 512 \ + --min_seq_length 1 \ + --tokens_in_batch 8192 \ + --lines_per_dataset_fragment 1000000 \ + --num_batches_per_tarfile 200 + + You can then set `model.preproc_out_dir=/path/to/preproc_dir` and `model.train_ds.use_tarred_dataset=true` to train with this data. + +Model Configuration and Training +-------------------------------- + +The overall model consists of an encoder, decoder, and classification head. Encoders and decoders have the following configuration +options: + ++-------------------------------------------------------------------+-----------------+-----------------------+-----------------------------------------------------------------------------------------------------------------+ +| **Parameter** | **Data Type** | **Default** | **Description** | ++-------------------------------------------------------------------+-----------------+-----------------------+-----------------------------------------------------------------------------------------------------------------+ +| **model.{encoder,decoder}.max_sequence_length** | int | ``512`` | Maximum sequence length of positional encodings. | ++-------------------------------------------------------------------+-----------------+-----------------------+-----------------------------------------------------------------------------------------------------------------+ +| **model.{encoder,decoder}.embedding_dropout** | float | ``0.1`` | Path to JSON metadata file that contains only a single entry for the total number of batches in the dataset. | ++-------------------------------------------------------------------+-----------------+-----------------------+-----------------------------------------------------------------------------------------------------------------+ +| **model.{encoder,decoder}.learn_positional_encodings** | bool | ``false`` | If ``True``, this is a regular learnable embedding layer. If ``False``, fixes position encodings to sinusoidal. | ++-------------------------------------------------------------------+-----------------+-----------------------+-----------------------------------------------------------------------------------------------------------------+ +| **model.{encoder,decoder}.hidden_size** | int | ``512`` | Size of the transformer hidden states. | ++-------------------------------------------------------------------+-----------------+-----------------------+-----------------------------------------------------------------------------------------------------------------+ +| **model.{encoder,decoder}.num_layers** | int | ``6`` | Number of transformer layers. | ++-------------------------------------------------------------------+-----------------+-----------------------+-----------------------------------------------------------------------------------------------------------------+ +| **model.{encoder,decoder}.inner_size** | int | ``2048`` | Size of the hidden states within the feedforward layers. | ++-------------------------------------------------------------------+-----------------+-----------------------+-----------------------------------------------------------------------------------------------------------------+ +| **model.{encoder,decoder}.num_attention_heads** | int | ``8`` | Number of attention heads. | ++-------------------------------------------------------------------+-----------------+-----------------------+-----------------------------------------------------------------------------------------------------------------+ +| **model.{encoder,decoder}.ffn_dropout** | float | ``0.1`` | Dropout probability within the feedforward layers. | ++-------------------------------------------------------------------+-----------------+-----------------------+-----------------------------------------------------------------------------------------------------------------+ +| **model.{encoder,decoder}.attn_score_dropout** | float | ``0.1`` | Dropout probability of the attention scores before softmax normalization. | ++-------------------------------------------------------------------+-----------------+-----------------------+-----------------------------------------------------------------------------------------------------------------+ +| **model.{encoder,decoder}.attn_layer_dropout** | float | ``0.1`` | Dropout probability of the attention query, key, and value projection activations. | ++-------------------------------------------------------------------+-----------------+-----------------------+-----------------------------------------------------------------------------------------------------------------+ +| **model.{encoder,decoder}.hidden_act** | str | ``relu`` | Activation function throughout the network. | ++-------------------------------------------------------------------+-----------------+-----------------------+-----------------------------------------------------------------------------------------------------------------+ +| **model.{encoder,decoder}.mask_future** | bool | ``false``, ``true`` | Whether to mask future timesteps for attention. Defaults to ``True`` for decoder and ``False`` for encoder. | ++-------------------------------------------------------------------+-----------------+-----------------------+-----------------------------------------------------------------------------------------------------------------+ +| **model.{encoder,decoder}.pre_ln** | bool | ``false`` | Whether to apply layer-normalization before (``true``) or after (``false``) a sub-layer. | ++-------------------------------------------------------------------+-----------------+-----------------------+-----------------------------------------------------------------------------------------------------------------+ + +Our pre-trained models are optimized with Adam, with a maximum learning of 0.0004, beta of (0.9, 0.98), and inverse square root learning +rate schedule from :cite:`nlp-machine_translation-vaswani2017attention`. The **model.optim** section sets the optimization parameters. + +The following script creates tarred datasets based on the provided parallel corpus and trains a model based on the ``base`` configuration +from :cite:`nlp-machine_translation-vaswani2017attention`. + +.. code :: + + python examples/nlp/machine_translation/enc_dec_nmt.py \ + -cn aayn_base \ + do_training=true \ + trainer.devices=8 \ + trainer.accelerator='gpu' \ + ~trainer.max_epochs \ + +trainer.max_steps=100000 \ + +trainer.val_check_interval=1000 \ + +exp_manager.exp_dir=/path/to/store/results \ + +exp_manager.create_checkpoint_callback=True \ + +exp_manager.checkpoint_callback_params.monitor=val_sacreBLEU \ + +exp_manager.checkpoint_callback_params.mode=max \ + +exp_manager.checkpoint_callback_params.save_top_k=5 \ + model.preproc_out_dir=/path/to/preproc_dir \ + model.train_ds.use_tarred_dataset=true \ + model.train_ds.lines_per_dataset_fragment=1000000 \ + model.train_ds.num_batches_per_tarfile=200 \ + model.train_ds.src_file_name=train.tokenized.en \ + model.train_ds.tgt_file_name=train.tokenized.es \ + model.validation_ds.src_file_name=validation.tokenized.en \ + model.validation_ds.tgt_file_name=validation.tokenized.es \ + model.encoder_tokenizer.vocab_size=32000 \ + model.decoder_tokenizer.vocab_size=32000 \ + ~model.test_ds \ + +The trainer keeps track of the sacreBLEU score :cite:`nlp-machine_translation-post2018call` on the provided validation set and saves +the checkpoints that have the top 5 (by default) sacreBLEU scores. + +At the end of training, a ``.nemo`` file is written to the result directory which allows to run inference on a test set. + +Multi-Validation +---------------- + +To run validation on multiple datasets, specify ``validation_ds.src_file_name`` and ``validation_ds.tgt_file_name`` with a list of file paths: + +.. code-block:: bash + + model.validation_ds.src_file_name=[/data/wmt13-en-de.src,/data/wmt14-en-de.src] \ + model.validation_ds.tgt_file_name=[/data/wmt13-en-de.ref,/data/wmt14-en-de.ref] \ + +When using ``val_loss`` or ``val_sacreBLEU`` for the ``exp_manager.checkpoint_callback_params.monitor`` +then the 0th indexed dataset will be used as the monitor. + +To use other indexes, append the index: + +.. code-block:: bash + + exp_manager.checkpoint_callback_params.monitor=val_sacreBLEU_dl_index_1 + +Multiple test datasets work exactly the same way as validation datasets, simply replace ``validation_ds`` by ``test_ds`` in the above examples. + +Bottleneck Models and Latent Variable Models (VAE, MIM) +------------------------------------------------------- + +NMT with bottleneck encoder architecture is also supported (i.e., fixed size bottleneck), along with the training of Latent Variable Models (currently VAE, and MIM). + +1. Supported learning frameworks (**model.model_type**): + * NLL - Conditional cross entropy (the usual NMT loss) + * VAE - Variational Auto-Encoder (`paper `_) + * MIM - Mutual Information Machine (`paper `_) +2. Supported encoder architectures (**model.encoder.arch**): + * seq2seq - the usual transformer encoder without a bottleneck + * bridge - attention bridge bottleneck (`paper `_) + * perceiver - Perceiver bottleneck (`paper `_) + + ++----------------------------------------+----------------+--------------+-------------------------------------------------------------------------------------------------------+ +| **Parameter** | **Data Type** | **Default** | **Description** | ++========================================+================+==============+=======================================================================================================+ +| **model.model_type** | str | ``nll`` | Learning (i.e., loss) type: nll (i.e., cross-entropy/auto-encoder), mim, vae (see description above) | ++----------------------------------------+----------------+--------------+-------------------------------------------------------------------------------------------------------+ +| **model.min_logv** | float | ``-6`` | Minimal allowed log variance for mim | ++----------------------------------------+----------------+--------------+-------------------------------------------------------------------------------------------------------+ +| **model.latent_size** | int | ``-1`` | Dimension of latent (projected from hidden) -1 will take value of hidden size | ++----------------------------------------+----------------+--------------+-------------------------------------------------------------------------------------------------------+ +| **model. non_recon_warmup_batches** | bool | ``200000`` | Warm-up steps for mim, and vae losses (anneals non-reconstruction part) | ++----------------------------------------+----------------+--------------+-------------------------------------------------------------------------------------------------------+ +| **model. recon_per_token** | bool | ``true`` | When false reconstruction is computed per sample, not per token | ++----------------------------------------+----------------+--------------+-------------------------------------------------------------------------------------------------------+ +| **model.encoder.arch** | str | ``seq2seq`` | Supported architectures: ``seq2seq``, ``bridge``, ``perceiver`` (see description above). | ++----------------------------------------+----------------+--------------+-------------------------------------------------------------------------------------------------------+ +| **model.encoder.hidden_steps** | int | ``32`` | Fixed number of hidden steps | ++----------------------------------------+----------------+--------------+-------------------------------------------------------------------------------------------------------+ +| **model.encoder.hidden_blocks** | int | ``1`` | Number of repeat blocks (see classes for description) | ++----------------------------------------+----------------+--------------+-------------------------------------------------------------------------------------------------------+ +| **model.encoder. hidden_init_method** | str | ``default`` | See classes for available values | ++----------------------------------------+----------------+--------------+-------------------------------------------------------------------------------------------------------+ + + +Detailed description of config parameters: + +* **model.encoder.arch=seq2seq** + * *model.encoder.hidden_steps is ignored* + * *model.encoder.hidden_blocks is ignored* + * *model.encoder.hidden_init_method is ignored* +* **model.encoder.arch=bridge** + * *model.encoder.hidden_steps:* input is projected to the specified fixed steps + * *model.encoder.hidden_blocks:* number of encoder blocks to repeat after attention bridge projection + * *model.encoder.hidden_init_method:* + * enc_shared (default) - apply encoder to inputs, than attention bridge, followed by hidden_blocks number of the same encoder (pre and post encoders share parameters) + * identity - apply attention bridge to inputs, followed by hidden_blocks number of the same encoder + * enc - similar to enc_shared but the initial encoder has independent parameters +* **model.encoder.arch=perceiver** + * *model.encoder.hidden_steps:* input is projected to the specified fixed steps + * *model.encoder.hidden_blocks:* number of cross-attention + self-attention blocks to repeat after initialization block (all self-attention and cross-attention share parameters) + * *model.encoder.hidden_init_method:* + * params (default) - hidden state is initialized with learned parameters followed by cross-attention with independent parameters + * bridge - hidden state is initialized with an attention bridge + + +Training requires the use of the following script (instead of ``enc_dec_nmt.py``): + +.. code :: + + python -- examples/nlp/machine_translation/enc_dec_nmt-bottleneck.py \ + --config-path=conf \ + --config-name=aayn_bottleneck \ + ... + model.model_type=nll \ + model.non_recon_warmup_batches=7500 \ + model.encoder.arch=perceiver \ + model.encoder.hidden_steps=32 \ + model.encoder.hidden_blocks=2 \ + model.encoder.hidden_init_method=params \ + ... + + +Model Inference +--------------- + +To generate translations on a test set and compute sacreBLEU scores, run the inference script: + +.. code :: + + python examples/nlp/machine_translation/nmt_transformer_infer.py \ + --model /path/to/model.nemo \ + --srctext test.en \ + --tgtout test.en-es.translations \ + --batch_size 128 \ + --source_lang en \ + --target_lang es + +The ``--srctext`` file must be provided before tokenization and normalization. The resulting ``--tgtout`` file is detokenized and +can be used to compute sacreBLEU scores. + +.. code :: + + cat test.en-es.translations | sacrebleu test.es + +Inference Improvements +---------------------- + +In practice, there are a few commonly used techniques at inference to improve translation quality. NeMo implements: + +1) Model Ensembling +2) Shallow Fusion decoding with transformer language models :cite:`nlp-machine_translation-gulcehre2015using` +3) Noisy-channel re-ranking :cite:`nlp-machine_translation-yee2019simple` + +(a) Model Ensembling - Given many models trained with the same encoder and decoder tokenizer, it is possible to ensemble their predictions (by averaging probabilities at each step) to generate better translations. + +.. math:: + + P(y_t|y_{`__ +or `Megatron-LM `__ +can be used to to train NeMo NMT models. + +The ``library`` flag takes values: ``huggingface``, ``megatron``, and ``nemo``. + +The ``model_name`` flag is used to indicate a *named* model architecture. +For example, we can use ``bert_base_cased`` from HuggingFace or ``megatron-bert-345m-cased`` from Megatron-LM. + +The ``pretrained`` flag indicates whether or not to download the pretrained weights (``pretrained=True``) or +instantiate the same model architecture with random weights (``pretrained=False``). + +To use a custom model architecture from a specific library, use ``model_name=null`` and then add the +custom configuration under the ``encoder`` configuration. + +HuggingFace +^^^^^^^^^^^ + +We have provided a `HuggingFace config file `__ +to use with HuggingFace encoders. + +To use the config file from CLI: + +.. code :: + + --config-path=conf \ + --config-name=huggingface \ + +As an example, we can configure the NeMo NMT encoder to use ``bert-base-cased`` from HuggingFace +by using the ``huggingface`` config file and setting + +.. code :: + + model.encoder.pretrained=true \ + model.encoder.model_name=bert-base-cased \ + +To use a custom architecture from HuggingFace we can use + +.. code :: + + +model.encoder._target_=transformers.BertConfig \ + +model.encoder.hidden_size=1536 \ + +Note the ``+`` symbol is needed if we're not adding the arguments to the YAML config file. + +Megatron +^^^^^^^^ + +We have provided a `Megatron config file `__ +to use with Megatron encoders. + +To use the config file from CLI: + +.. code :: + + --config-path=conf \ + --config-name=megatron \ + +The ``checkpoint_file`` should be the path to Megatron-LM checkpoint: + +.. code :: + + /path/to/your/megatron/checkpoint/model_optim_rng.pt + +In case your megatron model requires model parallelism, then ``checkpoint_file`` should point to the directory containing the +standard Megatron-LM checkpoint format: + +.. code :: + + 3.9b_bert_no_rng + ├── mp_rank_00 + │ └── model_optim_rng.pt + ├── mp_rank_01 + │ └── model_optim_rng.pt + ├── mp_rank_02 + │ └── model_optim_rng.pt + └── mp_rank_03 + └── model_optim_rng.pt + +As an example, to train a NeMo NMT model with a 3.9B Megatron BERT encoder, +we would use the following encoder configuration: + +.. code :: + + model.encoder.checkpoint_file=/path/to/megatron/checkpoint/3.9b_bert_no_rng \ + model.encoder.hidden_size=2560 \ + model.encoder.num_attention_heads=40 \ + model.encoder.num_layers=48 \ + model.encoder.max_position_embeddings=512 \ + +To train a Megatron 345M BERT, we would use + +.. code :: + + model.encoder.model_name=megatron-bert-cased \ + model.encoder.checkpoint_file=/path/to/your/megatron/checkpoint/model_optim_rng.pt \ + model.encoder.hidden_size=1024 \ + model.encoder.num_attention_heads=16 \ + model.encoder.num_layers=24 \ + model.encoder.max_position_embeddings=512 \ + +If the pretrained megatron model used a custom vocab file, then set: + +.. code:: + + model.encoder_tokenizer.vocab_file=/path/to/your/megatron/vocab_file.txt + model.encoder.vocab_file=/path/to/your/megatron/vocab_file.txt + + +Use ``encoder.model_name=megatron_bert_uncased`` for uncased models with custom vocabularies and +use ``encoder.model_name=megatron_bert_cased`` for cased models with custom vocabularies. + + +References +---------- + +.. bibliography:: ../nlp_all.bib + :style: plain + :labelprefix: nlp-machine_translation + :keyprefix: nlp-machine_translation- diff --git a/docs/source/nlp/megatron.rst b/docs/source/nlp/megatron.rst new file mode 100644 index 0000000000000000000000000000000000000000..743aa2f84b536fae71c666d653c75ee949c1149d --- /dev/null +++ b/docs/source/nlp/megatron.rst @@ -0,0 +1,183 @@ +.. _megatron_finetuning: + +NeMo Megatron +============= + +Megatron-LM :cite:`nlp-megatron-shoeybi2019megatron` is a large, powerful transformer developed by the Applied Deep Learning Research +team at NVIDIA. Currently NeMo Megatron supports 3 types of models: + +* GPT-style models (decoder only) +* T5/BART-style models (encoder-decoder) +* BERT-style models (encoder only) + +.. note:: + We recommend using `NeMo Megatron containers `_ for pre-training, tuning and running inference with large (1B and above) Megatrons. + + +Model Parallelism +----------------- + +`Megatron-LM `_ is a highly optimized and efficient library for training large language models. +With Megatron model parallelism, language models can be trained with billions of weights and then used in NeMo for downstream tasks. + +NeMo handles pretrained model parallel checkpoints from Megatron-LM automatically and model parallel models in NeMo have the all +the same features as other NeMo Models. + +.. note:: + + Currently, NeMo only supports tensor model parallelism. + +Training +^^^^^^^^ + +All of the necessary logic to train model parallel models in NeMo with PyTorch Lightning is contained in the ``NLPDDPStrategy``. +The ``NLPDDPStrategy`` subclasses the PyTorch Lightning strategy type ``DDPStrategy``. +See `strategies `_ for more information on PyTorch Lightning Strategies + +To enable model parallel training in NeMo: + +.. code-block:: python + + trainer = Trainer(strategy=NLPDDPStrategy(), **cfg.trainer) + +Megatron-LM checkpoints have a specific format. One checkpoint is saved for each model parallel rank: + +.. code-block:: bash + + iter_0080000/ + ├── mp_rank_00 + │ └── model_optim_rng.pt + └── mp_rank_01 + └── model_optim_rng.pt + + +To start fine-tuning from a Megatron-LM checkpoint, simply pass the path to the Megatron-LM checkpoint +via the language model config: + +.. code-block:: bash + + model.language_model.lm_checkpoint=/raid/megatron/bert/iter_0080000 \ + +We also need to input the model configuration. This can be done via json: + +.. code-block:: json + + { + "hidden-size": 1024, + "num-attention-heads": 16, + "num-layers": 24, + "max-seq-length": 512 + } + +And input via command line: + +.. code-block:: bash + + model.language_model.config_file=/raid/data/megatron/bert/config.json \ + +Or the model configuration can be input via YAML: + +.. code-block:: YAML + + model: + language_model: + config: + hidden_size: 1024 + num_attention_heads: 16 + num_layers: 24 + max_position_embeddings: 512 + +Additionally, Megatron-LM requires a vocab file: + +.. code-block:: bash + + model.tokenizer.vocab_file=/path/to/vocab.txt + +If using the Megatron-LM default tokenizer for training BERT the vocab file can be omitted: + +.. code-block:: bash + + # uncased model + model.tokenizer.tokenizer_name=megatron-bert-uncased + +.. code-block:: bash + + # cased model + model.tokenizer.tokenizer_name=megatron-bert-uncased + +Auto-Resume +^^^^^^^^^^^ + +Resuming training with NeMo experiment manager and PyTorch Lightning works exactly the same as other NeMo models. +While training with PTL, model parallel checkpoint will be saved and loaded properly. + +.. code-block:: bash + + checkpoints/ + ├── mp_rank_00 + │ ├── mp_autoresume-last.ckpt + │ ├── mp_autoresume---val_loss=0.35-epoch=0.ckpt + │ ├── mp_autoresume---val_loss=0.38-epoch=1.ckpt + │ └── mp_autoresume---val_loss=0.39-epoch=2.ckpt + └── mp_rank_01 + ├── mp_autoresume-last.ckpt + ├── mp_autoresume---val_loss=0.35-epoch=0.ckpt + ├── mp_autoresume---val_loss=0.38-epoch=1.ckpt + └── mp_autoresume---val_loss=0.39-epoch=2.ckpt + +Save and Restore +^^^^^^^^^^^^^^^^ + +Model parallel .nemo files behave the same as all other .nemo files. Calling ``.save_to`` will save +a checkpoint for each model parallel rank inside the .nemo file: + +.. code-block:: bash + + text_class_350m + ├── megatron-bert-uncased_encoder_config.json + ├── megatron_checkpoint_version.json + ├── model_config.yaml + ├── mp_rank_00 + │ └── model_weights.ckpt + ├── mp_rank_01 + │ └── model_weights.ckpt + ├── tokenizer_vocab_dict.json + └── tokenizer.vocab_file + +When restoring a model parallel .nemo file, we must pass in the ``Trainer`` as model parallel requires DDP: + +.. code-block:: python + + model = TokenClassificationModel.restore_from(cfg.pretrained_model, trainer=trainer) + +Evaluation +^^^^^^^^^^ + +Since model parallel models always require more than one GPU, the ``Trainer`` is needed for evaluation: + +.. code-block:: python + + trainer = pl.Trainer(strategy=NLPDDPStrategy(), **cfg.trainer) + + model = TextClassificationModel.restore_from(cfg.model.nemo_path, trainer=trainer) + model.setup_test_data(test_data_config=cfg.model.test_ds) + + trainer.test(model=model, ckpt_path=None) + +BioMegatron +----------- + +BioMegatron has the same network architecture as the Megatron-LM, but is pretrained on a different dataset - `PubMed `_, +a large biomedical text corpus, which achieves better performance in biomedical downstream tasks than the original Megatron-LM. + +Examples of using BioMegatron on biomedical downstream tasks can be found at (can be executed with `Google's Colab `_): +`NeMo/tutorials/nlp/Relation_Extraction-BioMegatron.ipynb `__ and `NeMo/tutorials/nlp/Token_Classification-BioMegatron.ipynb `__. + + +References +---------- + +.. bibliography:: nlp_all.bib + :style: plain + :labelprefix: NLP-MEGATRON + :keyprefix: nlp-megatron- diff --git a/docs/source/nlp/models.rst b/docs/source/nlp/models.rst new file mode 100644 index 0000000000000000000000000000000000000000..932be201bfb2ffd05b8f1a857dca6517d72b9543 --- /dev/null +++ b/docs/source/nlp/models.rst @@ -0,0 +1,24 @@ +.. _nlp_models: + +Tasks +===== + +NeMo's NLP collection supports provides the following task-specific models: + +.. toctree:: + :maxdepth: 1 + + punctuation_and_capitalization_models + token_classification + joint_intent_slot + text_classification + bert_pretraining + language_modeling + nemo_megatron/prompt_learning + question_answering + dialogue + glue_benchmark + information_retrieval + entity_linking + nlp_model + machine_translation/machine_translation diff --git a/docs/source/nlp/nemo_megatron/batching.rst b/docs/source/nlp/nemo_megatron/batching.rst new file mode 100644 index 0000000000000000000000000000000000000000..b7d6ea21306771de1eb90bb58f0578c8c8dbeb82 --- /dev/null +++ b/docs/source/nlp/nemo_megatron/batching.rst @@ -0,0 +1,21 @@ +.. _batching: + +Batching +-------- + +Batch size is one of the first parameters you should play with. For efficiency and convergence reasons we recommend you first try maximizing your batch size per GPU so that your GPU RAM usage is maximized. + +NeMo Megatron uses the following concepts. + +*Micro batch size* is the number of examples per data parallel rank. It is controlled by ``model.micro_batch_size`` parameter. + +*Global batch size* = micro_batch_size * data_parallel_size * gradient_accumulation_steps. For details on ``data_parallel_size`` see :ref:`parallelisms` section, but typically it is equal to the number of GPUs being used. +Global batch size is controlled by ``model.global_batch_size`` parameter. + + +*Gradient Accumulation* + + * Idea: Train with large batch sizes with fixed memory footprint at the cost of additional compute. + * Do k forward and backward passes through the network with different batches, do not perform parameter updates until after k passes. + * Update paramters + diff --git a/docs/source/nlp/nemo_megatron/gpt/gpt_training.rst b/docs/source/nlp/nemo_megatron/gpt/gpt_training.rst new file mode 100644 index 0000000000000000000000000000000000000000..807dce64e86edc8ada9afc3906ebc0280e1fcc80 --- /dev/null +++ b/docs/source/nlp/nemo_megatron/gpt/gpt_training.rst @@ -0,0 +1,232 @@ +GPT model training +------------------ + +GPT is a decoder-only Transformer model. + + +Quick start +^^^^^^^^^^^ +Steps below demonstrate training of a GPT style model with NeMo + +Data download & pre-processing +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ + +.. note:: + Data download, pre-processing and tokenizer training in the example below will take ~3 hours. + +**Step 1: Download data** + +The step below will download Wikipedia data (around 20GB) and can take some several hours. + +.. code-block:: bash + + wget https://dumps.wikimedia.org/enwiki/latest/enwiki-latest-pages-articles.xml.bz2 + +**Step 2: Extract raw data** + +.. code-block:: bash + + pip install wikiextractor + python -m wikiextractor.WikiExtractor enwiki-latest-pages-articles.xml.bz2 --json + find text -name 'wiki_*' -exec cat {} \; > train_data.jsonl + +Now, ``train_data.jsonl`` will contain our training data in the json line format. We are interested in the data under "text" field. + + +**Step 3: Train tokenizer** + +Below we will condider 2 options for training data tokenizers: Using pre-built HuggingFace BPE and training and using your own Google Sentencepiece tokenizer. +Note that only second option allows you to experiment with vocabulary size. + +*Option 1:* Using HuggingFace GPT2 tokenizer files. + +With this option we will just download pre-built vocabulary and merge files for BPE tokenizer. + +.. code-block:: bash + + wget https://s3.amazonaws.com/models.huggingface.co/bert/gpt2-vocab.json + wget https://s3.amazonaws.com/models.huggingface.co/bert/gpt2-merges.txt + + +*Option 2:* Using `Google Sentencepiece `_ tokenizer library. + +It comes as dependency with NeMo, so if you have installed NeMo it should already be installed. +Note that training tokenizer model will also take some time. + +.. code-block:: bash + + sudo apt install jq + jq .text train_data.jsonl >> text_for_tokenizer.txt + spm_train --input=text_for_tokenizer.txt \ + --model_prefix=spm_32k_wiki \ + --vocab_size=32768 \ + --character_coverage=0.9999 \ + --model_type=bpe \ + --byte_fallback=true \ + --pad_id=0 --unk_id=1 --bos_id=2 --eos_id=3 \ + --split_digits true + +After this is done (will take a while), you'll have two files: ```spm_32k_wiki.model and spm_32k_wiki.vocab`` which correspond to model and vocabulary. + +**Step 4: Convert training data into memory map format** + +This format makes trainig more efficient, especially with many nodes and GPUs. This step will also tokenize data using tokenizer model from Step 3. + +*Option 1:* Using HuggingFace GPT2 tokenizer files. + +.. code-block:: bash + + python /scripts/nlp_language_modeling/preprocess_data_for_megatron.py \ + --input=train_data.jsonl \ + --json-keys=text \ + --tokenizer-library=megatron \ + --vocab gpt2-vocab.json \ + --dataset-impl mmap \ + --tokenizer-type GPT2BPETokenizer \ + --merge-file gpt2-merges.txt \ + --output-prefix=hfbpe_gpt_training_data \ + --append-eod \ + --workers=32 + +*Option 2:* Using `Google Sentencepiece `_ tokenizer library. + +.. code-block:: bash + + python /scripts/nlp_language_modeling/preprocess_data_for_megatron.py \ + --input=train_data.jsonl \ + --json-keys=text \ + --tokenizer-library=sentencepiece \ + --tokenizer-model=spm_32k_wiki.model \ + --output-prefix=gpt_training_data \ + --append-eod \ + --workers=32 + + +Train GPT-style Model +~~~~~~~~~~~~~~~~~~~~~ + +Once you have prepared training data and tokenizer, you are ready to train the model. +The configuration we present below has about 124M parameters and it should fit on a single 16GB GPU if using float16. +Let's go!!! + +*Option 1:* Using HuggingFace GPT2 tokenizer files. + +.. code-block:: bash + + python /home/okuchaiev/repos/NeMo/examples/nlp/language_modeling/megatron_gpt_pretraining.py \ + --config-path=/home/okuchaiev/repos/NeMo/examples/nlp/language_modeling/conf \ + --config-name=megatron_gpt_config \ + trainer.devices=1 \ + trainer.num_nodes=1 \ + trainer.max_epochs=null \ + trainer.max_steps=300000 \ + trainer.val_check_interval=300 \ + trainer.log_every_n_steps=50 \ + trainer.limit_val_batches=50 \ + trainer.limit_test_batches=50 \ + trainer.accumulate_grad_batches=1 \ + trainer.precision=16 \ + model.micro_batch_size=6 \ + model.global_batch_size=192 \ + model.tensor_model_parallel_size=1 \ + model.pipeline_model_parallel_size=1 \ + model.max_position_embeddings=1024 \ + model.encoder_seq_length=1024 \ + model.hidden_size=768 \ + model.ffn_hidden_size=3072 \ + model.num_layers=12 \ + model.num_attention_heads=12 \ + model.init_method_std=0.021 \ + model.hidden_dropout=0.1 \ + model.layernorm_epsilon=1e-5 \ + model.tokenizer.vocab_file=gpt2-vocab.json \ + model.tokenizer.merge_file=gpt2-merges.txt \ + model.data.data_prefix=[1.0,hfbpe_gpt_training_data_text_document] \ + model.data.num_workers=2 \ + model.data.seq_length=1024 \ + model.data.splits_string=\'980,10,10\' \ + model.optim.name=fused_adam \ + model.optim.lr=6e-4 \ + model.optim.betas=[0.9,0.95] \ + model.optim.weight_decay=0.1 \ + model.optim.sched.name=CosineAnnealing \ + model.optim.sched.warmup_steps=750 \ + model.optim.sched.constant_steps=80000 \ + model.optim.sched.min_lr=6e-5 \ + exp_manager.resume_if_exists=True \ + exp_manager.resume_ignore_no_checkpoint=True \ + exp_manager.create_checkpoint_callback=True \ + exp_manager.checkpoint_callback_params.monitor=val_loss \ + exp_manager.checkpoint_callback_params.save_top_k=3 \ + exp_manager.checkpoint_callback_params.mode=min \ + exp_manager.checkpoint_callback_params.always_save_nemo=False + + +*Option 2:* Using `Google Sentencepiece `_ tokenizer library. + +.. code-block:: bash + + python /home/okuchaiev/repos/NeMo/examples/nlp/language_modeling/megatron_gpt_pretraining.py \ + --config-path=/home/okuchaiev/repos/NeMo/examples/nlp/language_modeling/conf \ + --config-name=megatron_gpt_config \ + trainer.devices=1 \ + trainer.num_nodes=1 \ + trainer.max_epochs=null \ + trainer.max_steps=300000 \ + trainer.val_check_interval=300 \ + trainer.log_every_n_steps=50 \ + trainer.limit_val_batches=50 \ + trainer.limit_test_batches=50 \ + trainer.accumulate_grad_batches=1 \ + trainer.precision=16 \ + model.micro_batch_size=6 \ + model.global_batch_size=192 \ + model.tensor_model_parallel_size=1 \ + model.pipeline_model_parallel_size=1 \ + model.max_position_embeddings=1024 \ + model.encoder_seq_length=1024 \ + model.hidden_size=768 \ + model.ffn_hidden_size=3072 \ + model.num_layers=12 \ + model.num_attention_heads=12 \ + model.init_method_std=0.021 \ + model.hidden_dropout=0.1 \ + model.layernorm_epsilon=1e-5 \ + model.tokenizer.library=sentencepiece \ + model.tokenizer.model=spm_32k_wiki.model \ + model.data.data_prefix=[1.0,gpt_training_data_text_document] \ + model.data.num_workers=2 \ + model.data.seq_length=1024 \ + model.data.splits_string=\'980,10,10\' \ + model.optim.name=fused_adam \ + model.optim.lr=6e-4 \ + model.optim.betas=[0.9,0.95] \ + model.optim.weight_decay=0.1 \ + model.optim.sched.name=CosineAnnealing \ + model.optim.sched.warmup_steps=750 \ + model.optim.sched.constant_steps=80000 \ + model.optim.sched.min_lr=6e-5 \ + exp_manager.resume_if_exists=True \ + exp_manager.resume_ignore_no_checkpoint=True \ + exp_manager.create_checkpoint_callback=True \ + exp_manager.checkpoint_callback_params.monitor=val_loss \ + exp_manager.checkpoint_callback_params.save_top_k=3 \ + exp_manager.checkpoint_callback_params.mode=min \ + exp_manager.checkpoint_callback_params.always_save_nemo=False + + +Next, simply launch Tensorboard to monitor training like so: + +.. code-block:: bash + + tensorboard --logdir nemo_experiments --bind_all + +Next steps +~~~~~~~~~~ + +Please refer to: + +* :ref:`batching` section for batch size adjustments +* :ref:`parallelisms` section for understanding various types of parallelisms +* :ref:`promptlearning` section for details on prompt-tuning and p-tuning + diff --git a/docs/source/nlp/nemo_megatron/images/ddp.gif b/docs/source/nlp/nemo_megatron/images/ddp.gif new file mode 100644 index 0000000000000000000000000000000000000000..feae2f9445c7d43ef7d7da27fdd295e59e94c75e --- /dev/null +++ b/docs/source/nlp/nemo_megatron/images/ddp.gif @@ -0,0 +1,3 @@ +version https://git-lfs.github.com/spec/v1 +oid sha256:c1f2c903f2526b5a067c575ceea68c71037779203b8b00fb9cc21732d4c3b8e0 +size 6859866 diff --git a/docs/source/nlp/nemo_megatron/images/pnom.gif b/docs/source/nlp/nemo_megatron/images/pnom.gif new file mode 100644 index 0000000000000000000000000000000000000000..f2f5ed8cd6d7003a9446b2b0aac8bda6bf13c443 --- /dev/null +++ b/docs/source/nlp/nemo_megatron/images/pnom.gif @@ -0,0 +1,3 @@ +version https://git-lfs.github.com/spec/v1 +oid sha256:86bd96ef60d79fb69cda823f651dd881d7306579ae7a1e1eaf3b553af15038c7 +size 4178913 diff --git a/docs/source/nlp/nemo_megatron/images/pp.gif b/docs/source/nlp/nemo_megatron/images/pp.gif new file mode 100644 index 0000000000000000000000000000000000000000..6d780094beac75a8f73136c3fbf1075962e1d7eb --- /dev/null +++ b/docs/source/nlp/nemo_megatron/images/pp.gif @@ -0,0 +1,3 @@ +version https://git-lfs.github.com/spec/v1 +oid sha256:ab31418983b038558c9d2c85f256205b961e2afd4fe9812fc1d541a939fe4f48 +size 5736616 diff --git a/docs/source/nlp/nemo_megatron/images/sp.gif b/docs/source/nlp/nemo_megatron/images/sp.gif new file mode 100644 index 0000000000000000000000000000000000000000..665df2becf8894474b0c98c675757e8dd3d0631a Binary files /dev/null and b/docs/source/nlp/nemo_megatron/images/sp.gif differ diff --git a/docs/source/nlp/nemo_megatron/images/tp.gif b/docs/source/nlp/nemo_megatron/images/tp.gif new file mode 100644 index 0000000000000000000000000000000000000000..b39bf4b1a3aead65095d2d33cb308da48350fe61 --- /dev/null +++ b/docs/source/nlp/nemo_megatron/images/tp.gif @@ -0,0 +1,3 @@ +version https://git-lfs.github.com/spec/v1 +oid sha256:ea5d43e578159891071dcf93997d863b4908879861452e8d7c11b4d8db799248 +size 2478936 diff --git a/docs/source/nlp/nemo_megatron/intro.rst b/docs/source/nlp/nemo_megatron/intro.rst new file mode 100644 index 0000000000000000000000000000000000000000..bcc335fa6ee8bb15bc7eeb8446d2e6285678013a --- /dev/null +++ b/docs/source/nlp/nemo_megatron/intro.rst @@ -0,0 +1,35 @@ +NeMo Megatron +============= + +Megatron :cite:`nlp-megatron-shoeybi2019megatron` is a large, powerful transformer developed by the Applied Deep Learning Research +team at NVIDIA. NeMo Megatron supports several types of models: + +* GPT-style models (decoder only) +* T5/BART/UL2-style models (encoder-decoder) +* BERT-style models (encoder only) +* RETRO model (decoder only) + + + +.. note:: + NeMo Megatron has an Enterprise edition which contains tools for data preprocessing, hyperparameter tuning, container, scripts for various clouds and more. With Enterprise edition you also get deployment tools. Apply for `early access here `_ . + + +.. toctree:: + :maxdepth: 1 + + mlm_migration + gpt/gpt_training + batching + parallelisms + prompt_learning + retro/retro_model + + +References +---------- + +.. bibliography:: ../nlp_all.bib + :style: plain + :labelprefix: nlp-megatron + :keyprefix: nlp-megatron- \ No newline at end of file diff --git a/docs/source/nlp/nemo_megatron/mlm_migration.rst b/docs/source/nlp/nemo_megatron/mlm_migration.rst new file mode 100644 index 0000000000000000000000000000000000000000..ffe9764615b586db98106d058fbe8b9a86ab868c --- /dev/null +++ b/docs/source/nlp/nemo_megatron/mlm_migration.rst @@ -0,0 +1,24 @@ +Migrating from Megatron-LM +-------------------------- + +NeMo Megatron and Megatron-LM share many underlying technology. You should be able to convert your GPT model checkpoints trained with Megatron-LM into NeMo Megatron. +Example conversion script: + +.. code-block:: bash + + /examples/nlp/language_modeling/megatron_lm_ckpt_to_nemo.py \ + --checkpoint_folder \ + --checkpoint_name megatron_gpt--val_loss=99.99-step={steps}-consumed_samples={consumed}.0 \ + --nemo_file_path \ + --model_type \ + --tensor_model_parallel_size \ + --pipeline_model_parallel_size \ + --gpus_per_node + + + +To resume the training from converted MegatronLM checkpoint, make sure to set the +`trainer.max_steps=round(lr-warmup-fraction * lr-decay-iters + lr-decay-iters)` +where `lr-warmup-fraction` and `lr-decay-iters` are arguments from MegatronLM training +so the learning rate scheduler will follow the same curve. + diff --git a/docs/source/nlp/nemo_megatron/parallelisms.rst b/docs/source/nlp/nemo_megatron/parallelisms.rst new file mode 100644 index 0000000000000000000000000000000000000000..172721a1d2ddd8042126c523dbb4fce853f85ea5 --- /dev/null +++ b/docs/source/nlp/nemo_megatron/parallelisms.rst @@ -0,0 +1,49 @@ +.. _parallelisms: + +Parallelisms +------------ + +NeMo Megatron supports 4 types of parallelisms (can be mixed together arbitraritly): + +Distributed Data parallelism +^^^^^^^^^^^^^^^^^^^^^^^^^^^^ + +.. image:: images/ddp.gif + :align: center + :width: 800px + :alt: Distributed Data Parallel + + +Tensor Parallelism +^^^^^^^^^^^^^^^^^^ + +.. image:: images/tp.gif + :align: center + :width: 800px + :alt: Tensor Parallel + +Pipeline Parallelism +^^^^^^^^^^^^^^^^^^^^ + +.. image:: images/pp.gif + :align: center + :width: 800px + :alt: Pipeline Parallel + +Sequence Parallelism +^^^^^^^^^^^^^^^^^^^^ + +.. image:: images/sp.gif + :align: center + :width: 800px + :alt: Sqeuence Parallel + +Parallelism nomenclature +^^^^^^^^^^^^^^^^^^^^^^^^ + +When reading and modifying NeMo Megatron code you will encounter the following terms. + +.. image:: images/pnom.gif + :align: center + :width: 800px + :alt: Parallelism nomenclature diff --git a/docs/source/nlp/nemo_megatron/prompt_learning.rst b/docs/source/nlp/nemo_megatron/prompt_learning.rst new file mode 100644 index 0000000000000000000000000000000000000000..8fe481019a6f127ca899eda39b50842eac844be5 --- /dev/null +++ b/docs/source/nlp/nemo_megatron/prompt_learning.rst @@ -0,0 +1,390 @@ +.. _promptlearning: + +Prompt Learning +--------------- + +Within NeMo we refer to **p-tuning** and **prompt tuning** methods collectively as prompt learning. Both methods are parameter efficient alternatives to fine-tuning pretrained language models. Our NeMo implementation makes it possible to use one pretrained GPT model on many downstream tasks without needing to tune the model's full set of parameters. It also allows for adding new tasks to your model without overwriting or disrupting previous tasks for which the model has already been p-tuned/prompt-tuned. Because the original model parameters are frozen and never altered by either method, p-tuning/prompt-tuning also avoids catastrophic forgetting issues often encountered when fine-tuning models. + +Instead of selecting discrete text prompts in a manual or automated fashion, prompt tuning and p-tuning utilize virtual prompt embeddings that can be optimized via gradient descent. The only difference between prompt tuning and p-tuning within NeMo-Megatron is the architecture used to tune the soft prompt tokens during training. + +- Our prompt tuning implementation is based off Lester et. al’s EMNLP 2021 paper "`The Power of Scale for Parameter-Efficient Prompt Tuning `_" +- Our p-tuning implementation is based off Liu et al's paper "`GPT Understands, Too `_" + +Our continuous learning capability for combined p-tuning and prompt tuning with GPT style models is a NeMo specific extension of the author's original work. + +Please also checkout our `prompt learning tutorial notebook. `_ + + +Terminology +^^^^^^^^^^^ +We will be using the terms ``continuous``, ``soft``, and ``virtual`` token interchangeably to refer to embeddings inserted into the model prompt that have no concrete mapping to strings or characters within the model’s vocabulary. These virtual token embeddings exist in contrast to the ``discrete``, ``hard``, or ``real`` tokens that do make up the model’s vocabulary. Virtual tokens are purely 1D vectors with dimensionality equal to that of each real token embedding, matching the ``hidden_size`` hyperparameter. In training and inference, continuous token embeddings are inserted among discrete token embeddings according to a template you provide in the model's config. We will demonstrate how to do this below. + +When referring to p-tuning and prompt tuning together, we will be using the phrase prompt learning for simplicity. + +Prompt Tuning +^^^^^^^^^^^^^ + +In prompt-tuning a pretrained GPT model, soft prompt embeddings are initialized as a 2D matrix of size ``total_virtual_tokens X hidden_size``. Each task the model is prompt-tuned to perform has its own 2D embedding matrix associated with it. Tasks do not share any parameters during training or inference. All GPT model parameters are frozen and only the embedding parameters for each task are updated during training. + +In prompt tuning you can specify how the embeddings are initialized for each task. You can either + +- Initialize embedding parameters according to some random distribution +- Initialize embedding parameters from existing vocabulary embeddings (recommended) + +If you choose to initialize virtual token embeddings from existing embedding weights, you can provide the string of words you want to use for initialization in the model's config. This string will be tokenized and tiled or truncated to match the specified number of virtual tokens you would like to use (``total_virtual_tokens``). Vocab embeddings are copied and used to initialize the soft prompt embedding matrix for each task. The vocab embeddings themselves are not updated or changed during prompt tuning. + +P-Tuning +^^^^^^^^ + +In p-tuning, an LSTM model is used to predict virtual token embeddings. We refer to this LSTM model as our ``prompt_encoder``. LSTM parameters are randomly initialized at the start of p-tuning. All GPT model parameters are frozen, and only the LSTM weights are updated at each training step. LSTM parameters are shared between all tasks that are p-tuned at the same time, but the LSTM model outputs unique virtual token embeddings for each task. The virtual tokens predicted by the LSTM are inserted among the discrete token input in the exact same manner as with prompt-tuning. You still specify the number of virtual tokens you want to use by setting ``total_virtual_tokens`` and each virtual token embedding is still a 1D vector of size ``hidden_size``. + +Using Both Prompt and P-Tuning +^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ + +A single pretrained GPT model can use both p-tuning and prompt-tuning. While you must decide to use either p-tuning or prompt-tuning for each task you want your model to perform, you can p-tune your model on a set of tasks *A*, then prompt tune your same model on a different set of tasks *B*, then finally run inference on tasks from both *A* and *B* at the same time. During prompt-tuning or p-tuning, tasks tuned at the same time must use the same number of virtual tokens. During inference, tasks using differing amounts of virtual tokens can be run at the same time. + +When p-tuning completes, prompt tuned virtual tokens from the p-tuning ``prompt_encoder`` are automatically moved to the ``prompt_table`` where all prompt tuned and p-tuned soft prompts are stored. The LSTM ``prompt_encoder`` is then removed from the model. This allows us to preserve previously p-tuned soft prompts while still maintaining the ability to add new p-tuned or prompt-tuned soft prompts in the future. The ``prompt_table`` uses the ``taskname`` as a key to look up the correct virtual tokens for a specified task. The ``prompt_table``'s hash table data structure also makes it possible for each task to flexibly use a different number of virtual tokens. + +P-tuning usually requires fewer virtual tokens per task to achieve good results but uses a higher number of parameters compared to prompt-tuning. For example, if you prompt tune a 125M parameter GPT model (with hidden size 768) on two tasks, using 100 virtual tokens per task, the total parameters tuned during prompt tuning would equal 153k (~.1% of the pre-trained model size). If you p-tune the same 125M GPT model on 2 tasks, using an LSTM with two layers and 10 tokens per task, you will be tuning 8.3M parameters (~6.6% of the pre-trained model size). The increased number of parameters used during p-tuning is mitigated by our ``prompt_table``. When p-tuned soft prompts are placed in the prompt table, only the parameters for the predicted virtual tokens are saved. This allows us to keep the benefit of tuning a larger number of parameters during training, while also preserving the parameter efficiency of prompt-tuning during inference and storing of the model. + +Because p-tuning shares parameters between tasks during training, p-tuning your model on multiple tasks that are similar might allow your model to share insight between tasks. In the same vein, p-tuning on many very different tasks at once might perform worse than prompt tuning, which tunes a distinct set of parameters per task. **Generally we recommend using p-tuning over prompt tuning.** + +Users can also optionally tune the model's full parameters in addition to the soft prompt parameters. See ``model.lm_finetune`` in the Prompt Learning Config section for details on how to configure this. + +Dataset Preprocessing +^^^^^^^^^^^^^^^^^^^^^ + +The prompt learning dataset accepts a list of json/dictionary objects or a list of json file names where each json file contains a collection of json objects. Each json object must include the field ``taskname`` which is a string identifier for the task the data example corresponds to. They should also include one or more fields corresponding to different sections of the discrete text prompt. The input data might look like: + +.. code:: + + [ + {"taskname": "squad", "context": [CONTEXT_PARAGRAPH_TEXT1], "question": [QUESTION_TEXT1], "answer": [ANSWER_TEXT1]}, + {"taskname": "squad", "context": [CONTEXT_PARAGRAPH_TEXT2], "question": [QUESTION_TEXT2], "answer": [ANSWER_TEXT2]}, + {"taskname": "intent_and_slot", "utterance": [UTTERANCE_TEXT1], "label": [INTENT_TEXT1][SLOT_TEXT1]}, + {"taskname": "intent_and_slot", "utterance": [UTTERANCE_TEXT2], "label": [INTENT_TEXT2][SLOT_TEXT2]}, + {"taskname": "sentiment", "sentence": [SENTENCE_TEXT1], "label": [SENTIMENT_LABEL1]}, + {"taskname": "sentiment", "sentence": [SENTENCE_TEXT2], "label": [SENTIMENT_LABEL2]}, + ] + +These additional fields can be unlimited in number and will be used to help map different parts of the discrete text input to a prompt template that you define. We show how this mapping works and how to construct your prompt template in the Prompt Formatting section. Data examples for each dataset can all be passed to the dataset class in one file, or in separate ``.jsonl`` files in a list. + +.. _data-example-label: + +Prompt Formatting +^^^^^^^^^^^^^^^^^ + +To customize different prompts for different tasks, we simply need to specify the prompt task template in the config file at ``model.task_templates``. The virtual token markers ``<|VIRTUAL_PROMPT_#|>`` signify where you want virtual tokens to be placed in the template string. ``<|VIRTUAL_PROMPT_0|>``, ``<|VIRTUAL_PROMPT_1|>``, and ``<|VIRTUAL_PROMPT_2|>`` indicate where a number of virtual tokens matching the values given at ``virtual_token_splits[0]``, ``virtual_token_splits[1]`` and ``virtual_token_splits[2]`` will be placed. The other variable fields ``{var}`` refer to the fields in the data json. + +For example, given: + +- the data json ``{"sentence1": "And he said, Mama, I'm home.", "sentence2": "He didn't say a word."}`` +- virtual token splits set to ``virtual_token_splits = [3, 3, 3]`` +- a prompt template set to ``prompt_template = "<|VIRTUAL_PROMPT_0|> Hypothesis: [sentence1], <|VIRTUAL_PROMPT_1|> Premise: [sentence2] <|VIRTUAL_PROMPT_2|> Answer:"`` + +the input will be translated into ``VVV Hypothesis: And he said, Mama, I'm home. VVV Premise: He didn't say a word. VVV Answer:``, where ``VVV`` are three virtual tokens. + +**We recommend you first try prompt learning by placing all virtual tokens at the very beginning of your prompt template** like we do with the ``sentiment`` task example below. We've found this gives strong performance. +.. code:: + + config.model.task_templates = [ + { + "taskname": "sentiment", + "prompt_template": "<|VIRTUAL_PROMPT_0|> {sentence} sentiment: {label}", + "total_virtual_tokens": 10, + "virtual_token_splits": [10], + "truncate_field": "sentence", + "answer_only_loss": False, + }, + { + "taskname": "intent_and_slot", + "prompt_template": "<|VIRTUAL_PROMPT_0|> Predict intent and slot <|VIRTUAL_PROMPT_1|> :\n{utterance}{label}", + "total_virtual_tokens": 10, + "virtual_token_splits": [7, 3], + "truncate_field": None, + "answer_only_loss": True, + "answer_field": "label" + } + ] + +.. _prompt-formatting-label: + +``model.task_templates`` Config Parameters +^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ +.. list-table:: + :widths: 15 15 25 + :header-rows: 1 + + * - **Parameter** + - **Data type** + - **Description** + * - **taskname** + - string + - Short string denoting the task, used to lookup task specific virtual tokens from the ``prompt_table``. Refers to the same ``taskname`` in the dataset json objects. + * - **prompt_template** + - string + - a string showing the model where to place virtual tokens and how to map dataset json fields to where they belong in the model prompt + * - **total_virtual_tokens** + - int + - specifies the total number of virtual tokens that will be inserted into the model prompt + * - **virtual_token_splits** + - list of ints + - specifies the number of virtual tokens that belong at each ``<|VIRTUAL_PROMPT_#|>`` marker. ``virtual_token_splits`` values should add up to ``total_virtual_tokens``. The number of ``virtual_token_splits`` should match the number of ``<|VIRTUAL_PROMPT_#|>`` markers. + * - **answer_only_loss** + - bool + - Whether to limit loss calculation to only the answer portion of the prompt during tuning. Strongly recommended for long prompts. + * - **answer_field** + - string + - The field in the data json corresponding to the answer. The loss will only be calculated on this portion of the prompt if ``answer_only_loss`` is ``True``. The answer field must be at the end of the prompt template. + * - **truncate_field** + - string + - specifies which field in the data json to truncate if the length of the input exceeds the maximum sequence length of the model. If ``truncate_field`` is set to ``None``, examples that are too long are simply dropped from the dataset. + +Prompt Learning Specific Config Values +^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ +.. list-table:: + :widths: 15 15 25 + :header-rows: 1 + + * - **Parameter** + - **Data type** + - **Description** + * - **model.nemo_path** + - string + - Path to where you want to save your model after prompt tuning/p-tuning, must end in `.nemo` + * - **model.virtual_prompt_style** + - string + - one of 'prompt-tuning', 'p-tuning', or 'inference' + * - **model.language_model_path** + - string + - Path to the GPT language model .nemo file you want to use for prompt learning, not needed if ``restore_path`` is set + * - **model.restore_path** + - string + - Path to a .nemo file of existing ``MegatronGPTPromptLearningModel`` that has already been prompt tuned or p-tuned on at least one task. P-tuned or prompt tuned in this training session will be added to this model's `prompt_table`. Should be set to ``null`` if none. + * - **model.new_tasks** + - list of strings + - List of new tasknames to be prompt or p-tuned, + * - **model.existing_tasks** + - list of strings + - List of tasks the model has already been p-tuned/prompt-tuned for, needed when a restore path is given. Should be set to ``[]`` if None. + * - **model.task_templates** + - list + - See the ``model.task_templates`` Config Parameters Table above + * - **model.prompt_tuning.new_prompt_init_methods** + - list of strings + - List of 'text' or 'random', should correspond to the order of tasks listed in ``model.new_tasks``. Only needed if `virtual_prompt_style='prompt-tuning'` + * - **model.prompt_tuning.new_prompt_init_text** + - list of strings + - The text you want to use for soft prompt initalization if ``model.prompt_tuning.new_prompt_init_methods`` is set to 'text' for a task. Should correspond to the order of tasks listed in ``model.new_tasks``. The text is tokenized and clipped or tiled to match ``total_virtual_tokens`` in ``model.task_templates``. The vocab embeddings associated with each token are copied and use to initialize the soft prompts before tuning. + * - **model.p_tuning.dropout** + - float + - LSTM prompt encoder dropout prob + * - **model.p_tuning.num_layers** + - int + - Num layers in LSTM prompt encoder + * - **model.tensor_model_parallel_size** + - int + - intra-layer model parallelism, must match the ``tensor_model_parallel_size`` of the GPT model given at ``language_model_path`` + * - **model.batch_size** + - int + - global batch size + * - **model.data.train_ds** + - list of strings + - list of ``.json`` or ``.jsonl`` training dataset files with json ojects that have the dataset format described above + * - **model.data.validation_ds** + - list of strings + - list of ``.json`` or ``.jsonl`` validation dataset files with json ojects that have the dataset format described above + * - **model.data.add_eos** + - bool + - Whether to add an EOS token at the end of each training example (recommended). + +An example config file can be found at https://github.com/NVIDIA/NeMo/tree/stable/examples/nlp/language_modeling/conf/megatron_gpt_prompt_learning_config.yaml + +Setting New Tasks +^^^^^^^^^^^^^^^^^ + +After you p-tune or prompt-tune your model, you can always go back and p-tune or prompt-tune your model on more tasks without over writing the virtual prompts who've trained already. You can also use a different number of ``total_virtual_tokens`` between each training session as long as tasks ptuned or prompt tuned at the same time have the same number of ``total_virtual_tokens``. For this reason, when you ptune on a new task, you need to tell your model which of your tasks are new and which ones already exist (and thus you don't want to tune them). You do this by setting the ``new_tasks`` and ``existing_tasks`` values in the config file. + +Example Multi-Task Prompt Tuning Config and Command +^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ +First define a config called ``multitask-prompt-learning.yaml`` demonstrated below. **In the** ``exp_manager`` **portion of the config,** ``save_nemo_on_train_end`` **should be set to** ``False`` **to avoid unnecessarily saving the incorrect model weights.** This is already done in the example `megatron_gpt_prompt_learning_config.yaml config `_ that you should use as your starting point. The correct prompt learning model will be saved at the ``model.nemo_path`` you set. + +.. code:: + + name: multitask_prompt_tuning + trainer: ... + exp_manager: ... + model: + seed: 1234 + nemo_path: ${name}.nemo + virtual_prompt_style: "prompt-tuning" + encoder_seq_length: 2048 + tensor_model_parallel_size: 1 + pipeline_model_parallel_size: 1 + global_batch_size: 16 + micro_batch_size: 4 + + restore_path: null + language_model_path: models/megatron_125M_gpt.nemo + existing_tasks: [] + new_tasks: ["sentiment", "intent_and_slot"] + + task_templates: + - taskname: "sentiment" + prompt_template: "<|VIRTUAL_PROMPT_0|> {sentence} sentiment: {label}" + total_virtual_tokens: 100 + virtual_token_splits: [100] + truncate_field: null + answer_only_loss: False + + - taskname: "intent_and_slot" + prompt_template: "<|VIRTUAL_PROMPT_0|> Predict intent and slot <|VIRTUAL_PROMPT_1|> :\n{utterance}{label}" + total_virtual_tokens: 100 + virtual_token_splits: [80, 20] + truncate_field: null + answer_only_loss: True + answer_field: "label" + + prompt_tuning: + new_prompt_init_methods: ["text", "text"] + new_prompt_init_text: ["financial sentiment analysis postive neutral negative", "intent and slot classification virtual assistant task bot please"] + + data: + train_ds: ["data/financial_phrase_bank_train.jsonl", "data/assistent_train.jsonl"] + validation_ds: ["data/financial_phrase_bank_val.jsonl", "data/assistent_val.jsonl"] + add_eos: True + shuffle: True + num_workers: 1 + pin_memory: True + + optim: ... + +(See https://github.com/NVIDIA/NeMo/tree/stable/examples/nlp/language_modeling/conf/megatron_gpt_prompt_learning_config.yaml for what should go in the ``trainer``, ``exp_manager``, and ``optim`` sections.) + +Then run the command + +.. code:: + + python megatron_gpt_prompt_learning.py --config-name=multitask-prompt-learning.yaml + + +Example Multi-Task P-Tuning Config and Command After Prompt-Tuning +^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ +Update ``multitask-prompt-learning.yaml`` from the example above with p-tuning parameters for the new task. Be sure to update ``model.existing_tasks`` with the tasknames from previous prompt learning runs and to use the ``.nemo`` file saved at the end of your last prompt learning session. Values different from the config above have stars commented next to them. + +In this example, the SQuAD task includes the question context as part of the prompt. Because the context is long, we recommend setting ``answer_only_loss`` to ``True`` for this task, and any task where a significant portion of the prompt is not a part of the answer. ``answer_only_loss`` tells the model to only calculate the cross-entropy loss on the answer portion of the training example. Though we recommend placing all virtual tokens at the beginning of the prompt, we place them throughout the prompt in this example to demonstrate how to do so. + +.. code:: + + name: multitask_p_tuning # *** + trainer: ... + exp_manager: ... + model: + seed: 1234 + nemo_path: ${name}.nemo + virtual_prompt_style: "p-tuning" # *** + encoder_seq_length: 2048 + tensor_model_parallel_size: 1 + pipeline_model_parallel_size: 1 + global_batch_size: 16 + micro_batch_size: 4 + + restore_path: multitask_prompt_tuning.nemo # *** + language_model_path: models/megatron_125M_gpt.nemo + existing_tasks: ["sentiment", "intent_and_slot"] # *** + new_tasks: ["squad"] + + task_templates: + - taskname: "sentiment" + prompt_template: "<|VIRTUAL_PROMPT_0|> {sentence} sentiment: {label}" + total_virtual_tokens: 100 + virtual_token_splits: [100] + truncate_field: null + answer_only_loss: False + + - taskname: "intent_and_slot" + prompt_template: "<|VIRTUAL_PROMPT_0|> Predict intent and slot <|VIRTUAL_PROMPT_1|> :\n{utterance}{label}" + total_virtual_tokens: 100 + virtual_token_splits: [80, 20] + truncate_field: null + answer_only_loss: True + answer_field: "label" + + - taskname: "squad" # *** + prompt_template: "<|VIRTUAL_PROMPT_0|> Answer the question from the context {question} {context} Answer: {answer}" # *** + total_virtual_tokens: 9 # *** + virtual_token_splits: [9] # *** + truncate_field: context # *** + answer_only_loss: True # *** + answer_field: "answer" # *** + + p_tuning: # *** + dropout: 0.0 # *** + num_layers: 2 # *** + + data: + train_ds: ["data/squad_train.jsonl"] # *** + validation_ds: ["data/squad_val.jsonl"] # *** + add_eos: True + shuffle: True + num_workers: 1 + pin_memory: True + + optim: ... + +Then run the command again: + +.. code:: + + python megatron_gpt_prompt_learning.py --config-name=multitask-prompt-learning.yaml + + +Example Multi-Task Inference +^^^^^^^^^^^^^^^^^^^^^^^^^^^^ +The inference file can contain a mix of prompts from all the tasks the model has been prompt tuned on. + +.. code:: + + python megatron_gpt_prompt_learning_eval.py \ + virtual_prompt_model_file=PATH_TO_NEMO_PROMPT_LEARNING_MODEL_FILE \ + gpt_model_file=PATH_TO_FROZEN_GPT_MODEL_FILE \ + inference.greedy=True \ + inference.add_BOS=False \ + trainer.devices=1 \ + trainer.num_nodes=1 \ + tensor_model_parallel_size=1 \ + pipeline_model_parallel_size=1 \ + prompts=[prompt1,prompt2] + +``virtual_prompt_model_file`` should be a path to a .nemo file saved after p-tuning/prompt tuning and ``model_file`` is still the path to the gpt model's .nemo file. + +prompts in this case should be a list of .json or .jsonl files containing json objects similar to the ones used during prompt learning. They should have keys that match the fields specified in the prompt template. Fields can be dropped from the prompt dict and their corresponding section of the prompt template will be automatically removed. + +For example, say the prompt template during p-tuning/prompt-tuning looked like: + +.. code:: + + '<|VIRTUAL_PROMPT_0|> Context: {context} Question: {question} Answer: {answer}' + +but you don't want to include the answer field during inference. Just don't include the answer field in the prompt dict like below: + +.. code:: + + {"taskname": "squad", "context": "some paragraph", "question": "question related to paragraph"} + {"taskname": "squad", "context": "another paragraph", "question": "a different question related to paragraph"} + + +And the dataset class will automatically format your input to have the form: + +.. code:: + + [ + '<|VIRTUAL_PROMPT_0|> Context: some paragraph Question: question related to paragraph Answer: ', + '<|VIRTUAL_PROMPT_0|> Context: another paragraph Question: a different question related to paragraph Answer: ' + ] + +Generally prompt learning inference is just like running inference with a GPT model. The only difference is you need to add ``virtual_prompt_model_file=PATH_TO_NEMO_PROMPT_LEARNING_MODEL_FILE`` to your command if you're using a p-tuned/prompt-tuned model. + +Example prompt learning script: `NeMo/examples/nlp/language_modeling/megatron_gpt_prompt_learning.py.py `__. + +Example prompt tuned inference script: `NeMo/examples/nlp/language_modeling/megatron_gpt_eval.py `__. diff --git a/docs/source/nlp/nemo_megatron/retro/images/arch.png b/docs/source/nlp/nemo_megatron/retro/images/arch.png new file mode 100644 index 0000000000000000000000000000000000000000..eca506a12ceaa0a26c317824412987cc22a9f299 Binary files /dev/null and b/docs/source/nlp/nemo_megatron/retro/images/arch.png differ diff --git a/docs/source/nlp/nemo_megatron/retro/retro_model.rst b/docs/source/nlp/nemo_megatron/retro/retro_model.rst new file mode 100644 index 0000000000000000000000000000000000000000..edbec3d1c2ca43dd1fe0b4b49a61144e8acee4c1 --- /dev/null +++ b/docs/source/nlp/nemo_megatron/retro/retro_model.rst @@ -0,0 +1,2 @@ +Coming Soon ... +================ \ No newline at end of file diff --git a/docs/source/nlp/nlp_all.bib b/docs/source/nlp/nlp_all.bib new file mode 100644 index 0000000000000000000000000000000000000000..fd0f15f6d1da58e9c0b3254711f5600678facdaf --- /dev/null +++ b/docs/source/nlp/nlp_all.bib @@ -0,0 +1,218 @@ +@article{devlin2018bert, + title={Bert: Pre-training of deep bidirectional transformers for language understanding}, + author={Devlin, Jacob and Chang, Ming-Wei and Lee, Kenton and Toutanova, Kristina}, + journal={arXiv preprint arXiv:1810.04805}, + year={2018} +} + +@article{shoeybi2019megatron, + title={Megatron-lm: Training multi-billion parameter language models using model parallelism}, + author={Shoeybi, Mohammad and Patwary, Mostofa and Puri, Raul and LeGresley, Patrick and Casper, Jared and Catanzaro, Bryan}, + journal={arXiv preprint arXiv:1909.08053}, + year={2019} +} + +@InProceedings{maas2011, + author = {Maas, Andrew L. and Daly, Raymond E. and Pham, Peter T. and Huang, Dan and Ng, Andrew Y. and Potts, Christopher}, + title = {Learning Word Vectors for Sentiment Analysis}, + booktitle = {Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies}, + month = {June}, + year = {2011}, + address = {Portland, Oregon, USA}, + publisher = {Association for Computational Linguistics}, + pages = {142--150}, + url = {http://www.aclweb.org/anthology/P11-1015} +} + +@inproceedings{socher2013, + title = "Recursive Deep Models for Semantic Compositionality Over a Sentiment Treebank", + author = "Socher, Richard and Perelygin, Alex and Wu, Jean and Chuang, Jason and Manning, Christopher D. and Ng, Andrew and Potts, Christopher", + booktitle = "Proceedings of the 2013 Conference on Empirical Methods in Natural Language Processing", + month = oct, + year = "2013", + address = "Seattle, Washington, USA", + publisher = "Association for Computational Linguistics", + url = "https://www.aclweb.org/anthology/D13-1170", + pages = "1631--1642", +} + +@article{lim2018chemical, + title={Chemical--gene relation extraction using recursive neural network}, + author={Lim, Sangrak and Kang, Jaewoo}, + journal={Database}, + volume={2018}, + year={2018}, + publisher={Oxford Academic} +} + +@inproceedings{li2007scalable, + title={Scalable term selection for text categorization}, + author={Li, Jingyang and Sun, Maosong}, + booktitle={Proceedings of the 2007 Joint Conference on Empirical Methods in Natural Language Processing and Computational Natural Language Learning (EMNLP-CoNLL)}, + pages={774--782}, + year={2007} +} + +@misc{lee2019biobert, + title={BioBERT: a pre-trained biomedical language representation model for biomedical text mining}, + author={Jinhyuk Lee and Wonjin Yoon and Sungdong Kim and Donghyeon Kim and Sunkyu Kim and Chan Ho So and Jaewoo Kang}, + year={2019}, + eprint={1901.08746}, + archivePrefix={arXiv}, + primaryClass={cs.CL} +} + +@misc{shin2020biomegatron, + title={BioMegatron: Larger Biomedical Domain Language Model}, + author={Hoo-Chang Shin and Yang Zhang and Evelina Bakhturina and Raul Puri and Mostofa Patwary and Mohammad Shoeybi and Raghav Mani}, + year={2020}, + eprint={2010.06060}, + archivePrefix={arXiv}, + primaryClass={cs.CL} +} + +@inproceedings{vaswani2017attention, + title={Attention is all you need}, + author={Vaswani, Ashish and Shazeer, Noam and Parmar, Niki and Uszkoreit, Jakob and Jones, Llion and Gomez, Aidan N and Kaiser, {\L}ukasz and Polosukhin, Illia}, + booktitle={Advances in Neural Information Processing Systems}, + pages={6000--6010}, + year={2017} +} + +@article{sennrich2015neural, + title={Neural machine translation of rare words with subword units}, + author={Sennrich, Rico and Haddow, Barry and Birch, Alexandra}, + journal={arXiv preprint arXiv:1508.07909}, + year={2015} +} + +@article{provilkov2019bpe, + title={Bpe-dropout: Simple and effective subword regularization}, + author={Provilkov, Ivan and Emelianenko, Dmitrii and Voita, Elena}, + journal={arXiv preprint arXiv:1910.13267}, + year={2019} +} + +@article{post2018call, + title={A call for clarity in reporting BLEU scores}, + author={Post, Matt}, + journal={arXiv preprint arXiv:1804.08771}, + year={2018} +} + +@misc{zhang2021sgdqa, + title={SGD-QA: Fast Schema-Guided Dialogue State Tracking for Unseen Services}, + author={Yang Zhang and Vahid Noroozi and Evelina Bakhturina and Boris Ginsburg}, + year={2021}, + eprint={2105.08049}, + archivePrefix={arXiv}, + primaryClass={cs.CL} +} + +@article{zhang2019neural, + title={Neural Models of Text Normalization for Speech Applications}, + author={Hao Zhang and R. Sproat and Axel H. Ng and Felix Stahlberg and Xiaochang Peng and Kyle Gorman and B. Roark}, + journal={Computational Linguistics}, + year={2019}, + pages={293-338} +} + +@misc{liu2021selfalignment, + title={Self-Alignment Pretraining for Biomedical Entity Representations}, + author={Fangyu Liu and Ehsan Shareghi and Zaiqiao Meng and Marco Basaldella and Nigel Collier}, + year={2021}, + eprint={2010.11784}, + archivePrefix={arXiv}, + primaryClass={cs.CL} + } + +@article{gulcehre2015using, + title={On using monolingual corpora in neural machine translation}, + author={Gulcehre, Caglar and Firat, Orhan and Xu, Kelvin and Cho, Kyunghyun and Barrault, Loic and Lin, Huei-Chi and Bougares, Fethi and Schwenk, Holger and Bengio, Yoshua}, + journal={arXiv preprint arXiv:1503.03535}, + year={2015} +} + +@article{yee2019simple, + title={Simple and effective noisy channel modeling for neural machine translation}, + author={Yee, Kyra and Ng, Nathan and Dauphin, Yann N and Auli, Michael}, + journal={arXiv preprint arXiv:1908.05731}, + year={2019} +} + +@inproceedings{koehnetal2007moses, + title = "{M}oses: Open Source Toolkit for Statistical Machine Translation", + author = "Koehn, Philipp and + Hoang, Hieu and + Birch, Alexandra and + Callison-Burch, Chris and + Federico, Marcello and + Bertoldi, Nicola and + Cowan, Brooke and + Shen, Wade and + Moran, Christine and + Zens, Richard and + Dyer, Chris and + Bojar, Ond{\v{r}}ej and + Constantin, Alexandra and + Herbst, Evan", + booktitle = "Proceedings of the 45th Annual Meeting of the Association for Computational Linguistics Companion Volume Proceedings of the Demo and Poster Sessions", + month = jun, + year = "2007", + address = "Prague, Czech Republic", + publisher = "Association for Computational Linguistics", + url = "https://aclanthology.org/P07-2045", + pages = "177--180", +} + +@inproceedings{sunkara20_interspeech, + author={Monica Sunkara and Srikanth Ronanki and Dhanush Bekal and Sravan Bodapati and Katrin Kirchhoff}, + title={{Multimodal Semi-Supervised Learning Framework for Punctuation Prediction in Conversational Speech}}, + year=2020, + booktitle={Proc. Interspeech 2020}, + pages={4911--4915}, + doi={10.21437/Interspeech.2020-3074} +} + +@article{chen2019bert, + title={Bert for joint intent classification and slot filling}, + author={Chen, Qian and Zhuo, Zhu and Wang, Wen}, + journal={arXiv preprint arXiv:1902.10909}, + year={2019} +} + +@article{borgeaud2021improving, + title={Improving language models by retrieving from trillions of tokens}, + author={Borgeaud, Sebastian and Mensch, Arthur and Hoffmann, Jordan and Cai, Trevor and Rutherford, Eliza and Millican, Katie and Driessche, George van den and Lespiau, Jean-Baptiste and Damoc, Bogdan and Clark, Aidan and others}, + journal={arXiv preprint arXiv:2112.04426}, + year={2021} +} + +@article{su2021roformer, + title={Roformer: Enhanced transformer with rotary position embedding}, + author={Su, Jianlin and Lu, Yu and Pan, Shengfeng and Wen, Bo and Liu, Yunfeng}, + journal={arXiv preprint arXiv:2104.09864}, + year={2021} +} + +@article{reimers2019sentence, + title={Sentence-bert: Sentence embeddings using siamese bert-networks}, + author={Reimers, Nils and Gurevych, Iryna}, + journal={arXiv preprint arXiv:1908.10084}, + year={2019} +} + +@article{yang2022tensor, + title={Tensor Programs V: Tuning Large Neural Networks via Zero-Shot Hyperparameter Transfer}, + author={Yang, Greg and Hu, Edward J and Babuschkin, Igor and Sidor, Szymon and Liu, Xiaodong and Farhi, David and Ryder, Nick and Pachocki, Jakub and Chen, Weizhu and Gao, Jianfeng}, + journal={arXiv preprint arXiv:2203.03466}, + year={2022} +} + +@article{jegou2022faiss, + title={Faiss: Similarity search and clustering of dense vectors library}, + author={J{\'e}gou, Herv{\'e} and Douze, Matthijs and Johnson, Jeff and Hosseini, Lucas and Deng, Chengqi}, + journal={Astrophysics Source Code Library}, + pages={ascl--2210}, + year={2022} +} diff --git a/docs/source/nlp/nlp_model.rst b/docs/source/nlp/nlp_model.rst new file mode 100644 index 0000000000000000000000000000000000000000..0c0b800fe44b5538d881ef3939931d0ea1ae0472 --- /dev/null +++ b/docs/source/nlp/nlp_model.rst @@ -0,0 +1,43 @@ +.. _nlp_model: + +Model NLP +========= + +The config file for NLP models contain three main sections: + + - ``trainer``: contains the configs for PTL training. For more information, refer to :doc:`../core/core` and `PTL Trainer class API `. + - ``exp_manager``: the configs of the experiment manager. For more information, refer to :doc:`../core/core`. + - ``model``: contains the configs of the datasets, model architecture, tokenizer, optimizer, scheduler, etc. + +The following sub-sections of the model section are shared among most of the NLP models. + + - ``tokenizer``: specifies the tokenizer + - ``language_model``: specifies the underlying model to be used as the encoder + - ``optim``: the configs of the optimizer and scheduler :doc:`../core/core` + +The ``tokenizer`` and ``language_model`` sections have the following parameters: + ++------------------------------------------------+-----------------+--------------------------------------------------------------------------------------------------------------+ +| **Parameter** | **Data Type** | **Description** | ++------------------------------------------------+-----------------+--------------------------------------------------------------------------------------------------------------+ +| **model.tokenizer.tokenizer_name** | string | Tokenizer name will be filled automatically based on ``model.language_model.pretrained_model_name``. | ++------------------------------------------------+-----------------+--------------------------------------------------------------------------------------------------------------+ +| **model.tokenizer.vocab_file** | string | Path to tokenizer vocabulary. | ++------------------------------------------------+-----------------+--------------------------------------------------------------------------------------------------------------+ +| **model.tokenizer.tokenizer_model** | string | Path to tokenizer model (only for sentencepiece tokenizer). | ++------------------------------------------------+-----------------+--------------------------------------------------------------------------------------------------------------+ +| **model.language_model.pretrained_model_name** | string | Pre-trained language model name, for example: ``bert-base-cased`` or ``bert-base-uncased``. | ++------------------------------------------------+-----------------+--------------------------------------------------------------------------------------------------------------+ +| **model.language_model.lm_checkpoint** | string | Path to the pre-trained language model checkpoint. | ++------------------------------------------------+-----------------+--------------------------------------------------------------------------------------------------------------+ +| **model.language_model.config_file** | string | Path to the pre-trained language model config file. | ++------------------------------------------------+-----------------+--------------------------------------------------------------------------------------------------------------+ +| **model.language_model.config** | dictionary | Config of the pre-trained language model. | ++------------------------------------------------+-----------------+--------------------------------------------------------------------------------------------------------------+ + +The parameter **model.language_model.pretrained_model_name** can be one of the following: + + - ``megatron-bert-345m-uncased``, ``megatron-bert-345m-cased``, ``biomegatron-bert-345m-uncased``, ``biomegatron-bert-345m-cased``, ``bert-base-uncased``, ``bert-large-uncased``, ``bert-base-cased``, ``bert-large-cased`` + - ``distilbert-base-uncased``, ``distilbert-base-cased`` + - ``roberta-base``, ``roberta-large``, ``distilroberta-base`` + - ``albert-base-v1``, ``albert-large-v1``, ``albert-xlarge-v1``, ``albert-xxlarge-v1``, ``albert-base-v2``, ``albert-large-v2``, ``albert-xlarge-v2``, ``albert-xxlarge-v2`` diff --git a/docs/source/nlp/punctuation_and_capitalization.rst b/docs/source/nlp/punctuation_and_capitalization.rst new file mode 100644 index 0000000000000000000000000000000000000000..4be0d2151d8ee86425cb3a3185f1f14a2b1569e7 --- /dev/null +++ b/docs/source/nlp/punctuation_and_capitalization.rst @@ -0,0 +1,921 @@ +.. _punctuation_and_capitalization: + +Punctuation and Capitalization Model +==================================== + +Quick Start Guide +----------------- + +.. code-block:: python + + from nemo.collections.nlp.models import PunctuationCapitalizationModel + + # to get the list of pre-trained models + PunctuationCapitalizationModel.list_available_models() + + # Download and load the pre-trained BERT-based model + model = PunctuationCapitalizationModel.from_pretrained("punctuation_en_bert") + + # try the model on a few examples + model.add_punctuation_capitalization(['how are you', 'great how about you']) + +Model Description +----------------- + +For each word in the input text, the Punctuation and Capitalization model: + +- predicts a punctuation mark that should follow the word (if any). By default, the model supports commas, periods, and question marks. +- predicts if the word should be capitalized or not + +In the Punctuation and Capitalization model, we are jointly training two token-level classifiers on top of a pre-trained +language model, such as `BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding `__ :cite:`nlp-punct-devlin2018bert`. + +.. note:: + + We recommend you try this model in a Jupyter notebook (run on `Google's Colab `_.): `NeMo/tutorials/nlp/Punctuation_and_Capitalization.ipynb `__. + + Connect to an instance with a GPU (**Runtime** -> **Change runtime type** -> select **GPU** for the hardware accelerator). + + An example script on how to train and evaluate the model can be found at: `NeMo/examples/nlp/token_classification/punctuation_capitalization_train_evaluate.py `__. + + The default configuration file for the model can be found at: `NeMo/examples/nlp/token_classification/conf/punctuation_capitalization_config.yaml `__. + + The script for inference can be found at: `NeMo/examples/nlp/token_classification/punctuate_capitalize_infer.py `__. + +.. _raw_data_format_punct: + +Raw Data Format +--------------- + +The Punctuation and Capitalization model can work with any text dataset, although it is recommended to balance the +data, especially for the punctuation task. Before pre-processing the data to the format expected by the model, the +data should be split into ``train.txt`` and ``dev.txt`` (and optionally ``test.txt``). Each line in the +``train.txt/dev.txt/test.txt`` should represent one or more full and/or truncated sentences. + +Example of the ``train.txt``/``dev.txt`` file: + +.. code:: + + When is the next flight to New York? + The next flight is ... + .... + + +The ``source_data_dir`` structure should look similar to the following: + +.. code:: + + . + |--sourced_data_dir + |-- dev.txt + |-- train.txt + +.. _nemo-data-format-label: + +NeMo Data Format +---------------- + +The Punctuation and Capitalization model expects the data in the following format: + +The training and evaluation data is divided into 2 files: +- ``text.txt`` +- ``labels.txt`` + +Each line of the ``text.txt`` file contains text sequences, where words are separated with spaces. + +[WORD] [SPACE] [WORD] [SPACE] [WORD], for example: + + :: + + when is the next flight to new york + the next flight is ... + ... + +The ``labels.txt`` file contains corresponding labels for each word in ``text.txt``, the labels are separated with +spaces. Each label in ``labels.txt`` file consists of 2 symbols: + +- the first symbol of the label indicates what punctuation mark should follow the word (where ``O`` means no + punctuation needed) + +- the second symbol determines if a word needs to be capitalized or not (where ``U`` indicates that the word should be + upper cased, and ``O`` - no capitalization needed) + +By default, the following punctuation marks are considered: commas, periods, and question marks; the remaining punctuation marks were +removed from the data. This can be changed by introducing new labels in the ``labels.txt`` files. + +Each line of the ``labels.txt`` should follow the format: ``[LABEL] [SPACE] [LABEL] [SPACE] [LABEL]`` (for ``labels.txt``). For example, +labels for the above ``text.txt`` file should be: + + :: + + OU OO OO OO OO OO OU ?U + OU OO OO OO ... + ... + +The complete list of all possible labels used in this tutorial are: + +- ``OO`` +- ``.O`` +- ``?O`` +- ``OU`` +- +- ``.U`` +- ``?U`` + +Converting Raw Data to NeMo Format +---------------------------------- + +To pre-process the raw text data, stored under :code:`sourced_data_dir` (see the :ref:`raw_data_format_punct` +section), run the following command: + +.. code:: + + python examples/nlp/token_classification/data/prepare_data_for_punctuation_capitalization.py \ + -s \ + -o + + +Required Argument for Dataset Conversion +^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ + +- :code:`-s` or :code:`--source_file`: path to the raw file +- :code:`-o` or :code:`--output_dir` - path to the directory to store the converted files + +After the conversion, the :code:`output_dir` should contain :code:`labels_*.txt` and :code:`text_*.txt` files. The +default names for the training and evaluation in the :code:`conf/punctuation_capitalization_config.yaml` are the +following: + +.. code:: + + . + |--output_dir + |-- labels_dev.txt + |-- labels_train.txt + |-- text_dev.txt + |-- text_train.txt + +Tarred dataset +-------------- + +Tokenization and encoding of data is quite costly for punctuation and capitalization task. If your dataset contains a +lot of samples (~4M) you may use tarred dataset. A tarred dataset is a collection of `.tar` files which +contain batches ready for passing into a model. Tarred dataset is not loaded into memory entirely, but in small pieces, +which do not overflow memory. Tarred dataset relies on `webdataset `_. + +For creating of tarred dataset you will need data in NeMo format: + +.. code:: + + python examples/nlp/token_classification/data/create_punctuation_capitalization_tarred_dataset.py \ + --text \ + --labels \ + --output_dir \ + --num_batches_per_tarfile 100 + +All tar files contain similar amount of batches, so up to :code:`--num_batches_per_tarfile - 1` batches will be +discarded during tarred dataset creation. + +Beside `.tar` files with batches, the +`examples/nlp/token_classification/data/create_punctuation_capitalization_tarred_dataset.py +`_ +script will create metadata JSON file, and 2 `.csv` files with punctuation and +capitalization label vocabularies. To use tarred dataset you will need to pass path to a metadata file of your dataset +in a config parameter :code:`model.train_ds.tar_metadata_file` and set a config parameter +:code:`model.train_ds.use_tarred_dataset=true`. + +Training Punctuation and Capitalization Model +--------------------------------------------- + +The language model is initialized with the a pre-trained model from +`HuggingFace Transformers `__, unless the user provides a pre-trained +checkpoint for the language model. To train model from scratch, you will need to provide HuggingFace configuration in +one of parameters ``model.language_model.config_file``, ``model.language_model.config``. An example of a model +configuration file for training the model can be found at: +`NeMo/examples/nlp/token_classification/conf/punctuation_capitalization_config.yaml `__. + +A configuration file is a `.yaml` file which contains all parameters for model creation, training, testing, validation. +A structure of the configuration file for training and testing is described in the :ref:`Run config` +section. Some of parameters are required in a punctuation-and-capitalization `.yaml` config. Default values of +required parameters are ``???``. If you omit any of other parameters, they will be initialized according to default +values from following tables. + +.. _run-config-label: + +Run config +^^^^^^^^^^ + +An example of a config file is +`here `_. + +.. list-table:: Run config. The main config passed to a script `punctuation_capitalization_train_evaluate.py `_ + :widths: 5 5 10 25 + :header-rows: 1 + + * - **Parameter** + - **Data type** + - **Default value** + - **Description** + * - **pretrained_model** + - string + - ``null`` + - Can be an NVIDIA's NGC cloud model or a path to a ``.nemo`` checkpoint. You can get list of possible cloud options + by calling a method :py:meth:`~nemo.collections.nlp.models.PunctuationCapitalizationModel.list_available_models`. + * - **name** + - string + - ``'Punctuation_and_Capitalization'`` + - A name of the model. Used for naming output directories and ``.nemo`` checkpoints. + * - **do_training** + - bool + - ``true`` + - Whether to perform training of the model. + * - **do_testing** + - bool + - ``false`` + - Whether ot perform testing of the model after training. + * - **model** + - :ref:`model config` + - :ref:`model config` + - A configuration for the :class:`~nemo.collections.nlp.models.PunctuationCapitalizationModel`. + * - **trainer** + - trainer config + - + - Parameters of + `pytorch_lightning.Trainer `_. + * - **exp_manager** + - exp manager config + - + - A configuration with various NeMo training options such as output directories, resuming from checkpoint, + tensorboard and W&B logging, and so on. For possible options see :ref:`exp-manager-label` description and class + :class:`~nemo.utils.exp_manager.exp_manager`. + +.. _model-config-label: + +Model config +^^^^^^^^^^^^ + +.. list-table:: Location of model config in parent config + :widths: 5 5 + :header-rows: 1 + + * - **Parent config** + - **Key in parent config** + * - :ref:`Run config` + - ``model`` + +A configuration of +:class:`~nemo.collections.nlp.models.token_classification.punctuation_capitalization_model.PunctuationCapitalizationModel` +model. + +.. list-table:: Model config + :widths: 5 5 10 25 + :header-rows: 1 + + * - **Parameter** + - **Data type** + - **Default value** + - **Description** + * - **class_labels** + - :ref:`class labels config` + - :ref:`class labels config` + - Cannot be omitted in `.yaml` config. The ``class_labels`` parameter containing a dictionary with names of label + id files used in ``.nemo`` checkpoints. These file names can also be used for passing label vocabularies to the + model. If you wish to use ``class_labels`` for passing vocabularies, please provide path to vocabulary files in + ``model.common_dataset_parameters.label_vocab_dir`` parameter. + * - **common_dataset_parameters** + - :ref:`common dataset parameters config` + - :ref:`common dataset parameters config` + - Label ids and loss mask information. + * - **train_ds** + - :ref:`data config` with string in ``ds_item`` + - ``null`` + - A configuration for creating training dataset and data loader. Cannot be omitted in `.yaml` config if training + is performed. + * - **validation_ds** + - :ref:`data config` with string OR list of strings in ``ds_item`` + - ``null`` + - A configuration for creating validation datasets and data loaders. + * - **test_ds** + - :ref:`data config` with string OR list of strings in ``ds_item`` + - ``null`` + - A configuration for creating test datasets and data loaders. Cannot be omitted in `.yaml` config if testing is + performed. + * - **punct_head** + - :ref:`head config` + - :ref:`head config` + - A configuration for creating punctuation MLP head that is applied to a language model outputs. + * - **capit_head** + - :ref:`head config` + - :ref:`head config` + - A configuration for creating capitalization MLP head that is applied to a language model outputs. + * - **tokenizer** + - :ref:`tokenizer config` + - :ref:`tokenizer config` + - A configuration for creating source text tokenizer. + * - **language_model** + - :ref:`language model config` + - :ref:`language model config` + - A configuration of a BERT-like language model which serves as a model body. + * - **optim** + - optimization config + - ``null`` + - A configuration of optimizer, learning rate scheduler, and L2 regularization. Cannot be omitted in `.yaml` + config if training is performed. For more information see :ref:`Optimization ` and + `primer `_ tutorial. + +.. _class-labels-config-label: + +Class labels config +^^^^^^^^^^^^^^^^^^^ + +.. list-table:: Location of class labels config in parent configs + :widths: 5 5 + :header-rows: 1 + + * - **Parent config** + - **Key in parent config** + * - :ref:`Run config` + - ``model.class_labels`` + * - :ref:`Model config` + - ``class_labels`` + +.. list-table:: Class labels config + :widths: 5 5 5 35 + :header-rows: 1 + + * - **Parameter** + - **Data type** + - **Default value** + - **Description** + * - **punct_labels_file** + - string + - ??? + - A name of a punctuation labels file. This parameter cannot be omitted in `.yaml` config. This name + is used as a name of label ids file in ``.nemo`` checkpoint. It also can be used for passing label vocabulary to + the model. If ``punct_labels_file`` is used as a vocabulary file, then you should provide parameter + ``label_vocab_dir`` in :ref:`common dataset parameters` + (``model.common_dataset_parameters.label_vocab_dir`` in :ref:`run config`). Each line of + ``punct_labels_file`` file contains 1 label. The values are sorted, ``==