Microsoft
company
Verified
AI & ML interests
None defined yet.
Recent Activity
Phi-3 family of small language and multi-modal models. Language models are available in short- and long-context lengths.
-
microsoft/Phi-3.5-mini-instruct
Text Generation • 4B • Updated • 258k • • 878 -
microsoft/Phi-3.5-MoE-instruct
Text Generation • 42B • Updated • 34.9k • 558 -
microsoft/Phi-3.5-vision-instruct
Image-Text-to-Text • 4B • Updated • 1.07M • 690 -
microsoft/Phi-3-mini-4k-instruct
Text Generation • 4B • Updated • 480k • • 1.21k
Artifacts for the paper "Controllable Safety Alignment: Inference-Time Adaptation to Diverse Safety Requirements" (https://arxiv.org/abs/2410.08968)
MAI-DS-R1 is a DeepSeek-R1 reasoning model that has been post-trained by the Microsoft AI team.
The SpeechT5 framework consists of a shared seq2seq and six modal-specific (speech/text) pre/post-nets that can address a few audio-related tasks.
The Table Transformer (TATR) is a series of object detection models useful for table extraction from PDF images.
-
microsoft/table-transformer-detection
Object Detection • 0.0B • Updated • 4.09M • 365 -
microsoft/table-transformer-structure-recognition
Object Detection • 0.0B • Updated • 1.33M • 194 -
microsoft/table-transformer-structure-recognition-v1.1-all
Object Detection • 0.0B • Updated • 1.08M • 74 -
microsoft/table-transformer-structure-recognition-v1.1-fin
Object Detection • 0.0B • Updated • 442 • 1
Models for biomedical research applications, such as radiology report generation and biomedical language understanding.
UDOP is a general multimodal model for document AI
-
Unifying Vision, Text, and Layout for Universal Document Processing
Paper • 2212.02623 • Published • 11 -
microsoft/udop-large
Image-Text-to-Text • 0.7B • Updated • 3.09k • 116 -
microsoft/udop-large-512
Image-Text-to-Text • 0.7B • Updated • 283 • 5 -
microsoft/udop-large-512-300k
Image-Text-to-Text • 0.7B • Updated • 166 • 32
-
Florence-2: Advancing a Unified Representation for a Variety of Vision Tasks
Paper • 2311.06242 • Published • 94 -
microsoft/Florence-2-large
Image-Text-to-Text • Updated • 1.02M • 1.58k -
microsoft/Florence-2-base
Image-Text-to-Text • Updated • 633k • 278 -
microsoft/Florence-2-large-ft
Image-Text-to-Text • Updated • 62.7k • 355
Locomotion policies for hundreds of simulated humanoid locomotion clips and demonstration data for training them.
Phi-4 family of small language, multi-modal and reasoning models.
-
microsoft/Phi-4-mini-reasoning
Text Generation • 4B • Updated • 24.5k • 179 -
microsoft/Phi-4-reasoning
Text Generation • 15B • Updated • 24.4k • 192 -
microsoft/Phi-4-reasoning-plus
Text Generation • 15B • Updated • 12.3k • 294 -
microsoft/Phi-4-multimodal-instruct
Automatic Speech Recognition • 6B • Updated • 551k • 1.44k
Phi-1 family of small language models.
🔥BitNet family of large language models (1-bit LLMs).
-
microsoft/bitnet-b1.58-2B-4T
Text Generation • 0.8B • Updated • 10.2k • 1.1k -
microsoft/bitnet-b1.58-2B-4T-bf16
Text Generation • 2B • Updated • 7.71k • 29 -
microsoft/bitnet-b1.58-2B-4T-gguf
Text Generation • 2B • Updated • 10.8k • 183 -
BitNet b1.58 2B4T Technical Report
Paper • 2504.12285 • Published • 73
LLM2CLIP makes SOTA pretrained CLIP modal more SOTA ever.
-
microsoft/LLM2CLIP-EVA02-L-14-336
Zero-Shot Image Classification • Updated • 505 • 59 -
microsoft/LLM2CLIP-Openai-L-14-336
Zero-Shot Classification • 0.6B • Updated • 2.02k • 43 -
microsoft/LLM2CLIP-EVA02-B-16
Updated • 10 • 10 -
microsoft/LLM2CLIP-Openai-B-16
Zero-Shot Classification • 0.4B • Updated • 454 • 18
TAPEX is the state-of-the-art table pre-training models which can be used for table-based question answering and table-based fact verification.
-
TAPEX: Table Pre-training via Learning a Neural SQL Executor
Paper • 2107.07653 • Published • 2 -
microsoft/tapex-large-finetuned-wtq
Table Question Answering • 0.4B • Updated • 3k • • 76 -
microsoft/tapex-base-finetuned-wikisql
Table Question Answering • Updated • 1.87k • • 18 -
microsoft/tapex-large-sql-execution
Table Question Answering • 0.4B • Updated • 232 • • 17
The LayoutLM series are Transformer encoders useful for document AI tasks such as invoice parsing, document image classification and DocVQA.
The Orca family of LMs developed by Microsoft.
GIT (Generative Image-to-text Transformer) is a model useful for vision-language tasks such as image/video captioning and question answering.
-
GIT: A Generative Image-to-text Transformer for Vision and Language
Paper • 2205.14100 • Published • 1 -
microsoft/git-base
Image-to-Text • 0.2B • Updated • 429k • 98 -
microsoft/git-large
Image-to-Text • Updated • 1.54k • 15 -
microsoft/git-base-vqav2
Visual Question Answering • 0.2B • Updated • 361 • 19
Industrial Foundation Models
Phi-4 family of small language, multi-modal and reasoning models.
-
microsoft/Phi-4-mini-reasoning
Text Generation • 4B • Updated • 24.5k • 179 -
microsoft/Phi-4-reasoning
Text Generation • 15B • Updated • 24.4k • 192 -
microsoft/Phi-4-reasoning-plus
Text Generation • 15B • Updated • 12.3k • 294 -
microsoft/Phi-4-multimodal-instruct
Automatic Speech Recognition • 6B • Updated • 551k • 1.44k
Phi-3 family of small language and multi-modal models. Language models are available in short- and long-context lengths.
-
microsoft/Phi-3.5-mini-instruct
Text Generation • 4B • Updated • 258k • • 878 -
microsoft/Phi-3.5-MoE-instruct
Text Generation • 42B • Updated • 34.9k • 558 -
microsoft/Phi-3.5-vision-instruct
Image-Text-to-Text • 4B • Updated • 1.07M • 690 -
microsoft/Phi-3-mini-4k-instruct
Text Generation • 4B • Updated • 480k • • 1.21k
Phi-1 family of small language models.
Artifacts for the paper "Controllable Safety Alignment: Inference-Time Adaptation to Diverse Safety Requirements" (https://arxiv.org/abs/2410.08968)
🔥BitNet family of large language models (1-bit LLMs).
-
microsoft/bitnet-b1.58-2B-4T
Text Generation • 0.8B • Updated • 10.2k • 1.1k -
microsoft/bitnet-b1.58-2B-4T-bf16
Text Generation • 2B • Updated • 7.71k • 29 -
microsoft/bitnet-b1.58-2B-4T-gguf
Text Generation • 2B • Updated • 10.8k • 183 -
BitNet b1.58 2B4T Technical Report
Paper • 2504.12285 • Published • 73
MAI-DS-R1 is a DeepSeek-R1 reasoning model that has been post-trained by the Microsoft AI team.
LLM2CLIP makes SOTA pretrained CLIP modal more SOTA ever.
-
microsoft/LLM2CLIP-EVA02-L-14-336
Zero-Shot Image Classification • Updated • 505 • 59 -
microsoft/LLM2CLIP-Openai-L-14-336
Zero-Shot Classification • 0.6B • Updated • 2.02k • 43 -
microsoft/LLM2CLIP-EVA02-B-16
Updated • 10 • 10 -
microsoft/LLM2CLIP-Openai-B-16
Zero-Shot Classification • 0.4B • Updated • 454 • 18
The SpeechT5 framework consists of a shared seq2seq and six modal-specific (speech/text) pre/post-nets that can address a few audio-related tasks.
TAPEX is the state-of-the-art table pre-training models which can be used for table-based question answering and table-based fact verification.
-
TAPEX: Table Pre-training via Learning a Neural SQL Executor
Paper • 2107.07653 • Published • 2 -
microsoft/tapex-large-finetuned-wtq
Table Question Answering • 0.4B • Updated • 3k • • 76 -
microsoft/tapex-base-finetuned-wikisql
Table Question Answering • Updated • 1.87k • • 18 -
microsoft/tapex-large-sql-execution
Table Question Answering • 0.4B • Updated • 232 • • 17
The Table Transformer (TATR) is a series of object detection models useful for table extraction from PDF images.
-
microsoft/table-transformer-detection
Object Detection • 0.0B • Updated • 4.09M • 365 -
microsoft/table-transformer-structure-recognition
Object Detection • 0.0B • Updated • 1.33M • 194 -
microsoft/table-transformer-structure-recognition-v1.1-all
Object Detection • 0.0B • Updated • 1.08M • 74 -
microsoft/table-transformer-structure-recognition-v1.1-fin
Object Detection • 0.0B • Updated • 442 • 1
The LayoutLM series are Transformer encoders useful for document AI tasks such as invoice parsing, document image classification and DocVQA.
Models for biomedical research applications, such as radiology report generation and biomedical language understanding.
The Orca family of LMs developed by Microsoft.
UDOP is a general multimodal model for document AI
-
Unifying Vision, Text, and Layout for Universal Document Processing
Paper • 2212.02623 • Published • 11 -
microsoft/udop-large
Image-Text-to-Text • 0.7B • Updated • 3.09k • 116 -
microsoft/udop-large-512
Image-Text-to-Text • 0.7B • Updated • 283 • 5 -
microsoft/udop-large-512-300k
Image-Text-to-Text • 0.7B • Updated • 166 • 32
GIT (Generative Image-to-text Transformer) is a model useful for vision-language tasks such as image/video captioning and question answering.
-
GIT: A Generative Image-to-text Transformer for Vision and Language
Paper • 2205.14100 • Published • 1 -
microsoft/git-base
Image-to-Text • 0.2B • Updated • 429k • 98 -
microsoft/git-large
Image-to-Text • Updated • 1.54k • 15 -
microsoft/git-base-vqav2
Visual Question Answering • 0.2B • Updated • 361 • 19
-
Florence-2: Advancing a Unified Representation for a Variety of Vision Tasks
Paper • 2311.06242 • Published • 94 -
microsoft/Florence-2-large
Image-Text-to-Text • Updated • 1.02M • 1.58k -
microsoft/Florence-2-base
Image-Text-to-Text • Updated • 633k • 278 -
microsoft/Florence-2-large-ft
Image-Text-to-Text • Updated • 62.7k • 355
Industrial Foundation Models
Locomotion policies for hundreds of simulated humanoid locomotion clips and demonstration data for training them.