Vision Language Models. Building VLMs with Hugging Face Merve Noyan, Andrés Marafioti, Miquel Farré

(ebook) (audiobook) (audiobook)

Promocja Przejdź

Vision Language Models. Building VLMs with Hugging Face Merve Noyan, Andrés Marafioti, Miquel Farré - okladka książki

Vision Language Models. Building VLMs with Hugging Face Merve Noyan, Andrés Marafioti, Miquel Farré - audiobook MP3

Vision Language Models. Building VLMs with Hugging Face Merve Noyan, Andrés Marafioti, Miquel Farré - audiobook CD

Autorzy:: Merve Noyan, Andrés Marafioti, Miquel Farré
Wydawnictwo:: O'Reilly Media (Z chęcią przeczytam książkę w języku polskim)
Ocena:: Bądź pierwszym, który oceni tę książkę
Stron:: 408
Dostępne formaty:: ePub

Mobi

Ebook

228,65 zł ~~269,00 zł~~ (-15%)

228,65 zł najniższa cena z 30 dni

Dodaj do koszyka Dostępny natychmiast po opłaceniu zakupu lub Kup na prezent Kup 1-kliknięciem

Przenieś na półkę

Do przechowalni

Vision language models (VLMs) combine computer vision and natural language processing to create powerful systems that can interpret, generate, and respond in multimodal contexts. Vision Language Models is a hands-on guide to building real-world VLMs using the most up-to-date stack of machine learning tools from Hugging Face, Meta (PyTorch), NVIDIA (Cuda), and others, written by leading researchers and practitioners Merve Noyan, Miquel Farré, Andrés Marafioti, and Orr Zohar. From image captioning and document understanding to advanced zero-shot inference and retrieval-augmented generation (RAG), this book covers the full VLM application and development lifecycle.

Designed for ML engineers, data scientists, and developers, this guide distills cutting-edge VLM research into practical techniques. Readers will learn how to prepare datasets, select the right architectures, fine-tune and deploy models, and apply them to real-world tasks across a range of industries.

Explore core model architectures and alignment techniques
Train and fine-tune VLMs with Hugging Face, PyTorch, and others
Deploy models for applications like image search and captioning
Implement advanced inference strategies, from zero-shot to agentic systems
Build scalable VLM systems ready for production use

Wybrane bestsellery

Ebooka "Vision Language Models. Building VLMs with Hugging Face" przeczytasz na:

czytnikach Inkbook, Kindle, Pocketbook, Onyx Booxs i innych
systemach Windows, MacOS i innych

systemach Windows, Android, iOS, HarmonyOS
na dowolnych urządzeniach i aplikacjach obsługujących formaty: PDF, EPub, Mobi

Masz pytania? Zajrzyj do zakładki Pomoc »

Oceny i opinie klientów: Vision Language Models. Building VLMs with Hugging Face Merve Noyan, Andrés Marafioti, Miquel Farré

(0)

Szczegóły książki

ISBN Ebooka:: 979-83-416-2401-6, 9798341624016
Data wydania ebooka :: 2026-06-08 Data wydania ebooka często jest dniem wprowadzenia tytułu do sprzedaży i może nie być równoznaczna z datą wydania książki papierowej. Dodatkowe informacje możesz znaleźć w darmowym fragmencie. Jeśli masz wątpliwości skontaktuj się z nami sklep@helion.pl.
Język publikacji:: angielski
Rozmiar pliku ePub:: 20.2MB
Rozmiar pliku Mobi:: 20.2MB

Zgłoś erratę
Kategorie:
Sztuczna inteligencja

Spis treści książki

Foreword
Preface
- Who Is This Book For?
- What You Will Learn
- Conventions Used in This Book
- Using Code Examples
- OReilly Online Learning
- How to Contact Us
- Acknowledgments
1. Introduction to Vision and Language
- Brief Introduction to Computer Vision
  - Signal Decomposition Techniques
  - Filters and Feature Extraction Kernels
  - From Filters to Convolutional Neural Networks
  - Other Basic Convolutional Neural NetworkBased Pipelines
- Transfer Learning
  - Computer Vision Backbones
    - ResNet
    - MobileNet
    - U-Net
    - Going from image to text and back
  - Transformers and Their Origins in Language
    - Recurrent neural networks and long-short-term memory
    - The boom of transformers in text
  - Vision Transformers
    - Vision transformer variations
    - Connecting images and text: CLIP
  - Modern Vision Language Models
- Brief Introduction to Hugging Face Open Source Ecosystem
  - Hugging Face Hub
  - Libraries
    - Transformers
    - Accelerate
    - Datasets
  - Coding Example: Searching for Images from Text
- Summary
2. Vision Language Model Applications
- Image Captioning
- Visual Question Answering
- Visual Reasoning
- Visual Language Retrieval
- Document Understanding
- Document Visual Question Answering
- Video Understanding
- Instance Localization with Vision Language Models
  - Zero-Shot Object Detection
  - Object Counting
  - Image Segmentation
- Summary
3. Vision Language Model Training
- A Birds-Eye View of Training Vision Language Models
  - The How: What Different Training Paradigms Do
  - The When: A Models Training Stages
- Training Vision Language Models
  - First Things First: Training Data
  - Lets Architect Our Vision Language Model
    - Making our vision language model learn
    - Training our model
  - Loading More Than One Sample at a Time
    - Naive padding
    - Constrained padding
    - Naive packing
    - Naive greedy knapsack algorithm
  - Training with a Real Batch Size
    - The uh-oh moment: Where do the images go?
    - The plan: Placeholder tokens
    - From plan to code
    - A few more practice details
    - The new training loop: In action
  - Inferring with Our Trained Model
  - Making Inference Faster with Key-Value Cache
    - The basic idea of the key-value cache
    - Prefill versus decode
    - Trade-offs and why you should care
  - Dealing with High-Resolution Images
    - Tiling: More pixels, same architecture
    - Pixel shuffle: The compression spell
- Summary
4. Training Data and Preprocessing for VLMs
- Looking at the Data
  - Image-Text Datasets
  - Video-Text Datasets
  - Vision-Language-Action Datasets
- Building a Dataset
  - Data Sourcing at Scale
    - Navigating the legal and ethical minefield
    - Where pretraining and post-training data diverge
  - Data Filtering at Scale
    - Finding good filtering proxies
    - Check for corrupt samples
  - Sample Diversity at Scale
    - Taxonomy development
    - Content categorization
  - Data Annotation and Quality Validation at Scale
    - Choosing your annotation approach
    - Building your own synthesis pipeline
    - Real-world hybrid approaches
  - Preparing the Dataset for Consumption
- Dataset Mixtures: The Hidden Hyperparameter
  - Mixture Ingredients and Proportions
  - Task-Driven Mixture Design
  - Ablations and Evaluation
    - Ablation methodology
    - Measuring mixture impact
- Summary
5. Post-Training Vision Language Models
- Supervised Fine-Tuning
- Parameter Efficient Fine-Tuning
  - Training with Quantization
  - Introduction to Transformers Reinforcement Learning
    - Supervised fine-tuning example
    - Multimodal alignment
- Reinforcement Learning from Human Feedback
  - Direct Preference Optimization and Mixed Preference Optimization
  - Group Relative Policy Optimization
- Summary
6. Core Architectures of Vision Language Models
- The Key to Combining Information: Multimodal Attention
  - Self-Attention: Finding Relationships Within a Sequence
  - Cross-Attention: Bridging Two Different Streams
- Modern VLM Blueprints: Connecting Pretrained Models
  - The Adapter Approach: Cross-Attention
    - The perceiver resampler: Compressing visual features
    - Gated cross-attention: Injecting vision into the LLM
  - The Unified Sequence Approach: Self-Attention
  - Comparing Architectures: Which Way to Go?
- Foundational Concepts in VLM Design
  - The Fusion Framework: Early or Late?
    - Early fusion: Integrating from the ground up
    - Late fusion: Independent strengths, unified decisions
    - Connecting the spectrum to modern architectures
  - The Encoder-Decoder Pattern: Where It All Started
- Summary
7. Deploying Models for Inference at Scale
- Inference Optimization for VLMs
  - Understanding the KV Cache
    - Memory cost
    - KV cache memory by model
- Attention Optimizations: FlashAttention and Beyond
  - Understanding GPU Memory: The Foundation of All Inference Optimization
  - The Attention Bottleneck and FlashAttention to the Rescue
  - Using Optimized Attention
- Quantization for VLMs
  - Why Quantization Helps: Bandwidth, Not Compute
  - Weight-Only Quantization
  - The Outlier Problem
  - Quantization Methods
    - bitsandbytes
    - Beyond bitsandbytes: Production-ready methods
  - The VLM Quantization Asymmetry
  - Practical Notes
  - torchao
- Exporting Models to Different Runtimes
  - ONNX
  - TensorRT: Maximum GPU Performance
  - Browser Deployment with transformers.js
- Packaging and Deploying in Real Environments
  - It Runs
  - Efficient Deployment with vLLM
    - Running vLLM with a VLM
      - A few flags worth understanding
      - Client code
      - Common friction points
  - Production Optimizations
    - Prefix caching
    - Disaggregated vision encoder
    - Alternative approach: Triton + TensorRT-LLM
- On-Device/Edge Deployment
  - The Edge Landscape
  - MLX on Apple Silicon
  - Llama.cpp
  - Mobile Deployment
  - PEFT Adapters for Edge Customization
  - Hybrid Patterns
- Summary
8. Document AI
- Introduction to Document AI
  - Information Extraction
    - Generative information extraction
    - Information extraction with understanding models
  - Document Parsing
  - Picking the Right Model
  - Multimodal Document Retrieval
    - Architecture of single- and multivector models
    - Differences between single-vector and multivector retrieval models
- Approaches in Solving Document AI Problems
  - Early Document AI Models for Information Extraction and Document Classification
    - LayoutLMv3
    - Donut
  - Code Examples: Document AI with Modern Vision Language Models
    - Fine-tuning KOSMOS2.5 with transformers
    - Fine-tuning SmolVLM2 with TRL
  - Document RAG
    - Document retrieval
    - Building e2e document RAG pipeline
- Summary
9. Video-Language Models
- Foundations
  - Video-Language Tasks
    - Video classification
    - Text-to-video retrieval
    - Captioning, summarization, and video QA
  - From Images to Video: Core Concepts
    - What can we learn from 3D CNNs?
    - Motion as an explicit signal
    - Attention over time (the transformer revolution)
  - The Evolution to Video-Language Models
- Temporal Modeling
  - Core Challenges in Temporal Modeling
  - Attention Mechanisms for Video
    - Spatiotemporal attention: Three strategies
    - Cross-modal attention: When language guides vision
    - Temporal position encoding
    - Putting it all together
- Video-Language Models in Practice
  - Picking the Right Output Mode
  - Retrieval Pipelines That Scale
    - Step 1: Segment videos
    - Step 2: Embed segments and build a FAISS index
    - Step 3 (optional): Retrieve then rerank
  - Video-RAG: Retrieval-Augmented Video QA
  - Fine-Tuning a Video-Language Model for Your Domain
    - Whats different from image-VLM fine-tuning?
    - Dataset format
    - Training
- Efficiency in Video-Language Models
  - Token Efficiency
  - Training Efficiency
- Summary
10. Any-to-Any Models
- Introducing the Three Approaches
  - Unified Vocabulary Models
  - Hybrid Models
  - Modular Models
- Unified Vocabulary Models
  - Monolithic Architecture
    - How the monolithic approach works
    - In practice: DeepSeeks Janus-Pro (a transitional architecture)
  - Factorized Heads
    - Asymmetric pattern: Continuous in, discrete out
    - Multicodebook generation
    - How do heads generate RVQ codes?
    - In practice: Qwen2.5-Omni (Thinker-Talker)
- Hybrid Multiobjective Models
  - Variational Autoencoders
  - Connecting Continuous Latents to Language Models Through Diffusion
  - Putting It All Together: From Prompt to Generated Output
    - Diffusion pseudocode
- Late-Conditioning Models
  - Conditioning Interfaces: How to Represent Intent
  - Connectors: Bridging the Gap
  - How Generators Use Conditioning
  - Hands On: Qwen-Image-Edit, a Modular Diffusion Editor
- Training
  - Task Balance and Data Mixing
  - Staged Training: Divide and Conquer
    - Connector-first (alignment before joint training)
    - First understanding and then generation (sequential training)
    - Gradual unmasking (for discrete models)
    - Staged training example: Qwen2.5-Omni
  - Architecture Specific Loss Functions
  - Getting Your Hands Dirty
    - Training scripts
    - Cheat sheet for training
- Summary
11. Advanced Topics and Cutting-Edge Research
- Agentic Vision Language Models
  - Introduction to Agents
    - The ReAct loop
    - Evaluating agents
  - Introduction to Smolagents
    - First toy example
    - Leveraging Hugging Face Spaces as tools
    - Connecting your model
  - Computer Use Agents
    - Using Holo2-Localization
    - Using ScaleCUA for agentic computer use
    - Vision browser example
- Vision-Language-Action Models
  - Closing the Perception-Action Loop
    - What VLAs actually do
    - Dataset requirements
  - From VLM to VLA
    - The action expert
    - From input to output, step by step
  - Model Landscape Overview
    - 0.6: VLAs that learn from experience
    - GR00T N1.5: Foundation VLA for humanoids
    - SmolVLA: The efficient
- Summary
Index