BaseballCV Datasets

BaseballCV provides several carefully curated datasets for computer vision tasks in baseball. These datasets are designed to support various analysis needs, from basic object detection to complex tracking tasks.

Available Datasets

YOLO Format Datasets

YOLO format datasets are primarily used for object detection tasks and are compatible with YOLO models.

baseball_rubber_home_glove
- Content: Baseball, rubber, homeplate, and catcher’s glove annotations
- Distribution:
  - Train: 3,925 images with annotations
  - Test: 88 images with annotations
  - Valid: 98 images with annotations
- Class Distribution:
```
classes = {
    0: 'glove',
    1: 'homeplate',
    2: 'baseball',
    3: 'rubber'
}
```
baseball_rubber_home
- Content: Baseball, rubber, and homeplate annotations
- Distribution:
  - Train: 3,905 images with annotations
  - Test: 87 images with annotations
  - Valid: 98 images with annotations
- Class Distribution:
```
classes = {
    15: 'homeplate',
    16: 'baseball',
    17: 'rubber'
}
```
baseball
- Content: Baseball-only annotations, optimized for ball tracking
- Distribution:
  - Train: 4,534 images
  - Test: 426 images
  - Valid: 375 images
phc
- Content: Pitcher, hitter, and catcher annotations
- Distribution:
  - Train, Test, Valid splits
- Primary use: Player tracking and positioning analysis
amateur_pitcher_hitter
- Content: Pitchers and hitters in amateur baseball games
- Distribution:
  - Train, Test, Valid splits
- Primary use: Amateur baseball analysis
OKD_NOKD
- Content: Catcher stance classification (One Knee Down vs No One Knee Down)
- Distribution:
  - OKD: 1,408 images
  - NOKD: 1,408 images

COCO Format Datasets

COCO format datasets are ideal for training DETR and other transformer-based models.

baseball_rubber_home_COCO
- Content: COCO format annotations for baseball, rubber, and homeplate
- Distribution:
  - Train: 3,994 images
  - Test: 774 images
  - Valid: 400 images
baseball_rubber_home_glove_COCO
- Content: COCO format annotations including glove tracking
- Distribution:
  - Train: 3,475 images
  - Test: 408 images
  - Valid: 206 images

JSONL Format Datasets

JSONL format datasets are optimized for vision-language models like Florence2 and PaliGemma2.

amateur_hitter_pitcher_jsonl
- Content: JSONL format annotations for amateur baseball players
- Distribution:
  - Train, Test, Valid splits
- Primary use: Vision-language model training for amateur baseball analysis

Raw Photo Datasets

broadcast_10k_frames
- Content: 10,000 unannotated MLB broadcast frames
- Resolution: 1280 x 720 pixels
- Usage: Custom annotation and model training
broadcast_15k_frames
- Content: 15,000 unannotated MLB broadcast frames
- Resolution: 1280 x 720 pixels
- Usage: Custom annotation and model training

HuggingFace Hosted CV Datasets

international_amateur_baseball_pitcher_photo
- Content: 10,000 unannotated international amateur baseball pitcher photos
- Usage: Custom annotation and model training for amateur baseball pitcher detection
international_amateur_baseball_photos
- Content: 100,000 unannotated international amateur baseball photos
- Usage: Custom annotation and model training for amateur baseball detection
international_amateur_baseball_catcher_photos
- Content: 15,000 unannotated international amateur baseball catcher photos
- Usage: Custom annotation and model training for amateur baseball catcher detection
international_amateur_baseball_catcher_video
- Content: Video clips of amateur baseball catchers
- Usage: Catcher motion analysis and detection model training
international_amateur_baseball_game_video
- Content: Complete amateur baseball game footage
- Usage: Full-game analysis and player tracking
international_amateur_baseball_bp_video
- Content: Batting practice footage from amateur baseball
- Usage: Swing analysis and training

HuggingFace Hosted Numerical Datasets

mlb_glove_tracking_april_2024
- Content: Over 100,000 plays worth of data for Glove Tracking throughout Pitch
- Usage: Command Estimation, Catcher Training

Using the Datasets

Loading Datasets

BaseballCV provides a simple interface for loading datasets:

from baseballcv.functions import LoadTools

# Initialize LoadTools
load_tools = LoadTools()

# Load a pre-annotated dataset
dataset_path = load_tools.load_dataset("baseball_rubber_home_glove")

# Load raw photos for custom annotation
raw_dataset = load_tools.load_dataset("broadcast_10k_frames")

# Load a HuggingFace hosted dataset
hf_dataset = load_tools.load_dataset("international_amateur_baseball_photos")

Working with Different Formats

YOLO Format

def process_yolo_dataset(dataset_path):
    """Process a YOLO format dataset"""
    # Label format: class_id x_center y_center width height
    labels_dir = os.path.join(dataset_path, "labels")
    images_dir = os.path.join(dataset_path, "images")
    
    for label_file in os.listdir(labels_dir):
        with open(os.path.join(labels_dir, label_file), 'r') as f:
            annotations = f.readlines()
            # Each line: class_id x_center y_center width height

COCO Format

def process_coco_dataset(dataset_path):
    """Process a COCO format dataset"""
    from pycocotools.coco import COCO
    
    annotation_file = os.path.join(dataset_path, "_annotations.json")
    coco = COCO(annotation_file)
    
    # Access images and annotations
    image_ids = coco.getImgIds()
    categories = coco.loadCats(coco.getCatIds())

JSONL Format

def process_jsonl_dataset(dataset_path):
    """Process a JSONL format dataset for vision-language models"""
    import json
    
    jsonl_path = os.path.join(dataset_path, "train_annotations.json")
    with open(jsonl_path, 'r') as f:
        for line in f:
            entry = json.loads(line)
            image_path = os.path.join(dataset_path, entry['image'])
            prefix = entry['prefix']  # Usually a task token like "<OD>"
            suffix = entry['suffix']  # Annotation text with location tags

Creating Custom Datasets

BaseballCV supports creating custom datasets from video footage:

from baseballcv.functions import DataTools

data_tools = DataTools()

# Generate dataset from videos
data_tools.generate_photo_dataset(
    output_frames_folder="custom_dataset",
    max_plays=100,
    max_num_frames=1000,
    start_date="2024-05-01",
    end_date="2024-05-31"
)

# Auto-annotate using existing model with YOLO format
data_tools.automated_annotation(
    model_alias="ball_tracking",
    image_dir="custom_dataset",
    output_dir="annotated_dataset",
    conf=0.8,
    mode="legacy"
)

# Auto-annotate using Autodistill and natural language
ontology = { 
    "a mitt worn by a baseball player for catching a baseball": "glove",
    "a baseball in flight": "baseball",
    "the white pentagon-shaped home plate on a baseball field": "homeplate"
}

data_tools.automated_annotation(
    model_type="detection",
    image_dir="custom_dataset",
    output_dir="annotated_dataset",
    conf=0.8,
    mode="autodistill",
    ontology=ontology
)

Dataset Quality Considerations

When working with BaseballCV datasets, consider:

Resolution and Quality
- Professional datasets are derived from MLB broadcast footage
- Standard resolution: 1280x720 pixels
- Amateur datasets vary in quality and camera angles
Annotation Consistency
- Ball annotations are centered on the baseball
- Glove annotations include the pocket area
- Plate and rubber annotations cover the entire visible area
Class Balance
- Baseball class is present in most frames
- Other objects may have varying frequencies
- Consider class weights during training
Data Splits
- Train/test/validation splits are consistent across formats
- Splits maintain temporal separation to prevent data leakage

Best Practices

Dataset Selection
- Choose format based on model architecture (YOLO vs COCO vs JSONL)
- Consider using combined datasets for multi-object detection
- Start with smaller datasets for prototyping
Data Augmentation
- Apply appropriate augmentations for baseball scenarios
- Consider motion blur and lighting variations
- Maintain object visibility during augmentation
Validation
- Regularly validate annotations during training
- Use visualization tools to check detection quality
- Monitor class distribution in custom datasets
Domain Adaptation
- When moving between professional and amateur footage, consider retraining
- Use transfer learning from MLB models to amateur scenarios
- Validate performance across different camera angles and lighting conditions