# Object Detection with YOLO ![Segmentation Results](images/object-detection-header.png "Segmentation of bounding boxes") ## Introduction In the realm of AI, the rapid advancement of technology, computer-vision has paved the way for innovations that could significantly enhance automated systems and improve public safety, especially in public transportation. Among the myriad of object detection systems and models, [Yolo](https://docs.ultralytics.com/) (You Only Look Once) stands out due to its unique ability to detect objects in real-time with remarkable accuracy and low computing requirements. The small size of the models makes them optimal for deployment and use in on-board equiment with limited resouces. While remaining very performant and a good fit in dynamic environments where rapid decision-making is crucial, such as in public transportation and autonomous driving. This experiment, led by a job openning I was applying to delves into the application of the YOLO object detection model on cab ride videos from trains and trams. These videos, were captured directly from the front-facing cameras of public transit vehicles and posted on Youtube, offer a rich dataset that reflects the diverse and unpredictable urban environment through which these vehicles travel. The primary goal of utilizing such a model in this context is to enhance transportation safety and operational efficiency by identifying potential hazards and improving route management based on real-world data. Other applications could be related to line-optimizations, passenger counting, track condition analysis and of course full autonomous driving. While exploring the process of training such a YOLO model with these specific videos, I wanted to document this experiment to highlight the challenges faced during the model's adaptation to the peculiarities of railway and tram systems. While keeping the dataset and effort rather limited I want to demonstrate my amazing skills, humm... lol, not really, but demonstrate how such rather complex tasks can be done with a rather limited set of human annotations and intervention using foundational models to train very specialized ones. ## Methodology The methodology used to train the vision model using cab ride videos encompasses several critical steps, from data collection to model training and validation. Each step is vital, and it's quality will influence the model's accuracy and functionality in real-world scenarios. This is _just_ an experiment and a real-world model would require more data, quality assurance and intermediate models or specific models for sub-tasks. ### Data Collection The primary data for this project consists of YouTube cab ride videos recorded in trains and trams. These videos are typically captured through front-facing cameras mounted on the vehicles, providing a driver's-eye view of the route. These footage includes diverse scenes from urban and rural settings under various weather and lighting conditions making them good source of input data. ### Data Characteristics The videos are characterized by high-resolution imagery that captures details necessary for accurate object detection, such as obstacles on tracks, signals, and other vehicles. The collection spans a couple of hours of footage, ensuring a comprehensive dataset that includes a wide range of scenarios and anomalies. These videos are often very lengthy and are including stops, tunnels or portions of very similar frames. To select which video segments are interesting and offer a variety of situation as a first step we will split them into segments. ```python def split_video(video_path, output_folder, segment_length=120): """Split a video file into segments of a fixed length and save them as separate files.""" # Load the video video = VideoFileClip(video_path) duration = video.duration # create the output folder if not os.path.exists(output_folder): os.makedirs(output_folder) # Calculate the number of segments needed num_segments = math.ceil(duration / segment_length) print(f"Splitting the video into {num_segments} segments.") # Loop through all segments for i in range(num_segments): # Calculate start and end times start_time = i * segment_length end_time = min((i + 1) * segment_length, duration) # Cut the segment segment = video.subclip(start_time, end_time) video_name = os.path.basename(video_path) # Define the output file name output_filename = f"{output_folder}/{video_name[:-4]}_{i+1}.mp4" # Write the segment to a file segment.write_videofile(output_filename, codec='libx264', audio_codec='aac') print(f"Segment {i+1} is done.") ``` ### Preparation and Preprocessing Now that we have a set of videos of 2 minutes each lets only focus on those having relevant content, a couple of techniques to identify frame similarity came to my mind but they would not highlight sequences having different objects or situation. So I quickly shuffled through the videos manually and kept about 15 miinutes of footage to analyse from the 7 hours of initial footprint, covering urban and rural situation, trams and trains, with or without other vehicules or pedestrians. #### Frame extraction Due to the continuous nature of video files, the next step involves extracting _some_ frames at a fixed interval (strides). This process reduces the volume of data to a manageable size for annotation and training. ```python def extract_frames(video_path, output_folder, stride=12): """Extract frames from a video file and save them as PNG images.""" # load the video cap = cv2.VideoCapture(video_path) if not cap.isOpened(): print("Error: could not open the video.") return # create the output folder if not os.path.exists(output_folder): os.makedirs(output_folder) # extract frames frame_count = 0 while True: ret, frame = cap.read() if not ret: break frame_count += 1 if frame_count % stride != 0: continue frame_path = os.path.join(output_folder, f"{frame_count:06d}.png") cv2.imwrite(frame_path, frame) print(f"Extracted {frame_count} frames to {output_folder}") ``` #### Annotation Process ![Label Studio Annotations](images/object-detection-image1.png "Labelling interface in Label Studio") From the 15minutes I've selected as less as 170 frames again creating small but comprehensive set of situations and conditions to be labelled. Each of these frames is then manually annotated by a team of trained annotators (Me, Myself and I) using label-studio. This involves identifying and labeling various objects of interest, such as pedestrians, vehicles, signals, signs, rails etc. The annotations are exported in the YOLO format, which includes bounding boxes and object class labels. ```html ``` ## Model Training The YOLOv8n (nano) model is selected for this project due to its balance of speed and accuracy, making it suitable for real-time detection tasks. YOLOv8 is known for its improved performance over previous versions through enhancements in architecture and training techniques. ### Dataset Configuration ```yaml path: ../object-detection/datasets train: detection val: detection names: 0: Bike 1: Bus 2: Car 3: CargoWagon 4: Pedestrian 5: Sign 6: TrafficLight 7: Train 8: Tram ``` ### Training Process The training process involves feeding the annotated frames into the YOLO model. Data augmentation techniques such as rotation, scaling, and color adjustment are employed to improve the model’s robustness by simulating various operational scenarios. The model undergoes several iterations of training and validation cycles to minimize overfitting and enhance its generalization capabilities. ```bash yolo.exe train detect data=trainz.detect.yaml model=yolov8n.pt ``` ### Validation and Testing Post-training, the model is validated using a separate set of video frames that were not included in the training set. This step is crucial to evaluate the model's performance and accuracy in detecting objects under different conditions. ```bash Validating runs\detection\weights\best.pt... Ultralytics YOLOv8.1.47 🚀 Python-3.11.9 torch-2.2.2+cpu CPU Model summary (fused): 168 layers, 3007403 parameters, 0 gradients, 8.1 GFLOPs Class Images Instances Box(P R mAP50 mAP50-95): 100%|██| 6/6 [2.24s/it] all 177 1011 0.955 0.876 0.931 0.755 Bike 177 4 1 0.644 0.764 0.705 Bus 177 11 1 0.886 0.995 0.895 Car 177 389 0.943 0.925 0.975 0.761 CargoWagon 177 33 0.869 0.848 0.865 0.677 Pedestrian 177 149 0.901 0.94 0.963 0.734 Sign 177 212 0.949 0.789 0.898 0.604 TrafficLight 177 157 0.964 0.924 0.969 0.701 Train 177 34 0.992 0.971 0.975 0.856 Tram 177 22 0.976 0.955 0.978 0.862 Speed: 0.9ms preprocess, 55.5ms inference, 0.0ms loss, 0.7ms postprocess per image Results saved to runs\detection\ ``` ### Performance Metrics The effectiveness of the trained model is measured using standard metrics such as precision, recall, and Intersection over Union (IoU). In the validation and testing phase of training a YOLO vision model, it is essential to measure its performance to ensure it can reliably identify objects under various conditions. Two critical metrics used for this purpose are precision and recall. These metrics provide insight into the model's accuracy and its ability to detect all relevant objects within the video frames. Note the performance of the model without any futher optimization or attention, the 55ms of inference time, reaching ~20 frames per second without any GPU assistance is a very good performance, futher parameter fine-tuning or resolution reductions could make significant differences but are beyond the scope of this POC. ![Training Results](images/object-detection-training-results.png "Training result plots") **Precision** (or positive predictive value) measures the accuracy of the detections made by the model. In the context of the YOLO vision model for cab ride videos, precision reflects the proportion of correct positive detections out of all positive detections made. For example, if the model identifies 100 objects as vehicles and 90 of these identifications are correct, the precision is 90%. High precision is crucial in transportation settings to minimize false alarms, which can lead to unnecessary disruptions or desensitization to alerts. **Recall** (or sensitivity) measures the model's ability to find all the relevant cases (or objects) within the dataset. In terms of the project, recall assesses the proportion of actual objects in the video frames that the model successfully detects. For instance, if there are 100 actual vehicles in the video and the model correctly identifies 85 of them, the recall is 85%. High recall is particularly important in safety-critical applications like transportation to ensure that potential hazards are not overlooked. Both metrics are especially important because they help balance the model's performance. A high precision rate with a low recall rate might indicate that the model is too conservative, missing potential hazards. Conversely, a high recall rate with low precision might mean the model generates many false positives, which could reduce the trust in or efficiency of the system. Therefore, tuning the model to achieve a balanced trade-off between precision and recall is vital for practical deployment in public transportation monitoring systems. **Intersection over Union** (IoU) is another metric used alongside precision and recall. It measures the overlap between the predicted bounding box and the actual bounding box, providing a direct measurement of localization accuracy, which is essential for precise object detection in dynamic environments like those captured in train and tram cab ride videos. ## From Object Detection to Segmentation Meta's SAM (Segment-Anything Model) provides a powerful tool for generating segmentation datasets using an initial set of detection data. This is particularly useful for situations where you have a dataset labeled for object detection and you want to extend it to include segmentation labels, which are typically more detailed and involve classifying each pixel of the object. ![Segmentation Results](images/object-detection-segmentation-3.png "Segmentation of bounding boxes") ### Extending Detection Models to Generate a Segmentation Dataset Building upon the foundation laid by the initial object detection model, this project took a significant step forward by employing Meta's Segment-Anything Model (SAM) to enhance our dataset with segmentation labels. The integration of SAM into our methodology allowed us to transform standard detection outputs—specifically, bounding boxes—into detailed pixel-level segmentation maps. This process bridged the gap between detection and segmentation, providing a comprehensive understanding of each object's precise contours and boundaries within the urban transit environment captured in our cab ride videos. ![Segmentation Results](images/object-detection-segmentation-2.png "Segmentation of bounding boxes") ### Integration of SAM with Detection Outputs Initially, our project utilized a robust detection model trained to identify various objects, such as vehicles, pedestrians, and other significant elements, within the urban landscape. The detection model efficiently located these objects and outlined them with bounding boxes. The transition from detection to segmentation began by feeding these bounding box coordinates into SAM. SAM's sophisticated algorithms were then applied to precisely delineate the shapes enclosed within these boxes, focusing on the texture, color, and form contrasts between the objects and their backgrounds. ![Segmentation Results](images/object-detection-segmentation-1.png "Segmentation of bounding boxes") ### Creating a Rich Segmentation Dataset The result of this integration was a series of high-quality segmentation masks that corresponded to each detected object. These masks detailed the objects at a pixel level, thus providing a far more nuanced dataset than was originally available with mere detection labels. To compile this enriched dataset, each original image was paired with its newly generated segmentation mask. This pairing formed a comprehensive set of data that included both the original detection information and the advanced segmentation details. ```python from ultralytics.data.annotator import auto_annotate auto_annotate( data="datasets\\track-detection", det_model="runs\\detection\\weights\\best.pt", sam_model='sam_b.pt', output_dir="datasets\\autosegment") ``` ### Quality Assurance and Dataset Refinement Critical to this methodology is the quality assurance phase. Each generated segmentation mask should undergo a thorough review to ensure that it meets the project’s standards for accuracy and consistency. This step is essential but much less time and resource consuming than manual annotation. The precision of segmentation masks will directly influence the effectiveness of subsequent models trained using this data. Where discrepancies or inaccuracies were noted, adjustments should be done through manual corrections to the masks, ensuring that the dataset upholds the integrity required for advanced computer vision applications. ### Utilization for Advanced Model Training The enriched segmentation dataset prepared in this manner is not merely an exercise but a practical toolkit for further research and development. With these detailed segmentation maps, we could train more sophisticated models capable of performing complex tasks that rely on an intricate understanding of the spatial and textural context of objects within an image. Those annotated masks can now be used to help the annotation of futher data or crop/hide parts of the frames for different subtask processing. Such tasks include object tracking, distance estimation, obstacle detection, signs reading, signal interpretation. All these tasks might require different specialized models and can varing performance requirements therefore the initial segmentation and mask generation of these from the live images at low cost is essential. ## Conclusion This exploration into the use of YOLO for object detection on cab ride videos has revealed the significant potential of AI in public transportation. The successful application of YOLOv8n demonstrates not just a technological triumph but also a blueprint for future innovations in autonomous navigation and safety enhancements. By creatively leveraging YouTube videos as a data source and employing Meta's SAM for segmentation, I have shown that even with constrained resources, and very limited amount of annotated data, one can generate a dataset rich enough to train a sophisticated model. My takeaways from this experience include: * The feasibility of applying advanced AI models like YOLO to real-world situations with limited data. * The importance of precision and recall balance in model performance, particularly in safety-critical applications. * The versatility of YOLO, which extends beyond detection to enable comprehensive scene understanding through segmentation. * The power of leveraging large compute and resource intensive models to train small lightweight specialized models. This work paves the way for more intricate applications and sets the stage for further refinement and application of AI in public transportation, promising a future where safety and efficiency are greatly enhanced by intelligent systems.