[CV] Human Pose Estimation with Object Detection

이번 포스팅에서는 저번에 논문 리뷰했던 Human Pose Estimation을 구현한 내용을 review할 것이다.

https://github.com/hjpark83/CVLab/tree/main/Human%20Pose%20Estimation/Implementation

CVLab/Human Pose Estimation/Implementation at main · hjpark83/CVLab

HYU Computer Vision Lab. Contribute to hjpark83/CVLab development by creating an account on GitHub.

github.com

출처는 제 github site를 참고하시면 되고 여기서는 구현 내용과 결과에 대해 다뤄보고자 합니다.

Human Pose Estimation의 경우 인간의 신체에 특정 point [보통 관절]를 설정하고 그 부분을 학습시킨 뒤 point 끼리의 connection을 해서 인간의 skeleton의 위치를 예측하는 task라고 생각하면 됩니다.

이번에 구현할 때 사용한 model은 OpenPose와 YOLOv9입니다.
그 이유는 OpenPose가 real-time에 강점이 있고, YOLOv9도 이전 모델 대비 성능과 안정성이 높아졌기 때문에 사용하게 되었습니다.

따라서 구현을 하기 위해서는 point와 point들끼리의 연결[pair]을 지정해 주어야 합니다.

전체 파일을 위와 같이 구성이 되어 있습니다. 여기에 이제 OpenPose model을 추가한...

pose.json

이 부분의 경우 .json 파일에 지정을 해서 config.py 파일에서 json 파일을 읽어서 data를 불러오도록 하였습니다. 또한 json 파일에서 미리 학습된 model도 불러오도록 설정했습니다.

Point의 경우 [머리, 목, 어깨, 팔꿈치, 손목, 엉덩이, 무릎, 발목] 까지 해서 총 14개를 지정하였고, point들을 연결한 결과는 위의 image처럼 됩니다.

{
  "BODY_PARTS": {
      "Head": 0,
      "Neck": 1,
      "RShoulder": 2,
      "RElbow": 3,
      "RWrist": 4,
      "LShoulder": 5,
      "LElbow": 6,
      "LWrist": 7,
      "RHip": 8,
      "RKnee": 9,
      "RAnkle": 10,
      "LHip": 11,
      "LKnee": 12,
      "LAnkle": 13,
      "Nose": 14,
      "LEar": 15,
      "REar": 16,
      "LEye": 17,
      "REye": 18
  },
  "POSE_PAIRS": [
      ["Head", "Neck"],
      ["Head", "Nose"],
      ["Head", "LEar"],
      ["Head", "REar"],
      ["Head", "LEye"],
      ["Head", "REye"],
      ["Neck", "RShoulder"],
      ["RShoulder", "RElbow"],
      ["RElbow", "RWrist"],
      ["Neck", "LShoulder"],
      ["LShoulder", "LElbow"],
      ["LElbow", "LWrist"],
      ["Neck", "RHip"],
      ["RHip", "RKnee"],
      ["RKnee", "RAnkle"],
      ["Neck", "LHip"],
      ["LHip", "LKnee"],
      ["LKnee", "LAnkle"]
  ],
  "colors": [
      [255, 0, 0],
      [0, 255, 0],
      [0, 0, 255],
      [128, 0, 0],
      [0, 128, 0],
      [0, 0, 128],
      [255, 255, 0],
      [255, 0, 255],
      [0, 255, 255],
      [128, 128, 0],
      [128, 0, 128],
      [0, 128, 128],
      [255, 255, 255],
      [128, 128, 128]
  ],
  "pose_protoFile": "models/pose/coco/pose_deploy_linevec.prototxt",
  "pose_weightsFile": "models/pose/coco/pose_iter_440000.caffemodel",
  "face_protoFile": "models/face/pose_deploy.prototxt",
  "face_weightsFile": "models/face/pose_iter_116000.caffemodel",
  "hand_protoFile": "models/hand/pose_deploy.prototxt",
  "hand_weightsFile": "models/hand/pose_iter_102000.caffemodel",
  "image_path": "openpose/input/image",
  "video_path": "openpose/input/video",
  "output_image_path": "openpose/output/image",
  "output_video_path": "openpose/output/video"
}

이게 point를 연결을 잘 해줘야 하는게 point들을 잘못 연결하면 아래 image 처럼 머리에서 목을 거쳐서 양쪽 엉덩이로 가는 것이 아니라 머리에서 바로 엉덩이로 가게 되어 부자연스러운 결과가 나오게 됩니다.

제대로 pair를 지정해주게 되면 결과는 다음과 같이 위의 image보다 자연스러운 결과가 출력됩니다.

config.py

config.py에서는 앞에서 언급했듯이 pose.json 파일에 저장해 놓은 point, pair, image/video 경로, pre-trained model의 경로를 읽어와서 갖고 있다가 호출했을 때 정보를 전달해주는 역할을 하는 파일 입니다.

import json

class Config:
    def __init__(self, config_file='pose.json'):
        with open(config_file, 'r') as file:
            config = json.load(file)
        
        self.BODY_PARTS = config['BODY_PARTS']
        self.POSE_PAIRS = config['POSE_PAIRS']
        self.colors = config['colors']
        self.pose_protoFile = config['pose_protoFile']
        self.pose_weightsFile = config['pose_weightsFile']
        self.face_protoFile = config['face_protoFile']
        self.face_weightsFile = config['face_weightsFile']
        self.hand_protoFile = config['hand_protoFile']
        self.hand_weightsFile = config['hand_weightsFile']
        self.image_path = config['image_path']
        self.video_path = config['video_path']
        self.output_image_path = config['output_image_path']
        self.output_video_path = config['output_video_path']

process.py

process.py는 image, video output을 처리하는 역할을 하는 함수입니다. 이번 구현에서는 원본 input에 pose estimate한 skeleton을 입히는 것 뿐만 아니라 skeleton 자체도 따로 저장을 하도록 구현했기 때문에 이 부분을 처리하는 역할을 한다고 보면 됩니다.

import os
import cv2
import numpy as np
from pose_estimation import PoseEstimation, initialize_video_writer

class Process_Media:
    def __init__(self, yolo_model, body_parts, pose_pairs, colors, 
                 pose_proto_file, pose_weights_file, face_proto_file, face_weights_file, hand_proto_file, hand_weights_file):
        
        self.yolo_model = yolo_model
        self.pose_estimator = PoseEstimation(
            yolo_model,
            body_parts,
            pose_pairs,
            colors,
            pose_proto_file,
            pose_weights_file,
            face_proto_file,
            face_weights_file,
            hand_proto_file,
            hand_weights_file
        )


    def make_output(self, source, output_path):
        if source == "camera":
            self.process_camera(output_path)
        elif source.endswith((".png", ".jpg")):
            self.process_image(source, output_path)
        elif source.endswith((".mp4", ".avi")):
            self.process_video(source, output_path)
        else:
            print(f"Unsupported file format: {source}")


    def process_image(self, image_path, output_path):
        image = cv2.imread(image_path)
        height, width, _ = image.shape
        boxes, labels, confidences = self.yolo_model.detect_objects(image)
        image = self.yolo_model.draw_boxes(image, boxes, labels, confidences)

        blank_frame = np.zeros((height, width, 3), dtype=np.uint8)

        for (startX, startY, endX, endY) in boxes:
            _, pose_image = self.pose_estimator.process_pose(image, startX, startY, endX, endY, (height, width, 3))
            blank_frame = cv2.add(blank_frame, pose_image)
        
        # Save output images
        base_dir = os.path.dirname(os.path.dirname(output_path)) 
        pose_output_dir = os.path.join(base_dir, 'pose/image')
        os.makedirs(pose_output_dir, exist_ok=True)
        
        pose_image_path = os.path.join(pose_output_dir, os.path.basename(output_path))
        cv2.imwrite(pose_image_path, blank_frame)
        cv2.imwrite(output_path, image)
        print(f"Finished processing image: {image_path}")


    def process_video(self, video_path, output_path):
        cap = cv2.VideoCapture(video_path)
        out, out_pose, frame_size = initialize_video_writer(cap, output_path)

        while cap.isOpened():
            ret, frame = cap.read()
            if not ret:
                break

            boxes, labels, confidences = self.yolo_model.detect_objects(frame)
            frame = self.yolo_model.draw_boxes(frame, boxes, labels, confidences)

            blank_frame = np.zeros((frame_size[1], frame_size[0], 3), dtype=np.uint8)

            for (startX, startY, endX, endY) in boxes:
                _, pose_frame = self.pose_estimator.process_pose(frame, startX, startY, endX, endY, (frame_size[1], frame_size[0], 3))
                blank_frame = cv2.add(blank_frame, pose_frame)

            out.write(frame)
            out_pose.write(blank_frame)

        cap.release()
        out.release()
        out_pose.release()
        print(f"Finished processing video: {video_path}")


    def process_camera(self, output_path):
        cap = cv2.VideoCapture(0)
        out, out_pose, frame_size = initialize_video_writer(cap, output_path)

        while cap.isOpened():
            ret, frame = cap.read()
            if not ret:
                break

            boxes, labels, confidences = self.yolo_model.detect_objects(frame)
            frame = self.yolo_model.draw_boxes(frame, boxes, labels, confidences)

            blank_frame = np.zeros((int(cap.get(cv2.CAP_PROP_FRAME_HEIGHT)), int(cap.get(cv2.CAP_PROP_FRAME_WIDTH)), 3), dtype=np.uint8)

            for (startX, startY, endX, endY) in boxes:
                _, pose_frame = self.pose_estimator.process_pose(frame, startX, startY, endX, endY, blank_frame.shape)
                blank_frame = cv2.add(blank_frame, pose_frame)

            self.pose_estimator.process_face_with_haar(frame)

            out.write(frame)
            out_pose.write(blank_frame)
            cv2.imshow('Pose Estimation', frame)

            if cv2.waitKey(1) & 0xFF == ord('q'):
                break

        cap.release()
        out.release()
        out_pose.release()
        cv2.destroyAllWindows()
        print(f"Finished processing camera stream.")

object_detection.py

yolov9s.pt 파일을 불러와서 object detection을 수행하는 파일입니다. Confidence 값은 0.5로 설정했고, 처음에 구현했을 때는 label이 human인 object에만 bounding box를 입히도록 했었는데 그렇게 하니 사람이 아닌데 비슷하게 생겨서 사람이라고 잘못 인식하는 대상에 bounding box가 그려지는 현상이 생겨서 오히려 더 눈에 띄는 상황이 발생했습니다. 따라서 이 때문에 confidence가 0.5를 넘기는 모든 object에 bounding box를 그리도록 변경하였습니다.

import cv2
import random
from ultralytics import YOLO

class YOLOModel:
    def __init__(self, model_path='yolov9s.pt'):
        self.model = YOLO(model_path).to('cuda')
        self.colors = {}


    def detect_objects(self, image):
        results = self.model.predict(image)
        boxes = []
        labels = []
        confidences = []

        for r in results:
            for box in r.boxes:
                confidence = float(box.conf)
                if confidence > 0.5:
                    x1, y1, x2, y2 = map(int, box.xyxy[0])
                    label = self.model.names[int(box.cls)]
                    boxes.append([x1, y1, x2, y2])
                    labels.append(label)
                    confidences.append(confidence)

                    if label not in self.colors:
                        self.colors[label] = (random.randint(0, 255), random.randint(0, 255), random.randint(0, 255))
        return boxes, labels, confidences


    def draw_boxes(self, image, boxes, labels=None, confidences=None):
        for i, (startX, startY, endX, endY) in enumerate(boxes):
            label = labels[i] if labels is not None else ''
            color = self.colors.get(label, (0, 255, 0))

            cv2.rectangle(image, (startX, startY), (endX, endY), color, 5)

            confidence = confidences[i] if confidences is not None else ''
            text = f"{label}: {confidence:.2f}" if confidence else label

            (text_width, text_height), baseline = cv2.getTextSize(text, cv2.FONT_HERSHEY_TRIPLEX, 0.5, 1)
            y = max(startY, text_height + 10)

            cv2.rectangle(image, (startX, y - text_height - 10),
                          (startX + text_width, y + baseline - 10),
                          color, cv2.FILLED)
            cv2.putText(image, text, (startX, y - 5),
                        cv2.FONT_HERSHEY_SIMPLEX, 0.5, (255, 255, 255), 1)
        return image
    
    
    def detect_faces_haar(self, image):
        face_cascade = cv2.CascadeClassifier('models/face/haarcascade_frontalface_alt.xml')
        gray = cv2.cvtColor(image, cv2.COLOR_BGR2GRAY)
        faces = face_cascade.detectMultiScale(gray, 1.1, 4)
        return faces


    def process_face_with_haar(self, image):
        faces = self.detect_faces_haar(image)
        for (x, y, w, h) in faces:
            roi = image[y:y+h, x:x+w]
            
            if len(roi.shape) == 2 or roi.shape[2] == 1:
                roi = cv2.cvtColor(roi, cv2.COLOR_GRAY2BGR)

            face_keypoints, _ = self.process_pose(roi, 0, 0, w, h, roi.shape)

pose_estimation.py

여기서는 pose estimation을 하는 역할을 합니다. get_point 함수에서 point들에 대한 정보를 불러오면, prob 값을 구한 뒤 이 값이 0.05보다 크면 배열에 넣어 놓고 이 배열을 들고 draw_pose 함수로 넘어가서 point들을 연결한 뒤 process_pose 함수에서 2개의 output에 예측한 pose를 그리도록 하였습니다.

import os
import cv2
import numpy as np

def initialize_video_writer(cap, output_path):
    width = int(cap.get(cv2.CAP_PROP_FRAME_WIDTH))
    height = int(cap.get(cv2.CAP_PROP_FRAME_HEIGHT))
    fps = int(cap.get(cv2.CAP_PROP_FPS))
    
    fourcc = cv2.VideoWriter_fourcc(*'mp4v')

    # video
    base_dir = os.path.dirname(os.path.dirname(output_path))
    video_output_dir = os.path.join(base_dir, 'video')
    os.makedirs(video_output_dir, exist_ok=True)

    video_output_path = os.path.join(video_output_dir, os.path.basename(output_path))
    out = cv2.VideoWriter(video_output_path, fourcc, fps, (width, height))

    # pose
    pose_output_dir = os.path.join(base_dir, 'pose/video')
    os.makedirs(pose_output_dir, exist_ok=True)

    pose_output_path = os.path.join(pose_output_dir, os.path.basename(output_path))
    out_pose = cv2.VideoWriter(pose_output_path, fourcc, fps, (width, height))

    return out, out_pose, (width, height)


class PoseEstimation:
    def __init__(self, yolo_model, body_parts, pose_pairs, colors, 
                 pose_proto_file, pose_weights_file, face_proto_file, face_weights_file, hand_proto_file, hand_weights_file):
        
        self.yolo_model = yolo_model
        self.body_parts = body_parts
        self.pose_pairs = pose_pairs
        self.colors = colors
        self.net_pose = cv2.dnn.readNetFromCaffe(pose_proto_file, pose_weights_file)
        self.net_face = cv2.dnn.readNetFromCaffe(face_proto_file, face_weights_file)
        self.net_hand = cv2.dnn.readNetFromCaffe(hand_proto_file, hand_weights_file)


    def get_points(self, input, output, startX, startY, H, W, imageHeight, imageWidth):
        points = []
        for i in range(len(self.body_parts)):
            probMap = output[0, i, :, :]
            minVal, prob, minLoc, point = cv2.minMaxLoc(probMap)
            x = int(startX + (imageWidth * point[0]) / W)
            y = int(startY + (imageHeight * point[1]) / H)
            if prob > 0.05: 
                points.append((x, y))
            else:
                points.append(None) 
        return points
    
    
    def draw_pose(self, image, points, pose_pairs):
        for idx, pair in enumerate(pose_pairs):
            partA = points[self.body_parts[pair[0]]]
            partB = points[self.body_parts[pair[1]]]
            if partA and partB:
                color = self.colors[idx % len(self.colors)]
                cv2.line(image, partA, partB, color, 8)
                

    def process_pose(self, input, startX, startY, endX, endY, frame_size):
        roi = input[startY:endY, startX:endX]
        imageHeight, imageWidth, _ = roi.shape

        if roi.dtype != np.uint8:
            roi = roi.astype(np.uint8)

        inpBlob = cv2.dnn.blobFromImage(roi, 1.0 / 255, (imageWidth, imageHeight),
                                        (0, 0, 0), swapRB=False, crop=False)
        self.net_pose.setInput(inpBlob)
        output = self.net_pose.forward()

        H = output.shape[2]
        W = output.shape[3]

        points = self.get_points(input, output, startX, startY, H, W, imageHeight, imageWidth)
        
        self.draw_pose(input, points, self.pose_pairs)
        blank_frame = np.zeros(frame_size, dtype=np.uint8)
        self.draw_pose(blank_frame, points, self.pose_pairs)

        return points, blank_frame

main.py

import os
import argparse
from pose_estimation import PoseEstimation
from object_detection import YOLOModel
from process import Process_Media
from config import Config

def parse_args():
    parser = argparse.ArgumentParser(description="Pose Estimation using OpenPose model")
    parser.add_argument('--source', type=str, required=True, help="Name of the input file (image or video)")
    parser.add_argument('--output', type=str, required=True, help="Name of the output file (image or video)")
    return parser.parse_args()


def main():
    args = parse_args()
    config = Config()

    yolo_model = YOLOModel(model_path='yolov9s.pt')
    PE = PoseEstimation(
        yolo_model,
        config.BODY_PARTS,
        config.POSE_PAIRS,
        config.colors,
        config.pose_protoFile,
        config.pose_weightsFile,
        config.face_protoFile,
        config.face_weightsFile,
        config.hand_protoFile,
        config.hand_weightsFile
    )

    process_media = Process_Media(
        yolo_model,
        config.BODY_PARTS,
        config.POSE_PAIRS,
        config.colors,
        config.pose_protoFile,
        config.pose_weightsFile,
        config.face_protoFile,
        config.face_weightsFile,
        config.hand_protoFile,
        config.hand_weightsFile
    )

    if args.source == "camera":
        source_file_path = args.source
        output_file_path = args.output
    else:
        source_file_path = os.path.join(config.image_path if args.source.endswith((".png", ".jpg")) else config.video_path, args.source)
        output_file_path = os.path.join(config.output_image_path if args.source.endswith((".png", ".jpg")) else config.output_video_path, args.output)

    process_media.make_output(source_file_path, output_file_path)
    
if __name__ == "__main__":
    main()

Result

Image

Image의 경우 위의 그림과 같이 결과가 나옵니다. Pose estimation의 경우 역동적인 움직임을 갖는 운동경기나 춤 같은 곳에 쓰일 때 효과적이기 때문에 input image는 인터넷에서 여러 image를 다운로드 받아서 각각 input으로 넣었습니다.
[참고로 위의 image의 경우 저거 자체가 1개의 image가 아니라 여러 image를 편집한 하나의 image입니다]

그리고 아까 pose estimation을 구현할 때 input image 뿐만 아니라 흑백 image 위에 pose skeleton만 그린 결과도 따로 저장을 했다고 했었는데, 이에 대한 결과는 아래의 이미지처럼 됩니다.

위의 원본 image랑 비교하면 뒤의 배경을 없애고 skeleton만 남긴 형태입니다. 이렇게 보았을 때 pose에 대한 분석이 더 용이하지 않을까 싶어서 이런 형식으로도 구현을 해보았습니다.

Video

이게 PPT에 넣었던 영상을 캡쳐한 영상이다보니 각각의 영상이 동시에 시작을 하지 않아서 그 부분은 양해부탁드립니다

아무튼 동영상을 input으로 넣으면 processing하는 과정은 평균 2~30분 정도로 짧은 편은 아니지만, 결과는 잘 나온 것을 확인할 수 있습니다. 다만 아쉬운 점은 pose estimation에서 흔히 발생하는 문제이지만, 여러 대상이 겹치거나, 가려지거나, 너무 작은 경우 제대로 출력이 되지 않는 현상이 보입니다. 이 부분은 더 보완을 할 수 있다면 해야할 것 같습니다.

마찬가지로 동영상 역시 skeleton만 따로 저장을 하면 아래와 같이 됩니다.

지금까지 Pose Estimation에 대한 논문들을 읽고 Object detection과 결합해서 제 나름대로 구현을 한 내용에 대해 다뤄보았습니다. Quality 측면에서는 부족한 부분이 많지만, 기존에 나와있는 모델들을 활용해서 이러한 결과물이 도출된 점에서는 유의미한 결과이지 않나 라는 생각이 듭니다.

긴 글 읽어주셔서 감사합니다.

'Pose Estimation' 카테고리의 다른 글

[논문] Stacked Hourglass Networks for Human Pose Estimation (3)	2025.01.03
[CV] Human Pose Estimation (2)	2024.09.11

pose.json

config.py

process.py

object_detection.py

pose_estimation.py

main.py

Result

Image

Video

'Pose Estimation' 카테고리의 다른 글

티스토리툴바