3) Data Cleaning for Pet Photos at Scale: EXIF, Dedup, Quality & Background Control

Posted at 2025-10-17

Summary: Pipeline làm sạch ảnh thú cưng để huấn luyện/hiển thị: chuẩn hóa EXIF, lọc ảnh mờ/thiếu sáng, loại trùng (pHash), phát hiện che khuất, và xử lý nền.

Mục lục

Tại sao “data quality > model quality”

EXIF orientation & chuẩn hóa cơ bản

Đánh giá chất lượng (độ mờ/độ phơi sáng)

Loại trùng bằng pHash

Phát hiện che khuất/pose đơn giản

Kiểm soát nền (segment → blur/solid)

Báo cáo dữ liệu & lưu kết quả

Further reading

Tại sao “data quality > model quality”

Ảnh chất lượng thấp làm giảm F1 đáng kể.

E-commerce cần ảnh nhất quán để tăng CTR/CR (gián tiếp).

EXIF orientation

Nhiều ảnh bị xoay do metadata. Chuẩn hóa:

from PIL import Image, ImageOps

def normalize_exif(path):
img = Image.open(path)
return ImageOps.exif_transpose(img) # auto rotate by EXIF

Đánh giá chất lượng

Ví dụ thô sơ: “độ sắc nét” xấp xỉ bằng phương sai cạnh; có thể thay bằng variance of Laplacian trên OpenCV/PIL.

from PIL import ImageFilter, ImageStat

def sharpness_score(img):
# tìm cạnh kiểu đơn giản
edges = img.convert("L").filter(ImageFilter.FIND_EDGES)
return ImageStat.Stat(edges).var[0] # giá trị cao → sắc nét hơn

Loại trùng bằng pHash

Dùng imagehash:

import imagehash

def get_phash(path):
img = Image.open(path)
return str(imagehash.phash(img))

Thiết lập ngưỡng Hamming nếu cho phép “gần trùng”:

from imagehash import phash, hex_to_hash

def is_near_duplicate(h1, h2, max_dist=4):
return phash.hash_size*phash.hash_size - (hex_to_hash(h1) ^ hex_to_hash(h2)).bit_count() <= max_dist

Che khuất/pose (heuristic)

Nếu có detector (face/eyes/muzzle), loại ảnh không thấy mặt.

Không có detector, dùng heuristic: crop trung tâm → độ tương phản thấp có thể là quay lưng/che khuất; chỉ là ước lượng.

Kiểm soát nền

Mục tiêu: ảnh gọn, chủ thể nổi bật.

Có thể dùng segmenter nhẹ (e.g., MODNet/U²-Net nhỏ) để tách nền; sau đó:

Nền blur nhẹ (Gaussian)

Hoặc nền đơn sắc trung tính

Pseudo-code:

def apply_bg_blur(img, mask):
# img, mask: PIL Image (mask 0/255)
from PIL import ImageFilter
bg = img.filter(ImageFilter.GaussianBlur(radius=6))
return Image.composite(img, bg, mask) # giữ chủ thể sắc nét

Báo cáo dữ liệu

Pipeline mẫu (gộp các bước trên):

import os, json, imagehash
from PIL import Image, ImageOps, ImageFilter, ImageStat

def normalize_exif(img): return ImageOps.exif_transpose(img)

def sharpness_score(img):
edges = img.convert("L").filter(ImageFilter.FIND_EDGES)
return ImageStat.Stat(edges).var[0]

def phash_str(img): return str(imagehash.phash(img))

seen = {}
report = {"kept": [], "dropped": []}

os.makedirs("clean/ok", exist_ok=True)
os.makedirs("clean/drop", exist_ok=True)

for fname in os.listdir("raw"):
if not fname.lower().endswith((".jpg",".jpeg",".png")):
continue
path = os.path.join("raw", fname)
try:
img = Image.open(path)
img = normalize_exif(img)
h = phash_str(img)

    # duplicate check
    if h in seen:
        report["dropped"].append({"file": fname, "reason": "duplicate", "dup_of": seen[h]})
        img.save(os.path.join("clean/drop", fname))
        continue
    seen[h] = fname

    # sharpness check (ngưỡng cần tinh chỉnh theo tập dữ liệu)
    s = sharpness_score(img)
    if s < 15:
        report["dropped"].append({"file": fname, "reason": "blurry", "score": float(s)})
        img.save(os.path.join("clean/drop", fname))
        continue

    # keep
    img.save(os.path.join("clean/ok", fname))
    report["kept"].append({"file": fname, "sharpness": float(s)})

except Exception as e:
    report["dropped"].append({"file": fname, "reason": f"error:{e}"})

with open("clean/report.json", "w", encoding="utf-8") as f:
json.dump(report, f, ensure_ascii=False, indent=2)

Gợi ý bổ sung:

Lưu histogram lý do bị loại (blurry/duplicate/other).

Manual spot-check ~50 ảnh để điều chỉnh ngưỡng.

Further reading

PIL/Pillow docs

imagehash

OpenCV Image Quality Heuristics

You get articles that match your needs
You can efficiently read back useful information
You can use dark theme

What you can do with signing up