언어 감지 python 라이브러리

게시 2025/06/10

By SouthGlory

2 분읽는 시간

utf-8 을 써야 한다.
언어마다 호환되는 폰트가 다르다.

def contains_vietnamese_chars(text: str) -> bool:
    """
    베트남어 고유 문자 포함 여부 판단
    """
    vietnamese_chars &#x3D; "ĂÂÊÔƠƯăâêôơưĐđạộụịởốỉể"
    return any(c in vietnamese_chars for c in text)

실제로 어떤 언어를 사용한 건지 판단한다.

pip install fast-langdetect

(python3.13까지 지원, 현재 2025-06-10) facebook에서 만든 fasttext라는 라이브러리가 원래 유명한데, 업데이트가 더 이상 없고, python3.10까지만 지원되어서 직접 가장 최근 업데이트된 관련 라이브러리 검색해서 찾았음.

from fast_langdetect import detect
import re

def detect_languages_per_sentence(text: str, low_memory: bool &#x3D; True) -> list[dict]:
    """
    문장을 분리한 뒤 각 문장의 언어와 신뢰도를 반환.

    Args:
        text (str): 전체 텍스트
        low_memory (bool): 모델 용량 설정

    Returns:
        List[dict]: [{"sentence": str, "lang": str, "score": float}]
    """
    sentences &#x3D; re.split(r&#x27;(? 0.8 for r in per_sentence)

    if includes_vietnamese:
        pretendard_path &#x3D; os.path.join(
            os.path.dirname(os.path.dirname(default_font_path)),
            "Pretendard", "Pretendard-Regular.otf"
        )
        if os.path.exists(pretendard_path):
            return ImageFont.truetype(pretendard_path, font_size)
        else:
            print("⚠️ 베트남어가 감지되었지만 Pretendard 폰트를 찾을 수 없습니다. 기본 폰트 사용.")
    return ImageFont.truetype(default_font_path, font_size)

참고로, low_memory=True로 하면 좀더 빠를 뿐 아니라 성능 저하도 거의 안느껴짐.

+추가

인공지능이 아무래도 오답률이 있다보니, 그냥 하드하게 검사하는게 맘 편함. https://gist.github.com/southglory/52dad558587a20ea2cfd19d79c73c60e

detect character language manually detect character language manually. GitHub Gist: instantly share code, notes, and snippets. detect character language manually. GitHub Gist: instantly share code, notes, and snippets.

개발, 개발도구

인기 태그