◾Preprocessing

🔸전처리의 중요성

[Penedo, Guilherme, et al. "The RefinedWeb dataset for Falcon LLM: outperforming curated corpora with web data, and web data only.]

[Penedo, Guilherme, et al. "The RefinedWeb dataset for Falcon LLM: outperforming curated corpora with web data, and web data only.]

전처리를 통해 90%의 중복 및 불필요 정보를 제거.

🔸Pipeline

🔻텍스트 전처리 종류

HTML tag, 특수문자, 이모티콘
정규표현식
불용어 (stopword)
어간추출 (Stemming), 표제어추출 (Lemmatizing)

🔸전처리 도구

🔻형태소 분석기 종류

◾Tokenization

토큰이 되는 기준(어절, 단어, 형태소, 음절 등)에 따라 tokenizer를 결정
- charater-based, word-based, subword-based, …

🔸토큰화 시 고려사항

구두점, 특수문자 등은 제외