◾전통적인 평가방법

🔻Confusion Matrix

e.g. ground truth(gt) : Here is a breakdown of what is happening and why prediction : Here is a what happening and also why

TP / (TP + FP)
- TP : prediction과 gt에 모두 있는 token.
- FP : prediction에는 있지만 gt에는 없는 token.
- TP + FP = prediction sequence
$$ \operatorname{Precision} =\frac{\text{\# of correct tokens}}{\operatorname{len}(\text{prediction})} $$
example

ground truth(gt) : Here is a breakdown of what is happening and why prediction : Here is a what happening and also why

→ precision = (7) / (7 + 1) = 7/8

TP / (TP + FN)
- TP : prediction과 gt에 모두 있는 token.
- FN : prediction에 없지만 gt에는 있는 token.
- TP + FN = gt sequence
$$ \operatorname{Recall} =\frac{\text{\# of correct tokens}}{\operatorname{len}(\text{gt})} $$
example

ground truth(gt) : Here is a breakdown of what is happening and why prediction : Here is a what happening and also why

→ recall = (7) / (7 + 10) = 7/8

→ gt의 is는 일반적으로 2개로 count한다.

기존 f1-score와 동일

$$ \operatorname{F1-score} =2\times \frac{\text{Recall} \times \text{Precision}}{\text{Recall} + \text{Precision}} $$

용도: 기계번역, 텍스트 생성에서 주로 사용, Vision-Language task에서 자주 사용된다.
Metrics

$$ \operatorname{BLEU}=\min(1,\frac{\operatorname{len}(\text{pred})}{\operatorname{len}(\text{gt})})(\prod_{i=1}^N \operatorname{Precision}_i)^{-N} $$
- $(\prod_{i=1}^N \operatorname{Precision}_i)^{-N}$ : N-gram Precision. 다각도로 precision을 평가
- $\min(1,\frac{\operatorname{len}(\text{pred})}{\operatorname{len}(\text{gt})})$ : brevity(짦음) penalty term. precision은 prediction을 기반으로 하는 지표이기 때문에 모델이 gt의 한 단어만 포함하도록 예측하면 100점이 나올 수 있어, prediction의 길이가 짧을수록 n-gram precision 점수를 낮추도록 패널티를 준다.
example