openai/evals , LLM 평가 툴 사용법
openai/evals , LLM test evaluation util tool framework library 사용 방법
https://github.com/openai/evals - 14.3k
//-----------------------------------------------------------------------------
< 설치 >
- 소스 다운로드
git clone https://github.com/openai/evals
cd evals
git lfs fetch --all
git lfs pull
//-------------------------------------
* python 3.9 용 가상 환경 만들기
C:\Python\Python39\python.exe -m venv py39
py39\Scripts\activate
py39\Scripts\python.exe -m pip install --upgrade pip
//-------------------------------------
pip install -e .
//-----------------------------------------------------------------------------
< 평가 작성(Writing evals) >
https://github.com/openai/evals?tab=readme-ov-file#writing-evals
평가 예제) Getting Started with OpenAI Evals
https://cookbook.openai.com/examples/evaluation/getting_started_with_openai_evals
평가 작성하기 매뉴얼
https://github.com/openai/evals/blob/main/docs/build-eval.md
* 평가 데이터 세트 만들기
파인튜닝에 사용한 jsonl과 동일한 형식
- 샘플예제 경로
evals/registry/data/<eval_name>/samples.jsonl
* .yaml 파일 작성
id - 평가의 식별자
description - 평가에 대한 간단한 설명
disclaimer - 평가에 대한 추가 참고 사항
metrics - 평가 메트릭, 3가지중 선택 : match, includes, fuzzyMatch
- 샘플예제 경로
evals/registry/evals/<eval_name>.yaml
* 평가 등록
- test-match의 예
evals\registry\evals\test-basic.yaml 파일에 등록됨
"""
test-match:
id: test-match.s1.simple-v0
description: Example eval that checks sampled text matches the expected output.
disclaimer: This is an example disclaimer.
metrics: [accuracy]
test-match.s1.simple-v0: # 이름 형식 : <eval_name>.<split>.<version>
class: evals.elsuite.test.match:TestMatch # 평가함수 경로 evals\elsuite\test\match.py
"""
//-------------------------------------
* 평가 실행(기본 평가 예제)
oaieval gpt-3.5-turbo <eval_name>
oaieval gpt-3.5-turbo test-match
//-----------------------------------------------------------------------------
< 사용자 평가 추가 방법 >
https://github.com/openai/evals/blob/main/docs/custom-eval.md
- 기본 제공 모듈
evals/api.py : 공통 인터페이스와 유틸리티를 제공
evals/record.py : 평가 결과를 로컬 JSON 등으로 기록하는 기능 제공
evals/metrics.py : 공통 관심 메트릭을 정의
- 참고: 기계번역평가 예제
evals/elsuite/translate.py
//-------------------------------------
예제 : 기본 연산 능력 평가
* 데이터셋 만들기
evals\registry\user_data\train.jsonl
{"problem": "2+2=", "answer": "4"}
{"problem": "4*4=", "answer": "16"}
evals\registry\user_data\test.jsonl
{"problem": "48+2=", "answer": "50"}
{"problem": "5*20=", "answer": "100"}
* 평가 Python 클래스를 작성
- evals\elsuite\arithmetic.py 파일 생성
import random
import textwrap
import evals
import evals.metrics
class Arithmetic(evals.Eval):
def __init__(self, train_jsonl, test_jsonl, train_samples_per_prompt=2, **kwargs):
super().__init__(**kwargs)
self.train_jsonl = train_jsonl
self.test_jsonl = test_jsonl
self.train_samples_per_prompt = train_samples_per_prompt
def run(self, recorder):
"""
Called by the `oaieval` CLI to run the eval. The `eval_all_samples` method calls `eval_sample`.
"""
self.train_samples = evals.get_jsonl(self.train_jsonl)
test_samples = evals.get_jsonl(self.test_jsonl)
self.eval_all_samples(recorder, test_samples)
# Record overall metrics
return {
"accuracy": evals.metrics.get_accuracy(recorder.get_events("match")),
}
def eval_sample(self, test_sample, rng: random.Random):
"""
Called by the `eval_all_samples` method to evaluate a single sample.
ARGS
====
`test_sample`: a line from the JSONL test file
`rng`: should be used for any randomness that is needed during evaluation
This method does the following:
1. Generate a prompt that contains the task statement, a few examples, and the test question.
2. Generate a completion from the model.
3. Check if the generated answer is correct.
"""
stuffing = rng.sample(self.train_samples, self.train_samples_per_prompt)
prompt = [
{"role": "system", "content": "Solve the following math problems"},
]
for i, sample in enumerate(stuffing + [test_sample]):
if i < len(stuffing):
prompt += [
{"role": "system", "content": sample["problem"], "name": "example_user"},
{"role": "system", "content": sample["answer"], "name": "example_assistant"},
]
else:
prompt += [{"role": "user", "content": sample["problem"]}]
result = self.completion_fn(prompt=prompt, temperature=0.0, max_tokens=1)
sampled = result.get_completions()[0]
evals.record_and_check_match(prompt=prompt, sampled=sampled, expected=test_sample["answer"])
* 평가 등록
evals/registry/evals/arithmetic.yaml 파일 생성
# Define a base eval
arithmetic:
# id specifies the eval that this eval is an alias for
# in this case, arithmetic is an alias for arithmetic.dev.match-v1
# When you run `oaieval davinci arithmetic`, you are actually running `oaieval davinci arithmetic.dev.match-v1`
id: arithmetic.dev.match-v1
# The metrics that this eval records
# The first metric will be considered to be the primary metric
metrics: [accuracy]
description: Evaluate arithmetic ability
# Define the eval
arithmetic.dev.match-v1:
# Specify the class name as a dotted path to the module and class
class: evals.elsuite.arithmetic:Arithmetic
# Specify the arguments as a dictionary of JSONL URIs
# These arguments can be anything that you want to pass to the class constructor
args:
train_jsonl: evals\registry\user_data\train.jsonl
test_jsonl: evals\registry\user_data\test.jsonl
* 평가 실시
oaieval gpt-3.5-turbo arithmetic
//-----------------------------------------------------------------------------
< 참고 >
Completion Functions
https://github.com/openai/evals/blob/main/docs/completion-fns.md