openai/evals , LLM 평가 툴 사용법
openai/evals , LLM test evaluation util tool framework library 사용 방법
https://github.com/openai/evals - 14.3k 
//-----------------------------------------------------------------------------
< 설치 >
- 소스 다운로드
git clone https://github.com/openai/evals
cd evals 
git lfs fetch --all 
git lfs pull 
//------------------------------------- 
* python 3.9 용 가상 환경 만들기 
C:\Python\Python39\python.exe  -m venv py39 
py39\Scripts\activate 
py39\Scripts\python.exe -m pip install --upgrade pip 
//------------------------------------- 
pip install -e . 
//----------------------------------------------------------------------------- 
< 평가 작성(Writing evals) > 
https://github.com/openai/evals?tab=readme-ov-file#writing-evals
평가 예제) Getting Started with OpenAI Evals 
https://cookbook.openai.com/examples/evaluation/getting_started_with_openai_evals
평가 작성하기 매뉴얼 
https://github.com/openai/evals/blob/main/docs/build-eval.md
* 평가 데이터 세트 만들기 
파인튜닝에 사용한 jsonl과 동일한 형식 
- 샘플예제 경로 
evals/registry/data/<eval_name>/samples.jsonl 
* .yaml 파일 작성 
id - 평가의 식별자 
description - 평가에 대한 간단한 설명 
disclaimer - 평가에 대한 추가 참고 사항 
metrics - 평가 메트릭, 3가지중 선택 : match, includes, fuzzyMatch 
- 샘플예제 경로 
evals/registry/evals/<eval_name>.yaml 
* 평가 등록 
- test-match의 예 
evals\registry\evals\test-basic.yaml 파일에 등록됨 
""" 
test-match: 
  id: test-match.s1.simple-v0 
  description: Example eval that checks sampled text matches the expected output. 
  disclaimer: This is an example disclaimer. 
  metrics: [accuracy] 
test-match.s1.simple-v0:   # 이름 형식 : <eval_name>.<split>.<version> 
  class: evals.elsuite.test.match:TestMatch  # 평가함수 경로 evals\elsuite\test\match.py 
""" 
//------------------------------------- 
* 평가 실행(기본 평가 예제) 
oaieval gpt-3.5-turbo <eval_name> 
oaieval gpt-3.5-turbo test-match 
//----------------------------------------------------------------------------- 
< 사용자 평가 추가 방법 > 
https://github.com/openai/evals/blob/main/docs/custom-eval.md
- 기본 제공 모듈
evals/api.py : 공통 인터페이스와 유틸리티를 제공 
evals/record.py : 평가 결과를 로컬 JSON 등으로 기록하는 기능 제공 
evals/metrics.py : 공통 관심 메트릭을 정의 
- 참고: 기계번역평가 예제 
evals/elsuite/translate.py
//-------------------------------------
예제 : 기본 연산 능력 평가 
* 데이터셋 만들기 
evals\registry\user_data\train.jsonl 
{"problem": "2+2=", "answer": "4"} 
{"problem": "4*4=", "answer": "16"} 
evals\registry\user_data\test.jsonl 
{"problem": "48+2=", "answer": "50"} 
{"problem": "5*20=", "answer": "100"} 
* 평가 Python 클래스를 작성 
- evals\elsuite\arithmetic.py 파일 생성
import random
import textwrap
import evals
import evals.metrics
class Arithmetic(evals.Eval):
    def __init__(self, train_jsonl, test_jsonl, train_samples_per_prompt=2, **kwargs):
        super().__init__(**kwargs)
        self.train_jsonl = train_jsonl
        self.test_jsonl = test_jsonl
        self.train_samples_per_prompt = train_samples_per_prompt
    def run(self, recorder):
        """
        Called by the `oaieval` CLI to run the eval. The `eval_all_samples` method calls `eval_sample`.
        """
        self.train_samples = evals.get_jsonl(self.train_jsonl)
        test_samples = evals.get_jsonl(self.test_jsonl)
        self.eval_all_samples(recorder, test_samples)
        # Record overall metrics
        return {
            "accuracy": evals.metrics.get_accuracy(recorder.get_events("match")),
        }
    def eval_sample(self, test_sample, rng: random.Random):
        """
        Called by the `eval_all_samples` method to evaluate a single sample.
        ARGS
        ====
        `test_sample`: a line from the JSONL test file
        `rng`: should be used for any randomness that is needed during evaluation
        This method does the following:
        1. Generate a prompt that contains the task statement, a few examples, and the test question.
        2. Generate a completion from the model.
        3. Check if the generated answer is correct.
        """
        stuffing = rng.sample(self.train_samples, self.train_samples_per_prompt)
        prompt = [
            {"role": "system", "content": "Solve the following math problems"},
        ]
        for i, sample in enumerate(stuffing + [test_sample]):
            if i < len(stuffing):
                prompt += [
                    {"role": "system", "content": sample["problem"], "name": "example_user"},
                    {"role": "system", "content": sample["answer"], "name": "example_assistant"},
                ]
            else:
                prompt += [{"role": "user", "content": sample["problem"]}]
        result = self.completion_fn(prompt=prompt, temperature=0.0, max_tokens=1)
        sampled = result.get_completions()[0]
        evals.record_and_check_match(prompt=prompt, sampled=sampled, expected=test_sample["answer"])
* 평가 등록 
evals/registry/evals/arithmetic.yaml 파일 생성 
# Define a base eval 
arithmetic: 
  # id specifies the eval that this eval is an alias for 
  # in this case, arithmetic is an alias for arithmetic.dev.match-v1 
  # When you run `oaieval davinci arithmetic`, you are actually running `oaieval davinci arithmetic.dev.match-v1` 
  id: arithmetic.dev.match-v1 
  # The metrics that this eval records 
  # The first metric will be considered to be the primary metric 
  metrics: [accuracy] 
  description: Evaluate arithmetic ability 
# Define the eval 
arithmetic.dev.match-v1: 
  # Specify the class name as a dotted path to the module and class 
  class: evals.elsuite.arithmetic:Arithmetic 
  # Specify the arguments as a dictionary of JSONL URIs 
  # These arguments can be anything that you want to pass to the class constructor 
  args: 
    train_jsonl: evals\registry\user_data\train.jsonl 
    test_jsonl: evals\registry\user_data\test.jsonl 
* 평가 실시 
oaieval gpt-3.5-turbo arithmetic 
//----------------------------------------------------------------------------- 
< 참고 > 
Completion Functions 
https://github.com/openai/evals/blob/main/docs/completion-fns.md