openai/evals , LLM 평가 툴 사용법

codens 2024. 6. 30. 03:36

openai/evals , LLM test evaluation util tool framework library 사용 방법

https://github.com/openai/evals - 14.3k

//-----------------------------------------------------------------------------
< 설치 >

- 소스 다운로드
git clone https://github.com/openai/evals
cd evals
git lfs fetch --all
git lfs pull

//-------------------------------------
* python 3.9 용 가상 환경 만들기
C:\Python\Python39\python.exe  -m venv py39
py39\Scripts\activate
py39\Scripts\python.exe -m pip install --upgrade pip

//-------------------------------------
pip install -e .

//-----------------------------------------------------------------------------
< 평가 작성(Writing evals) >
https://github.com/openai/evals?tab=readme-ov-file#writing-evals

평가 예제) Getting Started with OpenAI Evals
https://cookbook.openai.com/examples/evaluation/getting_started_with_openai_evals

평가 작성하기 매뉴얼
https://github.com/openai/evals/blob/main/docs/build-eval.md

* 평가 데이터 세트 만들기
파인튜닝에 사용한 jsonl과 동일한 형식
- 샘플예제 경로
evals/registry/data/<eval_name>/samples.jsonl

* .yaml 파일 작성
id - 평가의 식별자
description - 평가에 대한 간단한 설명
disclaimer - 평가에 대한 추가 참고 사항
metrics - 평가 메트릭, 3가지중 선택 : match, includes, fuzzyMatch

- 샘플예제 경로
evals/registry/evals/<eval_name>.yaml

* 평가 등록
- test-match의 예
evals\registry\evals\test-basic.yaml 파일에 등록됨

"""
test-match:
  id: test-match.s1.simple-v0
  description: Example eval that checks sampled text matches the expected output.
  disclaimer: This is an example disclaimer.
  metrics: [accuracy]
test-match.s1.simple-v0:   # 이름 형식 : <eval_name>.<split>.<version>
  class: evals.elsuite.test.match:TestMatch  # 평가함수 경로 evals\elsuite\test\match.py
"""

//-------------------------------------
* 평가 실행(기본 평가 예제)
oaieval gpt-3.5-turbo <eval_name>
oaieval gpt-3.5-turbo test-match

//-----------------------------------------------------------------------------
< 사용자 평가 추가 방법 >
https://github.com/openai/evals/blob/main/docs/custom-eval.md

- 기본 제공 모듈
evals/api.py : 공통 인터페이스와 유틸리티를 제공
evals/record.py : 평가 결과를 로컬 JSON 등으로 기록하는 기능 제공
evals/metrics.py : 공통 관심 메트릭을 정의

- 참고: 기계번역평가 예제
evals/elsuite/translate.py

//-------------------------------------
예제 : 기본 연산 능력 평가

* 데이터셋 만들기
evals\registry\user_data\train.jsonl
{"problem": "2+2=", "answer": "4"}
{"problem": "4*4=", "answer": "16"}

evals\registry\user_data\test.jsonl
{"problem": "48+2=", "answer": "50"}
{"problem": "5*20=", "answer": "100"}

* 평가 Python 클래스를 작성

- evals\elsuite\arithmetic.py 파일 생성

import random
import textwrap

import evals
import evals.metrics


class Arithmetic(evals.Eval):
    def __init__(self, train_jsonl, test_jsonl, train_samples_per_prompt=2, **kwargs):
        super().__init__(**kwargs)
        self.train_jsonl = train_jsonl
        self.test_jsonl = test_jsonl
        self.train_samples_per_prompt = train_samples_per_prompt

    def run(self, recorder):
        """
        Called by the `oaieval` CLI to run the eval. The `eval_all_samples` method calls `eval_sample`.
        """
        self.train_samples = evals.get_jsonl(self.train_jsonl)
        test_samples = evals.get_jsonl(self.test_jsonl)
        self.eval_all_samples(recorder, test_samples)

        # Record overall metrics
        return {
            "accuracy": evals.metrics.get_accuracy(recorder.get_events("match")),
        }

    def eval_sample(self, test_sample, rng: random.Random):
        """
        Called by the `eval_all_samples` method to evaluate a single sample.

        ARGS
        ====
        `test_sample`: a line from the JSONL test file
        `rng`: should be used for any randomness that is needed during evaluation

        This method does the following:
        1. Generate a prompt that contains the task statement, a few examples, and the test question.
        2. Generate a completion from the model.
        3. Check if the generated answer is correct.
        """
        stuffing = rng.sample(self.train_samples, self.train_samples_per_prompt)

        prompt = [
            {"role": "system", "content": "Solve the following math problems"},
        ]

        for i, sample in enumerate(stuffing + [test_sample]):
            if i < len(stuffing):
                prompt += [
                    {"role": "system", "content": sample["problem"], "name": "example_user"},
                    {"role": "system", "content": sample["answer"], "name": "example_assistant"},
                ]
            else:
                prompt += [{"role": "user", "content": sample["problem"]}]

        result = self.completion_fn(prompt=prompt, temperature=0.0, max_tokens=1)
        sampled = result.get_completions()[0]

        evals.record_and_check_match(prompt=prompt, sampled=sampled, expected=test_sample["answer"])

* 평가 등록
evals/registry/evals/arithmetic.yaml 파일 생성

# Define a base eval
arithmetic:
  # id specifies the eval that this eval is an alias for
  # in this case, arithmetic is an alias for arithmetic.dev.match-v1
  # When you run `oaieval davinci arithmetic`, you are actually running `oaieval davinci arithmetic.dev.match-v1`
  id: arithmetic.dev.match-v1
  # The metrics that this eval records
  # The first metric will be considered to be the primary metric
  metrics: [accuracy]
  description: Evaluate arithmetic ability
# Define the eval
arithmetic.dev.match-v1:
  # Specify the class name as a dotted path to the module and class
  class: evals.elsuite.arithmetic:Arithmetic
  # Specify the arguments as a dictionary of JSONL URIs
  # These arguments can be anything that you want to pass to the class constructor
  args:
    train_jsonl: evals\registry\user_data\train.jsonl
    test_jsonl: evals\registry\user_data\test.jsonl

* 평가 실시
oaieval gpt-3.5-turbo arithmetic

//-----------------------------------------------------------------------------
< 참고 >
Completion Functions
https://github.com/openai/evals/blob/main/docs/completion-fns.md

저작자표시 (새창열림)