G-gen の又吉です。当記事では、生成 AI の出力を迅速かつ効率的に評価できる Vertex AI 上の API である、Gen AI evaluation service を紹介します。

概要

Gen AI evaluation service は、生成 AI アプリケーションの出力を効率的に評価するための機能です。Vertex AI の1機能として、API で提供されます。この機能を使うと、事前定義された評価指標や、ユーザーが独自に定義したカスタム評価指標を用いて、生成 AI アプリケーションのパフォーマンスを定量的に評価できます。

同様の LLM 評価ツールとしては、オープンソースのフレームワークである Ragas などがありますが、Gen AI evaluation service は Vertex AI とシームレスに統合されている点と、マネージドサービスでありインフラの管理が不要な点がメリットです。一方で、Ragas に比べ評価指標テンプレートが少ない点や、少額ではありますが API 利用料金が発生するといったデメリットがあります。

参考 : Gen AI evaluation service overview

ユースケース

Gen AI Evaluation Service は、以下のようなユースケースで役立ちます。

生成 AI モデルの選定
最適なモデルパラメータを探索
プロンプトエンジニアリングの調整
ファインチューニングの評価
RAG（Retrieval Augmented Generation）の評価
Function calling の評価

以下の公式ドキュメントでは、ユースケース別のサンプル Notebook が公開されており、参考にすることができます。

参考 : Notebooks for evaluation use cases

評価指標について

評価タイプ

Gen AI evaluation service には、計算ベース（Computation-based）とモデルベース（Model-based）の 2 種類の評価タイプがあります。

計算ベースの評価は、正解データとの比較に基づいてスコアを算出します。処理速度が速いため、リアルタイム評価にも適しています。代表的な指標として、自然言語処理で広く使われる BLEU や ROUGE などがあります。

しかし、LLM の出力評価では、そもそも「正解データ」を準備することが難しい場合があります。これは、LLM の出力に対して単一の正解を定めにくいためです。このため、多くの場合、人間の判断による評価が行われますが、スケールするにつれて手間が増加するという課題もあります。こうした背景から、現在ではモデルベースの評価が注目されています。

モデルベースの評価は、LLM 自体を判定モデルとして用い、人間による評価に近い形で評価します。正解データを必須とせず、「流暢さ」や「一貫性」といった複雑な基準での評価が可能で、プロンプトの調整により柔軟な評価基準を設定できます。

評価タイプ	評価アプローチ	正解データ（Ground truth）	レイテンシ
計算ベース	数式を用いて評価する	必須	早い
モデルベース	判定モデル (LLM) に評価させる	任意	遅い

計算ベース

計算ベースの評価指標は、候補モデル（評価対象の LLM）の出力が正解データにどれだけ一致しているかを数値化します。以下はテキスト生成向けの評価指標です。

評価指標	説明	適するユースケース
Exact match	候補モデルの出力が正解データと完全一致する場合は「1」、しない場合は「0」を出力	QA や分類タスク
BLEU	候補モデルの出力と正解データの n-gram の一致度を算出。出力は [0 ~ 1] の範囲で、スコアが高いほど生成テキストが正解に近いことを示す。	翻訳タスク
ROUGE	候補モデルの出力と正解データの n-gram の F1-score を算出。出力は [0 ~ 1] の範囲で、スコアが高いほど内容が類似していることを示す。	要約タスク

その他、Function Calling 向けの評価指標などもあります。詳細は以下の公式ドキュメントをご参照ください。

参考 : Computation-based metrics

モデルベース

モデルベースの評価指標では、LLM が判定モデルとして機能し、候補モデルの出力を評価します。この手法は一般的に、LLM-as-a-Judge とも呼ばれます。

評価方式には、単一の出力に対する Pointwise と、複数の出力の比較を行う Pairwise があります。

方式	説明	ユースケース
Pointwise	単一の候補モデルの出力にスコアを付与	運用段階での継続的なモニタリング
Pairwise	2 つの候補モデルの出力を比較し、より適切な方を選択	モデル選定やプロンプト比較

また、Gen AI evaluation service には、様々なタスクに合わせた事前定義済みのプロンプトテンプレートが用意されています。

	テキスト生成	マルチターン会話形式	要約	QA 品質
Pointwise	・Fluency ・Groundedness など	・Multi-turn Chat Quality ・Multi-turn Safety	・Summarization Quality	・Question Answering Quality
Pairwise	・Fluency ・Groundedness など	・Multi-turn Chat Quality ・Multi-turn Safety	・Summarization Quality	・Question Answering Quality

これらのテンプレートは更新される可能性があるため、最新情報は以下の公式ドキュメントをご参照ください。

参考 : Metric prompt templates for model-based evaluation

料金

Gen AI Evaluation Serviceの料金は、入出力の文字数と評価タイプに基づいて計算されます。

評価タイプ	価格
モデルベース (Pointwise, Pairwise)	入力: $0.005 per 1k characters 出力: $0.015 per 1k characters
計算ベース	入力: $0.00003 per 1k characters 出力: $0.00009 per 1k characters

参考 : Vertex AI Pricing - Gen AI Evaluation Service

使ってみる

概要

ここでは、RAG システムで生成された回答の精度を評価する例を紹介します。

また、筆者の実行環境としては Colab Enterprise を使用します。Colab Enterprise の利用方法は以下の公式ドキュメントのクイックスタートをご参考下さい。

参考 : Create a notebook by using the Google Cloud console

準備

１. ライブラリのインストール

!pip install  google-cloud-aiplatform[evaluation]==1.71.0

２. Vertex AI インスタンスの初期化

import vertexai
from vertexai.evaluation import EvalTask, MetricPromptTemplateExamples, PointwiseMetric
import pandas as pd
  
  
PROJECT_ID = ""  # @param {type:"string"}
LOCATION = ""  # @param {type:"string"}
EXPERIMENT = ""  # @param {type:"string"}
  
vertexai.init(
    project=PROJECT_ID, 
    location=LOCATION
)

３. 事前定義済み評価指標テンプレートの確認

MetricPromptTemplateExamples.list_example_metric_names()

出力は以下の通りです。

['coherence',
 'fluency',
 'safety',
 'groundedness',
 'instruction_following',
 'verbosity',
 'text_quality',
 'summarization_quality',
 'question_answering_quality',
 'multi_turn_chat_quality',
 'multi_turn_safety',
 'pairwise_coherence',
 'pairwise_fluency',
 'pairwise_safety',
 'pairwise_groundedness',
 'pairwise_instruction_following',
 'pairwise_verbosity',
 'pairwise_text_quality',
 'pairwise_summarization_quality',
 'pairwise_question_answering_quality',
 'pairwise_multi_turn_chat_quality',
 'pairwise_multi_turn_safety']

４. question_answering_quality テンプレートの内容を確認

# プロンプトテンプレートの中身を表示
print(MetricPromptTemplateExamples.get_prompt_template("question_answering_quality"))

出力は以下の通りです。

question_answering_quality テンプレートの評価基準は、RAG アーキテクチャを使ったアプリケーションの評価にも使えそうであるため、今回はこのテンプレートをそのまま利用します。

# Instruction
You are an expert evaluator. Your task is to evaluate the quality of the responses generated by AI models.
We will provide you with the user input and an AI-generated response.
You should first read the user input carefully for analyzing the task, and then evaluate the quality of the responses based on the Criteria provided in the Evaluation section below.
You will assign the response a rating following the Rating Rubric and Evaluation Steps. Give step-by-step explanations for your rating, and only choose ratings from the Rating Rubric.

# Evaluation
## Metric Definition
You will be assessing question answering quality, which measures the overall quality of the answer to the question in user input. The instruction for performing a question-answering task is provided in the user prompt.

## Criteria
Instruction following: The response demonstrates a clear understanding of the question answering task instructions, satisfying all of the instruction's requirements.
Groundedness: The response contains information included only in the context if the context is present in user prompt. The response does not reference any outside information.
Completeness: The response completely answers the question with sufficient detail.
Fluent: The response is well-organized and easy to read.

## Rating Rubric
5: (Very good). The answer follows instructions, is grounded, complete, and fluent.
4: (Good). The answer follows instructions, is grounded, complete, but is not very fluent.
3: (Ok). The answer mostly follows instructions, is grounded, answers the question partially and is not very fluent.
2: (Bad). The answer does not follow the instructions very well, is incomplete or not fully grounded.
1: (Very bad). The answer does not follow the instructions, is wrong and not grounded.

## Evaluation Steps
STEP 1: Assess the response in aspects of instruction following, groundedness, completeness and fluency according to the criteria.
STEP 2: Score based on the rubric.

# User Inputs and AI-generated Response
## User Inputs
### Prompt
{prompt}

## AI-generated Response
{response}

５. カスタム評価指標 helpfulness の定義

# カスタムテンプレートを作成
helpfulness_prompt_template = """
You are a professional writing evaluator. Your job is to score writing responses according to pre-defined evaluation criteria.
  
You will be assessing helpfulness, which measures the ability to provide important details when answering a prompt.
  
You will assign the writing response a score from 5, 4, 3, 2, 1, following the rating rubric and evaluation steps.
  
## Criteria
Helpfulness: The response is comprehensive with well-defined key details. The user would feel very satisfied with the content in a good response.
  
## Rating Rubric
5 (completely helpful): Response is useful and very comprehensive with well-defined key details to address the needs in the instruction and usually beyond what explicitly asked. The user would feel very satisfied with the content in the response.
4 (mostly helpful): Response is very relevant to the instruction, providing clearly defined information that addresses the instruction's core needs.  It may include additional insights that go slightly beyond the immediate instruction.  The user would feel quite satisfied with the content in the response.
3 (somewhat helpful): Response is relevant to the instruction and provides some useful content, but could be more relevant, well-defined, comprehensive, and/or detailed. The user would feel somewhat satisfied with the content in the response.
2 (somewhat unhelpful): Response is minimally relevant to the instruction and may provide some vaguely useful information, but it lacks clarity and detail. It might contain minor inaccuracies. The user would feel only slightly satisfied with the content in the response.
1 (unhelpful): Response is useless/irrelevant, contains inaccurate/deceptive/misleading information, and/or contains harmful/offensive content. The user would feel not at all satisfied with the content in the response.
  
## Evaluation Steps
STEP 1: Assess comprehensiveness: does the response provide specific, comprehensive, and clearly defined information for the user needs expressed in the instruction?
STEP 2: Assess relevance: When appropriate for the instruction, does the response exceed the instruction by providing relevant details and related information to contextualize content and help the user better understand the response.
STEP 3: Assess accuracy: Is the response free of inaccurate, deceptive, or misleading information?
STEP 4: Assess safety: Is the response free of harmful or offensive content?
  
Give step by step explanations for your scoring, and only choose scores from 5, 4, 3, 2, 1.
  
  
# User Inputs and AI-generated Response
## User Inputs
### Prompt
{prompt}
  
## AI-generated Response
{response}
"""
  
# カスタム評価指標を定義 
helpfulness = PointwiseMetric(
    metric="helpfulness",
    metric_prompt_template=helpfulness_prompt_template,
)

６. サンプルデータの定義

今回利用する以下のサンプルデータは以下のとおりです。

questions ：ユーザーの質問
retrieved_contexts ：コンテキストとなる検索結果
generated_answers ：LLM の出力
golden_answers ：正解データ

questions = [
    "富士山はどこの県に位置していますか？",
    "東京タワーはどの区に位置していますか？",
    "沖縄の主要な伝統料理は何ですか？",
    "沖縄の伝統舞踊で有名なものは何ですか？"
]
  
retrieved_contexts = [
    "富士山は、静岡県と山梨県にまたがって位置しています。標高3,776メートルのこの山は、日本の最高峰であり、象徴的な自然のシンボルとして親しまれています。",
    "東京タワーは、東京都港区に位置しており、1958年に建てられた高さ333メートルの電波塔です。東京のシンボルとして、観光名所となっています。",
    "日本の主要な伝統料理には、寿司や天ぷらがあります。また、天ぷらは沖縄でもソウルフードとなっています",
    "沖縄の伝統舞踊には、エイサーがあり、太鼓のリズムに合わせて踊られるもので、祭りなどで広く披露されています。"
]
  
generated_answers = [
    "静岡県",
    "港区",
    "天ぷら",
    "エイサー"
]
  
golden_answers = [
    "静岡県と山梨県",
    "港区",
    "ゴーヤーチャンプルー、ソーキそば",
    "エイサー",
]

７. 評価データセットの作成

各評価指標の入力に必要なカラム名は以下の通りです。

評価指標	評価タイプ	必要なカラム名
question_answering_quality	モデルベース（Pontwise）	・prompt ・response
helpfulness	モデルベース（Pontwise）	・prompt ・response
exact_match	計算ベース	・response ・reference

必要なカラム名に沿って評価データセットを作成します。

# 評価データセットを作成
eval_dataset = pd.DataFrame(
    {
        "prompt": [
            "Answer the question: " + question + " Context: " + item
            for question, item in zip(questions, retrieved_contexts)
        ],
        "response": generated_answers, # 候補モデルの出力
        "reference": golden_answers,   # 正解データ
    }
)

実行と結果

８. 評価を実行

# 評価タスクを定義
eval_task = EvalTask(
    dataset=eval_dataset,
    metrics=[
        "question_answering_quality",  # モデルベース（事前定義の評価指標）
        helpfulness,                   # モデルベース（ユーザー独自の評価指標）
        "exact_match"                  # 計算ベース
    ],
    experiment=EXPERIMENT,
)
  
# 評価リクエストを実行
result = eval_task.evaluate()

evaluate メソッドの戻り値は EvalResult クラスであり以下の属性を持っています。

属性名	説明
summary_metrics	各指標の平均値や標準偏差などのサマリ情報
metrics_table	評価結果の詳細情報
metadata	評価時の実験名などのメタデータ情報

９. 評価結果のサマリを出力

result.summary_metrics

出力は以下の通りです。

{'row_count': 4,
 'question_answering_quality/mean': 4.0,
 'question_answering_quality/std': 1.1547005383792515,
 'helpfulness/mean': 3.0,
 'helpfulness/std': 0.816496580927726,
 'exact_match/mean': 0.5,
 'exact_match/std': 0.5773502691896257}

１０. 評価結果の詳細を出力

result.metrics_table

出力は以下の通りです。

文字が見えづらいため、 question_answering_quality 評価指標の最初のレコードのみを以下に記載します。

[ prompt ]
Answer the question: 富士山はどこの県に位置していますか？
Context: 富士山は、静岡県と山梨県にまたがって位置しています。標高3,776メートルのこの山は、日本の最高峰であり、象徴的な自然のシンボルとして親しまれています。

[ response ]
静岡県

[ reference ]
静岡県と山梨県

[ question_answering_quality/explanation ]
The response is incomplete. Although grounded in the given context and fluent, it only mentioned one of the two prefectures where Mt. Fuji is located. The prompt asked "In which prefecture is Mt. Fuji located?" The context clearly stated it's located in both Shizuoka and Yamanashi prefectures. Therefore, the instruction following is weak.

~日本語に翻訳~
回答は不完全です。与えられた文脈に基づいており流暢ではありますが、富士山がある 2 つの県のうちの 1 つしか言及されていません。プロンプトは「富士山はどの県にありますか?」と尋ねており、文脈から静岡県と山梨県の両方にあることは明らかです。したがって、指示に従うことが不十分です。

[ question_answering_quality/score ]
3.0

定量的なスコアが取得できていることが確認できました。なお、モデルベースの評価指標の出力には explanation 列が含まれており、判定モデルがスコアを算出するために行った思考の過程が記録されています。つまり、explanation 列の内容が 判定モデルが出力したスコアの根拠 となります。

参考 : View and interpret evaluation results

その他

クォータの制限について

モデルベースの評価タイプでは、判定モデルに Vertex AI Gemini API が使用されるためクォータには注意が必要です。特に、1 度に大量の評価データセットを含める場合や、評価リクエストの同時実行数が高くなる場合は、 gemini-1.5-pro のクォータの制限緩和をご検討ください。

制限緩和の申請については、公式ドキュメントをご参照ください。

参考 : Run model-based evaluation with increased rate limits and quota

評価データセットの件数

高品質な評価結果を得るためには、評価データセットを 100〜400 件にすることが推奨されています。この範囲であれば、外れ値の影響を最小限に抑えつつ、さまざまなシナリオでのパフォーマンスが期待できます。また、400 件を超えると前述の改善効果が薄れる傾向があるため、一般的な目安として 400 件を上限とすることが推奨されます。