NuExtract: A Foundation Model for Structured Extraction
We introduce NuExtract, a lightweight text-to-JSON LLM. NuExtract allows to extract arbitrarily complex information from text and turns it into structured data.
from 0.5B to 7B parameters
F1 Exact Matchで7BはGPT-4oに迫る
自分の言葉でまとめてみると
ドメイン非依存でstructured extractionを解く小型モデル
C4からLlama 3でテキスト・スキーマ・JSONという形式の50kのデータセットを作り、Phi-3を継続訓練した
Structured Extraction
Its goal is to extract all kinds of information from a document - entities, quantities, dates, and so on - and to identify their (potentially hierarchical) relationships.
extraction tree
JSONで表されることが多い
application 2つ
parsing technical documents such as medical reports, legal documents, or financial reports
ナレッジベース。RAGにも使える
(IMO:構造化)
chatbot conversations
this is the holy grail of information extraction.
Using GPT-4
code:プロンプト
Given the following JSON template and text, return a version of the JSON template filled in with the relevant data. Don't return anything besides the filled in JSON content.
{
}
Input: *<input>*
Output:
To improve performance, we could add examples to the prompt. However, such “in-context learning” quickly saturates, as we found out for entity recognition in our recent paper introducing NuNER:
Task-Specific Foundation Models
a model specialized for a generic task - such as sentiment analysis or entity recognition - but agnostic in terms of data domain and specific problem to solve.
They have the advantages of being small, usable in a private setting, and being often better at the task than much larger generic foundation models.
https://cdn.prod.website-files.com/638364a5e52e44ec3ca952a3/6718f125e8d1adc804243a82_66798ca6c6ca598b309ae9ff_taskspecfoundmod.png
コーパスC4
Llama 3を使ってannotation (zero-shot)
perfectでないが、よしとしている
annotationを使ってPhi-3を訓練
The resulting model can then be used in a zero-shot setting or fine-tuned to solve a specific problem, which it will solve better than a large generic model would.
Template/Schema Representation
We choose to represent the schema by a sort of empty JSON
code:スキーマ例
{
}
Each array is filled with an element template, and empty strings indicate the extracted fields.
Note that we only output strings and ignore other JSON types as we don’t see much interest in supporting them (you can always return a number as a string).
数字も文字列として返る
Note that this template format does not allow for the inclusion of field descriptions.
例を与える前提
Dataset Creation
applicationとしてはmedical, legal, financial domainsで使いたいが、モデル自体はdomain-agnosticにしたい
we use 300k English pieces of text from the C4 dataset
a large and diverse general-domain dataset
To annotate this text, we first prompt an LLM to generate a template from each piece of text.
code:まずスキーマを得るプロンプト
!!!START Context!!!
*<text-to-annotate>*
!!!END Context!!!
Goal: Generate an information extraction dataset.
Input: Text document + instructions for annotation.
Output: 1 JSON object (schema).
Schema:
Describes the information to be extracted.
Each field should:
Be a clear and concise name representing the extracted data.
ONLY STRING TYPE ARE ALLOWED AS VALUES (it can be an array of strings, or an object with string values, or an array of objects with string values...).
NO BOOLEAN, INT, ENUM, ETC.
The schema can focus only on part of the context document, or on the whole document.
Constraints:
Extracted information should be thematically coherent and form a well-structured JSON schema with a clear relationship between fields.
*<few-shot examples>*
(ここで文字列だけにしている。Llama 3などを使うには文字列だけにせざるをえないのだろうか)
Once we have the templates, we can use the LLM to extract information according to each template
code:スキーマに沿って情報抽出プロンプト
!!!START Context!!!
*<text-to-annotate>*
!!!END Context!!!
Goal: Extract strings from the text corresponding to the given schema.
Input: Text document + schema.
Output: 1 JSON object
Schema:
The schema describes the information to be extracted.
ONLY STRING TYPE ARE ALLOWED AS VALUES (it can be an array of strings, or an object with string values, or an array of objects with string values...).
NO BOOLEAN, INT, ENUM, ETC.
The schema can focus only on part of the context document, or on the whole document.
Output:
THE OUTPUT SHOULD FOLLOW EXACTLY THE SCHEMA.
It should respect the schema and contain the extracted information from the context document.
THE STRING SHOULD BE PRESENT EXACTLY AS IT IS IN THE CONTEXT DOCUMENT. NO PARAPHRASING ALLOWED.
If the information is NOT PRESENT in the context, return "" for empty string and [] for empty array. If the list of object is empty, return [].
Return only the information extracted as JSON. Do not output anything else or says anything else.
Information to extract:
*<schema>*
We use this prompt with Llama 3 70B to annotate the 300k pieces of text, and then filter out examples for which the template is not followed, as well as examples for which extracted values are not found in the text. This results in 50k annotated examples.
(LLMでannotationしたデータセットの分析が続く)
most pieces of text are below 200 words, but there is a tail going up to 1200 words, which typically corresponds to a 2-3 page document.
most extraction trees have a depth of 3, 4, or 5, but some even reach a depth of 9!
zero-shotだけでなく(few-shotも)
fieldの定義や値の例を入れる案(さらによい案が下に続く)
give full input → output examples of the task in the prompt. The issue is that input texts can be long, which takes up valuable context space.
We found that, surprisingly, only providing the outputs works great.
(few-shotとしてoutputだけを並べたということ?)
Base Models
in the case of structured extraction, the output space is large and complex, so we need to generate the output like we would to generate text.
encoder-decoderとpure decoderの2つのアーキテクチャを試した
As a result, we opt to use pure decoder LLMs. We use Phi-3-mini (3.8B parameters) for NuExtract, Phi-3-small (7B parameters) for NuExtract-large, and Qwen1.5-0.5B (0.5B parameters) for NuExtract-tiny.
上記のデータセットでファインチューニング
Evaluation
the JSON is valid, the schema is respected, and all extracted values are correct except for one!
We find that it always produces valid JSON expressions and has no difficulty following the template.
ベンチマークを作った(公開はfinalizeしてから)
For each problem, we create a template, find a set of raw text, and manually extract information from these pieces of text.
評価指標
Classic JSON distances such as Tree Edit Distance are not well adapted to our case because we want to heavily penalize the model when a schema is not respected, and we do not want to penalize the model if array elements are permuted.
We ended up creating a simple tree matching method that aligns extracted values (the leaves of the tree) through a recursive process, computes similarity between corresponding values through exact matching, and averages these leaf similarities to obtain a measure between 0 (trees are completely different) to 1 (trees are a perfect match).
We see that NuExtract-tiny is better than GPT-3.5 while being at least 100 times smaller, that NuExtract outperforms Llama3-70B while being 35 times smaller, and that NuExtract-large is reaching GPT-4o levels while being at least 100 times smaller.
zero-shot
40-shotでfew-shotの例も続く