NuExtract: A Foundation Model for Structured Extraction

https://numind.ai/blog/nuextract-a-foundation-model-for-structured-extraction

We introduce NuExtract, a lightweight text-to-JSON LLM. NuExtract allows to extract arbitrarily complex information from text and turns it into structured data.

from 0.5B to 7B parameters

F1 Exact Matchで7BはGPT-4oに迫る

https://huggingface.co/collections/numind/nuextract-6679e82d13c37a0fe4742d3d

自分の言葉でまとめてみると

ドメイン非依存でstructured extractionを解く小型モデル

C4からLlama 3でテキスト・スキーマ・JSONという形式の50kのデータセットを作り、Phi-3を継続訓練した

Structured Extraction

Its goal is to extract all kinds of information from a document - entities, quantities, dates, and so on - and to identify their (potentially hierarchical) relationships.

extraction tree

JSONで表されることが多い

application 2つ

parsing technical documents such as medical reports, legal documents, or financial reports

ナレッジベース。RAGにも使える

（IMO：構造化）

chatbot conversations

this is the holy grail of information extraction.

Using GPT-4

code:プロンプト

Given the following JSON template and text, return a version of the JSON template filled in with the relevant data. Don't return anything besides the filled in JSON content.

{

"reactants" : {”name” : “” , “quantity” : “”},

"reagents" : {”name” : ””, “quantity” : ””},

"solvents" : {”name” : ””, “quantity” : ””},

"catalysts" : {”name” : ””, “quantity” : ””},

"time" : “”,

"temperature" : “”

}

Input: *<input>*

Output:

To improve performance, we could add examples to the prompt. However, such “in-context learning” quickly saturates, as we found out for entity recognition in our recent paper introducing NuNER:

NuNER: Entity Recognition Encoder Pre-training via LLM-Annotated Data

Task-Specific Foundation Models

a model specialized for a generic task - such as sentiment analysis or entity recognition - but agnostic in terms of data domain and specific problem to solve.

They have the advantages of being small, usable in a private setting, and being often better at the task than much larger generic foundation models.

https://cdn.prod.website-files.com/638364a5e52e44ec3ca952a3/6718f125e8d1adc804243a82_66798ca6c6ca598b309ae9ff_taskspecfoundmod.png

コーパスC4

Llama 3を使ってannotation (zero-shot)

perfectでないが、よしとしている

annotationを使ってPhi-3を訓練

The resulting model can then be used in a zero-shot setting or fine-tuned to solve a specific problem, which it will solve better than a large generic model would.

Template/Schema Representation

We choose to represent the schema by a sort of empty JSON

code:スキーマ例

{

"reactants" : {”name” : “” , “quantity” : “”},

"time" : “”,

}

Each array is filled with an element template, and empty strings indicate the extracted fields.

Note that we only output strings and ignore other JSON types as we don’t see much interest in supporting them (you can always return a number as a string).

数字も文字列として返る

Note that this template format does not allow for the inclusion of field descriptions.

例を与える前提

Dataset Creation

applicationとしてはmedical, legal, financial domainsで使いたいが、モデル自体はdomain-agnosticにしたい

we use 300k English pieces of text from the C4 dataset

allenai/c4

a large and diverse general-domain dataset

To annotate this text, we first prompt an LLM to generate a template from each piece of text.

code:まずスキーマを得るプロンプト

!!!START Context!!!

*<text-to-annotate>*

!!!END Context!!!

Goal: Generate an information extraction dataset.

Input: Text document + instructions for annotation.

Output: 1 JSON object (schema).

Schema:

Describes the information to be extracted.

Each field should:

Be a clear and concise name representing the extracted data.

ONLY STRING TYPE ARE ALLOWED AS VALUES (it can be an array of strings, or an object with string values, or an array of objects with string values...).

NO BOOLEAN, INT, ENUM, ETC.

The schema can focus only on part of the context document, or on the whole document.

Constraints:

Extracted information should be thematically coherent and form a well-structured JSON schema with a clear relationship between fields.

*<few-shot examples>*

（ここで文字列だけにしている。Llama 3などを使うには文字列だけにせざるをえないのだろうか）

Once we have the templates, we can use the LLM to extract information according to each template

code:スキーマに沿って情報抽出プロンプト

!!!START Context!!!

*<text-to-annotate>*

!!!END Context!!!

Goal: Extract strings from the text corresponding to the given schema.

Input: Text document + schema.

Output: 1 JSON object

Schema:

The schema describes the information to be extracted.

ONLY STRING TYPE ARE ALLOWED AS VALUES (it can be an array of strings, or an object with string values, or an array of objects with string values...).

NO BOOLEAN, INT, ENUM, ETC.

The schema can focus only on part of the context document, or on the whole document.

Output:

THE OUTPUT SHOULD FOLLOW EXACTLY THE SCHEMA.

It should respect the schema and contain the extracted information from the context document.

THE STRING SHOULD BE PRESENT EXACTLY AS IT IS IN THE CONTEXT DOCUMENT. NO PARAPHRASING ALLOWED.

If the information is NOT PRESENT in the context, return "" for empty string and [] for empty array. If the list of object is empty, return [].

Return only the information extracted as JSON. Do not output anything else or says anything else.

Information to extract:

*<schema>*

We use this prompt with Llama 3 70B to annotate the 300k pieces of text, and then filter out examples for which the template is not followed, as well as examples for which extracted values are not found in the text. This results in 50k annotated examples.

（LLMでannotationしたデータセットの分析が続く）

most pieces of text are below 200 words, but there is a tail going up to 1200 words, which typically corresponds to a 2-3 page document.

most extraction trees have a depth of 3, 4, or 5, but some even reach a depth of 9!

zero-shotだけでなく（few-shotも）

fieldの定義や値の例を入れる案（さらによい案が下に続く）

give full input → output examples of the task in the prompt. The issue is that input texts can be long, which takes up valuable context space.

We found that, surprisingly, only providing the outputs works great.

（few-shotとしてoutputだけを並べたということ？）

Base Models

in the case of structured extraction, the output space is large and complex, so we need to generate the output like we would to generate text.

encoder-decoderとpure decoderの2つのアーキテクチャを試した

As a result, we opt to use pure decoder LLMs. We use Phi-3-mini (3.8B parameters) for NuExtract, Phi-3-small (7B parameters) for NuExtract-large, and Qwen1.5-0.5B (0.5B parameters) for NuExtract-tiny.

上記のデータセットでファインチューニング

Evaluation

the JSON is valid, the schema is respected, and all extracted values are correct except for one!

We find that it always produces valid JSON expressions and has no difficulty following the template.

ベンチマークを作った（公開はfinalizeしてから）

For each problem, we create a template, find a set of raw text, and manually extract information from these pieces of text.

評価指標

Classic JSON distances such as Tree Edit Distance are not well adapted to our case because we want to heavily penalize the model when a schema is not respected, and we do not want to penalize the model if array elements are permuted.

Tree Edit Distance

We ended up creating a simple tree matching method that aligns extracted values (the leaves of the tree) through a recursive process, computes similarity between corresponding values through exact matching, and averages these leaf similarities to obtain a measure between 0 (trees are completely different) to 1 (trees are a perfect match).

We see that NuExtract-tiny is better than GPT-3.5 while being at least 100 times smaller, that NuExtract outperforms Llama3-70B while being 35 times smaller, and that NuExtract-large is reaching GPT-4o levels while being at least 100 times smaller.

zero-shot

40-shotでfew-shotの例も続く