Imagine that you want to find a very specific type of entities in a collection of documents. For example, you might be interested in all the different types of fruit mentioned in a set of cooking recipes or the different kinds of knitting stitches in a set of patterns. However, the further you move away from the standard entity types like "name," "location," and "organization," the less likely it is that you will find a good entity linker for your specific task. It is even less likely that you will find a dataset to train your own entity linker.
So what do you do? One option is to annotate a set of documents yourself to train a new model. However, this is very time-consuming and, to be honest, extremely boring. Besides, we live in the era of automation, don't we? I think I read about a new chatbot that can answer all my questions... Can't we use that instead?
Using ChatGPT can be a great idea for tasks like this (as long as your task is not so specific that ChatGPT doesn’t understand it 😉). In this blog, I will describe a method for building an entity dataset using ChatGPT and additional post-processing. The use case for this experiment involves creating a set of medical entities and their corresponding layman-friendly descriptions. In the post-processing stage, the entities are mapped back onto the documents to create entity annotations that can be used to train your own model.
The main three steps are as follows:
Collect entities (medical terminology) from a set of documents using ChatGPT
Map the entities to the set of documents by aligning substring matches
Optional: generate entity explanations to create a simple “knowledge base”
First, we need to collect a set of examples of the entities that we are interested in. In my case, I am looking for medical terminology: terms or phrases that are not typically found in standard language and are used in medical reports. To do so, I created a prompt asking ChatGPT to provide a list of any such terms or phrases found in a document, separated by commas.
System prompt:
You are a medical assistant with a deep understanding of radiology.
User prompt:
I would like to select all medical terminology from the following radiology report. it specifically concerns all terms that do not belong to the standard Dutch vocabulary. Answer with all the terms you can find, separated only by commas. Write the terms exactly as they appear in the report. Don't add anything else to your answer.
Report: {{ insert report here }}
Using the OpenAI Python library (https://github.com/openai/openai-python), it is very easy to run these prompts for a bunch of documents to quickly collect the data you want.
❗ Of course, make sure to adhere to any rules pertaining to the sharing of potentially personally identifiable information, where applicable.
In code, it looks something like this:
results = []
for doc in documents:
user_prompt = make_user_prompt(doc)
completion = openai.ChatCompletion.create(
model="gpt-3.5-turbo",
messages=[
{"role": "system", "content": system_prompt},
{"role": "user", "content": user_prompt},
],
)
results.append(completion.choices[0].message["content"]
That’s pretty much it for this step! You now have the raw model response for each document.
In this step, we will transform the raw model response into a set of individual entities. We will also map those entities back onto the documents by identifying the character offsets of each occurrence. This allows us to generate annotated documents for training our own model.
If you followed the previous step, you may have noticed that although we instructed ChatGPT to separate entities using commas, it sometimes uses newlines instead. Bad robot! If you are trying this with a different prompt, you may encounter even more or different variations. This is something that needs to be addressed. In my case, I attempted to determine the separator that ChatGPT used and split the raw model response accordingly. Perhaps you could also solve this using prompt engineering, but ain’t nobody got time for that.
if len(response).split("\n") > 3:
separator = "\n"
else:
separator = ","
After obtaining a clean list of entities for each document, the next step is to find where each entity occurs. Depending on your documents and task, normalizing any whitespace in the documents may be necessary to avoid missing entities that span multiple lines. To locate the occurrences of each entity, we can use re.finditer
.
spans = []
for ent in entities:
for m in re.finditer(re.escape(ent), document):
spans.append(*m.span(), "MENTION")
Alternatively, you could use an approximate string matching approach if necessary.
This will provide us with a list of triplets (start, end, "MENTION"). "MENTION" is the entity class and will be necessary later when aligning entities to tokenized documents using Spacy (https://github.com/explosion/spacy). If this process results in overlapping spans, they must be filtered out before proceeding to the next step, as Spacy cannot handle this automatically.
Since most machine learning methods for entity detection are trained in token classification using BIO (beginning, inside, outside) or BILUO (beginning, inside, last, unit, outside) tags, we need to ensure that the generated spans correspond to complete tokens. Fortunately, Spacy can help us with this! We just need to tokenize our documents and attempt to generate BIO tags using the entity spans and the tokenized document.
nlp = spacy.load(...)
doc = nlp(doc) # tokenize the document
biluo_tags = offsets_to_biluo_tags(doc, spans) # map the spans into BILUO tags using tokenized document
offsets = biluo_tags_to_offsets(doc, biluo_tags)
offsets = [x for x in offsets if x[-1] == "MENTION"]
The last line is necessary to filter out "unknown entities", i.e. offsets that could not be mapped to tokens. These entities will be marked as (start, end, "-") triplets, where the "-" means that the entity was ignored during the conversion. Alternatively, you can store the BILUO tags directly if that better suits your use case. However, I chose to keep the original spans, which is why I convert back to that format.
And that concludes the difficult part! You now have a set of documents, along with entity annotations, that you can use to train a model. In the last section, I will also generate descriptions for each of the entities I found.
This step effectively involves repeating step 1 using a different prompt. To do this, we will use the separated entities from the previous step as input. The prompts used for generating descriptions are listed below. A system prompt was not used for this part.
User prompt:
Give a simple explanation in one sentence for the following term from a radiology report: {{ insert entity here }}
I ran this prompt for each entity:
explanations = {}
for entity in entities:
user_prompt = make_user_prompt(entity)
completion = openai.ChatCompletion.create(
model="gpt-3.5-turbo",
messages=[
{"role": "user", "content": user_prompt},
],
)
explanations[entity] = completion.choices[0].message["content"]
And voila, a knowledge base with entities and user-friendly descriptions. You can use this as a lookup table or to train more advanced models. Whatever suits your needs!
I have added some useful features to each of the steps, such as duplicate removal, automatic retries for API requests, and tracking the cost of all requests. Using the gpt-3.5-turbo
model is very inexpensive (usually under $1 for thousands of entities, depending on your documents and task). Upgrading to GPT-4 will cost a few dollars, which is also reasonable. It may help to estimate beforehand how many tokens you will use, so you won't be surprised by a large credit card bill. 🙂
In summary, I have found the quality of the results using this approach to be surprisingly good. I have not yet spotted a mistake in the generated explanations, and the entity recognition is quite comprehensive. This method is certainly very useful in situations where an established model or dataset is simply not available.