Better Natural Language Understanding of Tabular Data through Prompt Engineering

We used Lanchain’s Pandas Dataframe Agent and Large Language Models (LLMs) to reason about tabular data. Questions that relied on reasoning did not get correct results until we applied prompt engineering to define the table semantics.

15 min readOct 2, 2023

This image was created with the assistance of DALL·E 2

In my last article (Searching your own files in natural language), I used text embeddings, vector databases and semantic search, together with Langchain chains and ChatOpenAI to store, search and reason about documents of different formats. In the case of tabular files (e.g., .csv) loaded with CSVLoader, the reasoning was limited by the fact any non-obvious table semantics was not understood by the model. Moreover, rows were converted to text and any chain of thought that required looking at several values in columns would need to bring up too much data for it to be efficient (never mind the fact that the number of tokens would often exceed the maximum allowed by OpenAI).

Enter the concept of Agents. Suppose that instead of relying on the data currently present in an LLM (which could be out-of-date), you could search the internet for a particular topic, and then ask the LLM to summarize the search results, and answer any questions you may have about those results in a natural language form. Agents are a way of integrating LLMs with external tools, such as search, Python interpreters, Pandas functions, math libraries, or your own custom functions. Here is a clear introductory video about agents.

In this article I will explore the use of Langchain’s Pandas Dataframe Agent. This Pandas agent uses the ReAct framework. The idea of the ReAct framework is to use the LLM to elaborate on what is needed to answer a question (the “thought” process), then decide if it has the necessary information or it needs to call some external function to get it (the “action” process). The LLM also elaborates how to call this external function, based on the functions “registered” with the agent. In this case, the external function is called and the output is fed back to the LLM as an “observation”, which is then submitted again to the “thought” process, and it iterates until the LLM considers that it “knows” the answer. The figure below illustrates this flow.

The right side of this figure shows the (verbose) output of the pandas_dataframe_agent call to answer a question about attendance at FIFA Men’s Football World Cup games. As we will see later, I am using a table of all men’s football World Cup games from 1930 to 2022, scores, dates, location and attendance, which is read into a Pandas dataframe and used to initialize the agent. I will give you the code for it a little later, for now, let’s try to understand what the agent is doing.

Given a question shown in the figure, the LLM engages this thought process:

Thought: I need to find the maximum attendance value in the dataframe

The LLM+Agent knows it has access to data contained in a Pandas dataframe, and assumes that attendance is a column. Then, it kicks off an action, which invokes the Python shell and runs a Pandas command on the dataframe:

Action: python_repl_ast
Action Input: df[‘attendance’].max()

This action uses Pandas to find the maximum value in column ‘attendance’ in the input dataframe. The output of this command becomes an observation which is sent back to the LLM:

Observation: 173850

At this point the LLM has partial information, but it still needs to find out the date and the location, so it iterates, with new thought, action and observation steps, as shown below (dots added for easier visualization):

Thought: I need to find the row with this maximum attendance
Action: python_repl_ast
Action Input: df[df[‘attendance’] == df[‘attendance’].max()]
Observation:
….. world_cup ……………… date……….. team 1……. team 2… score team 1 \
74: 1950 FIFA World Cup… 7/16/1950… Uruguay… Brazil….. 2 ……………
…. score team 2… penalties team 1… penalties team 2… Location…… phase \
74: 1………………. None……………. None…………………. Rio De Janeiro… Final
….. attendance… year… host
74: 173850..…… 1950… Brazil

From this last observation, the LLM concludes that it knows the answer and returns to the user:

Thought: I now know the final answer
Final Answer: The largest attendance was 173850 at the 1950 FIFA World Cup Final between Uruguay and Brazil in Rio De Janeiro on 7/16/1950.
> Finished chain.
‘The largest attendance was 173850 at the 1950 FIFA World Cup Final between Uruguay and Brazil in Rio De Janeiro on 7/16/1950.’

It’s (almost) all in the prompt

The Python code that invokes the Pandas agent is pretty straightforward. Here we are going to use OpenAI (i.e., you need a key) and a csv file containing the data for all FIFA Men’s World Cup games, which I collected from several places on the web and put together as one table. Data and code may be downloaded from my Github. You may have to install some Python libraries — instructions are readily available on the web.

Let’s first load and look at the data.

import pandas as pd

matches_df = pd.read_csv("./datasets/world_cup_matches_v1.csv")
matches_df = matches_df.replace(np.nan, None)

# Import necessary packages
from langchain.llms import OpenAI
from langchain.agents import create_pandas_dataframe_agent
from langchain.agents.agent_types import AgentType

# Create the Pandas Dataframe Agent
agent = create_pandas_dataframe_agent(
    OpenAI(temperature=0, model_name="text-davinci-003",openai_api_key=<your_key>),
    matches_df,
    agent=AgentType.ZERO_SHOT_REACT_DESCRIPTION,
    verbose=True)

The function create_pandas_dataframe_agent creates the agent, using OpenAI’s LLM ”text-davinci-003” (default), and temperature = 0 (to get mostly reproducible results). The agent type for this LLM model is, by default, a zero-shot, which means the agent uses a prompt which does not rely on examples being given, and it relies on the ReAct framework to decide which tool to use, based solely on the tool’s description. To understand it better, let’s look at the default prompt associated with the pandas_dataframe_agent:

agent.agent.llm_chain.prompt.template

---- Default Prompt ----
"""
You are working with a pandas dataframe in Python. The name of the dataframe is `df`.
You should use the tools below to answer the question posed of you:

python_repl_ast: A Python shell. Use this to execute python commands. Input should be a valid python command. When using this tool, sometimes output is abbreviated - make sure it does not look abbreviated before using it in your answer.

Use the following format:

Question: the input question you must answer
Thought: you should always think about what to do
Action: the action to take, should be one of [python_repl_ast]
Action Input: the input to the action
Observation: the result of the action
... (this Thought/Action/Action Input/Observation can repeat N times)
Thought: I now know the final answer
Final Answer: the final answer to the original input question


This is the result of `print(df.head())`:
{df_head}

Begin!
Question: {input}
{agent_scratchpad}\
"""

The question is, given a table with semantically-meaningful column names, can the agent and the model make sense of the data, starting from generically-worded questions? Let’s explore this now.

Examples

Let’s now try some questions with varying degrees of reasoning required. You may want to download the table of World Cup results to better understand the actions suggested by the Pandas agent.

To pass a question to the agent we use the agent.run(“…”) method.

Query 1
Simple selection and count

Here we see the agent getting a sub-dataframe based on the value of column ‘year’= 1938, and getting its number of rows, giving a correct answer.

Given that we are using OpenAI non-free models, it is useful to keep track of the number of tokens in each query, as this will determine the final cost. To get the number of tokens in a query we can use the function:

from langchain.callbacks import get_openai_callback

def count_tokens(chain, query):
    with get_openai_callback() as cb:
        result = chain.run(query)
        print(f'Spent a total of {cb.total_tokens} tokens')

    return result

# Counting the tokens for a query
count_tokens(agent, "how many games were there in the 1938 world cup?")

---- Output ----
Spent a total of 1252 tokens

These 1252 tokens were almost all due to the prompt, since there were not many characters exchanged as part of the thought, action and observation processes.

Query 2
Extracting semantics from the data

As humans (with a basic understanding of football), we can easily interpret our table of World Cup games. We know to look at scores to determine the winner and to look at the ‘phase’ column to determine if it is a final match, for example. But how much of these table-specific semantics are understood by our Pandas agent + LLM? All we have to do is ask.

In this case, the model correctly figured out that comparing the values on ‘score team 1’ and ‘score team 2’ columns was the way to determine if a team won, lost, or it was a tie. Note that it created a new internal column ‘result’ to store the outcome. The answer is not fully correct in case of a tie on a knockout match, in which case it would also need to look at the penalties columns.

Let’s see now its reasoning for figuring out the final match.

The model correctly figured out how to interpret the table in order to find the ‘final’ match. Note that I deliberately did not use the word ‘final’ in the question.

In the case of knockout matches, if the result is a draw (possibly after extra-time), the winner is decided on penalty shoot-outs, and the table has two columns indicating the penalties scored by each team. Let’s see how the model interprets that.

That is correct!

Let’s try a few more questions that rely on interpretation of the data.

Query 3
Interpreting the question

After trying out several questions, it was clear that the wording of the question mattered. How you ask has an impact on the actions requested by the agent. For an example, let’s ask about the 1994 World Cup final (hint: Italy and Brazil tied the game, and Brazil won in penalties).

This is not correct. It did find the right answer for the scores but it failed to consider the penalty scores in the case of a tie, and did not say who the winner was (Brazil won in penalties!), even though in the previous query it did say it should look at the penalties columns in case of a tie.

Moreover, it had to return all games in the 1994 World Cup in order to get the ‘Final’ game. This increased considerably the number of tokens in the query to 3897 tokens. It would have been better if the original action were the following: df[df[‘year’] == 1994][df[‘phase’] == ‘Final’], which would have returned a single line as the observation and reduced the number of tokens.

If we ask the same question in a different way, we get the same wrong final answer but using a different thought process:

Here again it found the answer based on scores alone, not penalties, which in this case was wrong. Moreover, it used the last row in the sub-dataframe (df[df[‘year’] == 1994]) as the final match, which by chance is correct, but there is no guarantee.

Here is another example of a wrong answer regarding the 1974 World Cup, using the same text as before (for the 1994 World Cup). In this case the action finds the right match, but it interprets the score wrongly (differently than its own reasoning, stated in the answer to Query 2).

The observation brought the correct final match, but the last thought associated the scores with the wrong teams and declared the wrong winner (West Germany, as it was called then, won the match by the score of 2x1).

In fact, I repeated the same questions several times, and the answers would occasionally vary from correct to totally wrong, despite the temperature parameter being set to 0.0.

Prompt Engineering

As we saw earlier, the default prompt instructs the model to use the dataframe and call the python interpreter with Pandas commands, if it would help coming up with an answer. But clearly the model does not have a good understanding of the semantics of the data (i.e., what each column means), being very erratic depending on how the question was worded. This is not what is being called a ‘hallucination’, as the model only uses the data provided in the dataframe. This is the case of the model not being able to understand the semantics of the table, given the generic default prompt that was given.

Now I am going to tweak the prompt with specific instructions as for what the columns mean and how they relate to each other. A new prompt can be set on the agent with the command below. Pay close attention to the instructions 1 through 9 passed in explicitly.

agent.agent.llm_chain.prompt.template =\
"""
You are working with a pandas dataframe in Python. The name of the dataframe is `df`.
You should interpret the columns of the dataframe as follows:

1) Only answer questions related to the dataframe. For anything else, say you do not know.
2) Each row of the dataframe corresponds to a match or game.
3) The 'phase' column states the stage of the match within each tournament. For example if asked for the final match in a given World Cup year, you can use the code "df[df['year'] == year][df['phase'] == 'Final']".
4) Knockout matches are those that have any of the following values in the 'phase' column: 'Final','Third-place', 'Semi-Finals', 'Quarter-finals', 'Round of 16'.
5) Teams and scores are associated by the terms 'team 1' or 'team 2'. For example, the column 'team 1' is always associated with the columns 'score team 1' and 'penalties team 1', and similarly for 'team 2'.
6) To determine the winner of a match you must compare the columns 'score team 1' with 'score team 2', and for knockout matches, you may have to compare the columns 'penalties team 1' and 'penalties team 2'. If the value in 'score team 1' is bigger than 'score team 2' then the team in the 'team 1' column wins the match. If the value in 'score team 1' is smaller than 'score team 2' then the team in the 'team 2' column wins the match. If the value in column 'score team 1' is equal to the value in 'score team 2' then the game is a draw, unless it is a knockout match, in which case, the penalties columns need to be checked.
7) In a knockout match, if the columns 'score team 1' with 'score team 2' are equal, the winner is determined by the penalties scores, for which you must look at the 'penalties team 1' and 'penalties team 2' columns. If the value in 'penalties team 1' is bigger than 'penalties team 2' then the team in the 'team 1' column wins the match. If the value in 'penalties team 1' is smaller than 'penalties team 2' then the team in the 'team 2' column wins the match.
8) You only know about matches belonging to FIFA Men's Football World Cups. If asked about matches or games in any other tournament, competition or league, say you do not know.
9) If you do not know the answer, just say I don' know.

You should use the tools below and the instructions above to answer the question posed of you:

python_repl_ast: A Python shell. Use this to execute python commands. Input should be a valid python command. When using this tool, sometimes output is abbreviated - make sure it does not look abbreviated before using it in your answer.

Use the following format:

Question: the input question you must answer
Thought: you should always think about what to do
Action: the action to take, should be one of [python_repl_ast]
Action Input: the input to the action
Observation: the result of the action
... (this Thought/Action/Action Input/Observation can repeat N times)
Thought: I now know the final answer
Final Answer: the final answer to the original input question


This is the result of `print(df.head())`:
{df_head}

Begin!
Question: {input}
{agent_scratchpad}\
"""

Note that the new prompt does not include any actual data example, so this could still be considered ‘zero-shot prompt engineering’ as opposed to a ‘one-shot’ or ‘ ‘few-shot prompting’.

Now, let’s repeat the same questions as before and see if we get better results.

Query 1 (new prompt)
Simple selection and count

Same correct result. The number of tokens in this query was 2369 — considerably higher than the 1252 tokens used by the same query with the original prompt.

Query 2 (new prompt)
Extracting semantics from the data

In this case, the reasoning was derived directly from the re-engineered prompt. It remains to be seen if it will actually and always use this reasoning when answering questions.

Query 3 (new prompt)
Interpreting the question

In this case above, using the original prompt, the model was not able to give a complete correct answer, and it used a large number of tokens (3897). With the re-engineered prompt, the model gives a fully correct answer and it uses only 2488 tokens, because, although the prompt is bigger, it finds the final match with one Pandas command that returns a single line back to the model as the observation.

The same question worded differently, also produced fully correct results, as shown below.

The final question about the 1974 World Cup final that had initially produced completely wrong results, now returns the correct result.

Query 4 (new prompt)
Complex queries

These queries require more complex language understanding with more complex Pandas actions.

Italy did score 11 goals in the 1938 World Cup. Note that the thought process understood that it had to sum the goals scored by Italy as ‘team 1’ with the goals scored as ‘team 2’, and it called the correct Pandas functions.

Italy did suffer 5 goals in the 1938 World Cup. Here the thought process was to sum the goals scored by ‘team 2’ when Italy was ‘team 1’ with the goals scores by ‘team 1’ when Italy was ‘team 2’.

How about something requiring multiple sums, such as the total number of goals in a World Cup, like the question below. The model produced the correct result!

Now a similar query with an additional complication — asking the model to include penalty goals. Again, the correct answer was produced.

The final query below requires looking at the whole table:

Here the answer is not correct! The thought process interpreted the question correctly, but the Pandas functions applied in the action step were not correct. The groupby function needed to be applied to each subset of lines of a given year, and the results compared. For the record, the World Cup with most goals scored was the 2022 edition with 172 goals (not counting penalty shootouts).

Summary and Conclusions

This article explore the use of Langchain’s Pandas dataframe agent and language models for understanding of tabular data using natural language. I presented several queries and results using the default prompt (provided with the pandas_dataframe_agent) and contrasted them against using a re-engineered prompt.

The Pandas dataframe agent is a significant improvement step over the basic CSVLoader, and the language model does a decent job at trying to understand the table and apply the correct Pandas functions to extract data from the table. However, it is not able to understand all semantics associated with the columns of the table and it produces the wrong answer to questions that need such understanding.

I used prompt engineering to tell the model the semantics of the columns and how they relate to each other. This improved significantly the correctness of the results, but it is still not perfect. This is a very active research area and models and agents still need to be improved in order to reach human-level understanding of tabular data.

As a final observation, the wording of the modified prompt also seem to matter considerably. I tried writing the same information about the meaning of the columns in several different ways, and the results were different. It seems that the more explicit an instruction is the better the understanding, and it also helps to repeat parts of an instruction in different contexts.

The csv file with the World Cup games and the Jupyter notebook used in this article can be found here.