Gen our own AI
In the spirit of “drink your own champagne”, we also “Gen our own AI”.
At many of our customers we already work on GenAI projects - since if implemented well it can really boost productivity. Whether it be finding answers in thousands of documents, translating natural language into SQL queries, or answering customer questions by swiftly sifting through thousands of machine manuals.
However, PoC’s often show nice results, until you test them against reality. And prevent them from hallucinating. The proof of the pudding is always in the eating. That is why we also invest in our own chatbot. And knowing that there are many options to do so, we decided to test out multiple implementations and score them on how well the answers are given. Our first start was Azure OpenAI and LangChain, and Azure OpenAI and LangGraph, both connected to our internal data such as documents, messages, training material and technical designs.
What are the challenges when you want to tune large language models (LLM’s) into truly useful insights?
Learning points on the tool side
To start with tools to use. What are our learnings so far?
Starting point: Microsoft Copilot Studio
We started with what was advised by Microsoft to some of our customers: use out of the box MS Copilot.
We found that the starting point is quick & easy: a nice click-and-play-way to create pretty basic chatbots with some GenAI abilities. Great for creating chatbots for things like customer service, or employee support for simple questions. It also has the possibility to build extensions for copilot for Microsoft 365.
However – we soon hit the boundaries, since it is very limited in what it can do and which tasks to perform. Also very limited in which own knowledge to add.
Due to that, for our use case we had to turn to other options – since we wanted to connect to multiple different datasources of quite a size. In addition, we found that we did not have enough freedom of tuning because of low code.
Next stop: Azure AI Studio
So - we moved to Azure AI Studio. Which is a combination of low and high code, and with that gives more tuning possibilities.
There is a lot of prompting necessary though, to get it to select the correct tool (our usecase has some tools that overlap, hard for it to determine correct one).
The issues that we encountered based on above:
- No possibility to run tools in parallel
- No possibility to go back (create a cyclic graph)
- No possibility to handle human input other than just ask for the input and go through the entire system again
Thus, although AI Studio seemed promising, and we incrementally reached better results, in the end we reached the boundaries of what we could do to get to good results.
On to: Azure Open AI with LangChain
Adding LangChain to the equation helped us improve the decision tree of our assistant, which it uses to define which data set (tool) to choose for which question. LangChain is high code so we had more influence on the steps our assistant was taking.
But still, the issue was that we could not run various tools in parallel, the assistant only could walk one path at the time.
And next: Azure OpenAI with LangGraph
Now we added LangGraph. This really showed some big improvements. With that, we were able to run the tools in parallel. And, importantly, it added cyclic graphs, making it possible for us to make the agent retry with different tools or questions when it was not able to answer. With “Human in the Loop“ we were also able to ask for feedback during the processing instead of only at the end and have to restart the entire process again with a new query.
We found that we could better steer the tool selection.
And most importantly, we had more influence on every step the agent is taking.
Another advantage we realized: the visual representation of the graph really helps with designing the implementation.
Learning points on the implementation side
Today’s best is tomorrow’s worst
GenAI models have come a long way already. Since we started working with them, we’ve seen the user experience improve drastically, from waiting many minutes for mediocre answers, to immediately getting decent to good answers.
Faster generation enabled us to create more complex systems, for example where the agent reflects on its own work through clever prompting, making its final answers more accurate in the process.
With how fast GenAI is evolving, you really step into the world of FOMO. We need to make sure we keep up to date every day. Upgrades in LLMs, new technologies, updates to old technologies are all going very fast (e.g. both Copilot Studio and Azure AI Studio have had some improvements since we left it, and LangGraph is evolving very fast as well).
It is a thin line between constantly moving to new possibilities, and remaining with what worked OK but could be improved with new enhancements.
Good prompting is key
Just as ChatGPT-prompting: a good prompt done correctly can make the difference between a bad and a great answer. We like the quote from Andrej Karpathy: “The hottest new programming language is English” which not only goes for coding, but also for people using GenAI solutions.
We needed to do a lot of system prompting to ensure that answers would be within our predefined guard rails - to mitigate answering unrelated questions. For example, answering about the weather when the agent’s core functionality is helping with document analysis.
We also built in security measures, to ensure that a user did not get an answer based on data he was not allowed to see. Our developer team is, for example, not allowed to see data about time writing on another project. And if I do not have access to financial data at the original source, I should not get an answer based on that specific data that I am not allowed to see.
Learning points on the data side
Shit in = Shit out, even for GenAI. Your data quality needs to be of a certain standard, and a few documents can have a big impact on your results. Especially if you, like we did, choose to only show the top 10 most relevant results.
We for example had an excel document that was a download of detailed hours written in our time writing tool Harvest. This xls has thousands of rows with project names, dates, and hours per day. With some of the project names being the same as a question asked, the results of the LLM always gave that specific xls as the only and best result. Whether asking what we did at project X, or at customer Y, or a combination of both.
Another example is that we had many versions of the same document stored. Just using the latest version of a document sometimes gave less information or less valuable answers, so we have spent time to define what to do in which cases of multiple versions of the same document. The easy one is when having a version 1 with policies from 2019, a version 2 from 2021, a version 3 from 2023 and a completely overhauled version 4 of 2024. For that one, you would choose to only use the latest version. But when would the info of the previous versions still be valid and useful, for example in case the version 1 was the complete version, and the version 1.1 was a trimmed-down management summary version?
That made us think of a way to make documents “created” for GenAI: can you use a “Opt-in” method to have users add the useful documents themselves? And can you use a “Opt-out” method where users can mark a result as incorrect resulting in a document to be removed?
It’s good to start thinking about a strategy to make your document store GenAI ready. Things to think about: what to do with visual presentations, images in manuals, column naming in tables etc. You could, for example, let another (visual) GenAI model interpret the visuals in a document and store them as text in your knowledge store. However, it is still crucial to verify whether the generated textual representation of said visuals is deemed to be accurate enough to be part of the knowledge store.
Learning points from the testing side
With all our different implementations, we also wanted an objective way to define which implementation would give the best results. We created a scoring mechanism for that, which we used to score the answers on a set of 85 questions that we asked by pilot users. Each answer would score –5 points if the answer was not good at all (incorrect, hallucinating, link to a wrong document), 0 points if no answer was given, +1 point if the answer was slightly OK, and +5 points if the answer was good. We scored the following implementations:
- LangChain with all data
- LangChain with a cleaned set of data (70 documents removed such as the hour-writing xls)
- LangGraph with all data
- LangGraph with a cleaned set of data.
And we had 2 different users test all of these implementations:
- A user with limited access to the source data
- A user with full access to all of the source data
Quite some interesting outcomes. For example: the user with limited access ended the first scores with a minus-total, whereas the user with full access had plus-totals for all implementations. And that there was no overall “winner” due to the differences in scores between the limited access and full access. So working on more implementations and more testing.
So – when to Gen the AI?
Don’t use GenAI because of the hype.
GenAI is not the solution to every problem, and not every problem needs GenAI.
It’s important to keep asking if the use of GenAI really brings value. Sometimes a dashboard could bring the same value at lower cost or could even bring more value.
Our opinion: GenAI brings value in the following cases: helping you perform hard or time-consuming tasks, finding information quickly, as an interface to a system where you otherwise would have to perform a lot of steps or adjust a lot of settings, and as an interface to make tasks normally only for experts available to non-experts as well.
But keep in mind: the LLM is ‘black-boxish’. Is it way more complicated to test and improve. Since the model can generate great answers for one, and really bad answers for others - with a similar question and with similar available data.
This makes it hard to move from a PoC to a full production-ready implementation.
Tuning, opt-in and opt-out, validation of user feedback and using that to improve your system prompts: this all remains a regular task after implementation.
Be sure to take into account that there is work after the PoC.
Otherwise, your cool GenAI implementation fades out in no time.