Infraestruturas de Portugal AI agent procurement: open weight models, OCR/RAG and deployment
Today's post continues to look into the substantive elements of the Infraestruturas de Portugal procurement process for BIA. It will start by discussing open weight models before moving on to OCR (optical character recognition) and RAG (retrieval augmented generation) before concluding with the deployment plans for BIA. It is the last blogpost of the series for the next 10 days or so as I will be taking a week off. I think there is still enough to write at least one or maybe two posts after the Easter break.
Open weight models
Infraestruturas de Portugal sets out in the technical specifications as examples that the LLM for inference should be "OSS" which I assume as mentioned yesterday is intended to mean open source software. It name drops the following LLM models and suppliers:
- Llama 2
- Llama 3
- Mistral
- Falcon
- Stable/LM
Of the list above, only Llama 2 and Llama 3 are actual models and not really open source, just open weights. They can be used without payment as long as the requirements of the corresponding Llama 2 Community License or Llama 3 Community License are met.
Llama 2 and Llama 3 have three drawbacks for BIA. First, these are ancient LLM models from 2023 and 2024. They are completely outdated and the equivalent of seeing a tender today for a car suggesting the Ford Model T as an idea of what the contracting authority is looking for.
Second, these are not chain-of-thought models which are explicitly mentioned in the technical specifications when discussing the proof of concept. Infraestruturas de Portugal is right in mentioning chain-of-thought as important since this is the only way to have an idea of the blackbox of decision-making by LLMs. But neither LLama 2 nor Llama 3 have that. Furthermore, neither of these two models is particularly good in European Portuguese and that is a factor to take into account when choosing an underlying model since all the work they will be doing is in that language. It is important to notice here that the more modern Llama 4 is not mentioned by Infraestruturas de Portugal for reasons that are unclear (although the license is a bit more restrictive I think).
Third, these older models have really mall context windows (<32k tokens). For LLMs the context window works as its working memory, ie the amount of information it can store to address queries or produce an output. Larger context windows tend to increase accuracy (somewhat), reduce hallucinations and improve coherence in answers.
A small context window means they can hold maybe 50-70 pages of text in memory to be queried or to produce an output from. Depending on the contract, bids and documents associated with them can be hundreds of pages long meaning they will not fit within the context window of the model that is being tasked to summarise them for the jury. It does not help that the (correctly) stated preference for a chain-of-thought model means that part of the context window will be consumed by the tokens necessary for that operation.
The other names mentioned by Infraestruturas de Portugal as examples are not examples of models but companies/foundries behind them Mistral, Falcon and Stable/LM. Mistral does have chain-of-thought models and some of their models are available with Apache 2.0 licenses while others on modified MIT and finally some that are not available freely. Obviously, the best Mistral models these days are included in the latter category.
What about other open weights models?
What is absent from this list of suggestions are newer models, such as the gpt-oss series from OpenAI or the Chinese-made open weight models such as those coming from DeepSeek, AliBaba (Qwen) or MoonshotAI (Kimi) which are the state of the art when it comes to open weight models made available with a permissive or liberal license for their use. Qwen 3.5, for example, is available with the Apache License 2.0. Is their omission a sign of lack of knowledge about them or an indication their use is not welcome? Either way, those are likely to be among the better options within the constraints imposed by Infraestruturas de Portugal.
Going back to my original point of open source (in general) making sense for contracting authorities, I think we are at a particular juncture where the solution that Infraestruturas de Portugal is looking for is probably only achievable with using a frontier model not available for 'free'. In other words, wanting to deploy an AI 'agent' I think this can only be done with the cutting edge models from Anthropic, OpenAI or Google. Not exactly what people like to hearWhat about other open weights models?
What is absent from this list of suggestions are newer models, such as the gpt-oss series from OpenAI or the Chinese-made open weight models such as those coming from DeepSeek, AliBaba (Qwen) or MoonshotAI (Kimi) which are the state of the art when it comes to open weight models made available with a permissive or liberal license for their use. Qwen 3.5, for example, is available with the Apache License 2.0. Is their omission a sign of lack of knowledge about them or an indication their use is not welcome? Either way, those are likely to be among the better options within the constraints imposed by Infraestruturas de Portugal.
Going back to my original point of open source (in general) making sense for contracting authorities, I think we are at a particular juncture where the solution that Infraestruturas de Portugal is looking for is probably only achievable with using a frontier model not available for 'free'. In other words, wanting to deploy an AI 'agent' I think this can only be done with the cutting edge models from Anthropic, OpenAI or Google. Not exactly what people like to hear and naturally with a very different cost structure, especially at the deployment stage.
In alternative, had this been a national effort instead of simply a one off attempt by a single contracting authority, the just announced Mistral Forge looks exactly what would be the best way of going about developing an AI model fully under the control of the public sector. But alas, that is not what BIA is supposed to be. Anyway, here's a nice writeup about Mistral Forge and note the comments made there about RAG retrieval-augmented generation).
OCR (optical character recognition) and RAG (retrieval-augmented generation)
Looking at the workflow suggested in the technical specifications, the documents feeding BIA will pass first through a pipeline of various OCR techniques to draw the information out. The tools suggested there are all open source in the correct definition of the term.
On a cursory look this makes perfect sense since many documents will either be PDFs or scans and both are treated as images and not text by computers, so their content must be turned into text somehow. But herein lies another issue: OCR and RAG for LLMs is not a solved problem. It is true that the OCR stage is done as pre-processing before the data is fed into the LLM model, but I think the problem holds since even frontier models struggle with RAG. This postillustrate well the difficulties and tradeoffs associated with RAG.
Once more, language may be an issue here as well. I do not know how well those OCR tools handle Portuguese and in particular our special characters but pairing them with an old model might be problematic. I am being tentative here due to my own limited knowledge. Having said that, there is a real and significant risk that the model will be fed wrong or incomplete information via weaknesses in the OCR/RAG pipeline...which will then be the basis of the summarisation the jury is given.
Deployment
As discussed previously, Infraestruturas de Portugal wants the flexibility of deploying the tool either on a cloud or on premises and indicates as suggested minimum hardware for deployment
a nVidia A10/A100 gpu. It is unclear if this refers to a single gpu or a potential cluster of nVidia A10/A1000 gpus, so upon reflection I will take it is the latter as the single gpu scenario does not seem feasible for deploying BIA. However, even for older generation cards like A10/A100, a cluster of these GPUs is not going to be cheap either if hired on the cloud or especially if acquired to run on premises. Another reason why the cost of running BIA should have been included in the award criteria, although I'm still puzzled why SLA (service level agreement) is an award criterion...
Anyway, I think the tradeoffs here will be that a more performant solution based on a larger model with a longer context window requires more hardware to run, be it on the cloud or on premises. That, in turn, means higher running costs.
From the draft of the technical specifications it seems Infraestruturas de Portugal will be responsible only for its own deployment and not the wider public sector. While BIA is supposed to be available for adoption by other public bodies at least for now it seems that will be done on each authority own dime. Once more, I think the right approach here would be to centralise the development and deployment instead of expecting each single contracting authority to roll its own implementation.
This series stops here for a much needed Easter break. After Easter I am planning at least one or two further entries. The first will be on the performance targets and the other on the regulatory risks arising from BIA.