We are currently exploring the role of local knowledge bases (KBs) in retrieval-augmented generation (RAG) AI processing. This post is the first in a series documenting our “sandbox” knowledge bases (created over a period of about 20 years) and how we’re using them in various Generative AI (GenAI) projects.

What are local knowledge bases?

Local knowledge bases (KBs) are locally owned and controlled sources of information. They are specialized, often proprietary, and are usually considered more “authoritative” and “trustworthy” than the general information sources on which large language models (LLMs) are typically trained.

They are often cited as containing the “ground truth” information with which new information generated through AI tools and techniques must be compared.

Cover page of The El Dorado Hills Handbook — The El Dorado Hills Handbook, one of our local “sandbox” knowledge bases

What is RAG processing?

Retrieval Augmented Generation processing allows a search engine, assistant, or other AI tool to “consult” an available local knowledge base before “falling back” on the additional information available from a trained large language model.

What specific role do local KBs play in the RAG process?

Local knowledge bases play a crucial role in RAG processing by providing a curated source of information that can be efficiently searched and retrieved to augment the outputs of LLMs.

In RAG systems, the local knowledge base serves as an external repository of up-to-date, domain-specific information that the LLM can draw on. The knowledge base owner is able to keep the information private and local, if desired.

In RAG systems, the LLM can access (“retrieve”) and incorporate relevant facts and data that might not be part of its original training set and to “generate” either simple and direct answers to user questions (“Please give me the phone number for Customer X”), or more elaborate answers that can be packaged programmatically for more elaborate use cases.

A programmatic example might be Python code to access a local KB containing a number of recently published articles about financial trends, and to produce output containing summaries of the articles along with relevant charts and diagrams. The output might also contain recommendations, as well, for example, to buy or sell stock shares of a particular company.

What are the key advantages in the RAG methodology?

By leveraging a local KB, RAG systems can do the following:

Improve accuracy and factual correctness of responses
Provide up-to-date information without retraining the model
Offer domain-specific knowledge tailored to an organization’s needs
Reduce hallucinations and incorrect outputs from the LLM

What are the local KBs we are using in our explorations?

Some local knowledge bases are more useful than others in RAG processing.

For example, KBs with a clear and logical structure, are chunked into “bite-sized” pieces, and have been pre-labeled with relevant tags, are particularly useful.

As long-time (since 2006) DITA/xml consultants and practitioners, we are well aware of the benefits to both human and AI-entity users of DITA-based information sources.

We happen to have a small collection of DITA-based information sources we have written ourselves, and which have served us well in prior prototyping efforts. They include the following:

“Shopping for groceries” and “Cleaning the garage,” which we used as super-simple examples in the technical documentation we wrote about the DITA Open Toolkit, and the tutorial and reference document we wrote for new users of DITA itself.
DITAinformationcenter, a book-sized set of technical documentation about DITA, that could be useful as a template for documentation for other technologies, especially those that are AI-based.
The El Dorado Hills Handbook, a 300-page book about an unincorporated community in the Sierra foothills, near Sacramento. In its original format it was well-organized, but was written in an unstructured format. We have since transformed a subset of the original content into a DITA-based template that could be useful to other community-oriented organizations engaged in telling their stories.
A number of documents in the history domain that could be useful to people writing family histories, genealogies, memoirs, and travel narratives.

Cover page for The El Dorado Hills Handbook

We have been doing some preliminary experimentation with these KBs and RAG processing, and we plan on posting about our experiences as they unfold.

What’s next?

We have already published on this website several posts about our early experiences with RAG processing, and we plan to publish more.

Notes, references

We queried Perplexity about citing AI resources, and it responded as follows:

“If you were citing information I provided today about AI citation, it might look like this: ‘Perplexity AI (2024). Information on citing AI assistants. Retrieved July 23, 2024 from a conversation with Perplexity’s AI assistant.'”

Or, as a shorter citation in APA style:

Perplexity AI, personal communication, July 23, 2024

In any case, I made liberal use of the well-crafted summaries that Perplexity provided me in writing the first three sections above. Thank you, Perplexity AI!

S	M	T	W	T	F	S
			1	2	3	4
5	6	7	8	9	10	11
12	13	14	15	16	17	18
19	20	21	22	23	24	25
26	27	28	29	30	31

VRJ Associates, LLC

Exploring the role of local knowledge bases in RAG processing