⚙️ Minimal Configuration

There are sensible defaults for most of the configuration parameters. However, you have to specify at least:

  1. The language models used during the different steps of the EvidenceSeeker pipeline (including API keys if necessary).
  2. Where to find and/or store the indexed knowledge base used for fact-checking.

The following sections describe how to configure the EvidenceSeeker pipeline minimally for the preprocessor, confirmation analyser, and retriever components.

Requirements for Minimal Configuration

The minimal configuration requires that your LLM provider supports the following features:

  • Constrained decoding/structured output via JSON schemata: Some steps in the EvidenceSeeker pipeline require the language model to return structured output in a specific format (typically JSON).
  • Log probabilities: The EvidenceSeeker pipeline relies on the language model to return logprobs (also known as ‘logits’) for the generated tokens.

If your LLM (provider) does not support these features, use an alternative configuration as described under “Advanced Configuration”.

Preprocessor and Confirmation Analyser

You can configure the language models and their API keys for both the preprocessor and the confirmation analyser via configuration files: Create two YAML files, preprocessor_config.yaml and/or confirmation_analysis_config.yaml, with the following content:

used_model_key: your_model_identifier
# optional, if you want to use a file for setting up API keys
env_file: path/to/your/api_keys.txt
models:
  your_model_identifier:
    api_key_name: name_of_your_api_key
    backend_type: backend_type_of_your_model
    base_url: base_url_of_your_language_model
    description: description_of_your_model
    max_tokens: 1024
    model: model_name_or_identifier
    name: name_of_your_model
    temperature: 0.2
    timeout: 260

Clarifications:

  • Both your_model_identifier and name_of_your_model are arbitrary identifiers that you can choose.
  • Both components expect the API key to be set as an environment variable with the name specified by api_key_name. If you use a file with environment variables via env_file, the file should contain a line like this: name_of_your_api_key=your_api_key. Alternatively, you can set the environment variable directly in your shell or script before running the EvidenceSeeker pipeline.
    • If you do not need an API key (e.g., if you use a local model), you can omit the env_file and api_key_name parameters.
  • base_url and model are important since they specify the endpoint and model.
  • The backend_type determines which API client to use (e.g., OpenAI, HuggingFace, etc.). For a list of supported backends, see the here.

Retriever Component

The retriever component uses an embedding model to create and search your indexed knowledge base. It can be minimally configured using a YAML configuration file retrieval_config.yaml with the following content

env_file: path/to/your/api_keys.txt
api_key_name: api_key_name_for_your_embedding_model
embed_backend_type: huggingface_inference_api
embed_base_url: base_url_of_your_embedding_model
embed_model_name: model_name_or_identifier_of_your_embedding_model
# path to your knowledge base that is used to 
# create the index of your knowledge base
document_input_dir: path/to/your/knowledge_base
# path to the directory where the index is stored
index_persist_path: path/to/your/index

Clarifications:

Both embed_base_url and embed_model_name are important since they specify the endpoint and model. If you use embedding models hosted by HuggingFace and choose, for instance, sentence-transformers/paraphrase-multilingual-mpnet-base-v2 as the model, you would set:

  • embed_base_url to “https://router.huggingface.co/hf-inference/models/sentence-transformers/paraphrase-multilingual-mpnet-base-v2” and
  • embed_model_name to “sentence-transformers/paraphrase-multilingual-mpnet-base-v2”

Executing the Pipeline

Using these configuration files you can fact-check a statement against your knowledge base in the following way:

from evidence_seeker import EvidenceSeeker
import asyncio

pipeline = EvidenceSeeker(
    retrieval_config_file="path/to/retrieval_config.yaml",
    confirmation_analysis_config_file="path/to/confirmation_analysis_config.yaml",
    preprocessing_config_file="path/to/preprocessing_config.yaml",
)
# run the pipeline
results = asyncio.run(pipeline("your statement to fact-check"))