This is the multi-page printable view of this section. Click here to print.

Return to the regular view of this page.

Evaluation

How to evaluate Foyle’s performance

    What You’ll Learn

    How to quantitatively evaluate Foyle’s AI quality

    Building an Evaluation Dataset

    To evaluate Foyle’s performance, you need to build an evaluation dataset. This dataset should contain a set of examples where you know the correct answer. Each example should be a .foyle file that ends with a code block containing a command. The last command represents the command you’d like Foyle to predict when the input is the rest of the document.

    Its usually a good idea to start by hand crafting a few examples based on knowledge of your infrastructure and the kinds of questions you expect to ask. A good place to start is by creating examples that test basic knowledge of how your infrastructure is configured e.g.

    1. Describe the cluster used for dev/prod?
    2. List the service accounts used for dev
    3. List image repositories

    A common theme here is to think about what information a new hire would need to know to get started. Even a new hire experienced in the technologies you use would need to be informed about your specific configuration; e.g. what AWS Account/GCP project to use.

    For the next set of examples, you can start to think about conventions your teams use. For example, if you have specific tagging conventions for images or labels you should come up with examples that test whether Foyle has learned those conventions

    1. Describe the prod image for service foo?
    2. Fetch the logs for the service bar?
    3. List the deployment for application foo?

    A good way to test whether Foyle has learned these conventions is to create examples for non-existent services, images, etc… This makes it likely users haven’t actually issued those requests causing Foyle to memorize the answers.

    Evaluating Foyle

    Once you have an evaluation dataset define an Experiment resource in a YAML file. Here’s an example

    kind: Experiment
    metadata:
      name: "learning"
    spec:
      evalDir: /Users/jlewi/git_foyle/data/eval
      dbDir: /Users/jlewi/foyle_experiments/learning
      sheetID: "1O0thD-p9DBF4G_shGMniivBB3pdaYifgSzWXBxELKqE"
      sheetName: "Results"
      Agent:
        model: gpt-3.5-turbo-0125
        rag:
          enabled: true
          maxResults: 3
    

    You can then run evaluation by running

    foyle apply /path/to/experiment.yaml
    

    This will run evaluation. If evaluation succeeds it will create a Google Sheet with the results. Each row in the Google Sheet contains the results for one example in the evaluation dataset. The distance column measures the distance between the predicted command and the actual command. The distance is the number of arguments that would need to be changed to turn the predicted command into the actual command. If the actual and expected are the same the distance will be zero. The maximum value of the edit distance is the maximum of the number of arguments in the actual and expected commands.