UNDER CONSTRUCTION - with AI - so some stuff can be "off message"

movie lines utility:
supernote25.com/movielines

for educational reference on AI -
deeplearning.ai

Using Sheets to run batches and compare agent revisions

To ensure a semantic UI effectively covers an application with multiple screens, it may require 10,000 or more questions to account for the diverse ways users (and other AI agents) will interact with it. Regression analysis, by grouping questions based on expected responses, offers a high-level overview of the AI agent's stability. Example spreadsheet with initial responses from questions resolved by SME (subject matter experts - people that can source the “truth”).

INSERT SPREADSHEET IMAGE

Initial Analysis: The first batch of results is assessed by humans to establish a baseline of correctness. This initial assessment serves as the "truth" against which subsequent builds are compared.

Dynamic Truth: However, the "truth" isn't always static. If a later build produces a response that is deemed more accurate, that response becomes the new benchmark for comparison.

Build-to-Build Comparison: The process focuses on identifying changes between builds. This is crucial because a single change can impact numerous, or even all, outputs. Analyzing such widespread effects can be challenging. Therefore, the "truth" for comparison is typically the results from the most recent build.

The Importance of Regression Analysis: Regression analysis between builds is essential. It allows us to incorporate insights into the next set of training instructions. Without this analysis, we risk introducing unexpected variance across a wide range of responses.

The Goal of Regression: The core purpose of regression is to monitor for variance and refine the system to consistently achieve 100% accuracy in matching expected results.

Coverage vs. Resolution: While the number of questions and tests is related to resolution, it's not necessary to retain all of them. Batches of tests may fluctuate between correct and incorrect status.

Focus on Edge Cases: Some questions will consistently produce fluctuating labels. These "edge cases" are the ones that vary between builds and require attention. They are critical to retain in your coverage.

Quality Over Quantity: While more coverage is generally beneficial, having good quality coverage of these edge cases is essential for achieving optimal results.

In summary: The regression process is a dynamic cycle of analysis, comparison, and refinement. It's about establishing a "truth", challenging it, and continuously refining the system to achieve the highest possible accuracy. The focus is on understanding the impact of changes, managing variance, and paying special attention to those tricky edge cases that often hold the key to unlocking further improvements.

Agile hybriding: Previously, regression required code changes for each new build and batch. Now, users can modify prompts directly within a spreadsheet and utilize a Sheets add-on for efficient batch runs and result imports. Cloud-based snapshots enable precise recreation without code access or proprietary tools. While the process benefits from matrixing techniques for scalability (tracing the fewest cases for maximum accuracy), it's essentially a brute force method with numerous optimization possibilities, all manageable within the spreadsheet by non-developers. This accessibility empowers entry-level professionals to contribute meaningfully while developing valuable skills.

Prompt Tuning Guide

Welcome to the prompt tuning process! Your contributions are valuable as you learn and grow with us.

Directions

If you're unsure about editing a cell, please leave it as is. Use the Version History to revert any accidental changes.
Focus on your work. If you have questions, try to find answers within our team's shared resources.
This sheet is for task-related communication. Feedback on the product or new feature ideas are handled elsewhere.
Collaborate with other prompt tuners to find solutions.
Strive for accuracy. Mistakes can impact the team's time. Learn to verify change resolutions.
If you identify an error, please correct it.
Work only within the current regression sheet. Previous sheets are for historical reference only.

PROMPT TUNING TASKS

Find your tasks and instructions on the Resolution Tasks list sheet. The queue is prioritized and should be addressed promptly.

This queue contains unexpected changes that need resolution after a new build and regression analysis.

Once the queue is empty, you can generate new coverage using the application. Please be thoughtful and efficient in creating new cases.

The queue is refreshed daily at 7 am Pacific time. Pay close attention to updates, especially during final closedown.

General Guidance, Tasks, and Plan

Use the application to identify questions for coverage. Store your findings in clearly named new sheets, without altering existing data.
Familiarize yourself with the application's three main screens. Use the chat bot to understand the data presented and how it aligns with what you see.
Your initial task is to create three groups of questions, one for each main screen. Please do not modify the SME's Questions sheet.

Unexpected Change Resolution Guidelines

You will be working with results from two builds, typically named by date. Analyze the values from both the previous and new builds.
If a new value has changed but is correct upon your analysis, mark it as "pass."
If the new value is incorrect, mark it as "fail."
If you are uncertain, investigate further and re-check later. One change can often affect multiple outputs.
The application UI is the ultimate source of truth.
Use the application to verify answers when necessary, but refer to the previous build's value otherwise.

Expected vs. Unexpected Changes

Changes are a normal part of the process, and we will provide notes on what to expect in each build.

Use this information to determine whether an output change is anticipated or unexpected based on the build changes.

Apply careful thought and patience. This process ensures control as we refine our dynamic engine and stabilize results.

If you find any errors in these instructions, please correct them.

Focus on understanding and executing the process as described.

About Privacy contact@handtop.com

** UNDER CONSTRUCTION - with AI - so some stuff can be "off message" **