** UNDER CONSTRUCTION - with AI - so some stuff can be "off message" **
movie lines utility:
supernote25.com/movielines
for educational reference on AI -
deeplearning.ai
Using Sheets to run batches and compare agent revisions
To ensure a semantic UI effectively covers an application with multiple screens, it may require 10,000 or
more questions to account for the diverse ways users (and other AI agents) will interact with it. Regression
analysis, by grouping questions based on expected responses, offers a high-level overview of the AI agent's
stability.
Example spreadsheet with initial responses from questions resolved by SME (subject matter experts - people
that can source the “truth”).
INSERT SPREADSHEET IMAGE
Initial Analysis: The first batch of results is assessed by humans to establish a baseline of correctness.
This initial assessment serves as the "truth" against which subsequent builds are compared.
Dynamic Truth: However, the "truth" isn't always static. If a later build produces a response that is deemed
more accurate, that response becomes the new benchmark for comparison.
Build-to-Build Comparison: The process focuses on identifying changes between builds. This is crucial
because a single change can impact numerous, or even all, outputs. Analyzing such widespread effects can be
challenging. Therefore, the "truth" for comparison is typically the results from the most recent build.
The Importance of Regression Analysis: Regression analysis between builds is essential. It allows us to
incorporate insights into the next set of training instructions. Without this analysis, we risk introducing
unexpected variance across a wide range of responses.
The Goal of Regression: The core purpose of regression is to monitor for variance and refine the system to
consistently achieve 100% accuracy in matching expected results.
Coverage vs. Resolution: While the number of questions and tests is related to resolution, it's not
necessary to retain all of them. Batches of tests may fluctuate between correct and incorrect status.
Focus on Edge Cases: Some questions will consistently produce fluctuating labels. These "edge cases" are the
ones that vary between builds and require attention. They are critical to retain in your coverage.
Quality Over Quantity: While more coverage is generally beneficial, having good quality coverage of these
edge cases is essential for achieving optimal results.
In summary: The regression process is a dynamic cycle of analysis, comparison, and refinement. It's about
establishing a "truth", challenging it, and continuously refining the system to achieve the highest possible
accuracy. The focus is on understanding the impact of changes, managing variance, and paying special
attention to those tricky edge cases that often hold the key to unlocking further improvements.
Agile hybriding: Previously, regression required code changes for each new build and batch. Now, users can
modify prompts directly within a spreadsheet and utilize a Sheets add-on for efficient batch runs and result
imports. Cloud-based snapshots enable precise recreation without code access or proprietary tools. While the
process benefits from matrixing techniques for scalability (tracing the fewest cases for maximum accuracy),
it's essentially a brute force method with numerous optimization possibilities, all manageable within the
spreadsheet by non-developers. This accessibility empowers entry-level professionals to contribute
meaningfully while developing valuable skills.
Prompt Tuning Guide
Welcome to the prompt tuning process! Your contributions are valuable as you learn and grow with us.
Directions
- If you're unsure about editing a cell, please leave it as is. Use the Version History to revert any
accidental changes.
- Focus on your work. If you have questions, try to find answers within our team's shared resources.
- This sheet is for task-related communication. Feedback on the product or new feature ideas are
handled elsewhere.
- Collaborate with other prompt tuners to find solutions.
- Strive for accuracy. Mistakes can impact the team's time. Learn to verify change resolutions.
- If you identify an error, please correct it.
- Work only within the current regression sheet. Previous sheets are for historical reference only.
PROMPT TUNING TASKS
Find your tasks and instructions on the Resolution Tasks list sheet. The queue is prioritized and should
be addressed promptly.
This queue contains unexpected changes that need resolution after a new build and regression analysis.
Once the queue is empty, you can generate new coverage using the application. Please be thoughtful and
efficient in creating new cases.
The queue is refreshed daily at 7 am Pacific time. Pay close attention to updates, especially during
final closedown.
General Guidance, Tasks, and Plan
- Use the application to identify questions for coverage. Store your findings in clearly named new
sheets, without altering existing data.
- Familiarize yourself with the application's three main screens. Use the chat bot to understand the
data presented and how it aligns with what you see.
- Your initial task is to create three groups of questions, one for each main screen. Please do not
modify the SME's Questions sheet.
Unexpected Change Resolution Guidelines
- You will be working with results from two builds, typically named by date. Analyze the values from
both the previous and new builds.
- If a new value has changed but is correct upon your analysis, mark it as "pass."
- If the new value is incorrect, mark it as "fail."
- If you are uncertain, investigate further and re-check later. One change can often affect multiple
outputs.
- The application UI is the ultimate source of truth.
- Use the application to verify answers when necessary, but refer to the previous build's value
otherwise.
Expected vs. Unexpected Changes
Changes are a normal part of the process, and we will provide notes on what to expect in each build.
Use this information to determine whether an output change is anticipated or unexpected based on the
build changes.
Apply careful thought and patience. This process ensures control as we refine our dynamic engine and
stabilize results.
If you find any errors in these instructions, please correct them.
Focus on understanding and executing the process as described.