Prompt Engineering vs. Blind Prompting

![rw-book-cover](https://mitchellh.com) --- > "Prompt Engineering" emerged from the growth of language models to describe the process of applying prompting to effectively extract information from language models, typically for use in real-world applications. - [View Highlight](https://read.readwise.io/read/01h0hwb6j62mznd00nn1wsfwk7) --- > "Blind Prompting" is a term I am using to describe the method of creating prompts with a crude trial-and-error approach paired with minimal or no testing and a very surface level knowedge of prompting. *Blind prompting is not prompt engineering.* - [View Highlight](https://read.readwise.io/read/01h0hwbpvybmjqgdprg2w2rd9x) --- > whether prompt engineering can truly be described as "engineering" or if it's just ["witchcraft"](https://news.ycombinator.com/item?id=35524725) spouted by hype-chasers - [View Highlight](https://read.readwise.io/read/01h0hwcd2gwg4qn8md7vr9janq) --- > The demonstration set contains an expected input along with an expected output. This set will serve multiple goals: > 1. It will be used to measure the accuracy of our prompt. By using the input of a single demonstration, we can assert that we receive the expected output. > 2. It specifies what we expect the prompt inputs and outputs to look like, allowing us as engineers to determine if it is the right shape for our problem. > 3. We can use a subset of this demonstration set as exemplars for a few-shot approach if we choose to use a few-shot prompt. For those unfamiliar with the term "few-shot", few-shot is a style of prompt where examples are given in addition to the prompt. See [here for a good overview of Few-Shot vs Zero-Shot prompting](https://www.promptingguide.ai/techniques/fewshot). - [View Highlight](https://read.readwise.io/read/01h0j3y1cpawy2sw7fb7fjcaza) --- > The more demonstrations you have the better testing you can do, but also the more expensive it becomes due to token usage. At a certain size, it is often more economical to [fine-tune a language model](https://huggingface.co/course/chapter7/3?fw=pt). - [View Highlight](https://read.readwise.io/read/01h0j3yrd8kr4v4z3vcmvzqvkf) --- > **First, we are only extracting one piece of information.** It may be tempting to try to get the model to extract our entire event such as event name, attendees, time, location, etc. and output it as some beautiful ready-to-use JSON or some other format. The model may be able to do this. But when approaching a new problem, I recommend decomposing it into a single problem first. This makes the problem more tractable, and will also eventually give you a baseline accuracy that you can use to benchmark whether the multi-output approach is actually worth it or not. - [View Highlight](https://read.readwise.io/read/01h0j404ergsazkpz1k7dd9msz) --- > A prompt candidate is a prompt that we feel may elicit the desired behavior we want from the language model. We come up with multiple candidates because its unlikely we'll choose the best prompt right away. - [View Highlight](https://read.readwise.io/read/01h0j4bfpnm8ytbp72k7nwrrrf) --- > When building a few-shot prompt, equal distribution of labels matters, demonstrating the full set of labels matters, etc. When choosing exemplars, exemplars that the LLM was likely to get wrong typically perform best, exemplars have been shown to often perform best when ordered shortest to longest, etc. - [View Highlight](https://read.readwise.io/read/01h0j4c8fsksvd8dd2z8atq62k) --- > I always test zero-shot first. I want to get a baseline accuracy metric. From there, you can then test few-shot and compare not only different candidates but different prompting types. And so on. - [View Highlight](https://read.readwise.io/read/01h0j4e860rw4g9mg4zsq3scrt) --- > the few-shot example doesn't say something like "mimic the examples below." Experimental research has shown this doesn't reliably increase accuracy, so I like to test without it first to limit tokens. Second, the few-shot example exemplars don't ever show the "MM/DD" extraction as an example, which is poor form. - [View Highlight](https://read.readwise.io/read/01h0j4gn2dkr67b1vkhxqmch6b) --- > Finally, you choose one of the prompt candidates to integrate into your application. This isn't necessarily the most accurate prompt. This is a cost vs. accuracy analysis based on model used, tokens required, and accuracy presented. - [View Highlight](https://read.readwise.io/read/01h0j4mkbjmcmejwdnf89g8em6) --- > Due to the probabilistic nature of generative AI, your prompt likely has some issues. Even if your accuracy on your test set is 100%, there are probably unknown inputs that produce incorrect outputs. Therefore, you should *trust but verify* and *add verification failures to your demonstration set* in order to develop new prompts and increase accuracy. - [View Highlight](https://read.readwise.io/read/01h0j4ra7dbkb53612m135ynj3) --- > we may want to explicitly ask users: "is this event correct?" And if they say "no," then log the natural language input for human review. Or, we can maybe do better to automatically track any events our users manually change after our automatic information extraction - [View Highlight](https://read.readwise.io/read/01h0j4s20bwbt8f75d9pe61sn8) --- > if our prompt is generating code (such as a regular expression or programming language text), we can -- at a minimum -- try to *parse it*. - [View Highlight](https://read.readwise.io/read/01h0j4svy5qa15msgx1pyb83sy) ---